[AccessD] A2003: Checking for Similar Names and or Addresses

John W. Colby jwcolby at gmail.com
Fri May 15 10:02:54 CDT 2015


I would like to interject here that while this is fine and a worthwhile 
method to use in some cases (and I LOVE collections!!!), it won't scale 
well.  I was faced with dealing with 220 MILLION records, essentially 
every name / address in the US (well close). Using set based (query) 
methods was the only solution.  As you can imagine, trying to do that in 
code will simply never complete.  Even doing it in SQL will take awhile.

That is why I did all the preliminaries discussed, such as preprocessing 
(one time and stored) to remove padding left and right etc.  Once that 
is done then using a hash, calculated once and stored, allows you to 
index on that hash and then self join on the hash to find duplicates.  
Doing a groupby (on the hash) and count (on the hash or PK) allows you 
to filter on only the hashes with duplicates - WHERE Cnt > 1.  Then you 
can store that query (exposing the hash) and join it back to the table 
on the hash to pull all the duplicate records.  Sort by hash, then by 
whatever suites your fancy (LName / FName) to actually see and take care 
of duplicates.

I build a

1) HashPerson: FNameLNameAddrZip5Zip4
2) HashFamily: LNameAddrZip5Zip4
3) HashAddr: AddrZip5Zip4

Notice that in the US at least, city and state are not required since 
Zip5 and Zip4 get you there.  In fact it gets rid of a bunch of 
ambiguity (in the US) since the city can be called several different 
things, but regardless of what the city field contains, the zip5 / zip4 
is identical.

Using these hashes allows me to then select all records at a given 
address, all records with the same last name at a given address, or just 
individuals.

Because the hash is calced and stored, I can index the hash and joins on 
the hash works very fast.

Additionally, given this method I can have different databases (and I 
do) where I can pull matching records from the different databases.  For 
example I have a database of :

1) People with dogs and cats.
2) People with children with child age fields
3) People with boats
4) People with a TON of various demographics info.

So I can join the dogs and cats to the kids database on person to get 
EXACT people matches.  Or I can join Family hashes to get people in the 
same family that have kids and dogs or cats.  Or Families with kids and 
boats, or Kids and boats and dogs.

This method of using hashes has made the process of matching up families 
and people across databases possible, and actually quite trivial.  And 
FAST!  Ok, fastER.  Joining a 220 million record database with a 110 
million record database still takes time, but at least it is possible.  
It just can't be done with code.

And finally, I recognize that the OP said it was all done in Access.  
However SQL Server is free for us developers and so getting the hashes 
calculated, and in fact all this processing (on a one time basis) can be 
done on SQL Server and written back to Access. And probably much faster, 
both in development and certainly in processing, than trying to develop 
code to do it in Access.

John W. Colby

On 5/15/2015 10:26 AM, Bill Benson wrote:
> Jim Debtman I would add to a "collection" which if a duplicate would
> produce an error.
> But seriously that is one reason I like collections, they error out on
> duplicates.
> If you write a routine that compresses (trims and removes spaces) then adds
> to a collection, then any duplicates (err.number<>0) means that the
> duplicate should be written to a duplicates table.
> You can also start instantiate and add to a counter collection for all
> erroring items, with the compressed fullname as the key, and with each new
> error (with error handling still set to On Error Resume Next), read any
> current value for the countercollection based on the key, add 1 to it,
> remove the item (if it's not there, this won't hurt), and add it back
> incremented by 1 ... so that when you are done with this routine you have 1
> collection which is a list of all duplicates, and a second collection which
> is a series containing the number of duplicates per value. Then run a count
> from 1 to col.count, and append to a table of duplicates that has 2
> columns, the 1st comes from colNames and the 2nd comes from
> colCounter(colNames(i))
>
> Then run a query any time you want where the compressed name contatenation
> from the data list equals a value in this table and you have all records
> that duplicated.
>
> Even with several thousand names to create and maintain this list of
> duplicates would take seconds, not minutes.
>



More information about the AccessD mailing list