[AccessD] A2003: Checking for Similar Names and or Addresses

Doug Murphy dw-murphy at cox.net
Fri May 15 11:40:21 CDT 2015


HI John,

I am trying to learn here. Why is hashing better that concatenating the same
fields and using as you are doing with your hashes?

Doug

-----Original Message-----
From: AccessD [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of
John W. Colby
Sent: Friday, May 15, 2015 8:03 AM
To: Access Developers discussion and problem solving
Subject: Re: [AccessD] A2003: Checking for Similar Names and or Addresses

I would like to interject here that while this is fine and a worthwhile
method to use in some cases (and I LOVE collections!!!), it won't scale
well.  I was faced with dealing with 220 MILLION records, essentially every
name / address in the US (well close). Using set based (query) methods was
the only solution.  As you can imagine, trying to do that in code will
simply never complete.  Even doing it in SQL will take awhile.

That is why I did all the preliminaries discussed, such as preprocessing
(one time and stored) to remove padding left and right etc.  Once that is
done then using a hash, calculated once and stored, allows you to index on
that hash and then self join on the hash to find duplicates.  
Doing a groupby (on the hash) and count (on the hash or PK) allows you to
filter on only the hashes with duplicates - WHERE Cnt > 1.  Then you can
store that query (exposing the hash) and join it back to the table on the
hash to pull all the duplicate records.  Sort by hash, then by whatever
suites your fancy (LName / FName) to actually see and take care of
duplicates.

I build a

1) HashPerson: FNameLNameAddrZip5Zip4
2) HashFamily: LNameAddrZip5Zip4
3) HashAddr: AddrZip5Zip4

Notice that in the US at least, city and state are not required since
Zip5 and Zip4 get you there.  In fact it gets rid of a bunch of ambiguity
(in the US) since the city can be called several different things, but
regardless of what the city field contains, the zip5 / zip4 is identical.

Using these hashes allows me to then select all records at a given address,
all records with the same last name at a given address, or just individuals.

Because the hash is calced and stored, I can index the hash and joins on the
hash works very fast.

Additionally, given this method I can have different databases (and I
do) where I can pull matching records from the different databases.  For
example I have a database of :

1) People with dogs and cats.
2) People with children with child age fields
3) People with boats
4) People with a TON of various demographics info.

So I can join the dogs and cats to the kids database on person to get EXACT
people matches.  Or I can join Family hashes to get people in the same
family that have kids and dogs or cats.  Or Families with kids and boats, or
Kids and boats and dogs.

This method of using hashes has made the process of matching up families and
people across databases possible, and actually quite trivial.  And FAST!
Ok, fastER.  Joining a 220 million record database with a 110 million record
database still takes time, but at least it is possible.  
It just can't be done with code.

And finally, I recognize that the OP said it was all done in Access.  
However SQL Server is free for us developers and so getting the hashes
calculated, and in fact all this processing (on a one time basis) can be
done on SQL Server and written back to Access. And probably much faster,
both in development and certainly in processing, than trying to develop code
to do it in Access.

John W. Colby

On 5/15/2015 10:26 AM, Bill Benson wrote:
> Jim Debtman I would add to a "collection" which if a duplicate would 
> produce an error.
> But seriously that is one reason I like collections, they error out on 
> duplicates.
> If you write a routine that compresses (trims and removes spaces) then 
> adds to a collection, then any duplicates (err.number<>0) means that 
> the duplicate should be written to a duplicates table.
> You can also start instantiate and add to a counter collection for all 
> erroring items, with the compressed fullname as the key, and with each 
> new error (with error handling still set to On Error Resume Next), 
> read any current value for the countercollection based on the key, add 
> 1 to it, remove the item (if it's not there, this won't hurt), and add 
> it back incremented by 1 ... so that when you are done with this 
> routine you have 1 collection which is a list of all duplicates, and a 
> second collection which is a series containing the number of 
> duplicates per value. Then run a count from 1 to col.count, and append 
> to a table of duplicates that has 2 columns, the 1st comes from 
> colNames and the 2nd comes from
> colCounter(colNames(i))
>
> Then run a query any time you want where the compressed name 
> contatenation from the data list equals a value in this table and you 
> have all records that duplicated.
>
> Even with several thousand names to create and maintain this list of 
> duplicates would take seconds, not minutes.
>

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com



More information about the AccessD mailing list