[AccessD] A2003: Checking for Similar Names and or Addresses
John W. Colby
jwcolby at gmail.com
Fri May 15 11:48:50 CDT 2015
It isn't clear that it always is. A hash is a fixed length string. So
you exchange a variable length string for a (typically but not always
shorter) fixed length string, but you give up readability. The longer
the concatenated string, the more likely the storage, indexing and
comparisons of hashes will win.
In any event you want to calc the string (hash or straight
concatenation) store and index it.
John W. Colby
On 5/15/2015 12:40 PM, Doug Murphy wrote:
> HI John,
>
> I am trying to learn here. Why is hashing better that concatenating the same
> fields and using as you are doing with your hashes?
>
> Doug
>
> -----Original Message-----
> From: AccessD [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of
> John W. Colby
> Sent: Friday, May 15, 2015 8:03 AM
> To: Access Developers discussion and problem solving
> Subject: Re: [AccessD] A2003: Checking for Similar Names and or Addresses
>
> I would like to interject here that while this is fine and a worthwhile
> method to use in some cases (and I LOVE collections!!!), it won't scale
> well. I was faced with dealing with 220 MILLION records, essentially every
> name / address in the US (well close). Using set based (query) methods was
> the only solution. As you can imagine, trying to do that in code will
> simply never complete. Even doing it in SQL will take awhile.
>
> That is why I did all the preliminaries discussed, such as preprocessing
> (one time and stored) to remove padding left and right etc. Once that is
> done then using a hash, calculated once and stored, allows you to index on
> that hash and then self join on the hash to find duplicates.
> Doing a groupby (on the hash) and count (on the hash or PK) allows you to
> filter on only the hashes with duplicates - WHERE Cnt > 1. Then you can
> store that query (exposing the hash) and join it back to the table on the
> hash to pull all the duplicate records. Sort by hash, then by whatever
> suites your fancy (LName / FName) to actually see and take care of
> duplicates.
>
> I build a
>
> 1) HashPerson: FNameLNameAddrZip5Zip4
> 2) HashFamily: LNameAddrZip5Zip4
> 3) HashAddr: AddrZip5Zip4
>
> Notice that in the US at least, city and state are not required since
> Zip5 and Zip4 get you there. In fact it gets rid of a bunch of ambiguity
> (in the US) since the city can be called several different things, but
> regardless of what the city field contains, the zip5 / zip4 is identical.
>
> Using these hashes allows me to then select all records at a given address,
> all records with the same last name at a given address, or just individuals.
>
> Because the hash is calced and stored, I can index the hash and joins on the
> hash works very fast.
>
> Additionally, given this method I can have different databases (and I
> do) where I can pull matching records from the different databases. For
> example I have a database of :
>
> 1) People with dogs and cats.
> 2) People with children with child age fields
> 3) People with boats
> 4) People with a TON of various demographics info.
>
> So I can join the dogs and cats to the kids database on person to get EXACT
> people matches. Or I can join Family hashes to get people in the same
> family that have kids and dogs or cats. Or Families with kids and boats, or
> Kids and boats and dogs.
>
> This method of using hashes has made the process of matching up families and
> people across databases possible, and actually quite trivial. And FAST!
> Ok, fastER. Joining a 220 million record database with a 110 million record
> database still takes time, but at least it is possible.
> It just can't be done with code.
>
> And finally, I recognize that the OP said it was all done in Access.
> However SQL Server is free for us developers and so getting the hashes
> calculated, and in fact all this processing (on a one time basis) can be
> done on SQL Server and written back to Access. And probably much faster,
> both in development and certainly in processing, than trying to develop code
> to do it in Access.
>
> John W. Colby
>
> On 5/15/2015 10:26 AM, Bill Benson wrote:
>> Jim Debtman I would add to a "collection" which if a duplicate would
>> produce an error.
>> But seriously that is one reason I like collections, they error out on
>> duplicates.
>> If you write a routine that compresses (trims and removes spaces) then
>> adds to a collection, then any duplicates (err.number<>0) means that
>> the duplicate should be written to a duplicates table.
>> You can also start instantiate and add to a counter collection for all
>> erroring items, with the compressed fullname as the key, and with each
>> new error (with error handling still set to On Error Resume Next),
>> read any current value for the countercollection based on the key, add
>> 1 to it, remove the item (if it's not there, this won't hurt), and add
>> it back incremented by 1 ... so that when you are done with this
>> routine you have 1 collection which is a list of all duplicates, and a
>> second collection which is a series containing the number of
>> duplicates per value. Then run a count from 1 to col.count, and append
>> to a table of duplicates that has 2 columns, the 1st comes from
>> colNames and the 2nd comes from
>> colCounter(colNames(i))
>>
>> Then run a query any time you want where the compressed name
>> contatenation from the data list equals a value in this table and you
>> have all records that duplicated.
>>
>> Even with several thousand names to create and maintain this list of
>> duplicates would take seconds, not minutes.
>>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>
More information about the AccessD
mailing list