[AccessD] A2003: Checking for Similar Names and or Addresses
Doug Murphy
dw-murphy at cox.net
Fri May 15 12:43:14 CDT 2015
OK, thank you. For the size of project I typically work on concatenation
would probably suffice.
Doug
-----Original Message-----
From: AccessD [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of
John W. Colby
Sent: Friday, May 15, 2015 9:49 AM
To: Access Developers discussion and problem solving
Subject: Re: [AccessD] A2003: Checking for Similar Names and or Addresses
It isn't clear that it always is. A hash is a fixed length string. So you
exchange a variable length string for a (typically but not always
shorter) fixed length string, but you give up readability. The longer the
concatenated string, the more likely the storage, indexing and comparisons
of hashes will win.
In any event you want to calc the string (hash or straight
concatenation) store and index it.
John W. Colby
On 5/15/2015 12:40 PM, Doug Murphy wrote:
> HI John,
>
> I am trying to learn here. Why is hashing better that concatenating
> the same fields and using as you are doing with your hashes?
>
> Doug
>
> -----Original Message-----
> From: AccessD [mailto:accessd-bounces at databaseadvisors.com] On Behalf
> Of John W. Colby
> Sent: Friday, May 15, 2015 8:03 AM
> To: Access Developers discussion and problem solving
> Subject: Re: [AccessD] A2003: Checking for Similar Names and or
> Addresses
>
> I would like to interject here that while this is fine and a
> worthwhile method to use in some cases (and I LOVE collections!!!), it
> won't scale well. I was faced with dealing with 220 MILLION records,
> essentially every name / address in the US (well close). Using set
> based (query) methods was the only solution. As you can imagine,
> trying to do that in code will simply never complete. Even doing it in
SQL will take awhile.
>
> That is why I did all the preliminaries discussed, such as
> preprocessing (one time and stored) to remove padding left and right
> etc. Once that is done then using a hash, calculated once and stored,
> allows you to index on that hash and then self join on the hash to find
duplicates.
> Doing a groupby (on the hash) and count (on the hash or PK) allows you
> to filter on only the hashes with duplicates - WHERE Cnt > 1. Then
> you can store that query (exposing the hash) and join it back to the
> table on the hash to pull all the duplicate records. Sort by hash,
> then by whatever suites your fancy (LName / FName) to actually see and
> take care of duplicates.
>
> I build a
>
> 1) HashPerson: FNameLNameAddrZip5Zip4
> 2) HashFamily: LNameAddrZip5Zip4
> 3) HashAddr: AddrZip5Zip4
>
> Notice that in the US at least, city and state are not required since
> Zip5 and Zip4 get you there. In fact it gets rid of a bunch of
> ambiguity (in the US) since the city can be called several different
> things, but regardless of what the city field contains, the zip5 / zip4 is
identical.
>
> Using these hashes allows me to then select all records at a given
> address, all records with the same last name at a given address, or just
individuals.
>
> Because the hash is calced and stored, I can index the hash and joins
> on the hash works very fast.
>
> Additionally, given this method I can have different databases (and I
> do) where I can pull matching records from the different databases.
> For example I have a database of :
>
> 1) People with dogs and cats.
> 2) People with children with child age fields
> 3) People with boats
> 4) People with a TON of various demographics info.
>
> So I can join the dogs and cats to the kids database on person to get
> EXACT people matches. Or I can join Family hashes to get people in
> the same family that have kids and dogs or cats. Or Families with
> kids and boats, or Kids and boats and dogs.
>
> This method of using hashes has made the process of matching up
> families and people across databases possible, and actually quite trivial.
And FAST!
> Ok, fastER. Joining a 220 million record database with a 110 million
> record database still takes time, but at least it is possible.
> It just can't be done with code.
>
> And finally, I recognize that the OP said it was all done in Access.
> However SQL Server is free for us developers and so getting the hashes
> calculated, and in fact all this processing (on a one time basis) can
> be done on SQL Server and written back to Access. And probably much
> faster, both in development and certainly in processing, than trying
> to develop code to do it in Access.
>
> John W. Colby
>
> On 5/15/2015 10:26 AM, Bill Benson wrote:
>> Jim Debtman I would add to a "collection" which if a duplicate would
>> produce an error.
>> But seriously that is one reason I like collections, they error out
>> on duplicates.
>> If you write a routine that compresses (trims and removes spaces)
>> then adds to a collection, then any duplicates (err.number<>0) means
>> that the duplicate should be written to a duplicates table.
>> You can also start instantiate and add to a counter collection for
>> all erroring items, with the compressed fullname as the key, and with
>> each new error (with error handling still set to On Error Resume
>> Next), read any current value for the countercollection based on the
>> key, add
>> 1 to it, remove the item (if it's not there, this won't hurt), and
>> add it back incremented by 1 ... so that when you are done with this
>> routine you have 1 collection which is a list of all duplicates, and
>> a second collection which is a series containing the number of
>> duplicates per value. Then run a count from 1 to col.count, and
>> append to a table of duplicates that has 2 columns, the 1st comes
>> from colNames and the 2nd comes from
>> colCounter(colNames(i))
>>
>> Then run a query any time you want where the compressed name
>> contatenation from the data list equals a value in this table and you
>> have all records that duplicated.
>>
>> Even with several thousand names to create and maintain this list of
>> duplicates would take seconds, not minutes.
>>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>
--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com
More information about the AccessD
mailing list