jwcolby
jwcolby at colbyconsulting.com
Sat Sep 15 06:48:21 CDT 2007
>However, do you first put all to Ucase()? No I didn't but should have! My addresses are standardized but I do not know whether they standardize the case as well, I suspect not. And yes, it would almost certainly make a difference between the hash codes. The nice thing is that once calculated, you never have to do that again. You would have to remember to do the UCASE against any other address record so that you were comparing apples to oranges. Good catch. I will have to go back and recalc my hashes doing the UCASE. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gmail Sent: Saturday, September 15, 2007 4:33 AM To: 'Access Developers discussion and problem solving' Subject: [AccessD] HashBytes John, This is a very clever way of doing things and I might do similar. Currently I store Postcode+County+town+street+houseno/name and check against them. Your sounds a lot better (and a lot cleverer). However, do you first put all to Ucase()? I ask this because the MD5 code that Gustav kindly provided, does make a distinction between the same words with different case. Ie, "Max" and "max" will give a different hash value. Max Ps. I have renamed the topic, hope that is ok. -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Friday, September 14, 2007 4:58 AM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] First in a group Borge, The hash was done in SQL Server. There is a Hash function inside of SQL Server that I used. It has various "brands" of hash and I chose one (sha1), then just ran all of the address fields into the hash function and out pops a binary (~160 bit?) hash number. UPDATE [PSM_DPSI11000].[dbo].[tblData] SET [HashAddr] = hashbytes('sha1', Addr + Zip5 + Zip4) ,[HashFamily] = hashbytes('sha1', Addr + Zip5 + Zip4 + LName) ,[HashPerson] = hashbytes('sha1', Addr + Zip5 + Zip4 + LName + FName) Google HashBytes for more info on that function. I hash each of these strings and store the three values for every person record. I can then compare the hash values to determine a fact collision - same address, same family, same person. Obviously this does not handle mis-spellings, that is not the intent here. Having hashes I can do compares or even joins on the hash columns. I index the columns as well. It is just simpler (and faster, and easier) than doing a multi-column join to identify matches. The hash value is an 20 byte value so I think / hope / pray it becomes more efficient than a multi-column index as well. In the address case it is a marginal call. Zip5 + zip4 is nine bytes plus the average of Address but on average a 20 byte hash wins I think. I was previously using X bytes from the address plus zip5 plus zip 4 but that left me open to problems with instances like: 1) 1723 Twin Pines Apt A. 2) 1723 Twin Pines Apt B. 3) 1722 Twin Pines If I took the leading 6 characters (for example) then the examples 1 and 2 match and if I took the last 6 characters then examples 2 and 3 match. A hash simply seemed superior. Once you throw in the first and last name as well then the hash unquestionably wins. What really helps it all work is the fact that I do address validation such that I know that each address is EXACTLY valid (is actually deliverable) and just throw away any that are not deliverable so I do not have to deal with mis-spellings in the address part at least. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of pcs at azizaz.com Sent: Thursday, September 13, 2007 10:30 PM To: Access Developers discussion and problemsolving Subject: [AccessD] First in a group John Colby wrote: ... I do have an address hash that generates a hash string for just the address part. So it seems like something like a group by address hash and then a top1 is the right direction, but of course it has to be the TOP 1 in each GROUP. ... I am assuming that the result of an address hash is a unique numeric value for the particual address text string being 'hashed'? .... I am interested in knowing more about the address hash .... a function ? A pointer or link to an example will do, thanks Regards borge -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com