[AccessD] HashBytes

Sat Sep 15 06:48:21 CDT 2007

>However, do you first put all to Ucase()? 

No I didn't but should have!  My addresses are standardized but I do not
know whether they standardize the case as well, I suspect not.  And yes, it
would almost certainly make a difference between the hash codes. 

The nice thing is that once calculated, you never have to do that again.
You would have to remember to do the UCASE against any other address record
so that you were comparing apples to oranges.

Good catch.  I will have to go back and recalc my hashes doing the UCASE.  

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gmail
Sent: Saturday, September 15, 2007 4:33 AM
To: 'Access Developers discussion and problem solving'
Subject: [AccessD] HashBytes

 John,
This is a very clever way of doing things and I might do similar.  Currently
I store Postcode+County+town+street+houseno/name and check against them.
Your sounds a lot better (and a lot cleverer).
However, do you first put all to Ucase()?  
I ask this because the MD5 code that Gustav kindly provided, does make a
distinction between the same words with different case. Ie, "Max" and "max"
will give a different hash value.

Max
Ps. I have renamed the topic, hope that is ok.

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Friday, September 14, 2007 4:58 AM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] First in a group

Borge,

The hash was done in SQL Server.  There is a Hash function inside of SQL
Server that I used.  It has various "brands" of hash and I chose one (sha1),
then just ran all of the address fields into the hash function and out pops
a binary (~160 bit?) hash number.

UPDATE [PSM_DPSI11000].[dbo].[tblData]
   SET [HashAddr] = hashbytes('sha1', Addr + Zip5 + Zip4)
      ,[HashFamily] = hashbytes('sha1', Addr + Zip5 + Zip4 + LName)
      ,[HashPerson] = hashbytes('sha1', Addr + Zip5 + Zip4 + LName + FName)

Google HashBytes for more info on that function.

I hash each of these strings and store the three values for every person
record.  I can then compare the hash values to determine a fact collision -
same address, same family, same person.  Obviously this does not handle
mis-spellings, that is not the intent here.  Having hashes I can do compares
or even joins on the hash columns.  I index the columns as well.  It is just
simpler (and faster, and easier) than doing a multi-column join to identify
matches.  

The hash value is an 20 byte value so I think / hope / pray it becomes more
efficient than a multi-column index as well.  In the address case it is a
marginal call.  Zip5 + zip4 is nine bytes plus the average of Address but on
average a 20 byte hash wins I think.  I was previously using X bytes from
the address plus zip5 plus zip 4 but that left me open to problems with
instances like:

1) 1723 Twin Pines Apt A.
2) 1723 Twin Pines Apt B.
3) 1722 Twin Pines

If I took the leading 6 characters (for example) then the examples 1 and 2
match and if I took the last 6 characters then examples 2 and 3 match.  A
hash simply seemed superior.  Once you throw in the first and last name as
well then the hash unquestionably wins.  What really helps it all work is
the fact that I do address validation such that I know that each address is
EXACTLY valid (is actually deliverable) and just throw away any that are not
deliverable so I do not have to deal with mis-spellings in the address part
at least.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of pcs at azizaz.com
Sent: Thursday, September 13, 2007 10:30 PM
To: Access Developers discussion and problemsolving
Subject: [AccessD] First in a group

John Colby wrote:

...
I do have an address hash that generates a hash string for just the address
part.

So it seems like something like a group by address hash and then a top1 is
the right direction, but of course it has to be the TOP 1 in each GROUP.
...

I am assuming that the result of an address hash is a unique numeric value
for the particual address text string being 'hashed'? ....

I am interested in knowing more about the address hash .... 
a function ?

A pointer or link to an example will do, thanks

Regards
borge
--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com