[dba-VB] SHA1 to compute a hash

jwcolby jwcolby at colbyconsulting.com
Fri Mar 18 21:08:01 CDT 2011


In my databases I create SHA1 hashes to enable joining between tables and pull identical records 
(identical for the fields hashed).  I create:

1) A HashAddr of the zip5, zip4 and addr.  IOW I simply append the three values and feed them into 
SHa1 and out pops a number which I store in a field in my table.

2) A HashFamily of the Zip5, Zip4, Addr ad LName.

3) A HashPerson of Zip5, Zip4, Addr, LName and FName.

I am getting known collisions between different addresses (I have discovered and investigated 
collisions) in my HashAddr when I have many millions of addresses.  I need to address this.

Back when I made my design decisions (2004) my hardware consisted of single core processors, 4 gigs 
ram, Windows x32 etc.  Now I have 8 cores, 32 gigs Ram, Windows X64 etc.  IOW I was to a great 
extent constrained by my hardware "back in the day" whereas I am much less so now.

I am about to redesign my process.

I am considering simply appending in the city and state strings to all of the inputs: Addr, City, 
St, Zip5, Zip4 as the address base and then the same with LName and FName for the other two 
respective hashes.

The objective is to minimize hash collisions, not prevent some crypto attack.  I use these hash 
fields to join between multi million record tables so If I need to discover info in TableA where the 
HashAddr is the same as in TableB, I need the probability of a collision between different addresses 
(family/Person) to be as close to zero as I can get it.

My questions are:

1) Whether anyone out there is using a hash in this manner?
2) Has anyone seen a table of collision probability between messages of a given (short) message 
length.  My message is 9 digits for the zip5/4 and the address could be something as short as PO Box 
1, or Apt 1.  IOW the total message length of 14 is pretty common.  Adding the state would give me 
minimum message lengths of only 16 and City would only add a few more characters.
3) Does anyone know if just adding the same data back in again would decrease the collision 
probability. IOW Zip5,Zip4,Addr,City,St,Zip5,Zip4Etc.

Any experience out there?


-- 
John W. Colby
www.ColbyConsulting.com



More information about the dba-VB mailing list