jwcolby
jwcolby at colbyconsulting.com
Fri Mar 18 21:08:01 CDT 2011
In my databases I create SHA1 hashes to enable joining between tables and pull identical records (identical for the fields hashed). I create: 1) A HashAddr of the zip5, zip4 and addr. IOW I simply append the three values and feed them into SHa1 and out pops a number which I store in a field in my table. 2) A HashFamily of the Zip5, Zip4, Addr ad LName. 3) A HashPerson of Zip5, Zip4, Addr, LName and FName. I am getting known collisions between different addresses (I have discovered and investigated collisions) in my HashAddr when I have many millions of addresses. I need to address this. Back when I made my design decisions (2004) my hardware consisted of single core processors, 4 gigs ram, Windows x32 etc. Now I have 8 cores, 32 gigs Ram, Windows X64 etc. IOW I was to a great extent constrained by my hardware "back in the day" whereas I am much less so now. I am about to redesign my process. I am considering simply appending in the city and state strings to all of the inputs: Addr, City, St, Zip5, Zip4 as the address base and then the same with LName and FName for the other two respective hashes. The objective is to minimize hash collisions, not prevent some crypto attack. I use these hash fields to join between multi million record tables so If I need to discover info in TableA where the HashAddr is the same as in TableB, I need the probability of a collision between different addresses (family/Person) to be as close to zero as I can get it. My questions are: 1) Whether anyone out there is using a hash in this manner? 2) Has anyone seen a table of collision probability between messages of a given (short) message length. My message is 9 digits for the zip5/4 and the address could be something as short as PO Box 1, or Apt 1. IOW the total message length of 14 is pretty common. Adding the state would give me minimum message lengths of only 16 and City would only add a few more characters. 3) Does anyone know if just adding the same data back in again would decrease the collision probability. IOW Zip5,Zip4,Addr,City,St,Zip5,Zip4Etc. Any experience out there? -- John W. Colby www.ColbyConsulting.com