Stuart McLachlan
stuart at lexacorp.com.pg
Sat Mar 19 05:20:16 CDT 2011
How are you creating your hash? Can you post a few examples of different data strings and colliding SHA1 hashes. I can probably make a lot of money out of them. AFAIK, no one other than you has found any. -- Stuart On 18 Mar 2011 at 22:08, jwcolby wrote: > In my databases I create SHA1 hashes to enable joining between tables > and pull identical records (identical for the fields hashed). I > create: > > 1) A HashAddr of the zip5, zip4 and addr. IOW I simply append the > three values and feed them into SHa1 and out pops a number which I > store in a field in my table. > > 2) A HashFamily of the Zip5, Zip4, Addr ad LName. > > 3) A HashPerson of Zip5, Zip4, Addr, LName and FName. > > I am getting known collisions between different addresses (I have > discovered and investigated collisions) in my HashAddr when I have > many millions of addresses. I need to address this. > > Back when I made my design decisions (2004) my hardware consisted of > single core processors, 4 gigs ram, Windows x32 etc. Now I have 8 > cores, 32 gigs Ram, Windows X64 etc. IOW I was to a great extent > constrained by my hardware "back in the day" whereas I am much less so > now. > > I am about to redesign my process. > > I am considering simply appending in the city and state strings to all > of the inputs: Addr, City, St, Zip5, Zip4 as the address base and then > the same with LName and FName for the other two respective hashes. > > The objective is to minimize hash collisions, not prevent some crypto > attack. I use these hash fields to join between multi million record > tables so If I need to discover info in TableA where the HashAddr is > the same as in TableB, I need the probability of a collision between > different addresses (family/Person) to be as close to zero as I can > get it. > > My questions are: > > 1) Whether anyone out there is using a hash in this manner? > 2) Has anyone seen a table of collision probability between messages > of a given (short) message length. My message is 9 digits for the > zip5/4 and the address could be something as short as PO Box 1, or Apt > 1. IOW the total message length of 14 is pretty common. Adding the > state would give me minimum message lengths of only 16 and City would > only add a few more characters. 3) Does anyone know if just adding the > same data back in again would decrease the collision probability. IOW > Zip5,Zip4,Addr,City,St,Zip5,Zip4Etc. > > Any experience out there? > > > -- > John W. Colby > www.ColbyConsulting.com > _______________________________________________ > dba-VB mailing list > dba-VB at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-vb > http://www.databaseadvisors.com > >