[dba-VB] SHA1 to compute a hash

jwcolby jwcolby at colbyconsulting.com
Sat Mar 19 07:09:56 CDT 2011


 >AFAIK, no one other than you has found any.

LOL.

You are thinking of hashing messages with thousands of characters.  here's the deal.

Let's assume that I hash X characters in a single field.  The message length is the key, not the 
number of fields.  Assume also that I am really only hashing letters, numbers and special characters 
- the characters that are in a name or address.


If I hash 100 1 character messages, my probability of a collision if extremely high since there are 
only about 128 such alphanumeric characters (even less).

Now hash a thousand 1 character messages.  The probably just climbed immensely.
Now hash a million such 1 character messages.

What I am really calculating is the probability that I will repeat the same character string.

However if I up the number of characters, let's say 10 character messages.  *Now* the hash is really 
trying to prevent the same output for two *different* inputs.  Well that is always what a hash is 
trying to do.  But the point is that the probability of a collision decreases with message length 
increase.

The probability of a collision between any two messages of 10 random characters is much higher than 
the probability of a collision between any two 100 character messages, and the probability of a 
collision between any two 1000 character messages is much higher still.

That is just the way hash functions work.  Nobody claims that hashes don't create collisions, and 
everyone agrees that the longer the message, the lower the probability of a collision between two 
messages (to a point).

So, my address "messages" currently look like

89364 4456 PO Box 1
76543 9876 Apt 2
97867 3546 1723 Twin Pines Dr

The point here is that all address "messages" are short.  Now turn the address message into

89364 4456 PO Box 1 CA San Diego
76543 9876 Apt 2 TX Dallas
97867 3546 1723 Twin Pines Dr NC Hudson

And the length of the message increases fairly dramatically, decreasing the probability of a collision.

Remember too that I am hashing hundreds of millions of records - about 350 million addresses in 
various tables so far.  The other thing that affects the probability of a hash collision is the 
number of messages hashed.  Hash enough records and you *will* 100% probability create a hash that 
is the same for two different input strings - even for long messages.  That is just the nature of 
the business.

The probability of a collision is determined by three things.

1) The (strength of the) hash algorithm.
2) The message length
3) The quantity of messages.

1) I am using a fairly strong hash algorithm - SHA1
2) I am using a fairly short hash message - Zip5 / zip4 / Addr
3) I am hashing a fairly high number of messages - 350 million (and counting) messages (addresses)

Believe me, hashes have already occurred, I have seen them.  It is an easy thing to test.  Join two 
tables on the same hash and in the query compare that each address field is the same between the two 
tables.  Zip5 <> zip5 *or* zip4 <> zip4 *or* addr <> addr.

If the hash is the same (inner join) AND one of the three fields is different then you have a collision.

I have done this test and I have found collisions between my address hash.  I have never tested my 
last name or first name hash fields.

The problem with finding hashes is that they are rare, and in order to do an exhaustive test I would 
need to pull all the hash and data fields into a single table and compare that table against itself. 
  My data is contained in a bunch of databases.  11 million "dogs and cats", 21 million "kids", 7 
million "smokers", 23 million (and about to double) emails.  65 million "all adults", 100 million 
"mortgage" etc.  So it is difficult to find hash collisions because (at least in the small tables) 
the probability of a collision is low entirely within a table.

Believe me, it *does* happen though (at least for the address hash).

John W. Colby
www.ColbyConsulting.com

On 3/19/2011 6:20 AM, Stuart McLachlan wrote:
> How are you creating your hash?
>
> Can you post a few examples of different  data strings and colliding SHA1 hashes.   I can
> probably make a lot of money out of them.   AFAIK, no one other than you has found any.
>



More information about the dba-VB mailing list