Stuart McLachlan
stuart at lexacorp.com.pg
Sat Mar 19 07:51:13 CDT 2011
In the case of 1000 single character messages, you are bound to get collisions since there are only 256 possible original messages. You will be hashing the same value multiple times Apart from that, your understanding is incorrect. It doesn't matter how long the string is. The chances of a collision with two different messages remains the same. The message is hashed using padded blocks of a fixed length. There is no more chance of a collision between "a" and "b" than there is between "aaaaaaaaaaaaaaaaaaa" and "baaaaaaaaaaaaaaaaaaa". Specifically, the chance of a collision within n different messages, using b bits of encryption is (n*(n-1)/2) * (1/2^b). Note that the length of the message doesn't come into that equation. The probability of a collision is determined by only TWO things. 1) The length of the digest. 2) The quantity of messages. As for "everyone agrees". Noone who understands how it works, agrees. -- Stuart On 19 Mar 2011 at 8:09, jwcolby wrote: > >AFAIK, no one other than you has found any. > > LOL. > > You are thinking of hashing messages with thousands of characters. > here's the deal. > > Let's assume that I hash X characters in a single field. The message > length is the key, not the number of fields. Assume also that I am > really only hashing letters, numbers and special characters - the > characters that are in a name or address. > > > If I hash 100 1 character messages, my probability of a collision if > extremely high since there are only about 128 such alphanumeric > characters (even less). > > Now hash a thousand 1 character messages. The probably just climbed > immensely. Now hash a million such 1 character messages. > What I am really calculating is the probability that I will repeat the > same character string. > > However if I up the number of characters, let's say 10 character > messages. *Now* the hash is really trying to prevent the same output > for two *different* inputs. Well that is always what a hash is trying > to do. But the point is that the probability of a collision decreases > with message length increase. > > The probability of a collision between any two messages of 10 random > characters is much higher than the probability of a collision between > any two 100 character messages, and the probability of a collision > between any two 1000 character messages is much higher still. > > That is just the way hash functions work. Nobody claims that hashes > don't create collisions, and everyone agrees that the longer the > message, the lower the probability of a collision between two messages > (to a point). > > So, my address "messages" currently look like > > 89364 4456 PO Box 1 > 76543 9876 Apt 2 > 97867 3546 1723 Twin Pines Dr > > The point here is that all address "messages" are short. Now turn the > address message into > > 89364 4456 PO Box 1 CA San Diego > 76543 9876 Apt 2 TX Dallas > 97867 3546 1723 Twin Pines Dr NC Hudson > > And the length of the message increases fairly dramatically, > decreasing the probability of a collision. > > Remember too that I am hashing hundreds of millions of records - about > 350 million addresses in various tables so far. The other thing that > affects the probability of a hash collision is the number of messages > hashed. Hash enough records and you *will* 100% probability create a > hash that is the same for two different input strings - even for long > messages. That is just the nature of the business. > > The probability of a collision is determined by three things. > > 1) The (strength of the) hash algorithm. > 2) The message length > 3) The quantity of messages. ...