[dba-VB] SHA1 to compute a hash

Sat Mar 19 08:10:47 CDT 2011

I've just realised where the confusion comes in.  When you talk about the chance of collisions 
increasing decreasing message length, you are talking about general hashing algorithms 
such as CRC etc.   You need to undersand the difference between simple and cryptographic 
hashing.  

Cryptographic hashing algorithms are specifically designed to be "collision resistant".

I repeat, to date no one has found a collision in SHA1 (except you apparently <g>)

-- 
Stuart

On 19 Mar 2011 at 22:51, Stuart McLachlan wrote:

> In the case of 1000 single character messages, you are bound to get
> collisions since there are only 256 possible original messages.  You
> will be hashing the same value multiple times
> 
> Apart from that, your understanding is incorrect. It doesn't matter
> how long the string is.  The chances of a collision with two different
> messages remains the same.
> 
> The message is hashed using padded blocks of a fixed length.
> There is no more chance of a collision between "a" and "b" than there
> is between "aaaaaaaaaaaaaaaaaaa" and "baaaaaaaaaaaaaaaaaaa".
> 
> Specifically, the chance of a collision within n different messages,
> using b bits of encryption is (n*(n-1)/2) * (1/2^b).  
> 
> Note that the length of the message doesn't come into that equation.
> 
>  The probability of a collision is determined by only TWO things.
> 
>  1) The length of the digest.
>  2) The quantity of messages.
> 
> As for "everyone agrees".  Noone who understands how it works, agrees.
> 
> -- 
> Stuart
> 
> On 19 Mar 2011 at 8:09, jwcolby wrote:
> 
> >  >AFAIK, no one other than you has found any.
> > 
> > LOL.
> > 
> > You are thinking of hashing messages with thousands of characters.
> > here's the deal.
> > 
> > Let's assume that I hash X characters in a single field.  The
> > message length is the key, not the number of fields.  Assume also
> > that I am really only hashing letters, numbers and special
> > characters - the characters that are in a name or address.
> > 
> > 
> > If I hash 100 1 character messages, my probability of a collision if
> > extremely high since there are only about 128 such alphanumeric
> > characters (even less).
> > 
> > Now hash a thousand 1 character messages.  The probably just climbed
> > immensely. Now hash a million such 1 character messages. What I am
> > really calculating is the probability that I will repeat the same
> > character string.
> > 
> > However if I up the number of characters, let's say 10 character
> > messages.  *Now* the hash is really trying to prevent the same
> > output for two *different* inputs.  Well that is always what a hash
> > is trying to do.  But the point is that the probability of a
> > collision decreases with message length increase.
> > 
> > The probability of a collision between any two messages of 10 random
> > characters is much higher than the probability of a collision
> > between any two 100 character messages, and the probability of a
> > collision between any two 1000 character messages is much higher
> > still.
> > 
> > That is just the way hash functions work.  Nobody claims that hashes
> > don't create collisions, and everyone agrees that the longer the
> > message, the lower the probability of a collision between two
> > messages (to a point).
> > 
> > So, my address "messages" currently look like
> > 
> > 89364 4456 PO Box 1
> > 76543 9876 Apt 2
> > 97867 3546 1723 Twin Pines Dr
> > 
> > The point here is that all address "messages" are short.  Now turn
> > the address message into
> > 
> > 89364 4456 PO Box 1 CA San Diego
> > 76543 9876 Apt 2 TX Dallas
> > 97867 3546 1723 Twin Pines Dr NC Hudson
> > 
> > And the length of the message increases fairly dramatically,
> > decreasing the probability of a collision.
> > 
> > Remember too that I am hashing hundreds of millions of records -
> > about 350 million addresses in various tables so far.  The other
> > thing that affects the probability of a hash collision is the number
> > of messages hashed.  Hash enough records and you *will* 100%
> > probability create a hash that is the same for two different input
> > strings - even for long messages.  That is just the nature of the
> > business.
> > 
> > The probability of a collision is determined by three things.
> > 
> > 1) The (strength of the) hash algorithm.
> > 2) The message length
> > 3) The quantity of messages.
> ...
> 
> _______________________________________________
> dba-VB mailing list
> dba-VB at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-vb
> http://www.databaseadvisors.com
> 
>