[dba-VB] Merge rows

jwcolby jwcolby at colbyconsulting.com
Tue Sep 22 09:21:15 CDT 2009


Jack,

I was leaving that out to prevent obscuring the points I wanted to discuss.

In fact what happened is that I received a text file, fixed width that contained all of this data in 
it.  The file contained 65 million records that included fields for first name, last name, address, 
city, state, and zip.

I loaded it into a table, with an autonumber PKID.  I then pulled the name / address info out into a 
second table and ran it through the address validation program I have discussed in the past (virtual 
machines etc).  That process causes 15 million records to fall out as undeliverable.  50 million 
records remained, the records we discuss here.

So I have two tables, one with PK / name / address and the second with PK / poll information data.

What I actually do to group the records is in the Address table is that I created a HashPerson 
field, binary(250).  I then ran the first/last/addr/zip5/zip4 through a hash function (in SQL 
Server) which returned a repeatable unique number for each such string of data input.  So now I have 
this hash field which contains exactly the same binary representation for every record with the same 
first/last/addr/zip5/zip4.  So I can group on the HashPerson to count duplicates, I can select on 
HashPerson to get the PKIDs of all the records for a single person etc.

Obviously I am not effectively dealing with issues like "Jack Colby" and "John Colby" at the same 
address.  Same person, nicknames.  I have not even attempted to do such things.  Nor mis-spellings, 
minor variations in address etc.  These things DO exist, but statistically they are such a small 
percentage of the total, and I had so much else to deal with, that it made no sense to go there yet. 
  Maybe some day.

John W. Colby
www.ColbyConsulting.com


Jack and Pat wrote:
> John,
> 
> For clarity, how do you identify a specific John Colby? OK so may be there
> is only 1 John Colby, but how would you distinguish (and with what
> certainty) to Tom Smith's for example.
> Is there some defined population of individuals that get polled  and
> repolled?
> Jack
> 
> _______________________________________________
> dba-VB mailing list
> dba-VB at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-vb
> http://www.databaseadvisors.com
> 
> 



More information about the dba-VB mailing list