jwcolby
jwcolby at colbyconsulting.com
Tue Sep 22 09:21:15 CDT 2009
Jack, I was leaving that out to prevent obscuring the points I wanted to discuss. In fact what happened is that I received a text file, fixed width that contained all of this data in it. The file contained 65 million records that included fields for first name, last name, address, city, state, and zip. I loaded it into a table, with an autonumber PKID. I then pulled the name / address info out into a second table and ran it through the address validation program I have discussed in the past (virtual machines etc). That process causes 15 million records to fall out as undeliverable. 50 million records remained, the records we discuss here. So I have two tables, one with PK / name / address and the second with PK / poll information data. What I actually do to group the records is in the Address table is that I created a HashPerson field, binary(250). I then ran the first/last/addr/zip5/zip4 through a hash function (in SQL Server) which returned a repeatable unique number for each such string of data input. So now I have this hash field which contains exactly the same binary representation for every record with the same first/last/addr/zip5/zip4. So I can group on the HashPerson to count duplicates, I can select on HashPerson to get the PKIDs of all the records for a single person etc. Obviously I am not effectively dealing with issues like "Jack Colby" and "John Colby" at the same address. Same person, nicknames. I have not even attempted to do such things. Nor mis-spellings, minor variations in address etc. These things DO exist, but statistically they are such a small percentage of the total, and I had so much else to deal with, that it made no sense to go there yet. Maybe some day. John W. Colby www.ColbyConsulting.com Jack and Pat wrote: > John, > > For clarity, how do you identify a specific John Colby? OK so may be there > is only 1 John Colby, but how would you distinguish (and with what > certainty) to Tom Smith's for example. > Is there some defined population of individuals that get polled and > repolled? > Jack > > _______________________________________________ > dba-VB mailing list > dba-VB at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-vb > http://www.databaseadvisors.com > >