[dba-VB] Merge / Purge

jwcolby jwcolby at colbyconsulting.com
Wed Oct 14 22:18:19 CDT 2009


Charlotte,

 > Unless they have data in the same fields.  Then what happens?

Well, so what?  Either overwrite or don't.  Remember it is impossible to tell which is correct. 
They can't BOTH be correct.  Flip a coin and do something, it literally doesn't matter what.

John Colby Owns_A_Dog = 'Y'
John Colby Owns_A_Dog = 'N'

Which one is true?  We have no way to say one record is more recent so flip a coin and merge the data.

In fact in most cases (90% of all the fields) the data is "Y" or nothing.  These records are the 
result of people taking polls.  So in 90% of the cases, the person will answer "Y" or they have no 
answer because they weren't asked that question in that poll.

The objective here is to get a ton of individual poll responses merged into a single record.  I did 
a count earlier today and one person is in the database 36 times, 26 times, 17, 16, two 14s, four 
13s etc.  3,080,470 "duplicate" person fact records (out of 50 million total records).

So when I try to ask the question "income bracket in (A-T)" and ("Owns a dog" OR "owns a cat") AND 
"mail order buyer", I end up with cases where "Owns a dog" was answered in one record and "mail 
order buyer" was answered in another record.  So the AND fails whereas in fact the person answered 
affirmatively to both questions, just not in the same record.  This is nuts.  I need to merge things 
down so that there is a single record for a single person.  ALL ANSWERS need to be in one record. 
If there are conflicting records, flip a coin, pick one answer.

Now before you go off on normalization and a single column fact table, just remember that I have 50 
million records X 550 columns.  A single column fact table would end up with 25 Tera records and I 
don't have a supercomputer.

John W. Colby
www.ColbyConsulting.com


Charlotte Foust wrote:
> Unless they have data in the same fields.  Then what happens?
> 
> Charlotte Foust 
> 
> -----Original Message-----
> From: dba-vb-bounces at databaseadvisors.com
> [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
> Sent: Wednesday, October 14, 2009 1:20 PM
> To: stuart at lexacorp.com.pg; Discussion concerning Visual Basic and
> related programming issues.
> Subject: Re: [dba-VB] Merge / Purge
> 
> Stuart,
> 
> Yes there is a PKID.  The thing is, every record is equal so it simply
> doesn't matter in which order the records are pulled.  Any of the
> duplicates can be treated as the first record to be updated by whatever
> is found in the subsequent records.
> 
> It literally doesn't matter which record is encountered first.
> 
> John W. Colby
> www.ColbyConsulting.com




More information about the dba-VB mailing list