[dba-VB] Merge / Purge

jwcolby jwcolby at colbyconsulting.com
Wed Oct 14 12:37:15 CDT 2009


I have a problem that AFAICT is going to have to be attacked with programming, and it will need to 
be FAST programming.

Imagine I have a table with tens of millions of records, hundreds of fields.  In this table I have 
information about people, likes, uses, owns, income, etc.

In a small percentage I have "duplicate" records except that the records are not duplicates per se 
but rather "belong to the same person", but have different information about likes, dislikes, uses, 
owns etc.

It is impossible to rate the validity of any record over any other record.  So I have three records 
about John Colby, one says he drinks diet Pepsi, and drives a Chevy.  The next "duplicate" record is 
about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese 
food and listens to the blues.  The third record is also about John Colby, says he drives a Ford, 
doesn't mention drinks or eats, but says he listens to classical music.

Again there is no way to rate the validity or even "most recentness" of the records.

Now, I need to "merge" these records so that one single record says

John Colby

1 Drinks Diet Pepsi
1 Drives Chevy
2 Eats Chinese
2 Listens Blues
3 Drives Ford
3 Listens Classical

Merges into one record:

Drinks Pepsi
Drives Ford
Eats Chinese
Listens Classical

OK, I have solved the problem of selecting every record for John Colby.  Now I need to "merge" the 
data from all the fields of all the records into one single record using the "most recent field 
wins" strategy.  Every field in every record has to be examined.  "Existing data" cannot be 
overwritten by "no data" but MUST be overwritten by "most recent data".  Recent is defined by record 
position, NOT date (I don't have date info).

As you can see, I cannot simply select one of the records and say that is the real data, I MUST 
merge all of the records, field by field.  When I have one record with the "sum" of all the data in 
the multiple records, I write that record back out and delete the "other" records.

There are millions of duplicates (about 3 million out of 50 million IIRC).  In some cases I have two 
records for a single person, the maximum I have is about 15 records for a single person.  So if 
there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform 
this merge process.  In fact it will be somewhat fewer because there are more than 2 dupes in many 
cases.

In any case, it is a largish number of records, and a largish number of fields.  My concept of what 
I would do is:

1) Open all records for a given person with dupes, probably in a recordset(?).
2) Use the fields collection to iterate through each field position.
3) Compare rec1 to recN in a loop for each field.
4) If data in the current field of the current record, write that data to the current field in 
record 1.  Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in 
CurrentField.Record1 in every case.  OTOH, No data in CurrentField.CurrentRec, no update.
5) Increment field.  Go to 4)
6) Repeat until all fields compared
7) Increment record.  Goto 3)
8) Repeat until all records compared to record 1
9) Save record1 back to the database
10) Delete record 2-N
11 Go to 1)
12 Repeat until all duplicates processed.

So that is the overview.  This is where all the Masters are invited to chime in with "use a dataset" 
or "use this other technology", as well as WHY, and where possible pointers to example code or whatever.

For speed reasons I am looking at doing this with C#.  I could do this in short order in Access but 
I am afraid that the speed would be abysmal.  Also I want to be able to use threads perhaps to spawn 
off multiple instances of this task so that my quad core machine could have as many as three of 
these processes running at any given time.  I really have no clue how fast this would run so I have 
no clue whether I would need to do that but if I make it modular then perhaps I could do that if 
required.

-- 
John W. Colby
www.ColbyConsulting.com



More information about the dba-VB mailing list