[dba-VB] Merge / Purge

Wed Oct 14 13:05:50 CDT 2009

Gustav,

My thought was that I would be doing this in a disconnected ADO recordset.  However as we all know I 
do not understand the technology.  The vision in my head says that once the records are pulled into 
a disconnected recordset then the data is sitting in memory in some kind of .NET / ADO in memory 
structure.  I would think that doing the update there would be pretty fast.  Once that first record 
was updated (in memory) with the values from all subsequent records, then write that updated (first) 
record back to the physical database and delete the subsequent records from the physical database.

But what do I know?

John W. Colby
www.ColbyConsulting.com

Gustav Brock wrote:
> Hi John
> 
> I would skip the update and delete parts as these are very slow.
> Instead, read/loop through the sorted table and append the single records to a new table.
> 
> Thus:
> 1. Sort by name and "recenty" (what is the word for this, the most recent?)
> 2. Loop while collecting data for each name
> 3. For each name, append the collected and arranged data to the new (cleaned) table
> 4. Leave the old table as a backup
> 
> I haven't worked with that many records so I cannot tell which coding method will turn out fastest.
> 
> /gustav
> 
> 
>>>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>>
> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to 
> be FAST programming.
> 
> Imagine I have a table with tens of millions of records, hundreds of fields.  In this table I have 
> information about people, likes, uses, owns, income, etc.
> 
> In a small percentage I have "duplicate" records except that the records are not duplicates per se 
> but rather "belong to the same person", but have different information about likes, dislikes, uses, 
> owns etc.
> 
> It is impossible to rate the validity of any record over any other record.  So I have three records 
> about John Colby, one says he drinks diet Pepsi, and drives a Chevy.  The next "duplicate" record is 
> about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese 
> food and listens to the blues.  The third record is also about John Colby, says he drives a Ford, 
> doesn't mention drinks or eats, but says he listens to classical music.
> 
> Again there is no way to rate the validity or even "most recentness" of the records.
> 
> Now, I need to "merge" these records so that one single record says
> 
> John Colby
> 
> 1 Drinks Diet Pepsi
> 1 Drives Chevy
> 2 Eats Chinese
> 2 Listens Blues
> 3 Drives Ford
> 3 Listens Classical
> 
> Merges into one record:
> 
> Drinks Pepsi
> Drives Ford
> Eats Chinese
> Listens Classical
> 
> OK, I have solved the problem of selecting every record for John Colby.  Now I need to "merge" the 
> data from all the fields of all the records into one single record using the "most recent field 
> wins" strategy.  Every field in every record has to be examined.  "Existing data" cannot be 
> overwritten by "no data" but MUST be overwritten by "most recent data".  Recent is defined by record 
> position, NOT date (I don't have date info).
> 
> As you can see, I cannot simply select one of the records and say that is the real data, I MUST 
> merge all of the records, field by field.  When I have one record with the "sum" of all the data in 
> the multiple records, I write that record back out and delete the "other" records.
> 
> There are millions of duplicates (about 3 million out of 50 million IIRC).  In some cases I have two 
> records for a single person, the maximum I have is about 15 records for a single person.  So if 
> there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform 
> this merge process.  In fact it will be somewhat fewer because there are more than 2 dupes in many 
> cases.
> 
> In any case, it is a largish number of records, and a largish number of fields.  My concept of what 
> I would do is:
> 
> 1) Open all records for a given person with dupes, probably in a recordset(?).
> 2) Use the fields collection to iterate through each field position.
> 3) Compare rec1 to recN in a loop for each field.
> 4) If data in the current field of the current record, write that data to the current field in 
> record 1.  Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in 
> CurrentField.Record1 in every case.  OTOH, No data in CurrentField.CurrentRec, no update.
> 5) Increment field.  Go to 4)
> 6) Repeat until all fields compared
> 7) Increment record.  Goto 3)
> 8) Repeat until all records compared to record 1
> 9) Save record1 back to the database
> 10) Delete record 2-N
> 11 Go to 1)
> 12 Repeat until all duplicates processed.
> 
> So that is the overview.  This is where all the Masters are invited to chime in with "use a dataset" 
> or "use this other technology", as well as WHY, and where possible pointers to example code or whatever.
> 
> For speed reasons I am looking at doing this with C#.  I could do this in short order in Access but 
> I am afraid that the speed would be abysmal.  Also I want to be able to use threads perhaps to spawn 
> off multiple instances of this task so that my quad core machine could have as many as three of 
> these processes running at any given time.  I really have no clue how fast this would run so I have 
> no clue whether I would need to do that but if I make it modular then perhaps I could do that if 
> required.
>