[dba-VB] Merge / Purge

jwcolby jwcolby at colbyconsulting.com
Wed Oct 14 13:31:14 CDT 2009


Further to this. I have a hash field which contains an SH1 hash for FName + LName + Addr + Zip5 + 
Zip4.  These hash strings are absolutely unique (tested and confirmed) so what I am doing is pulling 
a view grouped by hash key where Count(PK) > 1.  This gives me the data set of all hash keys of 
multiple ("dupes") records.

In fact the hash key (and in fact all name / address fields) are out in a separate table related 1 
to 1 with the fact table which I am "merging".  Thus the hash would really pull a set of PKIDs which 
would be used to pull all of the fact records for the person that the hash key belongs to.  Once 
those fact records are sitting in memory I would thing the update code would zip pretty well.



John W. Colby
www.ColbyConsulting.com


Gustav Brock wrote:
> Hi John
> 
> I would skip the update and delete parts as these are very slow.
> Instead, read/loop through the sorted table and append the single records to a new table.
> 
> Thus:
> 1. Sort by name and "recenty" (what is the word for this, the most recent?)
> 2. Loop while collecting data for each name
> 3. For each name, append the collected and arranged data to the new (cleaned) table
> 4. Leave the old table as a backup
> 
> I haven't worked with that many records so I cannot tell which coding method will turn out fastest.
> 
> /gustav
> 
> 
>>>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>>
> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to 
> be FAST programming.
> 
> Imagine I have a table with tens of millions of records, hundreds of fields.  In this table I have 
> information about people, likes, uses, owns, income, etc.
> 
> In a small percentage I have "duplicate" records except that the records are not duplicates per se 
> but rather "belong to the same person", but have different information about likes, dislikes, uses, 
> owns etc.
> 
> It is impossible to rate the validity of any record over any other record.  So I have three records 
> about John Colby, one says he drinks diet Pepsi, and drives a Chevy.  The next "duplicate" record is 
> about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese 
> food and listens to the blues.  The third record is also about John Colby, says he drives a Ford, 
> doesn't mention drinks or eats, but says he listens to classical music.
> 
> Again there is no way to rate the validity or even "most recentness" of the records.
> 
> Now, I need to "merge" these records so that one single record says
> 
> John Colby
> 
> 1 Drinks Diet Pepsi
> 1 Drives Chevy
> 2 Eats Chinese
> 2 Listens Blues
> 3 Drives Ford
> 3 Listens Classical
> 
> Merges into one record:
> 
> Drinks Pepsi
> Drives Ford
> Eats Chinese
> Listens Classical
> 
> OK, I have solved the problem of selecting every record for John Colby.  Now I need to "merge" the 
> data from all the fields of all the records into one single record using the "most recent field 
> wins" strategy.  Every field in every record has to be examined.  "Existing data" cannot be 
> overwritten by "no data" but MUST be overwritten by "most recent data".  Recent is defined by record 
> position, NOT date (I don't have date info).
> 
> As you can see, I cannot simply select one of the records and say that is the real data, I MUST 
> merge all of the records, field by field.  When I have one record with the "sum" of all the data in 
> the multiple records, I write that record back out and delete the "other" records.
> 
> There are millions of duplicates (about 3 million out of 50 million IIRC).  In some cases I have two 
> records for a single person, the maximum I have is about 15 records for a single person.  So if 
> there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform 
> this merge process.  In fact it will be somewhat fewer because there are more than 2 dupes in many 
> cases.
> 
> In any case, it is a largish number of records, and a largish number of fields.  My concept of what 
> I would do is:
> 
> 1) Open all records for a given person with dupes, probably in a recordset(?).
> 2) Use the fields collection to iterate through each field position.
> 3) Compare rec1 to recN in a loop for each field.
> 4) If data in the current field of the current record, write that data to the current field in 
> record 1.  Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in 
> CurrentField.Record1 in every case.  OTOH, No data in CurrentField.CurrentRec, no update.
> 5) Increment field.  Go to 4)
> 6) Repeat until all fields compared
> 7) Increment record.  Goto 3)
> 8) Repeat until all records compared to record 1
> 9) Save record1 back to the database
> 10) Delete record 2-N
> 11 Go to 1)
> 12 Repeat until all duplicates processed.
> 
> So that is the overview.  This is where all the Masters are invited to chime in with "use a dataset" 
> or "use this other technology", as well as WHY, and where possible pointers to example code or whatever.
> 
> For speed reasons I am looking at doing this with C#.  I could do this in short order in Access but 
> I am afraid that the speed would be abysmal.  Also I want to be able to use threads perhaps to spawn 
> off multiple instances of this task so that my quad core machine could have as many as three of 
> these processes running at any given time.  I really have no clue how fast this would run so I have 
> no clue whether I would need to do that but if I make it modular then perhaps I could do that if 
> required.
> 



More information about the dba-VB mailing list