[dba-VB] Merge / Purge

jwcolby jwcolby at colbyconsulting.com
Wed Oct 14 13:39:10 CDT 2009


You are right of course but I am thinking the update would just fly.  There are about 550 fields in 
the fact table.  Thus I would be comparing / updating 550 fields, sequentially, between Rec1 and 
Rec2.  Then Rec2 / Rec3, Rec1 / Rec4 etc until all fact records for a given hash key (person) are 
finished.

The "facts" can be anything from strings of 20-30 bytes down to a simple "Y".  The vast majority of 
fields contain "Y" or nothing.  A small minority contain codes ('A' - 'T') for example.  And a very 
small number contain somewhat longer strings.

This is not a transaction database so there will be no locking contentions etc.  Just me and my 
update program whistling away.

I will be working on a copy of the main table of course.  Maybe writing modified (Record1) to one 
table and deleted (Records 2-N) to another table.  At the end do the deletes / inserts.  It probably 
will pay to copy all of the duplicate fact records out to a temp table and work on that as the 
source table.

John W. Colby
www.ColbyConsulting.com


Gustav Brock wrote:
> Hi John
> 
> Yes, you may be right, except that the amount of data is so big. But again, I can't tell for sure.
> However, my method is so plain simple that in any case I would try that first. If speed shows up as an issue you can easily expand to follow your scheme.
> 
> /gustav
> 
>>>> jwcolby at colbyconsulting.com 14-10-2009 20:05 >>>
> Gustav,
> 
> My thought was that I would be doing this in a disconnected ADO recordset.  However as we all know I 
> do not understand the technology.  The vision in my head says that once the records are pulled into 
> a disconnected recordset then the data is sitting in memory in some kind of .NET / ADO in memory 
> structure.  I would think that doing the update there would be pretty fast.  Once that first record 
> was updated (in memory) with the values from all subsequent records, then write that updated (first) 
> record back to the physical database and delete the subsequent records from the physical database.
> 
> But what do I know?
> 
> John W. Colby
> www.ColbyConsulting.com 
> 
> 
> Gustav Brock wrote:
>> Hi John
>>
>> I would skip the update and delete parts as these are very slow.
>> Instead, read/loop through the sorted table and append the single records to a new table.
>>
>> Thus:
>> 1. Sort by name and "recenty" (what is the word for this, the most recent?)
>> 2. Loop while collecting data for each name
>> 3. For each name, append the collected and arranged data to the new (cleaned) table
>> 4. Leave the old table as a backup
>>
>> I haven't worked with that many records so I cannot tell which coding method will turn out fastest.
>>
>> /gustav
>>
>>
>>>>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>>
>> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to 
>> be FAST programming.
>>
>> Imagine I have a table with tens of millions of records, hundreds of fields.  In this table I have 
>> information about people, likes, uses, owns, income, etc.
>>
>> In a small percentage I have "duplicate" records except that the records are not duplicates per se 
>> but rather "belong to the same person", but have different information about likes, dislikes, uses, 
>> owns etc.
>>
>> It is impossible to rate the validity of any record over any other record.  So I have three records 
>> about John Colby, one says he drinks diet Pepsi, and drives a Chevy.  The next "duplicate" record is 
>> about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese 
>> food and listens to the blues.  The third record is also about John Colby, says he drives a Ford, 
>> doesn't mention drinks or eats, but says he listens to classical music.
>>
>> Again there is no way to rate the validity or even "most recentness" of the records.
>>
>> Now, I need to "merge" these records so that one single record says
>>
>> John Colby
>>
>> 1 Drinks Diet Pepsi
>> 1 Drives Chevy
>> 2 Eats Chinese
>> 2 Listens Blues
>> 3 Drives Ford
>> 3 Listens Classical
>>
>> Merges into one record:
>>
>> Drinks Pepsi
>> Drives Ford
>> Eats Chinese
>> Listens Classical
>>
>> OK, I have solved the problem of selecting every record for John Colby.  Now I need to "merge" the 
>> data from all the fields of all the records into one single record using the "most recent field 
>> wins" strategy.  Every field in every record has to be examined.  "Existing data" cannot be 
>> overwritten by "no data" but MUST be overwritten by "most recent data".  Recent is defined by record 
>> position, NOT date (I don't have date info).
>>
>> As you can see, I cannot simply select one of the records and say that is the real data, I MUST 
>> merge all of the records, field by field.  When I have one record with the "sum" of all the data in 
>> the multiple records, I write that record back out and delete the "other" records.
>>
>> There are millions of duplicates (about 3 million out of 50 million IIRC).  In some cases I have two 
>> records for a single person, the maximum I have is about 15 records for a single person.  So if 
>> there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform 
>> this merge process.  In fact it will be somewhat fewer because there are more than 2 dupes in many 
>> cases.
>>
>> In any case, it is a largish number of records, and a largish number of fields.  My concept of what 
>> I would do is:
>>
>> 1) Open all records for a given person with dupes, probably in a recordset(?).
>> 2) Use the fields collection to iterate through each field position.
>> 3) Compare rec1 to recN in a loop for each field.
>> 4) If data in the current field of the current record, write that data to the current field in 
>> record 1.  Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in 
>> CurrentField.Record1 in every case.  OTOH, No data in CurrentField.CurrentRec, no update.
>> 5) Increment field.  Go to 4)
>> 6) Repeat until all fields compared
>> 7) Increment record.  Goto 3)
>> 8) Repeat until all records compared to record 1
>> 9) Save record1 back to the database
>> 10) Delete record 2-N
>> 11 Go to 1)
>> 12 Repeat until all duplicates processed.
>>
>> So that is the overview.  This is where all the Masters are invited to chime in with "use a dataset" 
>> or "use this other technology", as well as WHY, and where possible pointers to example code or whatever.
>>
>> For speed reasons I am looking at doing this with C#.  I could do this in short order in Access but 
>> I am afraid that the speed would be abysmal.  Also I want to be able to use threads perhaps to spawn 
>> off multiple instances of this task so that my quad core machine could have as many as three of 
>> these processes running at any given time.  I really have no clue how fast this would run so I have 
>> no clue whether I would need to do that but if I make it modular then perhaps I could do that if 
>> required.
> 
> 
> _______________________________________________
> dba-VB mailing list
> dba-VB at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-vb
> http://www.databaseadvisors.com
> 
> 



More information about the dba-VB mailing list