Gustav Brock
Gustav at cactus.dk
Wed Oct 14 12:52:41 CDT 2009
Hi John I would skip the update and delete parts as these are very slow. Instead, read/loop through the sorted table and append the single records to a new table. Thus: 1. Sort by name and "recenty" (what is the word for this, the most recent?) 2. Loop while collecting data for each name 3. For each name, append the collected and arranged data to the new (cleaned) table 4. Leave the old table as a backup I haven't worked with that many records so I cannot tell which coding method will turn out fastest. /gustav >>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to be FAST programming. Imagine I have a table with tens of millions of records, hundreds of fields. In this table I have information about people, likes, uses, owns, income, etc. In a small percentage I have "duplicate" records except that the records are not duplicates per se but rather "belong to the same person", but have different information about likes, dislikes, uses, owns etc. It is impossible to rate the validity of any record over any other record. So I have three records about John Colby, one says he drinks diet Pepsi, and drives a Chevy. The next "duplicate" record is about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese food and listens to the blues. The third record is also about John Colby, says he drives a Ford, doesn't mention drinks or eats, but says he listens to classical music. Again there is no way to rate the validity or even "most recentness" of the records. Now, I need to "merge" these records so that one single record says John Colby 1 Drinks Diet Pepsi 1 Drives Chevy 2 Eats Chinese 2 Listens Blues 3 Drives Ford 3 Listens Classical Merges into one record: Drinks Pepsi Drives Ford Eats Chinese Listens Classical OK, I have solved the problem of selecting every record for John Colby. Now I need to "merge" the data from all the fields of all the records into one single record using the "most recent field wins" strategy. Every field in every record has to be examined. "Existing data" cannot be overwritten by "no data" but MUST be overwritten by "most recent data". Recent is defined by record position, NOT date (I don't have date info). As you can see, I cannot simply select one of the records and say that is the real data, I MUST merge all of the records, field by field. When I have one record with the "sum" of all the data in the multiple records, I write that record back out and delete the "other" records. There are millions of duplicates (about 3 million out of 50 million IIRC). In some cases I have two records for a single person, the maximum I have is about 15 records for a single person. So if there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform this merge process. In fact it will be somewhat fewer because there are more than 2 dupes in many cases. In any case, it is a largish number of records, and a largish number of fields. My concept of what I would do is: 1) Open all records for a given person with dupes, probably in a recordset(?). 2) Use the fields collection to iterate through each field position. 3) Compare rec1 to recN in a loop for each field. 4) If data in the current field of the current record, write that data to the current field in record 1. Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in CurrentField.Record1 in every case. OTOH, No data in CurrentField.CurrentRec, no update. 5) Increment field. Go to 4) 6) Repeat until all fields compared 7) Increment record. Goto 3) 8) Repeat until all records compared to record 1 9) Save record1 back to the database 10) Delete record 2-N 11 Go to 1) 12 Repeat until all duplicates processed. So that is the overview. This is where all the Masters are invited to chime in with "use a dataset" or "use this other technology", as well as WHY, and where possible pointers to example code or whatever. For speed reasons I am looking at doing this with C#. I could do this in short order in Access but I am afraid that the speed would be abysmal. Also I want to be able to use threads perhaps to spawn off multiple instances of this task so that my quad core machine could have as many as three of these processes running at any given time. I really have no clue how fast this would run so I have no clue whether I would need to do that but if I make it modular then perhaps I could do that if required. -- John W. Colby www.ColbyConsulting.com