jwcolby
jwcolby at colbyconsulting.com
Wed Oct 14 13:39:10 CDT 2009
You are right of course but I am thinking the update would just fly. There are about 550 fields in the fact table. Thus I would be comparing / updating 550 fields, sequentially, between Rec1 and Rec2. Then Rec2 / Rec3, Rec1 / Rec4 etc until all fact records for a given hash key (person) are finished. The "facts" can be anything from strings of 20-30 bytes down to a simple "Y". The vast majority of fields contain "Y" or nothing. A small minority contain codes ('A' - 'T') for example. And a very small number contain somewhat longer strings. This is not a transaction database so there will be no locking contentions etc. Just me and my update program whistling away. I will be working on a copy of the main table of course. Maybe writing modified (Record1) to one table and deleted (Records 2-N) to another table. At the end do the deletes / inserts. It probably will pay to copy all of the duplicate fact records out to a temp table and work on that as the source table. John W. Colby www.ColbyConsulting.com Gustav Brock wrote: > Hi John > > Yes, you may be right, except that the amount of data is so big. But again, I can't tell for sure. > However, my method is so plain simple that in any case I would try that first. If speed shows up as an issue you can easily expand to follow your scheme. > > /gustav > >>>> jwcolby at colbyconsulting.com 14-10-2009 20:05 >>> > Gustav, > > My thought was that I would be doing this in a disconnected ADO recordset. However as we all know I > do not understand the technology. The vision in my head says that once the records are pulled into > a disconnected recordset then the data is sitting in memory in some kind of .NET / ADO in memory > structure. I would think that doing the update there would be pretty fast. Once that first record > was updated (in memory) with the values from all subsequent records, then write that updated (first) > record back to the physical database and delete the subsequent records from the physical database. > > But what do I know? > > John W. Colby > www.ColbyConsulting.com > > > Gustav Brock wrote: >> Hi John >> >> I would skip the update and delete parts as these are very slow. >> Instead, read/loop through the sorted table and append the single records to a new table. >> >> Thus: >> 1. Sort by name and "recenty" (what is the word for this, the most recent?) >> 2. Loop while collecting data for each name >> 3. For each name, append the collected and arranged data to the new (cleaned) table >> 4. Leave the old table as a backup >> >> I haven't worked with that many records so I cannot tell which coding method will turn out fastest. >> >> /gustav >> >> >>>>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>> >> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to >> be FAST programming. >> >> Imagine I have a table with tens of millions of records, hundreds of fields. In this table I have >> information about people, likes, uses, owns, income, etc. >> >> In a small percentage I have "duplicate" records except that the records are not duplicates per se >> but rather "belong to the same person", but have different information about likes, dislikes, uses, >> owns etc. >> >> It is impossible to rate the validity of any record over any other record. So I have three records >> about John Colby, one says he drinks diet Pepsi, and drives a Chevy. The next "duplicate" record is >> about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese >> food and listens to the blues. The third record is also about John Colby, says he drives a Ford, >> doesn't mention drinks or eats, but says he listens to classical music. >> >> Again there is no way to rate the validity or even "most recentness" of the records. >> >> Now, I need to "merge" these records so that one single record says >> >> John Colby >> >> 1 Drinks Diet Pepsi >> 1 Drives Chevy >> 2 Eats Chinese >> 2 Listens Blues >> 3 Drives Ford >> 3 Listens Classical >> >> Merges into one record: >> >> Drinks Pepsi >> Drives Ford >> Eats Chinese >> Listens Classical >> >> OK, I have solved the problem of selecting every record for John Colby. Now I need to "merge" the >> data from all the fields of all the records into one single record using the "most recent field >> wins" strategy. Every field in every record has to be examined. "Existing data" cannot be >> overwritten by "no data" but MUST be overwritten by "most recent data". Recent is defined by record >> position, NOT date (I don't have date info). >> >> As you can see, I cannot simply select one of the records and say that is the real data, I MUST >> merge all of the records, field by field. When I have one record with the "sum" of all the data in >> the multiple records, I write that record back out and delete the "other" records. >> >> There are millions of duplicates (about 3 million out of 50 million IIRC). In some cases I have two >> records for a single person, the maximum I have is about 15 records for a single person. So if >> there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform >> this merge process. In fact it will be somewhat fewer because there are more than 2 dupes in many >> cases. >> >> In any case, it is a largish number of records, and a largish number of fields. My concept of what >> I would do is: >> >> 1) Open all records for a given person with dupes, probably in a recordset(?). >> 2) Use the fields collection to iterate through each field position. >> 3) Compare rec1 to recN in a loop for each field. >> 4) If data in the current field of the current record, write that data to the current field in >> record 1. Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in >> CurrentField.Record1 in every case. OTOH, No data in CurrentField.CurrentRec, no update. >> 5) Increment field. Go to 4) >> 6) Repeat until all fields compared >> 7) Increment record. Goto 3) >> 8) Repeat until all records compared to record 1 >> 9) Save record1 back to the database >> 10) Delete record 2-N >> 11 Go to 1) >> 12 Repeat until all duplicates processed. >> >> So that is the overview. This is where all the Masters are invited to chime in with "use a dataset" >> or "use this other technology", as well as WHY, and where possible pointers to example code or whatever. >> >> For speed reasons I am looking at doing this with C#. I could do this in short order in Access but >> I am afraid that the speed would be abysmal. Also I want to be able to use threads perhaps to spawn >> off multiple instances of this task so that my quad core machine could have as many as three of >> these processes running at any given time. I really have no clue how fast this would run so I have >> no clue whether I would need to do that but if I make it modular then perhaps I could do that if >> required. > > > _______________________________________________ > dba-VB mailing list > dba-VB at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-vb > http://www.databaseadvisors.com > >