jwcolby
jwcolby at colbyconsulting.com
Wed Oct 14 13:39:10 CDT 2009
You are right of course but I am thinking the update would just fly. There are about 550 fields in
the fact table. Thus I would be comparing / updating 550 fields, sequentially, between Rec1 and
Rec2. Then Rec2 / Rec3, Rec1 / Rec4 etc until all fact records for a given hash key (person) are
finished.
The "facts" can be anything from strings of 20-30 bytes down to a simple "Y". The vast majority of
fields contain "Y" or nothing. A small minority contain codes ('A' - 'T') for example. And a very
small number contain somewhat longer strings.
This is not a transaction database so there will be no locking contentions etc. Just me and my
update program whistling away.
I will be working on a copy of the main table of course. Maybe writing modified (Record1) to one
table and deleted (Records 2-N) to another table. At the end do the deletes / inserts. It probably
will pay to copy all of the duplicate fact records out to a temp table and work on that as the
source table.
John W. Colby
www.ColbyConsulting.com
Gustav Brock wrote:
> Hi John
>
> Yes, you may be right, except that the amount of data is so big. But again, I can't tell for sure.
> However, my method is so plain simple that in any case I would try that first. If speed shows up as an issue you can easily expand to follow your scheme.
>
> /gustav
>
>>>> jwcolby at colbyconsulting.com 14-10-2009 20:05 >>>
> Gustav,
>
> My thought was that I would be doing this in a disconnected ADO recordset. However as we all know I
> do not understand the technology. The vision in my head says that once the records are pulled into
> a disconnected recordset then the data is sitting in memory in some kind of .NET / ADO in memory
> structure. I would think that doing the update there would be pretty fast. Once that first record
> was updated (in memory) with the values from all subsequent records, then write that updated (first)
> record back to the physical database and delete the subsequent records from the physical database.
>
> But what do I know?
>
> John W. Colby
> www.ColbyConsulting.com
>
>
> Gustav Brock wrote:
>> Hi John
>>
>> I would skip the update and delete parts as these are very slow.
>> Instead, read/loop through the sorted table and append the single records to a new table.
>>
>> Thus:
>> 1. Sort by name and "recenty" (what is the word for this, the most recent?)
>> 2. Loop while collecting data for each name
>> 3. For each name, append the collected and arranged data to the new (cleaned) table
>> 4. Leave the old table as a backup
>>
>> I haven't worked with that many records so I cannot tell which coding method will turn out fastest.
>>
>> /gustav
>>
>>
>>>>> jwcolby at colbyconsulting.com 14-10-2009 19:37 >>>
>> I have a problem that AFAICT is going to have to be attacked with programming, and it will need to
>> be FAST programming.
>>
>> Imagine I have a table with tens of millions of records, hundreds of fields. In this table I have
>> information about people, likes, uses, owns, income, etc.
>>
>> In a small percentage I have "duplicate" records except that the records are not duplicates per se
>> but rather "belong to the same person", but have different information about likes, dislikes, uses,
>> owns etc.
>>
>> It is impossible to rate the validity of any record over any other record. So I have three records
>> about John Colby, one says he drinks diet Pepsi, and drives a Chevy. The next "duplicate" record is
>> about John Colby but doesn't have any info about drinks or drives but has info about eats Chinese
>> food and listens to the blues. The third record is also about John Colby, says he drives a Ford,
>> doesn't mention drinks or eats, but says he listens to classical music.
>>
>> Again there is no way to rate the validity or even "most recentness" of the records.
>>
>> Now, I need to "merge" these records so that one single record says
>>
>> John Colby
>>
>> 1 Drinks Diet Pepsi
>> 1 Drives Chevy
>> 2 Eats Chinese
>> 2 Listens Blues
>> 3 Drives Ford
>> 3 Listens Classical
>>
>> Merges into one record:
>>
>> Drinks Pepsi
>> Drives Ford
>> Eats Chinese
>> Listens Classical
>>
>> OK, I have solved the problem of selecting every record for John Colby. Now I need to "merge" the
>> data from all the fields of all the records into one single record using the "most recent field
>> wins" strategy. Every field in every record has to be examined. "Existing data" cannot be
>> overwritten by "no data" but MUST be overwritten by "most recent data". Recent is defined by record
>> position, NOT date (I don't have date info).
>>
>> As you can see, I cannot simply select one of the records and say that is the real data, I MUST
>> merge all of the records, field by field. When I have one record with the "sum" of all the data in
>> the multiple records, I write that record back out and delete the "other" records.
>>
>> There are millions of duplicates (about 3 million out of 50 million IIRC). In some cases I have two
>> records for a single person, the maximum I have is about 15 records for a single person. So if
>> there are 3 million "dupes" I will have a maximum of 1.5 million "merge operations" where I perform
>> this merge process. In fact it will be somewhat fewer because there are more than 2 dupes in many
>> cases.
>>
>> In any case, it is a largish number of records, and a largish number of fields. My concept of what
>> I would do is:
>>
>> 1) Open all records for a given person with dupes, probably in a recordset(?).
>> 2) Use the fields collection to iterate through each field position.
>> 3) Compare rec1 to recN in a loop for each field.
>> 4) If data in the current field of the current record, write that data to the current field in
>> record 1. Notice that ACTUAL DATA in CurrentField.CurrentRec overwrites data in
>> CurrentField.Record1 in every case. OTOH, No data in CurrentField.CurrentRec, no update.
>> 5) Increment field. Go to 4)
>> 6) Repeat until all fields compared
>> 7) Increment record. Goto 3)
>> 8) Repeat until all records compared to record 1
>> 9) Save record1 back to the database
>> 10) Delete record 2-N
>> 11 Go to 1)
>> 12 Repeat until all duplicates processed.
>>
>> So that is the overview. This is where all the Masters are invited to chime in with "use a dataset"
>> or "use this other technology", as well as WHY, and where possible pointers to example code or whatever.
>>
>> For speed reasons I am looking at doing this with C#. I could do this in short order in Access but
>> I am afraid that the speed would be abysmal. Also I want to be able to use threads perhaps to spawn
>> off multiple instances of this task so that my quad core machine could have as many as three of
>> these processes running at any given time. I really have no clue how fast this would run so I have
>> no clue whether I would need to do that but if I make it modular then perhaps I could do that if
>> required.
>
>
> _______________________________________________
> dba-VB mailing list
> dba-VB at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-vb
> http://www.databaseadvisors.com
>
>