[dba-VB] What to do, what to do?

Fri Nov 13 22:46:18 CST 2009

John,

I don't think your machine, or any for that matter, have the memory 
to load 50 million records, extract 40 million, transform them and 
then load them back to the 50 million and update the database from 
the recordset.  Datasets are for working with a small number of 
records in memory, not millions.

Robert

At 12:00 PM 11/13/2009, you wrote:
>Date: Fri, 13 Nov 2009 11:39:14 -0500
>From: jwcolby <jwcolby at colbyconsulting.com>
>Subject: Re: [dba-VB] What to do, what to do?
>To: "Discussion concerning Visual Basic and related programming
>         issues."        <dba-vb at databaseadvisors.com>
>Message-ID: <4AFD8BB2.5030203 at colbyconsulting.com>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Shamil,
>
>Given what you now know about what I want to do...
>
>Why does each table get created in it's own DataSet.  I thought a 
>data set could have multiple
>tables.  I envisioned the following structure out in SQL Server:
>
>tblHSID - contains 6.4 million records.  Source of records for tblMergeSrc.
>tblMergeSrc - Contains X thousand records for Y hundred 
>people.  Working table in memory.
>tblMergedRecs - the single record merged into will be created here 
>(or copied here).  Working table
>in memory.
>tblPurgedRecs - The multiple records merged FROM will be archived 
>here.  Working table in memory.
>
>tblHashPK - Hash for a person / address, PKs for person / 
>address.  There is one hash for each
>person / hash, there are multiple PKs for each person / 
>address.  One record for each record in
>tblHSID.
>
>tblHashCnt - Group by Hash, Cnt PK.  Server two purposes, provides a 
>"DISTINCT" hash for each person
>being processed as well as a count of how many records will be merged down.
>
>So... tblHSID is the source of the records to be 
>processed.  Contains ONLY people who actually have
>"duplicates", most people don't.
>
>The Top(N) records from tblHashCnt will select the Top(N) PEOPLE by 
>number of records to be merged.
>   Merge the ones with the most records first.
>
>That will be joined to tblHashPK to pull all of the PKs for the 
>Top(N) people about to be processed.
>
>The PKs from that will be joined to tblHSID to pull a set of records 
>to be merged.  Copied into
>tblMergeSrc.  This is the first time HSID records (records to be 
>merged) will be moved into a table
>in a dataset.
>
>tblMergedRecs is empty but exists in the dataset.
>tblPurgedRecs is empty but exists in the dataset.
>
>The merge will most likely happen in tblMergeSrc although a new 
>record could be created in
>tblMergedRecs to work in.  Not decided yet.
>
>In the end, the merge happens in the dataset in memory.  The actual 
>ONE merged record for each
>person ends up in tblMergedRecs.
>
>The actual (multiple) records merged into other records are moved 
>into tblPurgedRecs.  This is an
>archive.
>
>The records in tblPurgedRecs are then deleted from tblMergedSrc, and 
>eventually deleted out of
>tblHSID back in the database.
>
>The records for the next person are processed.  Merged records moved 
>to tblMergedRecs.  "Purged"
>records moved to tblPurgedRecs.
>
>Deleted from tblMergeSrc.
>
>Wash, rinse repeat.
>
>Now... I don't care how many data sets I have but I just question 
>why I would create a new dataset
>for each table above.  It seems that having them all in a single 
>dataset would minimize confusion
>and stuff.  Logically it seems that working with multiple tables 
>inside of a single DataSet object
>MIGHT be faster than moving data back and forth between tables in 
>different DataSet objects, though
>maybe not.
>
>And of course... my book talks all about connections, then all about 
>commands, and then all about
>parameters, then...
>
>How to work with a data reader.  WTF over?  What happened to my 
>dataset and table objects?  Not a
>WORD about those two objects.
>
>Sigh.
>
>John W. Colby