[dba-VB] Merge rows

Tue Sep 22 07:20:59 CDT 2009

I think you are right in most regards.  Copying all of the dupes out to a new table for deduping is 
definitely a must.  After that I think it will be a case of iterating the recordset grabbing sets of 
dupe records, updating, deleting the duplicates, moving on.  In fact perhaps even updating out to a 
second table, i.e. read from one write to another.  The destination table will hold the resulting 
deduped recordset.

As I mentioned in a previous email, the string length limitation will force me to process subsets of 
fields at a time.

I also have to log all of the records being deleted so that I can delete those records back in my 
original source table.

At least this kind of puzzle is fun to me.

John W. Colby
www.ColbyConsulting.com

Shamil Salakhetdinov wrote:
> Hi John,
> 
> 3.086 million records seems to be manageable to be "deduped" in one go
> provided the source 50 million records  table is already prepared.
> 
> For "deduping" you can use 
> 
> 1. One stored procedure which:
>    a) will insert new individual attributes' record and
>    b) will update not null values for existing individual attributes'
> record.
> 
> 2. One C#/VB.NET function to get records from the source 50 million records'
> table and to "feed" the above stored procedure.
> 
> That's a rough plan.
> 
> Of course developing SP with 640 parameters and C#/VB.NET code working with
> that SP is not an easy manual work - that SP and code could be generated.
> 
> Not sure yet here that SP can have 640 parameters therefore it may be needed
> to have several SPs.
> 
> To get source rows a T-SQL select expression with ROW_NUMBER() function
> should be used to have relatively small subsets of record batches to
> process. This selection can be done using multi-threaded C#/VB.Net code . Do
> you have any index over source 50 million rows table? - ROW_NUMBER()
> function will need such an index (I'm talking MS SQL 2005 here, maybe in MS
> SQL 2008 ROW_NUMBER() can be used without any indexes)...
> 
> When source data deduped into ~3.086 million records table then that "small"
> table can get a few the most selective columns' indexes, and probably one
> compound natural key clustered index built using the most selective (/the
> most often used in queries) columns. If there are a few such columns at
> least one of which is used in all the queries then your application response
> time should be finally measured in seconds... I hope...
> 
> That's basically it.
> 
> Please correct me if you suppose that the above plan can't be fulfilled
> because of some reasons I'm missing here.
> 
> Thank you.
> 
> 
> --
> Shamil