jwcolby
jwcolby at colbyconsulting.com
Tue Sep 22 07:20:59 CDT 2009
I think you are right in most regards. Copying all of the dupes out to a new table for deduping is definitely a must. After that I think it will be a case of iterating the recordset grabbing sets of dupe records, updating, deleting the duplicates, moving on. In fact perhaps even updating out to a second table, i.e. read from one write to another. The destination table will hold the resulting deduped recordset. As I mentioned in a previous email, the string length limitation will force me to process subsets of fields at a time. I also have to log all of the records being deleted so that I can delete those records back in my original source table. At least this kind of puzzle is fun to me. John W. Colby www.ColbyConsulting.com Shamil Salakhetdinov wrote: > Hi John, > > 3.086 million records seems to be manageable to be "deduped" in one go > provided the source 50 million records table is already prepared. > > For "deduping" you can use > > 1. One stored procedure which: > a) will insert new individual attributes' record and > b) will update not null values for existing individual attributes' > record. > > 2. One C#/VB.NET function to get records from the source 50 million records' > table and to "feed" the above stored procedure. > > That's a rough plan. > > Of course developing SP with 640 parameters and C#/VB.NET code working with > that SP is not an easy manual work - that SP and code could be generated. > > Not sure yet here that SP can have 640 parameters therefore it may be needed > to have several SPs. > > To get source rows a T-SQL select expression with ROW_NUMBER() function > should be used to have relatively small subsets of record batches to > process. This selection can be done using multi-threaded C#/VB.Net code . Do > you have any index over source 50 million rows table? - ROW_NUMBER() > function will need such an index (I'm talking MS SQL 2005 here, maybe in MS > SQL 2008 ROW_NUMBER() can be used without any indexes)... > > When source data deduped into ~3.086 million records table then that "small" > table can get a few the most selective columns' indexes, and probably one > compound natural key clustered index built using the most selective (/the > most often used in queries) columns. If there are a few such columns at > least one of which is used in all the queries then your application response > time should be finally measured in seconds... I hope... > > That's basically it. > > Please correct me if you suppose that the above plan can't be fulfilled > because of some reasons I'm missing here. > > Thank you. > > > -- > Shamil