Max Wanadoo
max.wanadoo at gmail.com
Tue Sep 22 08:31:09 CDT 2009
On the contrary, it looks very simple to do in code, but i dont know about creating a complex sql string like that, if it gets that complicated I just do it in code - maybe a tad slower but what the heck, do something else and just let it run. strip some records into an mdb and email me if you want. max On 22/09/2009, jwcolby <jwcolby at colbyconsulting.com> wrote: > Max, > > > JC, I don't understand what strings you are manipulating... > > In order to do this you (or at least I) pretty much have to use dynamic > queries. Create a string > variable, then build up the update SQL statement. That SQL string cannot be > more than 4K long IIRC. > > Well imagine 640 fields with field names like (a real field name) > > "medication_taken_for_adhd_adderall" or > "receive_info_on_health_related_bladder_control". > > Obviously the SQL string is going to get lengthy! > > > Do your fieldname indicate the contents, ie Fldname = "Has Dog" with a > value of true/false > > Yes, generally the field contains a 'Y' or an empty string ''. In some > cases (rare) the field has > various codes or numbers. For example there is a field for income with > codes '1' through 'T' with > each one being an income band of $5K. > > > Also, if the first JCRecord is going to be the master, then just > overwrite the contents with > subsequent JCRecords as you previously indicated. > > Yes, except you can't just "overwrite it", you have to update it IF there is > data in the next record. > > Two records > > Has_Dog, Has_Cat, Boat_Length > 'Y','','' > '','','42' > > If you simply "overwrite" the first with the second then you lose the fact > that the person has a dog > because you overwrote that field in the first record with an empty string > from the second record. > > EVERY FIELD of EVERY RECORD has to be analyzed to see if there is data. If > so then update the first > record with the actual data containing field from the second record. But do > NOT update the fields > where the second record does not contain data. > > It starts to look more complex eh? > > Additionally, once this "merge" is done, all the records containing "merged > data" have to be > retained. All the records which were merged in to other records have to be > deleted. > > John W. Colby > www.ColbyConsulting.com > > > Max Wanadoo wrote: >> JC, I dont understand what strings you are manipulating...I am >> confused by the seemingly conflicting postings. >> Do you fieldname indicte the contents, ie Fldname = "Has Dog" with a >> value of true/false >> >> or is the fieldname = "Fld1" and the contents "Has Dog"? >> >> Also, if the first JCRecord is going to be the master, then just >> overwrite the contents with subsequent JCRecords as you previously >> indicated. If the recordsource is index on key and accessed in that >> order, then you have only to store the previous entry and change it as >> the recname changes. >> >> max >> >> >> On 22/09/2009, jwcolby <jwcolby at colbyconsulting.com> wrote: >>> I think you are right in most regards. Copying all of the dupes out to a >>> new table for deduping is >>> definitely a must. After that I think it will be a case of iterating the >>> recordset grabbing sets of >>> dupe records, updating, deleting the duplicates, moving on. In fact >>> perhaps >>> even updating out to a >>> second table, i.e. read from one write to another. The destination table >>> will hold the resulting >>> deduped recordset. >>> >>> As I mentioned in a previous email, the string length limitation will >>> force >>> me to process subsets of >>> fields at a time. >>> >>> I also have to log all of the records being deleted so that I can delete >>> those records back in my >>> original source table. >>> >>> At least this kind of puzzle is fun to me. >>> >>> John W. Colby >>> www.ColbyConsulting.com >>> >>> >>> Shamil Salakhetdinov wrote: >>>> Hi John, >>>> >>>> 3.086 million records seems to be manageable to be "deduped" in one go >>>> provided the source 50 million records table is already prepared. >>>> >>>> For "deduping" you can use >>>> >>>> 1. One stored procedure which: >>>> a) will insert new individual attributes' record and >>>> b) will update not null values for existing individual attributes' >>>> record. >>>> >>>> 2. One C#/VB.NET function to get records from the source 50 million >>>> records' >>>> table and to "feed" the above stored procedure. >>>> >>>> That's a rough plan. >>>> >>>> Of course developing SP with 640 parameters and C#/VB.NET code working >>>> with >>>> that SP is not an easy manual work - that SP and code could be >>>> generated. >>>> >>>> Not sure yet here that SP can have 640 parameters therefore it may be >>>> needed >>>> to have several SPs. >>>> >>>> To get source rows a T-SQL select expression with ROW_NUMBER() function >>>> should be used to have relatively small subsets of record batches to >>>> process. This selection can be done using multi-threaded C#/VB.Net code >>>> . >>>> Do >>>> you have any index over source 50 million rows table? - ROW_NUMBER() >>>> function will need such an index (I'm talking MS SQL 2005 here, maybe in >>>> MS >>>> SQL 2008 ROW_NUMBER() can be used without any indexes)... >>>> >>>> When source data deduped into ~3.086 million records table then that >>>> "small" >>>> table can get a few the most selective columns' indexes, and probably >>>> one >>>> compound natural key clustered index built using the most selective >>>> (/the >>>> most often used in queries) columns. If there are a few such columns at >>>> least one of which is used in all the queries then your application >>>> response >>>> time should be finally measured in seconds... I hope... >>>> >>>> That's basically it. >>>> >>>> Please correct me if you suppose that the above plan can't be fulfilled >>>> because of some reasons I'm missing here. >>>> >>>> Thank you. >>>> >>>> >>>> -- >>>> Shamil >>> _______________________________________________ >>> dba-VB mailing list >>> dba-VB at databaseadvisors.com >>> http://databaseadvisors.com/mailman/listinfo/dba-vb >>> http://www.databaseadvisors.com >>> >>> >> _______________________________________________ >> dba-VB mailing list >> dba-VB at databaseadvisors.com >> http://databaseadvisors.com/mailman/listinfo/dba-vb >> http://www.databaseadvisors.com >> >> > _______________________________________________ > dba-VB mailing list > dba-VB at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-vb > http://www.databaseadvisors.com > >