[dba-VB] SPAM-LOW: Re: Merge rows

Max Wanadoo max.wanadoo at gmail.com
Tue Sep 22 08:31:09 CDT 2009


On the contrary, it looks very simple to do in code, but i dont know
about creating a complex sql string like that, if it gets that
complicated I just do it in code - maybe a tad slower but what the
heck, do something else and just let it run.

strip some records into an mdb and email me if you want.

max


On 22/09/2009, jwcolby <jwcolby at colbyconsulting.com> wrote:
> Max,
>
>  > JC, I don't understand what strings you are manipulating...
>
> In order to do this you (or at least I) pretty much have to use dynamic
> queries.  Create a string
> variable, then build up the update SQL statement.  That SQL string cannot be
> more than 4K long IIRC.
>
> Well imagine 640 fields with field names like (a real field name)
>
> "medication_taken_for_adhd_adderall" or
> "receive_info_on_health_related_bladder_control".
>
> Obviously the SQL string is going to get lengthy!
>
>  > Do your fieldname indicate the contents, ie Fldname = "Has Dog" with a
> value of true/false
>
> Yes, generally the field contains a 'Y' or an empty string ''.  In some
> cases (rare) the field has
> various codes or numbers.  For example there is a field for income with
> codes '1' through 'T' with
> each one being an income band of $5K.
>
>  > Also, if the first JCRecord is going to be the master, then just
> overwrite the contents with
> subsequent JCRecords as you previously indicated.
>
> Yes, except you can't just "overwrite it", you have to update it IF there is
> data in the next record.
>
> Two records
>
> Has_Dog, Has_Cat, Boat_Length
> 'Y','',''
> '','','42'
>
> If you simply "overwrite" the first with the second then you lose the fact
> that the person has a dog
> because you overwrote that field in the first record with an empty string
> from the second record.
>
> EVERY FIELD of EVERY RECORD has to be analyzed to see if there is data.  If
> so then update the first
> record with the actual data containing field from the second record.  But do
> NOT update the fields
> where the second record does not contain data.
>
> It starts to look more complex eh?
>
> Additionally, once this "merge" is done, all the records containing "merged
> data" have to be
> retained.  All the records which were merged in to other records have to be
> deleted.
>
> John W. Colby
> www.ColbyConsulting.com
>
>
> Max Wanadoo wrote:
>> JC, I dont understand what strings you are manipulating...I am
>> confused by the seemingly conflicting postings.
>> Do you fieldname indicte the contents, ie Fldname = "Has Dog" with a
>> value of true/false
>>
>> or is the fieldname = "Fld1" and the contents "Has Dog"?
>>
>> Also, if the first JCRecord is going to be the master, then just
>> overwrite the contents with subsequent JCRecords as you previously
>> indicated.  If the recordsource is index on key and accessed in that
>> order, then you have only to store the previous entry and change it as
>> the recname changes.
>>
>> max
>>
>>
>> On 22/09/2009, jwcolby <jwcolby at colbyconsulting.com> wrote:
>>> I think you are right in most regards.  Copying all of the dupes out to a
>>> new table for deduping is
>>> definitely a must.  After that I think it will be a case of iterating the
>>> recordset grabbing sets of
>>> dupe records, updating, deleting the duplicates, moving on.  In fact
>>> perhaps
>>> even updating out to a
>>> second table, i.e. read from one write to another.  The destination table
>>> will hold the resulting
>>> deduped recordset.
>>>
>>> As I mentioned in a previous email, the string length limitation will
>>> force
>>> me to process subsets of
>>> fields at a time.
>>>
>>> I also have to log all of the records being deleted so that I can delete
>>> those records back in my
>>> original source table.
>>>
>>> At least this kind of puzzle is fun to me.
>>>
>>> John W. Colby
>>> www.ColbyConsulting.com
>>>
>>>
>>> Shamil Salakhetdinov wrote:
>>>> Hi John,
>>>>
>>>> 3.086 million records seems to be manageable to be "deduped" in one go
>>>> provided the source 50 million records  table is already prepared.
>>>>
>>>> For "deduping" you can use
>>>>
>>>> 1. One stored procedure which:
>>>>    a) will insert new individual attributes' record and
>>>>    b) will update not null values for existing individual attributes'
>>>> record.
>>>>
>>>> 2. One C#/VB.NET function to get records from the source 50 million
>>>> records'
>>>> table and to "feed" the above stored procedure.
>>>>
>>>> That's a rough plan.
>>>>
>>>> Of course developing SP with 640 parameters and C#/VB.NET code working
>>>> with
>>>> that SP is not an easy manual work - that SP and code could be
>>>> generated.
>>>>
>>>> Not sure yet here that SP can have 640 parameters therefore it may be
>>>> needed
>>>> to have several SPs.
>>>>
>>>> To get source rows a T-SQL select expression with ROW_NUMBER() function
>>>> should be used to have relatively small subsets of record batches to
>>>> process. This selection can be done using multi-threaded C#/VB.Net code
>>>> .
>>>> Do
>>>> you have any index over source 50 million rows table? - ROW_NUMBER()
>>>> function will need such an index (I'm talking MS SQL 2005 here, maybe in
>>>> MS
>>>> SQL 2008 ROW_NUMBER() can be used without any indexes)...
>>>>
>>>> When source data deduped into ~3.086 million records table then that
>>>> "small"
>>>> table can get a few the most selective columns' indexes, and probably
>>>> one
>>>> compound natural key clustered index built using the most selective
>>>> (/the
>>>> most often used in queries) columns. If there are a few such columns at
>>>> least one of which is used in all the queries then your application
>>>> response
>>>> time should be finally measured in seconds... I hope...
>>>>
>>>> That's basically it.
>>>>
>>>> Please correct me if you suppose that the above plan can't be fulfilled
>>>> because of some reasons I'm missing here.
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> --
>>>> Shamil
>>> _______________________________________________
>>> dba-VB mailing list
>>> dba-VB at databaseadvisors.com
>>> http://databaseadvisors.com/mailman/listinfo/dba-vb
>>> http://www.databaseadvisors.com
>>>
>>>
>> _______________________________________________
>> dba-VB mailing list
>> dba-VB at databaseadvisors.com
>> http://databaseadvisors.com/mailman/listinfo/dba-vb
>> http://www.databaseadvisors.com
>>
>>
> _______________________________________________
> dba-VB mailing list
> dba-VB at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-vb
> http://www.databaseadvisors.com
>
>



More information about the dba-VB mailing list