[dba-VB] SPAM-LOW: Re: Merge rows

Shamil Salakhetdinov shamil at smsconsulting.spb.ru
Tue Sep 22 09:10:29 CDT 2009


Hi John,

I suppose you can use static (compiled) SP - I assume you have a unique
natural (alt-)key, say, FirstName, LastName (of course you'll have some more
fields in this key:

Procedure MySP
(
  @firstName nvarchar(50),  
  @lastName nvarchar(50),
  ---
  @f1 int,
   ...
  @fn nvarchar(80)
)


set nocount on
declare @testCount int
select @testCount = count(*) from MyTable where FirstName = @firstName and
LastName = @lastName

if (@testCount = 0)
begin
  -- insert
End

else

begin
 update myTable
    set 
        f1 = IsNull(@f1, f1), 
        ...
        fn = case (isnull(len(@fn),0)) when 0 then fn else @fn end
        where FirstName = @firstName and LastName = @lastName
end

set nocount off

Try.

--
Shamil

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Tuesday, September 22, 2009 5:25 PM
To: Discussion concerning Visual Basic and related programming issues.
Subject: Re: [dba-VB] SPAM-LOW: Re: Merge rows

Max,

 > JC, I don't understand what strings you are manipulating...

In order to do this you (or at least I) pretty much have to use dynamic
queries.  Create a string 
variable, then build up the update SQL statement.  That SQL string cannot be
more than 4K long IIRC.

Well imagine 640 fields with field names like (a real field name)

"medication_taken_for_adhd_adderall" or
"receive_info_on_health_related_bladder_control".

Obviously the SQL string is going to get lengthy!

 > Do your fieldname indicate the contents, ie Fldname = "Has Dog" with a
value of true/false

Yes, generally the field contains a 'Y' or an empty string ''.  In some
cases (rare) the field has 
various codes or numbers.  For example there is a field for income with
codes '1' through 'T' with 
each one being an income band of $5K.

 > Also, if the first JCRecord is going to be the master, then just
overwrite the contents with 
subsequent JCRecords as you previously indicated.

Yes, except you can't just "overwrite it", you have to update it IF there is
data in the next record.

Two records

Has_Dog, Has_Cat, Boat_Length
'Y','',''
'','','42'

If you simply "overwrite" the first with the second then you lose the fact
that the person has a dog 
because you overwrote that field in the first record with an empty string
from the second record.

EVERY FIELD of EVERY RECORD has to be analyzed to see if there is data.  If
so then update the first 
record with the actual data containing field from the second record.  But do
NOT update the fields 
where the second record does not contain data.

It starts to look more complex eh?

Additionally, once this "merge" is done, all the records containing "merged
data" have to be 
retained.  All the records which were merged in to other records have to be
deleted.

John W. Colby
www.ColbyConsulting.com


Max Wanadoo wrote:
> JC, I dont understand what strings you are manipulating...I am
> confused by the seemingly conflicting postings.
> Do you fieldname indicte the contents, ie Fldname = "Has Dog" with a
> value of true/false
> 
> or is the fieldname = "Fld1" and the contents "Has Dog"?
> 
> Also, if the first JCRecord is going to be the master, then just
> overwrite the contents with subsequent JCRecords as you previously
> indicated.  If the recordsource is index on key and accessed in that
> order, then you have only to store the previous entry and change it as
> the recname changes.
> 
> max
> 
> 
> On 22/09/2009, jwcolby <jwcolby at colbyconsulting.com> wrote:
>> I think you are right in most regards.  Copying all of the dupes out to a
>> new table for deduping is
>> definitely a must.  After that I think it will be a case of iterating the
>> recordset grabbing sets of
>> dupe records, updating, deleting the duplicates, moving on.  In fact
perhaps
>> even updating out to a
>> second table, i.e. read from one write to another.  The destination table
>> will hold the resulting
>> deduped recordset.
>>
>> As I mentioned in a previous email, the string length limitation will
force
>> me to process subsets of
>> fields at a time.
>>
>> I also have to log all of the records being deleted so that I can delete
>> those records back in my
>> original source table.
>>
>> At least this kind of puzzle is fun to me.
>>
>> John W. Colby
>> www.ColbyConsulting.com
>>
>>
>> Shamil Salakhetdinov wrote:
>>> Hi John,
>>>
>>> 3.086 million records seems to be manageable to be "deduped" in one go
>>> provided the source 50 million records  table is already prepared.
>>>
>>> For "deduping" you can use
>>>
>>> 1. One stored procedure which:
>>>    a) will insert new individual attributes' record and
>>>    b) will update not null values for existing individual attributes'
>>> record.
>>>
>>> 2. One C#/VB.NET function to get records from the source 50 million
>>> records'
>>> table and to "feed" the above stored procedure.
>>>
>>> That's a rough plan.
>>>
>>> Of course developing SP with 640 parameters and C#/VB.NET code working
>>> with
>>> that SP is not an easy manual work - that SP and code could be
generated.
>>>
>>> Not sure yet here that SP can have 640 parameters therefore it may be
>>> needed
>>> to have several SPs.
>>>
>>> To get source rows a T-SQL select expression with ROW_NUMBER() function
>>> should be used to have relatively small subsets of record batches to
>>> process. This selection can be done using multi-threaded C#/VB.Net code
.
>>> Do
>>> you have any index over source 50 million rows table? - ROW_NUMBER()
>>> function will need such an index (I'm talking MS SQL 2005 here, maybe in
>>> MS
>>> SQL 2008 ROW_NUMBER() can be used without any indexes)...
>>>
>>> When source data deduped into ~3.086 million records table then that
>>> "small"
>>> table can get a few the most selective columns' indexes, and probably
one
>>> compound natural key clustered index built using the most selective
(/the
>>> most often used in queries) columns. If there are a few such columns at
>>> least one of which is used in all the queries then your application
>>> response
>>> time should be finally measured in seconds... I hope...
>>>
>>> That's basically it.
>>>
>>> Please correct me if you suppose that the above plan can't be fulfilled
>>> because of some reasons I'm missing here.
>>>
>>> Thank you.
>>>
>>>
>>> --
>>> Shamil





More information about the dba-VB mailing list