[dba-SQLServer] Not the end all and be all

Fri Sep 24 07:46:09 CDT 2010

I am not a DBA, and I don't play one on TV (or this list).  OTOH I *use* SQL Server every single 
workday.  The other day I was doing a task I do all the time, validating addresses for a bunch of 
lists.

The process has several steps some of which are quite time consuming.  One step in particular runs 
UDF to calculate a code, taking as input four fields and saving the code returned in a field in the 
same table.

*SET BASED* (of course).

Another process generates an SHA1 hash on three to five fields and store them into three fields in 
the same table.

*SET BASED* (of course).

No cursors here.  ;)

The tables I do this on are anywhere from 3.5 million names up to 65 million names.  Everything was 
fine until I tried to run the process on the biggest table, whereupon it "went to lunch".  I started 
at 5 pm and by midnight when I went to bed it was still chunking away.  8:00 am when I got up to my 
office - still chunking away.

It seems that my poor underpowered server was paging memory.  Maybe.  Remember I am not a SQL god so 
I really don't know for certain the "why" it went to lunch.

In any event, trying to do this whole thing (the hash field generation in appears) as a single set 
on 65 million records was just unwieldy.

So I broke it into cursors...

;)

*JUST KIDDING*...

I drive all of my processes from C#.  So we went into the "big processes" and broke them down into 
sets or chunks of records.  I am absolutely certain there is a much more efficient way to do it, and 
if I were an SQL God (or any kind of god) I probably would have picked a different way.  However 
being a poor working schmuck with a job to get done, I used the tools I know.

We dynamically added a "processed" field to the table which we then used to store a flag that 
records had been processed.  We used a do while (RecordsAffected > 0) and updated 500K records at 
pop, updating the "processed" flag, and grabbing the top(500K) where Processed is null each time 
through the loop.  We enclosed in transactions for good measure.  Dropped the processed flag field 
when finished.

Voila, each 500K chunk takes about 30 seconds and while it still takes a half hour to process 65 
million records, it no longer goes to lunch and never comes back.  The process that had not finished 
in 12 hours the previous day took 30 minutes yesterday.

The morals of the story are (in no particular order):

1) I rarely use cursors and wasn't in this instance
2) I was using set based updates and when it tried to scale it went to lunch anyway.
3) I am not a SQL God (or any other kind of god).
4) Breaking the big set based process down into chunks (still set based) avoided the issue
5) Had I asked how to accomplish this "the right way" on *my* sql server list I would likely have 
received a RTBOL kind of response
6) So I used the tools I know to accomplish this - C# and command objects running dynamic SQL 
constructed in C# using cmd.ExecuteNonQuery.

In the process I discovered that I could use an entire set (a pair in this case) of action queries, 
enclosed in transactions, in a single SQL statement executed by the C# command object.  I tried this 
in order to wrap the pair of queries in a transaction, and it works just fine as long as the SQL is 
syntactically correct.

Unfortunately the RecsAffected is a sum of the records affected of the individual action queries, 
and to be honest I don't know what will happen if the first action query fails in terms of error 
information returned back into my command object.  I assume that it will be the error from the query 
that failed.

Anyway, mission accomplished.

-- 
John W. Colby
www.ColbyConsulting.com