jwcolby
jwcolby at colbyconsulting.com
Fri Sep 24 07:46:09 CDT 2010
I am not a DBA, and I don't play one on TV (or this list). OTOH I *use* SQL Server every single workday. The other day I was doing a task I do all the time, validating addresses for a bunch of lists. The process has several steps some of which are quite time consuming. One step in particular runs UDF to calculate a code, taking as input four fields and saving the code returned in a field in the same table. *SET BASED* (of course). Another process generates an SHA1 hash on three to five fields and store them into three fields in the same table. *SET BASED* (of course). No cursors here. ;) The tables I do this on are anywhere from 3.5 million names up to 65 million names. Everything was fine until I tried to run the process on the biggest table, whereupon it "went to lunch". I started at 5 pm and by midnight when I went to bed it was still chunking away. 8:00 am when I got up to my office - still chunking away. It seems that my poor underpowered server was paging memory. Maybe. Remember I am not a SQL god so I really don't know for certain the "why" it went to lunch. In any event, trying to do this whole thing (the hash field generation in appears) as a single set on 65 million records was just unwieldy. So I broke it into cursors... ;) *JUST KIDDING*... I drive all of my processes from C#. So we went into the "big processes" and broke them down into sets or chunks of records. I am absolutely certain there is a much more efficient way to do it, and if I were an SQL God (or any kind of god) I probably would have picked a different way. However being a poor working schmuck with a job to get done, I used the tools I know. We dynamically added a "processed" field to the table which we then used to store a flag that records had been processed. We used a do while (RecordsAffected > 0) and updated 500K records at pop, updating the "processed" flag, and grabbing the top(500K) where Processed is null each time through the loop. We enclosed in transactions for good measure. Dropped the processed flag field when finished. Voila, each 500K chunk takes about 30 seconds and while it still takes a half hour to process 65 million records, it no longer goes to lunch and never comes back. The process that had not finished in 12 hours the previous day took 30 minutes yesterday. The morals of the story are (in no particular order): 1) I rarely use cursors and wasn't in this instance 2) I was using set based updates and when it tried to scale it went to lunch anyway. 3) I am not a SQL God (or any other kind of god). 4) Breaking the big set based process down into chunks (still set based) avoided the issue 5) Had I asked how to accomplish this "the right way" on *my* sql server list I would likely have received a RTBOL kind of response 6) So I used the tools I know to accomplish this - C# and command objects running dynamic SQL constructed in C# using cmd.ExecuteNonQuery. In the process I discovered that I could use an entire set (a pair in this case) of action queries, enclosed in transactions, in a single SQL statement executed by the C# command object. I tried this in order to wrap the pair of queries in a transaction, and it works just fine as long as the SQL is syntactically correct. Unfortunately the RecsAffected is a sum of the records affected of the individual action queries, and to be honest I don't know what will happen if the first action query fails in terms of error information returned back into my command object. I assume that it will be the error from the query that failed. Anyway, mission accomplished. -- John W. Colby www.ColbyConsulting.com