jwcolby
jwcolby at colbyconsulting.com
Wed Sep 30 08:46:29 CDT 2009
I wrote this email yesterday in the middle of a very frustrating day. After my testing this morning I now realize that there is this quite significant "overhead" for the first pass processing, overhead which is not a fixed amount but which increases with the size of the data set. That overhead will likely explain the results I encountered and documented below. I am sending this email because it explains in detail the process and the stored procedure and udf which actually does the processing. I am about to do some testing similar to the last email where I do processing on specific size chunks of records in order to see numbers similar to the last email. The idea is to update an AddrValid Char() field with various codes depending on how the address plays out. PO if the address line has a "PO BOX" in it, "INV" (invalid) if it has a null in the zip4, "ANK" if it has a "77" in the ANK field and so forth. These codes then tell me various things about the addresses and allow me to select the addresses for ANK processing, filter out PO boxes etc. Awhile back I created a udf into which I passed a set of fields which will be analyzed and which returns a code up to three characters wide (or null). I then have a stored procedure which is a simple update, updating the AddrValid with the results passed back by this udf. The table being updated is 20 million records (in this case). I just went through the exercise of passing in a view where I selected the TOP() one million rows where AddrValid IS NULL. Processing one million rows takes 17 seconds. I ran that about five times, counting the remaining rows with AddrValid IS NULL. Each time the time to process the million rows was in fact about 17-18 seconds. When I got down to 15 million rows, having successfully processed about 6 million rows, I decided to pass in the table itself instead of the view with the TOP(1000000) statement, figuring that it would take 15 (million rows remaining) x 17 seconds to complete, or 4.25 minutes. WRONG. 11:40 later it is still processing. Sigh. However ATM I am not locked up! BTW I manually set the IO affinity mask to the same three cores which I assigned to SQL Server. Previously I had told it to do the assignment itself. I also set Max Degree of Parallelism to 2 per Nancy Lytle's suggestion. So as of right now, it is chunking away, 16:28 now. But I am not locked up. I am poking around looking at stuff. Windows task manager shows a single CPU processing... sometimes. Mostly it is flatlining, but it does occasionally spring to life. It is interesting to me that when I move from "SELECT TOP(1000000) * FROM tblX) as a saved query passed in to the stored procedure to just passing in tblX itself, I go from one million records every 17 seconds to "we ain't finishing for some reason". There are only 20 million records in the table. It should have been a quick and easy 4:30 task to finish up the 15 million records where AddrValid is still null. With the exception of the field being updated (AddrValid), all of the fields used in the stored procedure are contained in an index. The SP itself looks like this: BEGIN TRY SELECT @SQL = 'UPDATE [' + @DBName + '].[dbo].[' + @TblName + '] SET [AddrValid] = _DataMaster.dbo.udf_AddrValid(ank_, Addr, Zip4, dpv_) WHERE (AddrValid IS NULL)' exec (@SQL) print 'success UPDATING AddrValid on ' + @DBName + '.' + @TblName END TRY BEGIN CATCH print 'There was an error UPDATING AddrValid ANK ' print ERROR_MESSAGE() END CATCH And the udf looks like this: ALTER FUNCTION [dbo].[udf_AddrValid] ( -- Add the parameters for the function here @ANK varchar(50), @Address varchar(50), @Zip4 varchar(50), @dpv varchar(50) ) RETURNS char(3) AS BEGIN -- Declare the return variable here DECLARE @Return char(3) -- Add the T-SQL statements to compute the return value here IF @ANK = '77' SELECT @Return = 'ANK' ELSE IF left(@Address,5) = 'MOVED' SELECT @Return = 'MOV' ELSE if @Zip4 is null SELECT @Return = 'INV' else IF left(@Address,2) = 'PO' SELECT @Return = 'PO' ELSE IF (@dpv = 'YN') SELECT @Return = 'V' ELSE IF ((@dpv = 'DN') OR (@dpv = 'SN')) SELECT @Return = 'E' ELSE SELECT @Return = 'INV' -- Return the result of the function RETURN @Return END WELL... it just finished. 26:09 to process a 20 million record table. John W. Colby www.ColbyConsulting.com