jwcolby
jwcolby at colbyconsulting.com
Sat Sep 25 00:09:36 CDT 2010
Well guys, I got the OK today from the client to purchase the new server. It will use the AMD Magney-Cours Opteron to build out a 16 core machine with 64 gigs of memory. It has been a long time coming. When I started this business in August 2004 I did not know a thing about SQL Server. I started with a single core 2800 AMD with 4 gigs of ram running Windows 2003 x32 and SQL Server 2000. I moved to 4 core and 16 gigs a couple of years ago, still Windows 2003 (x64) but SQL Server 2005 and now 2008, and have since accumulated several more large lists and done a ton of development in C# to get my application clicking. For you .Net folks, I am now writing in C#, and have hired a programmer part time to help me out as well. We have designed a system to process my name / address lists. The idea is that every list needs to be sent through address validation every 30 days. My first list had 65 million rows and came to me on a set of DVDs with Zip files. Fixed width format. I started learning SQL Server, imported the list and did every single thing manually. It took me two weeks to get the first list imported, trailing spaces stripped off the fields and set up for use. It took hours to do much of anything! Hell, it took DAYS to do much of anything. Today I have two lists with 100 million names, one with 64 million (split off from the original list), the "database from hell" with 50 million records (the original list) an email list with 23 million, a children list with 22 million, a "dogs and cats" with 11 million and a smoker with 5 million and 3 million. I just got an additional email list with 57 million (not yet fully processed). 9 lists with about 420 million names and address when aggregated. I need to address validate all of those names every month. I have developed a "table driven" system where I write supervisor records that hold the name of the database and some other flag fields (written to by the process). I have broken the process down into 4 stages. 1) Stage 1 BCPs the data out of the table to a disk directory comma delimited. Two million record chunks / file. 2) Stage 2 moves the files from the output staging directory to a virtual machine The virtual machine runs a third party application that performs the address validation on a chunk (CSV file). That application is written in Foxpro for Windows, thus the size limitation. The third party application continuously scans an input directory looking for files, processes the files and places the results and a CASS report (PDF) into an output directory. The virtual machine is and will continue to be my bottle neck as it can only process a 2 million record chunk approximately every 50 minutes, about 1/2 of which time is spent out on a server in California. However I can buy additional licenses to set up more VMs. 3) My stage three watches that output directory for files popping out of the third party application and moves them off the virtual machine back to an input staging directory. 4) Stage four imports them back into a temp database in SQL Server, builds a "valid address" flag field, and three SHA1 hash fields, and then moves only the move addresses back into the live database. Before it brings the move addresses back in, it archives the old address (where they lived before they moved). Each stage runs in a thread and is asynchronous to the other stages. IOW stage one can be outputting files from TableX, while stage 2 is moving files from a previously exported TableY out to the virtual machine. Stage three is always watching for files from the third party app (on the VM) and moving them to the input staging directory where they belong, and stage four is watching for files to appear in its input staging directory and importing them back in to SQL Server. Stage one kind of drives the process since it writes a record for each 2 million record chunk file exported out of SQL Server. That record (child to the supervisor record that "orders" the address validation for a table) defines the staging out / in paths for the later stages to work with, and provides fields for all the stages to log completion dates and status on a per file basis, as well as RecordsAffected (where appropriate) and start / start times for each file for each stage. Each table has its own staging in / out directory on disk, created by stage one and the paths are logged in the supervisor table. After the files are processed Stage one cleans up by zipping up the files in the output staging directory. After stage four successfully imports files back into SQL Server it cleans up by zipping up the files that have been imported. Stage four creates a brand new supervisor record set for 30 days out from the completion of stage four. IOW, "do it again in 30 days". All of this was originally written in stored procedures in SQL Server but that was clumsy at best. I pretty much had paper lists of stored procedures to run in specific order to do this processing. Then stored procedures running the paper lists of stored procedures. It was getting out of hand!!! I tried to automate it using VBA and Access (because that is what I knew) but with a single thread for everything, that didn't work well at all. Last fall I took a C# class at the local community college and started writing the system described above, using C# to automate the system, executing the stored procedures previously written over the years that did all this stuff. That actually worked quite well, but now the source was split between SQL Server stored procedures which are (relatively) hard to write / debug / maintain and C# / .Net (a dream to maintain). When I hired Paul in May, we went into high gear and have just finished completely replacing the old stored procedures with equivalent SQL directly out in C#. Basically we took the stored procedures, analyzed the dynamic SQL in each stored procedure, pulled that SQL inside of C# and used a command object to send the SQL out to SQL Server. Forms / combos to select databases to process, combos to select tables etc. "Wizard" interfaces to do all of this stuff. Far far away from the paper lists of stored procedures of yesteryear. Table driven, logged status and NLogger for errors. Visual SVN for source control. Oh, and I have C# code to do backup / restores of these databases. SQL Server 2008 has compression built in and with SMO it is a breeze to write this kind of code in C#. And the system also processes the orders that the client sends which uses these tables. Needless to say, I need more horsepower to keep up with the system as it exists today. Four cores and 16 gigs of memory just can't handle the task(s) any more. The new server will have 16 cores and 64 gigs of ram, and I am using 8 gig ECC registered sticks so that if I need to I can drop in another 64 gigs later. Next year when Bulldozer comes out I will likely upgrade to 32 cores (if required) and 128 gigs of RAM. With luck that will be a simple processor swap. I will eventually be placing the main databases (for use) on Flash drives to speed up the reads. Most of the actual use (as opposed to the address updates) is read only, i.e. I never write to the name / address tables (other than monthly address updates). Flash drives provide smoking fast reads and insane IOPS, and I already have a coprocessor raid controller to drive that. Coupled with 16 cores and 64 gigs of memory this *should* give me a very responsive system. So there ya have it, the list processing system dreamed up by an old Access guy. Available today (well, the new server is on the way) at a home office somewhere in the foothills of the Smokey Mountains of NC. I am pretty proud of my Frankenstein. It has been 5 years in the making but it is finally coming together. One of the guys I do business with called the other day. Said he was talking with a guy who claimed to have two BILLION names / addresses in a bunch of lists, and he said that was probably low, perhaps more like 6 to 8 billion. I'm thinking I might be able to actually handle that business (not that I will necessarily get it). Two years ago it would have been impossible. My dream is to be sitting on the beach, with a laptop at my side, remote desktop into my servers, watching my servers make money for me. We shall see. -- John W. Colby www.ColbyConsulting.com