[dba-VB] New server

jwcolby jwcolby at colbyconsulting.com
Sat Sep 25 00:09:36 CDT 2010


Well guys, I got the OK today from the client to purchase the new server.  It will use the AMD 
Magney-Cours Opteron to build out a 16 core machine with 64 gigs of memory.

It has been a long time coming.

When I started this business in August 2004 I did not know a thing about SQL Server.  I started with 
a single core 2800 AMD with 4 gigs of ram running Windows 2003 x32 and SQL Server 2000.

I moved to 4 core and 16 gigs a couple of years ago, still Windows 2003 (x64) but SQL Server 2005 
and now 2008, and have since accumulated several more large lists and done a ton of development in 
C# to get my application clicking.

For you .Net folks, I am now writing in C#, and have hired a programmer part time to help me out as 
well.  We have designed a system to process my name / address lists.  The idea is that every list 
needs to be sent through address validation every 30 days.

My first list had 65 million rows and came to me on a set of DVDs with Zip files.  Fixed width 
format.  I started learning SQL Server, imported the list and did every single thing manually.  It 
took me two weeks to get the first list imported, trailing spaces stripped off the fields and set up 
for use.  It took hours to do much of anything!  Hell, it took DAYS to do much of anything.

Today I have two lists with 100 million names, one with 64 million (split off from the original 
list), the "database from hell" with 50 million records (the original list) an email list with 23 
million, a children list with 22 million, a "dogs and cats" with 11 million and a smoker with 5 
million and 3 million.  I just got an additional email list with 57 million (not yet fully processed).

9 lists with about 420 million names and address when aggregated.  I need to address validate all of 
those names every month.

I have developed a "table driven" system where I write supervisor records that hold the name of the 
database and some other flag fields (written to by the process).  I have broken the process down 
into 4 stages.

1) Stage 1 BCPs the data out of the table to a disk directory comma delimited.  Two million record 
chunks / file.

2) Stage 2 moves the files from the output staging directory to a virtual machine

The virtual machine runs a third party application that performs the address validation on a chunk 
(CSV file).  That application is written in Foxpro for Windows, thus the size limitation.  The third 
party application continuously scans an input directory looking for files, processes the files and 
places the results and a CASS report (PDF) into an output directory.

The virtual machine is and will continue to be my bottle neck as it can only process a 2 million 
record chunk approximately every 50 minutes, about 1/2 of which time is spent out on a server in 
California.  However I can buy additional licenses to set up more VMs.

3) My stage three watches that output directory for files popping out of the third party application 
and moves them off the virtual machine back to an input staging directory.

4) Stage four imports them back into a temp database in SQL Server, builds a "valid address" flag 
field, and three SHA1 hash fields, and then moves only the move addresses back into the live 
database. Before it brings the move addresses back in, it archives the old address (where they lived 
before they moved).

Each stage runs in a thread and is asynchronous to the other stages.  IOW stage one can be 
outputting files from TableX, while stage 2 is moving files from a previously exported TableY out to 
the virtual machine.  Stage three is always watching for files from the third party app (on the VM) 
and moving them to the input staging directory where they belong, and stage four is watching for 
files to appear in its input staging directory and importing them back in to SQL Server.

Stage one kind of drives the process since it writes a record for each 2 million record chunk file 
exported out of SQL Server.  That record (child to the supervisor record that "orders" the address 
validation for a table) defines the staging out / in paths for the later stages to work with, and 
provides fields for all the stages to log completion dates and status on a per file basis, as well 
as RecordsAffected (where appropriate) and start / start times for each file for each stage.

Each table has its own staging in / out directory on disk, created by stage one and the paths are 
logged in the supervisor table.  After the files are processed Stage one cleans up by zipping up the 
files in the output staging directory.  After stage four successfully imports files back into SQL 
Server it cleans up by zipping up the files that have been imported.  Stage four creates a brand new 
supervisor record set for 30 days out from the completion of stage four.  IOW, "do it again in 30 days".

All of this was originally written in stored procedures in SQL Server but that was clumsy at best.
I pretty much had paper lists of stored procedures to run in specific order to do this processing. 
Then stored procedures running the paper lists of stored procedures.  It was getting out of hand!!! 
  I tried to automate it using VBA and Access (because that is what I knew) but with a single thread 
for everything, that didn't work well at all.

Last fall I took a C# class at the local community college and started writing the system described 
above, using C# to automate the system, executing the stored procedures previously written over the 
years that did all this stuff.  That actually worked quite well, but now the source was split 
between SQL Server stored procedures which are (relatively) hard to write / debug / maintain and C# 
/ .Net (a dream to maintain).

When I hired Paul in May, we went into high gear and have just finished completely replacing the old 
stored procedures with equivalent SQL directly out in C#.  Basically we took the stored procedures, 
analyzed the dynamic SQL in each stored procedure, pulled that SQL inside of C# and used a command 
object to send the SQL out to SQL Server.

Forms / combos to select databases to process, combos to select tables etc.  "Wizard" interfaces to 
do all of this stuff.  Far far away from the paper lists of stored procedures of yesteryear.

Table driven, logged status and NLogger for errors.  Visual SVN for source control.

Oh, and I have C# code to do backup / restores of these databases.  SQL Server 2008 has compression 
built in and with SMO it is a breeze to write this kind of code in C#.

And the system also processes the orders that the client sends which uses these tables.

Needless to say, I need more horsepower to keep up with the system as it exists today. Four cores 
and 16 gigs of memory just can't handle the task(s) any more.

The new server will have 16 cores and 64 gigs of ram, and I am using 8 gig ECC registered sticks so 
that if I need to I can drop in another 64 gigs later.  Next year when Bulldozer comes out I will 
likely upgrade to 32 cores (if required) and 128 gigs of RAM.  With luck that will be a simple 
processor swap.

I will eventually be placing the main databases (for use) on Flash drives to speed up the reads. 
Most of the actual use (as opposed to the address updates) is read only, i.e. I never write to the 
name / address tables (other than monthly address updates).  Flash drives provide smoking fast reads 
and insane IOPS, and I already have a coprocessor raid controller to drive that.  Coupled with 16 
cores and 64 gigs of memory this *should* give me a very responsive system.

So there ya have it, the list processing system dreamed up by an old Access guy.  Available today 
(well, the new server is on the way) at a home office somewhere in the foothills of the Smokey 
Mountains of NC.

I am pretty proud of my Frankenstein.  It has been 5 years in the making but it is finally coming 
together.

One of the guys I do business with called the other day.  Said he was talking with a guy who claimed 
to have two BILLION names / addresses in a bunch of lists, and he said that was probably low, 
perhaps more like 6 to 8 billion.  I'm thinking I might be able to actually handle that business 
(not that I will necessarily get it).  Two years ago it would have been impossible.

My dream is to be sitting on the beach, with a laptop at my side, remote desktop into my servers, 
watching my servers make money for me.  We shall see.

-- 
John W. Colby
www.ColbyConsulting.com



More information about the dba-VB mailing list