John W. Colby
jwcolby at colbyconsulting.com
Tue Sep 14 11:45:32 CDT 2004
Just to let you know some numbers on the database as it currently stands. First, I have two almost identical machines which I built to hold and manipulate the database. Both machines use an MSI K8N Neo motherboard with built in 4 port SATA and dual channel IDE, gbit LAN. The processor is the 754 pin Athlon64 at 3ghz. The server currently has Windows 2K and 3g RAM installed. Apparently Win2K can use up to 4g ram whereas XP is limited to 2g. Unfortunately I cannot persuade SQL Server to use more than 2g RAM so I am not sure that more memory than that is really useful in this context. The server then has (4) 250g Maxtor SATA drives and (1) 250g Maxtor IDE drive holding the data files and the log file respectively. The second machine is currently running XP Pro. Since the two new machines have gbit nics built into the motherboard I bought an 8 port gbit switch so they could talk at full speed. In general I have found that I can run operations from the second machine to the server over gbit LAN at close to full speed, i.e. the LAN is not a severe bottleneck anymore (it definitely was at 100mbit). The db is 164 million records, 660 fields, ~3K bytes / record. I have split the db into 5 files on (5) 250g hard disks. I turned on Bulk logging for the import and when the raw data was finished importing, the data files were about 40g apiece after compacting and the log file about 100 meg. Turning on bulk logging actually enabled the process to complete in a timely manner. After import I added a PK field Long autoincrement. Doing that locked up the machine for over 24 hours but finished with the new field created as expected. When finished the log file was sitting at the 200g maximum size, the 4 data files were sitting at about 80gb each, but shrunk back to about 40 - 60g. I then added an index on the state field and ended up with two data files at 40g and two at 80g. Adding the PK field and its index as well as the index on the state field apparently added about 80g of data to the 160g used for the data itself. I have to apologize but I didn't keep written logs of file sizes after each step. Each of these operations also took overnight. After each operation I used the shrink to compact the files back down. Strangely, the shrink operation took a rather long time on two of the data files (as much as an hour) but only about a minute on the other two and the log file. I have turned to WinZip for backup. Zipping any of the database or log files gives me about a 13 to 1 compression ratio thus I can zip the files and store a 43 gig file in a 3.3 gig zip. Of course it does take awhile. I have the server zip a file and the other NEO based machine zip another. The second NEO finishes about 10% behind the server (time wise) which is pretty darned good IMHO. Someday I will time the zip operation. I have run a count of records grouped by state. That operation took a mere 1 minute and clearly demonstrates the power of the index! I also built a view to build a "every hundredth record" subset of the data using a "WHERE (PK % 100) = 0" clause. This process takes about 25 minutes to create the dataset. My intent is to use that view to create a table containing that sub data set. Then use that smaller table to run various analysis that the client is asking for. I still have to run what I fear will be a very time consuming process to strip trailing spaces from various fields. A lot of work remains but at least I have it all in SQL Server and can produce results now. John W. Colby www.ColbyConsulting.com