John W. Colby
jwcolby at colbyconsulting.com
Sun Sep 5 23:57:43 CDT 2004
Aww you're just sadistic at heart! ;-) In fact I have now had this thing in my lap for well over two weeks and still haven't managed to get it into a SQL Server db, though I FEEL it getting closer. The data is (IMHO) simply too BIG to do what you are talking about. We are talking about 600 fields here. The data is broken down into 22 files of 10 gb of raw data, each containing 3 million records of 600 fields. That is a LOT of data. Imagine that it takes "ONLY" 1/2 hour to clean up each field, we are now talking 300 hours. Would it really "only" take 1/2 hour to clean up one field in 22 files containing (a total of) 65 million records. Doubtful to say the least. If cleanup is going to occur, I think it will have to occur inside the database once it is imported. I can do the cleanup over months once I am up and running. As for the LAN thing, I currently have a 100 mbit router, with two very fast computers (3ghz a64 with 2g memory apiece), my laptop (A64 with 512mb ram) and 2 older desktop machines, 2.5g and 1.3g Athlons with 512g ram. The new A64s have 1 gigabit NICs so I have ordered an 8 port gigabit switch. As much moving of the raw data as I am doing that can't hurt. I am waffling re whether a gigabit nic would help the other two desktops - it seems unlikely but maybe. In the end my wife's computer (the 1.3g Athlon) will go back to being her machine, I am only really using it for the import, and SHE has no need for gigabit LAN. I am using one of the new A64s as the SQL Server. I am using the other 4 machines to "throw data" at the server. It actually was working well, until I discovered the "quote" problem. Sigh. Each 10g file takes about 45 minutes to an hour to bulk insert in to the server. With 4 machines hammering away I was actually making progress. Now I have to "start over" but once I do and get all the machines rolling again I should be in business shortly - assuming I don't find something else to get in the way. The biggest show stopper was simply getting the storage on line and figuring out that I could use multiple files for the db instead of one big file. In fact it appears that having multiple files is faster than a single file. I REALLY wanted to get my system set up with a Gig of storage, Raid 1 so that the reliability would be there from the start. It turns out that in my economic bracket it simply isn't possible to get that much storage, with hardware raid, in a single box. I don't have the bucks for SCSI and with a realistic max 250gb hard disk size, 1g (raid 1) is 8 drives. Plus the system disk, plus the cd rom. In fact it will be possible but only with PCI raid controllers, which in this case are slower because of having to deal with the PCI bus. My new motherboards can deal with 8 disks directly, 4 SATA and 4 IDE, directly off the Nvidia chip (doesn't go through the PCI bus) but that leaves the system disk (disks if you raid that too) and the CD. Something's gotta give! Since I MUST show results I backed off, set up 1G (4 SATA Drives) NON Raid plus the system disk (non raid), CD and another 250g IDE drive just for misc. stuff (raw data anyone?). That's working fine but somewhere down the line I will be faced with bringing it up to snuff. Or building a perfect backup system. Wouldn't you know my reading of SQL Server's backup says that a backup can only go to a hard disk on the same machine. HELLO... Unless the backup is highly compressed we are now talking another terrabyte. How do you backup a terrabyte? OK, now how do you do it on my budget? Anyway... I have always been the kind that would bravely say "Yea I can do that", and then go learn how to do it. This is one of those experiences... IN SPADES. But I WILL persevere! John W. Colby www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence (AccessD) Sent: Monday, September 06, 2004 12:01 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] Quotes in data part II Hi John: I Have been following the data saga and have been quiet enjoying it. Not it a sadistic way but in a sympathetic way. I am glad it is you and not me. Here is some observations and suggestions. 1. Why not import the data is chunks. Clean up each chunk by change the 'Y' and 'N' fields to True and False (1,0). Then port the data to the finally database. This could even be on another computer. Start partially breaking the data up along the lines described in that Warehousing article I posted a couple of days ago. This process could be accomplished by a SP. The Database/tables holding the cleaned up data would be far smaller than the raw imported data. The conversion would take longer to get all the data imported but in actual fact the whole process would be shorter. (Note if you are using a couple of computers connected across a LAN, a Switch is a better choice that a Hub.) 2. Do not even think of putting keys on the tables until all the data has been imported. Nothing consumes resources and slows performance more than having to manage keys when massive amounts of data is being imported. 3. Trying to do an exact match is almost impossible. There will be a number of matches but then there is the data where the person has changed their name, their name has been misspelled, a number of times, they have moved residence etc..etc... A lot of the matching can be done by the programs but it comes down grunt work...like ten cloistered data-clerks each with a 5 year contract. These comments may have already been covered or totally redundant but I hope there is something there that will help out. Jim