[dba-SQLServer] Quotes in data part II

John W. Colby jwcolby at colbyconsulting.com
Mon Sep 6 00:20:40 CDT 2004


LOL.  I keep talking about 1 g of data.  Of course it is turning out to be
almost 1 T.  

John W. Colby
www.ColbyConsulting.com 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of John W.
Colby
Sent: Monday, September 06, 2004 12:58 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] Quotes in data part II


Aww you're just sadistic at heart!  ;-)

In fact I have now had this thing in my lap for well over two weeks and
still haven't managed to get it into a SQL Server db, though I FEEL it
getting closer.

The data is (IMHO) simply too BIG to do what you are talking about.  We are
talking about 600 fields here.  The data is broken down into 22 files of 10
gb of raw data, each containing 3 million records of 600 fields.  That is a
LOT of data.  Imagine that it takes "ONLY" 1/2 hour to clean up each field,
we are now talking 300 hours.  Would it really "only" take 1/2 hour to clean
up one field in 22 files containing (a total of) 65 million records.
Doubtful to say the least.  If cleanup is going to occur, I think it will
have to occur inside the database once it is imported.  I can do the cleanup
over months once I am up and running.

As for the LAN thing, I currently have a 100 mbit router, with two very fast
computers (3ghz a64 with 2g memory apiece), my laptop (A64 with 512mb ram)
and 2 older desktop machines, 2.5g and 1.3g Athlons with 512g ram.  The new
A64s have 1 gigabit NICs so I have ordered an 8 port gigabit switch.  As
much moving of the raw data as I am doing that can't hurt.  I am waffling re
whether a gigabit nic would help the other two desktops - it seems unlikely
but maybe.  In the end my wife's computer (the 1.3g Athlon) will go back to
being her machine, I am only really using it for the import, and SHE has no
need for gigabit LAN.

I am using one of the new A64s as the SQL Server.  I am using the other 4
machines to "throw data" at the server.  It actually was working well, until
I discovered the "quote" problem.  Sigh.

Each 10g file takes about 45 minutes to an hour to bulk insert in to the
server.  With 4 machines hammering away I was actually making progress.  Now
I have to "start over" but once I do and get all the machines rolling again
I should be in business shortly - assuming I don't find something else to
get in the way.  The biggest show stopper was simply getting the storage on
line and figuring out that I could use multiple files for the db instead of
one big file.  In fact it appears that having multiple files is faster than
a single file.

I REALLY wanted to get my system set up with a Gig of storage, Raid 1 so
that the reliability would be there from the start.  It turns out that in my
economic bracket it simply isn't possible to get that much storage, with
hardware raid, in a single box.  I don't have the bucks for SCSI and with a
realistic max 250gb hard disk size, 1g (raid 1) is 8 drives.  Plus the
system disk, plus the cd rom.  In fact it will be possible but only with PCI
raid controllers, which in this case are slower because of having to deal
with the PCI bus.  My new motherboards can deal with 8 disks directly, 4
SATA and 4 IDE, directly off the Nvidia chip (doesn't go through the PCI
bus) but that leaves the system disk (disks if you raid that too) and the
CD.   Something's gotta give!

Since I MUST show results I backed off, set up 1G (4 SATA Drives) NON Raid
plus the system disk (non raid), CD and another 250g IDE drive just for
misc. stuff (raw data anyone?).  That's working fine but somewhere down the
line I will be faced with bringing it up to snuff.  Or building a perfect
backup system.  

Wouldn't you know my reading of SQL Server's backup says that a backup can
only go to a hard disk on the same machine.  HELLO... Unless the backup is
highly compressed we are now talking another terrabyte.  How do you backup a
terrabyte?  OK, now how do you do it on my budget?

Anyway... I have always been the kind that would bravely say "Yea I can do
that", and then go learn how to do it.  This is one of those experiences...
IN SPADES.  But I WILL persevere!

John W. Colby
www.ColbyConsulting.com 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Jim
Lawrence (AccessD)
Sent: Monday, September 06, 2004 12:01 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] Quotes in data part II


Hi John:

I Have been following the data saga and have been quiet enjoying it. Not it
a sadistic way but in a sympathetic way. I am glad it is you and not me.

Here is some observations and suggestions.

1. Why not import the data is chunks. Clean up each chunk by change the 'Y'
and 'N' fields to True and False (1,0). Then port the data to the finally
database. This could even be on another computer. Start partially breaking
the data up along the lines described in that Warehousing article I posted a
couple of days ago. This process could be accomplished by a SP. The
Database/tables holding the cleaned up data would be far smaller than the
raw imported data. The conversion would take longer to get all the data
imported but in actual fact the whole process would be shorter. (Note if you
are using a couple of computers connected across a LAN, a Switch is a better
choice that a Hub.)

2. Do not even think of putting keys on the tables until all the data has
been imported. Nothing consumes resources and slows performance more than
having to manage keys when massive amounts of data is being imported.

3. Trying to do an exact match is almost impossible. There will be a number
of matches but then there is the data where the person has changed their
name, their name has been misspelled, a number of times, they have moved
residence etc..etc... A lot of the matching can be done by the programs but
it comes down grunt work...like ten cloistered data-clerks each with a 5
year contract.

These comments may have already been covered or totally redundant but I hope
there is something there that will help out.

Jim


_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com







More information about the dba-SQLServer mailing list