[dba-SQLServer] Going over to the dark side

Thu Aug 12 22:56:51 CDT 2004

John,

I cannot address all of your points, but I am working with compressed CSV
files daily.  They are about 100MB zipped and come out to around 20 million
rows with 8 fields /row.  Without knowing how the data is being given to
you, but assuming it is set up the same way as mine is (bcp'd into the
database on the fly and then zipped and made available to download) there
are some considerations you need to look at regarding the data.

All of my raw data is sent as a varchar 8000 so that the DateTime field and
the 5 character Name field is the same size.  That can be greatly reduced by
judicious use of appropriate fields in SQL Server.  But it causes some
interesting issues trying to get the raw data in there in the first place -
there are often errors in the length of the fields and extra spaces that
need to be dealt with.  All of this can be done in the initial import using
DTS and  some careful monitoring.  In one set of data I have to make 2
passes - one to get the fields into the database and a second DTS after I
filter out the crap that would cause import issues. I know the DateTime
field should be a DateTime Field, but I have to pull it in as a varchar to
start because it is not always date and time information in that row! Again,
this should only be a one time shot for you.

I would suggest looking at the DTS functions - you can import a CSV file and
create the table(s) on the fly, including or excluding the columns that you
want based on the table you are creating, so that would not be a major
issue.  Once the tables are created, you can go back and index them as
necessary.

If you were to expect to have any real speed out of this you will need a
heavy duty server.  I run the data through a couple of iterations and the
most intensive one is a correlation process where I have to compare every
number in a  column to every other column's data.  With around 800 columns
holding 390 rows each, a dual processor (1.8 GHz each) 3GB system takes
about 5 hours to crunch the data - and that is a dedicated SQL Server box.
What you are looking at will not be as numerically intense as the
calculations necessary for correlations, but the sheer size of the tables,
queries and sub-queries to get what you want will be as intense in terms of
processor and memory, if not more.

My suggestion is that until they can show you the data, so that you can see
what the 350 demographic fields are and how they could be related (for
example if there is an "owns a boat" how many other boat-related demographic
points are there, and can that be queried differently), you don't want to
make any commitments to doing anything.

A VB or Access form front end to a SQL backend would be possible, but trying
to make a query that will show every 30 - 45 year old red-headed left-handed
boat owner that does not smoke and has a MasterCard with less than 10% of
their limit left on it, who is married to a blonde woman without enhancement
surgery and still has all of her original teeth and likes the double-wide
they live in on the shores of Lake Okeechobee when they are not vacationing
at Dollywood or Branson might be a difficult proposition. Especially when
you have to include their kids in the mix.

HTH,

Mark

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of John W.
Colby
Sent: Thursday, August 12, 2004 10:11 PM
To: dba-sqlserver at databaseadvisors.com
Subject: [dba-SQLServer] Going over to the dark side

I have been contacted by what appears to be a startup marketing firm, who
want to take a 64 million name database and pull it into SQL Server.  They
are getting the data in comma delimited format, apparently compressed, on
two DVDs - totaling something like 60 gbytes of raw text data.  They have
never seen the data (just purchased it) but they think it is about 400-500
fields of some personal info but mostly demographic stuff.  Things like
"owns a boat, owns a motor home, has credit cards, ethnic, sex, income" etc.
Their intention is to pull subsets of the data and sell it to other
companies.

This is all very vague since all I know so far is what I have gleaned from
one of them in a handful of phone conversations.  They haven't seen the
data, don't know how to get it our of the DVDs etc.

I have no experience with trying to get something like that into SQL Server.
I have a couple of questions.

First, if the data is comma delimited, my assumption is that the first line
will be field names, followed by data.  Is SQL Server (or some other big db)
capable of exporting directly into a cab file or zip file?  If this is two
DVDs both of which are compressed comma delimited files, how do I uncompress
the data before importing it into SQL Server?  They think it is 60gb.  I
have room for a 60gb database but not the uncompressed data as well as the
database.  I can of course just go buy a huge hard disk (200 gb) but I just
spent a lot of time getting a mirror up and especially for something of this
nature I would want to get it on a mirrored drive.  Plus they want to start
looking at the data as soon as possible.

Second, is this a job for bcp?  How do I get it in?  Writing a VB function
to parse it doesn't seem reasonable.

Third, how long is it going to take to get that much data into a SQL Server
table.  It apparently is a single flat file which should translate to a
single table of 400-500 fields.  I assume that something like bcp would
handle building the table given the field name in the first line, comma
delimited?  If not, how do I look at that first line so that I can go build
the table "manually" in order to do the import?

Fourth, what tool would I use to query that?  Access has query limits that
appear to eliminate it as a tool for this, never mind the speed issues.  On
the other hand if the actual number of fields exported out are small (the
data exported doesn't have to contain the demographics data, just the
personal data) then perhaps an ADP would allow a form with controls to
select the demographics, then SQL Server directly dumps the data.

Fifth, it seems logical that I would want to index the demographics fields
(several hundred fields) so that queries could efficiently pull subsets of
the data.  How "big" is the database going to get with indexes on 350 (or
more) fields?  IOW, the raw data is 65gb - turned into a properly indexed
table, what does that translate into in terms of SQL Database file size?

Sixth, given 350 demographic fields, wouldn't I need to pull subsets to
maintain lists of all the valid values for a given field.  IOW, if there is
a field for ethnicity, with 15 choices, it seems I would need to run a
"distinct" query against that field and build a table of all the possible
values rather than run a "distinct" on the fly to populate the combo that
allows choices.  Now multiply that by 350 fields.  That looks like a lot of
preprocessing just to get ready to start hashing the data.

Seventh, How much processor power / memory is needed to handle a database of
this nature?  Is this something that I could reasonably expect to buy /
build?

These guys are examining their options, but three options they mentioned
are:

1) Just hire me (or someone) to "take an order and produce a file, on a CD
or tape".  IOW, I own the server, I hold the db, and I take an order and fed
ex out a CD.

2) Buy a server, stick it in their office and hire me to set it up and
maintain it.

3) Hire a company out there on the internet somewhere to put the data on a
server at a server farm.  Then query that, somehow get the data onto a cd or
a tape.  Much bigger question there since I would no longer just figure it
out and do it all myself.

Assuming a reasonable fee per data pull my preferences would be ordered 1,
2, and 3.

Option 1 sets up a constant income stream but with the issue of having to be
available to build the CDs.
Option 3 is just too much setup, at least initially.

Is anyone out there doing something of this nature?  Any words of wisdom?

John W. Colby
www.ColbyConsulting.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com