[dba-SQLServer] Going over to the dark side

Arthur Fuller artful at rogers.com
Fri Aug 13 12:56:07 CDT 2004


There's a wonderful store nearby my home whose manager and employees
have impressed me like few if any elsewhere. They specialize in very
powerful equipment at excellent prices. No cheap crap anywhere in the
store. Anyway, they have a LaCie 400GB external hard drive with USB
interface for $449 CDN and a FireWire interface for $479 CDN. (For quick
conversion to USD, knock about 1/4 to 1/3 off the dollar figure.) LaCie
makes similar drives going all the way to 1TB. See www.lacie.com for
some awesome equipment. (I have no connection to this firm.)

You could always send the client to me and I'll run out and grab one of
these drives, and burn the clients' CDs til the cows come home :)

As to the file format on the DVDs, is the compression standard such as
ZIP or RAR? You could unzip one with space available, perhaps. To read
the first line, just go to a DOS window and type:

Type filename.ext

Then hit Ctrl-C quickly. You should be able to read most or all of the
first line that way. Experiment with how quickly you have to hit Ctrl-C.
I would want to see a row of data too.

Assuming enough space available, you could simply try to import the data
into Access or Excel. Either method would give you a peek at the first
row and some sample data too. You could then abort the import if you
don't have sufficient space available.

Some archive programs offer the ability to create N files of a specified
size, so that's one possible approach. But frankly I'd just go with the
Lacie drive.

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of John W.
Colby
Sent: Thursday, August 12, 2004 11:11 PM
To: dba-sqlserver at databaseadvisors.com
Subject: [dba-SQLServer] Going over to the dark side


I have been contacted by what appears to be a startup marketing firm,
who want to take a 64 million name database and pull it into SQL Server.
They are getting the data in comma delimited format, apparently
compressed, on two DVDs - totaling something like 60 gbytes of raw text
data.  They have never seen the data (just purchased it) but they think
it is about 400-500 fields of some personal info but mostly demographic
stuff.  Things like "owns a boat, owns a motor home, has credit cards,
ethnic, sex, income" etc. Their intention is to pull subsets of the data
and sell it to other companies.  

This is all very vague since all I know so far is what I have gleaned
from one of them in a handful of phone conversations.  They haven't seen
the data, don't know how to get it our of the DVDs etc.

I have no experience with trying to get something like that into SQL
Server. I have a couple of questions.  

First, if the data is comma delimited, my assumption is that the first
line will be field names, followed by data.  Is SQL Server (or some
other big db) capable of exporting directly into a cab file or zip file?
If this is two DVDs both of which are compressed comma delimited files,
how do I uncompress the data before importing it into SQL Server?  They
think it is 60gb.  I have room for a 60gb database but not the
uncompressed data as well as the database.  I can of course just go buy
a huge hard disk (200 gb) but I just spent a lot of time getting a
mirror up and especially for something of this nature I would want to
get it on a mirrored drive.  Plus they want to start looking at the data
as soon as possible.

Second, is this a job for bcp?  How do I get it in?  Writing a VB
function to parse it doesn't seem reasonable.  

Third, how long is it going to take to get that much data into a SQL
Server table.  It apparently is a single flat file which should
translate to a single table of 400-500 fields.  I assume that something
like bcp would handle building the table given the field name in the
first line, comma delimited?  If not, how do I look at that first line
so that I can go build the table "manually" in order to do the import?  

Fourth, what tool would I use to query that?  Access has query limits
that appear to eliminate it as a tool for this, never mind the speed
issues.  On the other hand if the actual number of fields exported out
are small (the data exported doesn't have to contain the demographics
data, just the personal data) then perhaps an ADP would allow a form
with controls to select the demographics, then SQL Server directly dumps
the data.

Fifth, it seems logical that I would want to index the demographics
fields (several hundred fields) so that queries could efficiently pull
subsets of the data.  How "big" is the database going to get with
indexes on 350 (or
more) fields?  IOW, the raw data is 65gb - turned into a properly
indexed table, what does that translate into in terms of SQL Database
file size?

Sixth, given 350 demographic fields, wouldn't I need to pull subsets to
maintain lists of all the valid values for a given field.  IOW, if there
is a field for ethnicity, with 15 choices, it seems I would need to run
a "distinct" query against that field and build a table of all the
possible values rather than run a "distinct" on the fly to populate the
combo that allows choices.  Now multiply that by 350 fields.  That looks
like a lot of preprocessing just to get ready to start hashing the data.

Seventh, How much processor power / memory is needed to handle a
database of this nature?  Is this something that I could reasonably
expect to buy / build?

These guys are examining their options, but three options they mentioned
are:

1) Just hire me (or someone) to "take an order and produce a file, on a
CD or tape".  IOW, I own the server, I hold the db, and I take an order
and fed ex out a CD.  

2) Buy a server, stick it in their office and hire me to set it up and
maintain it.

3) Hire a company out there on the internet somewhere to put the data on
a server at a server farm.  Then query that, somehow get the data onto a
cd or a tape.  Much bigger question there since I would no longer just
figure it out and do it all myself.  

Assuming a reasonable fee per data pull my preferences would be ordered
1, 2, and 3.  

Option 1 sets up a constant income stream but with the issue of having
to be available to build the CDs. Option 3 is just too much setup, at
least initially.

Is anyone out there doing something of this nature?  Any words of
wisdom?

John W. Colby
www.ColbyConsulting.com 


_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com




More information about the dba-SQLServer mailing list