[dba-SQLServer] VLDBs, the saga - Ramblings only

Sun Aug 29 09:48:39 CDT 2004

>> If I can in fact do this at all on the class machines available to me
it should be fun.  It has been a much bigger challenge than I expected I
can tell you that.

We tried to caution you :) Big databases demand big hardware. There are
no shortcuts (i.e. compressed drives, etc.). For 65M rows I would want
at least 800GB of space and as much RAM as the box can handle. There are
threshholds involved here, though. Go past 8GB for example and you have
to reconfigure everything so SQL Server can exploit the RAM. I don't
have an URL handy concerning this, but this is a well-known issue. If
you want to do some reading on this subject, I'll look for an URL for
you.

A.

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of John W.
Colby
Sent: Saturday, August 28, 2004 1:11 PM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] VLDBs, the saga - Ramblings only

Mark,

Good to hear from you again.  I have to assume that you only check your
list email every few weeks?

This is an OLAP database (more or less) from what I can tell, rather
than a transaction database.  There is only a single table, completely
(intentionally) denormalized for speed reasons.  Not that it has to
remain that way, I can do it any way I want in the end, but that is how
the data comes to me.  

The whole idea is that there are 65 million names of Americans ONLY with
about 600 fields that categorize them.  What they like to eat, drink,
smoke, wear, drive, play with, medical drugs used, etc.  The data will
be loaded once (MAYBE!!! If I can get that done), then read hundreds of
times a month. But only updated every few months as new names come in
(additions / updates to existing data) or new survey results come in.
This is definitely NOT like a banking database with millions of new
records every month, nor even a call center or order entry database.  NO
transactions ever.  There is in fact some maintenance involved such as
checking addresses are valid, "do not mail" lists and so forth but in
general this is a read only database.

What I have been told is that for example 100 thousand records are
pulled to build a mailing, by age, income, perhaps race or geographic
locality if appropriate to the marketing effort.  Then if it is a
mailing set destined to be mailed to car buyers, perhaps narrowed down
by those who like the brand being sold in this mass mailing, or even
(intentionally) similar competitors.

A result set comes back.  Of those 100 thousand mailings, only 2%
respond. So 2000 postcards come back asking for more info or whatever.
The idea now is to look at the demographics of those 2000 and see what
they have in common, and just as important who DIDN'T respond.  If the
mailing was sent to ages 25 to 50 but only 2% of those from 25 to 30
responded and 5% of 30 to 35 but 40% of those from 35-50 responded, then
we know that the mailing "didn't work" for that 25-35 age group and
perhaps give up on them, or perhaps target them differently.  But it is
safe(er) to mail a million out to ages 35 to 50 because many more of
them will respond, making the response / mail more cost effective.

As you can see, this is just a TON of analysis of the demographics, but
very little (in fact NO) update of the original data.  We will be adding
a "business" database that holds client info (who is buying the address
data sets), "filter" records of what records were selected in a given
mailing, "Response" records that give us info on who responded.  But
these will probably be a set of a few to a few hundred (just off the top
of my head) records in a handful of tables to record this "who bought
the data" / "what was mailed" / "Who responded" information for each
mailing data set created.

If I can in fact do this at all on the class machines available to me it
should be fun.  It has been a much bigger challenge than I expected I
can tell you that.

John W. Colby
www.ColbyConsulting.com 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Mark L.
Breen
Sent: Saturday, August 28, 2004 12:33 PM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only

Hello John,

I have just downloaded my email for the list and there was in total 5500
or so emails, so I have just read through your series of problems with
SQL.

Firstly, I envy you, you lucky guy, having such a big db to work with !

Secondly, sorry to hear about the amount of hassle that you have had.

I have recently been working on a 2gb database and when it is on the
server, my laptop runs fine, but on my laptop with 1/2 gb ram, the
laptop grinds to a halt, so I would be in favour of running the db off
the development machine.

I know that you have already looked at some of the obvious things such
as ensuring that there are no indexes, in the tables, ensuring that the
fields are no bigger than they need to be, use integers where you can
etc.

I have a few other comments to make, although I do not think that there
are
any revelations in them.  DTS is probably the best way to get data in.
SQL
2000 better at truncating / shrinking than SQL 7.

Have you considered just importing a few hundred records from each file
and getting the thing up and running?  I am presuming that once you get
it imported you want to start development.  Wouldn't it be nicer to work
with just 1 million records for the initial development period.

Have you considered building a little vb app that would import 10k
records and then do a database integrity and index rebuild (the work
that the maintenance wizard does).  Then let the vb app just work away
for a week or so, mean while you could continue to use the hardware that
you have.

For the duration of this project, would you consider forgetting about
redundancy and just use what ever disk space you have?  I also have a
dislike (not based on fact, just emotion) for compression.  I wonder how
it will effect performance of the SQL Server if you ever get it fully
loaded.

Keep in touch and let us know how the project goes

Best of luck,

Mark

----- Original Message ----- 
From: "John W. Colby" <jwcolby at colbyconsulting.com>
To: "'Access Developers discussion and problem solving'"
<accessd at databaseadvisors.com>; <dba-sqlserver at databaseadvisors.com>
Sent: Saturday, August 28, 2004 12:56 PM
Subject: [dba-SQLServer] VLDBs, the saga - Ramblings only

> I have been working on getting a rather large database into SQL
> Server. This thing is ~65 million names plus demographics info and 
> will be used
for
> bulk mailing analysis.  I have been struggling for weeks to make this
> happen.  Having no experience with a database this big, I had no idea 
> how big the final db would be.  I tried to get it in to a 160g drive 
> but it rapidly became obvious that wouldn't hold it.  I then purchased

> two 200g drives and used Raid0 to make a big 400 g drive.  I thought I

> turned on compression but after getting well into the extraction 
> process I
discovered
> this wasn't the case.  I then started trying to figure out how to get
> the drive compressed.
>
> Long story short, a NTFS drive can be compressed, even a raid array
> such
as
> this, however...  There is a control that allows you to select the
> sector size.  I had selected the "compress" check box but then 
> selected a sector size of 64K.  As I started investigating why the 
> drive wasn't compressed
it
> turns out that only sector sizes of 512 to 4K bytes allow compression.
> Anything larger causes the "compress drive" check box to gray out and 
> the drive ends up uncompressed.
>
> By this time I had already spent several days extracting zipped files
> of data and BCPing them into SQL Server so I had a MDF file of over 
> 300gb and no place to put it!
>
> Sigh.
>
> Out of desperation I decided to try zipping the database file.  I
> started
it
> PK zipping last night onto an empty 160g partition.  This morning I
> had a 10gb zipped file that supposedly contains the MDF file!
>
> I then deleted the partition on the 400gb Raid array and started
> playing with the compression / block size which is when I discovered 
> the >4K
sector
> size gotcha.  I set the sector size to 4K and quick formatted, then
started
> unzipping the MDF file to the (compressed) 400gb raid array.
>
> We shall see.  The unzip is not finished, in fact has several hours to
> go yet.  If this works I will celebrate.
>
> This whole project has been a challenge.  It looks like the database
> will
be
> around 600g for JUST the data, never mind any indexes.  I simply don't
have
> the money to build a raid 5 array to up the uncompressed drive size.
> Even if I did, IDE drives larger than 250gb are expensive and AFAICT 
> only available in 5200 RPM.  Plus the overhead of Raid5 is "One Drive"

> which means I'd need (4) 300g drives to build a 900g usable space 
> raid5 array. Raid1 (which I am currently using) tops out at 600g using

> (2) 300g drives (uncompressed).  So far my (2) drive Raid1 array using

> 200g drives has
cost
> me $240 plus a controller I already had.  A Raid5 solution using 300g
drives
> would cost about $1200 just for the new controller and 4 drives!
>
> With any luck, given the massive compression PKZip managed to attain,
> I
will
> be able to shoehorn the 600g.
>
> Update 8-(
>
> As I write this I just got a "delayed write failed" message from
> Windows saying it lost data trying to write to the disk.  I have tried

> to turn off write caching but can't seem to find the magic button to 
> cause Windows to quit using "Delayed write".
>
> BIG sigh!
>
> If I can't get the db back out of the zip file I will be facing a
> weekend
of
> "starting from scratch" on getting the data out of the raw text files
> and back in to SQL Server!
>
> And I thought this would be fairly easy.
>
> John W. Colby
> www.ColbyConsulting.com
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
>

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com