[dba-SQLServer] VLDBs, the saga - Ramblings only

Mon Aug 30 00:41:49 CDT 2004

John,

May I ask you did you make any calculations in advance to see what to
expect with your data loading? (Almost nobody do I know - But I did
start to do that for some time now - it did happen to me that "brute
force" approach doesn't work well sometimes :) )

Why not make it normalized as a first step of your data loading
adventure?

I mean:

- you have 65 million records with 600 fields each and let's assume that
each field is 20 bytes (not Unicode).
Then you get 65millions*600*20 = 780GB (even if the average size of a
record is less than that - it for sure(?) should be more than 4KB and
therefore you get ONE record loaded on ONE page - MS SQL records can't
be longer than 8KB - this is ~520GB without indexes...)

If as the first step you go through all your source data and get them
normalized you get something like:

65millions*600*4 = 156GB - the latter looks manageable even with
ordinary modern IDE drive even connected through USB2 - that's a cheap
and quick enough solution for starters (I assume that some references
from normalized table will be 4 bytes long, some two bytes long, others
1 byte long, some like First- , Middle- and Last- name will be left
intact - so all that will probably give 4 bytes long field in average.
And four bytes are enough to get referenced even 65,000,000 data
dictionary/reference (65,000,000 =  03 DF D2 40 hexadecimal).

So as the first step - get all your source data unzipped on one HDD
(160MB)?

Second step - analyze the unzipped data and find what is the best way to
get it normalized (here is a simple utility reading source data, making
hash code of every field and calculating quantities of different hash
codes for every field should be not a bad idea - such a utility should
be very quick and reliable solution to get good approximation where you
can get with your huge volume of the source data, especially if you
write in on C++/C#/VB.NET - I'm getting in  this field now - so I can
help here just for fun to share this your challenge but spare time is
problem here so this help can be not as quick as it maybe needed for you
now...

Third step - make good (semi-) normalized data model (don't forget a
clustered primary key - Identity - you like them I know :)), calculate
well what size it will get when it will be implemented in a MS SQL
database...

Fourth step - load normalized data, maybe in several steps....
....
N-th step get all you data loaded and manageable - here is the point
where you can get it back denormalized if you will need that (I don't
think it will be needed/possible with your/your customer resources and
the OLAP tasks should work well on (semi-)normalized database metioned
above), maybe as a set of federated databases, linked databases,
partitioned views ...

I'd also advise you to read now  carefully the "SQL Server Architecture"
chapter from BOL...

Of course it easy to advice - and it's not that easy to go through the
challenge you have...

I'm not that often here this days but I'm every working day online on MS
Messenger (~ 8:00 - 22:00 (MT+3) Shamil at Work) - so you can get me there
if you'll need some of my help...

HTH &  I hope you'll get it up&running soon,
Shamil

----- Original Message ----- 
From: "John W. Colby" <jwcolby at colbyconsulting.com>
To: "'Access Developers discussion and problem solving'"
<accessd at databaseadvisors.com>; <dba-sqlserver at databaseadvisors.com>
Sent: Saturday, August 28, 2004 3:56 PM
Subject: [dba-SQLServer] VLDBs, the saga - Ramblings only

> I have been working on getting a rather large database into SQL
Server.
> This thing is ~65 million names plus demographics info and will be
used for
> bulk mailing analysis.  I have been struggling for weeks to make this
> happen.  Having no experience with a database this big, I had no idea
how
> big the final db would be.  I tried to get it in to a 160g drive but
it
> rapidly became obvious that wouldn't hold it.  I then purchased two
200g
> drives and used Raid0 to make a big 400 g drive.  I thought I turned
on
> compression but after getting well into the extraction process I
discovered
> this wasn't the case.  I then started trying to figure out how to get
the
> drive compressed.
>
> Long story short, a NTFS drive can be compressed, even a raid array
such as
> this, however...  There is a control that allows you to select the
sector
> size.  I had selected the "compress" check box but then selected a
sector
> size of 64K.  As I started investigating why the drive wasn't
compressed it
> turns out that only sector sizes of 512 to 4K bytes allow compression.
> Anything larger causes the "compress drive" check box to gray out and
the
> drive ends up uncompressed.
>
> By this time I had already spent several days extracting zipped files
of
> data and BCPing them into SQL Server so I had a MDF file of over 300gb
and
> no place to put it!
>
> Sigh.
>
> Out of desperation I decided to try zipping the database file.  I
started it
> PK zipping last night onto an empty 160g partition.  This morning I
had a
> 10gb zipped file that supposedly contains the MDF file!
>
> I then deleted the partition on the 400gb Raid array and started
playing
> with the compression / block size which is when I discovered the >4K
sector
> size gotcha.  I set the sector size to 4K and quick formatted, then
started
> unzipping the MDF file to the (compressed) 400gb raid array.
>
> We shall see.  The unzip is not finished, in fact has several hours to
go
> yet.  If this works I will celebrate.
>
> This whole project has been a challenge.  It looks like the database
will be
> around 600g for JUST the data, never mind any indexes.  I simply don't
have
> the money to build a raid 5 array to up the uncompressed drive size.
Even
> if I did, IDE drives larger than 250gb are expensive and AFAICT only
> available in 5200 RPM.  Plus the overhead of Raid5 is "One Drive"
which
> means I'd need (4) 300g drives to build a 900g usable space raid5
array.
> Raid1 (which I am currently using) tops out at 600g using (2) 300g
drives
> (uncompressed).  So far my (2) drive Raid1 array using 200g drives has
cost
> me $240 plus a controller I already had.  A Raid5 solution using 300g
drives
> would cost about $1200 just for the new controller and 4 drives!
>
> With any luck, given the massive compression PKZip managed to attain,
I will
> be able to shoehorn the 600g.
>
> Update 8-(
>
> As I write this I just got a "delayed write failed" message from
Windows
> saying it lost data trying to write to the disk.  I have tried to turn
off
> write caching but can't seem to find the magic button to cause Windows
to
> quit using "Delayed write".
>
> BIG sigh!
>
> If I can't get the db back out of the zip file I will be facing a
weekend of
> "starting from scratch" on getting the data out of the raw text files
and
> back in to SQL Server!
>
> And I thought this would be fairly easy.
>
> John W. Colby
> www.ColbyConsulting.com
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>