[dba-SQLServer] VLDBs, the saga - Ramblings only

John W. Colby jwcolby at colbyconsulting.com
Mon Aug 30 07:37:38 CDT 2004


Shamil,

In fact the numbers don't look like that at all, plus I really didn't know
what the fields looked like.  Looking at the data in 600 fields is
non-trivial all by itself.

However I can tell you it isn't a straight calc like that.  Many and
probably most of the fields are a simple true false which led me to expect a
MUCH smaller db size.  They just have a T or a F in them (the character).  

Further it is not helpful to normalize them.  This is SO huge that putting
them back together again in queries joining a dozen or more fields / tables
would be a slow nightmare.  Even normalizing 600 fields would be many many
hours (days?  Weeks?) of work.  But in fact I think it may just be better
left denormalized.

John W. Colby
www.ColbyConsulting.com 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Monday, August 30, 2004 1:42 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only


John,

May I ask you did you make any calculations in advance to see what to expect
with your data loading? (Almost nobody do I know - But I did start to do
that for some time now - it did happen to me that "brute force" approach
doesn't work well sometimes :) )

Why not make it normalized as a first step of your data loading adventure?

I mean:

- you have 65 million records with 600 fields each and let's assume that
each field is 20 bytes (not Unicode). Then you get 65millions*600*20 = 780GB
(even if the average size of a record is less than that - it for sure(?)
should be more than 4KB and therefore you get ONE record loaded on ONE page
- MS SQL records can't be longer than 8KB - this is ~520GB without
indexes...)

If as the first step you go through all your source data and get them
normalized you get something like:

65millions*600*4 = 156GB - the latter looks manageable even with ordinary
modern IDE drive even connected through USB2 - that's a cheap and quick
enough solution for starters (I assume that some references from normalized
table will be 4 bytes long, some two bytes long, others 1 byte long, some
like First- , Middle- and Last- name will be left intact - so all that will
probably give 4 bytes long field in average. And four bytes are enough to
get referenced even 65,000,000 data dictionary/reference (65,000,000 =  03
DF D2 40 hexadecimal).

So as the first step - get all your source data unzipped on one HDD (160MB)?

Second step - analyze the unzipped data and find what is the best way to get
it normalized (here is a simple utility reading source data, making hash
code of every field and calculating quantities of different hash codes for
every field should be not a bad idea - such a utility should be very quick
and reliable solution to get good approximation where you can get with your
huge volume of the source data, especially if you write in on C++/C#/VB.NET
- I'm getting in  this field now - so I can help here just for fun to share
this your challenge but spare time is problem here so this help can be not
as quick as it maybe needed for you now...

Third step - make good (semi-) normalized data model (don't forget a
clustered primary key - Identity - you like them I know :)), calculate well
what size it will get when it will be implemented in a MS SQL database...

Fourth step - load normalized data, maybe in several steps.... .... N-th
step get all you data loaded and manageable - here is the point where you
can get it back denormalized if you will need that (I don't think it will be
needed/possible with your/your customer resources and the OLAP tasks should
work well on (semi-)normalized database metioned above), maybe as a set of
federated databases, linked databases, partitioned views ...

I'd also advise you to read now  carefully the "SQL Server Architecture"
chapter from BOL...

Of course it easy to advice - and it's not that easy to go through the
challenge you have...

I'm not that often here this days but I'm every working day online on MS
Messenger (~ 8:00 - 22:00 (MT+3) Shamil at Work) - so you can get me there if
you'll need some of my help...

HTH &  I hope you'll get it up&running soon,
Shamil





More information about the dba-SQLServer mailing list