[dba-SQLServer] VLDBs, the saga - Ramblings only

Mon Aug 30 08:48:44 CDT 2004

John,

I didn't know they are these simple T/F...

Why your database is getting so huge then? Do you have any
explanations/calculations?

> Further it is not helpful to normalize them.  This is SO huge that
putting
> them back together again in queries joining a dozen or more fields /
tables
> would be a slow nightmare.
This is incredibly quick in MS SQL....
And I'm not talking about "brute force" joining back...
Of course all that my first thoughts/advises probably do not make sense
now when I get known your source data are quite different from what I've
thought about it before...

> Even normalizing 600 fields would be many many
> hours (days?  Weeks?) of work.
I don't think so - but to not get it wrong again I need to see some of
your source data...

> But in fact I think it may just be better left demoralized.
Yes, very probably if they are these simple T/F. (Although I can't get
then how they present so different I think habits and tastes of
Americans...)

Shamil

----- Original Message ----- 
From: "John W. Colby" <jwcolby at colbyconsulting.com>
To: <dba-sqlserver at databaseadvisors.com>
Sent: Monday, August 30, 2004 4:37 PM
Subject: RE: [dba-SQLServer] VLDBs, the saga - Ramblings only

> Shamil,
>
> In fact the numbers don't look like that at all, plus I really didn't
know
> what the fields looked like.  Looking at the data in 600 fields is
> non-trivial all by itself.
>
> However I can tell you it isn't a straight calc like that.  Many and
> probably most of the fields are a simple true false which led me to
expect a
> MUCH smaller db size.  They just have a T or a F in them (the
character).
>
> Further it is not helpful to normalize them.  This is SO huge that
putting
> them back together again in queries joining a dozen or more fields /
tables
> would be a slow nightmare.  Even normalizing 600 fields would be many
many
> hours (days?  Weeks?) of work.  But in fact I think it may just be
better
> left denormalized.
>
> John W. Colby
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of
Shamil
> Salakhetdinov
> Sent: Monday, August 30, 2004 1:42 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only
>
>
> John,
>
> May I ask you did you make any calculations in advance to see what to
expect
> with your data loading? (Almost nobody do I know - But I did start to
do
> that for some time now - it did happen to me that "brute force"
approach
> doesn't work well sometimes :) )
>
> Why not make it normalized as a first step of your data loading
adventure?
>
> I mean:
>
> - you have 65 million records with 600 fields each and let's assume
that
> each field is 20 bytes (not Unicode). Then you get 65millions*600*20 =
780GB
> (even if the average size of a record is less than that - it for
sure(?)
> should be more than 4KB and therefore you get ONE record loaded on ONE
page
> - MS SQL records can't be longer than 8KB - this is ~520GB without
> indexes...)
>
> If as the first step you go through all your source data and get them
> normalized you get something like:
>
> 65millions*600*4 = 156GB - the latter looks manageable even with
ordinary
> modern IDE drive even connected through USB2 - that's a cheap and
quick
> enough solution for starters (I assume that some references from
normalized
> table will be 4 bytes long, some two bytes long, others 1 byte long,
some
> like First- , Middle- and Last- name will be left intact - so all that
will
> probably give 4 bytes long field in average. And four bytes are enough
to
> get referenced even 65,000,000 data dictionary/reference (65,000,000 =
03
> DF D2 40 hexadecimal).
>
> So as the first step - get all your source data unzipped on one HDD
(160MB)?
>
> Second step - analyze the unzipped data and find what is the best way
to get
> it normalized (here is a simple utility reading source data, making
hash
> code of every field and calculating quantities of different hash codes
for
> every field should be not a bad idea - such a utility should be very
quick
> and reliable solution to get good approximation where you can get with
your
> huge volume of the source data, especially if you write in on
C++/C#/VB.NET
> - I'm getting in  this field now - so I can help here just for fun to
share
> this your challenge but spare time is problem here so this help can be
not
> as quick as it maybe needed for you now...
>
> Third step - make good (semi-) normalized data model (don't forget a
> clustered primary key - Identity - you like them I know :)), calculate
well
> what size it will get when it will be implemented in a MS SQL
database...
>
> Fourth step - load normalized data, maybe in several steps.... ....
N-th
> step get all you data loaded and manageable - here is the point where
you
> can get it back denormalized if you will need that (I don't think it
will be
> needed/possible with your/your customer resources and the OLAP tasks
should
> work well on (semi-)normalized database metioned above), maybe as a
set of
> federated databases, linked databases, partitioned views ...
>
> I'd also advise you to read now  carefully the "SQL Server
Architecture"
> chapter from BOL...
>
> Of course it easy to advice - and it's not that easy to go through the
> challenge you have...
>
> I'm not that often here this days but I'm every working day online on
MS
> Messenger (~ 8:00 - 22:00 (MT+3) Shamil at Work) - so you can get me
there if
> you'll need some of my help...
>
> HTH &  I hope you'll get it up&running soon,
> Shamil
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>