[dba-SQLServer] VLDBs, the saga - Ramblings only

Charlotte Foust cfoust at infostatsystems.com
Mon Aug 30 10:44:16 CDT 2004


>From what John has described, he has, in essence, a data warehouse.  It
could certainly be broken up into one or more fact tables with dimension
tables to make the slicing and dicing easier, but you still wouldn't
wind up with "normal" normalization, just 1NF tables.  I've worked with
this kind of data on a smaller scale (I used to work for a company that
would have been a customer of that kind of list).  I suspect the
demographics they're using as selection criteria should probably be put
into dimension tables, which will actually make the database bigger but
will make the queries MUCH faster and easier.

Charlotte Foust


-----Original Message-----
From: Shamil Salakhetdinov [mailto:shamil at users.mns.ru] 
Sent: Monday, August 30, 2004 6:49 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only


John,

I didn't know they are these simple T/F...

Why your database is getting so huge then? Do you have any
explanations/calculations?

> Further it is not helpful to normalize them.  This is SO huge that
putting
> them back together again in queries joining a dozen or more fields /
tables
> would be a slow nightmare.
This is incredibly quick in MS SQL....
And I'm not talking about "brute force" joining back...
Of course all that my first thoughts/advises probably do not make sense
now when I get known your source data are quite different from what I've
thought about it before...

> Even normalizing 600 fields would be many many
> hours (days?  Weeks?) of work.
I don't think so - but to not get it wrong again I need to see some of
your source data...

> But in fact I think it may just be better left demoralized.
Yes, very probably if they are these simple T/F. (Although I can't get
then how they present so different I think habits and tastes of
Americans...)

Shamil


----- Original Message ----- 
From: "John W. Colby" <jwcolby at colbyconsulting.com>
To: <dba-sqlserver at databaseadvisors.com>
Sent: Monday, August 30, 2004 4:37 PM
Subject: RE: [dba-SQLServer] VLDBs, the saga - Ramblings only


> Shamil,
>
> In fact the numbers don't look like that at all, plus I really didn't
know
> what the fields looked like.  Looking at the data in 600 fields is 
> non-trivial all by itself.
>
> However I can tell you it isn't a straight calc like that.  Many and 
> probably most of the fields are a simple true false which led me to
expect a
> MUCH smaller db size.  They just have a T or a F in them (the
character).
>
> Further it is not helpful to normalize them.  This is SO huge that
putting
> them back together again in queries joining a dozen or more fields /
tables
> would be a slow nightmare.  Even normalizing 600 fields would be many
many
> hours (days?  Weeks?) of work.  But in fact I think it may just be
better
> left denormalized.
>
> John W. Colby
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of
Shamil
> Salakhetdinov
> Sent: Monday, August 30, 2004 1:42 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only
>
>
> John,
>
> May I ask you did you make any calculations in advance to see what to
expect
> with your data loading? (Almost nobody do I know - But I did start to
do
> that for some time now - it did happen to me that "brute force"
approach
> doesn't work well sometimes :) )
>
> Why not make it normalized as a first step of your data loading
adventure?
>
> I mean:
>
> - you have 65 million records with 600 fields each and let's assume
that
> each field is 20 bytes (not Unicode). Then you get 65millions*600*20 =
780GB
> (even if the average size of a record is less than that - it for
sure(?)
> should be more than 4KB and therefore you get ONE record loaded on ONE
page
> - MS SQL records can't be longer than 8KB - this is ~520GB without
> indexes...)
>
> If as the first step you go through all your source data and get them 
> normalized you get something like:
>
> 65millions*600*4 = 156GB - the latter looks manageable even with
ordinary
> modern IDE drive even connected through USB2 - that's a cheap and
quick
> enough solution for starters (I assume that some references from
normalized
> table will be 4 bytes long, some two bytes long, others 1 byte long,
some
> like First- , Middle- and Last- name will be left intact - so all that
will
> probably give 4 bytes long field in average. And four bytes are enough
to
> get referenced even 65,000,000 data dictionary/reference (65,000,000 =
03
> DF D2 40 hexadecimal).
>
> So as the first step - get all your source data unzipped on one HDD
(160MB)?
>
> Second step - analyze the unzipped data and find what is the best way
to get
> it normalized (here is a simple utility reading source data, making
hash
> code of every field and calculating quantities of different hash codes
for
> every field should be not a bad idea - such a utility should be very
quick
> and reliable solution to get good approximation where you can get with
your
> huge volume of the source data, especially if you write in on
C++/C#/VB.NET
> - I'm getting in  this field now - so I can help here just for fun to
share
> this your challenge but spare time is problem here so this help can be
not
> as quick as it maybe needed for you now...
>
> Third step - make good (semi-) normalized data model (don't forget a 
> clustered primary key - Identity - you like them I know :)), calculate
well
> what size it will get when it will be implemented in a MS SQL
database...
>
> Fourth step - load normalized data, maybe in several steps.... ....
N-th
> step get all you data loaded and manageable - here is the point where
you
> can get it back denormalized if you will need that (I don't think it
will be
> needed/possible with your/your customer resources and the OLAP tasks
should
> work well on (semi-)normalized database metioned above), maybe as a
set of
> federated databases, linked databases, partitioned views ...
>
> I'd also advise you to read now  carefully the "SQL Server
Architecture"
> chapter from BOL...
>
> Of course it easy to advice - and it's not that easy to go through the

> challenge you have...
>
> I'm not that often here this days but I'm every working day online on
MS
> Messenger (~ 8:00 - 22:00 (MT+3) Shamil at Work) - so you can get me
there if
> you'll need some of my help...
>
> HTH &  I hope you'll get it up&running soon,
> Shamil
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com




More information about the dba-SQLServer mailing list