Shamil Salakhetdinov
shamil at users.mns.ru
Mon Aug 30 08:48:44 CDT 2004
John, I didn't know they are these simple T/F... Why your database is getting so huge then? Do you have any explanations/calculations? > Further it is not helpful to normalize them. This is SO huge that putting > them back together again in queries joining a dozen or more fields / tables > would be a slow nightmare. This is incredibly quick in MS SQL.... And I'm not talking about "brute force" joining back... Of course all that my first thoughts/advises probably do not make sense now when I get known your source data are quite different from what I've thought about it before... > Even normalizing 600 fields would be many many > hours (days? Weeks?) of work. I don't think so - but to not get it wrong again I need to see some of your source data... > But in fact I think it may just be better left demoralized. Yes, very probably if they are these simple T/F. (Although I can't get then how they present so different I think habits and tastes of Americans...) Shamil ----- Original Message ----- From: "John W. Colby" <jwcolby at colbyconsulting.com> To: <dba-sqlserver at databaseadvisors.com> Sent: Monday, August 30, 2004 4:37 PM Subject: RE: [dba-SQLServer] VLDBs, the saga - Ramblings only > Shamil, > > In fact the numbers don't look like that at all, plus I really didn't know > what the fields looked like. Looking at the data in 600 fields is > non-trivial all by itself. > > However I can tell you it isn't a straight calc like that. Many and > probably most of the fields are a simple true false which led me to expect a > MUCH smaller db size. They just have a T or a F in them (the character). > > Further it is not helpful to normalize them. This is SO huge that putting > them back together again in queries joining a dozen or more fields / tables > would be a slow nightmare. Even normalizing 600 fields would be many many > hours (days? Weeks?) of work. But in fact I think it may just be better > left denormalized. > > John W. Colby > www.ColbyConsulting.com > > -----Original Message----- > From: dba-sqlserver-bounces at databaseadvisors.com > [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Shamil > Salakhetdinov > Sent: Monday, August 30, 2004 1:42 AM > To: dba-sqlserver at databaseadvisors.com > Subject: Re: [dba-SQLServer] VLDBs, the saga - Ramblings only > > > John, > > May I ask you did you make any calculations in advance to see what to expect > with your data loading? (Almost nobody do I know - But I did start to do > that for some time now - it did happen to me that "brute force" approach > doesn't work well sometimes :) ) > > Why not make it normalized as a first step of your data loading adventure? > > I mean: > > - you have 65 million records with 600 fields each and let's assume that > each field is 20 bytes (not Unicode). Then you get 65millions*600*20 = 780GB > (even if the average size of a record is less than that - it for sure(?) > should be more than 4KB and therefore you get ONE record loaded on ONE page > - MS SQL records can't be longer than 8KB - this is ~520GB without > indexes...) > > If as the first step you go through all your source data and get them > normalized you get something like: > > 65millions*600*4 = 156GB - the latter looks manageable even with ordinary > modern IDE drive even connected through USB2 - that's a cheap and quick > enough solution for starters (I assume that some references from normalized > table will be 4 bytes long, some two bytes long, others 1 byte long, some > like First- , Middle- and Last- name will be left intact - so all that will > probably give 4 bytes long field in average. And four bytes are enough to > get referenced even 65,000,000 data dictionary/reference (65,000,000 = 03 > DF D2 40 hexadecimal). > > So as the first step - get all your source data unzipped on one HDD (160MB)? > > Second step - analyze the unzipped data and find what is the best way to get > it normalized (here is a simple utility reading source data, making hash > code of every field and calculating quantities of different hash codes for > every field should be not a bad idea - such a utility should be very quick > and reliable solution to get good approximation where you can get with your > huge volume of the source data, especially if you write in on C++/C#/VB.NET > - I'm getting in this field now - so I can help here just for fun to share > this your challenge but spare time is problem here so this help can be not > as quick as it maybe needed for you now... > > Third step - make good (semi-) normalized data model (don't forget a > clustered primary key - Identity - you like them I know :)), calculate well > what size it will get when it will be implemented in a MS SQL database... > > Fourth step - load normalized data, maybe in several steps.... .... N-th > step get all you data loaded and manageable - here is the point where you > can get it back denormalized if you will need that (I don't think it will be > needed/possible with your/your customer resources and the OLAP tasks should > work well on (semi-)normalized database metioned above), maybe as a set of > federated databases, linked databases, partitioned views ... > > I'd also advise you to read now carefully the "SQL Server Architecture" > chapter from BOL... > > Of course it easy to advice - and it's not that easy to go through the > challenge you have... > > I'm not that often here this days but I'm every working day online on MS > Messenger (~ 8:00 - 22:00 (MT+3) Shamil at Work) - so you can get me there if > you'll need some of my help... > > HTH & I hope you'll get it up&running soon, > Shamil > > > _______________________________________________ > dba-SQLServer mailing list > dba-SQLServer at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-sqlserver > http://www.databaseadvisors.com >