[dba-SQLServer] Just an FYI

jwcolby jwcolby at colbyconsulting.com
Sat Dec 8 15:08:50 CST 2007


>Thanks for promoting me to your Good Friends list!

We have our differences but "I love ya Arthur" (in that sobbing tone).

>Perhaps I forgot to suggest that the only rows/columns of interest

Yes, that is what I am doing.  Any fields with no data is a "" and the
transform only pulls WHERE <> ''.  So I only have valid data points.  You
can see that because the average data points per field is 15.93 million, not
51 million (the number of records). 


John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Arthur
Fuller
Sent: Saturday, December 08, 2007 3:38 PM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] Just an FYI

Thanks for promoting me to your Good Friends list!

Perhaps I forgot to suggest that the only rows/columns of interest are those
with values that are NOT NULL or equivalent to  "". The entire point of this
exercise is that we record only meaningful values. In the case of your YesNo
columns, if the column contains neither Yes nor No then it ought not be
written to the child table. All we care about are the Yes and No answers.
The empty values are of no interest, since they contain no information.

A.

P.S.
We have had our wars, which is one of several reasons why I deem you a Good
Friend too, John.

On 12/8/07, jwcolby <jwcolby at colbyconsulting.com> wrote:
>
> In converting the database from hell to a "vertical" table, I have 
> processed
> 36 fields out of 584 fields from a table with 51.157 million records..
>
> It has found 573,397,612 data elements in those 36 fields.  15.93 
> million data points per field average.
>
> It has consumed 12.246 GBytes of data space (no indexes yet). 340 megs 
> / field average.
>
> It is taking roughly 300 seconds per field to process at this point in 
> time, though that bounces around a lot depending on how many data 
> elements it finds in a field.
>
> Assuming that the statistics so far hold true over the rest of the 
> data (not a real good assumption) and assuming that the insert time / 
> element remains the same (a horrid assumption) it will take (684-36) * 
> 5 minutes to finish the conversion - 3240 minutes, 54 hours.  It will 
> also take 340 * 684 megs of data space - 232 gigs.  The table when 
> finished would hold 684 * 15.93 million data points - 10896.12 million 
> data points - 10.896 Gig data points.
>
> The point of this exercise is to build a table with three fields.  The 
> FieldID from a field lookup table, the PKID from the record that the 
> data point is found in, and the data itself.  I will have converted a 
> table containing 50 million records but 684 fields, into a table with 
> 11 gig records, but 3 fields.  Obviously a table scan is being used on 
> the wide table most of the time, I cannot realistically index 684 
> fields, and even if I did, many (most) of the fields are a single 
> character consisting of 'Y'.
>
> Once the table is built I will build a single index on the table, a 
> clustered index with the fields arranged FieldID, Data and PKID on the 
> theory that most operations will be "where" on a field (thus the field 
> ID
> first) where the data is a certain value (thus the data next) and 
> returning only the PKID.
>
> If I understand things (a HUGE assumption) this clustered index would 
> be optimized for that kind of search.
>
> So in the case of the "vertical" table, an "index scan" will be 
> performed BUT all the records for a given field will be contiguous, 
> BUT it will have to scan (worst case) 11 gig records, BUT it will only 
> have to read three fields consisting of two integers and a varchar(100).
>
> What would be interesting but probably not feasible would be to 
> analyze which fields are used the most by my client and place them 
> first in the table.  That would guarantee that the most used fields 
> occur earliest in the search.  THAT would make an enormous difference 
> I think.  Then if there are only 5 million data points in a field and 
> they are contiguous, and they are at the front of the table...
>
> My good friend Arthur assures me that the new table will on average 
> win the race.  We shall see.
>
> Does anyone have a got feeling for the results?
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
>
_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com




More information about the dba-SQLServer mailing list