[dba-SQLServer] Just an FYI

Sat Dec 8 14:20:51 CST 2007

In converting the database from hell to a "vertical" table, I have processed
36 fields out of 584 fields from a table with 51.157 million records..

It has found 573,397,612 data elements in those 36 fields.  15.93 million
data points per field average.

It has consumed 12.246 GBytes of data space (no indexes yet). 340 megs /
field average.

It is taking roughly 300 seconds per field to process at this point in time,
though that bounces around a lot depending on how many data elements it
finds in a field.

Assuming that the statistics so far hold true over the rest of the data (not
a real good assumption) and assuming that the insert time / element remains
the same (a horrid assumption) it will take (684-36) * 5 minutes to finish
the conversion - 3240 minutes, 54 hours.  It will also take 340 * 684 megs
of data space - 232 gigs.  The table when finished would hold 684 * 15.93
million data points - 10896.12 million data points - 10.896 Gig data points.

The point of this exercise is to build a table with three fields.  The
FieldID from a field lookup table, the PKID from the record that the data
point is found in, and the data itself.  I will have converted a table
containing 50 million records but 684 fields, into a table with 11 gig
records, but 3 fields.  Obviously a table scan is being used on the wide
table most of the time, I cannot realistically index 684 fields, and even if
I did, many (most) of the fields are a single character consisting of 'Y'.

Once the table is built I will build a single index on the table, a
clustered index with the fields arranged FieldID, Data and PKID on the
theory that most operations will be "where" on a field (thus the field ID
first) where the data is a certain value (thus the data next) and returning
only the PKID.

If I understand things (a HUGE assumption) this clustered index would be
optimized for that kind of search.

So in the case of the "vertical" table, an "index scan" will be performed
BUT all the records for a given field will be contiguous, BUT it will have
to scan (worst case) 11 gig records, BUT it will only have to read three
fields consisting of two integers and a varchar(100).

What would be interesting but probably not feasible would be to analyze
which fields are used the most by my client and place them first in the
table.  That would guarantee that the most used fields occur earliest in the
search.  THAT would make an enormous difference I think.  Then if there are
only 5 million data points in a field and they are contiguous, and they are
at the front of the table...

My good friend Arthur assures me that the new table will on average win the
race.  We shall see.

Does anyone have a got feeling for the results?

John W. Colby
Colby Consulting
www.ColbyConsulting.com