[dba-SQLServer] My learning experience - was RE: Clustered index

Sun Dec 9 14:33:58 CST 2007

This email is going to present results of a "side by side" (but not quite
apples to apples) comparison of the work I have done over the last few years
on a huge database which I lovingly call "the database from hell".  But
first some background.  

I have three different databases.  One called HSIDRaw (first generation),
the next called HSID (second generation),  the next called HSIDVert.  

tblHSIDCean: 

HSIDRaw has a table called tblHSIDClean which is all of the original data,
65 million records, with close to 700 fields.  This data comes from an old
"fixed width" text file which I imported years ago.  It was this database /
table that I have mostly been using for the last couple of years.  Each
record in tblHSIDClean has an autonumber PKID field which is the PK of each
record in tblHSIDClean, created at the time I did the data import.  This
PKID carries through to everything else I do.  The data in this table
consists of responses to polls, so although there are 65 million records in
the database, not all people answered every poll.  There is data about boat
ownership (about 7 million responders), income (about 40 million
responders), travel (unknown number of responders ATM) etc.  So every field
does not contain data for every record, in fact in some fields the data is
sparsely populated.  

I took the name / address info from this table and had it "address
validated", the results of which sit in another table of ~51 million valid
deliverable name / address records (tblAZHSID).  That data carries the PKID
from tblHSIDClean as well so that I can relate a clean address back to these
tblHSIDClean records.  That "address validated" table contains ONLY address
validation info, and NONE of the data in tblHSIDClean.  tblAZHSID is of no
interest to this discussion other than so that you can conceptualize how the
data might be used to get "names / addresses of people WHERE...", i.e.
mailing list data, which is the ultimate requirement of this data.
tblAZHSID has address info, tblHSIDClean contains WHERE info, and they are
related on PKID.

INDEXES: In tblHSIDClean I used indexes on the fields that the client would
ask about.  I would build the index on demand (as I needed it) and they
slowly accumulated.  The indexes were always just a single field, no "cover
index" concept at this point.  These indexes provided fast COUNTS but did
not work well for returning PKID sets for further processing in complex
queries with WHERE clauses covering many different unrelated fields - WHERE
Income IN() AND BoatLength IN() AND HasDogs = 'Y'.

tblHSID: 

In the last month I extracted tblHSID From HSIDClean, out into a new
database called HSID.  tblHSID has the leading ~100 fields stripped off.
These 100 fields are OLD address validation stuff and is never used at all
for any reason.  Thus tblHSID has 584 fields.  Additionally, tblHSID only
has 51 million records.  When I did address validation years ago It turned
out that about 14 million records "fell out" of the 65 million records
because of bad address info.  The 51 million records represent all the
records that were deliverable.  This table carries the PKID field from
tblHSIDClean so that I know which record in tblHSID (new) relates to which
record in tblHSIDClean (old).  To summarize, tblHSID has ~100 fewer fields
and 14 million fewer records than tblHSIDClean.

INDEXES: In this database I started building "cover indexes" where each
index was GENERALLY still for a single field but would also include the PKID
as well.  In a couple of cases I build a cover index where a logical group
of fields were all contained in the index, as well as the PKID.  In these
cover indexes I place the data field first in the index, and then the PKID,
on the theory that I would be doing WHERE clauses using this index to look
for specific values in the data field.  In some cases the results are simple
counts, how many records with HullType = 'A' (boaters with aluminum hulls)
or NarrowIncomeBand = 'T' (people with incomes > 150K).  So covering indexes
SHOULD make the process of getting counts of specific data elements faster,
and in particular should make the process of getting actual PKID sets for
these data elements faster.  One of the uses of the database is to get
actual addresses for all of the people WHERE..., for which I need PKID sets
in order to facilitate comparisons and joins, and also to get at the name /
address data over in tblAZHSID.

tblHSIDVert:

In the last few days I created HSIDVert database with tblHSIDVert_001_083,
tblHSIDVert_084_143, tblHSIDVert_144_233 etc.  Each table contains data
points from specific sets of fields.  You can think of each table as a
vertical partition of tblHSID, although of course it is more than that.

In these tables I completely restructured the data in the manner I have
discussed in the last few days.  I got a list of all of the remaining data
fields (all of the fields in tblHSID), 584 fields total, and placed the
field names into tblHSIDFldName with an autoincrement PKID which becomes the
FieldNameID.  I then created tables named as above, with just three columns,
FldNameID, FldValue, and PKID.  PKID of course is the PKID from records in
tblHSID and tblHSIDClean and simply tells me which RECORD back in tblHSID /
tblHSIDClean each "data point record" is coming from.  FldNameID comes from
the above mentioned tblHSIDFldName and simply tells me which FIELD each data
point comes from.  Data is the data point or element from the field
specified.

I then created a process that pulls ONLY the data points out of tblHSID for
each field in tblHSID, and place those data points into tblHSIDVert.  IOW I
grabbed "just the facts ma'am" and "stacked them" into a single column of
data, with a pair of other fields to keep it all straight - the FieldNameID
and the PKID.

INDEXES: In HSIDVert I created a single clustered index covering all three
fields, in the order of FldNameID, Data, PKID.  If you think about the
concept, the "second generation" (tblHSID) is of course better than the
"first generation" (tblHSIDClean) for three reasons, fewer records, fewer
fields, and covering indexes.  HOWEVER I still have to create these covering
indexes as the client asks for results using a specific field.  I.e. each
field has its own distinct covering index, and there are still 584 possible
fields needing a cover index.  That's a lot of indexes.  The "third
generation" has some advantages in that ONLY the actual data is contained in
the table, (no empty fields) and EVERY field gets an index automatically due
to the fact that this table uses a cover index (the clustered index).

So which "schema" will win the race?  

Results in part two of this email.

John W. Colby
Colby Consulting
www.ColbyConsulting.com