[dba-SQLServer] How are indexes built

Thu Nov 29 09:25:10 CST 2007

Arthur,

First of all thanks for the response.  

First of all some terminology.  

1) HSIDRaw is the original HSID file (database from hell) as imported from
the text files provided by the client.  Roughly 65 million records.
2) AZHSID is a table created by pulling the name / address AND PKID from
HSID Raw.  That data was sent out for address validation processing (AccuZip
- thus AZXXX).  The data returned from address validation is guaranteed
deliverable addresses, plus some additional data fields having to do with
the address validation process.  Roughly 51 million records after deleting
the undeliverable records.  NOTICE that the records in AZHSID have all of
the valid ADDRESS information but none of the SELECTION data.  HSIDRaw has
all of the valid SELECTION data but none of the valid ADDRESS data.  

In order to perform the whole process, I use HSIDRaw to pull fields for
selection, INCLUDING the PKID field.  I perform selections and get a
resulting PKID set from HSIDRaw which is "the records selected".  I then
join that result set to HSIDRaw in order to: 

a) narrow down to just the records with valid deliverable addresses (for
counts).
b) Pull valid deliverable ADDRESS information should this be a real order
for addresses.

Thus the two tables work together to form a unified whole system

To address your points:

>From what I have gathered about your app, I see not a single advantage to
your creation of a clustered index. 

1) A clustered index (from my understanding) is a physical order by the PK
as you explained, in this case the "autonumber" int field set up originally
on HSIDRaw.  That PK is then used in AZHSID to allow joining back to HSIDRaw
when required.  My understanding is that the clustered index has the entire
record in it (all of the fields), thus once you find the record you "are
there" and can go get data from any field.  This would be helpful in any
case where the field of interest is not indexed separately.  Obviously with
hundreds of fields, not every field will have indexes.  So the "advantage"
would be in those cases where the individual field did not have an index buy
a where is being performed on that field.  Or so it would seem.

>IMO, your largest problem is the width of the table. Whenever I detect
several hundred columns, I detect a wrong approach.

2) Yes, and obviously.  Unfortunately it is beyond my power to correct this.
You and I did discuss moving this out to a "normalized" system where each
field of each column is just a record in a single table.  I agree that this
would likely allow a huge performance increase, unfortunately it is beyond
my capabilities to perform this normalization.

>I know you receive new records occasionally

3) No in fact these tables NEVER!!! change (OK I do have one table that I
get updates to).  These are data tables that I got years ago, imported them
to SQL Server, and have only ever USED since the import.  So I do NOT update
HSIDRaw, it is carved in stone.  AZHSID will be updated, but that is the
address table, not the "fact (SELECTION) table".

4) I am in the process of rebuilding HSIDRaw by tearing out the first 100 or
so fields which were old "address validation" fields and no longer valid.
Additionally I am using the data you provided for the "optimum width" of
each field in HSIDRaw to build a new HSIDRaw table with ONLY the selection
fields plus the PKID, and the selection fields optimized to the correct data
type - char(1) in most cases.  Oh, and I also removed all of the HSIDRaw
records not found in AZHSID, i.e. which do not have a deliverable address.
Thus the AZHISD and HSIDRaw now have exactly the same PKIDs.  Before doing
this, AZHSID had 51 million and HSIDRaw had 65 million records.  Now HSIDRaw
has 51 million records as well.

Since HSIDRaw is now "cleaned up" with the first 100 fields deleted and the
remaining fields "optimized" for the correct data type, and the useless
undeliverable records deleted, I hope to make it even more useful by
correctly indexing the selection fields.

Yes, a change to the "normalized" method that you and I discussed would be
nice but unless and until I can figure out how to do that, it is not helpful
to keep pounding away that "that is the correct way to do things".  I have
to deal with reality, and the reality is that I can't do the normalization.

So let's discuss what I can do.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Arthur
Fuller
Sent: Thursday, November 29, 2007 9:02 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] How are indexes built

JC:

First of all, you need to understand what a clustered index is, and what it
is not. Perhaps you already know, but in case not, let me try to explain, in
perhaps simplistic terms.

Creating a clustered index physically re-orders the table according to the
order of said index. That will take a while, given the size of your table.
It will also consume (temporarily) a large amount of disk space, since it
has to re-order the table in tempdb and then write it back to yourdb.

>From what I have gathered about your app, I see not a single advantage 
>to
your creation of a clustered index. A clustered index (again typically)
would be most useful in a compound-index situation, i.e. index on Order
Number + Order Detail Number, so gathering up the Order Details would
require minimal visits to the tables. Your case is much different than this,
and AFAIS you would gain nothing by creating a clustered index, and
conversely, lose all the time it will require to physically re-order said
table.

IMO, your largest problem is the width of the table. Whenever I detect
several hundred columns, I detect a wrong approach. We have previously
discussed this, and I hold my ground. All these columns (or at least many of
them) ought to be rows in a child table, not columns in the principal table.
Then you could creatively use indexes. Imagine a hundred columns whose
contents are Y/N/null. No point in recording the nulls; only the Y/Ns are
significant. Give all the current columns a PK and a meaningful description
in the "attributes" table, and then index said PK. Yes, this will result in
many millions of rows, one for every Y/N attribute in your big fat table.
But it will make queries such as attribute A + attribute B + attribute C -
attribute D much simpler. This would translate in the new scheme to an
EXISTS and a NOT EXISTS clause, and since all this could be handled by the
index, relatively few disk hits would be required (relative to the
full-table scans your current model requires for almost every query).

I know you receive new records occasionally, and porting said new records to
this model will involve creating new rows in said child table and then
populating them with the PK from the user, the PK from the new attribute,
and the immediate value.

I would vastly prefer going down that road to the full-table-scan scenarios
you are currently locked into.

Arthur