jwcolby
jwcolby at colbyconsulting.com
Thu Nov 29 09:25:10 CST 2007
Arthur, First of all thanks for the response. First of all some terminology. 1) HSIDRaw is the original HSID file (database from hell) as imported from the text files provided by the client. Roughly 65 million records. 2) AZHSID is a table created by pulling the name / address AND PKID from HSID Raw. That data was sent out for address validation processing (AccuZip - thus AZXXX). The data returned from address validation is guaranteed deliverable addresses, plus some additional data fields having to do with the address validation process. Roughly 51 million records after deleting the undeliverable records. NOTICE that the records in AZHSID have all of the valid ADDRESS information but none of the SELECTION data. HSIDRaw has all of the valid SELECTION data but none of the valid ADDRESS data. In order to perform the whole process, I use HSIDRaw to pull fields for selection, INCLUDING the PKID field. I perform selections and get a resulting PKID set from HSIDRaw which is "the records selected". I then join that result set to HSIDRaw in order to: a) narrow down to just the records with valid deliverable addresses (for counts). b) Pull valid deliverable ADDRESS information should this be a real order for addresses. Thus the two tables work together to form a unified whole system To address your points: >From what I have gathered about your app, I see not a single advantage to your creation of a clustered index. 1) A clustered index (from my understanding) is a physical order by the PK as you explained, in this case the "autonumber" int field set up originally on HSIDRaw. That PK is then used in AZHSID to allow joining back to HSIDRaw when required. My understanding is that the clustered index has the entire record in it (all of the fields), thus once you find the record you "are there" and can go get data from any field. This would be helpful in any case where the field of interest is not indexed separately. Obviously with hundreds of fields, not every field will have indexes. So the "advantage" would be in those cases where the individual field did not have an index buy a where is being performed on that field. Or so it would seem. >IMO, your largest problem is the width of the table. Whenever I detect several hundred columns, I detect a wrong approach. 2) Yes, and obviously. Unfortunately it is beyond my power to correct this. You and I did discuss moving this out to a "normalized" system where each field of each column is just a record in a single table. I agree that this would likely allow a huge performance increase, unfortunately it is beyond my capabilities to perform this normalization. >I know you receive new records occasionally 3) No in fact these tables NEVER!!! change (OK I do have one table that I get updates to). These are data tables that I got years ago, imported them to SQL Server, and have only ever USED since the import. So I do NOT update HSIDRaw, it is carved in stone. AZHSID will be updated, but that is the address table, not the "fact (SELECTION) table". 4) I am in the process of rebuilding HSIDRaw by tearing out the first 100 or so fields which were old "address validation" fields and no longer valid. Additionally I am using the data you provided for the "optimum width" of each field in HSIDRaw to build a new HSIDRaw table with ONLY the selection fields plus the PKID, and the selection fields optimized to the correct data type - char(1) in most cases. Oh, and I also removed all of the HSIDRaw records not found in AZHSID, i.e. which do not have a deliverable address. Thus the AZHISD and HSIDRaw now have exactly the same PKIDs. Before doing this, AZHSID had 51 million and HSIDRaw had 65 million records. Now HSIDRaw has 51 million records as well. Since HSIDRaw is now "cleaned up" with the first 100 fields deleted and the remaining fields "optimized" for the correct data type, and the useless undeliverable records deleted, I hope to make it even more useful by correctly indexing the selection fields. Yes, a change to the "normalized" method that you and I discussed would be nice but unless and until I can figure out how to do that, it is not helpful to keep pounding away that "that is the correct way to do things". I have to deal with reality, and the reality is that I can't do the normalization. So let's discuss what I can do. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Arthur Fuller Sent: Thursday, November 29, 2007 9:02 AM To: dba-sqlserver at databaseadvisors.com Subject: Re: [dba-SQLServer] How are indexes built JC: First of all, you need to understand what a clustered index is, and what it is not. Perhaps you already know, but in case not, let me try to explain, in perhaps simplistic terms. Creating a clustered index physically re-orders the table according to the order of said index. That will take a while, given the size of your table. It will also consume (temporarily) a large amount of disk space, since it has to re-order the table in tempdb and then write it back to yourdb. >From what I have gathered about your app, I see not a single advantage >to your creation of a clustered index. A clustered index (again typically) would be most useful in a compound-index situation, i.e. index on Order Number + Order Detail Number, so gathering up the Order Details would require minimal visits to the tables. Your case is much different than this, and AFAIS you would gain nothing by creating a clustered index, and conversely, lose all the time it will require to physically re-order said table. IMO, your largest problem is the width of the table. Whenever I detect several hundred columns, I detect a wrong approach. We have previously discussed this, and I hold my ground. All these columns (or at least many of them) ought to be rows in a child table, not columns in the principal table. Then you could creatively use indexes. Imagine a hundred columns whose contents are Y/N/null. No point in recording the nulls; only the Y/Ns are significant. Give all the current columns a PK and a meaningful description in the "attributes" table, and then index said PK. Yes, this will result in many millions of rows, one for every Y/N attribute in your big fat table. But it will make queries such as attribute A + attribute B + attribute C - attribute D much simpler. This would translate in the new scheme to an EXISTS and a NOT EXISTS clause, and since all this could be handled by the index, relatively few disk hits would be required (relative to the full-table scans your current model requires for almost every query). I know you receive new records occasionally, and porting said new records to this model will involve creating new rows in said child table and then populating them with the PK from the user, the PK from the new attribute, and the immediate value. I would vastly prefer going down that road to the full-table-scan scenarios you are currently locked into. Arthur