jwcolby
jwcolby at colbyconsulting.com
Sun Dec 9 18:42:49 CST 2007
This email is going to present results of a "side by side" (but not quite apples to apples) comparison of the work I have done over the last few years on a huge database which I lovingly call "the database from hell". But first some background. I have three different databases. One called HSIDRaw (first generation), the next called HSID (second generation), the next called HSIDVert. tblHSIDCean: HSIDRaw has a table called tblHSIDClean which is all of the original data, 65 million records, with close to 700 fields. This data comes from an old "fixed width" text file which I imported years ago. It was this database / table that I have mostly been using for the last couple of years. Each record in tblHSIDClean has an autonumber PKID field which is the PK of each record in tblHSIDClean, created at the time I did the data import. This PKID carries through to everything else I do. The data in this table consists of responses to polls, so although there are 65 million records in the database, not all people answered every poll. There is data about boat ownership (about 7 million responders), income (about 40 million responders), travel (unknown number of responders ATM) etc. So every field does not contain data for every record, in fact in some fields the data is sparsely populated. I took the name / address info from this table and had it "address validated", the results of which sit in another table of ~51 million valid deliverable name / address records (tblAZHSID). That data carries the PKID from tblHSIDClean as well so that I can relate a clean address back to these tblHSIDClean records. That "address validated" table contains ONLY address validation info, and NONE of the data in tblHSIDClean. tblAZHSID is of no interest to this discussion other than so that you can conceptualize how the data might be used to get "names / addresses of people WHERE...", i.e. mailing list data, which is the ultimate requirement of this data. tblAZHSID has address info, tblHSIDClean contains WHERE info, and they are related on PKID. INDEXES: In tblHSIDClean I used indexes on the fields that the client would ask about. I would build the index on demand (as I needed it) and they slowly accumulated. The indexes were always just a single field, no "cover index" concept at this point. These indexes provided fast COUNTS but did not work well for returning PKID sets for further processing in complex queries with WHERE clauses covering many different unrelated fields - WHERE Income IN() AND BoatLength IN() AND HasDogs = 'Y'. tblHSID: In the last month I extracted tblHSID From HSIDClean, out into a new database called HSID. tblHSID has the leading ~100 fields stripped off. These 100 fields are OLD address validation stuff and is never used at all for any reason. Thus tblHSID has 584 fields. Additionally, tblHSID only has 51 million records. When I did address validation years ago It turned out that about 14 million records "fell out" of the 65 million records because of bad address info. The 51 million records represent all the records that were deliverable. This table carries the PKID field from tblHSIDClean so that I know which record in tblHSID (new) relates to which record in tblHSIDClean (old). To summarize, tblHSID has ~100 fewer fields and 14 million fewer records than tblHSIDClean. INDEXES: In this database I started building "cover indexes" where each index was GENERALLY still for a single field but would also include the PKID as well. In a couple of cases I build a cover index where a logical group of fields were all contained in the index, as well as the PKID. In these cover indexes I place the data field first in the index, and then the PKID, on the theory that I would be doing WHERE clauses using this index to look for specific values in the data field. In some cases the results are simple counts, how many records with HullType = 'A' (boaters with aluminum hulls) or NarrowIncomeBand = 'T' (people with incomes > 150K). So covering indexes SHOULD make the process of getting counts of specific data elements faster, and in particular should make the process of getting actual PKID sets for these data elements faster. One of the uses of the database is to get actual addresses for all of the people WHERE..., for which I need PKID sets in order to facilitate comparisons and joins, and also to get at the name / address data over in tblAZHSID. tblHSIDVert: In the last few days I created HSIDVert database with tblHSIDVert_001_083, tblHSIDVert_084_143, tblHSIDVert_144_233 etc. Each table contains data points from specific sets of fields. You can think of each table as a vertical partition of tblHSID, although of course it is more than that. In these tables I completely restructured the data in the manner I have discussed in the last few days. I got a list of all of the remaining data fields (all of the fields in tblHSID), 584 fields total, and placed the field names into tblHSIDFldName with an autoincrement PKID which becomes the FieldNameID. I then created tables named as above, with just three columns, FldNameID, FldValue, and PKID. PKID of course is the PKID from records in tblHSID and tblHSIDClean and simply tells me which RECORD back in tblHSID / tblHSIDClean each "data point record" is coming from. FldNameID comes from the above mentioned tblHSIDFldName and simply tells me which FIELD each data point comes from. Data is the data point or element from the field specified. I then created a process that pulls ONLY the data points out of tblHSID for each field in tblHSID, and place those data points into tblHSIDVert. IOW I grabbed "just the facts ma'am" and "stacked them" into a single column of data, with a pair of other fields to keep it all straight - the FieldNameID and the PKID. INDEXES: In HSIDVert I created a single clustered index covering all three fields, in the order of FldNameID, Data, PKID. If you think about the concept, the "second generation" (tblHSID) is of course better than the "first generation" (tblHSIDClean) for three reasons, fewer records, fewer fields, and covering indexes. HOWEVER I still have to create these covering indexes as the client asks for results using a specific field. I.e. each field has its own distinct covering index, and there are still 584 possible fields needing a cover index. That's a lot of indexes. The "third generation" has some advantages in that ONLY the actual data is contained in the table, (no empty fields) and EVERY field gets an index automatically due to the fact that this table uses a cover index (the clustered index). So which "schema" will win the race? Results in part two of this email. John W. Colby Colby Consulting www.ColbyConsulting.com