jwcolby
jwcolby at colbyconsulting.com
Sat Dec 8 14:20:51 CST 2007
In converting the database from hell to a "vertical" table, I have processed 36 fields out of 584 fields from a table with 51.157 million records.. It has found 573,397,612 data elements in those 36 fields. 15.93 million data points per field average. It has consumed 12.246 GBytes of data space (no indexes yet). 340 megs / field average. It is taking roughly 300 seconds per field to process at this point in time, though that bounces around a lot depending on how many data elements it finds in a field. Assuming that the statistics so far hold true over the rest of the data (not a real good assumption) and assuming that the insert time / element remains the same (a horrid assumption) it will take (684-36) * 5 minutes to finish the conversion - 3240 minutes, 54 hours. It will also take 340 * 684 megs of data space - 232 gigs. The table when finished would hold 684 * 15.93 million data points - 10896.12 million data points - 10.896 Gig data points. The point of this exercise is to build a table with three fields. The FieldID from a field lookup table, the PKID from the record that the data point is found in, and the data itself. I will have converted a table containing 50 million records but 684 fields, into a table with 11 gig records, but 3 fields. Obviously a table scan is being used on the wide table most of the time, I cannot realistically index 684 fields, and even if I did, many (most) of the fields are a single character consisting of 'Y'. Once the table is built I will build a single index on the table, a clustered index with the fields arranged FieldID, Data and PKID on the theory that most operations will be "where" on a field (thus the field ID first) where the data is a certain value (thus the data next) and returning only the PKID. If I understand things (a HUGE assumption) this clustered index would be optimized for that kind of search. So in the case of the "vertical" table, an "index scan" will be performed BUT all the records for a given field will be contiguous, BUT it will have to scan (worst case) 11 gig records, BUT it will only have to read three fields consisting of two integers and a varchar(100). What would be interesting but probably not feasible would be to analyze which fields are used the most by my client and place them first in the table. That would guarantee that the most used fields occur earliest in the search. THAT would make an enormous difference I think. Then if there are only 5 million data points in a field and they are contiguous, and they are at the front of the table... My good friend Arthur assures me that the new table will on average win the race. We shall see. Does anyone have a got feeling for the results? John W. Colby Colby Consulting www.ColbyConsulting.com