jwcolby
jwcolby at colbyconsulting.com
Thu Dec 6 14:37:51 CST 2007
I am taking a stab at what Arthur has been harping on about, which is turning my database from hell into a hellish normalized table with something like PKID, FieldName, Value. This essentially stacks every piece of data into a single table of three fields (four if you include a table name column). I would normally be skeptical of such a schema however in this specific case I have a table of 51 million rows with 600 fields where in come cases there is data in fieldXYZ for 30 million records, but for field ABC only 5 million rows. As you can see, by doing it this way I pull out exactly the number of values for a given field as there actual data values in the source table / that field, and then place the value and field name into a new table. Index it and when I ask for fieldABC I only get back 5 million records. Furthermore my table is only 3 fields wide instead of 600 fields wide. However... the field names are things like Forwarding_Address_Or_Nixie_Run_Codes, Number_of_Persons and Second_Name_of_Opposite_Gender_Match_Indicator. My issue now is that the data storage for the FIELD NAMES becomes gigantic. 37 million records where the DATA is a single character 0-9 or A-T (good) but the NAME is Narrow_Income_Band (bad). Thus I REAAAALY need to turn the field names into a lookup table so that the nameID (integer) is used instead of a FieldName (varchar(50)). Doing it that way would then reduce my table to PKID (int), FldNameID (int) and FldVal (VarChar(50)). Throwing a cover index on that would allow me to ask any question with something like SELECT PKID from tblHSIDVertical WHERE FldNameID = 13 and FldVal = 'A' Once constructed the table would have exactly as many records as there were actual data values in ALL of the fields, and there would be ~600 "sets" of answers. All queries would look like the above query. Let's pick a number out of the air and say that there are ON AVERAGE 10 million records with values in any given field, that means that the resulting database would have 10 million * 600 records, BUT the records would only be three fields and there would be exactly ONE index (a cover index). Further I could do queries with ANDS and ORs much more simply (I think). Of course I would also have to do lookups to get the FieldNameID but it seems like it should work. Arthur has been hammering me that this is the "right way" to do this, and I kind of agree. I am currently doing table scans in many cases (across 50 million records) because there are only two values in many of the columns, 'Y' and '' (true or no value). Your thoughts on this? John W. Colby Colby Consulting www.ColbyConsulting.com