[dba-SQLServer] A step backwards

jwcolby jwcolby at colbyconsulting.com
Thu Dec 6 14:37:51 CST 2007


I am taking a stab at what Arthur has been harping on about, which is
turning my database from hell into a hellish normalized table with something
like PKID, FieldName, Value.  This essentially stacks every piece of data
into a single table of three fields (four if you include a table name
column).

I would normally be skeptical of such a schema however in this specific case
I have a table of 51 million rows with 600 fields where in come cases there
is data in fieldXYZ for 30 million records, but for field ABC only 5 million
rows.  As you can see, by doing it this way I pull out exactly the number of
values for a given field as there actual data values in the source table /
that field, and then place the value and field name into a new table.  Index
it and when I ask for fieldABC I only get back 5 million records.
Furthermore my table is only 3 fields wide instead of 600 fields wide.

However... the field names are things like
Forwarding_Address_Or_Nixie_Run_Codes, Number_of_Persons and
Second_Name_of_Opposite_Gender_Match_Indicator.  My issue now is that the
data storage for the FIELD NAMES becomes gigantic.  37 million records where
the DATA is a single character 0-9 or A-T (good) but the NAME is
Narrow_Income_Band (bad).

Thus I REAAAALY need to turn the field names into a lookup table so that the
nameID (integer) is used instead of a FieldName (varchar(50)).  Doing it
that way would then reduce my table to PKID (int), FldNameID (int) and
FldVal (VarChar(50)).  

Throwing a cover index on that would allow me to ask any question with
something like 

SELECT PKID from tblHSIDVertical
WHERE FldNameID = 13 and FldVal = 'A'

Once constructed the table would have exactly as many records as there were
actual data values in ALL of the fields, and there would be ~600 "sets" of
answers.  All queries would look like the above query.  Let's pick a number
out of the air and say that there are ON AVERAGE 10 million records with
values in any given field, that means that the resulting database would have
10 million * 600 records, BUT the records would only be three fields and
there would be exactly ONE index (a cover index).

Further I could do queries with ANDS and ORs much more simply (I think).  Of
course I would also have to do lookups to get the FieldNameID but it seems
like it should work.

Arthur has been hammering me that this is the "right way" to do this, and I
kind of agree.  I am currently doing table scans in many cases (across 50
million records) because there are only two values in many of the columns,
'Y' and '' (true or no value).

Your thoughts on this?

John W. Colby
Colby Consulting
www.ColbyConsulting.com 




More information about the dba-SQLServer mailing list