John W. Colby
jwcolby at colbyconsulting.com
Sun Nov 7 21:50:06 CST 2004
I think I have figured out a strategy for the nVLDB - 64 million records, ~640 fields. The deal is that the table consists of sets of fields, typically 10-20 fields, where the "subject" of the field is the same, perhaps age, income, people in household, ethnicity, drugs taken, types of music listened to, types of books read etc. I have started breaking out these sets of fields into individual tables, with a PK field that is the PK of the main table. In other words I will now have a table with perhaps 12 fields that are yes / no / null listens to music type x, y, z. Another table with perhaps 20 fields with yes / no / null takes zocor, etc. etc. I will end up breaking the 640 fields down into perhaps 30 tables where each table holds a set of fields from the original table. This allows me to index each field which was not possible with the big table simply because any given table can only contain 240 indexes. The idea now is to be able to select a handful of these subtables, perhaps income, age, music, reading and autos, join them back using the PK, then do a where income >20k and <40K, age > 20 and <30, music = blues or rock, etc... Then do a count(pk) to get a count of people having those characteristics. Additionally I can now get a count on the values in each field, 2 million no, 250k yes, 60 million null "listens to rap" etc. Sound feasible? The issue now is that I need to do this stuff programmatically from Access or perhaps a web environment (eventually). IOW, I need to build up a query joining the income, age, music, reading and autos tables on PK, throw in the where(s), and get a count of PK. I need to do this dynamically from VB (or access), where I can let the user select tables to join (sets of related fields), hand back lists of fields to select where values from, and lists of values that are actually in those fields. I can do all of that stuff, where I need assistance is how to hand the finished SQL statement off to SQL Server and have it run the query, handing back the count. I can't have Access (Jet) brute force it simply because the tables each have 64 million records in them. Just getting the PKs back to JET would take days. So, how do I hand SQL Server a dynamically built SQL statement, ask it to evaluate the SQL statement and return a single count value? Eventually I also need to have SQL take that same where clause and hand me back a set of fields / records from the main name / address subtable for sending to the bulk mail client. I am considering doing this in an ADP environment, where I build and save the SQL as a UDF or something like that- not that I've ever done anything like that before. Or get out of Access entirely (though the data environment sure is nice) and use .NET or something. John W. Colby www.ColbyConsulting.com Contribute your unused CPU cycles to a good cause: http://folding.stanford.edu/