Tina Norris Fields
tinanfields at torchlake.com
Mon Dec 10 10:21:13 CST 2007
I'm interested. So after the crunch is done, please do come back and tell us how it all worked out. BTW I just love watching how you think! Tina jwcolby wrote: > OK, so here I am going to run some timings on various things, using all > three databases. My client gave me an actual order to provide counts of the > HSID data against the smoker data (a different table). From the HSID data I > need to do selections: > > NarrowIncomeBand codes 1-5 > codes 6-8 > codes 9, A-D > codes E-I > codes J-S > codes T > > Gender - Male Code 1 > Gender - Female Code 2 > > YearModelOfBoat Code <> '' > > Pool Code 'Y' > > Etc Etc. The client is trying to understand the data in two different > completely unrelated tables, the HSID file and the TFF smokers, as well as > the HSID file and the SMK smokers. > > The order actually has me join PKID result sets to another pair of tables > the TFF smoker file and the SMK smoker file. IOW count: > > TFF smokers WHERE narrow income band 1-5 > SMK smokers WHERE narrow income band 1-5 > > TFF Smokers WHERE Gender 1 > SMK Smokers WHERE Gender 1 > > TFF Smokers WHERE Gender 2 > SMK Smokers WHERE Gender 2 > > etc etc. > > The problem is that TFF and SMK smokers are unrelated between themselves, > and also unrelated to HSID. In order to get counts, I need to somehow get > something in common, which will be the name / address data. That part of > the equation is really too complex to go into here ("beyond the scope of > this discussion") and so I am going to punt and ignore the join between HSID > and any other file (smokers in this case) and instead simply discuss how I > "prep" the HSID file to allow me to compare it later on, and then compare > timings of those "prepped" queries. > > Create order structures: > > I typically create a directory structure - PSM (the client) \ Orders \ > OrderNum. In that directory on the disk I place the spreadsheet containing > the order as well as any queries that I create that I want to save out as > actual query files, any documents I get or create, and any CSVs or fixed > with output files in the case of orders for names or counts. Counts > generally go into spreadsheets directly. > > I also create a new database container to hold the views and temp tables > required to fill that specific order, with the name of the order in the > database name. Once I have a database container I... > > Create the Indexes: > > The next thing I have to do in the case of tblHSIDClean and tblHSID is to go > in and make sure that I have indexes on the fields that I am going to do > WHERE comparisons on. Sometimes they already exist from a previous order, > sometimes I have to create the indexes because there isn't one already. In > the case of tblHSIDClean I will create indexes of just each field > individually. In tblHSID I will create "cover indexes" of either the > individual field AND the PKID, or if there are several RELATED fields in the > WHERE clause I will create a cover index including all of the related fields > AND the PKID field. Creating these indexes typically takes a few minutes > per index, with the exact time varying by how many actual data points are in > a specific field, as well as how many fields are in the cover index (in the > case of tblHSID). > > In the case of tblHSIDVert, I already have a single clustered index and that > is all I need. > > Create views: > > Next I create views to pull each field with the PKID for each WHERE column. > Strictly speaking individual queries are not required, i.e. I could create > one mongo query (and I used to do so), but in the past anyway these mongo > queries would take anywhere from 20 minutes to an hour to run. The problem > was simply that if anything was not right I would have to make changes and > rerun, leading to LOOOOONG times to get the thing done, though not > necessarily lots of billable time. So I learned to create "field specific" > views that would pull just the PKIDs required for one piece of the final > result, and later join the pieces into a whole once the pieces pulled the > right things. > > With a query for each piece of the bigger WHERE clause, I would then start > the process of combining the queries to AND and OR them into the larger > "whole WHERE clause". Joins between the PKIDs form the ANDS. So if I need > (for example) WHERE Income in('D', 'E', 'F') AND CasionGambler = 'Y', I > would create two queries, one to get all the PKIDs matching the income > criteria, the other with all the PKIDs that match the CasinoGambler, and > then join the PKIDs in a higher level query to AND the result set. > > Often times I need to get COUNTS so I would build a query on top of the > queries that and / or the smaller queries to COUNT() the PKIDs that make it > through the subqueries. > > If you have gotten this far then you understand how I go about constructing > the stuff required to fill the order. The order might just be "counts" of > how many people in an income range gamble, or how many people have dogs AND > have cats, or how many people have children, are in a specific income range, > and have cats. My client then knows how many "addresses" he can deliver for > those criteria, and can go to his client with a proposal for supplying > addresses. He might get the order, in which case he comes back to me and > says, "now deliver 100K name / Address records in zip codes X, Y and Z". > And of course this assumes he asked for counts in those zip codes, and there > were at least 100K records made it through the filters. > > That's the idea and now I will present some timing results for some of the > sub queries used in an actual order, this order is just for counts as > discussed at the top of this email. I will construct the sub queries for > all three different databases and time them. I will NOT take it to the > point where I tie it into the smoker databases because that "is beyond the > scope of this discussion". > > I am also going to bring AccessD into this because there might be folks on > that list interested in this subject. > > OTOH maybe NOBODY except Arthur is interested! ;-) > > Next email, the results. It may be a day or so. I have to actually get > this out TONIGHT so I will be doing whatever I have to do to get results. I > will do timings tomorrow sometime. > > John W. Colby > Colby Consulting > www.ColbyConsulting.com > >