jwcolby
jwcolby at colbyconsulting.com
Sun Dec 9 20:32:31 CST 2007
OK, so here I am going to run some timings on various things, using all three databases. My client gave me an actual order to provide counts of the HSID data against the smoker data (a different table). From the HSID data I need to do selections: NarrowIncomeBand codes 1-5 codes 6-8 codes 9, A-D codes E-I codes J-S codes T Gender - Male Code 1 Gender - Female Code 2 YearModelOfBoat Code <> '' Pool Code 'Y' Etc Etc. The client is trying to understand the data in two different completely unrelated tables, the HSID file and the TFF smokers, as well as the HSID file and the SMK smokers. The order actually has me join PKID result sets to another pair of tables the TFF smoker file and the SMK smoker file. IOW count: TFF smokers WHERE narrow income band 1-5 SMK smokers WHERE narrow income band 1-5 TFF Smokers WHERE Gender 1 SMK Smokers WHERE Gender 1 TFF Smokers WHERE Gender 2 SMK Smokers WHERE Gender 2 etc etc. The problem is that TFF and SMK smokers are unrelated between themselves, and also unrelated to HSID. In order to get counts, I need to somehow get something in common, which will be the name / address data. That part of the equation is really too complex to go into here ("beyond the scope of this discussion") and so I am going to punt and ignore the join between HSID and any other file (smokers in this case) and instead simply discuss how I "prep" the HSID file to allow me to compare it later on, and then compare timings of those "prepped" queries. Create order structures: I typically create a directory structure - PSM (the client) \ Orders \ OrderNum. In that directory on the disk I place the spreadsheet containing the order as well as any queries that I create that I want to save out as actual query files, any documents I get or create, and any CSVs or fixed with output files in the case of orders for names or counts. Counts generally go into spreadsheets directly. I also create a new database container to hold the views and temp tables required to fill that specific order, with the name of the order in the database name. Once I have a database container I... Create the Indexes: The next thing I have to do in the case of tblHSIDClean and tblHSID is to go in and make sure that I have indexes on the fields that I am going to do WHERE comparisons on. Sometimes they already exist from a previous order, sometimes I have to create the indexes because there isn't one already. In the case of tblHSIDClean I will create indexes of just each field individually. In tblHSID I will create "cover indexes" of either the individual field AND the PKID, or if there are several RELATED fields in the WHERE clause I will create a cover index including all of the related fields AND the PKID field. Creating these indexes typically takes a few minutes per index, with the exact time varying by how many actual data points are in a specific field, as well as how many fields are in the cover index (in the case of tblHSID). In the case of tblHSIDVert, I already have a single clustered index and that is all I need. Create views: Next I create views to pull each field with the PKID for each WHERE column. Strictly speaking individual queries are not required, i.e. I could create one mongo query (and I used to do so), but in the past anyway these mongo queries would take anywhere from 20 minutes to an hour to run. The problem was simply that if anything was not right I would have to make changes and rerun, leading to LOOOOONG times to get the thing done, though not necessarily lots of billable time. So I learned to create "field specific" views that would pull just the PKIDs required for one piece of the final result, and later join the pieces into a whole once the pieces pulled the right things. With a query for each piece of the bigger WHERE clause, I would then start the process of combining the queries to AND and OR them into the larger "whole WHERE clause". Joins between the PKIDs form the ANDS. So if I need (for example) WHERE Income in('D', 'E', 'F') AND CasionGambler = 'Y', I would create two queries, one to get all the PKIDs matching the income criteria, the other with all the PKIDs that match the CasinoGambler, and then join the PKIDs in a higher level query to AND the result set. Often times I need to get COUNTS so I would build a query on top of the queries that and / or the smaller queries to COUNT() the PKIDs that make it through the subqueries. If you have gotten this far then you understand how I go about constructing the stuff required to fill the order. The order might just be "counts" of how many people in an income range gamble, or how many people have dogs AND have cats, or how many people have children, are in a specific income range, and have cats. My client then knows how many "addresses" he can deliver for those criteria, and can go to his client with a proposal for supplying addresses. He might get the order, in which case he comes back to me and says, "now deliver 100K name / Address records in zip codes X, Y and Z". And of course this assumes he asked for counts in those zip codes, and there were at least 100K records made it through the filters. That's the idea and now I will present some timing results for some of the sub queries used in an actual order, this order is just for counts as discussed at the top of this email. I will construct the sub queries for all three different databases and time them. I will NOT take it to the point where I tie it into the smoker databases because that "is beyond the scope of this discussion". I am also going to bring AccessD into this because there might be folks on that list interested in this subject. OTOH maybe NOBODY except Arthur is interested! ;-) Next email, the results. It may be a day or so. I have to actually get this out TONIGHT so I will be doing whatever I have to do to get results. I will do timings tomorrow sometime. John W. Colby Colby Consulting www.ColbyConsulting.com