Tina Norris Fields
tinanfields at torchlake.com
Mon Dec 10 10:21:13 CST 2007
I'm interested. So after the crunch is done, please do come back and
tell us how it all worked out. BTW I just love watching how you think!
Tina
jwcolby wrote:
> OK, so here I am going to run some timings on various things, using all
> three databases. My client gave me an actual order to provide counts of the
> HSID data against the smoker data (a different table). From the HSID data I
> need to do selections:
>
> NarrowIncomeBand codes 1-5
> codes 6-8
> codes 9, A-D
> codes E-I
> codes J-S
> codes T
>
> Gender - Male Code 1
> Gender - Female Code 2
>
> YearModelOfBoat Code <> ''
>
> Pool Code 'Y'
>
> Etc Etc. The client is trying to understand the data in two different
> completely unrelated tables, the HSID file and the TFF smokers, as well as
> the HSID file and the SMK smokers.
>
> The order actually has me join PKID result sets to another pair of tables
> the TFF smoker file and the SMK smoker file. IOW count:
>
> TFF smokers WHERE narrow income band 1-5
> SMK smokers WHERE narrow income band 1-5
>
> TFF Smokers WHERE Gender 1
> SMK Smokers WHERE Gender 1
>
> TFF Smokers WHERE Gender 2
> SMK Smokers WHERE Gender 2
>
> etc etc.
>
> The problem is that TFF and SMK smokers are unrelated between themselves,
> and also unrelated to HSID. In order to get counts, I need to somehow get
> something in common, which will be the name / address data. That part of
> the equation is really too complex to go into here ("beyond the scope of
> this discussion") and so I am going to punt and ignore the join between HSID
> and any other file (smokers in this case) and instead simply discuss how I
> "prep" the HSID file to allow me to compare it later on, and then compare
> timings of those "prepped" queries.
>
> Create order structures:
>
> I typically create a directory structure - PSM (the client) \ Orders \
> OrderNum. In that directory on the disk I place the spreadsheet containing
> the order as well as any queries that I create that I want to save out as
> actual query files, any documents I get or create, and any CSVs or fixed
> with output files in the case of orders for names or counts. Counts
> generally go into spreadsheets directly.
>
> I also create a new database container to hold the views and temp tables
> required to fill that specific order, with the name of the order in the
> database name. Once I have a database container I...
>
> Create the Indexes:
>
> The next thing I have to do in the case of tblHSIDClean and tblHSID is to go
> in and make sure that I have indexes on the fields that I am going to do
> WHERE comparisons on. Sometimes they already exist from a previous order,
> sometimes I have to create the indexes because there isn't one already. In
> the case of tblHSIDClean I will create indexes of just each field
> individually. In tblHSID I will create "cover indexes" of either the
> individual field AND the PKID, or if there are several RELATED fields in the
> WHERE clause I will create a cover index including all of the related fields
> AND the PKID field. Creating these indexes typically takes a few minutes
> per index, with the exact time varying by how many actual data points are in
> a specific field, as well as how many fields are in the cover index (in the
> case of tblHSID).
>
> In the case of tblHSIDVert, I already have a single clustered index and that
> is all I need.
>
> Create views:
>
> Next I create views to pull each field with the PKID for each WHERE column.
> Strictly speaking individual queries are not required, i.e. I could create
> one mongo query (and I used to do so), but in the past anyway these mongo
> queries would take anywhere from 20 minutes to an hour to run. The problem
> was simply that if anything was not right I would have to make changes and
> rerun, leading to LOOOOONG times to get the thing done, though not
> necessarily lots of billable time. So I learned to create "field specific"
> views that would pull just the PKIDs required for one piece of the final
> result, and later join the pieces into a whole once the pieces pulled the
> right things.
>
> With a query for each piece of the bigger WHERE clause, I would then start
> the process of combining the queries to AND and OR them into the larger
> "whole WHERE clause". Joins between the PKIDs form the ANDS. So if I need
> (for example) WHERE Income in('D', 'E', 'F') AND CasionGambler = 'Y', I
> would create two queries, one to get all the PKIDs matching the income
> criteria, the other with all the PKIDs that match the CasinoGambler, and
> then join the PKIDs in a higher level query to AND the result set.
>
> Often times I need to get COUNTS so I would build a query on top of the
> queries that and / or the smaller queries to COUNT() the PKIDs that make it
> through the subqueries.
>
> If you have gotten this far then you understand how I go about constructing
> the stuff required to fill the order. The order might just be "counts" of
> how many people in an income range gamble, or how many people have dogs AND
> have cats, or how many people have children, are in a specific income range,
> and have cats. My client then knows how many "addresses" he can deliver for
> those criteria, and can go to his client with a proposal for supplying
> addresses. He might get the order, in which case he comes back to me and
> says, "now deliver 100K name / Address records in zip codes X, Y and Z".
> And of course this assumes he asked for counts in those zip codes, and there
> were at least 100K records made it through the filters.
>
> That's the idea and now I will present some timing results for some of the
> sub queries used in an actual order, this order is just for counts as
> discussed at the top of this email. I will construct the sub queries for
> all three different databases and time them. I will NOT take it to the
> point where I tie it into the smoker databases because that "is beyond the
> scope of this discussion".
>
> I am also going to bring AccessD into this because there might be folks on
> that list interested in this subject.
>
> OTOH maybe NOBODY except Arthur is interested! ;-)
>
> Next email, the results. It may be a day or so. I have to actually get
> this out TONIGHT so I will be doing whatever I have to do to get results. I
> will do timings tomorrow sometime.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
>