[dba-SQLServer] My learning experience

Sun Dec 9 20:32:31 CST 2007

OK, so here I am going to run some timings on various things, using all
three databases.  My client gave me an actual order to provide counts of the
HSID data against the smoker data (a different table).  From the HSID data I
need to do selections:

NarrowIncomeBand 	codes 1-5
			codes 6-8
			codes 9, A-D
			codes E-I
			codes J-S
			codes T 

Gender - Male	Code 1
Gender - Female	Code 2

YearModelOfBoat	Code <> ''

Pool			Code 'Y'

Etc Etc.  The client is trying to understand the data in two different
completely unrelated tables, the HSID file and the TFF smokers, as well as
the HSID file and the SMK smokers.

The order actually has me join PKID result sets to another pair of tables
the TFF smoker file and the SMK smoker file.  IOW count:

TFF smokers WHERE narrow income band 1-5
SMK smokers WHERE narrow income band 1-5 

TFF Smokers WHERE Gender 1
SMK Smokers WHERE Gender 1

TFF Smokers WHERE Gender 2
SMK Smokers WHERE Gender 2

etc etc.

The problem is that TFF and SMK smokers are unrelated between themselves,
and also unrelated to HSID.  In order to get counts, I need to somehow get
something in common, which will be the name / address data.  That part of
the equation is really too complex to go into here ("beyond the scope of
this discussion") and so I am going to punt and ignore the join between HSID
and any other file (smokers in this case) and instead simply discuss how I
"prep" the HSID file to allow me to compare it later on, and then compare
timings of those "prepped" queries.

Create order structures:

I typically create a directory structure - PSM (the client) \ Orders \
OrderNum.  In that directory on the disk I place the spreadsheet containing
the order as well as any queries that I create that I want to save out as
actual query files, any documents I get or create, and any CSVs or fixed
with output files in the case of orders for names or counts.  Counts
generally go into spreadsheets directly.  

I also create a new database container to hold the views and temp tables
required to fill that specific order, with the name of the order in the
database name.  Once I have a database container I...

Create the Indexes: 

The next thing I have to do in the case of tblHSIDClean and tblHSID is to go
in and make sure that I have indexes on the fields that I am going to do
WHERE comparisons on.  Sometimes they already exist from a previous order,
sometimes I have to create the indexes because there isn't one already.  In
the case of tblHSIDClean I will create indexes of just each field
individually.  In tblHSID I will create "cover indexes" of either the
individual field AND the PKID, or if there are several RELATED fields in the
WHERE clause I will create a cover index including all of the related fields
AND the PKID field.  Creating these indexes typically takes a few minutes
per index, with the exact time varying by how many actual data points are in
a specific field, as well as how many fields are in the cover index (in the
case of tblHSID).

In the case of tblHSIDVert, I already have a single clustered index and that
is all I need.

Create views:

Next I create views to pull each field with the PKID for each WHERE column.
Strictly speaking individual queries are not required, i.e. I could create
one mongo query (and I used to do so), but in the past anyway these mongo
queries would take anywhere from 20 minutes to an hour to run.  The problem
was simply that if anything was not right I would have to make changes and
rerun, leading to LOOOOONG times to get the thing done, though not
necessarily lots of billable time.  So I learned to create "field specific"
views that would pull just the PKIDs required for one piece of the final
result, and later join the pieces into a whole once the pieces pulled the
right things.

With a query for each piece of the bigger WHERE clause, I would then start
the process of combining the queries to AND and OR them into the larger
"whole WHERE clause".  Joins between the PKIDs form the ANDS.  So if I need
(for example) WHERE Income in('D', 'E', 'F') AND CasionGambler = 'Y', I
would create two queries, one to get all the PKIDs matching the income
criteria, the other with all the PKIDs that match the CasinoGambler, and
then join the PKIDs in a higher level query to AND the result set.

Often times I need to get COUNTS so I would build a query on top of the
queries that and / or the smaller queries to COUNT() the PKIDs that make it
through the subqueries.

If you have gotten this far then you understand how I go about constructing
the stuff required to fill the order.  The order might just be "counts" of
how many people in an income range gamble, or how many people have dogs AND
have cats, or how many people have children, are in a specific income range,
and have cats.  My client then knows how many "addresses" he can deliver for
those criteria, and can go to his client with a proposal for supplying
addresses.  He might get the order, in which case he comes back to me and
says, "now deliver 100K name / Address records in zip codes X, Y and Z".
And of course this assumes he asked for counts in those zip codes, and there
were at least 100K records made it through the filters.

That's the idea and now I will present some timing results for some of the
sub queries used in an actual order, this order is just for counts as
discussed at the top of this email.  I will construct the sub queries for
all three different databases and time them.  I will NOT take it to the
point where I tie it into the smoker databases because that "is beyond the
scope of this discussion".

I am also going to bring AccessD into this because there might be folks on
that list interested in this subject.  

OTOH maybe NOBODY except Arthur is interested!  ;-)

Next email, the results.  It may be a day or so.  I have to actually get
this out TONIGHT so I will be doing whatever I have to do to get results.  I
will do timings tomorrow sometime.

John W. Colby
Colby Consulting
www.ColbyConsulting.com