[dba-SQLServer] John Colby's problem

Thu Feb 15 21:21:26 CST 2007

Blame it all on me, JC. I told you how to do it and you resist. That's ok. I have been resisted before LOL. But I'm right. You normalize the columns and admittedly you end up with 600M rows but they are indexable and your speed goes up 1000 times. Your queries reduce to something like "Smoker = Y and Cailis = Y and Divorced = Soon". Add to that the fact that the retrieved rows all automatically point to the people in question. I don't see the point of your objection. This seems to me entirely straightforward, and 600M rows is child's play, assuming that everything is stored and indexed correctly. I can do 600M rows in an indexed search in less than a second. I've got the data to prove it.

Arthur Fuller
Technical Writer, Data Modeler, SQL Sensei
Artful Databases Organization
www.artfulsoftware.com

----- Original Message ----
From: JWColby <jwcolby at colbyconsulting.com>
To: dba-sqlserver at databaseadvisors.com
Sent: Thursday, February 15, 2007 3:04:55 PM
Subject: Re: [dba-SQLServer] John Colby's problem

OK, a little background.

The client sells lists of names to bulk mailers (paper bulk mail).  The
client gets list (what I call surveys, because that is a more accurate
description), which contain name / address information and then (usually)
survey information such as "what brand of cigarette do you smoke, or what
types of computers do you use etc.  These surveys can be about ANYTHING that
my client finds useful to purchase the lists for.

My job is to somehow make sense of these.  I take the lists and pull them
into SQL Server.  I then immediately send the lists out for address
validation processing.  "Sending out" really means that I process the names
and send them through a address processing system running on my system,
which can reach out over the internet to get certain parts of the processing
done (NCOA specifically).  At the end of the day, I then have a list of
names which have been validated.  Some of the names / addresses fail
validation and I purge them.  All the names which pass validation I keep.
Passing means that the ADDRESS is actually deliverable, and MAYBE the person
actually lives at that address.  They PROBABLY lived at that address at some
time in the past (when was the survey taken?).

I then have to build a master table of addresses, people, and people who
live at addresses.  I also have to tie the SURVEY (remember the survey
questions) back in to the names / addresses. 

Let's take an example.

My client buys a list of 80 million names / addresses from a mortgage bank.
It has names / addresses / income information / property information (info
about the property purchased).  I import into SQL Server, export back out
for address validation.  In THIS case, likely an EXTREMELY high percentage
of the names / addresses are valid since a mortgage company used the info to
process mortgage applications.  Anyway, as you can see, I now have two
distinct "sets" of data, the person / address / PersonAddress set, and the
information about the property and of course also about the person's income
etc.

Now, my job is to tie all this back in to a master database of names and
addresses.  For example, if a person purchased a property, they now live
there, I know their income, I know the property price, I know the number of
rooms, bathrooms, has a pool etc.  That ADDRESS info has to be "matched"
against addresses already in my database from other lists that I have
processed in the past.  The PERSON information also needs to be tied in to
information I have about PEOPLE that I obtained from other lists.  I might
get an NCOA that shows that person MOVING TO the new house they just
purchased ALREADY IN MY TABLES from some other list.  

Furthermore I have to create a brand new table (and yes, Arthur has other
ideas as yet unimplemented) that contains the PROPERTY information and links
that to a table of deliverable addresses I have built up.  

So I end up with a system of tables - people / address / survey1 / survey2 /
survey3 / survey4.  Survey1/2/3/4... All end up with pointers to a specific
PERSON since that information is about their personal preferences (brands of
soap, soft drinks, computers etc.)

NOW... (are you still with me?)  I get a request from my client (who got a
request from HIS client) for a COUNT all the people in age bracket XXX,
income bracket YYY, zip codes AAA, that own a pool and drive a Mercedes or a
BMW.  I create a system of views / queries to pull all of the pieces
together, and count those people.

I then get another order from my client (for a different client of theirs)
for a count of all the people in zip code BBB who use detergent X, they
really don't care about the age or income.

Next count order, next count order, modify count order 1, modify count order
2, new count order, etc.  

I end up with a single list of addresses, a single list of people, a m-m
table of what addresses people lived at when, and 47 different SURVEY
tables, until tomorrow of course when my client buys surveys number 49, 50
and 51 and I start the process of integrating that data into the system.

Arthur in the meantime espouses a system where each answer in each survey
table is  merged into one big SurveyAnswer table with FKs back to people /
survey fields.  Which is almost assuredly the correct answer however that
method requires a huge programming effort (and immense crunching on my
computer) to get the data normalized in this manner, and then to allow
extraction of the data in order to do the counts.

And if I use this metaphor, what happens when I get another list from
another mortgage company where the same basic information (property size,
number of rooms, has a pool) is purchased?  I do NOT want this stored as two
different "surveys" but rather merged into one.

Well anyway, there ya go.  I need organization.  I need organization on the
front end when I turn these lists into data, and I need organization on the
back end when I process orders for counts against the data, and someday when
I process orders for name/address lists to be sold.

Some day soon (assuming you live in the USA) I will know EVERYTHING about
YOU!  Bwaaahaaaaahaaa.  Kinda scary really.  Enough to make you think twice
about taking surveys.

John W. Colby
Colby Consulting
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Greg
Worthey
Sent: Thursday, February 15, 2007 2:27 PM
To: dba-sqlserver at databaseadvisors.com
Subject: [dba-SQLServer] John Colby's problem

John,

If I understand you correctly, here's what you need to do:
Figure out a set of criteria that describe a "count order", then put each
order as a record in a table.  You say all the orders are different, but
after you get enough of them, they all should fit some set of criteria.
Then, when there is some demand against that order (give summary of counts,
or give all names, etc), you have a SQL query builder that uses the criteria
from the order to query the data on demand.  

It sounds like you're replicating data where all you need to do is query it.
If you create a database for each order, you'll have a huge mess on your
hands when they start coming in frequently.  Generalize the criteria for the
orders and keep everything in one database, and it should be tidy.  

Greg Worthey
Worthey Solutions
www.worthey.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com