artful at rogers.com
artful at rogers.com
Thu Feb 15 21:21:26 CST 2007
Blame it all on me, JC. I told you how to do it and you resist. That's ok. I have been resisted before LOL. But I'm right. You normalize the columns and admittedly you end up with 600M rows but they are indexable and your speed goes up 1000 times. Your queries reduce to something like "Smoker = Y and Cailis = Y and Divorced = Soon". Add to that the fact that the retrieved rows all automatically point to the people in question. I don't see the point of your objection. This seems to me entirely straightforward, and 600M rows is child's play, assuming that everything is stored and indexed correctly. I can do 600M rows in an indexed search in less than a second. I've got the data to prove it. Arthur Fuller Technical Writer, Data Modeler, SQL Sensei Artful Databases Organization www.artfulsoftware.com ----- Original Message ---- From: JWColby <jwcolby at colbyconsulting.com> To: dba-sqlserver at databaseadvisors.com Sent: Thursday, February 15, 2007 3:04:55 PM Subject: Re: [dba-SQLServer] John Colby's problem OK, a little background. The client sells lists of names to bulk mailers (paper bulk mail). The client gets list (what I call surveys, because that is a more accurate description), which contain name / address information and then (usually) survey information such as "what brand of cigarette do you smoke, or what types of computers do you use etc. These surveys can be about ANYTHING that my client finds useful to purchase the lists for. My job is to somehow make sense of these. I take the lists and pull them into SQL Server. I then immediately send the lists out for address validation processing. "Sending out" really means that I process the names and send them through a address processing system running on my system, which can reach out over the internet to get certain parts of the processing done (NCOA specifically). At the end of the day, I then have a list of names which have been validated. Some of the names / addresses fail validation and I purge them. All the names which pass validation I keep. Passing means that the ADDRESS is actually deliverable, and MAYBE the person actually lives at that address. They PROBABLY lived at that address at some time in the past (when was the survey taken?). I then have to build a master table of addresses, people, and people who live at addresses. I also have to tie the SURVEY (remember the survey questions) back in to the names / addresses. Let's take an example. My client buys a list of 80 million names / addresses from a mortgage bank. It has names / addresses / income information / property information (info about the property purchased). I import into SQL Server, export back out for address validation. In THIS case, likely an EXTREMELY high percentage of the names / addresses are valid since a mortgage company used the info to process mortgage applications. Anyway, as you can see, I now have two distinct "sets" of data, the person / address / PersonAddress set, and the information about the property and of course also about the person's income etc. Now, my job is to tie all this back in to a master database of names and addresses. For example, if a person purchased a property, they now live there, I know their income, I know the property price, I know the number of rooms, bathrooms, has a pool etc. That ADDRESS info has to be "matched" against addresses already in my database from other lists that I have processed in the past. The PERSON information also needs to be tied in to information I have about PEOPLE that I obtained from other lists. I might get an NCOA that shows that person MOVING TO the new house they just purchased ALREADY IN MY TABLES from some other list. Furthermore I have to create a brand new table (and yes, Arthur has other ideas as yet unimplemented) that contains the PROPERTY information and links that to a table of deliverable addresses I have built up. So I end up with a system of tables - people / address / survey1 / survey2 / survey3 / survey4. Survey1/2/3/4... All end up with pointers to a specific PERSON since that information is about their personal preferences (brands of soap, soft drinks, computers etc.) NOW... (are you still with me?) I get a request from my client (who got a request from HIS client) for a COUNT all the people in age bracket XXX, income bracket YYY, zip codes AAA, that own a pool and drive a Mercedes or a BMW. I create a system of views / queries to pull all of the pieces together, and count those people. I then get another order from my client (for a different client of theirs) for a count of all the people in zip code BBB who use detergent X, they really don't care about the age or income. Next count order, next count order, modify count order 1, modify count order 2, new count order, etc. I end up with a single list of addresses, a single list of people, a m-m table of what addresses people lived at when, and 47 different SURVEY tables, until tomorrow of course when my client buys surveys number 49, 50 and 51 and I start the process of integrating that data into the system. Arthur in the meantime espouses a system where each answer in each survey table is merged into one big SurveyAnswer table with FKs back to people / survey fields. Which is almost assuredly the correct answer however that method requires a huge programming effort (and immense crunching on my computer) to get the data normalized in this manner, and then to allow extraction of the data in order to do the counts. And if I use this metaphor, what happens when I get another list from another mortgage company where the same basic information (property size, number of rooms, has a pool) is purchased? I do NOT want this stored as two different "surveys" but rather merged into one. Well anyway, there ya go. I need organization. I need organization on the front end when I turn these lists into data, and I need organization on the back end when I process orders for counts against the data, and someday when I process orders for name/address lists to be sold. Some day soon (assuming you live in the USA) I will know EVERYTHING about YOU! Bwaaahaaaaahaaa. Kinda scary really. Enough to make you think twice about taking surveys. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Greg Worthey Sent: Thursday, February 15, 2007 2:27 PM To: dba-sqlserver at databaseadvisors.com Subject: [dba-SQLServer] John Colby's problem John, If I understand you correctly, here's what you need to do: Figure out a set of criteria that describe a "count order", then put each order as a record in a table. You say all the orders are different, but after you get enough of them, they all should fit some set of criteria. Then, when there is some demand against that order (give summary of counts, or give all names, etc), you have a SQL query builder that uses the criteria from the order to query the data on demand. It sounds like you're replicating data where all you need to do is query it. If you create a database for each order, you'll have a huge mess on your hands when they start coming in frequently. Generalize the criteria for the orders and keep everything in one database, and it should be tidy. Greg Worthey Worthey Solutions www.worthey.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com