[dba-SQLServer] Append only new records

jwcolby jwcolby at colbyconsulting.com
Tue Mar 4 21:14:22 CST 2008


To this point I have been doing it mostly with VB.Net and ADO recordsets.
What I have built so far is a system where I define imports and exports
using metadata tables in SQL Server.  Each import / export metadata table
(manually filled in to this point - one time) has information about the SQL
Server table that is the source or target, the directories where the import
/ export files are moving from, etc.  I then have a map table (for export)
with metadata defining what the fields are in the source table to be
exported.  Flat file import field metadata is generally provided in an excel
spreadsheet or CSV file by the list provider.   

Like that.

That is all working.  I can import or export CSV files and flat files simply
by defining metadata records and running my program.  The process consists
of an initial import of a list from a slew of flat files (always flat files,
that seems to be the standard in the industry) into sql server.  Then an
immediate export of just the name / address fields back out to CSV files, to
a third party address validation program running on a server here in my
office that performs the address validation - CASS and DPV.  The validation
program is running a watch on directories where I drop the CSV files.  Each
file has a million records, though I define the max record count in my
metadata.  

The third party validation program performs the validation including
uploading some info to an NCOA server over the internet.  Back comes NCOA
data for whatever records (people) have moved physical locations.  The third
party validation program reintegrates the NCOA information into fields in
the validation records it is assembling.  The returning data is then dropped
as CSV files into directories which my program is monitoring.  

My VB.Net program then imports the resulting address validated CSV files
(one million record files) back into SQL Server into validation tables which
hold just the name / address / validation data (plus the PK of course).
>From there I have to discover which addresses are undeliverable using
validation codes from the CASS / DPV process and delete those records
entirely.  They are completely undeliverable, usually bogus or incorrect
address info.  The NCOA codes tell me when a person has moved from one
(valid) address to another (valid) address.  So I have to store the original
valid address record (that is useful to me) and then update with the new
valid address information from the NCOA (where they currently live).  Once
all the addresses are validated and any NCOA addresses are discovered and
written, I create the person, family and address hash codes.  I now have two
tables, the original data imported from the flat files, and a smaller subset
of address validation records (all valid, deliverable and NCOAd), related by
a long PKID.  One address list is now processed.

However... the whole "export to validation, import from validation" piece
has to be performed periodically.  Of course that is just my code running
again, the same code mentioned above.  And of course my code also has to log
all the processes so that I can bill my clients for the processes performed
on their data.

It is a rather intricate dance all in all, and it is all orchestrated by
VB.Net code that I wrote in Visual Studio and using SQL Server 2005, and of
course a third party CASS / DPV / NCOA program running on a server here in
my office.

The part we were discussing is kind of ancillary to all the stuff mentioned
above.  Once I have validated and NCOAd addresses, it would be nice to build
a validated address table as a precursor to a system to track people
movements as well as learning other stuff.  For example, if I have a list of
mortgages, these include all kinds of "house" information such as number of
rooms, square footages, lot size etc, as well as mortgage information - sale
price, mortgage payment etc. PLUS the address.  I have another, completely
unrelated list of people / addresses with perhaps buying preferences for
electronics, or soap or cat food.  I have another completely unrelated name
/ address list with income, ethnicity, number of children, and whatever else
that list may contain.  By having the VALIDATED address as common
information I can now correlate completely unrelated lists and gain more
information than I can gather from any individual list.  

EVEN if it is not the same person, I can discover that a person that lives
at some address has this kind of house, that SOMEONE who lived there bought
this kind of food, drove this kind of car, was this ethnic background etc.
In essence simply by knowing that you lived at an address and drove a
certain kind of car, by gathering sufficient information about your
neighbors I can INFER what kinds of cars people in your neighborhood are
LIKELY to drive, even if I have no specific information about the person at
any given address.  People in my neighborhood don't drive Mercedes or BMWs
so why send a mailing for Mercedes or BMWs to my neighborhood?  OTOH there
are a LOT of dogs in my neighborhood, so it would be a safe bet that a dog
food ad would go over well.

In theory anyway.

Do you use those "shopper cards" at the grocery store that give you a
discount on various items in your shopping cart?  Well the real purpose is
not to give you a discount, but rather to track what you buy.  If you scan
your card, everything in your cart is associated with your card.  Now I (my
client really) can buy a list of everything that you purchased.  He can then
ask me to discover what kind of soap people buy who also buy dog food.  By
BRAND, by zip plus 4.  If you were silly enough to give them a correct
address, I can now discover that people who live in this neighborhood buy
this kind of dog food and this kind of soap and this kind of soft drinks and
this kind of...  And so my client can then sell your name and address to
Proctor and Gamble to direct mail you a flyer about the exact soap or dog
food that you (or your neighbors) buy that is going on sale at the local
supermarket.  Or send you a flyer about the brand of car that you purchased
last (or that your neighbor purchased last).  The interesting thing is that
even if you were not silly enough to give a real address... your neighbor
did.

My client likes to say that "junk mail is only junk if you do not want it",
and that if he can learn enough about you or even your neighbors, he can
target you with the junk mail that you will actually use.  Kind of makes
sense in a twisted way.  At any rate, he pays me to maintain servers with
all his lists and I am building a system to allow him to do exactly this
kind of stuff. 

It has been interesting.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul
Nielsen
Sent: Tuesday, March 04, 2008 5:20 PM
To: 'Discussion concerning MS SQL Server'
Subject: Re: [dba-SQLServer] Append only new records

Have you considered writing a proc with dynamic SQL so you can pass in the
name of the source table (and maybe the column names) as a parameter and it
works with any source table?

-Paul 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Tuesday, March 04, 2008 3:07 PM
To: 'Discussion concerning MS SQL Server'
Subject: Re: [dba-SQLServer] Append only new records

Every table has addresses.  These are marketing lists of name / address from
various sources.

MOST lists are "static" in that I do not get new versions of the lists,
although I do get updates to one of my lists.  OTOH, EVERY list has to be
NCOA (National Change of Address) processed occasionally, with
"occasionally" being defined as every few months.  Thus I will get address
changes from the NCOA process every few months for each list.

My lists contain ~50 million records, ~ 75 million records and ~150 million
records plus a bunch of smaller lists, 10 million etc.  I can get a new list
at any time and need to be able to cross reference them, i.e. which records
in list A are also in list B.  That is why the Hash idea was so appealing.
A autonumber in any given list is only really useful within that list, it
does not in any way "match" an autonumber in any other list.  

However I do in fact use an integer autonumber for a couple of purposes.
One thing I do is take the NCOA processed data and store it in a new table,
with the PKID then relating back to the original list.  The NCOA data adds
significant "value" to the address including such things as area code,
county, congressional district, lat/long etc.  Given that the NCOA has to be
performed on a regular basis, and given that I am attempting to track PEOPLE
(who move around) not addresses (which obviously do not move) it is useful
to keep the NCOA data separate from the original list data which contains
demographics information such as race, income, sex, purchasing preferences
and so forth.

The other usage of the PKID is simply that it allows me to export for the
NCOA process and then match the returning records to the original records.  

But to correlate between lists I really need a hash type of functionality.
It is useful to discover that a record in a list about purchasing
preferences "matches" a record in another list about mortgages or yet
another list about insurance policies.

John W. Colby
Colby Consulting
www.ColbyConsulting.com




More information about the dba-SQLServer mailing list