[dba-SQLServer] Append only new records

jwcolby jwcolby at colbyconsulting.com
Tue Mar 4 16:07:13 CST 2008


Every table has addresses.  These are marketing lists of name / address from
various sources.

MOST lists are "static" in that I do not get new versions of the lists,
although I do get updates to one of my lists.  OTOH, EVERY list has to be
NCOA (National Change of Address) processed occasionally, with
"occasionally" being defined as every few months.  Thus I will get address
changes from the NCOA process every few months for each list.

My lists contain ~50 million records, ~ 75 million records and ~150 million
records plus a bunch of smaller lists, 10 million etc.  I can get a new list
at any time and need to be able to cross reference them, i.e. which records
in list A are also in list B.  That is why the Hash idea was so appealing.
A autonumber in any given list is only really useful within that list, it
does not in any way "match" an autonumber in any other list.  

However I do in fact use an integer autonumber for a couple of purposes.
One thing I do is take the NCOA processed data and store it in a new table,
with the PKID then relating back to the original list.  The NCOA data adds
significant "value" to the address including such things as area code,
county, congressional district, lat/long etc.  Given that the NCOA has to be
performed on a regular basis, and given that I am attempting to track PEOPLE
(who move around) not addresses (which obviously do not move) it is useful
to keep the NCOA data separate from the original list data which contains
demographics information such as race, income, sex, purchasing preferences
and so forth.

The other usage of the PKID is simply that it allows me to export for the
NCOA process and then match the returning records to the original records.  

But to correlate between lists I really need a hash type of functionality.
It is useful to discover that a record in a list about purchasing
preferences "matches" a record in another list about mortgages or yet
another list about insurance policies.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul
Nielsen
Sent: Tuesday, March 04, 2008 4:06 PM
To: 'Discussion concerning MS SQL Server'
Subject: Re: [dba-SQLServer] Append only new records

Yep, SQL Server typically has larger databases. The average sized db for SQL
Server is probably 500Gb +. At the last PASS conference when asked how many
of you have databases over 1 Tb most hands went up. I normally test any app
with millions of rows. 

This is my last contract job,
http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/12/10-lessons-from-35k
-tps.aspx

in your case I see two points of pain,
First, the " many different tables". How many tables have addresses? Is it
dynamic or a fixed set of tables. Either way it's scriptable. 

Secondly, hashes for large sets tend to not be unique. I'd use an bigint
identity. 

The process is not difficult. 
1) Insert all the addresses from every table into a single table with an
identity. 
2) Delete dups.
3) If you dislike the gaps, alter the table to remove the identity and then
add a new identity. 
4) Add AddressID column to each of the many tables
5) update manytables set mt.addressid =  a.addressid from manytables join
address on a.address = manytables.address

Is really is dirt simple of a onetime conversion process. 

-Paul 



-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Tuesday, March 04, 2008 1:12 PM
To: 'Discussion concerning MS SQL Server'
Subject: Re: [dba-SQLServer] Append only new records

LOL.  Nothing is dirt simple when you are dealing with a hundred million
records.

I have many different tables, each table has name / address information.  I
already create a hash of the address fields (HashAddr), the address plus
last name (HashFamily) and Address / last name / first name (HashPerson).
Any given table will contain millions of records where there are multiple
people in a family at the same address.  I am looking to carve out just the
address (and it's hash) and dump it into an address table, however it needs
to be there only once.  That will form a "validated address table" where by
definition if the address is in there it is valid and deliverable.  The hash
of the address fields then becomes the PK for that table since it is unique.
Having done that I can then do joins on the address hash to see everyone who
has ever lived at any address.  I can also get relationships between the
lists.

Of course the address table will have more than a hundred million rows.  The
obvious is to do a "not in" query to select hashes not in the address table,
then append those records into the address table.  However with a hundred
million records and growing, that kind of operation could take a few
seconds.  8~)


John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul
Nielsen
Sent: Tuesday, March 04, 2008 1:36 PM
To: 'Discussion concerning MS SQL Server'
Subject: Re: [dba-SQLServer] Append only new records

You could always pump them all into a temp table, delete dups, and then move
them into the real table. It'd be dirt simple and repeatable. 

-Paul 

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Monday, March 03, 2008 3:17 PM
To: 'Discussion concerning MS SQL Server'
Subject: [dba-SQLServer] Append only new records

I am building a table of known validated addresses in the US.  For my
purposes all I have is Address, city, state, zip and HashAddr.  HashAddr is
a has which is guaranteed to be unique given the input which is the address
fields given.  

In Access you could append records in at random and those records which did
not collide with a unique index goes in, and those that do collide do not go
in.  Obviously an index on the HashAddress will be the unique index.  Is it
possible to do the same kind of thing in SQL Server?  When I have tried this
before (in SQL Server), the whole process aborted if a collision occurred,
with no records appended if there was any collision.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com


__________ NOD32 2918 (20080303) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com


_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com


__________ NOD32 2921 (20080304) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com


_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com




More information about the dba-SQLServer mailing list