jwcolby
jwcolby at colbyconsulting.com
Tue Mar 4 16:07:13 CST 2008
Every table has addresses. These are marketing lists of name / address from various sources. MOST lists are "static" in that I do not get new versions of the lists, although I do get updates to one of my lists. OTOH, EVERY list has to be NCOA (National Change of Address) processed occasionally, with "occasionally" being defined as every few months. Thus I will get address changes from the NCOA process every few months for each list. My lists contain ~50 million records, ~ 75 million records and ~150 million records plus a bunch of smaller lists, 10 million etc. I can get a new list at any time and need to be able to cross reference them, i.e. which records in list A are also in list B. That is why the Hash idea was so appealing. A autonumber in any given list is only really useful within that list, it does not in any way "match" an autonumber in any other list. However I do in fact use an integer autonumber for a couple of purposes. One thing I do is take the NCOA processed data and store it in a new table, with the PKID then relating back to the original list. The NCOA data adds significant "value" to the address including such things as area code, county, congressional district, lat/long etc. Given that the NCOA has to be performed on a regular basis, and given that I am attempting to track PEOPLE (who move around) not addresses (which obviously do not move) it is useful to keep the NCOA data separate from the original list data which contains demographics information such as race, income, sex, purchasing preferences and so forth. The other usage of the PKID is simply that it allows me to export for the NCOA process and then match the returning records to the original records. But to correlate between lists I really need a hash type of functionality. It is useful to discover that a record in a list about purchasing preferences "matches" a record in another list about mortgages or yet another list about insurance policies. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul Nielsen Sent: Tuesday, March 04, 2008 4:06 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records Yep, SQL Server typically has larger databases. The average sized db for SQL Server is probably 500Gb +. At the last PASS conference when asked how many of you have databases over 1 Tb most hands went up. I normally test any app with millions of rows. This is my last contract job, http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/12/10-lessons-from-35k -tps.aspx in your case I see two points of pain, First, the " many different tables". How many tables have addresses? Is it dynamic or a fixed set of tables. Either way it's scriptable. Secondly, hashes for large sets tend to not be unique. I'd use an bigint identity. The process is not difficult. 1) Insert all the addresses from every table into a single table with an identity. 2) Delete dups. 3) If you dislike the gaps, alter the table to remove the identity and then add a new identity. 4) Add AddressID column to each of the many tables 5) update manytables set mt.addressid = a.addressid from manytables join address on a.address = manytables.address Is really is dirt simple of a onetime conversion process. -Paul -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Tuesday, March 04, 2008 1:12 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records LOL. Nothing is dirt simple when you are dealing with a hundred million records. I have many different tables, each table has name / address information. I already create a hash of the address fields (HashAddr), the address plus last name (HashFamily) and Address / last name / first name (HashPerson). Any given table will contain millions of records where there are multiple people in a family at the same address. I am looking to carve out just the address (and it's hash) and dump it into an address table, however it needs to be there only once. That will form a "validated address table" where by definition if the address is in there it is valid and deliverable. The hash of the address fields then becomes the PK for that table since it is unique. Having done that I can then do joins on the address hash to see everyone who has ever lived at any address. I can also get relationships between the lists. Of course the address table will have more than a hundred million rows. The obvious is to do a "not in" query to select hashes not in the address table, then append those records into the address table. However with a hundred million records and growing, that kind of operation could take a few seconds. 8~) John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul Nielsen Sent: Tuesday, March 04, 2008 1:36 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records You could always pump them all into a temp table, delete dups, and then move them into the real table. It'd be dirt simple and repeatable. -Paul -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Monday, March 03, 2008 3:17 PM To: 'Discussion concerning MS SQL Server' Subject: [dba-SQLServer] Append only new records I am building a table of known validated addresses in the US. For my purposes all I have is Address, city, state, zip and HashAddr. HashAddr is a has which is guaranteed to be unique given the input which is the address fields given. In Access you could append records in at random and those records which did not collide with a unique index goes in, and those that do collide do not go in. Obviously an index on the HashAddress will be the unique index. Is it possible to do the same kind of thing in SQL Server? When I have tried this before (in SQL Server), the whole process aborted if a collision occurred, with no records appended if there was any collision. John W. Colby Colby Consulting www.ColbyConsulting.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com __________ NOD32 2918 (20080303) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com __________ NOD32 2921 (20080304) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com