Paul Nielsen
pauln at sqlserverbible.com
Tue Mar 4 21:41:24 CST 2008
I built nearly the same type of system for an airline data warehouse a couple years ago using all stored procs + bulk insert. The code was small and it screamed. Bulk Insert is very fast. -Paul -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Tuesday, March 04, 2008 8:14 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records To this point I have been doing it mostly with VB.Net and ADO recordsets. What I have built so far is a system where I define imports and exports using metadata tables in SQL Server. Each import / export metadata table (manually filled in to this point - one time) has information about the SQL Server table that is the source or target, the directories where the import / export files are moving from, etc. I then have a map table (for export) with metadata defining what the fields are in the source table to be exported. Flat file import field metadata is generally provided in an excel spreadsheet or CSV file by the list provider. Like that. That is all working. I can import or export CSV files and flat files simply by defining metadata records and running my program. The process consists of an initial import of a list from a slew of flat files (always flat files, that seems to be the standard in the industry) into sql server. Then an immediate export of just the name / address fields back out to CSV files, to a third party address validation program running on a server here in my office that performs the address validation - CASS and DPV. The validation program is running a watch on directories where I drop the CSV files. Each file has a million records, though I define the max record count in my metadata. The third party validation program performs the validation including uploading some info to an NCOA server over the internet. Back comes NCOA data for whatever records (people) have moved physical locations. The third party validation program reintegrates the NCOA information into fields in the validation records it is assembling. The returning data is then dropped as CSV files into directories which my program is monitoring. My VB.Net program then imports the resulting address validated CSV files (one million record files) back into SQL Server into validation tables which hold just the name / address / validation data (plus the PK of course). >From there I have to discover which addresses are undeliverable using validation codes from the CASS / DPV process and delete those records entirely. They are completely undeliverable, usually bogus or incorrect address info. The NCOA codes tell me when a person has moved from one (valid) address to another (valid) address. So I have to store the original valid address record (that is useful to me) and then update with the new valid address information from the NCOA (where they currently live). Once all the addresses are validated and any NCOA addresses are discovered and written, I create the person, family and address hash codes. I now have two tables, the original data imported from the flat files, and a smaller subset of address validation records (all valid, deliverable and NCOAd), related by a long PKID. One address list is now processed. However... the whole "export to validation, import from validation" piece has to be performed periodically. Of course that is just my code running again, the same code mentioned above. And of course my code also has to log all the processes so that I can bill my clients for the processes performed on their data. It is a rather intricate dance all in all, and it is all orchestrated by VB.Net code that I wrote in Visual Studio and using SQL Server 2005, and of course a third party CASS / DPV / NCOA program running on a server here in my office. The part we were discussing is kind of ancillary to all the stuff mentioned above. Once I have validated and NCOAd addresses, it would be nice to build a validated address table as a precursor to a system to track people movements as well as learning other stuff. For example, if I have a list of mortgages, these include all kinds of "house" information such as number of rooms, square footages, lot size etc, as well as mortgage information - sale price, mortgage payment etc. PLUS the address. I have another, completely unrelated list of people / addresses with perhaps buying preferences for electronics, or soap or cat food. I have another completely unrelated name / address list with income, ethnicity, number of children, and whatever else that list may contain. By having the VALIDATED address as common information I can now correlate completely unrelated lists and gain more information than I can gather from any individual list. EVEN if it is not the same person, I can discover that a person that lives at some address has this kind of house, that SOMEONE who lived there bought this kind of food, drove this kind of car, was this ethnic background etc. In essence simply by knowing that you lived at an address and drove a certain kind of car, by gathering sufficient information about your neighbors I can INFER what kinds of cars people in your neighborhood are LIKELY to drive, even if I have no specific information about the person at any given address. People in my neighborhood don't drive Mercedes or BMWs so why send a mailing for Mercedes or BMWs to my neighborhood? OTOH there are a LOT of dogs in my neighborhood, so it would be a safe bet that a dog food ad would go over well. In theory anyway. Do you use those "shopper cards" at the grocery store that give you a discount on various items in your shopping cart? Well the real purpose is not to give you a discount, but rather to track what you buy. If you scan your card, everything in your cart is associated with your card. Now I (my client really) can buy a list of everything that you purchased. He can then ask me to discover what kind of soap people buy who also buy dog food. By BRAND, by zip plus 4. If you were silly enough to give them a correct address, I can now discover that people who live in this neighborhood buy this kind of dog food and this kind of soap and this kind of soft drinks and this kind of... And so my client can then sell your name and address to Proctor and Gamble to direct mail you a flyer about the exact soap or dog food that you (or your neighbors) buy that is going on sale at the local supermarket. Or send you a flyer about the brand of car that you purchased last (or that your neighbor purchased last). The interesting thing is that even if you were not silly enough to give a real address... your neighbor did. My client likes to say that "junk mail is only junk if you do not want it", and that if he can learn enough about you or even your neighbors, he can target you with the junk mail that you will actually use. Kind of makes sense in a twisted way. At any rate, he pays me to maintain servers with all his lists and I am building a system to allow him to do exactly this kind of stuff. It has been interesting. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Paul Nielsen Sent: Tuesday, March 04, 2008 5:20 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records Have you considered writing a proc with dynamic SQL so you can pass in the name of the source table (and maybe the column names) as a parameter and it works with any source table? -Paul -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Tuesday, March 04, 2008 3:07 PM To: 'Discussion concerning MS SQL Server' Subject: Re: [dba-SQLServer] Append only new records Every table has addresses. These are marketing lists of name / address from various sources. MOST lists are "static" in that I do not get new versions of the lists, although I do get updates to one of my lists. OTOH, EVERY list has to be NCOA (National Change of Address) processed occasionally, with "occasionally" being defined as every few months. Thus I will get address changes from the NCOA process every few months for each list. My lists contain ~50 million records, ~ 75 million records and ~150 million records plus a bunch of smaller lists, 10 million etc. I can get a new list at any time and need to be able to cross reference them, i.e. which records in list A are also in list B. That is why the Hash idea was so appealing. A autonumber in any given list is only really useful within that list, it does not in any way "match" an autonumber in any other list. However I do in fact use an integer autonumber for a couple of purposes. One thing I do is take the NCOA processed data and store it in a new table, with the PKID then relating back to the original list. The NCOA data adds significant "value" to the address including such things as area code, county, congressional district, lat/long etc. Given that the NCOA has to be performed on a regular basis, and given that I am attempting to track PEOPLE (who move around) not addresses (which obviously do not move) it is useful to keep the NCOA data separate from the original list data which contains demographics information such as race, income, sex, purchasing preferences and so forth. The other usage of the PKID is simply that it allows me to export for the NCOA process and then match the returning records to the original records. But to correlate between lists I really need a hash type of functionality. It is useful to discover that a record in a list about purchasing preferences "matches" a record in another list about mortgages or yet another list about insurance policies. John W. Colby Colby Consulting www.ColbyConsulting.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com __________ NOD32 2922 (20080305) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com