JMoss
jmoss111 at bellsouth.net
Mon Sep 6 11:07:00 CDT 2004
John, Even with that type of match key you will have dups, either because of poor data entry, people giving their names or addresses a bit differently, same address but without a zip code, with high volume records like that you will see it all. The worst records to convert are records obtained from web sites, where people leave their contact information, lots of garbage. We did CRM for the NBA and various major league sports franchises and they were the hardest to ETL. Are you running the A64 on a 32 bit OS and if so do you really think that you see a good performance boost from the 64 bit architecture? -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W. Colby Sent: Monday, September 06, 2004 8:15 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update Thanks for the MATCH FIELD definition. It's always good to know what others found to work. I wonder if a match field ends up unique, it certainly sounds as if it would be. John W. Colby www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JMoss Sent: Monday, September 06, 2004 4:12 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update John, We used tools from PeopleSmith for hygiene and deduping: RightFielder refielded badly fielded lists, recognized and refielded names from company names, refielded primary from secondary addresses; Personator split and reformatted full or inverse names into prefix, first name, middle name, last name, suffix, creating genderized prefixes, and split addresses; Styleist propercased names, addresses, corrected punctuation, and expanded abbreviations; and DoubleTake internally parses names and addresses into parts, ie street name is parsed in street number, street name, street direction (like N or North), etc into matchcodes and deduped. To build a match key basically in the manner that DoubleTake does, use the first three characters of the first name, after ensuring that the prefix or title was removed from the first name amd placed in a prefix column, the first four characters of the last name, and ensured that any suffix was removed from the end of last name and placed in a suffix column, the first five characters of company name, the first seven characters of address line 1, the last two characters of address line 1, the first five characters of address line 2, the first five characters of city, and the first five characters of zip, after ensuring that zip is formatted properly, and has been parsed into zip and zip plus 4. Also, we used householding upon customer's request. We performed these steps preceding the data load via DataJunction process in a template file, which was then loaded into production databases. You might want to test the number of characters used in each field used in the match key and see what type of results that you get, because we used specialized tools for this process. We never indexed a match key nor did one exist in the prod db, only before the load on the template file. But we did do a merge/purge weekly or after a large ETL process. Jim -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W. Colby Sent: Sunday, September 05, 2004 9:45 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update MATCH FIELD! That's the name they called it. What I haven't discovered is whether the field is actually required or whether a unique index on all these fields prevents dupes and I'm done. It seems like an extra step to pull those X characters out, append them all together, then drop them in a new field. They insist that it is needed but they don't understand databases. I understand databases but I don't know their business. I suspect that this new "match field" is then indexed to prevent dupes. Is it used for anything else? Is it a standard definition (how many characters from which fields)? John W. Colby www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JMoss Sent: Sunday, September 05, 2004 10:26 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update John, What the client is talking about is taking x characters from first name, x characters from last name, x characters from address1, x characters from city, x characters from state, and x characters from the zip code and creating a match key field from that. Other criteria could be added to ensure the uniqueness of the record. Then you could use something like http://www.codeproject.com/database/xp_md5.asp to build a hash key. Either key could be used for purposes of deduping. Jim -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W. Colby Sent: Sunday, September 05, 2004 9:07 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update Eric, >Why not use Merge Replication to merge your data? Maybe pure ignorance. As I have said I know nothing about SQL Server, learning as I go. However I am merging data in from a set of 22 comma delimited text files. From the word "replication" I assumed that this would merge data already in a sql server database. >For a unique id you can use the uniqueidentifier field type Precisely. I will be using a uniqueidentifier field as my pk. However I also need a unique index to prevent putting the same record in the database twice. This is a database of people in the US. I need something to attempt to recognize John Colby in Northfield CT as already in the db and not put a second copy in the database. The client says that the industry uses a field where they take N characters from field A and M characters from field B and O characters from field C etc. I haven't seen any sign of such a field in the data that I am importing, but they keep saying we need such a thing. John W. Colby www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Eric Barro Sent: Sunday, September 05, 2004 3:27 AM To: dba-sqlserver at databaseadvisors.com Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update John, Why not use Merge Replication to merge your data?' For a unique id you can use the uniqueidentifier field type. Select yes for the "is rowguid" option. SQL server creates a unique 16-bit id that you can use. In fact this is the field type used by the field when you include the table for replication. --- Eric Barro Senior Systems Analyst Advanced Field Services (208) 772-7060 http://www.afsweb.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W. Colby Sent: Saturday, September 04, 2004 11:35 PM To: dba-sqlserver at databaseadvisors.com Subject: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling update Does anyone know of a robust hash algorithm for SQL Server? I need to generate a unique id (natural PK) for the nVLDB address database I am working on. I currently have a gaggle (as opposed to a google) of machines busily importing the raw data into the database and once that is done I have to build a unique identifier that can be used to find an address record if it exists for merges and updates. Anyone who has worked on databases like this feel free to pipe in with how you did this. Rambling update - I have been building a pair of machines with the MSI Nforce3 motherboard for the ATHLON64 - http://www.newegg.com/app/viewProductDesc.asp?description=13-130-457&depa=0 With a 3ghz A64 processor. It promises 4 SATA channels and 2 IDE channels for 8 total devices. It promises that raids can be built from any combination of drives. I had hoped to set up a bunch of raid 1 containing disk pairs to hold the database but in the end had a heck of a time trying to get the raid to function reliably. I was trying to set up a Raid 1 boot disk and spent literally days installing and reinstalling trying to get that working. Never did. I did however get a REAL PERSON at MSI to talk to and with luck will sort that problem next week. I finally simply had to give up for now and get on with the show since I need results next week. As a result I built one of the machines with (4) SATA 250gb disks for a terrabyte of storage, threw in another 250 g drive on an IDE channel to hold the log file and the raw data files from which the import is happening. This whole thing has been an exercise in humility I must admit, with much time spent going nowhere. I purchased (2) 1g Dimms a few weeks ago for an existing Athlon machine (my old desktop) and when I purchased the pair of motherboards I purchased (3) more gig sticks, one of which turns out to be bad. It turns out that the bad stick was one of two I opened to just test the ram and so wrongly came to the conclusion that the ram simply did not work with these motherboards. Luckily after getting one of the motherboards up I went back "one last time" to try and manually tweak the ram parameters and see if they would work (didn't want to RMA them if I didn't have to) and discovered that in fact 2 sticks did work. Long story medium long, TODAY I finally got ALL of the machines up and running. The two new motherboards and my original dev machine (which I had retired in favor of my Laptop). I am so tired of Windows installs I could spit. Thank the big cahuna for high speed internet. I also figured out how to get SQL Server to use multiple files on different disks as a single database, which given my total ignorance about SQL Server I consider to be a major victory. When I started this I thought I needed a single file on a HUGE disk. So I have the (4) 250gb SATA drives each holding a single mdf file for a total capacity of 1 terabyte. By my calculations the data will be around 600+ gbytes, giving me a little head room. A fifth 250gb drive will hold the indexes (assuming I can figure out how to tell SQL where to put them). I now have the second a64 machine, my old desktop, my new laptop and my Wife's desktop all running bcp queries dumping the raw data into the server. Each machine is also simultaneously unzipping a raw data file - ~350g zipped, 10g unzipped. Talk about saturating your network bandwidth! With luck, by this time tomorrow I will have all the data in and be able to start the REAL work. Tomorrow I have to figure out the unique index thing, plus start to look at (become familiar with) the data fields. I also want to build an "autonumber" PK. Eventually I would like to experiment with dividing the data out onto different machines. The database currently has 600 fields and I fear that I am about to bump into the 8k / record limitation that someone on the list has mentioned. If that happens I will have to divide the db vertically. I also need to build some views containing subsets of the fields to make analysis and exporting easier. So much remains to be done, and none of it could proceed until I got all the data in which was a waaaay bigger task than I anticipated. I have ordered the MS Action pack and somewhere down the road I hope to get a Windows 2003 server set up. I have heard rumors that it can run 64 bit mode, and that SQL Server can as well, so if that is true I will be testing a 64 bit system, perhaps setting up two identical systems, one 32 bit and one 64 bit for a side by side speed comparison. Of course I need to get some PAYING work done to allow me to do that. ;-) Anyway, that's all the news that's fit to print. Thanks to all the folks that have been giving me suggestions and reading materials. John W. Colby www.ColbyConsulting.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com