[dba-SQLServer] SQL Server Hash algorithm - question, then rambling update

Mon Sep 6 11:07:00 CDT 2004

John,

Even with that type of match key you will have dups, either because of poor
data entry, people giving their names or addresses a bit differently, same
address but without a zip code, with high volume records like that you will
see it all. The worst records to convert are records obtained from web
sites, where people leave their contact information, lots of garbage. We did
CRM for the NBA and various major league sports franchises and they were the
hardest to ETL.

Are you running the A64 on a 32 bit OS and if so do you really think that
you see a good performance boost from the 64 bit architecture?

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Monday, September 06, 2004 8:15 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

Thanks for the MATCH FIELD definition.  It's always good to know what others
found to work.  I wonder if a match field ends up unique, it certainly
sounds as if it would be.

John W. Colby
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JMoss
Sent: Monday, September 06, 2004 4:12 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

John,

We used tools from PeopleSmith for hygiene and deduping: RightFielder
refielded badly fielded lists, recognized and refielded names from company
names, refielded primary from secondary addresses; Personator split and
reformatted full or inverse names into prefix, first name, middle name, last
name, suffix, creating genderized prefixes, and split addresses; Styleist
propercased names, addresses, corrected punctuation, and expanded
abbreviations; and DoubleTake internally parses names and addresses into
parts, ie street name is parsed in street number, street name, street
direction (like N or North), etc into matchcodes and deduped.

To build a match key basically in the manner that DoubleTake does, use the
first three characters of the first name, after ensuring that the prefix or
title was removed from the first name amd placed in a prefix column, the
first four characters of the last name, and ensured that any suffix was
removed from the end of last name and placed in a suffix column, the first
five characters of company name, the first seven characters of address line
1, the last two characters of address line 1, the first five characters of
address line 2, the first five characters of city, and the first five
characters of zip, after ensuring that zip is formatted properly, and has
been parsed into zip and zip plus 4.

Also, we used householding upon customer's request.

We performed these steps preceding the data load via DataJunction process in
a template file, which was then loaded into production databases.

You might want to test the number of characters used in each field used in
the match key and see what type of results that you get, because we used
specialized tools for this process.

We never indexed a match key nor did one exist in the prod db, only before
the load on the template file. But we did do a merge/purge weekly or after a
large ETL process.

Jim

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Sunday, September 05, 2004 9:45 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

MATCH FIELD!  That's the name they called it.  What I haven't discovered is
whether the field is actually required or whether a unique index on all
these fields prevents dupes and I'm done.  It seems like an extra step to
pull those X characters out, append them all together, then drop them in a
new field.  They insist that it is needed but they don't understand
databases.  I understand databases but I don't know their business.

I suspect that this new "match field" is then indexed to prevent dupes.  Is
it used for anything else?  Is it a standard definition (how many characters
from which fields)?

John W. Colby
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JMoss
Sent: Sunday, September 05, 2004 10:26 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

John,

What the client is talking about is taking x characters from first name, x
characters from last name, x characters from address1, x characters from
city, x characters from state, and x characters from the zip code and
creating a match key field from that. Other criteria could be added to
ensure the uniqueness of the record.

Then you could use something like
http://www.codeproject.com/database/xp_md5.asp to build a hash key.

Either key could be used for purposes of deduping.

Jim

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Sunday, September 05, 2004 9:07 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

Eric,

>Why not use Merge Replication to merge your data?

Maybe pure ignorance.  As I have said I know nothing about SQL Server,
learning as I go.  However I am merging data in from a set of 22 comma
delimited text files.  From the word "replication" I assumed that this would
merge data already in a sql server database.

>For a unique id you can use the uniqueidentifier field type

Precisely.  I will be using a uniqueidentifier field as my pk.  However I
also need a unique index to prevent putting the same record in the database
twice.  This is a database of people in the US.  I need something to attempt
to recognize John Colby in Northfield CT as already in the db and not put a
second copy in the database.

The client says that the industry uses a field where they take N characters
from field A and M characters from field B and O characters from field C
etc.  I haven't seen any sign of such a field in the data that I am
importing, but they keep saying we need such a thing.

John W. Colby
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Eric Barro
Sent: Sunday, September 05, 2004 3:27 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

John,

Why not use Merge Replication to merge your data?'

For a unique id you can use the uniqueidentifier field type. Select yes for
the "is rowguid" option. SQL server creates a unique 16-bit id that you can
use. In fact this is the field type used by the field when you include the
table for replication.

---
Eric Barro
Senior Systems Analyst
Advanced Field Services
(208) 772-7060
http://www.afsweb.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Saturday, September 04, 2004 11:35 PM
To: dba-sqlserver at databaseadvisors.com
Subject: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling
update

Does anyone know of a robust hash algorithm for SQL Server?  I need to
generate a unique id (natural PK) for the nVLDB address database I am
working on.  I currently have a gaggle (as opposed to a google) of machines
busily importing the raw data into the database and once that is done I have
to build a unique identifier that can be used to find an address record if
it exists for merges and updates.  Anyone who has worked on databases like
this feel free to pipe in with how you did this.

Rambling update - I have been building a pair of machines with the MSI
Nforce3 motherboard for the ATHLON64 -

http://www.newegg.com/app/viewProductDesc.asp?description=13-130-457&depa=0

With a 3ghz A64 processor.  It promises 4 SATA channels and 2 IDE channels
for 8 total devices.  It promises that raids can be built from any
combination of drives.  I had hoped to set up a bunch of raid 1 containing
disk pairs to hold the database but in the end had a heck of a time trying
to get the raid to function reliably.  I was trying to set up a Raid 1 boot
disk and spent literally days installing and reinstalling trying to get that
working.  Never did.  I did however get a REAL PERSON at MSI to talk to and
with luck will sort that problem next week.

I finally simply had to give up for now and get on with the show since I
need results next week.  As a result I built one of the machines with (4)
SATA 250gb disks for a terrabyte of storage, threw in another 250 g drive on
an IDE channel to hold the log file and the raw data files from which the
import is happening.

This whole thing has been an exercise in humility I must admit, with much
time spent going nowhere.  I purchased (2) 1g Dimms a few weeks ago for an
existing Athlon machine (my old desktop) and when I purchased the pair of
motherboards I purchased (3) more gig sticks, one of which turns out to be
bad.  It turns out that the bad stick was one of two I opened to just test
the ram and so wrongly came to the conclusion that the ram simply did not
work with these motherboards.  Luckily after getting one of the motherboards
up I went back "one last time" to try and manually tweak the ram parameters
and see if they would work (didn't want to RMA them if I didn't have to) and
discovered that in fact 2 sticks did work.  Long story medium long, TODAY I
finally got ALL of the machines up and running.  The two new motherboards
and my original dev machine (which I had retired in favor of my Laptop).  I
am so tired of Windows installs I could spit.  Thank the big cahuna for high
speed internet.

I also figured out how to get SQL Server to use multiple files on different
disks as a single database, which given my total ignorance about SQL Server
I consider to be a major victory. When I started this I thought I needed a
single file on a HUGE disk.  So I have the (4) 250gb SATA drives each
holding a single mdf file for a total capacity of 1 terabyte.  By my
calculations the data will be around 600+ gbytes, giving me a little head
room.  A fifth 250gb drive will hold the indexes (assuming I can figure out
how to tell SQL where to put them).

I now have the second a64 machine, my old desktop, my new laptop and my
Wife's desktop all running bcp queries dumping the raw data into the server.
Each machine is also simultaneously unzipping a raw data file - ~350g
zipped, 10g unzipped.  Talk about saturating your network bandwidth!  With
luck, by this time tomorrow I will have all the data in and be able to start
the REAL work.

Tomorrow I have to figure out the unique index thing, plus start to look at
(become familiar with) the data fields.  I also want to build an
"autonumber" PK.  Eventually I would like to experiment with dividing the
data out onto different machines.  The database currently has 600 fields and
I fear that I am about to bump into the 8k / record limitation that someone
on the list has mentioned.  If that happens I will have to divide the db
vertically.  I also need to build some views containing subsets of the
fields to make analysis and exporting easier.  So much remains to be done,
and none of it could proceed until I got all the data in which was a waaaay
bigger task than I anticipated.

I have ordered the MS Action pack and somewhere down the road I hope to get
a Windows 2003 server set up.  I have heard rumors that it can run 64 bit
mode, and that SQL Server can as well, so if that is true I will be testing
a 64 bit system, perhaps setting up two identical systems, one 32 bit and
one 64 bit for a side by side speed comparison.  Of course I need to get
some PAYING work done to allow me to do that.  ;-)

Anyway, that's all the news that's fit to print.  Thanks to all the folks
that have been giving me suggestions and reading materials.

John W. Colby
www.ColbyConsulting.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com