[dba-SQLServer] SQL Server Hash algorithm - question, then rambling update

Sun Sep 5 09:26:27 CDT 2004

John,

What the client is talking about is taking x characters from first name, x
characters from last name, x characters from address1, x characters from
city, x characters from state, and x characters from the zip code and
creating a match key field from that. Other criteria could be added to
ensure the uniqueness of the record.

Then you could use something like
http://www.codeproject.com/database/xp_md5.asp to build a hash key.

Either key could be used for purposes of deduping.

Jim

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Sunday, September 05, 2004 9:07 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

Eric,

>Why not use Merge Replication to merge your data?

Maybe pure ignorance.  As I have said I know nothing about SQL Server,
learning as I go.  However I am merging data in from a set of 22 comma
delimited text files.  From the word "replication" I assumed that this would
merge data already in a sql server database.

>For a unique id you can use the uniqueidentifier field type

Precisely.  I will be using a uniqueidentifier field as my pk.  However I
also need a unique index to prevent putting the same record in the database
twice.  This is a database of people in the US.  I need something to attempt
to recognize John Colby in Northfield CT as already in the db and not put a
second copy in the database.

The client says that the industry uses a field where they take N characters
from field A and M characters from field B and O characters from field C
etc.  I haven't seen any sign of such a field in the data that I am
importing, but they keep saying we need such a thing.

John W. Colby
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Eric Barro
Sent: Sunday, September 05, 2004 3:27 AM
To: dba-sqlserver at databaseadvisors.com
Subject: RE: [dba-SQLServer] SQL Server Hash algorithm - question,then
rambling update

John,

Why not use Merge Replication to merge your data?'

For a unique id you can use the uniqueidentifier field type. Select yes for
the "is rowguid" option. SQL server creates a unique 16-bit id that you can
use. In fact this is the field type used by the field when you include the
table for replication.

---
Eric Barro
Senior Systems Analyst
Advanced Field Services
(208) 772-7060
http://www.afsweb.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com]On Behalf Of John W.
Colby
Sent: Saturday, September 04, 2004 11:35 PM
To: dba-sqlserver at databaseadvisors.com
Subject: [dba-SQLServer] SQL Server Hash algorithm - question,then rambling
update

Does anyone know of a robust hash algorithm for SQL Server?  I need to
generate a unique id (natural PK) for the nVLDB address database I am
working on.  I currently have a gaggle (as opposed to a google) of machines
busily importing the raw data into the database and once that is done I have
to build a unique identifier that can be used to find an address record if
it exists for merges and updates.  Anyone who has worked on databases like
this feel free to pipe in with how you did this.

Rambling update - I have been building a pair of machines with the MSI
Nforce3 motherboard for the ATHLON64 -

http://www.newegg.com/app/viewProductDesc.asp?description=13-130-457&depa=0

With a 3ghz A64 processor.  It promises 4 SATA channels and 2 IDE channels
for 8 total devices.  It promises that raids can be built from any
combination of drives.  I had hoped to set up a bunch of raid 1 containing
disk pairs to hold the database but in the end had a heck of a time trying
to get the raid to function reliably.  I was trying to set up a Raid 1 boot
disk and spent literally days installing and reinstalling trying to get that
working.  Never did.  I did however get a REAL PERSON at MSI to talk to and
with luck will sort that problem next week.

I finally simply had to give up for now and get on with the show since I
need results next week.  As a result I built one of the machines with (4)
SATA 250gb disks for a terrabyte of storage, threw in another 250 g drive on
an IDE channel to hold the log file and the raw data files from which the
import is happening.

This whole thing has been an exercise in humility I must admit, with much
time spent going nowhere.  I purchased (2) 1g Dimms a few weeks ago for an
existing Athlon machine (my old desktop) and when I purchased the pair of
motherboards I purchased (3) more gig sticks, one of which turns out to be
bad.  It turns out that the bad stick was one of two I opened to just test
the ram and so wrongly came to the conclusion that the ram simply did not
work with these motherboards.  Luckily after getting one of the motherboards
up I went back "one last time" to try and manually tweak the ram parameters
and see if they would work (didn't want to RMA them if I didn't have to) and
discovered that in fact 2 sticks did work.  Long story medium long, TODAY I
finally got ALL of the machines up and running.  The two new motherboards
and my original dev machine (which I had retired in favor of my Laptop).  I
am so tired of Windows installs I could spit.  Thank the big cahuna for high
speed internet.

I also figured out how to get SQL Server to use multiple files on different
disks as a single database, which given my total ignorance about SQL Server
I consider to be a major victory. When I started this I thought I needed a
single file on a HUGE disk.  So I have the (4) 250gb SATA drives each
holding a single mdf file for a total capacity of 1 terabyte.  By my
calculations the data will be around 600+ gbytes, giving me a little head
room.  A fifth 250gb drive will hold the indexes (assuming I can figure out
how to tell SQL where to put them).

I now have the second a64 machine, my old desktop, my new laptop and my
Wife's desktop all running bcp queries dumping the raw data into the server.
Each machine is also simultaneously unzipping a raw data file - ~350g
zipped, 10g unzipped.  Talk about saturating your network bandwidth!  With
luck, by this time tomorrow I will have all the data in and be able to start
the REAL work.

Tomorrow I have to figure out the unique index thing, plus start to look at
(become familiar with) the data fields.  I also want to build an
"autonumber" PK.  Eventually I would like to experiment with dividing the
data out onto different machines.  The database currently has 600 fields and
I fear that I am about to bump into the 8k / record limitation that someone
on the list has mentioned.  If that happens I will have to divide the db
vertically.  I also need to build some views containing subsets of the
fields to make analysis and exporting easier.  So much remains to be done,
and none of it could proceed until I got all the data in which was a waaaay
bigger task than I anticipated.

I have ordered the MS Action pack and somewhere down the road I hope to get
a Windows 2003 server set up.  I have heard rumors that it can run 64 bit
mode, and that SQL Server can as well, so if that is true I will be testing
a 64 bit system, perhaps setting up two identical systems, one 32 bit and
one 64 bit for a side by side speed comparison.  Of course I need to get
some PAYING work done to allow me to do that.  ;-)

Anyway, that's all the news that's fit to print.  Thanks to all the folks
that have been giving me suggestions and reading materials.

John W. Colby
www.ColbyConsulting.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com