[dba-SQLServer] Using Databases on separate machines - performance

Wed May 16 10:09:53 CDT 2007

I now have two "high powered" servers built up.  Both run Windows 2003 and
SQL Server 2005.  

ATM I am running my entire database on a single machine.  The database
consists of a couple of largish "raw data" tables (65 million recs / 700+
fields; 97 million recs / 149 fields), which contain the data pulled in from
text files, with a autoincrement PK added for tracking purposes.  Each raw
data table then has address / PK fields pulled out and "sent out" for
address validation.  The results are then reimported back in to SQL Server,
into the same DB as the raw data table sits in.  I have created separate DBF
files for each "database" (raw / validated).

As I mentioned earlier, I am creating a new dbf file set for each "order" I
receive from my customer, where I build up the views required to process
that specific order.  That is working quite well BTW.

I have a bunch of questions re performance.  I have discovered that I can
create queries / views that pull the data straight out of the desired db /
table when I use that data in another db, simply by just referencing the
database / table.  I think I can do the same thing if pieces are on another
server instance.  What I am considering doing is placing these huge raw /
validated database files out on the StoneHenge server, leaving the Azul
server to contain and process the orders.  Stonehenge is the newer machine
and has a single partition with 1.6 tbytes open, and I will be adding
another 500 gb to that soon.  Thus this seems like the logical home for the
big databases.

I have a gbit switch between all the machines on my network.  

My question is, will there be any pros/cons to organizing things this way.
I can get about 450mb burst streaming data off of my raid arrays which is
considerably above the 1 gb switch capacity, but it seems unlikely that SQL
Server would actually process data at that speed anyway.  So I want to place
the big source databases on the new server and the order database on the
original server.

To give an example of a real order I created a set of queries:

1) One query talks to the tblHSIDRaw table (75 million records / 700
fields), asking for 

"SELECT PKID WHERE ... "

The Where clause encompasses about 9 different fields.  All the fields are
indexed, though how useful the indexes are (in all cases) is in doubt.

2) Another query uses a small table of 180 ZIPS provided by the client.
That ZIP table is joined on the ZIP column of tblAZHSID which is the table
that has been processed for valid addresses.  tblAZHSID is ~50 million
records with about 20 fields.

"SELECT PK from tblZIPS inner join tblAZHSID on ..."

3) A third query requests data from tblHSIDRaw for a where on a single
specific field.  

"SELECT PK FROM tblHSIDRaw WHERE FieldX IN ()

I did that just because this has to be an AND with all the 9 fields in the
first query.  Perhaps not the most efficient but it works.

4) A query that inner joins all the PKs and returns a result set.

So what happens if I place these tblHSIDRaw and tblAZHSID out on Stonehenge?
I assume that processor is passed the select clauses and performs the
processing required to return the PKs requested?

Does Azul do any processing?  Perhaps the join of the result sets in query 4
to create the final data set?

Is the process faster or slower than if the whole shootin match ran on a
single machine?

John W. Colby
Colby Consulting
www.ColbyConsulting.com