jwcolby
jwcolby at colbyconsulting.com
Wed May 16 10:09:53 CDT 2007
I now have two "high powered" servers built up. Both run Windows 2003 and SQL Server 2005. ATM I am running my entire database on a single machine. The database consists of a couple of largish "raw data" tables (65 million recs / 700+ fields; 97 million recs / 149 fields), which contain the data pulled in from text files, with a autoincrement PK added for tracking purposes. Each raw data table then has address / PK fields pulled out and "sent out" for address validation. The results are then reimported back in to SQL Server, into the same DB as the raw data table sits in. I have created separate DBF files for each "database" (raw / validated). As I mentioned earlier, I am creating a new dbf file set for each "order" I receive from my customer, where I build up the views required to process that specific order. That is working quite well BTW. I have a bunch of questions re performance. I have discovered that I can create queries / views that pull the data straight out of the desired db / table when I use that data in another db, simply by just referencing the database / table. I think I can do the same thing if pieces are on another server instance. What I am considering doing is placing these huge raw / validated database files out on the StoneHenge server, leaving the Azul server to contain and process the orders. Stonehenge is the newer machine and has a single partition with 1.6 tbytes open, and I will be adding another 500 gb to that soon. Thus this seems like the logical home for the big databases. I have a gbit switch between all the machines on my network. My question is, will there be any pros/cons to organizing things this way. I can get about 450mb burst streaming data off of my raid arrays which is considerably above the 1 gb switch capacity, but it seems unlikely that SQL Server would actually process data at that speed anyway. So I want to place the big source databases on the new server and the order database on the original server. To give an example of a real order I created a set of queries: 1) One query talks to the tblHSIDRaw table (75 million records / 700 fields), asking for "SELECT PKID WHERE ... " The Where clause encompasses about 9 different fields. All the fields are indexed, though how useful the indexes are (in all cases) is in doubt. 2) Another query uses a small table of 180 ZIPS provided by the client. That ZIP table is joined on the ZIP column of tblAZHSID which is the table that has been processed for valid addresses. tblAZHSID is ~50 million records with about 20 fields. "SELECT PK from tblZIPS inner join tblAZHSID on ..." 3) A third query requests data from tblHSIDRaw for a where on a single specific field. "SELECT PK FROM tblHSIDRaw WHERE FieldX IN () I did that just because this has to be an AND with all the 9 fields in the first query. Perhaps not the most efficient but it works. 4) A query that inner joins all the PKs and returns a result set. So what happens if I place these tblHSIDRaw and tblAZHSID out on Stonehenge? I assume that processor is passed the select clauses and performs the processing required to return the PKs requested? Does Azul do any processing? Perhaps the join of the result sets in query 4 to create the final data set? Is the process faster or slower than if the whole shootin match ran on a single machine? John W. Colby Colby Consulting www.ColbyConsulting.com