Shamil Salakhetdinov
shamil at users.mns.ru
Mon Jan 7 07:33:13 CST 2008
Hi John, Do not get "demoralized" :) I just thought that with all your great hardware you can use great .NET Framework + MS SQL 2005 + your custom code running in several threads to the max and get results even quicker using one de-normalized table's sequential scan than you get them when you're joining three tables and build for that join many indexes which "eat" HDD space and memory... ...that's pretty easy to run multi-threaded code on .NET - and my hypothesis is that with such multi-threaded code (or even without it) you can get your report text file in the same time of even quicker... ..."piping-through" - by that I mean sequential scan of de-normalized table, and during this scan only the fields used as query filter criteria and result fields are queried... ...I can be wrong but you can try to adapt the code I posted in this thread just to "pipe-through" your main table - if that will be comparable (/less?) in time than joining then my hypothesis is almost correct - still to try it then with all three tables joined into on de-normalized table... ...AFAIK .NET SqlClient namespace set of class works with SQL Server on very low level almost of physical records/blocks level and when SqlDataReader is used then sequential scan is incredibly speedy... Just a hypothesis... I can be wrong - never had opportunity to "pipe-through" that large amount of data you have there... -- Shamil -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Monday, January 07, 2008 6:10 AM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] x64 tests Shamil, >...kept in demoralized form... The only one demoralized was me. I don't understand what you mean by "piping through". In any event It appears that my hardware is sufficient, though it is expensive. The most expensive part is the raid array which I needed in any event. Back when I started this 320g drives were the biggest I could afford. By dropping eight 320 gig drives on a dedicated controller raid 6 I got ~ 300 (real) gigs x 6 storage which was big enough to deal with all the data itself. Once you have that the streaming read rates are jaw dropping. Reading data off of 8 drives at once gives me very close to 400 megs / second read rate. That is enough to keep the processors busy! You mention expensive hardware, I built my servers from parts for about $3000 each, and that was about a year ago now. The controller was $500 and 8 320 gig drives was more than 1/2 of the system cost, but it is that array which makes it fly. These databases are pretty much read only, static data that is rarely if ever updated. The selection fields are never updated at all. I do have to build indexes and stuff but that is "as required". So I mostly have to focus on building a system that can READ lots of data quickly. Then it turns out that the "cover" indexes carves out frequently used fields into little "mini tables" kinda sorta. I had a heck of a time originally because of the width of the tables, over 700 fields as I received it. So even a table scan on a single field could take 3 minutes. It was not pretty. With the cover fields, one per field actually used in a select, suddenly SQL Server is going to the indexes to get what is needed instead of going to the actual 700 field wide table. That just made all the difference in the world. >...and then you can "pipe-through" all that denormalized data rows using code, which if you use VB.Net/C# would be very speedy and which could implement much more advanced filtering/data mining than T-SQL... So what do you mean by this? "pipe through"? As for data mining, I am not sure that I define my usage as data mining. My interpretation of data mining is more analysis than anything. I am really just selecting recordsets. I need names / addresses where... The end purpose is to send names / addresses to a bulk mailer who wants to send an advertisement to a selected group of people. OTOH if I can learn something new I am certainly all ears. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Sunday, January 06, 2008 8:35 PM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] x64 tests John, I thought you could have all your three tables joined and kept in demoralized form... ...then there will be no need in indexes... ...and then you can "pipe-through" all that denormalized data rows using code, which if you use VB.Net/C# would be very speedy and which could implement much more advanced filtering/data mining than T-SQL... I can be wrong but joining 50 mln. x 50 mln. x ZipCodeWorld to get 600000 result query is quite a task and it needs expensive hardware while "piping-though" 50mln. denormalized data rows using code can be done I expect with similar or higher speed on less expensive hardware... -- Shamil -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Sunday, January 06, 2008 8:40 PM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] x64 tests I don't really understand what you are doing. I suppose you are saying just get the select fields and process the "do they match the criteria" in code? Before we go further remember that we are simply trying to select name address fields for export, based on a set of criteria. We have to use three different tables: HSID has a PK field and 640 fields of data. I would need to select 50 million records of the PK and N criteria fields (2 age fields, 1 income field and 8 hobby fields). I also have a table AZHSID which has the PKID and the name / address fields. So I would need to get the PKID plus fname, lname, city, state, zip and a HashFamily field, 50 million records of that data. Then I have ZipCodeWorld that has the zip code information that tells me what zips are in a specific metropolitan code. I have to build a query to join the HSID, AZHSID and Accuzip tables and pull all of the fields, then process them in code? I would also have to write some sort of generic code to do the selection. Of course there are many ways to perform the queries anyway, and I should go back and build them differently just to see what happens. I could have simply built up one big query. In the past I hesitated to do this because the queries would take 20-30 minutes to run. If anything went wrong I would have to make changes and retest, waiting another 20-30 minutes. I learned to break things down into pieces so that the pieces only took a few minutes each. My client had counts that he thought represented the number of records for each criteria. I was trying to check counts on my tables against what he thought he had. He had a count of the age / sex, the income, and then each hobby. So I built up separate queries for those pieces so that I could count the pieces. In the end, the point of this exercise was not to optimize the process but rather to see how moving from x32 to x64 changed the time required. Unfortunately that part is masked by the disk subsystem so I didn't really learn anything there. Then I wanted to learn how moving from 4 gigs to 8 gigs changed the time required, and likewise how moving from dual to quad core changed the time. I really expected the doubling of memory to substantially change the times, but it did not. I also expected SQL Server x64 to be able to natively handle large amounts of memory, not depend on AWE which I understand to be "paging" of memory. At this point I am actually quite happy with a 5 minute time to grab all the data and get it out to a text file. I think that for processing 50 million records in two tables, and narrow it down to 600 thousand records using about 10 select fields and three joins 5 minutes is doing OK. Especially since it was taking me 20-30 minutes before I moved to the "cover" queries. I do think I will go back and built this as one big query and see how long that takes. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Sunday, January 06, 2008 9:11 AM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] x64 tests Hello John, Did you ever try to just "pipe-through" your millions of records and processing them in code? I mean nowadays memory and processors are so incredibly quick, and your query (selection) criteria are looking rather simple - therefore it could be (much) quicker to get results if using some custom coding... I do not have millions records to test this hypothesis but here is a sample C# code, which runs against a Db with 30000+ of member records and then cycle 2500 times to get the counter to 80+ millions, and it does get data into local vars for further processing - the result is that it takes 3 min 30 sec for 89+ million records on "slow" dual core Pentium with 2GB ram and simple 200GB IDE... Please correct me if I did some mistakes below. Thanks. -- Shamil -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com