[AccessD] x64 tests

Mon Jan 7 07:33:13 CST 2008

Hi John,

Do not get "demoralized" :)

I just thought that with all your great hardware you can use great .NET
Framework + MS SQL 2005 + your custom code running in several threads to the
max and get results even quicker using one de-normalized table's sequential
scan than you get them when you're joining three tables and build for that
join many indexes which "eat" HDD space and memory...

...that's pretty easy to run multi-threaded code on .NET - and my hypothesis
is that with such multi-threaded code (or even without it) you can get your
report text file in the same time of even quicker...

..."piping-through" - by that I mean sequential scan of de-normalized table,
and during this scan only the fields used as query filter criteria and
result fields are queried...

...I can be wrong but you can try to adapt the code I posted in this thread
just to "pipe-through" your main table - if that will be comparable (/less?)
in time than joining then my hypothesis is almost correct - still to try it
then with all three tables joined into on de-normalized table...

...AFAIK .NET SqlClient namespace set of class works with SQL Server on very
low level almost of physical records/blocks level and when SqlDataReader is
used then sequential scan is incredibly speedy...

Just a hypothesis...

I can be wrong - never had opportunity to "pipe-through" that large amount
of data you have there...

--
Shamil

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Monday, January 07, 2008 6:10 AM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] x64 tests

Shamil,

>...kept in demoralized form...

The only one demoralized was me.

I don't understand what you mean by "piping through".

In any event It appears that my hardware is sufficient, though it is
expensive.  The most expensive part is the raid array which I needed in any
event.  Back when I started this 320g drives were the biggest I could
afford.  By dropping eight 320 gig drives on a dedicated controller raid 6 I
got ~ 300 (real) gigs x 6  storage which was big enough to deal with all the
data itself.  Once you have that the streaming read rates are jaw dropping.
Reading data off of 8 drives at once gives me very close to 400 megs /
second read rate.  That is enough to keep the processors busy!

You mention expensive hardware, I built my servers from parts for about
$3000 each, and that was about a year ago now.  The controller was $500 and
8 320 gig drives was more than 1/2 of the system cost, but it is that array
which makes it fly.  These databases are pretty much read only, static data
that is rarely if ever updated.  The selection fields are never updated at
all.  I do have to build indexes and stuff but that is "as required".  So I
mostly have to focus on building a system that can READ lots of data
quickly.  Then it turns out that the "cover" indexes carves out frequently
used fields into little "mini tables" kinda sorta.  

I had a heck of a time originally because of the width of the tables, over
700 fields as I received it.  So even a table scan on a single field could
take 3 minutes.  It was not pretty.

With the cover fields, one per field actually used in a select, suddenly SQL
Server is going to the indexes to get what is needed instead of going to the
actual 700 field wide table.  That just made all the difference in the
world.

>...and then you can "pipe-through" all that denormalized data rows using
code, which if you use VB.Net/C# would be very speedy and which could
implement much more advanced filtering/data mining than T-SQL...

So what do you mean by this?  "pipe through"?  As for data mining, I am not
sure that I define my usage as data mining.  My interpretation of data
mining is more analysis than anything.  I am really just selecting
recordsets.  I need names / addresses where...  The end purpose is to send
names / addresses to a bulk mailer who wants to send an advertisement to a
selected group of people.

OTOH if I can learn something new I am certainly all ears.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Sunday, January 06, 2008 8:35 PM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] x64 tests

John,

I thought you could have all your three tables joined and kept in
demoralized form...

...then there will be no need in indexes...

...and then you can "pipe-through" all that denormalized data rows using
code, which if you use VB.Net/C# would be very speedy and which could
implement much more advanced filtering/data mining than T-SQL...

I can be wrong but joining 50 mln. x 50 mln. x ZipCodeWorld to get 600000
result query is quite a task and it needs expensive hardware while
"piping-though" 50mln. denormalized data rows using code can be done I
expect with similar or higher speed on less expensive hardware...

--
Shamil

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Sunday, January 06, 2008 8:40 PM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] x64 tests

I don't really understand what you are doing.  I suppose you are saying just
get the select fields and process the "do they match the criteria" in code?

Before we go further remember that we are simply trying to select name
address fields for export, based on a set of criteria.  We have to use three
different tables:

HSID has a PK field and 640 fields of data.  I would need to select 50
million records of the PK and N criteria fields (2 age fields, 1 income
field and 8 hobby fields).  I also have a table AZHSID which has the PKID
and the name / address fields.  So I would need to get the PKID plus fname,
lname, city, state, zip and a HashFamily field, 50 million records of that
data.

Then I have ZipCodeWorld that has the zip code information that tells me
what zips are in a specific metropolitan code.

I have to build a query to join the HSID, AZHSID and Accuzip tables and pull
all of the fields, then process them in code?  I would also have to write
some sort of generic code to do the selection.

Of course there are many ways to perform the queries anyway, and I should go
back and build them differently just to see what happens.  I could have
simply built up one big query.  In the past I hesitated to do this because
the queries would take 20-30 minutes to run.  If anything went wrong I would
have to make changes and retest, waiting another 20-30 minutes.  I learned
to break things down into pieces so that the pieces only took a few minutes
each.

My client had counts that he thought represented the number of records for
each criteria.  I was trying to check counts on my tables against what he
thought he had.  He had a count of the age / sex, the income, and then each
hobby.  So I built up separate queries for those pieces so that I could
count the pieces.

In the end, the point of this exercise was not to optimize the process but
rather to see how moving from x32 to x64 changed the time required.
Unfortunately that part is masked by the disk subsystem so I didn't really
learn anything there.  Then I wanted to learn how moving from 4 gigs to 8
gigs changed the time required, and likewise how moving from dual to quad
core changed the time.  I really expected the doubling of memory to
substantially change the times, but it did not.  I also expected SQL Server
x64 to be able to natively handle large amounts of memory, not depend on AWE
which I understand to be "paging" of memory.

At this point I am actually quite happy with a 5 minute time to grab all the
data and get it out to a text file.  I think that for processing 50 million
records in two tables, and narrow it down to 600 thousand records using
about 10 select fields and three joins 5 minutes is doing OK.  Especially
since it was taking me 20-30 minutes before I moved to the "cover" queries.

I do think I will go back and built this as one big query and see how long
that takes.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Sunday, January 06, 2008 9:11 AM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] x64 tests

Hello John,

Did you ever try to just "pipe-through" your millions of records and
processing them in code? I mean nowadays memory and processors are so
incredibly quick, and your query (selection) criteria are looking rather
simple - therefore it could be (much) quicker to get results if using some
custom coding... 

I do not have millions records to test this hypothesis but here is a sample
C# code, which runs against a Db with 30000+ of member records and then
cycle 2500 times to get the counter to 80+ millions, and it does get data
into local vars for further processing  - the result is that it takes 3 min
30 sec for 89+ million records on "slow" dual core Pentium with 2GB ram and
simple 200GB IDE...

Please correct me if I did some mistakes below.

Thanks.

--
Shamil

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com

-- 
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com