[dba-SQLServer] SQL Server 2005 - long count time

jwcolby jwcolby at colbyconsulting.com
Wed Jun 13 09:09:53 CDT 2007


 
Michael,

The indexes on the where clauses are the biggie.  There are about 650
demographics fields (on one of these tables, 65 million demographics
records).  Putting an index on each field is time consuming to say the
least.  Not to mention space consuming.  I am doing that as the client comes
in with orders needing the where clause on a given field.

This whole thing needs to be normalized such that I have a "fact table"
which is a field with the demographics data, a field with the fact name
(name of the original field in the demographics table) plus a field back to
the person/address that "owns" that fact.  Then I could merge all of the
demographics "facts" from every demographics table I get from the client
into a single fact table, and all of the addresses into a person / address
table (system).

I suspect that I will do that down the road a bit (as I get more comfortable
in SQL Server) but I have to make the data usable in the meantime.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Michael
Maddison
Sent: Wednesday, June 13, 2007 9:56 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] SQL Server 2005 - long count time

' Now that I understand what the CI indexes do, I will probably go back and
add both a CI and a NCI to each table, on the PKID, in order to allow SQL
Server to decide which is more efficient in any given case and use whichever
it needs in each situation.'

>>>Also as inserts and updates aren't an issue makes sure that both
sides of any join fields have indexes as well.
Also any columns in the where clause.  

cheers

MM

Michael,

>In my test scenario with 3mill rows the query select count(*) from
[testtable] was minutes faster on a heap, which I wouldn't have expected...
>It seemed to be the only query that was faster though.


Now take into account that it was minutes faster on a 3 million row table,
how much faster on a 90 million row table.

This leaves us to wonder if perhaps both would not be beneficial?  The nci
for doing counts (and I do a LOT of them!!!) and a clustered index for
actual joins, leaving it up to SQL Server as to which it uses.  The thing to
understand here is that ATM I have use a system where I have two actual
tables, a 90 million "original data" table and a matching N million "address
validated" table, which contains ONLY the address fields, NOT any additional
demographics fields.  The reason that the address validated table is "N
million" is that I delete non valid addresses from that table in order to
pare down the size.  So the "address validated table" has the same PKID
values as the "original data" table.  

When I process data, I use the original "data table" in where clauses
because that table contains the demographics - age, income, children,
preferences.  The "address validated" table has address information, but it
also has information that the address is valid just in the PK alone.
In other words, the fact that it is in the valid address table in and of
itself means that the address is valid (because I deleted all non-valid
records from THAT table.

So my client calls me and says "give me a count of all the households
(another subject) in these zips".  I can do that without ever going back to
the original data table.  In other cases he says "give me a count of all the
addresses in these zips where age=x, income is between y and z and has
female children.  That query needs to go back to the original data table for
the demographics information.

So you can see that these PK indexes play an absolutely huge role in my
business, both from the perspective of joins between the two tables as well
as from the perspective of raw counts when demographics are not needed.

Now, I can hear the protestations already:

1) Why not merge the valid address back in to the demographics table.  

Because this is but one of the demographics tables I deal with, there will
be literally dozens of them.  At some point in time the valid address data
from each demographics table has to be merged into a single valid address
table with an even more complex m-m table saying this valid address is
linked to THESE (multiple) demographics tables.

2) OK then why not delete the non-valid addresses from the demographics
table?

This is a "should be done" but in the end it makes only a small (but still
significant) dent in the total number of records in the demographics table,
about 20% maximum (so far).  What I really need to do is go back and remove
ALL address data from the demographics table.  Basically, someone DID merge
all of the address validation data back in to the demographics table (back
before I got ahold of it) and so the demographics table now has probably 40
(out of 700) fields that are old, out of date address validation info.

Even if I do that, I STILL need the PKs in both tables in order to join
demographics with valid address, and I need them to be FAST.  Now that I
understand what the CI indexes do, I will probably go back and add both a CI
and a NCI to each table, on the PKID, in order to allow SQL Server to decide
which is more efficient in any given case and use whichever it needs in each
situation.

John W. Colby
Colby Consulting
www.ColbyConsulting.com




More information about the dba-SQLServer mailing list