[dba-SQLServer] SQL Server 2005 - long count time

Michael Maddison michael at ddisolutions.com.au
Wed Jun 13 18:08:27 CDT 2007


OLAP may be an option but I think John will still run into issues with
100's of dimensions.
Actually with the changes Arthur recommended then it should work a
treat... in theory...

cheers

Michael M

John,

How about using olap? This would do a lot of preprocessing to make it
quick and easy to find counts and sums. My understanding is that SQL
Server 2005 has very good olap options. (My experience is limited to
Cognos, but a 300 million record cube of web site hit data took 20
minutes to process 12-20 dimensions and about 6 measures. The resulting
cube gave instant answers.)

If you did that, MDX (similar to SQL) would allow you to write queries
against the olap cube. Because of the pre-processing of the olap cube,
MDX query response should be instant. 

I originally got into this sort of stuff when the cross tabs I was
running with Access as a front end to SQL Server 2000 were taking too
long. The company I was working for wanted to use Cognos (2000's olap
was fairly rudimentary before the development of MDX) so that's the
direction I took, but I've heard lots of good things about 2005's olap.

Fred

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Wednesday, June 13, 2007 11:21 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] SQL Server 2005 - long count time

Arthur,

Yes you did recommend that strategy.  And of course it is just the
implementation that is holding me back.

And of course it is not just hundreds of millions rows in the fact
table.  

tblHSID - 50 million records * 700 fields = 35 BILLION records
tblInfutor - 90 million records * 120 fields = 10.8 BILLION records
tblAmericanTrust - 70 million records * 50 fields = 3.5 BILLION fields
Etc Etc Etc

Thus we are already up to close to 50 billion records in the fact table,
with only three lists processed.  Something tells me that doing searches
and joins on 50 billion records is never going to be fast on the
hardware I can afford. 

This ain't exactly your grandma's recipe database.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Arthur
Fuller
Sent: Wednesday, June 13, 2007 10:35 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] SQL Server 2005 - long count time

I think that I recommended that strategy a while back. Each of those
hundreds of columns ought to be a row (use a SQL Variant column
definition or alternatively have a child table that contains a column of
every data type, but that approach is cumbersome).

Using this approach, only the not-null (or not empty) columns would
become rows. Yes, this table would contain many millions of rows, but
you would achieve search speeds an order of magnitude (or more) better
than what you are currently achieving.

This approach also requires a third table consisting in essence of the
column-names in your current table. So that rather than 500 columns, you
have 500 rows in said lookup table, and if and only if the current
person/household has a value in column 134 do you add a row to the
bridge table PersonAttributes.

Why is this a huge gain? Because the lookup table and the bridge table
can be indexed. For example, let's suppose that we are looking for
divorced females who own a flat-screen TV (and who isn't? but I'll let
that slide).
These values constitute 3 rows in the aforementioned Attributes table,
with PKs that can be found quickly. So ultimately a search such as this
becomes something like "SELECT * FROM People INNER JOIN Attributes WHERE
People.PersonID = Attributes.PersonID and AttributeID IN( 123, 234,
345).

Obviously there is a more optimal way to write that Select statement but
I am on a keystroke diet LOL.

A.

On 6/13/07, jwcolby <jwcolby at colbyconsulting.com> wrote:
>
>
> Michael,
>
> The indexes on the where clauses are the biggie.  There are about 650 
> demographics fields (on one of these tables, 65 million demographics 
> records).  Putting an index on each field is time consuming to say the

> least.  Not to mention space consuming.  I am doing that as the client

> comes in with orders needing the where clause on a given field.
>
> This whole thing needs to be normalized such that I have a "fact
table"
> which is a field with the demographics data, a field with the fact 
> name (name of the original field in the demographics table) plus a 
> field back to the person/address that "owns" that fact.  Then I could 
> merge all of the demographics "facts" from every demographics table I 
> get from the client into a single fact table, and all of the addresses

> into a person / address table (system).
>
> I suspect that I will do that down the road a bit (as I get more 
> comfortable in SQL Server) but I have to make the data usable in the 
> meantime.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of 
> Michael Maddison
> Sent: Wednesday, June 13, 2007 9:56 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] SQL Server 2005 - long count time
>
> ' Now that I understand what the CI indexes do, I will probably go 
> back and add both a CI and a NCI to each table, on the PKID, in order 
> to allow SQL Server to decide which is more efficient in any given 
> case and use whichever it needs in each situation.'
>
> >>>Also as inserts and updates aren't an issue makes sure that both
> sides of any join fields have indexes as well.
> Also any columns in the where clause.
>
> cheers
>
> MM
>
> Michael,
>
> >In my test scenario with 3mill rows the query select count(*) from
> [testtable] was minutes faster on a heap, which I wouldn't have 
> expected...
> >It seemed to be the only query that was faster though.
>
>
> Now take into account that it was minutes faster on a 3 million row 
> table, how much faster on a 90 million row table.
>
> This leaves us to wonder if perhaps both would not be beneficial?  The

> nci for doing counts (and I do a LOT of them!!!) and a clustered index

> for actual joins, leaving it up to SQL Server as to which it uses.
> The thing to understand here is that ATM I have use a system where I 
> have two actual tables, a 90 million "original data" table and a 
> matching N million "address validated" table, which contains ONLY the 
> address fields, NOT any additional demographics fields.  The reason 
> that the address validated table is "N million" is that I delete non 
> valid addresses from that table in order to pare down the size.  So 
> the "address validated table" has the same PKID values as the 
> "original data" table.
>
> When I process data, I use the original "data table" in where clauses 
> because that table contains the demographics - age, income, children, 
> preferences.  The "address validated" table has address information, 
> but it also has information that the address is valid just in the PK 
> alone.
> In other words, the fact that it is in the valid address table in and 
> of itself means that the address is valid (because I deleted all 
> non-valid records from THAT table.
>
> So my client calls me and says "give me a count of all the households 
> (another subject) in these zips".  I can do that without ever going 
> back to the original data table.  In other cases he says "give me a 
> count of all the addresses in these zips where age=x, income is 
> between y and z and has female children.  That query needs to go back 
> to the original data table for the demographics information.
>
> So you can see that these PK indexes play an absolutely huge role in 
> my business, both from the perspective of joins between the two tables

> as well as from the perspective of raw counts when demographics are 
> not needed.
>
> Now, I can hear the protestations already:
>
> 1) Why not merge the valid address back in to the demographics table.
>
> Because this is but one of the demographics tables I deal with, there 
> will be literally dozens of them.  At some point in time the valid 
> address data from each demographics table has to be merged into a 
> single valid address table with an even more complex m-m table saying 
> this valid address is linked to THESE (multiple) demographics tables.
>
> 2) OK then why not delete the non-valid addresses from the 
> demographics table?
>
> This is a "should be done" but in the end it makes only a small (but 
> still
> significant) dent in the total number of records in the demographics 
> table, about 20% maximum (so far).  What I really need to do is go 
> back and remove ALL address data from the demographics table.
> Basically, someone DID merge all of the address validation data back 
> in to the demographics table (back before I got ahold of it) and so 
> the demographics table now has probably 40 (out of 700) fields that 
> are old, out of date address validation info.
>
> Even if I do that, I STILL need the PKs in both tables in order to 
> join demographics with valid address, and I need them to be FAST.  Now

> that I understand what the CI indexes do, I will probably go back and 
> add both a CI and a NCI to each table, on the PKID, in order to allow 
> SQL Server to decide which is more efficient in any given case and use

> whichever it needs in each situation.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
>
_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com



_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com





More information about the dba-SQLServer mailing list