[dba-SQLServer] Billions and billions of rows

Arthur Fuller fuller.artful at gmail.com
Sat Aug 22 14:09:50 CDT 2015


Thanks for the insights, JC. I think that you could do even better, but
you've already saved so much time that shaving a few seconds off would
require much more effort than the payoff would be worth. Congrats!

Arthur

On Fri, Aug 21, 2015 at 12:25 PM, John W. Colby <jwcolby at gmail.com> wrote:

> Just an update, after rebuilding this as a pair of fields - FKDB101 /
> FKSimmons (Int), set as the PK (clustered index) and compressed (page), the
> total storage dropped to just the storage of the data (compressed) and a
> tiny little amount (a hundred mbytes?) that SQL Server uses for something
> index related.
>
> The time to get counts has dropped by an order of magnitude.  Just to
> recap, previously I would look up in a Simmons Definition table what
> operations were required to "translate" a given field in DB101 to a
> corresponding field in Simmons.  Simmons has 220 defined "fields" which are
> "translated from DB101 and mapped into Simmons". Just as an example, I have
> a single age field.  Simmons has (IIRC) 5 year age brackets.  So to count
> people in simmons age bracket 30-35 I would pull all records with an age
> between 30 and 35 and count them.  From that example you can see that to
> get that specific Simmons count I had to dynamically create SQL and perform
> a count on the pulled set of records.
>
> Given that, the counts for each simmons data point took around 3.5 minutes
> to define the dynamic sql (not particularly time intensive) and then pull
> and count the required records against the full DB101 data set (220 million
> records).  And of course the actual time varied all over the map, but that
> was the average.
>
> So what I did was "preselect" the records that fit each simmons data point
> and stored it in this new junction table
>
> a pair of fields - FKDB101 / FKSimmons (Int)
>
> So now, while I still have to perform joins on at least one other table,
> the join is between the already selected PKDB101 set and another table.
>
> The results are lightning fast, with each count requiring about 9 seconds
> (average for junction table) instead of about 3.5 minutes (dynamic sql).  I
> can now do my entire count of all 220 Simmons data points in around 30
> minutes instead of days.
>
> The junction table contains just over seven billion records.  DB101
> contains about 220 million records and several hundred fields.  The Simmons
> table contains 228 records, each record mapping one specific DB101 field
> into a similar Simmons "field".
>
> The simmons information is then further filtered on other (non simmons)
> fields in DB101, for example simmons counts against people with diabetes,
> or people who are jewish, or people who go cruising.
>
> SQL Server is very efficient in determining how to select the various
> parts of a query to optimize the retrieval of the result set, so it tends
> to preselect the jewish (for example) records and then only perform the
> "simmons stuff" against that much smaller result set.
>
> What this means is that using either method (old dynamic sql or new
> junction table) the actual time to get the counts will be reduced by the
> "other selection" criteria.  But the junction table method still wins hands
> down in time to completion.
>
> John W. Colby
>
> On 8/9/2015 11:06 PM, John W. Colby wrote:
>
>> So the table ended up with 7 billion rows and a combined 320 gb data and
>> index space.
>>
>> I decided to rebuild the table using just the FKs (PKSimmons and
>> PKDB101), and a PK of those two fields combined.  I also compressed the
>> index (page).  It is unclear how much compression I will get since the data
>> being compressed are two integer FK fields.  All just scientific inquiry.
>> :)
>>
>> I can say however that the Index storage is almost nothing.
>>
>> So I ran a count on the first attempt at this junction table, selecting
>> just the second PKSimmons and the time was 1.5 minutes vs just over 2
>> minutes for doing it the old (hard coded) way.  Not a significant savings I
>> must admit.  That was with a noncompressed / non clustered index however.
>> We shall see whether a single compressed / clustered index will speed
>> things up.  Of course it is likely to take me another day to yank the two
>> FK fields out of the first table and insert into the new table (insert into
>> / select).
>>
>>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
>


-- 
Arthur


More information about the dba-SQLServer mailing list