[dba-SQLServer] Billions and billions of rows
John W. Colby
jwcolby at gmail.com
Fri Aug 21 11:25:30 CDT 2015
Just an update, after rebuilding this as a pair of fields - FKDB101 /
FKSimmons (Int), set as the PK (clustered index) and compressed (page),
the total storage dropped to just the storage of the data (compressed)
and a tiny little amount (a hundred mbytes?) that SQL Server uses for
something index related.
The time to get counts has dropped by an order of magnitude. Just to
recap, previously I would look up in a Simmons Definition table what
operations were required to "translate" a given field in DB101 to a
corresponding field in Simmons. Simmons has 220 defined "fields" which
are "translated from DB101 and mapped into Simmons". Just as an example,
I have a single age field. Simmons has (IIRC) 5 year age brackets. So
to count people in simmons age bracket 30-35 I would pull all records
with an age between 30 and 35 and count them. From that example you can
see that to get that specific Simmons count I had to dynamically create
SQL and perform a count on the pulled set of records.
Given that, the counts for each simmons data point took around 3.5
minutes to define the dynamic sql (not particularly time intensive) and
then pull and count the required records against the full DB101 data set
(220 million records). And of course the actual time varied all over
the map, but that was the average.
So what I did was "preselect" the records that fit each simmons data
point and stored it in this new junction table
a pair of fields - FKDB101 / FKSimmons (Int)
So now, while I still have to perform joins on at least one other table,
the join is between the already selected PKDB101 set and another table.
The results are lightning fast, with each count requiring about 9
seconds (average for junction table) instead of about 3.5 minutes
(dynamic sql). I can now do my entire count of all 220 Simmons data
points in around 30 minutes instead of days.
The junction table contains just over seven billion records. DB101
contains about 220 million records and several hundred fields. The
Simmons table contains 228 records, each record mapping one specific
DB101 field into a similar Simmons "field".
The simmons information is then further filtered on other (non simmons)
fields in DB101, for example simmons counts against people with
diabetes, or people who are jewish, or people who go cruising.
SQL Server is very efficient in determining how to select the various
parts of a query to optimize the retrieval of the result set, so it
tends to preselect the jewish (for example) records and then only
perform the "simmons stuff" against that much smaller result set.
What this means is that using either method (old dynamic sql or new
junction table) the actual time to get the counts will be reduced by the
"other selection" criteria. But the junction table method still wins
hands down in time to completion.
John W. Colby
On 8/9/2015 11:06 PM, John W. Colby wrote:
> So the table ended up with 7 billion rows and a combined 320 gb data
> and index space.
>
> I decided to rebuild the table using just the FKs (PKSimmons and
> PKDB101), and a PK of those two fields combined. I also compressed
> the index (page). It is unclear how much compression I will get since
> the data being compressed are two integer FK fields. All just
> scientific inquiry. :)
>
> I can say however that the Index storage is almost nothing.
>
> So I ran a count on the first attempt at this junction table,
> selecting just the second PKSimmons and the time was 1.5 minutes vs
> just over 2 minutes for doing it the old (hard coded) way. Not a
> significant savings I must admit. That was with a noncompressed / non
> clustered index however. We shall see whether a single compressed /
> clustered index will speed things up. Of course it is likely to take
> me another day to yank the two FK fields out of the first table and
> insert into the new table (insert into / select).
>
More information about the dba-SQLServer
mailing list