[dba-SQLServer] 2.45 billion rows and counting was PK Cast fails
John W. Colby
jwcolby at gmail.com
Sat Aug 8 10:00:12 CDT 2015
The Simmons table I am building now has 2.5 billion rows. The stored
procedure is running a "loop" iterating through a table of 228 Simmons
demographics definitions, pulling PKs from DB101 (225 million possible
PKs) one at a time, and inserting them into the junction table mentioned
in the previous email. Sometime last night the log file ran out of room
(filled up the log disk) and stopped the process, at Simmons Def 70.
Luckily I am writing the individual sql statements out as each one is
executed so I could see where it stopped. I cleaned a bunch of very
large log files off and restarted this morning.
While I can see the total storage (records) in the junction table, what
I can't do is get a Max() of the Simmons PK in the junction table to see
where I am in the process. I was hoping that info would be in metadata
somewhere but apparently not, it seems that it has to scan through the
table looking for the max() value. While I haven't let it run in a
query window, trying to do that in a view timed out for obvious reasons.
And so I wait for completion. I would estimate the junction table will
contain somewhere north of 10 billion records when done.
I have never even come close to this quantity of records in a single
table so I am still unsure whether this will even make things faster.
Assuming an index on two fields:
PKSimmons - PKs for each of 220 Simmons Definitions
PKDB101 - DB101 PKs that match the Simmons Definition
Will doing a join on PKDB101 for a single value of PKSimmons return a
recordset any time in the useful future? I have to join the PKDB101 to
another table (and possibly two), then do WHERE stuff on that other
table(s), then count the remaining records to determine which DB101
records match the Simmons definition.
This essentially pre-selects just exactly the DB101 records that match
the simmons definition, storing the selected DB101 PKs in the junction
table.
The way I do it right now is just dynamically build a SQL statement,
joining everything required, doing a pretty complicated where and
counting the resulting records. Rinse and repeat 220 times, once for
each Simmons definition.
The other question is as follows. I use a BigInt as the PK and make it
a defined PK, i.e. the table ends up clustered on that PK. Should I have
included the PDSimmons and PKDB101 in that PK so that those two fields
are already indexed using the PK index? As it is now, I will be
creating a separate (non clustered) index on just those two fields when
done. Given that the records are being inserted in the table in
PKSimmons order, i.e. an entire set of PK101 values inserted each
iteration of my code, one set for each PKSimmons value. So had I used
at least the PKSimmons in the (clustered) PK would have it made any
difference in performance? Or when doing a join, will the server scan
the clustered index anyway and just find all the groups by virtue of the
fact that the PK is autonumber and all records for each PKSimmons value
are contiguous?
The things that keep me up at night.
--
John W. Colby
More information about the dba-SQLServer
mailing list