[dba-SQLServer] 2.45 billion rows and counting was PK Cast fails

Sat Aug 8 11:47:56 CDT 2015

I think you have crossed the line into "big data" world. Hadoop and friends. 

Jim

----- Original Message -----
From: "John W. Colby" <jwcolby at gmail.com>
To: "Access Developers discussion and problem solving" <accessd at databaseadvisors.com>, "Discussion concerning MS SQL Server" <dba-sqlserver at databaseadvisors.com>
Sent: Saturday, August 8, 2015 8:00:12 AM
Subject: [dba-SQLServer] 2.45 billion rows and counting was PK Cast fails

The Simmons table I am building now has 2.5 billion rows.  The stored 
procedure is running a "loop" iterating through a table of 228 Simmons 
demographics definitions, pulling PKs from DB101 (225 million possible 
PKs) one at a time, and inserting them into the junction table mentioned 
in the previous email.  Sometime last night the log file ran out of room 
(filled up the log disk) and stopped the process, at Simmons Def 70.  
Luckily I am writing the individual sql statements out as each one is 
executed so I could see where it stopped.  I cleaned a bunch of very 
large log files off and restarted this morning.

While I can see the total storage (records) in the junction table, what 
I can't do is get a Max() of the Simmons PK in the junction table to see 
where I am in the process.  I was hoping that info would be in metadata 
somewhere but apparently not, it seems that it has to scan through the 
table looking for the max() value.  While I haven't let it run in a 
query window, trying to do that in a view timed out for obvious reasons.

And so I wait for completion.  I would estimate the junction table will 
contain somewhere north of 10 billion records when done.

I have never even come close to this quantity of records in a single 
table so I am still unsure whether this will even make things faster.  
Assuming an index on two fields:

PKSimmons - PKs for each of 220 Simmons Definitions
PKDB101 - DB101 PKs that match the Simmons Definition

Will doing a join on PKDB101 for a single value of PKSimmons return a 
recordset any time in the useful future?  I have to join the PKDB101 to 
another table (and possibly two), then do WHERE stuff on that other 
table(s), then count the remaining records to determine which DB101 
records match the Simmons definition.

This essentially pre-selects just exactly the DB101 records that match 
the simmons definition, storing the selected DB101 PKs in the junction 
table.

The way I do it right now is just dynamically build a SQL statement, 
joining everything required, doing a pretty complicated where and 
counting the resulting records.  Rinse and repeat 220 times, once for 
each Simmons definition.

The other question is as follows.  I use a BigInt as the PK and make it 
a defined PK, i.e. the table ends up clustered on that PK. Should I have 
included the PDSimmons and PKDB101 in that PK so that those two fields 
are already indexed using the PK index?  As it is now, I will be 
creating a separate (non clustered) index on just those two fields when 
done.  Given that the records are being inserted in the table in 
PKSimmons order, i.e. an entire set of PK101 values inserted each 
iteration of my code, one set for each PKSimmons value.  So had I used 
at least the PKSimmons in the (clustered) PK would have it made any 
difference in performance?  Or when doing a join, will the server scan 
the clustered index anyway and just find all the groups by virtue of the 
fact that the PK is autonumber and all records for each PKSimmons value 
are contiguous?

The things that keep me up at night.

-- 
John W. Colby

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com