[AccessD] Analyzing groups of dated records

John W. Colby jwcolby at gmail.com
Mon Mar 2 15:26:51 CST 2015


I wrote an application that updates 500 million records a month. The 
process involves exporting the records out of SQL Server to text files, 
feeding the files off to a third party app, and then re-importing the 
files back into SQL Server.

I would like to analyze the number of files processed per hour, per day 
etc.  This would allow me to discover whether there is an optimum time 
of day / week / month to do this processing.

When I export the files I create a parent "supervisor" record (and class 
instance in C#) which class instance is then tasked with exporting the 
files out to disk.  As files are exported, a process record (and class 
instance) is created for each file.  The process record is then updated 
with the date / time that the various processes occur:

1) The date time that the file is created on disk
2) The date / time that the file is sent to the virtual machine for 
processing, as well as which VM is processing that file.
3) The date / time that the resulting file comes back from the VM
4) The date / time that the file is imported back into SQL Server.

Each of these processes runs on threads that can be started and stopped 
individually.

Step 1 (export to disk) can occur whether or not the VMs are being fed 
with data.
Step 2 (file to VM) can occur as soon as (and as long as) any file is 
out on disk ready for processing.
Step 3 (file back from VM) can occur as soon as the result file is ready 
out on the VM
Step 4 (file into SQL Server) can occur as soon as (and as long as) any 
result file has been brought back from the VM.

Step 1 truly is standalone, and can occur whenever the source data table 
is ready to be processed.  I create a start date / time to begin 
processing each database.
Steps 2 and 3 are actually (and unfortunately) run on the same thread 
and can be thought of as a single process.  A file is sent to the VM, 
the date / time / VM name is logged, and then the thread just watches 
for the resulting file.  When the file is created, the thread copies the 
file back from the VM to a receiving directory and the date / time is 
logged.
Step 4 is standalone, at least in that if there are files available in 
the receiving directory (returned flag is set) , then that process can run.

My objective when designing this whole thing was to make the processing 
asynchronous, allow multiple VMs, allow me to take VMs off line and 
still process etc.

It all works quite well.

So basically, the bottle neck is the third party process running in the 
VMs.  Each VM can process one file at a time.  Each VM will process 
approximately 1 million records / hour.  However in the middle of the 
process out on the VM, the data has to be uploaded over the internet to 
a server out in California for processing, and then change records 
downloaded back down to the VM and changes integrated back into the data 
in that file.  IOW Change of Address records, actual MOVES comes back.  
And finally the app spits out a (pipe delimited) CSV of the data I sent 
to the VM plus additional flag fields.

It turns out that the upload / process / download on the remote server 
is the overall bottleneck.

Since I have "Date / time sent to vm" and "Date / time received from VM" 
recorded in my process records, I can (if I can figure out how) analyze 
how many records / hour I am processing over a given time window.  I 
want to know does it change?  I know for a fact that I am one of the 
largest users in terms of total records per month.  I also know that the 
server in California runs small files before large files.  What I do not 
know is how that selection process works.  Nor do I know the time of day 
loads on his servers.  I could program my system to "self throttle", 
only work at specific times of day, only work on the weekends, use 
smaller files (fewer records / file), use bigger files etc.  In order to 
do that though I need to have real live analysis of results as I make 
changes.

I would like to start by just analyzing hourly "files / hour" kinds of 
metrics.  Analysis is not my forte however and I really don't know the 
big picture on how to group and count records by date / time.  Big 
picture, how do I ask (SQL) how many records do I have from midnight to 
1, 1 to 2, 2 to 3 etc.  Or even midnight to 4 am, 4am to 8 am etc.  Then 
by day of the week.  And so forth.  Pick a date / time to begin, a date 
/ time to end and a time period to group on, then translate that into 
SQL.  And of course put it on paper (screen) in a visual sort of way.

I can (and have) just counted files processed between midnight and 8 and 
divided by 8 to get files per hour.  NOT FUN!  Really not fun to do a 
thorough analysis.

I like FUN and easy and repeatable, and automatic.

-- 
John W. Colby



More information about the AccessD mailing list