[AccessD] Analyzing groups of dated records
John W. Colby
jwcolby at gmail.com
Mon Mar 2 15:26:51 CST 2015
I wrote an application that updates 500 million records a month. The
process involves exporting the records out of SQL Server to text files,
feeding the files off to a third party app, and then re-importing the
files back into SQL Server.
I would like to analyze the number of files processed per hour, per day
etc. This would allow me to discover whether there is an optimum time
of day / week / month to do this processing.
When I export the files I create a parent "supervisor" record (and class
instance in C#) which class instance is then tasked with exporting the
files out to disk. As files are exported, a process record (and class
instance) is created for each file. The process record is then updated
with the date / time that the various processes occur:
1) The date time that the file is created on disk
2) The date / time that the file is sent to the virtual machine for
processing, as well as which VM is processing that file.
3) The date / time that the resulting file comes back from the VM
4) The date / time that the file is imported back into SQL Server.
Each of these processes runs on threads that can be started and stopped
individually.
Step 1 (export to disk) can occur whether or not the VMs are being fed
with data.
Step 2 (file to VM) can occur as soon as (and as long as) any file is
out on disk ready for processing.
Step 3 (file back from VM) can occur as soon as the result file is ready
out on the VM
Step 4 (file into SQL Server) can occur as soon as (and as long as) any
result file has been brought back from the VM.
Step 1 truly is standalone, and can occur whenever the source data table
is ready to be processed. I create a start date / time to begin
processing each database.
Steps 2 and 3 are actually (and unfortunately) run on the same thread
and can be thought of as a single process. A file is sent to the VM,
the date / time / VM name is logged, and then the thread just watches
for the resulting file. When the file is created, the thread copies the
file back from the VM to a receiving directory and the date / time is
logged.
Step 4 is standalone, at least in that if there are files available in
the receiving directory (returned flag is set) , then that process can run.
My objective when designing this whole thing was to make the processing
asynchronous, allow multiple VMs, allow me to take VMs off line and
still process etc.
It all works quite well.
So basically, the bottle neck is the third party process running in the
VMs. Each VM can process one file at a time. Each VM will process
approximately 1 million records / hour. However in the middle of the
process out on the VM, the data has to be uploaded over the internet to
a server out in California for processing, and then change records
downloaded back down to the VM and changes integrated back into the data
in that file. IOW Change of Address records, actual MOVES comes back.
And finally the app spits out a (pipe delimited) CSV of the data I sent
to the VM plus additional flag fields.
It turns out that the upload / process / download on the remote server
is the overall bottleneck.
Since I have "Date / time sent to vm" and "Date / time received from VM"
recorded in my process records, I can (if I can figure out how) analyze
how many records / hour I am processing over a given time window. I
want to know does it change? I know for a fact that I am one of the
largest users in terms of total records per month. I also know that the
server in California runs small files before large files. What I do not
know is how that selection process works. Nor do I know the time of day
loads on his servers. I could program my system to "self throttle",
only work at specific times of day, only work on the weekends, use
smaller files (fewer records / file), use bigger files etc. In order to
do that though I need to have real live analysis of results as I make
changes.
I would like to start by just analyzing hourly "files / hour" kinds of
metrics. Analysis is not my forte however and I really don't know the
big picture on how to group and count records by date / time. Big
picture, how do I ask (SQL) how many records do I have from midnight to
1, 1 to 2, 2 to 3 etc. Or even midnight to 4 am, 4am to 8 am etc. Then
by day of the week. And so forth. Pick a date / time to begin, a date
/ time to end and a time period to group on, then translate that into
SQL. And of course put it on paper (screen) in a visual sort of way.
I can (and have) just counted files processed between midnight and 8 and
divided by 8 to get files per hour. NOT FUN! Really not fun to do a
thorough analysis.
I like FUN and easy and repeatable, and automatic.
--
John W. Colby
More information about the AccessD
mailing list