[dba-VB] c# lock()

Sat Mar 26 23:12:20 CDT 2011

Well...

In programming you can do any task 50 different ways.  The way I did it is to create a supervisor 
and ProcessFile table in SQL Server.  The supervisor represents a table in SQL Server that contains 
an address table which needs to be processed.  The ProcessFile records are child to tblSupervisor 
and represent chunks (files).

The supervisor record stores all of the high level flags about the database (address table) being 
processed.  The ProcessFile contains all of the flags about each chunk being processed.

I have four different high level "stages" as I call them that a supervisor goes through.  AccuzipOut 
- table to csv files in staging out (stage 1), Staging to VM (stage 2), VM to Staging in (stage 3, 
Staging in to SQL Server (stage 4)

1) Export all chunks (files) to a SQL Out staging directory.  Each chunk ends up being a CSV file 
with PKID, FName, LName, Addr, City, St, Zip5, Zip4.  The ProcessFile needs to be sorted by Zip5 / 
Zip4 order because the third party application that processes these files is immensely faster if the 
file is presorted.

2) One by one, copy each chunk file to a virtual machine - AccuzipIn directory.

The Accuzip virtual machine runs on a Windows 2008 server running Hyper-V manager.  I currently run 
3 different VMs, only one of which runs Accuzip.  For awhile I actually ran 3 VMs running Accuzip, 
and I am designing such that I can add additional Accuzip VMs in the future.  What this means is 
that each file could go to any valid Accuzip VM, and when the chunk file is placed on a VM it has to 
log which VM it went to because...

3) The Accuzip software automatically senses and processes the file in the InputTo directory.  A 
file of 2 million records takes about 25 minutes to CASS process locally to the VM, then is uploaded 
to a server in California for further NCOA processing.  The server in California belongs to Accuzip 
and they stage big files so it can take awhile before it even begins to process.  Once it processes, 
the NCOA server sends a small file back to my VM where Address change information is integrated back 
in to the local Accuzip database file (foxpro).  After that it is exported back to CSV with a bunch 
of additional fields and is placed in the Accuzip OutputFrom directory along with a PDF which 
contains the CASS / NCOA statistics.  This PDF is a "proof of processing" document that I have to 
keep forever, and I have to send to the client.

I have a Directory Watcher which is watching the OutputFrom directory.  When *all* of the resulting 
Accuzip output files are in the OutputFrom directory, my Directory Watcher event triggers my program 
to copy all of these files back to the SQL Server In staging directory.

4) Once all of the Accuzip files for a single chunk are back on the SQL Server In directory, my 
application BCPs the CSV into a input chunk table.  From there decisions are made whether I need to 
import every address back in or only the moves.  In any event some or all of the address records 
(with the additional fields from Accuzip) are pulled into an AZData table back in the database that 
the address data originally came from.

So...

1) A supervisor object creates a clsDbAZOut instance.
2) clsDBAZOut creates an actual database dbXyzAZOut temporary database.
3) clsDBAZOut pulls every record needing to be accuzipped into a table with a super PK, sorted on Zip5/4
4) clsDBAZOut counts the records in tblAZOut, and starts to build clsProcessFile instances which 
holds the information about the chunk processing.  clsProcessFile  can write its info back to the 
ProcessFile records on SQL Server.
5) clsDbAZOut has a thread which builds tblChunk, gets ChunkSize records into the chunk table and 
BCPs tblChunk to CSV files on a staging out directory.
6) clsDbAZOut BCPs the chunk to file
7) At this point the clsProcessFile is handed off to clsVM.

clsSupervisor has a strongly typed dictionary which stores the clsProcessFile instances created by 
clsDbAZOut.  clsSupervisor can read ProcessFile instances from SQL Server into a clsProcessFile 
factory or clsDbAZOut can use a class factory to create new ones. In all cases they are stored in 
the dictionary in clsSupervisor.

clsProcessfile has flags which document that:
a) The chunk table was created
b) The chunk table exported to file
c) The chunk went to a VM (and which one)
d) The chunk file was (eventually) found in the Accuzip OutputFrom directory
e) The resulting AZData file (and PDF) were moved from the VM back to SQL ServerIn staging
f) The file in SQL Server In staging was pulled back in to a chunk table in a second database 
dbXyzAZIn temporary database
g) That the chunk table was imported back into the original database.

As you can see, clsProcessFile is the center of the universe for a single chunk.
clsSupervisor is the center of the universe for the entire process.

It is a bit difficult to discuss how the entire program works in a single email.

A clsManager loads one or a dozen clsSupervisor instances.
clsManager loads clsVM which supervises the (a) virtual machine which runs a third party process.

clsVM moves files from SQL Staging Out into the VM one chunk at a time, updating clsProcessTable flags.
clsVM moves files from the the VM back to the SQL Staging IN, updating flags.
The VM may not be available, or the Accuzip program may be down etc.

clsSupervisor loads one to N clsProcessTable instances.  clsSupervisor loads a clsDBExp and clsDBImp.

clsDBExp creates new clsProcessTable instances, one for each chunk it exports - storing them in 
clsSupervisor.  clsDBExp supervises the export process.

clsDBImp supervises the import process getting the chunks back into SQL Server into a temp database, 
and from there into the live database.

I am trying to design the system such that it is asynchronous.  Stage1, stage 2/3 and stage 4 can 
all do their thing independently of each other.  Each step of each stage is logged (dateTime flags 
written) so that any piece can just pick up where it left off.

That means any Supervisor and any ProcessFile (chunk) can pick up from any point.

In the meantime, the clsManager is watching the tblSupervisor for supervisor records ready to 
process - a thread.

clsSupervisor is watching for ProcessTable records and dispatching them as required to the clsDBExp, 
clsDBImp or clsVM - a thread in each supervisor.

clsDBExp is processing chunks, creating CSV files - a thread.

clsVM is watching for clsProcessTable instances ready to go to the VM.  It is also watching the 
output directory and sending files back to the SQL Server - a thread and two file watchers (and two 
events).

clsDBImp is watching for clsProcessTable instances ready to import back into SQL Server - a thread.

Most of these threads are writing flags (as they finish their specific process) and reading flags in 
clsProcessTable to decide what chunk has to be processed in which class.

Oh, and I have three list controls on a form which the various classes write to to display the 
status for Stage1, Stage 2/3 and Stage 4.

So pretty much a single supervisor is processing at a time, exporting files, sending them to the vm, 
importing them back in etc.  It is possible for an order supervisor to cut in at the head of the 
line however.

Not a simple task co-ordinating all of this stuff.

I used to co-ordinate it all manually!  ;)

John W. Colby
www.ColbyConsulting.com

On 3/26/2011 8:08 PM, Shamil Salakhetdinov wrote:
> John --
>
> I'm falling asleep here now - would that be correct assumption that you can
> generalize your global task like that - you have, say:
>
> - split input data into 50 Chunks;
> - every chunk should be processed in 10 steps;
> - every step's average processing time is 100 minutes.
>
> Then in total it is needed 50*10*100 = 50,000 minutes or ~833.(3) hours to
> process all your input data sequentially, chunk after chunk step after step,
> say, you don't have enough computer power to process some chunks in
> parallel, and you can't process the whole not split input data as it's too
> large for your system...
>
> But you have "50 small computers" and you can process every of 50 input
> chunks in parallel - then you'll need just 10*100 = 1,000 minutes (~16.67
> hours) to complete the job.
>
> I assume that all the chunks processing is independent - then you can:
>
> 1) Create Scheduler class instance;
> 2) Schedule class creates 10 TaskQueue class instances - 1 queue for every
> processing step type;
> 3) Scheduler class defines splitting criteria to split input data into 50
> chunks;
> 4) Schedule defines completion criteria - when all 50 chunks get collected
> in 11th FinalQueue;
> 5) Schedule seeds the first InputChunkSplitTaskQueue with 50
> SplitTasksDescriptors;
> 4) From time to time - say every half a minute Scheduler class instance
> creates 500 worker classes in 500 threads (500 worker classes =>  50 chunks *
> 10 steps.), which "hunt for the job" in their dedicate Queues: if they find
> Job they work on it, if not - they "die"...
> 5) When a worker class completes processing its job it queues its "task
> results" into the next step Queue, if a worker class fails then it puts its
> incomplete task descriptor back into its dedicated queue and "dies", or that
> could be Scheduler's (or SchedulerAssitant's) task job to find failed
> workers' "trap" and to resubmit their failed work to the proper queue
> according to the custom workflow descriptors attached to every chunk...
>
> 500 worker classes is an overkill as only 50 worker classes can have job
> every time but that seems to be an easy "brute force" and "lazy parallel
> programming" approach - and it should work...
> Or you can make worker class instances production processes smarter:
> Scheduler class can start special thread for WorkerClassGenerator instance,
> which will monitor all the 10 TaskQueues, and if it finds an item in a
> Queque, it will pick it up, it will create corresponding Worker class in
> parallel thread and it will pass WorkItem to the Worker class fro
> processing...
>
> When that described above approach will work there then you can easily(?)
> scale it splitting your (constant size) input data into 100 chunks, and then
> if every chunk can be processes in half time - in average (50 minutes) -
> then all job can be completed in 500 minutes = ~8.33(3) hours etc...
>
> Please correct me if I oversimplified your application business area...
>
> Thank you.
>
> --
> Shamil