[dba-VB] c# lock()

Shamil Salakhetdinov shamil at smsconsulting.spb.ru
Sun Mar 27 19:08:25 CDT 2011


Hi John --

Yes, I see: you tend to build a multi-threaded application/utility system
with centralized control - and I tend to build an anarchical free-market
model based application/utility system :)

Thank you.

--
Shamil
 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: 27 ????? 2011 ?. 8:12
To: Discussion concerning Visual Basic and related programming issues.
Subject: Re: [dba-VB] c# lock()

Well...

In programming you can do any task 50 different ways.  The way I did it is
to create a supervisor and ProcessFile table in SQL Server.  The supervisor
represents a table in SQL Server that contains an address table which needs
to be processed.  The ProcessFile records are child to tblSupervisor and
represent chunks (files).

The supervisor record stores all of the high level flags about the database
(address table) being processed.  The ProcessFile contains all of the flags
about each chunk being processed.

I have four different high level "stages" as I call them that a supervisor
goes through.  AccuzipOut
- table to csv files in staging out (stage 1), Staging to VM (stage 2), VM
to Staging in (stage 3, Staging in to SQL Server (stage 4)

1) Export all chunks (files) to a SQL Out staging directory.  Each chunk
ends up being a CSV file with PKID, FName, LName, Addr, City, St, Zip5,
Zip4.  The ProcessFile needs to be sorted by Zip5 /
Zip4 order because the third party application that processes these files is
immensely faster if the file is presorted.

2) One by one, copy each chunk file to a virtual machine - AccuzipIn
directory.

The Accuzip virtual machine runs on a Windows 2008 server running Hyper-V
manager.  I currently run
3 different VMs, only one of which runs Accuzip.  For awhile I actually ran
3 VMs running Accuzip, and I am designing such that I can add additional
Accuzip VMs in the future.  What this means is that each file could go to
any valid Accuzip VM, and when the chunk file is placed on a VM it has to
log which VM it went to because...

3) The Accuzip software automatically senses and processes the file in the
InputTo directory.  A file of 2 million records takes about 25 minutes to
CASS process locally to the VM, then is uploaded to a server in California
for further NCOA processing.  The server in California belongs to Accuzip
and they stage big files so it can take awhile before it even begins to
process.  Once it processes, the NCOA server sends a small file back to my
VM where Address change information is integrated back in to the local
Accuzip database file (foxpro).  After that it is exported back to CSV with
a bunch of additional fields and is placed in the Accuzip OutputFrom
directory along with a PDF which contains the CASS / NCOA statistics.  This
PDF is a "proof of processing" document that I have to keep forever, and I
have to send to the client.

I have a Directory Watcher which is watching the OutputFrom directory.  When
*all* of the resulting Accuzip output files are in the OutputFrom directory,
my Directory Watcher event triggers my program to copy all of these files
back to the SQL Server In staging directory.

4) Once all of the Accuzip files for a single chunk are back on the SQL
Server In directory, my application BCPs the CSV into a input chunk table.
>From there decisions are made whether I need to import every address back in
or only the moves.  In any event some or all of the address records (with
the additional fields from Accuzip) are pulled into an AZData table back in
the database that the address data originally came from.

So...

1) A supervisor object creates a clsDbAZOut instance.
2) clsDBAZOut creates an actual database dbXyzAZOut temporary database.
3) clsDBAZOut pulls every record needing to be accuzipped into a table with
a super PK, sorted on Zip5/4
4) clsDBAZOut counts the records in tblAZOut, and starts to build
clsProcessFile instances which holds the information about the chunk
processing.  clsProcessFile  can write its info back to the ProcessFile
records on SQL Server.
5) clsDbAZOut has a thread which builds tblChunk, gets ChunkSize records
into the chunk table and BCPs tblChunk to CSV files on a staging out
directory.
6) clsDbAZOut BCPs the chunk to file
7) At this point the clsProcessFile is handed off to clsVM.


clsSupervisor has a strongly typed dictionary which stores the
clsProcessFile instances created by clsDbAZOut.  clsSupervisor can read
ProcessFile instances from SQL Server into a clsProcessFile factory or
clsDbAZOut can use a class factory to create new ones. In all cases they are
stored in the dictionary in clsSupervisor.

clsProcessfile has flags which document that:
a) The chunk table was created
b) The chunk table exported to file
c) The chunk went to a VM (and which one)
d) The chunk file was (eventually) found in the Accuzip OutputFrom directory
e) The resulting AZData file (and PDF) were moved from the VM back to SQL
ServerIn staging
f) The file in SQL Server In staging was pulled back in to a chunk table in
a second database dbXyzAZIn temporary database
g) That the chunk table was imported back into the original database.

As you can see, clsProcessFile is the center of the universe for a single
chunk.
clsSupervisor is the center of the universe for the entire process.

It is a bit difficult to discuss how the entire program works in a single
email.

A clsManager loads one or a dozen clsSupervisor instances.
clsManager loads clsVM which supervises the (a) virtual machine which runs a
third party process.

clsVM moves files from SQL Staging Out into the VM one chunk at a time,
updating clsProcessTable flags.
clsVM moves files from the the VM back to the SQL Staging IN, updating
flags.
The VM may not be available, or the Accuzip program may be down etc.

clsSupervisor loads one to N clsProcessTable instances.  clsSupervisor loads
a clsDBExp and clsDBImp.

clsDBExp creates new clsProcessTable instances, one for each chunk it
exports - storing them in clsSupervisor.  clsDBExp supervises the export
process.

clsDBImp supervises the import process getting the chunks back into SQL
Server into a temp database, and from there into the live database.

I am trying to design the system such that it is asynchronous.  Stage1,
stage 2/3 and stage 4 can all do their thing independently of each other.
Each step of each stage is logged (dateTime flags
written) so that any piece can just pick up where it left off.

That means any Supervisor and any ProcessFile (chunk) can pick up from any
point.

In the meantime, the clsManager is watching the tblSupervisor for supervisor
records ready to process - a thread.

clsSupervisor is watching for ProcessTable records and dispatching them as
required to the clsDBExp, clsDBImp or clsVM - a thread in each supervisor.

clsDBExp is processing chunks, creating CSV files - a thread.

clsVM is watching for clsProcessTable instances ready to go to the VM.  It
is also watching the output directory and sending files back to the SQL
Server - a thread and two file watchers (and two events).

clsDBImp is watching for clsProcessTable instances ready to import back into
SQL Server - a thread.

Most of these threads are writing flags (as they finish their specific
process) and reading flags in clsProcessTable to decide what chunk has to be
processed in which class.

Oh, and I have three list controls on a form which the various classes write
to to display the status for Stage1, Stage 2/3 and Stage 4.

So pretty much a single supervisor is processing at a time, exporting files,
sending them to the vm, importing them back in etc.  It is possible for an
order supervisor to cut in at the head of the line however.

Not a simple task co-ordinating all of this stuff.

I used to co-ordinate it all manually!  ;)

John W. Colby
www.ColbyConsulting.com

On 3/26/2011 8:08 PM, Shamil Salakhetdinov wrote:
> John --
>
> I'm falling asleep here now - would that be correct assumption that 
> you can generalize your global task like that - you have, say:
>
> - split input data into 50 Chunks;
> - every chunk should be processed in 10 steps;
> - every step's average processing time is 100 minutes.
>
> Then in total it is needed 50*10*100 = 50,000 minutes or ~833.(3) 
> hours to process all your input data sequentially, chunk after chunk 
> step after step, say, you don't have enough computer power to process 
> some chunks in parallel, and you can't process the whole not split 
> input data as it's too large for your system...
>
> But you have "50 small computers" and you can process every of 50 
> input chunks in parallel - then you'll need just 10*100 = 1,000 
> minutes (~16.67
> hours) to complete the job.
>
> I assume that all the chunks processing is independent - then you can:
>
> 1) Create Scheduler class instance;
> 2) Schedule class creates 10 TaskQueue class instances - 1 queue for 
> every processing step type;
> 3) Scheduler class defines splitting criteria to split input data into 
> 50 chunks;
> 4) Schedule defines completion criteria - when all 50 chunks get 
> collected in 11th FinalQueue;
> 5) Schedule seeds the first InputChunkSplitTaskQueue with 50 
> SplitTasksDescriptors;
> 4) From time to time - say every half a minute Scheduler class 
> instance creates 500 worker classes in 500 threads (500 worker classes 
> =>  50 chunks * 10 steps.), which "hunt for the job" in their dedicate 
> Queues: if they find Job they work on it, if not - they "die"...
> 5) When a worker class completes processing its job it queues its 
> "task results" into the next step Queue, if a worker class fails then 
> it puts its incomplete task descriptor back into its dedicated queue 
> and "dies", or that could be Scheduler's (or SchedulerAssitant's) task 
> job to find failed workers' "trap" and to resubmit their failed work 
> to the proper queue according to the custom workflow descriptors attached
to every chunk...
>
> 500 worker classes is an overkill as only 50 worker classes can have 
> job every time but that seems to be an easy "brute force" and "lazy 
> parallel programming" approach - and it should work...
> Or you can make worker class instances production processes smarter:
> Scheduler class can start special thread for WorkerClassGenerator 
> instance, which will monitor all the 10 TaskQueues, and if it finds an 
> item in a Queque, it will pick it up, it will create corresponding 
> Worker class in parallel thread and it will pass WorkItem to the 
> Worker class fro processing...
>
> When that described above approach will work there then you can 
> easily(?) scale it splitting your (constant size) input data into 100 
> chunks, and then if every chunk can be processes in half time - in 
> average (50 minutes) - then all job can be completed in 500 minutes =
~8.33(3) hours etc...
>
> Please correct me if I oversimplified your application business area...
>
> Thank you.
>
> --
> Shamil
_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com




More information about the dba-VB mailing list