jwcolby
jwcolby at colbyconsulting.com
Sat Mar 26 23:12:20 CDT 2011
Well... In programming you can do any task 50 different ways. The way I did it is to create a supervisor and ProcessFile table in SQL Server. The supervisor represents a table in SQL Server that contains an address table which needs to be processed. The ProcessFile records are child to tblSupervisor and represent chunks (files). The supervisor record stores all of the high level flags about the database (address table) being processed. The ProcessFile contains all of the flags about each chunk being processed. I have four different high level "stages" as I call them that a supervisor goes through. AccuzipOut - table to csv files in staging out (stage 1), Staging to VM (stage 2), VM to Staging in (stage 3, Staging in to SQL Server (stage 4) 1) Export all chunks (files) to a SQL Out staging directory. Each chunk ends up being a CSV file with PKID, FName, LName, Addr, City, St, Zip5, Zip4. The ProcessFile needs to be sorted by Zip5 / Zip4 order because the third party application that processes these files is immensely faster if the file is presorted. 2) One by one, copy each chunk file to a virtual machine - AccuzipIn directory. The Accuzip virtual machine runs on a Windows 2008 server running Hyper-V manager. I currently run 3 different VMs, only one of which runs Accuzip. For awhile I actually ran 3 VMs running Accuzip, and I am designing such that I can add additional Accuzip VMs in the future. What this means is that each file could go to any valid Accuzip VM, and when the chunk file is placed on a VM it has to log which VM it went to because... 3) The Accuzip software automatically senses and processes the file in the InputTo directory. A file of 2 million records takes about 25 minutes to CASS process locally to the VM, then is uploaded to a server in California for further NCOA processing. The server in California belongs to Accuzip and they stage big files so it can take awhile before it even begins to process. Once it processes, the NCOA server sends a small file back to my VM where Address change information is integrated back in to the local Accuzip database file (foxpro). After that it is exported back to CSV with a bunch of additional fields and is placed in the Accuzip OutputFrom directory along with a PDF which contains the CASS / NCOA statistics. This PDF is a "proof of processing" document that I have to keep forever, and I have to send to the client. I have a Directory Watcher which is watching the OutputFrom directory. When *all* of the resulting Accuzip output files are in the OutputFrom directory, my Directory Watcher event triggers my program to copy all of these files back to the SQL Server In staging directory. 4) Once all of the Accuzip files for a single chunk are back on the SQL Server In directory, my application BCPs the CSV into a input chunk table. From there decisions are made whether I need to import every address back in or only the moves. In any event some or all of the address records (with the additional fields from Accuzip) are pulled into an AZData table back in the database that the address data originally came from. So... 1) A supervisor object creates a clsDbAZOut instance. 2) clsDBAZOut creates an actual database dbXyzAZOut temporary database. 3) clsDBAZOut pulls every record needing to be accuzipped into a table with a super PK, sorted on Zip5/4 4) clsDBAZOut counts the records in tblAZOut, and starts to build clsProcessFile instances which holds the information about the chunk processing. clsProcessFile can write its info back to the ProcessFile records on SQL Server. 5) clsDbAZOut has a thread which builds tblChunk, gets ChunkSize records into the chunk table and BCPs tblChunk to CSV files on a staging out directory. 6) clsDbAZOut BCPs the chunk to file 7) At this point the clsProcessFile is handed off to clsVM. clsSupervisor has a strongly typed dictionary which stores the clsProcessFile instances created by clsDbAZOut. clsSupervisor can read ProcessFile instances from SQL Server into a clsProcessFile factory or clsDbAZOut can use a class factory to create new ones. In all cases they are stored in the dictionary in clsSupervisor. clsProcessfile has flags which document that: a) The chunk table was created b) The chunk table exported to file c) The chunk went to a VM (and which one) d) The chunk file was (eventually) found in the Accuzip OutputFrom directory e) The resulting AZData file (and PDF) were moved from the VM back to SQL ServerIn staging f) The file in SQL Server In staging was pulled back in to a chunk table in a second database dbXyzAZIn temporary database g) That the chunk table was imported back into the original database. As you can see, clsProcessFile is the center of the universe for a single chunk. clsSupervisor is the center of the universe for the entire process. It is a bit difficult to discuss how the entire program works in a single email. A clsManager loads one or a dozen clsSupervisor instances. clsManager loads clsVM which supervises the (a) virtual machine which runs a third party process. clsVM moves files from SQL Staging Out into the VM one chunk at a time, updating clsProcessTable flags. clsVM moves files from the the VM back to the SQL Staging IN, updating flags. The VM may not be available, or the Accuzip program may be down etc. clsSupervisor loads one to N clsProcessTable instances. clsSupervisor loads a clsDBExp and clsDBImp. clsDBExp creates new clsProcessTable instances, one for each chunk it exports - storing them in clsSupervisor. clsDBExp supervises the export process. clsDBImp supervises the import process getting the chunks back into SQL Server into a temp database, and from there into the live database. I am trying to design the system such that it is asynchronous. Stage1, stage 2/3 and stage 4 can all do their thing independently of each other. Each step of each stage is logged (dateTime flags written) so that any piece can just pick up where it left off. That means any Supervisor and any ProcessFile (chunk) can pick up from any point. In the meantime, the clsManager is watching the tblSupervisor for supervisor records ready to process - a thread. clsSupervisor is watching for ProcessTable records and dispatching them as required to the clsDBExp, clsDBImp or clsVM - a thread in each supervisor. clsDBExp is processing chunks, creating CSV files - a thread. clsVM is watching for clsProcessTable instances ready to go to the VM. It is also watching the output directory and sending files back to the SQL Server - a thread and two file watchers (and two events). clsDBImp is watching for clsProcessTable instances ready to import back into SQL Server - a thread. Most of these threads are writing flags (as they finish their specific process) and reading flags in clsProcessTable to decide what chunk has to be processed in which class. Oh, and I have three list controls on a form which the various classes write to to display the status for Stage1, Stage 2/3 and Stage 4. So pretty much a single supervisor is processing at a time, exporting files, sending them to the vm, importing them back in etc. It is possible for an order supervisor to cut in at the head of the line however. Not a simple task co-ordinating all of this stuff. I used to co-ordinate it all manually! ;) John W. Colby www.ColbyConsulting.com On 3/26/2011 8:08 PM, Shamil Salakhetdinov wrote: > John -- > > I'm falling asleep here now - would that be correct assumption that you can > generalize your global task like that - you have, say: > > - split input data into 50 Chunks; > - every chunk should be processed in 10 steps; > - every step's average processing time is 100 minutes. > > Then in total it is needed 50*10*100 = 50,000 minutes or ~833.(3) hours to > process all your input data sequentially, chunk after chunk step after step, > say, you don't have enough computer power to process some chunks in > parallel, and you can't process the whole not split input data as it's too > large for your system... > > But you have "50 small computers" and you can process every of 50 input > chunks in parallel - then you'll need just 10*100 = 1,000 minutes (~16.67 > hours) to complete the job. > > I assume that all the chunks processing is independent - then you can: > > 1) Create Scheduler class instance; > 2) Schedule class creates 10 TaskQueue class instances - 1 queue for every > processing step type; > 3) Scheduler class defines splitting criteria to split input data into 50 > chunks; > 4) Schedule defines completion criteria - when all 50 chunks get collected > in 11th FinalQueue; > 5) Schedule seeds the first InputChunkSplitTaskQueue with 50 > SplitTasksDescriptors; > 4) From time to time - say every half a minute Scheduler class instance > creates 500 worker classes in 500 threads (500 worker classes => 50 chunks * > 10 steps.), which "hunt for the job" in their dedicate Queues: if they find > Job they work on it, if not - they "die"... > 5) When a worker class completes processing its job it queues its "task > results" into the next step Queue, if a worker class fails then it puts its > incomplete task descriptor back into its dedicated queue and "dies", or that > could be Scheduler's (or SchedulerAssitant's) task job to find failed > workers' "trap" and to resubmit their failed work to the proper queue > according to the custom workflow descriptors attached to every chunk... > > 500 worker classes is an overkill as only 50 worker classes can have job > every time but that seems to be an easy "brute force" and "lazy parallel > programming" approach - and it should work... > Or you can make worker class instances production processes smarter: > Scheduler class can start special thread for WorkerClassGenerator instance, > which will monitor all the 10 TaskQueues, and if it finds an item in a > Queque, it will pick it up, it will create corresponding Worker class in > parallel thread and it will pass WorkItem to the Worker class fro > processing... > > When that described above approach will work there then you can easily(?) > scale it splitting your (constant size) input data into 100 chunks, and then > if every chunk can be processes in half time - in average (50 minutes) - > then all job can be completed in 500 minutes = ~8.33(3) hours etc... > > Please correct me if I oversimplified your application business area... > > Thank you. > > -- > Shamil