Vestpocket Software

Postal Merge Example

The Postal Merge example, shown below,  is a project where some of the jobs have stream inputs and outputs.  It may look like it cannot take advantage of parallel processing but its looks are deceiving.  Streams and be segmented and the segments can be processed in parallel on different worker machines.  The job 'Get Addresses' has a non stream input which is a query that specifies which addresses are to be fetched from the database.  It produces a stream of address objects that are sent to the 'Merge Addresses Into Document' job.  The 'Choose A Document' job has both a non stream input and output.  The input is a document name and the output is a single document object which is sent to the 'Merge Addresses Into Document' job.  The 'Merge Addresses Into Document' job has both a stream input, the address objects, and a single input, the document object.  It merges the address objects into the document object and produces a stream of merged document objects.  The 'Send Mail To Post' job accepts a stream of merged document objects as input and mails them. 

If two streams are combined, each object in one stream is combined with every object in the other stream.  This can lead to 'combinatorial explosions'.  A combinatorial explosion can be predicted by knowing the number of objects in each stream.  For instance if 636 objects in one stream are combined with 234 objects in another stream the resulting stream will have 148,824 objects.  The result of combining three streams can lead to a catastrophic combinatorial explosion.  For instance, if the to streams above are combined with a third stream of 223 objects the result would be a stream of 33,187,752 objects.  Streams have a function in the system that resembles a 'do loop' in a programming language.  Just as nested do loops consume time in programming languages, combined streams consume time and resources in a parallel system.  Stream segmentation can help.  When combining two streams, each individual object from one stream can be sent to a different worker server to be combined with all the objects in the other stream.  That worker server could segment the incoming stream and send the segmented streams to different workers.  This could result in a combinatorial explosion of request for workers.  If the workers were available, this would not cause a problem.  If however there were a shortage of workers, the available workers 'in baskets'  could quickly be overloaded.  Segmenting a job with three input streams involves similar logic and could result in an even more likely catastrophic combinatorial explosion meltdown  Calculations involving the number of streams, the length of each stream, and the estimated execution time of the job when processing a single object can help with predicting the total estimated run time and the stream segmentation strategy and worker job distribution strategy.  Sometimes this cannot easily be predicted before a project is run, and cannot be predicted at all if any of the jobs generate streams of unpredictable lengths. A query of a database usually produces an unpredictable number of output objects.