Vestpocket Software

Parallel Data Flow Project Development Framework

The Parallel Console, shown below, is the heart of the Parallel Data Flow Project Development Framework. The Parallel Console is used to construct and run data flow networks of SOA objects called Projects. It also serves as an interface to a master server which is part of an collection of networked computers. Each computer on the network has a subordinate server and a worker server which are interconnected to the master server that is connected to the parallel console. The console is used to define a computational project made up of individual jobs that will be run in parallel on the parallel computer network as shown in the PERT Chart view below. The individual jobs may also be projects. The console directs the master server to run a project. The console can also submit a batch run of projects to the master server. The system requires a database which supports ODBC. The master server breaks down each project into a form that can be run in a parallel fashion as shown in the GANTT Chart view below. It runs a brief simulation of the project taking into account the estimated complexity of the jobs in the project and the power of its subordinate master server computers to help it work out a distribution strategy. Using its distribution strategy it then assigns parallel parts of the project to subordinate servers which schedule the jobs in a serial fashion to subordinate worker servers. The worker servers put the jobs into a queue and run them in a serial fashion. If a job is a project, the process is repeated recursively on the subordinate master server. A project is a set of SOA objects or 'transformations', and a job is a single SOA object or 'transformation'. Each SOA object may be connected to one or more SOA objects by data flow 'relationships' as shown in the Connections Lists view below. No loops are allowed, therefore these SOA objects form an acyclic directed graph. A job can be thought of as a computer program that completes a single part of a project. A project is also considered a transformation and can appear as a job within a project thus allowing recursion. After the project is defined, it can be assigned to be run on the parallel network. A project can also be referred to as an executable directed graph that allows for data flow between the SOA objects in the graph. The user can view the progress of the project as the jobs within the project are run in parallel. As the jobs are run their colors change from red, indicating that they are running, to cyan indicating they are finished. After the overall project is finished, the user can view the results of any of the jobs within the project and also view the results at the end points of the overall project. A report writer can produce a hard copy of the results of the execution of the project. A list of projects can also be created and submitted to the master server as a batch run. There is no theoretical limit to the number of computers in the parallel network and no theoretical limit to the number of jobs in the project. Versions of the parallel console are written in both Java and in C#. The program shown on this page is written in Java.

The Pert Chart view above allows the user to construct a project and create the data flow dependencies of the jobs within a project. It is used mostly during the construction and editing of a project. During the construction phase the user can create new jobs in a project, drag and drop connections between jobs, drag jobs to new locations, and edit the properties of a job such as its name and the name of the SOA object it represents. When the construction of a project is finished, the user can run the project by connecting to the server and clicking the button with the ! character on it. The icons change color when the user runs the job. The icon is blue if the job has not run, red if the job is running, and cyan if the job is finished. The animation, at the top of the home page, is similar in appearance to a project as it is running. During the editing phase the user can add and delete jobs, create or destroy connections, change inputs, etc., and then rerun the project. The user can also view the characteristics, inputs, and outputs for any job in this view. There is another view, which is not shown, that allows the user to manage the details of the data flow between each job within the project. The project begins running with the jobs that have no dependencies. In the project shown above, there is only one such job at the top of the graph. If there were more than one such job in the graph, the jobs would begin in parallel. The inputs for the jobs are obtained as input objects from the database. The jobs are executed on separate worker computers on the network which are assigned by a master server computer. When the jobs are finished, their computed results are stored in the database as output objects. Values in the output objects from the upstream jobs can override values in the input objects for the dependent jobs that are downstream, thus allowing the upstream jobs to provide variable inputs for the downstream jobs. This is called data flow between SOA objects. The master server continuously checks the jobs that have completed to see if the necessary job dependencies to start new jobs have been satisfied. When all of the dependencies to start new jobs are satisfied, the new jobs are assigned to worker computers on the network. The process continues until the last job is completed. At this point the job on the bottom of the graph is finished running, and the final results are available. If there were more than one job at the bottom of the graph, all of them would be finished at this point. The Properties tab displays a sorted list of all of the jobs in a project. When the user double clicks on a job, a window opens showing the properties of the job. The Input tab also shows a sorted list of the jobs in the project. When the user double clicks on a job, a window appears showing the input values for the job. The user can change any of the input values and then rerun the project. There is also a view that shows which outputs from an upstream job override which inputs in a downstream job. The user can select the Output tab and double click on a job in a sorted list. A window will appear showing the computed values for that job. If a project is run several times with different inputs, all of the inputs and outputs for each run are stored as objects in the database. The user can select a previous run and view all of the inputs and outputs for that run, these values are read only. There is a tool for collecting, filtering, analyzing and comparing the results of previous runs. There is also a tool for collecting the inputs and outputs from any run of any project for use as the inputs for future runs of any project.

The Gantt Chart view above shows the phases the project must go through as the jobs are executed in parallel by worker computers on the network, and an estimate of the time required to finish the jobs. In the case that a project has been executed, the user can switch to another view and see the actual execution times in the chart. The user can watch the status of the jobs as they are completed on this view as well as in the Pert view. The user can also view the characteristics, inputs, and outputs for any job in this view by double clicking on a particular job.
If the project has not been run, no outputs are available and the view of actual execution times is not available. A phase is defined as the parallel execution of a collection of jobs whose input requirements have been satisfied by the execution of the jobs in the previous phase. As you can see from the chat above, the jobs in Phase 2 must run before the jobs in Phase 3 are started.

The Connections List view above shows the connections or 'relationships' between the SOA objects or 'transformations'. As you can see 'Job One' is connected to 'Job1', 'Job14', 'Job15', 'Job2', and 'Job3'. The tables at the bottom of the view shown above allow the user to make connections between jobs using a 'drag and drop' technique. If the user selects a job in the lower left table, the list on the lower right will contain the jobs that are eligible to be connected to. The user can then drag the selected item on the left to the eligible item on the right to make the connection between the jobs.

The Job Properties view, shown above, displays a table of the jobs in the project and some of their properties. Double clicking on a job will cause a dialog showing the properties of the job to appear. The user can change the name of the job, the wait type of the job, the estimated time of the job, etc. The user can also view and edit the input and output properties of the job, and the name of the actual physical program, batch file, etc. that will be run when the job is executed by the server at run time.

The Batch Run Control Panel above shows the projects that have been submitted to the master server in the top list. The bottom list is used to compose batch runs by selecting projects from a project library and adding them to the list. Once the list is complete, the user clicks the 'Submit' button. The project objects in the bottom list are then added to the list of projects submitted to the master computer in the top list. The user can then monitor the progress of the projects in the batch queue by viewing the Worker and Batch Run Status dialog.

Configuring a network of parallel computers. Any number of machines can be worker computers in the parallel network. To get more computational power simply configure a new worker computer and connect it to the network. When the worker starts up it contacts its parent master server computer and informs its boss that it is ready to go to work, and tells its boss about its processing power and some other things so that it may be utilized in a proper way when scheduling workers. New workers can join or leave the parallel network even when a project is in the process of running. The master computer simply dynamically adjusts it running strategy to contend with more or fewer workers. There must, however, be at least one worker available on the network to run a project. A project has a theoretical optimal number of workers, and the master server uses this as a guide when assigning workers. The master server, worker, and console can exist on a single computer if desired. The console, master server, and workers can also be anywhere on the network and even at separate remote locations on the internet. There can theoretically be multiple master servers running multiple dependent projects on a parallel network that is huge, perhaps having thousands of workers all over the world, being controlled by a single remote console.