Makeflow is Copyright (C) 2009 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
You can run a Makeflow on your local machine to test it out. If you have a multi-core machine, then you can run multiple tasks simultaneously. If you have a Condor pool or a Sun Grid Engine batch system, then you can send your jobs there to run. If you don't already have a batch system, Makeflow comes with a system called Work Queue that will let you distribute the load across any collection of machines, large or small.
Makeflow is part of the Cooperating Computing Tools. You can download the CCTools from this web page, follow the installation instructions, and you are ready to go.
Makeflow attempts to generate all of the target files in a script. It examines all of the rules and determines which rules must run before others. Where possible, it runs commands in parallel to reduce the execution time.
Here is a Makeflow that uses the convert utility to make an animation. It downloads an image from the web, creates four variations of the image, and then combines them back together into an animation. The first and the last task are marked as LOCAL to force them to run on the controlling machine.
CURL=/usr/bin/curl CONVERT=/usr/bin/convert URL=http://www.cse.nd.edu/~ccl/images/capitol.jpg capitol.montage.gif: capitol.jpg capitol.90.jpg capitol.180.jpg capitol.270.jpg capitol.360.jpg LOCAL $CONVERT -delay 10 -loop 0 capitol.jpg capitol.90.jpg capitol.180.jpg capitol.270.jpg capitol.360.jpg capitol.270.jpg capitol.180.jpg capitol.90.jpg capitol.montage.gif capitol.90.jpg: capitol.jpg $CONVERT $CONVERT -swirl 90 capitol.jpg capitol.90.jpg capitol.180.jpg: capitol.jpg $CONVERT $CONVERT -swirl 180 capitol.jpg capitol.180.jpg capitol.270.jpg: capitol.jpg $CONVERT $CONVERT -swirl 270 capitol.jpg capitol.270.jpg capitol.360.jpg: capitol.jpg $CONVERT $CONVERT -swirl 360 capitol.jpg capitol.360.jpg capitol.jpg: $CURL LOCAL $CURL -o capitol.jpg $URLNote that Makeflow differs from Make in a few important ways. Read section 4 below to get all of the details.
% makeflow example.makeflowNote that if you run it a second time, nothing will happen, because all of the files are built:
% makeflow example.makeflow makeflow: nothing left to doUse the -c option to clean everything up before trying it again:
% makeflow -c example.makeflowIf you have access to a batch system running SGE, then you can direct Makeflow to run your jobs there:
% makeflow -T sge example.makeflowOr, if you have a Condor Pool, then you can direct Makeflow to run your jobs there:
% makeflow -T condor example.makeflowTo submit Makeflow as a Condor job that submits more Condor jobs:
% condor_submit_makeflow example.makeflowYou will notice that a workflow can run very slowly if you submit each batch job to SGE or Condor, because it typically takes 30 seconds or so to start each batch job running. To get around this limitation, we provide the Work Queue system. This allows Makeflow to function as a master process that quickly dispatches work to remote worker processes.
To begin, let's assume that you are logged into a machine named barney.nd.edu. start your Makeflow like this:
% makeflow -T wq example.makeflowThen, submit 10 worker processes to Condor like this:
% condor_submit_workers barney.nd.edu 9123 10 Submitting job(s).......... Logging submit event(s).......... 10 job(s) submitted to cluster 298.Or, submit 10 worker processes to SGE like this:
% sge_submit_workers barney.nd.edu 9123 10Or, submit 10 worker processes to Torque like this:
% torque_submit_workers barney.nd.edu 9123 10Or, you can start workers manually on any other machine you can log into:
% work_queue_worker barney.nd.edu 9123Once the workers begin running, Makeflow will dispatch multiple tasks to each one very quickly. If a worker should fail, Makeflow will retry the work elsewhere, so it is safe to submit many workers to an unreliable system.
When the Makeflow completes, your workers will still be available, so you can either run another Makeflow with the same workers, remove them from the batch system, or wait for them to expire. If you do nothing for 15 minutes, they will automatically exit.
Note that condor_submit_workers, sge_submit_workers, and torque_submit_workers. are simple shell scripts, so you can edit them directly if you would like to change batch options or other details.
MEMORY=100 one: src cmd CATEGORY="simulation" DISK=500 CATEGORY="preprocessing" MEMORY=200 DISK=200 two: src cmd three: src cmd CATEGORY="simulation" DISK=700 four: src cmd five: src @CATEGORY="preprocessing" cmd six: src cmd CATEGORY="analysis" CORES=1 MEMORY=400 DISK=400 seven: src cmd
export CORES=4 makeflow ...
|Category||Cores||Memory (MB)||Disk (MB)|
Note that in this example, in job five, the CATEGORY temporarily has the value of "preprocessing", and then it is re-established to "simulation" (so that four and six belong to the same category). This is done using rule lexical scoping, explained in the next section.
# This is an incorrect rule. output.txt: ./mysim.exe -c calib.data -o output.txtHowever, the following is correct, because the rule states all of the files needed to run the simulation. Makeflow will use this information to construct a batch job that consists of mysim.exe and calib.data and uses it to produce output.txt:
# This is a correct rule. output.txt: mysim.exe calib.data ./mysim.exe -c calib.data -o output.txtWhen a regular file is specified as an input file, it means the command relies on the contents of that file. When a directory is specified as an input file, however, it could mean one of two things. First, the command depends on the contents inside the directory. Second, the command relies on the existence of the directory (for example, you just want to add more things into the directory later, it does not matter what's already in it). Makeflow assumes that an input directory indicates that the command relies on the directory's existence.
SOME_VARIABLE=original_value #rule 1 target_1: source_1 command_1 #rule 2 target_2: source_2 @SOME_VARIABLE=local_value_for_2 command_2 #rule 3 target_3: source_3 command_3In this example, SOME_VARIABLE has the value 'original_value' for rules 1 and 3, and the value 'local_value_for_2' for rule 2.
When using Condor, this string will be added to each submit file. For example, if you want to add Requirements and Rank lines to your Condor submit files, add this to your Makeflow:
BATCH_OPTIONS = Requirements = (Memory>1024)
When using SGE, the string will be added to the qsub options. For example, to specify that jobs should be submitted to the devel queue:
BATCH_OPTIONS = -q devel
local_name->remote_nameindicates that the file local_name is called remote_name in the remote system. Consider the following example:
b.out: a.in myprog LOCAL myprog a.in > b.out c.out->out: a.in->in1 b.out myprog->prog prog in1 b.out > outThe first rule runs locally, using the executable myprog and the local file a.in to locally create b.out. The second rule runs remotely, but the remote system expects a.in to be named in1, c.out, to be named out and so on. Note that we did not need to rename the file b.out. Without remote file renaming, we would have to create either a symbolic link, or a copy of the files with the expected correct names.
% makeflow -D example.makeflow | dot -T gif > example.gif
% makeflow -D example.makeflow | grep '^N[0-9]\+ \[label=' | wc -l
% makeflow -T wq -p 9567 example.makeflow
% makeflow -T wq -a -N MyProj example.makeflowThe -N option gives the master a project name called 'MyProj'. The -a option enables the catalog mode of the master. Only in the catalog mode a master would advertise its information, such as the project name, running status, hostname and port number, to a catalog server. Then a worker could retrieve these information from the same catalog server. The above command uses the default catalog server at Notre Dame which runs 24/7. We will talk about how to set up your own catalog server later.
To start a worker that automatically finds MyProj's master via the default Notre Dame catalog server:
% work_queue_worker -a -N MyProjThe '-a' option enables the catalog mode on the worker, which tells the worker to contact a catalog server to find out a project's (specified by -N option) hostname and port.
You can also give multiple -N options to a worker. The worker will find out which ones of the specified projects are running from the catalog server and randomly select one to work for. When one project is done, the worker would repeat this process. Thus, the worker can work for a different master without being stopped and given the different master's hostname and port. An example of specifying multiple projects:
% work_queue_worker -a -N proj1 -N proj2 -N proj3
% makeflow --password mypwfile ... % work_queue_worker --password mypwfile ...
Now let's look at how to set up your own catalog server. Say you want to run your catalog server on a machine named catalog.somewhere.edu. Tthe default port that the catalog server will be listening on is 9097, you can change it via the '-p' option.
% catalog_serverNow you have a catalog server listening at catalog.somewhere.edu:9097. To make your masters and workers contact this catalog server, simply add the '-C hostname:port' option to both of your master and worker:
% makeflow -T wq -C catalog.somewhere.edu:9097 -N MyProj example.makeflow % work_queue_worker -C catalog.somewhere.edu:9097 -a -N MyProj
For clusters using managers similar to SGE or Moab that are configured to preclude the use of Work Queue we have the "Cluster" custom driver. To use the "Cluster" driver the Makeflow must be run in a parallel filesystem available to the entire cluster, and the following environment variables must be set.
% $SUBMIT_COMMAND $SUBMIT_OPTIONS "<rule name>" $CLUSTER_NAME.wrapper "<rule commandline>"The wrapper script is a shell script that reads the command to be run as an argument and handles bookkeeping operations necessary for Makeflow.
As the workflow execution progresses, Makeflow can automatically delete intermediate files that are no longer needed. In this context, an intermediate file is an input of some rule that is the target of another rule. Therefore, by default, garbage collection does not delete the original input files, nor final target files.
Which files are deleted can be tailored from the default by appending files to the Makeflow variables GC_PRESERVE_LIST and GC_COLLECT_LIST. Files added to GC_PRESERVE_LIST are never deleted, thus it is used to mark intermediate files that should not be deleted. Similarly, GC_COLLECT_LIST marks final target files that should be deleted. Makeflow is conservative, in the sense that GC_PRESERVE_LIST takes precedence over GC_COLLECT_LIST, and original input files are never deleted, even if they are listed in GC_COLLECT_LIST.
Makeflow offers two modes for garbage collection: reference count, and on demand. With the reference count mode, intermediate files are deleted as soon as no rule has them listed as input. The on-demand mode is similar to reference count, only that files are deleted until the space on the local file system is below a given threshold.
To activate reference count garbage collection:
To activate on-demand garbage collection, with a threshold of 500MB:
makeflow -gon_demand -G500000000
In Makeflow, each task is called a node. You can think of it as a tree node in a tree structure of the target workflow. There is a unique id, a positive integer, associated with each node. This id is referred as node id. The Makeflow log file starts with a list of node (task) descriptions where each task is specified in the following format:
# NODE node_id original_command # SYMBOL node_id symbol # PARENTS node_id parent_node_id_1 parent_node_id_2 ... parent_node_id_n # SOURCES node_id source_file_1 source_file_2 ... source_file_n # TARGETS node_id target_file_1 target_file_2 ... target_file_n # COMMAND node_id translated_commandThese annotation lines all start with '#' and will be treated as comments by Makeflow. When Makeflow tries to resume the execution of a Makeflow script from a existing Makeflow log file, it will skip the scanning of these commented lines. Also, annotations being commented out allows third-party applications, such as gnuplot, to skip them when plotting data.
The body sections of a Makeflow log file come after the node specificiation section. The following is a snippet from example.makeflow.makeflowlog:
# STARTED timestamp 1347281321284638 5 1 9206 5 1 0 0 0 6 1347281321348488 5 2 9206 5 0 1 0 0 6 1347281321348760 4 1 9207 4 1 1 0 0 6 1347281321348958 3 1 9208 3 2 1 0 0 6 1347281321629802 4 2 9207 3 1 2 0 0 6 1347281321630005 2 1 9211 2 2 2 0 0 6 1347281321635236 3 2 9208 2 1 3 0 0 6 1347281321635463 1 1 9212 1 2 3 0 0 6 1347281321742870 2 2 9211 1 1 4 0 0 6 1347281321752857 1 2 9212 1 0 5 0 0 6 1347281321753064 0 1 9215 0 1 5 0 0 6 1347281325731146 0 2 9215 0 0 6 0 0 6 # COMPLETED timestampIf a Makeflow execution is completed without errors, a body section of the log consists of a series of node (task) status change lines (the 10 column line) surrounded by a pair of "# STARTED" and "# COMPLETED" lines which record the start and end unix time of that execution. A node status change line has the following form:
timestamp node_id new_state job_id nodes_waiting nodes_running nodes_complete nodes_failed nodes_aborted node_id_counterHere is the column specificiation:
# ABORTED timestamp # FAILED timestampWhen file garbage collection is enabled, the log file records information at each collection cycle. Collection information is included in lines starting with the '# GC' prefix:
# GC timestamp collected time_spent dag_gc_collectedEach garbage collection line records the garbage collection statistics during a garbage collection cycle:
% makeflow -b some_output_directory example.makeflow