DISC Tutorial B - The Makeflow Workflow System

Prerequisites:
  • Familiarity with basic Unix/Linux commands.
  • Ability to use a text editor to create and modify text files.
  • Completed Lecture 4 in the DISC online course.
  • To complete Tutorial B, you will need access to a Linux cluster equipped with some sort of batch systems such as HTCondor, PBS, Torque, SGE, or another system supported by Makeflow. If you don't have a cluster, you can run the simple examples on a single server or desktop running Linux.

    Setup for Notre Dame

    1. If using the wireless network, make sure that you are using the eduroam network and not the ND-Guest network.
    2. Connect to a CRC front end node. If you are using Linux or Mac, just open up a terminal and use ssh:
      ssh USERNAME@crcfe01.crc.nd.edu
      
      If you are using a Windows machine, download and install PuTTY and use that to connect to the host condorfe.crc.nd.edu.
    3. Once logged in, you will need to install the Makeflow software in your home directory. The simplest way is to check out the source code and build it, which should only take a minute:
      git clone https://github.com/cooperative-computing-lab/cctools cctools-src
      cd cctools-src
      ./configure
      make
      make install
      
      The software is now installed in $HOME/cctools. To use it directly, you will need to add it to your path using one of these two commands: (if one fails, just try the other)
      export PATH=$HOME/cctools/bin:$PATH
      setenv PATH $HOME/cctools/bin:$PATH
      
    4. Now, check that makeflow is in your path before proceeding:
      makeflow -v
      
    5. Finally, add the SGE commands to your path. Again, use whichever command works:
      export PATH=/opt/sge/bin/lx-amd64:$PATH
      setenv PATH /opt/sge/bin/lx-amd64:$PATH
      
      And check that you can run qstat:
      qstat
      

    Simple Example

    Let's being by using Makeflow to run a handful of simulation codes. First, make and enter a clean directory to work in:
    cd $HOME
    mkdir tutorial
    cd tutorial
    
    Now, download this program, which performs a highly "sophisticated" simulation of black holes colliding together:
    wget http://www.nd.edu/~dthain/courses/disc/tutorialB/simulation.py
    
    Try running it once, just to see what it does:
    chmod 755 simulation.py
    ./simulation.py 5
    
    Now, let's use Makeflow to run several simulations. Create a file called example.makeflow and paste the following text into it:
    input.txt:
    	LOCAL /bin/echo "Simulate Black Holes" > input.txt
    
    output.1: simulation.py input.txt
    	./simulation.py 1 < input.txt > output.1
    
    output.2: simulation.py input.txt
    	./simulation.py 2 < input.txt > output.2
    
    output.3: simulation.py input.txt
    	./simulation.py 3 < input.txt > output.3
    
    output.4: simulation.py input.txt
    	./simulation.py 4 < input.txt > output.4
    
    To run it on your local machine, one job at a time, do this:
    makeflow -j 1 example.makeflow
    
    Note that if you run it a second time, nothing will happen, because all of the files are built:
    makeflow example.makeflow
    makeflow: nothing left to do
    
    Use the -c option to clean everything up before trying it again:
    makeflow -c example.makeflow
    
    Of course, you are running on a machine with multiple cores. If you leave out the -j option, then makeflow will run as many jobs as you have cores:
    makeflow example.makeflow
    
    If the jobs are expected to be long running, then you can dispatch jobs to a local batch system like SGE, Condor, or Torque by using the appropriate command:
    makeflow -T sge example.makeflow
    makeflow -T condor example.makeflow
    makeflow -T torque example.makeflow
    ...
    
    After that completes, examine the output files output.1 etc, and you will notice that each job ran on a different machine in the cluster.
    Answer these questions using what you have learned so far:
    1. How long did the workflow take when using a single core? Why?
    2. How much faster did it run when using all cores? Why?
    3. Did it run faster or slower when using the cluster? Why?

    Running Makeflow with Work Queue

    Sometimes, submitting jobs individually to a batch system is not convenient. It can take a long time for each job to wait in the queue and receive service. Or, perhaps you are working in a situation where you don't have a batch system set up for use. Instead, you can use the Work Queue system to run the jobs. To do this, first start makeflow in Work Queue (wq) mode:
    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -p 0
    listening for workers on port XXXX.
    ...
    
    You are going to need to have two terminals open at once for the next step, so open up another terminal (or PuTTY session) and line it up next to your first one. (You may have to set your PATH again as noted above.) Then, in the new terminal, start a worker using the same port number:
    work_queue_worker localhost XXXX
    
    Go back to your first shell and observe that the makeflow has finished. Your worker process will stay there for a few minutes until it is sure that Makeflow has finished. Use Control-C to forcibly kill the worker, if you have to.

    Of course, remembering port numbers all the time gets old fast, so try the same thing again, but using a project name (-N) to give makeflow and the worker the same project name. (Replace MYPROJECT with a name of your choice.)

    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -N MYPROJECT
    listening for workers on port XXXX
    ...
    
    Now open up another shell and run your worker with a project name:
    work_queue_worker -N MYPROJECT
    
    When using a project name, your workflow is advertised to the catalog server, and can be viewed using work_queue_status:
    work_queue_status
    
    Answer these questions using what you have learned so far:
    1. What is the largest/smallest Work Queue application currently running?
    2. Did your workflow run faster or slower on the cluster or via the worker? Why?

    Running Workers on the Cluster via SGE

    Of course, we don't really want to run workers on the head node, which would become quickly overloaded with running jobs. Let's instead start five workers on the cluster using SGE:
    sge_submit_workers -N MYPROJECT 5
    Creating worker submit scripts in dthain-workers...
    Your job 18728 ("worker.sh") has been submitted
    Your job 18729 ("worker.sh") has been submitted
    Your job 18730 ("worker.sh") has been submitted
    Your job 18731 ("worker.sh") has been submitted
    Your job 18732 ("worker.sh") has been submitted
    
    Use the qstat command to observe that they are submitted (and possibly running):
    qstat -u $USER
    job-ID     prior   name       user         state submit/start at     queue                      
    ------------------------------------------------------------------------------------------------
         18728 100.49976 worker.sh  dthain       r     06/02/2016 12:04:45 long@d6copt172.crc.nd.edu
         18729 100.49976 worker.sh  dthain       r     06/02/2016 12:04:47 long@d6copt184.crc.nd.edu
         18730 100.49976 worker.sh  dthain       r     06/02/2016 12:04:47 long@d6copt025.crc.nd.edu
         18731 100.49976 worker.sh  dthain       r     06/02/2016 12:04:48 long@d6copt025.crc.nd.edu
         18732 100.49976 worker.sh  dthain       r     06/02/2016 12:04:48 long@dqcneh084.crc.nd.edu
    
    Now, restart your Makeflow and it will use the workers already running in SGE
    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -N MYPROJECT
    listening for workers on port XXXX.
    ...
    
    You can leave the workers running there, if you want to start another Makeflow. (Try cleaning up and running again right now.) They will remain until they have been idle for fifteen minutes, then will stop automatically.

    If you add the -d all option to Makeflow, it will display debugging information that shows where each task was sent, when it was returned, and so forth:

    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -N MYPROJECT -d all
    listening for workers on port XXXX.
    

    (Alternate) Running Workers on the Cluster via Condor

    Of course, we don't really want to run workers on the head node, so let's instead start five workers using Condor:
    condor_submit_workers -N MYPROJECT 5
    Creating worker submit scripts in dthain-workers...
    Submitting job(s).....
    5 job(s) submitted to cluster 258192.
    
    Use the condor_q command to observe that they are submitted to Condor:
    condor_q
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD       
    258192.0   dthain          5/31 16:03   0+00:00:12 R  0   0.7  work_queue_worker 
    258192.1   dthain          5/31 16:03   0+00:00:12 R  0   0.7  work_queue_worker 
    258192.2   dthain          5/31 16:03   0+00:00:12 R  0   0.7  work_queue_worker 
    258192.3   dthain          5/31 16:03   0+00:00:12 R  0   0.7  work_queue_worker 
    258192.4   dthain          5/31 16:03   0+00:00:11 R  0   0.7  work_queue_worker
    
    Now, restart your Makeflow and it will use the workers already running in Condor:
    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -N MYPROJECT
    listening for workers on port XXXX.
    ...
    
    You can leave the workers running there, if you want to start another Makeflow. They will remain until they have been idle for fifteen minutes, then will stop automatically.

    If you add the -d all option to Makeflow, it will display debugging information that shows where each task was sent, when it was returned, and so forth:

    makeflow -c example.makeflow
    makeflow -T wq example.makeflow -N MYPROJECT -d all
    listening for workers on port XXXX.
    

    Homework Assignment

    Check out the Makeflow Examples Repository and look closely at the BWA example workflow. Set up a "Medium" sized run of BWA, then compare the performance of running it locally and on your cluster. How much faster does the workflow run on your cluster? Why?

    For More Information

    See the Makeflow Web Page for a complete user's manual, man pages, example workflows, and research papers.