DISC Tutorial C - Concurrent Programming with Work Queue

Prerequisites:

  • Familiarity with basic Unix/Linux commands.
  • Ability to use a text editor to create and modify text files.
  • Completed Lecture 5 in the DISC online course.
  • Setup for Notre Dame

    1. If using the wireless network, make sure that you are using the eduroam network and not the ND-Guest network.
    2. Connect to a CRC front end node. If you are using Linux or Mac, just open up a terminal and use ssh:
      ssh USERNAME@crcfe01.crc.nd.edu
      
      If you are using a Windows machine, download and install PuTTY and use that to connect to the host condorfe.crc.nd.edu.
    3. Add the HTCondor software to your path by using one of these commands:
      setenv PATH /afs/crc.nd.edu/user/condor/software/bin:$PATH
      export PATH=/afs/crc.nd.edu/user/condor/software/bin:$PATH
      

    Install the Software

    First, you will need to install the cctools software in your home directory, if you haven't done so already. The simplest way is to check out the source code and build it, which should only take a minute:
    git clone https://github.com/cooperative-computing-lab/cctools cctools-src
    cd cctools-src
    ./configure
    make
    make install
    
    The software is now installed in $HOME/cctools. To use it directly, you will need to add it to your path using one of these two commands: (if one fails, just try the other)
    export PATH=$HOME/cctools/bin:$PATH
    setenv PATH $HOME/cctools/bin:$PATH
    
    Now, check that work_queue_status is in your path before proceeding:
    work_queue_status
    

    Run the Example Locally

    Download the example Work Queue application in your favorite language:
    C:      wget http://ccl.cse.nd.edu/software/manuals/work_queue_example.c
    Python: wget http://ccl.cse.nd.edu/software/manuals/work_queue_example.py
    Perl:   wget http://ccl.cse.nd.edu/software/manuals/work_queue_example.pl
    
    Depending on your language, you will need to either compile the program (C) or set some environment variables (Python and Perl):
    C:      gcc work_queue_example.c -o work_queue_example -I${HOME}/cctools/include/cctools -L${HOME}/cctools/lib -lwork_queue -ldttools -lm
    Python: setenv PYTHONPATH ${HOME}/cctools/lib/python2.7/site-packages
    Perl:   setenv PERL5LIB ${HOME}/cctools/lib/perl5/site_perl
    
    The example program simply runs gzip on whatever files you give on the command line, so run it like this:
    C:      ./work_queue_example *
    Python: ./work_queue_example.py *
    Perl:   ./work_queue_example.pl *
    
    The example program listens on the default port of 9123. So, you might get the following error:
    couldn't listen on port 9123: Address already in use
    
    Whether you have this problem or not, go ahead and modify the program to use port zero instead:

    C Example:
    q = work_queue_create(0);
    if(!q) {
    printf("couldn't listen on port");
    return 1;
    }
    
    Python Example:
    try:
    q = WorkQueue(0)
    except:
    print "Instantiation of Work Queue failed!" 
    sys.exit(1)
    
    Perl Example:
    my $q = work_queue_create(0);
    if (not defined($q)) {
    print "Instantiation of Work Queue failed!\n";
    exit 1;
    }
    
    Now, run it again, and you should see this:
    listening on port XXXX...
    ...
    waiting for tasks to complete...
    
    Now, open up a new window (you may have to repeat your PATH setup as above) and run a single local worker:
    work_queue_worker localhost XXXX
    
    After a short delay, you should see all the tasks completing in the first window. After another short delay, the worker should stop when it realizes there is no more work to be done. You can also stop the worker by pressing Control-C.

    Running Workers on the Cluster

    Of course, we don't really want to run workers on your computer, so let's instead start five workers on the cluster using HTCondor, or whatever batch system is available to you locally. We will need to use a project name to make this work, so first edit the application and modify it to use a project name:

    C Example:
    q = work_queue_create(0);
    if(!q) {
    printf("couldn't listen....
    return 1;
    }
    
    work_queue_specify_name(q,"MYPROJECT");
    
    printf("listening on port %d...\n", work_queue_port(q));
    
    
    Python Example:
    try:
    q = WorkQueue(0)
    except:
    print "Instantiation of Work Queue failed!" 
    sys.exit(1)
    
    q.specify_name("MYPROJECT");
    
    print "listening on port %d..." % q.port
    
    
    Perl Example:
    my $q = work_queue_create(0);
    if (not defined($q)) {
    print "Instantiation of Work Queue failed!\n";
    exit 1;
    }
    
    work_queue_specify_project_($q,"MYPROJECT");
    
    $port = work_queue_port($q);
    print "listening on port $port...\n"; 
    
    Now, submit workers to Condor with the same project name:
    condor_submit_workers -N MYPROJECT 5
    Creating worker submit scripts in /tmp/nhazekam-workers...
    Submitting job(s).....
    5 job(s) submitted to cluster 643692.
    
    Use the condor_q command to observe that they are submitted to Condor:
    condor_q -submitter $USER
    -- Submitter: nhazekam@nd.edu : <129.74.85.211:9618?... : crcfe01.crc.nd.edu
    ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    643692.0   nhazekam       10/18 14:16   0+00:00:02 R  0   0.7  work_queue_worker
    643692.1   nhazekam       10/18 14:16   0+00:00:02 R  0   0.7  work_queue_worker
    643692.2   nhazekam       10/18 14:16   0+00:00:02 R  0   0.7  work_queue_worker
    643692.3   nhazekam       10/18 14:16   0+00:00:02 R  0   0.7  work_queue_worker
    643692.4   nhazekam       10/18 14:16   0+00:00:02 R  0   0.7  work_queue_worker
    
    
    Now, restart your application and it will use the workers already running in Condor:
    ./work_queue_example *
    listening on port XXXX...
    

    While the application is running, you can open up another window and see the progress of your application with work_queue_status, which will show you the project name, master host, and number of tasks in each state. This information is updated approximately once per minute.

    When the application completes, your workers will still be in the cluster. You can leave them there, if you want to start another application quickly. If they are not used within fifteen minutes, they will stop automatically.

    In Class Exercise

    Modify the example program to run each of the program sequentially. That is, for each file to be compressed, run the task, wait for it to finish, then run the next one, until all are complete.