Parallel DNA Analysis with Work Queue

In the previous assignment, you connected together existing programs into a high throughput workflow for video rendering. In this assignment, you will go the opposite direction, taking a sequential program, and adapting it to a distributed system using the Work Queue system.

Getting Started with Work Queue

Download and install the cctools software in your home directory on one of the student machines:
cd $HOME
wget http://ccl.cse.nd.edu/software/files/cctools-7.0.4-source.tar.gz
tar xvzf cctools-7.0.4-source.tar.gz
cd cctools-7.0.4-source
./configure --prefix $HOME/cctools --tcp-low-port 9000
make
make install
cd $HOME
The software is now installed in $HOME/cctools, so you must set your path appropriately:
setenv PATH ${PATH}:${HOME}/cctools/bin
If you use bash instead, then do this:
export PATH=${PATH}:${HOME}/cctools/bin
Now double check that you can run the various commands, like this:
makeflow -v
work_queue_worker -v
work_queue_status
To complete the assignment, you will need to become familiar with the manuals and other materials online. I recommend you read the Work Queue manual next and try running an example program.
  • Work Queue Web Page
  • Makeflow Web Page
  • A Brief Introduction to Sequence Alignment

    Download and examine the file agambiae.small.fasta It contains 250 short DNA sequences ("reads") obtained from sequencing Anopheles Gambiae, the common mosquito. (Faculty at ND study the transmission of malaria through mosquitoes, you can see more at vectorbase.org) The lines like >1101555423223 simply indicate the ID number of the sequence given as it is obtained from the sequencing machine, followed by the actual DNA string. Each string is just a fragment of the actual complete DNA of the organism.

    If our goal is to assemble all of these reads into a complete genome, our first step would be to compare each sequence to every other one, to see which ones are similar, or overlap. This is known as alignment. The result of alignment is a line-up between the letters in each string, and an overall score indicating the quality of the alignment.

    Download and unpack the program swaligntool.tar.gz, which compares two DNA strings on the command line using the Smith-Waterman algorithm. Note that the tool consists of a Python main program (swaligntool) and a directory containing a Python library (swalign). Python will find the library if it is in the current working directory.

    Use it like this:

    ./swaligntool GCTCAGCCATCTACTACAAATCGGT TCTACTACAAATCGGGTCAACGATCT
    Query: cmdline (26 nt)
    Ref  : cmdline (25 nt)
    
    Query:    0 TCTACTACAAATCGGGT 17
                ||||||||||||| |||
    Ref  :    9 TCTACTACAAATC-GGT 25
    Score: 31
    Matches: 16 (94.1%)
    Mismatches: 1
    CIGAR: 13M1I3M
    
    The tool will also read sequences out of files, so if you wanted to compare a single sequence (say, in file 1.fasta) to all other sequences, you could do this:
    head -n2 agambiae.small.fasta >  1.fasta
    ./swaligntool 1.fasta agambiae.small.fasta
    

    The Assignment

    Hints

  • Begin by testing your program on a (very) small subset of data that can complete in less than a minute, so that you can debug problems very quickly as they arise. Run with a single worker running on the same machine as the master at first. Once that works, try a handful of workers in the Condor pool. Only after that works, try a large number of workers.
  • If your tasks don't work right away, try displaying the standard output generated by each task (task->output) and try turning on the debug output in the worker (work_queue_worker -d all)
  • You may find it useful for your program to display some output indicating tasks submitted, completed, etc as it proceeds, to aid with debugging. It's ok to leave this in the output, as long as the end of the output contains the top ten list indicated above.
  • What to Turn In

    Your dropbox directory is:
    /afs/nd.edu/courses/cse/cse40822.01/dropbox/YOURNAME/a2
    
    Turn in the following:
  • All of your source code.
  • The output of part two, showing the top ten matches.
  • A short lab report in PDF format that explains your estimate in Part 1, the results obtained in Part 2, and a carefully-prepared graph showing the performance results obtained in Part 3. Briefly discuss the results and explain whether the performance observed matches your expectations.
  • This assignment is due on Monday, October 1st at 11:59PM.
    CSE 40822 / A2