A2: Parallel DNA Analysis with Work Queue

In the previous assignment, you connected together existing programs into a high throughput workflow for video rendering. In this assignment, you will go the opposite direction, taking a sequential program, and adapting it to a distributed system using the Work Queue system.

Getting Started with Work Queue

Download and install the cctools software in your home directory on one of the student machines:

cd $HOME
wget http://ccl.cse.nd.edu/software/files/cctools-7.0.4-source.tar.gz
tar xvzf cctools-7.0.4-source.tar.gz
cd cctools-7.0.4-source
./configure --prefix $HOME/cctools --tcp-low-port 9000
make
make install
cd $HOME

The software is now installed in $HOME/cctools, so you must set your path appropriately:

setenv PATH ${PATH}:${HOME}/cctools/bin

If you use bash instead, then do this:

export PATH=${PATH}:${HOME}/cctools/bin

Now double check that you can run the various commands, like this:

makeflow -v
work_queue_worker -v
work_queue_status

To complete the assignment, you will need to become familiar with the manuals and other materials online. I recommend you read the Work Queue manual next and try running an example program.

Work Queue Web Page

Makeflow Web Page

A Brief Introduction to Sequence Alignment

Download and examine the file agambiae.small.fasta It contains 250 short DNA sequences ("reads") obtained from sequencing Anopheles Gambiae, the common mosquito. (Faculty at ND study the transmission of malaria through mosquitoes, you can see more at vectorbase.org) The lines like >1101555423223 simply indicate the ID number of the sequence given as it is obtained from the sequencing machine, followed by the actual DNA string. Each string is just a fragment of the actual complete DNA of the organism.

If our goal is to assemble all of these reads into a complete genome, our first step would be to compare each sequence to every other one, to see which ones are similar, or overlap. This is known as alignment. The result of alignment is a line-up between the letters in each string, and an overall score indicating the quality of the alignment.

Download and unpack the program swaligntool.tar.gz, which compares two DNA strings on the command line using the Smith-Waterman algorithm. Note that the tool consists of a Python main program (swaligntool) and a directory containing a Python library (swalign). Python will find the library if it is in the current working directory.

Use it like this:

./swaligntool GCTCAGCCATCTACTACAAATCGGT TCTACTACAAATCGGGTCAACGATCT
Query: cmdline (26 nt)
Ref  : cmdline (25 nt)

Query:    0 TCTACTACAAATCGGGT 17
            ||||||||||||| |||
Ref  :    9 TCTACTACAAATC-GGT 25
Score: 31
Matches: 16 (94.1%)
Mismatches: 1
CIGAR: 13M1I3M

The tool will also read sequences out of files, so if you wanted to compare a single sequence (say, in file 1.fasta) to all other sequences, you could do this:

head -n2 agambiae.small.fasta >  1.fasta
./swaligntool 1.fasta agambiae.small.fasta

The Assignment

Part 1: Select a handful of items from the data, and measure how many comparisons can be completed in a minute. Now, how long would it take to compare every sequence to every other sequence sequentially? How many machines would you need to finish this job in about an hour? We only gave you a portion of the data -- the complete a. gambiae data consists of 100,000 sequences. How long would that take?
Part 2: Write a Work Queue application that will compare every sequence in a given file to every other sequence in that file (excluding self-comparisons) and then outputs only the ten pairs of sequences with the best match scores. Use this program to obtain the ten best matches in agambiae.small.fasta, using Work Queue workers running in the campus Condor pool. The output should look something like this:
```
./compareit agambiae.small.fasta
Listening on port 9785...
Top Ten Matches:
1: sequence 1101555423543 matches 1101897423223 with a score of 807
2: sequence 1101555423223 matches 1101555423223 with a score of 643
...
10: sequence 1101557298657 matches 1101555400923 with a score of 35
```
Part 3: Evaluate the performance of your application on a varying number of workers: 50, 100, 150, 200. For each number of workers, measure the run time and compute the speedup ( est seq time / parallel time) and efficiency (100%;*speedup/workers).

Hints

Begin by testing your program on a (very) small subset of data that can complete in less than a minute, so that you can debug problems very quickly as they arise. Run with a single worker running on the same machine as the master at first. Once that works, try a handful of workers in the Condor pool. Only after that works, try a large number of workers.

If your tasks don't work right away, try displaying the standard output generated by each task (task->output) and try turning on the debug output in the worker (work_queue_worker -d all)

You may find it useful for your program to display some output indicating tasks submitted, completed, etc as it proceeds, to aid with debugging. It's ok to leave this in the output, as long as the end of the output contains the top ten list indicated above.

What to Turn In

Your dropbox directory is:

/afs/nd.edu/courses/cse/cse40822.01/dropbox/YOURNAME/a2

Turn in the following:

All of your source code.

The output of part two, showing the top ten matches.

A short lab report in PDF format that explains your estimate in Part 1, the results obtained in Part 2, and a carefully-prepared graph showing the performance results obtained in Part 3. Briefly discuss the results and explain whether the performance observed matches your expectations.

This assignment is due on Monday, October 1st at 11:59PM.

CSE 40822 / A2