SAND User's Manual

Last Updated February 2010

SAND is Copyright (C) 2010 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.

Overview

SAND is a set of modules for genome assembly that are built atop the Work Queue platform for large-scale distributed computation on clusters, clouds, grids, or assorted collections of machines. SAND was designed as a modular replacement for the conventional overlapper in the Celera assembler, separated into two distinct steps: candidate selection and alignment. Given a set of sequences, the modules can produce a set of candidate pairs of sequences and compute the alignments on those pairs, storing the alignment results in OVL format for use farther down the Celera pipeline.

SAND is part of the Cooperating Computing Tools. You can download the CCTools from this web page, follow the installation instructions, and you are ready to go.

The SAND modules

The two SAND modules are very similar to the combined Overlapper module in Celera, except that they facilitate easy and flexible parallelization on distributed resources.

SAND requires data in formats that differ slightly from other bioinformatics applications. Included in the SAND package is a set of tools that allow users to convert data into the formats required by SAND. The primary difference is that sequences data is used in a compressed fasta-like format, which by convention we call .cfa. The details of this format are discussed in the next paragraph, and conversion to this format is dicussed in the Example Execution section.

The candidate filtering module produces a set of promising pairs of sequences to align. By default, the module uses k-mer filtering, which selects a pair of sequences for alignment if it has a perfect match on a short alignment. The filtering is done by a serial program that is included with the SAND candidate filtering module, but alternate candidate selection routines can be specified in its place. The candidate filtering module requires a list of sequences in compressed fasta-like format, which encodes the actual sequence data {A,C,T,G} into a 2-bit format, but retains human-readable metadata:

>sequence_1_name number_of_bases number_of_bytes\n
Sequence encoded into 2-bit Format, with a trailing newline\n
>sequence_2_name number_of_bases number_of_bytes\n
Sequence encoded into 2-bit Format, with a trailing newline\n
>sequence_3_name number_of_bases number_of_bytes\n
...

The alignment module computes sequence alignments using a user-supplied alignment algorithm. It requires a list of sequences in the same compressed fasta-like format as the candidate filtering module, a list of candidate pairs, and an alignment executable. The list of candidate pairs, which may be either pre-chosen or selected concurrently in a production pipeline as candidates are selected, is a file in the format:

sequence_1_name sequence_2_name alignment_flag extra_data
sequence_1_name sequence_3_name alignment_flag extra_data
sequence_2_name sequence_3_name alignment_flag extra_data
sequence_3_name sequence_4_name alignment_flag extra_data
...
The alignment flag is the direction of alignment, signified by either "1" (forward, left to right) or "-1" (backward, right to left). The extra data is a free-form field of up to 128 ASCII characters that may be used as additional information to the alignment algorithm. In the case of the algorithms bundled with the SAND modules, it is used only for metadata about the candidate. The algorithms bundled with the SAND alignment module are simple implementations of a Smith-Waterman alignment, and a basic banded alignment heuristic. The modules use the same OVL record format as used by the Celera assembler.

Example execution

Given a a set of sequences in fasta format my_sequences.fa, this section will walk through the process of creating the compressed fasta-like format, completing the candidate filtering, and computing the alignments. This walk-through is nearly identical to the example provided within the SAND distribution, which includes 20 sequences (in actuality, 10 duplicated to ensure that there are viable candidates), and a correct final set of alignment records to compare against.

Data Conversion

The primary data conversion tool is sand_compress_reads, which converts fasta files into the compressed fasta-like format used with our modules.

 % sand_compress_reads < my_sequences.fa > my_sequences.cfa 
The compressed file my_sequences.cfa is then used to filter out a list of candidates. This is the first step in the pipeline that uses the Work Queue, so let us discuss that here.

Procuring Resources

To begin, let's assume that you are logged into a machine named barney.nd.edu. In order to procure workers, you can use your batch system (such as those running running SGE, or Condor), or you can execute the workers yourself. In order to make this a little easier, we have written some tools, provided in the CCTools, that submit workers to each of these two common batch systems.
This is an example of submitting 10 worker processes to Condor:

% condor_submit_workers barney.nd.edu 9123 10
Submitting job(s)..........
Logging submit event(s)..........
10 job(s) submitted to cluster 298.
Or, submitting 10 worker processes to SGE:
% sge_submit_workers barney.nd.edu 9123 10
Or, you can start workers manually on any other machine you can log into, using the worker executable built in the CCTools:
% worker barney.nd.edu 9123
Once the workers begin running, the SAND modules can dispatch tasks to each one very quickly. If a worker should fail, Work Queue will retry the work elsewhere, so it is safe to submit many workers to an unreliable system.

When the SAND module's master process completes, your workers will still be available, so you can either run another master with the same workers, remove them from the batch system, or wait for them to expire. If you do nothing for 15 minutes, they will automatically exit.

Note that condor_submit_workers and sge_submit_workers are simple shells scripts, so you can edit them directly if you would like to change batch options or other details.

Candidate Filtering

The candidate filtering master takes the newly-created my_sequences.cfa as its only input requirement. Its only output requirement is the candidate list to be created, which this document will refer to as my_candidates.cand:
% sand_filter_master my_sequences.cfa my_candidates.cand
Note that a progress table will be printed to standard out, while more detailed performance information is printed to standard error. We strongly suggest redirecting at least standard error, if not both, to a file while executing.

>For many large datasets, preprocessing must be done before candidate filtering in order to discard subsequences that are repeated so often that they will not be useful. Failing to do so does not change the correctness of the filtering, but will increase its runtime significantly. Given the original list of (uncompressed) sequences, we can create a list (called repeats.meryl) of short sequences of length k that are repeated n times with the following commands. In this case k is 24 and n is 100.
% meryl -B -m 24 -C -L 100 -v -o output.meryl -s my_sequences.fa
% meryl -Dt -s output.meryl -n 100 > repeats.meryl

Alignment

The program sand_align_master accepts an alignment program sand_align_kernel with some options for alignment, the newly created my_candidates.cand, and the sequences file my_sequences.cfa. it divides the work up among multiple workers, and produces my_overlaps.ovl, which indicates which sequences overlap significantly, in the Celera OVL format. The options -q 0.25 -m 40 passed to sand_align_kernel indicate a minimum alignment quality of 0.04 and a minimum alignment length of 40 bases.

% sand_align_master sand_align_kernel -e "-q 0.04 -m 40" my_candidates.cand my_sequences.cfa  my_results.ovl
Again, a progress table will be printed to standard out:
 Total | Workers   | Tasks                      Avg | K-Cand K-Seqs | Total
  Time | Idle Busy | Submit Idle  Run   Done   Time | Loaded Loaded | Speedup
     0 |    0    0 |      0    0    0      0   0.00 |      0      0 |  0.00
     8 |    0   48 |    100   52   48      0   0.00 |   1000    284 |  0.00
    10 |    0   86 |    100   13   86      1   7.07 |   1000    284 |  0.71
    36 |    1   83 |    181   14   83      2  19.47 |   1810    413 |  1.08
   179 |    1   83 |    259   92   83      3  22.51 |   2590   1499 |  0.38
   180 |    1   83 |    259   92   83     18  28.19 |   2590   1499 |  2.82
   186 |    2   80 |    259   15   80     85  28.54 |   2590   1499 | 13.04
   199 |    2   80 |    334   90   80     86  29.96 |   3340   1499 | 12.95
   200 |    2   80 |    334   90   80    114  59.43 |   3340   1499 | 33.88
   202 |    2   81 |    334    9   81    165  86.08 |   3340   1499 | 70.32
The columns of the output are as follows:
  • Total Time is the elapsed time the master has been running.
  • Workers Idle is the number of workers that are connected, but do not have a task to run.
  • Workers Busy is the number of workers that are currently running a task.
  • Tasks Submitted is the cumulative number of tasks created by the master.
  • Tasks Idle is the number of tasks waiting for a worker.
  • Tasks Running is the number of tasks currently running on a worker.
  • Tasks Done is the cumulative number of tasks compelted.
  • Avg Time is the average time a task takes to run. An average time of 60 seconds is a good goal.
  • K-Cand Loaded indicates the number of candidates loaded into memory (in thousands).
  • K-Seqs Loaded indicates the number of sequences loaded into memory (in thousands).
  • Speedup is the approximate speed of the distributed framework, relative to one processor.
  • Tuning Suggestions

  • As a rule of thumb, a single task should take a minute or two. If tasks are much longer than that, it becomes more difficult to measure progress and recover from failures. If tasks are much shorter than that, the overhead of managing the tasks becomes excessive. Use the -n parameter to increase or decrease the size of tasks.
  • When using banded alignment (the default), the -q match quality parameter has a significant effect on speed. A higher quality threshhold will consider more alignments, but take longer and produce more output.
  • For More Information

    For the latest information about SAND, please visit our web site and subscribe to our mailing list.