SAND is Copyright (C) 2010 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
SAND is a set of modules for genome assembly that are built atop the Work Queue platform for large-scale distributed computation on clusters, clouds, grids, or assorted collections of machines. SAND was designed as a modular replacement for the conventional overlapper in the Celera assembler, separated into two distinct steps: candidate selection and alignment. Given a set of sequences, the modules can produce a set of candidate pairs of sequences and compute the alignments on those pairs, storing the alignment results in OVL format for use farther down the Celera pipeline.
SAND is part of the Cooperating Computing Tools. You can download the CCTools from this web page, follow the installation instructions, and you are ready to go.
The two SAND modules are very similar to the combined Overlapper module in Celera, except that they facilitate easy and flexible parallelization on distributed resources.
SAND requires data in formats that differ slightly from other bioinformatics applications. Included in the SAND package is a set of tools that allow users to convert data into the formats required by SAND. The primary difference is that sequences data is used in a compressed fasta-like format, which by convention we call .cfa. The details of this format are discussed in the next paragraph, and conversion to this format is dicussed in the Example Execution section.
The candidate filtering module produces a set of promising pairs of sequences to align. By default, the module uses k-mer filtering, which selects a pair of sequences for alignment if it has a perfect match on a short alignment. The filtering is done by a serial program that is included with the SAND candidate filtering module, but alternate candidate selection routines can be specified in its place. The candidate filtering module requires a list of sequences in compressed fasta-like format, which encodes the actual sequence data {A,C,T,G} into a 2-bit format, but retains human-readable metadata:
>sequence_1_name number_of_bases number_of_bytes\n Sequence encoded into 2-bit Format, with a trailing newline\n >sequence_2_name number_of_bases number_of_bytes\n Sequence encoded into 2-bit Format, with a trailing newline\n >sequence_3_name number_of_bases number_of_bytes\n ...
The alignment module computes sequence alignments using a user-supplied alignment algorithm. It requires a list of sequences in the same compressed fasta-like format as the candidate filtering module, a list of candidate pairs, and an alignment executable. The list of candidate pairs, which may be either pre-chosen or selected concurrently in a production pipeline as candidates are selected, is a file in the format:
sequence_1_name sequence_2_name alignment_flag extra_data sequence_1_name sequence_3_name alignment_flag extra_data sequence_2_name sequence_3_name alignment_flag extra_data sequence_3_name sequence_4_name alignment_flag extra_data ...The alignment flag is the direction of alignment, signified by either "1" (forward, left to right) or "-1" (backward, right to left). The extra data is a free-form field of up to 128 ASCII characters that may be used as additional information to the alignment algorithm. In the case of the algorithms bundled with the SAND modules, it is used only for metadata about the candidate. The algorithms bundled with the SAND alignment module are simple implementations of a Smith-Waterman alignment, and a basic banded alignment heuristic. The modules use the same OVL record format as used by the Celera assembler.
Given a a set of sequences in fasta format my_sequences.fa, this section will walk through the process of creating the compressed fasta-like format, completing the candidate filtering, and computing the alignments. This walk-through is nearly identical to the example provided within the SAND distribution, which includes 20 sequences (in actuality, 10 duplicated to ensure that there are viable candidates), and a correct final set of alignment records to compare against.
The primary data conversion tool is sand_compress_reads, which converts fasta files into the compressed fasta-like format used with our modules.
% sand_compress_reads < my_sequences.fa > my_sequences.cfaThe compressed file my_sequences.cfa is then used to filter out a list of candidates. This is the first step in the pipeline that uses the Work Queue, so let us discuss that here.
To begin, let's assume that you are logged into a machine named
barney.nd.edu. In order to procure workers, you can use your
batch system (such as those running running SGE, or Condor), or you can execute the
workers yourself. In order to make this a little easier, we have
written some tools, provided in the CCTools, that submit workers to
each of these two common batch systems.
This is an example of submitting 10 worker processes to Condor:
% condor_submit_workers barney.nd.edu 9123 10 Submitting job(s).......... Logging submit event(s).......... 10 job(s) submitted to cluster 298.Or, submitting 10 worker processes to SGE:
% sge_submit_workers barney.nd.edu 9123 10Or, you can start workers manually on any other machine you can log into, using the worker executable built in the CCTools:
% worker barney.nd.edu 9123Once the workers begin running, the SAND modules can dispatch tasks to each one very quickly. If a worker should fail, Work Queue will retry the work elsewhere, so it is safe to submit many workers to an unreliable system.
When the SAND module's master process completes, your workers will still be available, so you can either run another master with the same workers, remove them from the batch system, or wait for them to expire. If you do nothing for 15 minutes, they will automatically exit.
Note that condor_submit_workers and sge_submit_workers are simple shells scripts, so you can edit them directly if you would like to change batch options or other details.
% sand_filter_master my_sequences.cfa my_candidates.candNote that a progress table will be printed to standard out, while more detailed performance information is printed to standard error. We strongly suggest redirecting at least standard error, if not both, to a file while executing. >For many large datasets, preprocessing must be done before candidate filtering in order to discard subsequences that are repeated so often that they will not be useful. Failing to do so does not change the correctness of the filtering, but will increase its runtime significantly. Given the original list of (uncompressed) sequences, we can create a list (called repeats.meryl) of short sequences of length k that are repeated n times with the following commands. In this case k is 24 and n is 100.
% meryl -B -m 24 -C -L 100 -v -o output.meryl -s my_sequences.fa % meryl -Dt -s output.meryl -n 100 > repeats.meryl
The program sand_align_master accepts an alignment program sand_align_kernel with some options for alignment, the newly created my_candidates.cand, and the sequences file my_sequences.cfa. it divides the work up among multiple workers, and produces my_overlaps.ovl, which indicates which sequences overlap significantly, in the Celera OVL format. The options -q 0.25 -m 40 passed to sand_align_kernel indicate a minimum alignment quality of 0.04 and a minimum alignment length of 40 bases.
% sand_align_master sand_align_kernel -e "-q 0.04 -m 40" my_candidates.cand my_sequences.cfa my_results.ovlAgain, a progress table will be printed to standard out:
Total | Workers | Tasks Avg | K-Cand K-Seqs | Total Time | Idle Busy | Submit Idle Run Done Time | Loaded Loaded | Speedup 0 | 0 0 | 0 0 0 0 0.00 | 0 0 | 0.00 8 | 0 48 | 100 52 48 0 0.00 | 1000 284 | 0.00 10 | 0 86 | 100 13 86 1 7.07 | 1000 284 | 0.71 36 | 1 83 | 181 14 83 2 19.47 | 1810 413 | 1.08 179 | 1 83 | 259 92 83 3 22.51 | 2590 1499 | 0.38 180 | 1 83 | 259 92 83 18 28.19 | 2590 1499 | 2.82 186 | 2 80 | 259 15 80 85 28.54 | 2590 1499 | 13.04 199 | 2 80 | 334 90 80 86 29.96 | 3340 1499 | 12.95 200 | 2 80 | 334 90 80 114 59.43 | 3340 1499 | 33.88 202 | 2 81 | 334 9 81 165 86.08 | 3340 1499 | 70.32The columns of the output are as follows: