SAND is Copyright (C) 2010 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
SAND is a set of modules for genome assembly that are built atop the Work Queue platform for large-scale distributed computation on clusters, clouds, grids, or assorted collections of machines. SAND was designed as a modular replacement for the conventional overlapper in the Celera assembler, separated into two distinct steps: candidate selection and alignment. Given a set of sequences, the modules can produce a set of candidate pairs of sequences and compute the alignments on those pairs, storing the alignment results in OVL format for use farther down the Celera pipeline.
SAND is part of the Cooperating Computing Tools. You can download the CCTools from this web page, follow the installation instructions, and you are ready to go.
The two SAND modules are very similar to the combined Overlapper module in Celera, except that they facilitate easy and flexible parallelization on distributed resources.
SAND requires data in formats that differ slightly from other bioinformatics applications. Included in the SAND package is a set of tools that allow users to convert data into the formats required by SAND. The primary difference is that sequences data is used in a compressed fasta-like format, which by convention we call .cfa. The details of this format are discussed in the next paragraph, and conversion to this format is dicussed in the Example Execution section.
The candidate filtering module produces a set of promising pairs of sequences to align. By default, the module uses k-mer filtering, which selects a pair of sequences for alignment if it has a perfect match on a short alignment. The filtering is done by a serial program that is included with the SAND candidate filtering module, but alternate candidate selection routines can be specified in its place. The candidate filtering module requires a list of sequences in compressed fasta-like format, which encodes the actual sequence data {A,C,T,G} into a 2-bit format, but retains human-readable metadata:
>sequence_1_name number_of_bases number_of_bytes\n Sequence encoded into 2-bit Format, with a trailing newline\n >sequence_2_name number_of_bases number_of_bytes\n Sequence encoded into 2-bit Format, with a trailing newline\n >sequence_3_name number_of_bases number_of_bytes\n ...
The alignment module computes sequence alignments using a user-supplied alignment algorithm. It requires a list of sequences in the same compressed fasta-like format as the candidate filtering module, a list of candidate pairs, and an alignment executable. The list of candidate pairs, which may be either pre-chosen or selected concurrently in a production pipeline as candidates are selected, is a file in the format:
sequence_1_name sequence_2_name alignment_flag extra_data sequence_1_name sequence_3_name alignment_flag extra_data sequence_2_name sequence_3_name alignment_flag extra_data sequence_3_name sequence_4_name alignment_flag extra_data ...The alignment flag is the direction of alignment, signified by either "1" (forward, left to right) or "-1" (backward, right to left). The extra data is a free-form field of up to 128 ASCII characters that may be used as additional information to the alignment algorithm. In the case of the algorithms bundled with the SAND modules, it is used only for metadata about the candidate. The algorithms bundled with the SAND alignment module are simple implementations of a Smith-Waterman alignment, and a basic banded alignment heuristic. The modules use the same OVL record format as used by the Celera assembler.
Given a a set of sequences in fasta format my_sequences.fa and alignment executable align.exe that implements an arbitrary alignment algorithm based on the data formats described above, this section will walk through the process of creating the compressed fasta-like format, completing the candidate filtering, and computing the alignments. This walk-through is nearly identical to the example provided within the SAND distribution, which includes 20 sequences (in actuality, 10 duplicated to ensure that there are viable candidates), and a correct final set of alignment records to compare against.
The primary data conversion tool is sand_compress_reads, which converts fasta files into the compressed fasta-like format used with our modules.
% sand_compress_reads < my_sequences.fa > my_sequences.cfaThe compressed file my_sequences.cfa is then used to filter out a list of candidates. This is the first step in the pipeline that uses the Work Queue, so let us discuss that here.
To begin, let's assume that you are logged into a machine named
barney.nd.edu. In order to procure workers, you can use your
batch system (such as those running running SGE, or Condor), or you can execute the
workers yourself. In order to make this a little easier, we have
written some tools, provided in the CCTools, that submit workers to
each of these two common batch systems.
This is an example of submitting 10 worker processes to Condor:
% condor_submit_workers barney.nd.edu 9123 10 Submitting job(s).......... Logging submit event(s).......... 10 job(s) submitted to cluster 298.Or, submitting 10 worker processes to SGE:
% sge_submit_workers barney.nd.edu 9123 10Or, you can start workers manually on any other machine you can log into, using the worker executable built in the CCTools:
% worker barney.nd.edu 9123Once the workers begin running, the SAND modules can dispatch tasks to each one very quickly. If a worker should fail, Work Queue will retry the work elsewhere, so it is safe to submit many workers to an unreliable system.
When the SAND module's master process completes, your workers will still be available, so you can either run another master with the same workers, remove them from the batch system, or wait for them to expire. If you do nothing for 15 minutes, they will automatically exit.
Note that condor_submit_workers and sge_submit_workers are simple shells scripts, so you can edit them directly if you would like to change batch options or other details.
The candidate filtering master takes the newly-created my_sequences.cfa as its only input requirement. Its only output requirement is the candidate list to be created, which this document will refer to as my_candidates.cand. A custom filtering executable is optional (without it, the standard one provided with the module, sand_filter_mer_seq, will be used). The most basic set of optional arguments to the master indicate the port on which it should listen for workers (9123 in our Work Queue example above), the number of subsets to split the workload tasks into, and the -b option because we are using the binary cfa format. The filtering executable must be in the same directory as the filtering master.
% sand_filter_master -p 9123 -s 10000 -b my_sequences.cfa my_candidates.candNote that a progress table will be printed to standard out, while more detailed performance information is printed to standard error. We strongly suggest redirecting at least standard error, if not both, to a file while executing.
For many large datasets, preprocessing must be done before candidate filtering in order to discard subsequences that are repeated so often that they will not be useful. Failing to do so does not change the correctness of the filtering, but will increase its runtime significantly. Given the original list of (uncompressed) sequences, we can create a list (called repeats.meryl) of short sequences of length k that are repeated n times with the following commands. In this case k is 24 and n is 100.
% meryl -B -m 24 -C -L 100 -v -o output.meryl -s my_sequences.fa % meryl -Dt -s output.meryl -n 100 > repeats.meryl
The alignment master takes the sequence file my_sequences.cfa and the newly created my_candidates.cand as its input data requirements. It also requires a serial alignment executable, which we proposed above as align.exe (examples of serial alignment executables provided with the SAND modules are sand_sw_alignment (a standard Smith-Waterman alignment algorithm) and sand_banded_alignment (a simple banding heuristic). The only output argument is the set of alignment computations, my_results.ovl. The most basic set of optional arguments to the master indicate the port on which it should listen for workers (9123 in our Work Queue example above), and the number of individual alignments per Work Queue task.
% sand_align_master -p 9090 -n 1000 align.exe my_candidates.cand my_sequences.cfa my_results.ovlAgain, a progress table will be printed to standard out, while more detailed performance information is printed to standard error. We strongly suggest redirecting at least standard error, if not both, to a file while executing. The progress table is formatted as:
Time | WI WR WB | TS TW TR TC | TD AR AF WS | Speedup 90s | 0 0 0 | 0 0 0 0 | 0 nan nan nan | 0.00 96s | 30 0 6 | 100 94 6 0 | 0 nan nan nan | 0.00 101s | 34 1 5 | 100 93 5 0 | 2 7.96 0.28 28.00 | 0.16 103s | 35 2 12 | 100 84 12 0 | 4 8.12 0.21 38.02 | 0.32where: