Scientific Supercomputing at the NIH

Burrows-Wheeler Alignment (BWA) Tool on Helix

BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence, except for disallowing gaps close to the end of the query. It can also be tuned to find a fraction of longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads against the human genome can be done in 20 minutes. Usually the speed is gained at the cost of huge memory, disallowing gaps and/or the hard limits on the maximum read length and the maximum mismatches. BWA does not. It is still relatively light-weighted (2.3GB memory for human alignment), performs gapped alignment, and does not set a hard limit on read length or maximum mismatches.

Given a database file in FASTA format, BWA first builds BWT index with the 'index' command. The alignments in suffix array (SA) coordinates are then generated with the 'aln' command. The resulting file contains ALL the alignments found by BWA. The 'samse/sampe' command converts SA coordinates to chromosomal coordinates. For single-end reads, most of computing time is spent on finding the SA coordinates (the aln command). For paired-end reads, half of computing time may be spent on pairing (the sampe command) given 32bp reads. Using longer reads would reduce the fraction of time spent on pairing because each end in a pair would be mapped to fewer places.

Version

Type '/usr/local/bwa/bwa' on commend line

Sample Session on Helix

BWA sample files can be copied from:

/usr/local/bwa/sample

Put these sample files under user's own area:

% cd /home/user/bwa/run1

% /usr/local/bwa/bwa index -a bwtsw tttF3.csfasta

% /usr/local/bwa/bwa aln tttF3.csfasta ttt.fastq ttt.sai

% /usr/local/bwa/bwa samse tttF3.csfasta ttt.sai ttt.single.fastq ttt.sam

Documentation

http://bio-bwa.sourceforge.net/bwa.shtml