SOAP, developed by R. Li and colleagues at the Beijing Genomics Institute, is a program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. It features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer. It require only 2 minutes aligning one million single-end reads onto the human reference genome. SOAPaligner is that it now supports a wide range of the read length. For detailed information, see http://soap.genomics.org.cn/
SOAP contains several programs:
- SOAPaligner/soap2 is a program for faster and efficient alignemt for short oligonucleotide onto reference sequences. SOAPaligner/soap2 is compatible with numerous applications, including single-read or pair-end resequencing.
- SOAPsnp, SOAPsnp is an accurate consensus sequence builder based on soap1 and SOAPaligner/soap2's alignment output. It calculates a quality score for each consensus base, which can be used for any latter process to call SNPs.
- SOAPindel, SOAPindel is developed to find the insertion and deletion specially for re-sequence technology
- SOAPsv, SOAPsv is a program for detecting the structural variation
- SOAPdenovo, a short read de novo assembly tool, is a package for assembling short oligonucleotide into contigs and scanfolds.
- SOAPals is designed to use RNA-Seq reads for genome-wide ab initio detection of splice junction sites and identification of alternative splicing (als) events. Comparing to previous tools, the tool detects junctions with relatively low false positive. And remarkably it is useful for detecting junctions for mRNAs with relatively low expression level.
To add the SOAP executables to your path, it is easiest to use 'module load soap', as in the example below:
helix% module avail soap ----------- /usr/local/Modules/3.2.9/modulefiles ------------- soap/2.2.1 soap3/0.01b helix% module load soap helix% module list Currently Loaded Modulefiles: 1) soap/2.2.1
How To Use
Before running a SOAP executable on Helix, type 'module load soap' to set up the paths correctly. This needs to be done only once per login session.
To run SOAPaligner, we need to build index files for the reference genome, and then search reads against the formatted index files.
1. Format reference sequence:
eg: ./2bwt-builder ~/human_genome.fa
Then under the directory there will be 13 index files, all their prefixes are your_fasta file name with “.index” added, e.g. human_genome.fa.index. The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt, *.rev.pac, *.sa, and *.sai.
2. Alignment quick start:
For alignment of single-end reads:
For paired-end reads:
NOTE: For the –D option, the program can only accept the prefix of your index files, such as “~/human_genome.fa.index”.
-a STR Query file, for SE reads alignment or one end of PE reads
-b STR Query b file, one end of PE reads
-o STR Output file for alignment results
-2 STR Output file contains mapped but unpaired reads when do PE alignment
-u STR Output file for unmapped reads, [none]
-m INT Minimal insert size INT allowed for PE, 
-x INT Maximal insert size INT allowed for PE, 
-n INT Filter low quality reads contain more INT bp Ns, 
-t Output reads id instead reads name, [none]
-r INT How to report repeat hits, 0=none; 1=random one; 2=all, 
-A Report all the mismatches in SOAP format or not, [none] only report the mismatches in seed, and the number of mismatches in remain
-R RF alignment for long insert size(>= 2k bps) PE data, [none] FR alignment
-l INT For long reads with high error rate at 3'-end, those can't align whole length, then first align 5' INT bp subsequence as a seed,  use whole length of the read
-v INT Totally allowed mismatches in one read, 
-M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, 
0: exact match only
1: 1 mismatch match only
2: 2 mismatch match only
3: [gap] (coming soon)
4: find the best hits
-p INT Multithreads, n threads, 
1. Required parameters:
Note that here we say “sorted’ means alignments of each chromosome are sorted first by chromosome name lexicographically and then by coordinates on each chromosome numerically.
-d <FILE> Reference DNA sequence in FASTA format
-o <FILE> Output consensus file
2. Optional parameters:(default in [ ])
FASTQ files generated by Illumina base-calling pipeline use ‘@’ as 0, but some institutes use ‘!’ as 0.
-g <Double> Global error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.9]
-p <Double> PCR error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.5]
Sequencing errors are found slightly repeatable (once an error occur, additional errors also tend to occur) due to various reasons. Therefore, observations of sequencing errors are not complete independent.The main source of repeatable errors is believed to be PCR amplification in sequencing process. The proper values of the two parameters rely on wetlab process. Nonetheless, the default value generally work at most time.
-r <Double> novel altHOM prior probability [0.0005]
-e <Double> novel HET prior probability [0.0010]
The two are prior probabilities of homozygous SNPs (altHOM) and heterozygous SNPs (HET), which are
used in Bayes formula calculation. Note these are prior probabilities of a new (novel) SNP. They are
expected to be stringent. For different species, the two values should change if necessary.
-t set transition/transversion ratio to 2:1 in prior probability
-s <FILE> Pre-formatted known SNP information.
The file consist of a lot of lines like this one:
chr1 201979756 1 1 0 0.161 0 0 0.839 rs568
The columns from left to right are: name of chromosome, coordinate on the chromosome, whether the SNP has allele frequency information (1 is true, 0 is false), whether the SNP is validated by experiment (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false), frequency of A, frequency of C, frequency of T, frequency of G, SNP id. For known SNP sites that do not have allele frequency information, the frequency information can be arbitrarily determined as any positive values, which only imply what alleles have already been deposited in the database.
-2 specify this option will REFINE SNP calling using known SNP information [Off]
-a <Double> Validated HET prior, if no allele frequency known [0.1]
-b <Double> Validated altHOM prior, if no allele frequency known[0.05]
-j <Double> Unvalidated HET prior, if no allele frequency known [0.02]
-k <Double> Unvalidated altHOM rate, if no allele frequency known[0.01]
The parameters are related to using external SNP information to alter prior probabilities for SNP calling.
SOAPsnp will try using allele frequency information as prior probability in calling genotypes for each site.
If the allele frequency information is absent, it will use the above 4 parameters as prior probability.
-u Enable rank sum test (that check whether the two allele of a possible HET call have same sequencing quality) to give HET further penalty for better accuracy. [Off]
-n Enable binomial probability calculation (that check whether the two allele are observed equally)to give HET further penalty for better accuracy. [Off]
-m Enable monoploid calling mode, this will ensure all consensus as HOM and you probably should SPECIFY higher altHOM rate. [Off]
-q Only output potential SNPs. Useful in Text output mode. [Off]
-M <FILE> Output the quality calibration matrix; the matrix can be reused with -I if you rerun the program
-I <FILE> Input previous quality calibration matrix. It cannot be used simutaneously with -M
-L <short> maximum length of read 
Please note that once length of some reads exceeds the parameter will probably collapse the program.
-Q <short> maximum FASTQ quality score 
-F <int> Output format. 0: Text; 1: GLFv2; 2: GPFv2.
-E <String> Extra headers EXCEPT CHROMOSOME FIELD specified in GLFv2 output. Format is "TypeName1:DataName1:TypeName2:DataName2"
-T <FILE> Only call consensus on regions specified in FILE. Format of this file is:
-h Display this help
1) Chromosome ID
2) Coordinate on chromosome, start from 1
3) Reference genotype
4) Consensus genotype
5) Quality score of consensus genotype
6) Best base, average quality score of best base
7) Count of uniquely mapped best base
8) Count of all mapped best base
9) Second best bases, average quality score of second best base
10) Count of uniquely mapped second best base
11) Count of all mapped second best base
12) Sequencing depth of the site, rank sum test p_value
13) Average copy number of nearby region
14) Whether the site is a dbSNP.
2.GLFv2 and GPFv2
GLFv2 (Genome Likelihood Format v2) is a binary file format proposed by Prof. R. Durbin.
Soap sample files can be copied from:
The infile.fasta contains 4 million reads of 33-bp Solexa sequencing data looking like this:
>HWI-EAS3_2:6:1:860:735 GTTTCCGTAGTGTAGTGGTTATTCTTATTCCGT >HWI-EAS3_2:6:1:446:8 GGTAGATCTGATGTCTGGTGAGTCGTATGCCGT >HWI-EAS3_2:6:1:29:559 GCATAAGATTAGAAGGTCGTATGCCGTCTTCTG .......... .......... >HWI-EAS9_2:6:1:454:45 GTACAGTACTGTGATAACTGATCGTATGCCGTC >HWI-EAS9_2:6:1:181:407 GTACAGTACTGTGATAACTGAATCGTATGCCGT
The backbonefile.fasta is a 250 million bp genomic sequence in fasta format as well.
The following example formats the reference sequence, align reads to reference sequence, then run SOAPsnp to assembly consensus sequence before identify SNPs.
1. Format reference sequence:
<helix> 230% /usr/local/soap2/soap/2bwt-builder backbonefile.fasta
Parsing FASTA file..
Finished. Parsed 1 sequences.
Elapsed time = 13.85 s
Elapsed time = 43.20 s
Finished constructing BWT in 97 iterations.
Elapsed time = 108.90 s
Finished saving BWT.
Elapsed time = 1.06 s
Building Reversed BWT..
Finished constructing Reversed BWT in 97 iterations.
Elapsed time = 123.99 s
Finished saving BWT.
Elapsed time = 1.05 s
Finished loading BWT.
Elapsed time = 0.12 s
Building SA value...
Finished building SA value.
Elapsed time = 98.08 s
Building High-Occ Hash Table...
Elapsed time = 54.51 s
Building SA index...
Finished building SA index.
Elapsed time = 6.43 s
Index building is completed.
Total elapsed time = 451.21 s
Begin Program SOAPaligner/soap2
Thu Apr 23 09:03:04 2009
Query File a: jean_in.solexa.fasta
Output File: out.soap
Load Index Table ...
Load Index Table OK
Begin Alignment ...
131072 ok 6.43 sec
262144 ok 13.03 sec
393216 ok 19.64 sec
524288 ok 26.35 sec
655360 ok 33.02 sec
786432 ok 39.81 sec
917504 ok 46.57 sec
3670016 ok 191.44 sec
3801088 ok 198.58 sec
3932160 ok 205.65 sec
4063232 ok 212.68 sec
4150401 ok 217.38 sec
Total Reads: 4150401
Alignment: 5418 ( 0.13%)
Total Elapsed Time: 219.59
- Load Index Table: 2.15
- Alignment: 217.44
Thu Apr 23 09:06:44 2009
% soapsnp -i out.soap -d backbonefile.fasta -o out.soapsnp
Reading Chromosome and dbSNP information Done.
Correction Matrix Done!