SIFT can provide exome-wide analysis of single nucleotide variants and indels. SIFT_exome_nssnvs.pl (for single nucleotide variants) SIFT_exome_indels.pl (for indels) SNPClassifier (for placing variants in genome) (for SNPClassifier, see the documentation in bin/SNPClassifier/ directory) 1. SIFT_exome_nssnvs.pl script takes as input, a list of multiple chromosome coordinates of coding single nucleotide variants and outputs variant annotation along with SIFT predictions and scores. This tool requires human variation databases built using SQLite3 that need to be downloaded before the tool can be used. 1a. Setting up: Ensure that the instructions listed in STANDALONE_INSTALLATION has been performed to set up the standalone SIFT installation. We refer to as the location of the SIFT standalone installation. i. Variation databases can be obtained from the following URL using wget On the command prompt, run the following command: wget http://sift-dna.org/www/sift_hg19_db.tar.gz Upon downloading sift_hg19_db.tar.gz, unzip the file: tar xzvf sift_hg19_db.tar.gz Access the sift_hg19_db directory, and unzip each file that ends with .tar.gz. cd sift_hg19_db Examples: tar xzvf Variation_CHR7.sqlite.tar.gz tar xzvf chr10_1kg.sqlite.tar.gz Run the following 2 commands: mkdir 1000Genomes mv *_1kg.sqlite 1000Genomes ii. Copy the two gff files from /config/gff_37 to the directory where the Human variation database is. /config/gff_37/repeat_db.gff /config/gff_37/small_indel_errors_ref.gff The directory where the Human Variation 37 database is located will be referred to as 1b. Preparing the input Input Format Example : RESIDUE BASED COORDINATE SYSTEM (comma separated) 3,81780820,-1,T/C 2,43881517,1,A/T,#User Comment 2,43857514,1,T/C 6,88375602,1,G/A,#User Comment 22,29307353,-1,T/A 10,115912482,-1,C/T Format Example 2: SPACE BASED COORDINATE SYSTEM (comma separated) 3,81780819,81780820,-1,T/C 2,43881516,43881517,1,A/T,#User Comment 2,43857513,43857514,1,T/C 6,88375601,88375602,1,G/A,#User Comment 22,29307352,29307353,-1,T/A 10,115912481,115912482,-1,C/T An example input file is provided: SIFT_HOME/test/snvs.input Format Description [comma separated: chromosome,coordinate,orientation,alleles,user comment(optional) ] Please do not use spaces except in the user comments field Coordinate System: SIFT accepts both residue-based and a space-based coordinates for single nucleotide variants. If there is only one column of coordinates, as shown in Example 1 above, SIFT assumes the coordinate system is residue-based, if there are two columns, as shown in Example 2 above, SIFT assumes the coordinate system is space-based. The space-based coordinate system counts the spaces before and after bases rather than the bases themselves. Zero always refers to the space before the first base. The sequence 'ACGT' has coordinates (0,4) and its subsequence 'CG' has coordinates (1,3) as shown in Example 3 below. The difference between the start and end coordinates gives the sequence length. Misinterpretation of these coordinates can easily lead to 'off-by-one'. errors. Space-based coordinates become necessary when describing insertions/deletions and genomic rearrangements. Example 3: 0 A 1 C 2 G 3 T 4 In a residue based system as described in Example 4 below, each base is assigned a coordinate base on its absolute position, starting from 1. The sequence 'ACGT' has coordinates (1,4) and its subsequence 'CG' has coordinates (2,3). Example 4: A C G T 1 2 3 4 Orientation: Use 1 for positve strand and -1 for negative strand. If orientation is not known, use 1 as default. Alleles: Use 'base1/base2' where either base1 or base2 may be the reference allele. SIFT will predict for non-reference allele only. If you need prediction for reference allele, then use base1/base1 where base1 is the reference allele. 1c. Running the tool Navigate to /bin directory Following is the usage of SIFT_exome_nssnvs.pl usage: ./SIFT_exome_nssnvs.pl -i -d -o /tmp> -m Yes to output multiple transcripts if exists: default No The following optional parameters can also be entered if you wish the results to include additional information. They are not included by default -A 1 to output Ensembl Gene ID -B 1 to output Gene Name -C 1 to output Gene Description -D 1 to output Ensembl Protein Family ID -E 1 to output Ensembl Protein Family Description -F 1 to output Ensembl Transcript Status (Known / Novel) -G 1 to output Protein Family Size -H 1 to output Ka/Ks (Human-mouse) -I 1 to output Ka/Ks (Human-macaque) -J 1 to output OMIM Disease: default -K 1 to output Allele Frequencies (All Hapmap Populations - weighted average) -L 1 to output Allele Frequencies (CEU Hapmap population) -M 1 to output Allele Frequencies (HCB Hapmap population) -N 1 to output Allele Frequencies (JPT Hapmap population) -O 1 to output Allele Frequencies (YRI Hapmap population) -P 1 to output 1000 Genomes Average Allele Frequencies -Q 1 to output 1000 Genomes European Population Allele Frequencies -R 1 to output 1000 Genomes East Asian Population Allele Frequencies -S 1 to output 1000 Genomes West African Population Allele Frequencies -T 1 to output 1000 Genomes South Asian Population Allele Frequencies -U 1 to output 1000 Genomes American Population Allele Frequencies To run the example input provided in the SIFT_HOME/test directory, ./SIFT_exome_nssnvs.pl -i ../test/snvs_build37.input -d The output directory is SIFT_HOME/tmp/ by default and is printed to the screen after submitting the commandline. 2. SIFT_exome_indels.pl script takes as input, a list of multiple chromosome coordinates of coding insertion/deletion variants and outputs variant annotation. SIFT scores and predictions are not provided at this stage. This tool requires human coding information files that need to be downloaded before the tool can be used. 2a. Setting up: Human variation databases can be obtained from the following URL using wget http://sift.bii.a-star.edu.sg/packages/db/Coding_info_36 http://sift.bii.a-star.edu.sg/packages/db/Coding_info_37 The list of files to be downloaded from Coding_info_36 can be found in the directory SIFT_db_list/wget_list_Coding_info_36 The list of files to be downloaded from Coding_info_37 can be found in the directory SIFT_db_list/wget_list_Coding_info_37 For example, run the following command to get 10.1-135374737.coding_info for Coding_info_36: wget http://sift.bii.a-star.edu.sg/packages/db/Coding_info_36/10.1-135374737.coding_info Download all the files under Coding Info 36, and place in the directory SIFT_HOME/coding_info/Coding_info_36/, or a directory of your choice. The directory where the database is located will be referred to as in this document. Perform the same steps for Coding_info_37. Unzip the files and placed in /coding_info/Coding_info_37, or a directory of your choice. This location will be referred to as in this document. 2b. Preparing the input Format Example: SPACE BASED COORDINATE SYSTEM (comma separated) 10,102760304,102760304,1,GCGGCT,#User comment 1 10,50205013,50205013,1,ACACACACACAC 5,179134934,179134935,1,/,#User comment 2 1,153108866,153108866,1,CTGCTGCTGCTG 11,6368547,6368547,1,GCTGGCGCTGGC 11,65081932,65081932,1,AGCAGC 12,110521161,110521164,1,/ 12,116990733,116990736,1,/ 12,123453048,123453048,1,CTG 12,131113090,131113090,1,GCA 12,1932613,1932613,1,CTG Format Description [comma separated: chromosome,coordinate,oientation,alleles,user comment(optional) ] Please do not use spaces except in the user comments field Coordinate System: SIFT accepts only space-based coordinates for insertion / deletion variants. The space-based coordinate system counts the spaces before and after bases rather than the bases themselves. Zero always refers to the space before the first base. The sequence 'ACGT' has coordinates (0,4) and its subsequence 'CG' has coordinates (1,3) as shown in Example 1 below. The difference between the start and end coordinates gives the sequence length. Misinterpretation of these coordinates can easily lead to 'off-by-one' errors. Space-based coordinates become necessary when describing insertions/deletions and genomic rearrangements. Example 1: 0 A 1 C 2 G 3 T 4 Orientation: Use 1 for positive strand and -1 for negative strand. If orientation is not known, use 1 as default. Alleles: For Insertion, the begin and end coordinates should be same and the allele should be a string of inserted nucleotides in one of the following formats. 1. ----/ATGC 2. -/ATGC 3. ATGC For Deletion, the difference between begin and end coordinates should be equal to the length of the deleted string. the allele can either be left blank or be specified in one of the followig formats 1. ATGC/---- 2. ATGC/- 3. / 2c. Running the tool Navigate to SIFT_HOME/bin directory Following is the usage of SIFT_exome_nssnvs.pl usage: ./SIFT_exome_indels.pl -i -c -d -o /tmp> All values should be in local 0 space based coordinates. To run the example input provided in the SIFT_HOME/test directory, perl ./SIFT_exome_indels.pl -i ../test/indels_build36.input -c -d The default output directory is /tmp/ and the job_id is printed to the screen after submitting the commandline. 2d. Description of output (This can also be viewed on the SIFT website at http://sift.jcvi.org/www/chr_coords_example_indels.html) Amino Acid Position Change This column contains the change coordinates within the original protein sequence and the modified protein sequence. For example, the insertion of GCGGCT at location 102760304 of chromosome 10 of Homo Sapiens (represented by input row: 0,102760304,102760304,1,GCGGCT) inserts two additional amino acids Arginine 'R' and Serine 'S' at position 145 to 147 (space based coordinates) in the modified protein sequence. >ENST00000238965; MISMATCH = 145-145 GPQEQGSPASCFETSPAGHATQASPYHPRACRGGFYLLPVNGFPEEEDNGELRERLGALK VSPSASAPRHPHKGIPPLQDVPVDAFTPLRIACTPPPQLPPVAPRPLRPNWLLTEPLSRE HPPQSQIRGRAQSRSRSRSRSRSRSSRGQGKSPGRRSPSPVPTPAPSMTNGRYHKPRKAR PPLPRPLDGEAAKVGAKQGPSESGTEGTAKEAAMKNPSGELKTVTLSKMKQSLGISISGG IESKVQPMVKIEKIFPGGAAFLSGALQAGFELVAVDGENLEQVTHQRAVDTIRRAYRNKA REPMELVVRVPGPSPRPSPSDSSALTDGGLPADHLPAHQPLDAAPVPAHWLPEPPTNPQT PPTDARLLQPTPSPAPSPALQTPDSKPAPSPRIP >ENST00000238965; MISMATCH = 145-147 GPQEQGSPASCFETSPAGHATQASPYHPRACRGGFYLLPVNGFPEEEDNGELRERLGALK VSPSASAPRHPHKGIPPLQDVPVDAFTPLRIACTPPPQLPPVAPRPLRPNWLLTEPLSRE HPPQSQIRGRAQSRSRSRSRSRSRSrsSRGQGKSPGRRSPSPVPTPAPSMTNGRYHKPRK ARPPLPRPLDGEAAKVGAKQGPSESGTEGTAKEAAMKNPSGELKTVTLSKMKQSLGISIS GGIESKVQPMVKIEKIFPGGAAFLSGALQAGFELVAVDGENLEQVTHQRAVDTIRRAYRN KAREPMELVVRVPGPSPRPSPSDSSALTDGGLPADHLPAHQPLDAAPVPAHWLPEPPTNP QTPPTDARLLQPTPSPAPSPALQTPDSKPAPSPRIP Indel location This percentage indicates the approximate location of the indel in the protein. For example, a value less than 50% means that the indel is located in the first half of the protein and is close to the amino terminus, whereas a number greater than 50% means that the indel is closer to the carboxy terminus. Transcript Visualization <---{}--{}[]--[*.]--[]--[]--[]--[]--[]--[]--[]{}---| The above example visualization mimics the structure of the transcript containing the indel. <--- indicates the 3' end ---| indicates the 5' end {} indicate UTR [] indicates a coding exon -- indicates an intron . indicates the start of insertion or deletion * indicates the end of deletion If the 3'end of the transcript appears to the left of the 5' end, as in this case, then the transcript is transcribed from the negative strand. This transcript has two 3'UTRs, one 5'UTR, nine exons and nine introns. The indel both starts and ends in the 8th coding exon. Nucleotide change The input allele (insertion or deletion) and +/- 5 base pairs are shown. For example, the user input for insertion variant "10,102760304,102760304,1,GCGGCT" will populate this column with the following information cggct-GCGGCT-acggc whereas a user input for deletion variant "12,110521161,110521164,1,/" will populate this column with the following information TGCTG-ctg-TTGCT For insertions, the inserted bases are displayed in uppercase and the flanking bases are displayed in lowercase. For deletions, the deleted bases are displayed in lowercase whereas the flanking bases are displayed in uppercase. Amino acid change This column displays the amino acid change caused by the indel. For example QQTT->QQqTT indicates the addition of amino acid Glutamine ('Q') in the modified protein sequence, whereas EEeDA->EEDA indicates the deletion of amino acid Glutamic acid, 'E' in the modified protein sequence. Protein sequence change This column links original and modified protein sequence files with regions of mismatch (caused due to indel) colored in red. For example, an insertion represented by the user input "1,153108866,153108866,1,CTGCTGCTGCTG" causes an expansion in polyglutamine tract as shown in the following fasta format sequences. The Fasta headers contain the Ensembl transcript ID along with the coordinates of change. >ENST00000271915; MISMATCH = 80-80 MDTSGHFHDSGVGDLDEDPKCPCPSSGDEQQQQQQQQQQQQPPPPAPPAAPQQPLGPSLQ PQPPQLQQQQQQQQQQQQQQPPHPLSQLAQLQSQPVHPGLLHSSPTAFRAPPSSNSTAIL HPSSRQGSQLNLNDHLLGHSPSSTATSGPGGGSRHRQASPLVHRRDSNPFTEIAMSSCKY SGGVMKPLSRLSASRRNLIEAETEGQPLQLFSPSNPPEIVISSREDNHAHQTLLHHPNAT HNHQHAGTTASSTTFPKANKRKNQNIGYKLGHRRALFEKRKRLSDYALIFGMFGIVVMVI ETELSWGLYSKDSMFSLALKCLISLSTIILLGLIIAYHTREVQLFVIDNGADDWRIAMTY ERILYISLEMLVCAIHPIPGEYKFFWTARLAFSYTPSRAEADVDIILSIPMFLRLYLIAR VMLLHSKLFTDASSRSIGALNKINFNTRFVMKTLMTICPGTVLLVFSISLWIIAAWTVRV CERYHDQQDVTSNFLGAMWLISITFLSIGYGDMVPHTYCGKGVCLLTGIMGAGCTALVVA VVARKLELTKAEKHVHNFMMDTQLTKRIKNAAANVLRETWLIYKHTKLLKKIDHAKVRKH QRKFLQAIHQLRSVKMEQRKLSDQANTLVDLSKMQNVMYDLITELNDRSEDLEKQIGSLE SKLEHLTASFNSLPLLIADTLRQQQQQLLSAIIEARGVSVAVGTTHTPISDSPIGVSSTS FPTPYTSSSSC >ENST00000271915; MISMATCH = 80-84 MDTSGHFHDSGVGDLDEDPKCPCPSSGDEQQQQQQQQQQQQPPPPAPPAAPQQPLGPSLQ PQPPQLQQQQQQQQQQQQQQqqqqPPHPLSQLAQLQSQPVHPGLLHSSPTAFRAPPSSNS TAILHPSSRQGSQLNLNDHLLGHSPSSTATSGPGGGSRHRQASPLVHRRDSNPFTEIAMS SCKYSGGVMKPLSRLSASRRNLIEAETEGQPLQLFSPSNPPEIVISSREDNHAHQTLLHH PNATHNHQHAGTTASSTTFPKANKRKNQNIGYKLGHRRALFEKRKRLSDYALIFGMFGIV VMVIETELSWGLYSKDSMFSLALKCLISLSTIILLGLIIAYHTREVQLFVIDNGADDWRI AMTYERILYISLEMLVCAIHPIPGEYKFFWTARLAFSYTPSRAEADVDIILSIPMFLRLY LIARVMLLHSKLFTDASSRSIGALNKINFNTRFVMKTLMTICPGTVLLVFSISLWIIAAW TVRVCERYHDQQDVTSNFLGAMWLISITFLSIGYGDMVPHTYCGKGVCLLTGIMGAGCTA LVVAVVARKLELTKAEKHVHNFMMDTQLTKRIKNAAANVLRETWLIYKHTKLLKKIDHAK VRKHQRKFLQAIHQLRSVKMEQRKLSDQANTLVDLSKMQNVMYDLITELNDRSEDLEKQI GSLESKLEHLTASFNSLPLLIADTLRQQQQQLLSAIIEARGVSVAVGTTHTPISDSPIGV SSTSFPTPYTSSSSC Causes Nonsense Mediated Decay Nonsense mediated decay (NMD) is a cellular mechanism of mRNA surveillance to detect nonsense mutations and prevent the expression of truncated or erroneous proteins. This column indicates whether the input indel is likely to cause NMD. If NMD occurs, then the indel is equivalent to a gene deletion because the mRNA is never translated. There is no NMD when: 1) the resulting premature termination codon is in the last exon -or- 2) the resulting premature termintion codon is in the last 50 nucleotides in the second to last exon Repeat detected This column gets populated if the input insertion/deletion is found to expand or contract a coding repeat region. For example, an input row '1,153108866,153108866,1,CTGCTGCTGCTG' causes an insertion resulting in the expansion of a poly-glutamine repeat. A poly-glutamine repeat of length 14 that expands to length 18 is illustrated in this column by 'PQL(q)14P-->PQL(q)18P'. The repeat amino acid(s) are shown in parenthesis followed by the repeat number and bounded by flanking amino acids. Warning: NCBI reference miscall If you receive a reference miscall warning in the coordinates column (first column) of the output table, this means that your input coordinates overlap or contain a location that is not a true indel, but likely to be an error in NCBI human genome reference sequence.