SIFT I. INSTALLATION Ensure that the instructions in STANDALONE_INSTALLATION has been already been performed. Refer to the paths set in /config/config_env.txt for the following directories: = Location of the temporary directory where the results are stored. = Location of the Blast installation. = Location of Blimps directory. II. DATABASE FORMAT This step is required if you are inputting a protein alignment SIFT searches a database of protein sequences to find homologous sequences. You will need to download a database of protein sequences and format it properly so that SIFT subroutines can read it. A. Database from EMBL: ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/ After downloading the databases, you should have 3 .gz files: uniprotkb_swissprot.gz uniprotkb_swissprotsv.gz uniprotkb_trembl.gz In the directory containing the above files, run the following commands: > zcat uniprotkb_swissprot.gz | awk '{if (/^>/) { print "> " $2 " " substr ($_,2,10000)} else { print $_}}' > swiss.uni > gunzip uniprotkb_trembl.gz > cat uniprot_trembl | perl -pe 's/\|/\t/g' | awk ' {if (/^>/) { print ">" $2 } else { print $_}}' | perl -pe 's/\>tr//' > trembl.uni > cat swiss.uni trembl.uni > swiss_trembl.uni > $NCBI/formatdb -i swiss_trembl.uni -t 'Uniprot-TrEMBL ' -p T > $NCBI/formatdb -i swiss.uni -t 'Uniprot-Swiss ' -p T ****************** OR *********************** B. Database from NCBI ftp to obtain ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz Perform this only if you wish to use NCBI's database. In the directory containing nr.gz, run the following commands: > gunzip nr.gz > cat nr | perl -pe 's/\|/\t/g' | awk '{if (/^>/) { print $1 $2} else{ print;}}' > nr_formatted > $NCBI/formatdb -i nr_formatted -t 'NCBI_NR ' -p T Assuming that NCBI's blast package (including psi-blast), as well as formatdb is installed in . The names will be changed to the proper format for proper parsing. If you have your own protein sequence database and SIFT is not properly recognizing the names, go to /config/orig_bin/Alignment.c and modify "fix_names" Recompile ALL programs. Re-run /config/setup_env.pl perl script as directed in STANDALONE_INSTALLATION Section I Part D. III. RUNNING SIFT A. Input: Protein sequence. (SIFT chooses homologues). Requires 3 inputs: 1) Protein sequence in fasta format. 2) Protein database to search. These sequences are assumed to be functional 3) File of substitutions to be predicted on (optional). See test/lacI.subst for an example of the format. This file is optional. Alternatively, you can print scores for the entire protein sequence. Results will be stored in the tmp/.SIFTprediction. COMMANDLINE FOR A LIST OF SUBSTITUTIONS: If you are in 's bin directory, the commandline is: csh ./SIFT_for_submitting_fasta_seq.csh EXAMPLE: If you have a list of substitutions, type the following: csh ./SIFT_for_submitting_fasta_seq.csh test/lacI.fasta test/lacI.subst 2.75 Results will appear in lacI.fasta.SIFTprediction and look something like: K2S TOLERATED 0.08 3.47 LOW CONFIDENCE P3M TOLERATED 0.08 3.35 LOW CONFIDENCE V15K INTOLERANT 0.00 2.84 (Note: the results may appear different depending on the version of Uniprot database that you use) According to this output, the SIFT score for K2S is 0.08 and the median information of the sequences that have an amino acid represented at the position 2 is 3.47. If this number exceeds 3.25 the substitution is annotated as having LOW CONFIDENCE (which means too few sequences were represented at that position.) There are enough sequences for confidence in the V15K prediction. COMMANDLINE TO PRINT ALL SIFT SCORES bin/SIFT_for_submitting_fasta_seq.csh - A dash "-" replaces the list of substitutions. Results will appear in /lacI.fasta.SIFT prediction. Each row is a position in the sequence (row 1 is amino acid position 1, row 2 is amino acid 2) and the SIFT scores for each amino acid substitution are printed for each row. B. Input: Your own protein alignment COMMANDLINE FORMAT FOR A LIST OF SUBSTITIONS: If you are in SIFT_HOME, the commandline is: env BLIMPS_DIR= bin/info_on_seqs where BLIMPS_DIR is the path to the Blimps directory set during installation. EXAMPLE: Type in: env BLIMPS_DIR= bin/info_on_seqs test/lacI.alignedfasta test/lacI.subst test/lacI.fasta.SIFTprediction And the prediction results will appear in test/lacI.fasta.SIFTprediction, read above for description of output. COMMANDLINE TO PRINT ALL SIFT SCORES: env BLIMPS_DIR= bin/info_on_seqs - Example: Type in: env BLIMPS_DIR= bin/info_on_seqs test/lacI.alignedfasta - test/lacI.fasta.SIFTprediction Scores for each position will appear in the file. Read III.A for description of output. COMMANDLINE TO PRINT ALL SIFT SCORES csh bin/SIFT_for_submitting_fasta_seq.csh - BEST A dash "-" replaces the substitution file, and BEST is optional. Results will appear in /.SIFT prediction. Read III.A for description of output.