SEQIO -- A Package for Sequence File I/O

FORMAT.DOC - The SEQIO File Formats


These three format variants are included in the `GCG-*' set of formats.

File Format Types

Each format is considered to be one of the following types, which gives a basic description of the capabilities and common uses of the format:
T_SEQONLY
The entries of the format contain only a sequence. It does not contain any place to store sequence information or comments.
(Plain, Raw)
T_DATABANK
The entries are used mainly to store unadorned sequences (i.e., not used for sequences containing alignment characters).
(GenBank, PIR, EMBL, Swiss-Prot, their GCG-* forms, ASN.1)
T_GENERAL
The entries can contain both unadorned sequences and alignment sequences. In addition, there is a place to store sequence information and comments.
(FASTA, NBRF, IG/Stanford, their GCG-* forms, GCG)
T_LIMITED
The entries can contain both unadorned sequences and alignment sequence, but there no place to store extra sequence information and comments.
(FASTA-old, NBRF-old, IG-old, their GCG-* forms)
T_ALIGNMENT
The entries are used mainly to store multiple sequence alignments. They are not considered to contain much sequence information and do not have any place to store comments.
(PHYLIP, Clustalw, MSF)
T_OUTPUT
The format is the output of an aligment program, and these formats are read-only formats.
(FASTA-output, BLAST-output)
These types may be of some use when developing software that wishes to perform different operations based on this file type information (the "fmtseq" program included in the distribution is one such piece of software).

(NOTE: Why is having someplace to store comments so important? Well, one of the goals of this package is to try to unify all of the file formats and be able to capture and transfer as much information from one format to another. The plans are to use these comment sections as the place to store any extra information for which there is not explicit spot in the entry. And that can't happen if the file format doesn't have a comment section. This is also the reason for the FASTA, NBRF and IG/Stanford variants mentioned above.)

Automatically Determining the Format Type

The SEQIO package has the ability to automatically determine the format of a file, if that file is one of the following formats:

The Raw format and all of the format variations (*-old, *fast) must be explicitly specified in order to be used. The package makes the format determination in two phases. The first phase looks at the initial non-whitespace text of the file. The second phase looks at the text of the first entry in the file. Both of these phases occur during the opening of the file.

First Phase

The first phase operation first skips over an e-mail header at the beginning of the file, if the file begins with the string "From ". It then looks for the first non-whitespace character of the file and attempts to match that non-whitespace text to one of the following keywords (where the matching is case-insensitive and the `?' character is a wildcard which can match any character in the file):
    GenBank - "LOCUS ", "GB???.SEQ          Genetic Sequence Data Bank"
       NBRF - ">??;"
      FASTA - ">"
       EMBL - "ID   ", "CC ", "XX "
        PIR - "\\\", "ENTRY", "P R O T E I N  S E Q U E N C E  D A T A B A S E"
IG/Stanford - ";"
      ASN.1 - "Bioseq-set ::= {", "Seq-set ::= {"
  FASTA-out - "FASTA", "TFASTA", "SSEARCH", "LFASTA", "LALIGN", "ALIGN"
     PHYLIP - "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"
   Clustalw - "CLUSTAL"
        MSF - "PileUp"
  BLAST-out - "BLASTN", "BLASTP", "BLASTX"
The keyword matching occurs in the order specified here, and the first matching keyword specifies the file format. So, for NBRF and FASTA files, if the first entry's header line has a ';' as the third character after the initial '>', the file format is taken to be NBRF. Files without that semi-colon are taken to be in FASTA format.

If there's a match, then the file format has been determined. Otherwise, the file's format is considered to be `Plain' at this point.

Second Phase

The second phase distinguishes more subtle variations of the file formats by looking in more detail at the text of the entries. The possible changes in the determined format are the following:


The SEQIO File Format Implementations

The package has six main (internal) operations that encapsulate the details of the file formats. Those operations are:
read
Read the input file to find the beginning and end of the next entry in the file. Also, find the beginning of the lines containing the sequence and if the entry explicitly specifies a sequence length, get that value.
getseq
Retrieve the sequence, if it exists, from the entry.
rawseq
Retrieve the raw sequence, if it exists, from the entry. The raw sequence typically contains the sequence characters plus any alignment or notational characters.
getinfo
Get one piece or all of the SEQINFO information from the entry.
putseq
Given a sequence and SEQINFO structure, output a correctly formatted entry.
annotate
Output an entry's text, adding new text to its comment section (creating a comment section, if none exists in the entry).
Each of the supported file formats will be described in terms of what those six operations do for that format.

General Comments


Raw Format

In the raw format, all of the characters of the file are the characters of the sequence (including spaces, newlines, non-printable characters, and so on).

The read operation simply reads the whole file. The getseq and rawseq operations return that text. The getinfo operation merely stores the filename in the description field. The putseq operation just outputs the sequence characters. And there is no annotate operation.


Plain Format

In the plain format, all of the alphabetic characters of the file are taken as the characters of the sequence, while spaces, newlines, position numbers and other punctuation characters are ignored.

The read operation reads in the whole file. The getseq operation extracts all of the alphabetic characters from the text. The rawseq operation extracts all of the non-whitespace and non-numeric characters from the text. The getinfo operation stores the filename in the description field.

The putseq operation outputs the sequence in one of two formats, depending on the sequence's alphabet. If the alphabet is DNA, RNA or Protein, or the alphabet is Unknown but does not contain newline characters, the sequence is output 60 sequence characters per line, with interspersed spaces to improve the look of the output. If the alphabet is Unknown and it contains newline characters, then it is output as is.


GenBank Flat-File Format

The read operation first looks for a "LOCUS" line and extracts the sequence length from positions 23-29 of that line (if the text there consists of digits). Then, it looks for the entry ending "//" line, along with the "ORIGIN" line which specifies where the sequence lines begin. The "ORIGIN" line is not required, however if it does not exist, the entry is assumed to contain no sequence.

The getseq operation scans the sequence lines, from just after the "ORIGIN" line to the "//" line. All alphabetic characters there are assumed to be part of the sequence. No assumptions are made about the format of these lines.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation looks first at the "LOCUS" line. It takes the identifier from positions 13-22 (and assumes it's a GenBank id, unless marked by an identifier prefix), the alphabet determination from positions 37-40, whether it's circular from the existence of the keyword "circular" at positions 43-52, and the date from positions 63-73. Then, it looks for the "ACCESSION", "NID", "PID", "DEFINITION", "COMMENT" and "SOURCE" lines, where `lines' here mean one or more text lines corresponding to that part of the entry and where the lines can appear in any order. Accession numbers, NID numbers and PID numbers are extracted from the "ACCESSION", "NID" and "PID" lines, respectively. The description is taken from the "DEFINITION" line. Comments are retrieved from the "COMMENT" line. The organism name is taken from the "ORGANISM" sub-record of the "SOURCE" line. The getinfo operation cannot determine the value of the isfragment field (since that is not explicitly given anywhere in the entry).

The putseq operation outputs an entry with the following lines (in order): LOCUS, DEFINITION, ACCESSION, NID, SOURCE/ORGANISM, COMMENT, BASE COUNT, ORIGIN, sequence lines, //. The form of these lines follows that described in the GenBank Release Notes, with the following exceptions:

The annotate operation replaces or appends to the COMMENT line, if it exists. If no COMMENT line exists, then a new COMMENT line will be inserted (or rather output between the existing lines of the entry) just before one of the following lines (whichever comes first in the entry): FEATURES, BASE COUNT or ORIGIN. One of those lines must appear in the entry.

Example GenBank entry:

LOCUS       A02201        664 bp    DNA             UNC       10-MAR-1993
DEFINITION  Phage phi-105 DNA for immF plypeptide.
ACCESSION   A02201
SOURCE      .
  ORGANISM  Bacteriophage phi-105
COMMENT     NCBI gi: 345121
            
            SEQIO retrieval from GenBank database entry.   07-Feb-1996
BASE COUNT      237 a    111 c    144 g    172 t
ORIGIN
        1 tgatcaccta tctcctttac aacacatagt gcctcactgt gccactgtgt cttgtggcat
       61 gacacaatta tagtatccga atgtcggaaa tacaatacta aaaaagacgg aaatacaagt
      121 attttttagt aaattgacgg aaatacaaga taaatactct ctgaatcttt aaaatgcttg
      181 aatttcgtca aatttcgact tttacaaaat gtcgtgaata ccatacaatt tagacatacc
      241 ttaacgggag gtgataatca tgctggatgg gaaaaagctt ggggctttaa ttaaggacaa
      301 aagaaaagaa aagcacttga aacagacaga aatggcgaag gcactgggta tgtccagaac
      361 ttatctctct gatatcgaaa acggcagata tctgccgagt acaaaaacac tttccagaat
      421 agcgatttta ataaatctgg atttaaatgt gttaaaaatg acggaaatac aagtagttga
      481 ggagggtgga tatgatagag ctgccggcac atgtagaaga caggctttat gagattttta
      541 tgaaactatc agttccaagg ttgcttgaga aagaagccct ggagaaagga gagaagccga
      601 atgcggaaag aaaaggcgct tgacctcgcg gccttcttcg ctgaatttga acaaatgatg
      661 atca
//

GBFAST variation of GenBank

The read operation performs the same steps as the GenBank read, however it makes some additional assumptions. First, all keywords must appear in uppercase. Second, the sequence length must appear in positions 23-29 on the "LOCUS" line. Third, an "ORIGIN" line must appear in the entry (as must a sequence). Fourth, all of the lines of sequence except the last must be in the format as described in the Release Notes, and so must be 75 characters long (9 characters for the position number, 60 characters of sequence, 6 spaces), plus the newline characters. See the above example.

The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.

The rawseq operation here is exactly the same as the getseq operation, since the GenBank sequences don't contain other characters.

The getinfo, putseq and annotate functions are the same as in the GenBank format.


PIR/CODATA Format

The read operation first looks for an "ENTRY" line. It then looks for the entry ending "///" line, but during this scan it also looks for the "SUMMARY" line and the "SEQUENCE" line. If the "SUMMARY" line is found, the sequence length is extracted by scanning for "#length" on the line, and then looking for digits after that keyword. The "SEQUENCE" line specifies the beginning of the sequence lines (starting on the next line), and no sequence is assumed to appear in the entry if the "SEQUENCE" line is missing.

The getseq operation scans the sequences lines from just after the "SEQUENCE" line to the "///" line ending the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation first looks at the "ENTRY" line. The next word (i.e., non-whitespace string) after the "ENTRY" keyword is taken for an identifier, and then the rest of the line is searched for a "#type" option. If the word after "#type" is "fragment", the isfragment field is set to 1. Then, the entry is searched for the "ACCESSIONS", "COMMENT", "DATE", "ORGANISM" and "TITLE" lines, which can appear in any order. The "ACCESSIONS" line holds accession numbers (and the search for the "ACCESSIONS" line will also find lines beginning with just "ACCESSION", for backward compatibility). The "COMMENT" lines hold comments. The "DATE" line holds the date, and the date taken is the last given on the line, with the assumption being that the dates on the line are specified from oldest to newest (not absolutely accurate, but handling dates better is on my TODO list). The "TITLE" line holds the description, an optional organism name and possibly one of the keywords "(fragment)", "(fragment)" or "(tentative sequence)". The text before the string " - " is taken for the description, and the rest of the text, except for a trailing keyword, is taken for the organism name. If the keywords "(fragment)" or "(fragments)" appear at the end of the string, isfragment is set to 1. If "(tentative sequence)" appears, it is considered part of the description. The "ORGANISM" line holds an organism name which is taken if the "TITLE" line does not specify an organism.

The putseq operation outputs a PIR entry containing the following lines (in order): ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS, COMMENT, SUMMARY, SEQUENCE, sequence lines, ///. The format of those lines follows the PIR Release Notes, with the following exceptions:

The annotate operation replaces or appends to the COMMENT line, if it exists. If no COMMENT line exists, then a new COMMENT line will be inserted just before one of the following lines (whichever comes first in the entry): GENETIC, CLASSIFICATION, KEYWORDS, FEATURE, SUMMARY or SEQUENCE. One of those lines must appear in the entry.

Example PIR entry:

ENTRY            CCMST       #type complete
TITLE            cytochrome c, testis-specific - mouse
ORGANISM         #formal_name mouse
DATE             04-Nov-1994
ACCESSIONS       B28160; A00012
COMMENT    Mammalian testis contains two forms of cytochrome c, one identical
           with the form found in somatic tissues and another that is
           expressed in a stage-specific manner during spermatogenic
           differentiation.
           
           SEQIO retrieval from PIR database entry.   07-Feb-1996
SUMMARY          #length 105
SEQUENCE
                5        10        15        20        25        30
      1 M G D A E A G K K I F V Q K C A Q C H T V E K G G K H K T G
     31 P N L W G L F G R K T G Q A P G F S Y T D A N K N K G V I W
     61 S E E T L M E Y L E N P K K Y I P G T K M I F A G I K K K S
     91 E R E D L I K Y L K Q A T S S
///

PIRFAST Variation of PIR

The read operation performs the same steps as the PIR read, however it makes some additional assumptions. First, all keywords must appear in uppercase. Second, a "SUMMARY" line must appear in the entry, and it must contain a "#length" field (although the field can appear anywhere on the line). Third, a "SEQUENCE" line must appear in the entry immediately after the "SUMMARY" line (and the entry must contain a sequence). Fourth, the format of the sequence lines must be as given in the PIR database, and so must be either 67 or 68 characters long (7 characters for the position number, 30 characters of sequence, 30 or 31 spaces or notational characters), plus the newline character. See the above example.

The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.

The rawseq operation here does not use the "fast" implementation, but uses the rawseq operation of the basic PIR format.

The getinfo, putseq and annotate functions are the same as in the PIR format.


EMBL/Swiss-Prot File Formats

NOTE: The EMBL and Swiss-Prot file format implementations are essentially the same, differing only in their putseq and annotate operations. So, we'll describe them together.

NOTE2: The EMBL read, getseq and getinfo implementations have been tested on, and are compatible with, the "EMBL" entries in the EMBL, EPD, aids-db, ENZYME, PROSITE and Swiss-Prot databases. Because of the variations of the entries in these databases, some of the assumptions made in the implementations will differ from the official EMBL or Swiss-Prot file format descriptions.

The read operation first looks for an "ID " line. It then looks for the entry ending "//" line, but during this scan it also looks for an "SQ " line and a line beginning with two spaces. If the "SQ " line is found and the next word after "SQ Sequence" consists of digits, it is taken for the sequence length. The first line beginning with two spaces is assumed to be the beginning of the sequence lines, and if no such lines appear, the entry is assumed to contain no sequence.

The getseq operation scans the sequences lines from the first line beginning with two spaces to the "///" line ending the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation first looks at the "ID " line. The next word (i.e., non-whitespace string) after the "ID" keyword is taken for an identifier, and an attempt is made to determine if it is an EMBL id, an EPD id, a Swiss-Prot id, or something else. It does this by counting the number of semi-colons on the line and checking whether the line ends with a period. If three semi-colons and a period are found, then the string just before the third identifier is checked, and the identifier is assumed to be an EPD id if that string is "EPD" and is assumed to be an EMBL id otherwise. If two semi-colons and a period are found, and the string just before the second semi-colon is "PRT", the identifier is assumed to be a Swiss-Prot id. Otherwise, the identifier is some other id. After figuring out the type of identifier and extracting it from the line, the rest of the line is searched for words that specify the alphabet ("DNA", "RNA", "PRT", and so on) and whether the sequence is circular ("circular").

Then the rest of the entry is searched for the "AC ", "NI ", "PI ", "DT ", "DE ", "OS ", "CC " and "XX " lines, which can appear in any order. The "AC ", "NI " and "PI " lines contain accession, NID and PID numbers. The "DT " lines contain dates, of which the date on the last "DT " line is taken, under the assumption that the dates are given from oldest tonewest. The "DE " lines contain the description, and may end with one of the keywords "(fragment)" or "(fragments)", in which caseisfragment is set to 1. The "OS " lines specify the organism name. The "CC " and "XX " lines specify the comment lines, about which there are a couple things to note. First, an "XX " line isdifferent from any line beginning with "XX", in that three spacesmust appear after the "XX" and non-whitespace text must appear after that, in order for it to be considered a comment line. These lines do not occur in the official EMBL or Swiss-Prot formats, but do appear in some of the variations. Second, more than one comment section can appear in an entry. When a "CC " line is reached, the comment section beginning at that line is assumed to consist of all "CC " and "XX" lines (note the lack of spaces after the "XX") following that line, upto the first line not beginning with "CC" or "XX" (and ignoring a trailing "XX" line). When an "XX " line is seen, all following "XX " lines are considered part of that comment section. The text for these sections are concatenated together to make up the comment lines.

For the EMBL format, the putseq operation outputs an EMBL entry containing the following lines (in order): ID, AC, NI, DT, DE, OS, CC, SQ, sequence lines, //. In the output, XX lines are added between each of the lines (except the sequence lines) as specified in the EMBL format. The format of the lines follows the EMBL Release Notes, with the following exceptions:

For the Swiss-Prot format, the putseq operation outputs a Swiss-Prot entry containing the following lines (in order): ID, AC, DT, DE, OS, CC, SQ, sequence lines, //. The format of the lines follows the Swiss-Prot Release Notes, with the following exceptions:

For the EMBL format, the annotate operation replaces or appends to the "CC " or "XX " lines, if one exists. The operation looks for the first comment section, and will insert or replace at that point. If no comment section exists, then a new comment section using "CC " lines will be inserted (or rather output between the existing lines of the entry) as follows. If a "DR ", "PR ", "FH " or "FT " line appears in the entry, the comment is inserted just before the first of those lines. Otherwise, the comment is inserted just before the "SQ ", or " " (i.e., sequence) lines. One of these lines must appear in the entry.

For the Swiss-Prot format, the annotate operation replaces or appends to the "CC " lines, if they exist. If no comment section exists, then a new comment section will be inserted (or rather output between the existing lines of the entry) as follows. If a "DR ", "KW " or "FT " line appears in the entry, the comment is inserted just before the first of those lines. Otherwise, the comment is inserted just before the "SQ " or sequence lines. One of these lines must appear in the entry.

Example EMBL entry:

ID   CM23SRIBR  converted; DNA; UNC; 805 BP.
XX
AC   X80636;
XX
DT   22-MAR-1995
XX
DE   C.mucosalis gene for 23S ribosomal RNA (fragment)
XX
OS   Campylobacter mucosalis
XX
CC   SEQIO retrieval from EMBL-format entry.   07-Feb-1996
XX
SQ   Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other;
     gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt        60
     actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc       120
     ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg       180
     taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa       240
     gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg       300
     atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag       360
     gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct       420
     tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata       480
     atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga       540
     agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta       600
     actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact       660
     gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg       720
     cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc       780
     cgagtaaacg gccgccgtaa ctata                                             805
//
Example Swiss-Prot entry:
ID   104K_THEPA  CONVERTED;      PRT;   924 AA.
AC   P15711;
DT   01-AUG-1992
DE   104 KD MICRONEME-RHOPTRY ANTIGEN.
OS   THEILERIA PARVA.
CC   -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN.
CC   -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
CC   
CC   SEQIO retrieval from Swiss-Prot database entry.   07-Feb-1996
SQ   SEQUENCE   924 AA;
     MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
     QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
     DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
     GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
     YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
     TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
     THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
     EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
     QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
     SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
     PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
     DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
     DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
     SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
     TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
     KKPDSAYIPS ILAILVVSLI VGIL
//

EMBLFAST/SPFAST Variation of EMBL/Swiss-Prot

The read operation performs the same steps as the EMBL/Swiss-Prot read, however it makes some additional assumptions. First, all keywords must appear in uppercase, with one exception noted next. Second, an "SQ Sequence" line must appear in the entry, although the keyword "Sequence" can appear in uppercase, as in "SQ SEQUENCE". Third, the sequence length must be the next word after "SQ Sequence". Fourth, the format of the sequence lines must occur as in the EMBL or Swiss-Prot databases. The EMBL sequence lines are 80 characters long (5 spaces, 60 sequence characters with 5 interspersed spaces, and 10 characters with a right justified position number), plus the newline character. The Swiss-Prot sequence lines are 70 characters long (same as EMBL except no position numbers), plus the newline.

The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.

The rawseq operation here is exactly the same as the getseq operation, since the EMBL and Swiss-Prot sequences don't contain other characters.

The getinfo, putseq and annotate functions are the same as in the EMBL/Swiss-Prot format.


FASTA/FASTA-old File Formats

NOTE: The implementation of the FASTA format here follows the format described in the FASTA program documentation, with the exception that, at the beginning of the entry, multiple lines beginning with either '>' or ';' can appear. This was done in order to better distinguish the entry's header lines from the sequence lines (where comments beginning with ';' are permitted). This exception only occurs when reading FASTA entries. The FASTA output functions only use ';' for those additional header lines.
The read operation looks for a line beginning with '>'. That line is taken as the header/description line for the entry. If that line has been formatted using the standard one-line description format (see file "user.doc"), then the sequence length is extracted from that line. The operation then looks for the next line which does not begin with a '>' and which does not begin with a ';'. If such a line occurs before the next line with a '>', that line is the first line of the sequence. Finally, the operation looks for the entry's end at either the next line which does begin with a '>' or the end of the file.

The getseq operation scans the sequences lines (all of the lines not beginning with '>'). All alphabetic characters on those lines are assumed to be in the sequence, except that when a semi-colon appears on a line, the rest of that line is considered a comment and not part of the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation first looks at the first header line of the entry, and parses it according to the one-line description format specified in file "user.doc". It then considers any following lines that begin either with a '>' or a ';' as comment lines. Any other comments in the entry are ignored.

In the FASTA format, the putseq operation outputs a first header line according to the one-line description format. The comment/history lines and the sequence identifiers are output as additional header lines that begin with a ';'. Finally, the sequence is output.

In the FASTA-old format, the putseq operation only outputs the first header line and the sequence lines. No comment/history lines are output, and the identifiers appear in the header line.

In the FASTA format, the annotate operation either replaces, appends or inserts the comment lines just after the first header line. There is no annotate operation in the FASTA-old format.

Example FASTA entry:

>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
;
;NCBI gi: 579066
;
;SEQIO retrieval from GenBank database entry.   07-Feb-1996
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c
Example FASTA-old entry:
>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c

NBRF/NBRF-old File Formats

NOTE: The implementation of the NBRF format follows the format descriptions given in the release notes of the VMS version of the PIR database, with the following exceptions:

  1. An identifier list (with identifiers separated by '|') can appear after the ';' on the first line of the entry, and there is no limitation to the length of that identifier list.
  2. The second line of the entry is treated as a full one-line description (so it can contain more than just the description and organism name).
  3. The NBRF header lines (which occur after the sequence) are assumed to begin at the first line whose second character is a ';', and run until the end of the entry. So, the sequence lines cannot contain such a line (or the sequence will only be partially read).
  4. Every "C;Comment: " line in the header lines is assumed to contain a space between the "C;Comment:" and the comment text. This space (or whatever character appears there) is not considered part of the comment text.
The read operation first looks for a line beginning with '>', which contains a two-character code and database identifiers for the sequence. The next line, which should not begin with a '>', contains a one-line description of the sequence, and the operation attempts to extract the sequence length from that line. After that, the operation scans the sequence lines looking for the beginning of the header lines or the end of the entry. The header lines begin with the first line whose second character is ';', and they are not required to appear in an entry. The end of the entry is either the first line which begins with a '>', or the end of the file.

The getseq operation scans the sequences lines from just after the description line to either the first occurrence of a '*', the beginning of the header lines or the end of the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation first looks at the initial identification line. The format of that line is ">??;..." where "??" is a two character description and "..." is a list of identifiers. Six forms of the two character description are recognized

and the appropriate alphabet, isfragment and iscircular values are set. The list of identifiers are added to mainid, mainacc and idlist. If no identifier prefix is specified for an identifier (either by the identifier itself or by the "IdPrefix" information field of the database's BIOSEQ entry, if a database search is being performed), then "oth" for Other is used. The next line in the entry is parsed according to the one-line description format. Then, if the header lines were found in the entry during the read operation, they are scanned, looking for lines beginning with "C;Accession:", "C;Comment:" and "C;Date:" which give the accession numbers, comments and date, respectively.

In the NBRF format, the putseq operation outputs a initial identification line of the appropriate form, containing one of the two character descriptions above (or "XX" if the alphabet is Unknown) and containing the list of identifiers in idlist. It then outputs a one-line description according to the one-line description format. The sequence is output and terminated with a '*'. Finally, the date, accession numbers and comments/history are output in lines beginning with "C;Accession:", "C;Comment:" and "C;Date:".

In the NBRF-old format, the putseq operation only outputs the initial identification line, the description line and the sequence lines. In addition, only one identifier is placed on the initial identification line, and if that identifier was not an accession number, the main accession number is added to the beginning of the description line.

For the NBRF format, the annotate operation replaces or appends the "C;Comment: " lines, if they exists. If no comment lines exists, then a new comment section will be inserted (or rather output between the existing lines of the entry) as follows. If a "C;Genetics:", C;Complex:", "C;Function:", "C;Superfamily:", "C;Keywords:" or "F;" line appears in the entry, the comment is inserted just before the first of those lines. Otherwise, the comment is inserted at the end of the entry.

There is no annotate operation in the NBRF-old format.

Example NBRF entry:

>DL;gb:A14666
PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*
C;Date: 18-AUG-1994
C;Accession: A14666
C;Comment: NCBI gi: 579066
C;Comment: 
C;Comment: SEQIO retrieval from GenBank database entry.   23-Mar-1996
Example NBRF-old entry:
>DL;gb:A14666
~A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*

IG/Stanford, IG-old/Stanford-old File Formats

The read operation first looks for a line beginning with ';'. The operation then looks for the next line which does not begin with a ';'. All of the lines beginning with ';' make up the comment lines, and the first line not beginning with ';' contains the sequence's description. If the description line has been formatted using the standard one-line description format (see file "user.doc"), then the sequence length is extracted from that line. Finally, the operation looks for the entry's end at either the next line which does begin with a ';' or the end of the file.

The getseq operation scans the sequence lines from just after the description line until either the end of the entry is reached, or a '1' or a '2' appears. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation first gets the comment lines at the beginning of the entry, and then parses the description line according to the one-line description format. Finally, it looks for a '1' or '2' at the end of the sequence, and sets iscircular to 0 or 1, respectively.

In the IG/Stanford format, the putseq operation outputs any comment/history lines (or just the line ";\n" if there are no comment/history lines, a one-line description, the sequence and finally either a '1' or '2' depending on the value of iscircular.

In the IG-old/Stanford-old format, the putseq operation outputs the same text as in the IG/Stanford format except that exactly one comment/history line is output.

In the IG/Stanford format, the annotate operation either replaces, appends or inserts the comment lines at the beginning of the entry. There is no annotate operation in the IG-old/Stanford-old format.

Example IG/Stanford entry:

;NCBI gi: 579066
;
;SEQIO retrieval from GenBank database entry.   07-Feb-1996
gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1
Example IG-old/Stanford-old entry:
;NCBI gi: 579066
gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1

ASN.1 Text File Format

NOTE: This file format implementation is not nearly complete enough to handle all of the variations of ASN.1 text files. I concentrated the implementation on handling the "Bioseq" sequence records defined as part of the "Bioseq-set" structure, i.e., it looks for each "Bioseq-set.seq-set.seq" record in the file, where '.' separates the initial keywords for each level of sub-record. (See the NCBI toolkit for the definitions of the "Bioseq-set" and "Bioseq" syntax, and the values of those initial keywords).

However, it does handle all of the syntactic requirements of the ASN.1 text format. It makes no assumptions on the structure of the file, handling a completely free-form file (with one exception listed below). It does assume that the format consists of a hierarchy of records, where a record consists of a text string identifier and then a pair of matching braces bounding the contents of the record (except for simple records which contain only one or more strings and numbers).

The read operation looks for the beginning of each "Bioseq-set.seq-set.seq" record in the file. The operation assumes that this record is a "Bioseq" record, and looks for the end of it. Also, the read operations makes the syntactic requirement that the open brace beginning the "seq" record is separated from its initial keyword by exactly one space (i.e., the operation looks for the string "seq {"). After scanning to the end of the "seq" record, the operation looks for the "seq.inst.length" sub-record. If found, the sequence length is extracted from that sub-record.

The getseq operation looks for the "seq.inst.seq-data" sub-record in the entry. If found, the sequence is extracted from that sub-record. (NOTE: This operation can only handle sequences that have been encoded in the `iupacna', `iupacaa', `ncbi2na' or `ncbi4na' formats.)

The rawseq operation is the same as the getseq operation, since the `iupacna', `iupacaa', 'ncbi2na' and 'ncbi4na' formats do not contain non-alphabetic characters.

The getinfo operation looks for a large number of possible sub-records for information about the sequence. To find database identifiers, it looks in the "seq.id" sub-record for the sub-sub-records "pir.name", "pir.accession", "swissprot.name", "swissprot.accession", "genbank.name", "genbank.accession", "embl.name", "embl.accession", "ddbj.name", "ddbj.accession", "prf.name", "prf.accession", "other.name", "other.accession", "pdb.mol", "gi", "giim.id", "gibbsq" and "gibbmt". Any identifiers found are added to the idlist. To find the date information, it looks in the "seq.descr" sub-record to find the sub-sub-records "create-date", "update-date", "genbank.date", "genbank.entry-date", "embl.creation-date", "embl.update-date", "pir.date", "sp.created", "sp.sequpd", "sp.annotupd" and "pdb.deposition".

Then, the operations searches for the description, organism and comment information in the "seq.descr" sub-record. For the description, the operation searches for the sub-sub-records "title", "pdb.compound" and "name" and picks one of them for the description ("title" if found, else "pdb.compound", else "name"). For the organism, the sub-sub-records "org.taxname", "org.common", "pir.source" and "pdb.source" are searched. For the comments, all of the "comment" sub-sub-records in "seq.descr" are concatenated together to make up the comment lines.

Finally, the alphabet is picked up from the "seq.descr.mol-type", "seq.descr.modif.dna", "seq.descr.modif.rna" or "seq.inst.mol" sub-records, the isfragment field is set to 1 if "seq.descr.modif.partial" exists, and the iscircular field is set to 1 if data string in "seq.inst.topology" is "circular".

The putseq operation outputs a "Bioseq" record for the sequence as part of a "Bioseq-set" structure (i.e., the appropriate strings are output before the first putseq operation, between the "Bioseq" records and when the file is closed, so that the file consists of a correctly formatted "Bioseq-set" record). The form of the file mirrors that of the Bioseq-set example given in the NCBI toolkit.

(NOTE: Because some text must be output when the file is closed (i.e., when seqfclose is called), you MUST call seqfclose when writing an ASN.1 file. If you don't call seqfclose, the text file will not be complete.)

The annotate operation either replaces, creates or appends the comment lines in the "seq.descr" sub-record (i.e., the comment lines are the "seq.descr.comment" records). If no "seq.descr" sub-record exists, one is created in the most appropriate place in the "seq" record. If the entry given to the annotate operation is not a Bioseq "seq" record, an error occurs.

(NOTE: Using the annotate operation by itself will NOT create a valid ASN.1 text file. You must output the following strings before the first entry, between entries, and after the last entry (again, assuming the entries are "Bioseq" records taken from the "Bioseq-set" hierarchy):

   Before the first entry:  "Bioseq-set ::= {\n  seq-set {\n"
          Between entries:  " ,\n"
     After the last entry:  " } }\n"
A Complete ASN.1 Text File:
Bioseq-set ::= {
  seq-set {
    seq {
      id {
        genbank {
          name "A14666" ,
          accession "A14666" } } ,
      descr {
        title "PRLB promoter" ,
        org {
          taxname "Bacteriophage lambda" } ,
        update-date
          str "18-AUG-1994" ,
        comment "NCBI gi: 579066" ,
        comment "SEQIO retrieval from GenBank database entry.  07-Feb-1996" } ,
      inst {
        repr raw ,
        mol dna ,
        length 281 ,
        seq-data
          iupacna "gatcagctgcgacacaactagtttacttactcgcttattaaaccagacccacaatcttt
tacacagatacaatatttttagtggaaacttcttgacatttcggcccatgacctttactctgttataaattactttta
tgggggacgatcacactagcaaaggagttacctaagccccgaatgttcaatgggaagacttccccaatcatgacccac
attacgggaccccaagttgcggagaagaaggcgatgtaaactgtcaaagcaatcacagagatgatc" } } } }

GCG Format

The read operation first looks for a line that ends with the string ".." (or more precisely, a line whose last non-whitespace characters are ".."). That line should be the GCG information line, and should look something like the following:
  gb:A02201  Length: 664  June 21, 1996 18:42  Type: N  Check: 9896  ..
although any or all of this information (except the "..") can be missing. If the line contains the "Length:" keyword, then the read operation will extract the sequence length. The read operation then reads the rest of the file, and assumes that those lines contain the sequence.

The getseq operation scans the sequences lines. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence. During this operation, any period `.' appearing in the sequence lines is assumed to be a gap character and translated into a dash `-' (the SEQIO's canonical gap character).

The getinfo operation takes the date and the alphabet from the GCG information line (if the date and the "Type:" fields are there), sets the description to the first word of the GCG information line (if it isn't "Length:"), and then takes all of the lines up to the GCG information line as the comment.

The putseq operation first outputs any comment lines, outputs a complete GCG information line (with a valid checksum), and then outputs the sequence lines in the default format shown below. Any dash `-' appearing in the output sequence is assumed to be a gap character and automatically translated into a period `.'.

There currently is no annotate function.


GCG-* Formats

The processing of the GCG-* formats essentially merges the processing of the GCG format on the sequence lines with the processing of the GenBank, PIR, EMBL, Swiss-Prot, FASTA, FASTA-old, NBRF, NBRF-old, IG/Stanford and IG-old formats when dealing with the header lines of each entry. So, see above for the details on that processing.

The one exception to this rule is the relationship between the NBRF and GCG-NBRF formats. Since the NBRF entries contain "header" information that actually appears at the end of the entry, and the GCG format requires that the last thing in an entry be the sequence, the GCG and non-GCG forms of the NBRF entries differ more than the other formats. In the GCG-NBRF format, the lines before the GCG information line are assumed to contain the two header lines normally found in the NBRF entries, immediately followed by the lines normally appearing at the end of the file (the "C;Comment:", "C;Accession:" and other lines). After those lines, the GCG information line and sequence lines should appear, and be the last things in the entry. The fmtseq program and SEQIO package have been implemented to make this transformation between the NBRF and GCG-NBRF formats.

An example GCG-Genbank entry:

LOCUS       A14666        281 bp    DNA             PHG       18-AUG-1994
DEFINITION  PRLB promoter.
ACCESSION   A14666
KEYWORDS    .
SOURCE      Bacteriophage lambda.
  ORGANISM  Bacteriophage lambda
            Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
REFERENCE   1  (bases 1 to 281)
  AUTHORS   Michiels,F., Delcour,J., Mahillon,J., Joos,H., Platteeuw,C. and
            Josson,K.
  TITLE     Transformed lactic acid bacteria
  JOURNAL   Patent: EP 0311469-A 10 12-APR-1989;
            PLANT GENETIC SYSTEMS N.V.; UNIVERSITE CATHOLIQUE DE LOUVAIN
COMMENT     NCBI gi: 579066
FEATURES             Location/Qualifiers
     source          1..281
                     /organism="Bacteriophage lambda"
     RBS             158..166
     CDS             180..254
                     /note="PRLB;  NCBI gi: 579067"
                     /codon_start=1
                     /translation="MFNGKTSPIMTHITGPQVAEKKAM"
BASE COUNT       89 a     67 c     52 g     73 t
ORIGIN      

  gb:A14666  Length: 281  June 28, 1996 16:23  Type: N  Check: 2754  ..

       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 

      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 

     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 

     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 

     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 

     251 gtaaactgtc aaagcaatca cagagatgat c

An example GCG-NBRF entry:

>DL;gb:A14666
PRLB promoter - Bacteriophage lambda, 281 bp.
C;Date: 18-AUG-1994
C;Accession: A14666
C;Comment: NCBI gi: 579066
C;Comment: 
C;Comment: SEQIO retrieval from GenBank database.   28-Jun-1996

  gb:A14666  Length: 281  June 28, 1996 16:22  Type: N  Check: 2754  ..

       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 

      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 

     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 

     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 

     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 

     251 gtaaactgtc aaagcaatca cagagatgat c 


MSF Multiple Sequence Format

The read operation first looks for a GCG information line of the following form:
 Pileup.Msf  MSF: 729  Type: N  June 21, 1996 15:02  Check: 3171 ..
although any or all of this information can be missing, except the ".." and the "MSF: %d" section, the second of which the read operation uses to get the sequence length. After the information line, the read operation looks for the sequence name lines, which are of the form
 Name: Humhbbbpc        Len:   729  Check: 6463  Weight:  1.00
where the "Name: " field gives the sequence identifier and must appear on any non-blank line in this section of the MSF file (the other fields are ignored, and the length is assumed to be the same as the global length). The sequence name lines section ends when a line beginning with "//" appears. Any number of blank lines can be interspersed in this section, but any non-blank line should contain the above format. The rest of the file is assumed to contain the sequence lines, where each sequence line begins with the sequence name followed by a space, as in:
           401                                                450
Humhbbbpc  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ 
Humhbbbpd  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ 
Humhbbbpe  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... 
Humhbbbpf  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... 
Humhbbbpg  AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... 
Humhbbbph  AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... 
Humhbbbp1  AAGTGATGAA ATTGTGTATT CAATGTAGTC TCAAGAGAAT TGAAAACCAA 
Humhbbbpa  AAATAAAAGG ATGGAGGAAG ATCTACCAAG CA........ .......... 
Humhbbbpb  AAATAAAAGG ATGGAGGAAT ATCTACCAAG CA........ .......... 
Humhbbbp2  AGCT.AAAGG ATTGTAAATG CACTAATCAG CACTCTGTGT CTAGCTCAAG 
No format of the sequence lines or presence or absence of the position number lines (401...450) is assumed, except for the initial sequence name. The sequence lines run to the end of the file.

The getseq operation finds every sequence line beginning with the corresponding sequence name (the sequences are ordered by the order of sequence names in the sequence names section). All alphabetic characters appearing after the sequence name are taken for the sequence.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence. During this operation, any period `.' appearing in the sequence lines is assumed to be a gap character and translated into a dash `-' (the SEQIO's canonical gap character).

The getinfo operation takes the date and the alphabet from the GCG information line (if the date and the "Type:" fields are there), sets the description to the sequence name found in the sequence name section, and then takes all of the lines up to the GCG information line as the comment.

The putseq operation outputs an MSF file exactly mimicing the files output by GCG using "PileUp" in its default mode, except that only the keyword "PileUp" appears on the first line and no comments are output. Any dashes `-' found in the sequences are assumed to be gap characters and are automatically translated into periods `.'. If the sequences are of different lengths, the putseq operation will pad the smaller sequences with periods `.'.

(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except Clustalw and PHYLIP, the actual output does not occur until `seqfclose' is called to close the file. Because the MSF format must know the number of entries before it can begin the output, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does, on each call to `seqfwrite', is make a copy of the sequence and a sequence identifier (either the main identifier, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)

There currently is no annotation function.

An example MSF file:

PileUp


 pir.msf  MSF: 104  Type: P  June 28, 1996 17:04  Check: 3466  ..

 Name: pir:CCCZ         Len:   104  Check: 9501  Weight:  1.00
 Name: pir:CCMQR        Len:   104  Check: 9512  Weight:  1.00
 Name: pir:CCMKP        Len:   104  Check: 9066  Weight:  1.00
 Name: pir:CCRB         Len:   104  Check: 8395  Weight:  1.00
 Name: pir:CCGW         Len:   104  Check: 8496  Weight:  1.00
 Name: pir:CCCM         Len:   104  Check: 8496  Weight:  1.00

//

            1                                                   50
pir:CCCZ    GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMQR   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMKP   GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
pir:CCRB    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCGW    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCCM    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD

            51                                                 100
pir:CCCZ    ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCMQR   ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCMKP   ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCRB    ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
pir:CCGW    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
pir:CCCM    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK

            101
pir:CCCZ    ATNE
pir:CCMQR   ATNE
pir:CCMKP   ATNE
pir:CCRB    ATNE
pir:CCGW    ATNE
pir:CCCM    ATNE


PHYLIP Interleaved and Sequential File Formats

NOTE: The implementation here is more flexible than other implementations, however it is a bit restrictive in its output, in that

  1. Both interleaved and sequential formats are supported and rigorously distinguished. See below for the details.
  2. An input file in the PHYLIP format can contain one or more PHYLIP entries, where each entry must be separated only by whitespace. Mixed files (some interleaved entries, some sequential entries) are supported.
  3. Any number of blank lines or lines filled only with whitespace can be included in the file. Blank lines do not disrupt the parsing of the entries.
  4. The output operation does NOT output more than one entry per file, because I have yet to completely figure out the SEQIO interface issues. (Note that this may change in a future version.)
  5. This implementation was done using the documentation from Version 3.5c. Whether it works with earlier versions is not known.
The read operation first skips whitespace characters and then looks for the number of sequences and the sequence length (those two numbers must be the first thing in the entry). On that initial line, it also looks for the option characters 'A', 'C', 'F', 'M', 'U', 'W'. If any of the options except 'U' are found, the operation then skips any subsequent lines that begin with a match to the character strings "ANCESTOR ", "CATEGORIES", "FACTORS ", "MIXTURE ", or "WEIGHTS ". A line is considered to match one of the strings if the first 10 characters of the line contain a prefix of the string padded by spaces. Also, these lines are skipped only if the corresponding option was given on that first line.
(NOTE: This may cause some problems on an entry such as this one:
3 6 A
A         ABCDEF
B         BCDEFG
C         CDEFGH
because the second line of the entry is treated as an "ANCESTOR " line, when in fact it was a sequence line. But, from looking at the documentation, the PHYLIP programs would die on this entry, too. And replacing "A " with something like "Alpha " eliminates the problem.)

After skipping those initial lines, the read operation tries to match the subsequent lines to the interleaved and sequential file formats. The following criteria are the keys to distinguishing between the two formats:

  1. The line giving the initial piece of a sequence must be at least 10 characters long and there must be at least one non-whitespace character in those first ten characters. This should be the sequence identifier, and its characters are not counted as part of the sequence.
  2. In the Interleaved format, all of the sequence substrings in each block of the entry must have the same length. A block is a set of "number-of-sequences" lines (not counting blank lines) which contain a piece of each of the sequences.
  3. The end of each sequence must occur on its own line, without any additional non-whitespace text after the sequence characters.
If one format but not the other matches, or both formats match and the input format has been specified as PHYLIP-Int or PHYLIP-seq (instead of just PHYLIP), then the entry format has been successfully determined. Otherwise (if neither match or both match), a parse error is triggered. However, given the above criteria and the fact that the operation attempts to completely match both formats against the text, the likelihood that the formats will match the same text is extremely remote.

Finally, if the 'U' option has been set on the entry's first line, the read operation skips the user trees listed in the entry, to get to the end of the entry. The format of the user trees consists of a line giving the number of trees, followed by any number of lines of text where each user tree description is ended by a semi-colon (the operation just counts the semi-colons it sees). The end of the entry is at the end of the line containing the last semi-colon.

The getseq operation finds the first line of the appropriate sequence in the entry (i.e., the `seqfseqno' sequence), skips the 10 character identifier and retrieves the sequence. All alphabetic characters are considered to be in the sequence.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation takes the 10 character sequence identifier to be the description of the sequence. No other information is retrieved.

The putseq operation outputs an Interleaved or Sequential entry exactly as described in the PHYLIP program documentation. If the sequences output are of different lengths, the putseq operation will pad the smaller sequences with dashes `-'.

(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except Clustalw and MSF, the actual output does not occur until `seqfclose' is called to close the file. Because the PHYLIP format must know the number of entries before it can output the first line, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does is, on each call to `seqfwrite', it makes a copy of the sequence and a sequence identifier (either the mainid, mainacc, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)

There is no annotate function.

Example PHYLIP Interleaved entry:

     6    104
pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 
pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 
pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE 
pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 
pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 
pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 

           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK 
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 

           ATNE 
           ATNE 
           ATNE 
           ATNE 
           ATNE 
           ATNE 
Example PHYLIP Sequential entry:
     6    104
pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
           ATNE
pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
           ATNE
pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
           ATNE

Clustalw Format

The read operation first skips the header line of the file, and then skips any blank lines. The next non-blank line is assumed to begin the first block. The sequence lines of each block contain first an identifier of 15 characters and then the rest of the line is sequence. Those sequence lines must begin with a non-whitespace character. After the sequence lines in each block, there is an additional line to highlight closely related columns in the alignment, followed by zero or more blank lines. This additional line and all of the lines occurring between blocks must either be empty or begin with a whitespace character. There is only one entry per file, and the whole file is assumed to consist of these sequence blocks.

The getseq operation finds the first line of the appropriate sequence in the entry (i.e., the `seqfseqno' sequence), skips the 15 character identifier and retrieves the sequence. All alphabetic characters are considered to be in the sequence.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation takes the 15 character sequence identifier to be the description of the sequence. No other information is retrieved.

The putseq operation outputs a Clustalw entry exactly as the clustalw program does, except that the version number is replaced with "*.**" and the package does not look for closely related columns in the output alignment (it simply outputs a line of whitespace without any '*' or '.' characters). If the sequences are of different lengths, the putseq operation will pad the smaller sequences with dashes '-'.

(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except PHYLIP and MSF, the actual output does not occur until `seqfclose' is called to close the file. Because the Clustalw format must know the number of entries before it can output the first line, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does is, on each call to `seqfwrite', it makes a copy of the sequence and a sequence identifier (either the mainid, mainacc, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)

There is no annotate function.

Example Clustalw file:

CLUSTAL W(*.**) multiple sequence alignment



pir:CCCZ       GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
pir:CCMQR      GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWG
pir:CCMKP      GDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWG
pir:CCRB       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
pir:CCGW       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
pir:CCCM       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
                                                                           

pir:CCCZ       EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCMQR      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCMKP      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCRB       EDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE
pir:CCGW       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
pir:CCCM       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
                                                           

FASTA-output Formats

NOTE: With one or two exceptions, this implementation can read and understand the output from the FASTA, TFASTA, SSEARCH, LFASTA, LALIGN and ALIGN programs which were run either in interactive or non-interactive mode, and where the output was formatted with MARKX option set to any of 0, 1, 2, 3 or 10.

The exceptions are

  1. The program must have been run in non-interactive mode in order for the automatic format determination to work correctly. By "non-interactive", I mean that the initial header output by the program:
        FASTA searches a protein or DNA sequence data bank
        version 2.0u4 Feb., 1996
       Please cite:
        W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
       .
       .
       .
    

    must appear in the text given as input.

  2. If the FASTA, TFASTA or SSEARCH is run in interactive mode, no information will be known about the query sequence (its information is in the initial header, which is not included in the file specified to receive the program output),
  3. The ALIGN program must be run in non-interactive mode in order for the package to correctly parse it (i.e., that initial header must occur in the text). For the other programs, the package will parse its output correctly, if the file format is specified as `FASTA-output'.
  4. The implementation was tested against version 2.0u4. If the output was different in previous versions, the implementation may not work.
The read operation first scans the text occurring before the first alignment in the file. This initial text is ignored, except where it gives information about the sequences being aligned. The initial texts of some of the output formats contain lines of the following form.
 >GT8.7 transl. of pa875.con, 19 to 675: 217 aa
 >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa

(A) musplfm.aa >musplfm transl. of musplfm.seq, 2 to 676          - 224 aa
(B) lcbo.aa    >LCBO - Prolactin precursor - Bovine               - 229 aa

>musplfm transl. of musplfm.seq, 2 to 676           224 aa vs.
>LCBO - Prolactin precursor - Bovine                229 aa
The text after the '>' is parsed to extract the sequence id (the first word after the '>'), a sequence description, the sequence length and alphabet information about the sequence.

Then, the read operation reads the "entries" of the file, where each entry is considered to be the text describing an alignment between two sequences. Different programs output different sets of alignments, but all six of the FASTA programs supported output one or more two-sequence alignments. Thus, every entry in this format contains two sequences.

The getseq operation extracts the appropriate sequence from the entry (the first or second sequence if the `seqfseqno' value is 1 or 2, respectively). All alphabetic characters are considered part of the sequence, except that if the output was generated with MARKX=2, then any periods occurring in the second sequence are replaced with the corresponding character of the first sequence.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence (with the exception of period substitution mentioned above).

The getinfo operation extracts a main identifier, a description and an alphabet for the appropriate sequence, if available. It also constructs a comment that begins with the following:

From SSEARCH output alignment of:
 >musplfm transl. of musplfm.seq, 2 to 676, 224 aa
 >LCBO - Prolactin precursor - Bovine, 229 aa
This gives the name of the program whose output is being parsed, and the descriptions of the two sequences from whose alignment came the current sequence. This text is then followed by any information from the alignment describing the score of that pairwise alignment. The format of this text depends on the FASTA program executed and the MARKX value, as it is just copied from the program output.

There is no putseq or annotate operation.


BLAST-output Formats

NOTE: With one or two exceptions, this implementation can read and understand the output from the BLASTN, BLASTP or BLASTX (and maybe even the TBLAST* programs, although that has not been tested yet). The exceptions are:

  1. Automatic recognition of the BLAST-output format requires that one of the keywords BLASTN, BLASTP or BLASTX be the first word in the file (possibly after an e-mail header). Many of the BLAST e-mail servers prepend a description of their service before the actual BLAST output, and so disrupt the recognition by the package. So, for output gotten by an e-mail server, the input format must be set.
  2. The implementation was tested on output generated by versions 1.2 and 1.4.9. If the output is different in version 1.3 or 2.0, the implementation may not work (although the implementation can correctly handle gaps in the alignments, so that change from 1.* to 2.0 is handled).
The read operation first scans the text occurring before the first alignment in the file. This initial text is ignored, except where it gives information about the sequences being aligned. The initial texts of some of the output formats contain lines of the following form.
Query=  gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi-
        (665 letters)
The text after "Query=" and before the line containing the "(... letters)" is parsed as a oneline description, and the number inside the "(... letters)" is taken as the length of the query sequence.

Then, the read operation reads the "entries" of the file, where each entry is considered to be the text describing an alignment between two sequences. The BLAST alignment format consists of header lines specifying the sequence that matches the query, following by one or more pairwise alignments of substrings of the matching sequence and the query. The read operation first scans the header lines, which are of the form:

>emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with
            repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor
            (ORF1)-Orf 2 genes and there flanking regions
            Length = 1306
where the "Length =" line ends the list of oneline descriptions of the sequences that match the query (in the next pairwise alignment(s) ). It extracts the oneline description and length of the sequence.

The read operation considers an "entry" to consist only of the actual score reporting text and pairwise alignment text. So, while the header lines above are scanned for their information, the entry reported by the package begins at the line containing either "Plus Strand HSPs:", "Minus Strand HSPs:" or "Score =". And the entry ends just after the last line of the pairwise alignment text. This is done to make the entry text reported by the package more uniform. Thus, the following BLAST output would be reported as two entries, the first beginning at the "Plus Strand HSPs:" line and running through the first pairwise alignment, and the second beginning with the "Score = 89..." line. The header lines will not be reported in any alignment, and will only be scanned to extract the oneline description and length information.

>emb|Z68118|CER01E6 Caenorhabditis elegans cosmid R01E6
            Length = 40,937

  Plus Strand HSPs:

 Score = 127 (35.1 bits), Expect = 3.2, Sum P(2) = 0.96
 Identities = 39/56 (69%), Positives = 39/56 (69%), Strand = Plus / Plus

Query:    426 ATTTTAATAAATCTGGATTTAAATGTGTTAAAAATGACGGAAATACAAGTAGTTGA 481
              ||||||||||||||    ||||||  | |||||||||  | || |    || || |
Sbjct:  35266 ATTTTAATAAATCTCATCTTAAATTAGATAAAAATGAATGCAAAATTTATATTTTA 35321

 Score = 89 (24.6 bits), Expect = 3.2, Sum P(2) = 0.96
 Identities = 25/34 (73%), Positives = 25/34 (73%), Strand = Plus / Plus

Query:     93 ACAATACTAAAAAAGACGGAAATACAAGTATTTT 126
              ||||||||||||||    | ||   || ||||||
Sbjct:  31613 ACAATACTAAAAAATCTTGTAAACAAAATATTTT 31646

The getseq operation extracts the appropriate sequence from the entry (the first or second sequence if the `seqfseqno' value is 1 or 2, respectively). All alphabetic characters are considered part of the sequence.

The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.

The getinfo operation extracts a main identifier, a description and an alphabet for the appropriate sequence, if available. It also constructs a comment that begins with the following:

From BLASTN/BLASTP/BLASTX output alignment of:
   >gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi
and
   >emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity
              region with repressor gene and ORF 
   >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes
              and there flanking regions
This gives the name of the program whose output is being parsed, and the descriptions of the two sequences from whose alignment came the current sequence. This text is then followed by any information from the alignment describing the score of that pairwise alignment.

There is no putseq or annotate operation.


James R. Knight, knight@cs.ucdavis.edu
June 28, 1996