1.0 GENERAL DESCRIPTION
(Last updated by GB on 10mar2011)
The masterfile (MF) provides an integration of DNA sequence and its
feature annotations, with the feature annotations embedded at the corresponding
positions in the sequence. The naming rules for genetic elements have
been designed such that they are easily interpretable by a human reader,
but are at the same time computer-readable (the MF format is a superset
of the FASTA format).
We have developed a number of tools that take MF-files as input (basic sequence operation, ssyntax and logic checking, and gene/intergenic region extraction). Particularly usful is one, developped in collaboration with NCBI, that permit translation of an MF directly into a GenBank record (ASN.1 and GBF flatfile format), making it much easier to submit sequences to NCBI's publically-accessible data repositories.
The current document describes the conventions for the MF annotations that occur in contig headers, comment lines and feature lines.
The MF header is essentially free text, with all lines starting with a ";;" in the first two columns (which indicates that these lines are to be ignored/not parsed by software).
Here, we usually store information that will ultimately be included in various fields of GenBank entries: (i) a concise description of the sequence (e. g., 'Complete mitochondrial DNA sequence from Mus domesticus') ("DEFINITION" field); (ii) names of people who provided DNA and overlooked sequencing ("AUTHORS" field) ; and (iii) organism name, strain, subcellular localization of the molecule whose sequence is described, the nature of the sequence (DNA, RNA, etc); complete (if applicable), and genetic code (all specified under GenBank's "source" feature key).
The sequence is composed of the following legal characters:
Sequence numbering and "!" characters in the MF are not translated to GenBank records. "!" is used in the MF to delimit the anticodons of tRNA genes or individual repeat units in repeat arrays.
All sequence feature lines start with a single semicolon ";" in the first column, an arbitrary number of blanks, a "G-" (always capital G), followed by the name of a feature, and an arrow symbol, ==> or <==, pointing left or right, depending on the direction of the element in the contig (5'->3' or 3'->5', respectively), and an obligatory "start", "end" or "point" qualifier (see also section 3.5, location operators).
; G-cox1 ==> start indicates the start of the coding region of the cox1 gene. ; G-cox1 ==> end indicates the end of the coding region of the cox1 gene.
When present as the last non-whitespace character of a feature line, a backslash indicates that the subsequent feature line is a continuation of this line, e.g.:
; G-cox2-I3 ==> start /note="this intron is inserted in gene positions\ ;; identical to introns identified in Prototheca wickerhamii and Allomyces\ ;; macrogynus mtDNA."
G- is always written with a capital G, but all G- identifiers can alternatively use caps or nocaps
G-cox1-I5-ORF### = G-cox1-i5-orf###
Location operators are composed of (i) the arrow symbols (==> or <==) , pointing left or right, depending on the direction of the element (5'->3' or 3'->5', respectively); (ii) the obligatory start , end or point qualifiers associated with the arrow symbols; and (iii) the optional /join qualifier that is used in conjunction with the -F feature and that is used to annotate fragments of genetic elements (see feature description or example 5.1). The GenBank location operator " group " and the MF "-F" feature are equivalent and will be discussed in the section that describes this feature. The start and end qualifiers are used for positioning the extension of a genetic element, and have to be always used in pairs of identical feature line identifiers (see Example 1, below), whereas the point qualifiers occur single (see Example 2).
Combination of features (exons, introns, intronic open reading frames and other elements associated with genetic elements) are described by nested expressions.
Example 1: cox1 with the sequence "ATGCTGCTTGTTGATTAA".
TATATTAAAAAAAA ; G-cox1 ==> start ATGCTGCTTGTTGATTAA ; G-cox1 ==> end GGTCCCCCCC
Example 2: Positional operator.
The A following the G-Var annotation is a polymorphic site (A or G).
TCTCTCTGATC ; G-Var-mut1 ==> point /substitution= A=>G /polymorph AGGGGG
MF features that contain general information about the sequence are included in the MF header. Features that describe specific genetic elements, or their location are embedded into the nucleotide sequence and include: coding regions of protein genes (CDS), rRNAs, tRNAs, RNAs (e. g., mRNAs, gRNAs, etc.), exons, introns, mobile elements, signals/sites/conserved regions, sequence variation,DNA elements of unidentified function and fragments of genetic elements. Further genetic features may be added later (e. g., genes encoding the RNA subunits of ribozymes). For a summary of features and their associated qualifiers see Appendix 1. Many Genbank features that describe very specific genetic features, such as LTR are used as MF qualifiers ( /LTR) in association with a more general feature, in this case signals/sites/conserved regions. For a list of MF qualifiers that are equivalent to GenBank features, see Appendix 5.
Features can occur either singly, in combination, or in deeply nested expressions (see section 4.5, introns and an example).
When additional qualifiers ( Appendix 4) have been adopted by OGMP that do not have equivalents in the GenBank format, they equate to the "note" qualifier of GenBank records. An example for the translation of an MF intron qualifier (groupIA) to GenBank syntax (right side, in italics) follows (see also Appendix 3):
; G-cox1-I3 ==> start /group=1a intron /gene="cox1" /number=3 /note="/group=1a"
Likewise, when MF features do not have a GenBank equivalent, the description of this MF feature (see Appendix 3) should be represented in the /note field of the GenBank record, in the style of MF qualifiers. This permits back-translation from GenBank to MF records. E.g.:
; G-Mob-DHE13 ==> start misc_feature/note="/mobile /DHE13"
Note that in the MF annotatation system, gene names are used for feature identification. The gene names, products (protein, ribosomal RNA, tRNA), and synonyms are defined in an external look-up table (see Appendix 2).
Example: Start and end of the coding region of the gene cox1
; G-cox1 ==> start ; G-cox1 ==> end
Open reading frames whose products have not been identified are called orf and are identified by the number of amino acids that they potentially encode. In case that the same number occurs more than once in a genome, additional characters are used for distinction.
; G-orf123 ==> start ; G-orf123a ==> start
Specific ymf numbers will be assigned to orfs found in organellar genomes; also see the mitochondrial subsection of the Mendel database. As soon as a ymf number has been given to an orf, its homologs in other organisms will be given this same ymf number. In case that synonyms were used in past publications, these can be identified by a /synonym=" " qualifier that may contain the synonymous name, the species for which this synonym was introduced and a reference number in square brackets [#] pointing to a relevant citation listed in the reference section of the sequence record.
; G-orf227 ==> start /ymf#### /synonym="urf a (S. pombe mt) [#]" ; G-orf175a==> end /ymf#### /synonym="orfB (plants mt)" ; G-orf175b==> start /ymf#### ; G-orf248==> end /ymf#### /synonym="orf244 (Marchantia mt) [#]"Specific orf qualifiers will be used to classify certain conserved classes of common orfs. Note that these occur frequently as free-standing 'genes', as well as intronic orfs.
Example: Start and end of the coding region of the rns gene
; G-rns ==> start ; G-rns ==> end
Example: Start and end of the coding region of the gene
; G-trnG(ucc) ==> start ; G-trnG(ucc) ==> end
This allows to annotate in a DNA sequence where an mRNA or any other expressed RNA starts and ends.
Example: Start and end of an RNA
; G-RNA ==> start ; G-RNA ==> end
Exons ("E#") are annotated according to their location in a gene with increasing numbers, E1, E2, ...
Example: End of a cox1 intron 2, followed by the start of cox1 exon 3.
; G-cox1-I2 ==> end ; G-cox1-E3 ==> start
As for exons, introns (I#) are annotated according to their location in a gene with increasing numbers, I1, I2, .... Introns can be specified by three qualifiers: (i) the group they belong to (in organelles, group1, 2 and 3 have been introduced, and further subclassified in subgroups, like 1a 1b etc.), (ii) their insertionsite in an homologous gene positions. The insertion site can be defined with a /hompos= statement which makes references to a nucleotide position in a ribosomal or tRNA sequence (single number) or a amino acid plus codon position (two numbers joined by a "-"). An optional reference species associated with the /hompos statement can be added in brackets. Formal rules and standards for the /hompos definition will be described elsewhere, in more detail.
; G-cox1-I3 ==> start /group1b /hompos=255-2 (Allomyces)
Twintrons are introns that are inserted into introns. In order to make this special case of intron structure, the "outer" intron containing the insertion should be qualified with /twintron,
; G-cox1-I3 ==> start /group1b /hompos=255-2 /twintron
and the inserted intron is annotated at the sequence position where it starts, with a "II" (or "II1,II2 ...") annotation, e.g.:
; G-cox1-I3-II1 ==> start /group3
Other genetic elements residing inside introns , such as intron orfs, genes, etc. are annotated using the established rules for nested features, and and are labelled with a mandatory /intronic qualifier. Example for the start of an intron orf:
; G-cox1-I1-orf### ==> start /intronic /LAGLIDADG
In addition, if intronic orfs are in frame with their upstream exons, the following qualifier is used:
/inframe /first_aa=[amino acid]
Note that inframe intronic orfs must not include upstream exon sequences, but rather must start with the first codon within the intron. In the (frequent) case that this codon is not an intitiation codon, the qualifier /first_aa=[amino acid in IUPAC code] is required.
The concept of mobile elements cannot be easlily represented in GenBank terminology. The supplied /transposon qualifier of the "source" feature is only of limited use. Repetitive mobile elements could be translated into GenBank terminology by employing the "repeat_region" feature, whereas non-repetitive mobile elements can be only translated by using the "misc_feature" feature. For reasons of consistency, we will always use misc_feature.
; G-Mob- "element_name" ==> start /GIY-YIG
Note that some of the qualifiers of the "G-Sig-" feature are genuine features in GenBank (e.g., promoter, LTR, RBS, and D-loop, see Appendix 5), and that there exist two types of features in this category, one that describes sequence ranges and another one sites:
; G-Sig-"signal_name" ==> start ; G-Sig-"signal_name ==> point" (for positions, sites, without start and end)
The telomeric regions are annotated using the qualifiers /repeat_unit and /organization="text", to define the sequence of the telomere repeat and the number of units, respectively. The number of units can be given by a precise (e.g., 20), a relative (>20) or a range value (20-30). For more details, see the definition of this qualifier in Appendix 7"
In the following example, the feature lines in a masterfile describe (i) the start of the telomere repeat which consists of 20 to 35 repeat units, named "name_rep"; (ii) the start and (iii) the end of the repeat unit. Note the nesting of elements and their names, for the complete telomere region and its repeat unit.
; G-Sig-"tel_name" ==> start /organization="(name_rep)20-35" ; G-Sig-"tel_name"-"name_rep" ==> start /repeat_unit ; G-Sig-"tel_name"-"nam_rep" ==> end ; G-Sig-"tel_name" ==> end
The -F# (-F followed by a number) feature is used in conjunction with a G-feature to annotate fragments of a genetic element that belong to one functional unit (e.g. rRNA in pieces, trans-splicing introns, protein coding regions interrupted by insertions, proteins interrupted by protein introns, etc.).
; G-"gene_name"-F# /join (see example)
If the fragments have to be joined before translation into a protein sequence, the qualifier /join is used, as in the example above. In all other cases (e.g. genes in pieces, trans-splicing introns) the default for "-F#" is equivalent to the GenBank "group" operator that indicates the functional relationship of these fragments.
GenBank annotations use the "join" and "group" operators to decribe the masterfile -F feature.
This feature does not define a genetic element like the other features described above, but is rather an operator indicating sequence variation of any nature at the DNA or RNA level. Like sequence motifs of unknown function (see also next section, 4.10), its feature elements are not nestable as they describe at the sequence level only.
; G-Var-"var_name" ==> start (for sequence variation ranges) ; G-Var-"var_name" ==> point (for substitutions and insertions)
; G-Var-ins1 ==> point /insertion="AGCTAGATAGGTGG" ; G-Var-subs2 ==> point /substitution G==>A (substitute following G with A) ; G-Var-del3 ==> start /deletion ; G-Var-del3 ==> end ; G-Var-pmo7 ==> point /substitution G==>A,C,T /polymorph (G can be A, C or T)
This feature is used to annotate repeats, potential RNA/DNA secondary structures and other regular sequence motifs. Like sequence sequence variations (see the previous section, 4.9), its feature elements are not nestable as they describe features at the sequence level only.
; G-Mot-"motif_name" ==> start
The repeat units of repetitive sequences are annotated using the /repeat_unit and /organization="text" qualifiers to define both the sequence of the repeats, as well as their organization. Note that individual units in a repeat region are annotated without indicating their context (nesting; this is different to the annotation of the repeat units of telomeres). The reason for the different syntax is that repeat units of telomeric regions only occur within these regions and are functional units of the temomere, whereas this is not necessarily the case for other repeat regions of unknown function.
; G-Mot-"name_region" ==> start/organization="(name_rep)3" ; G-Mot-"name_rep" ==> start/repeat_unit ; G-Mot-"name_rep" ==> end
This example describes the repeat region with the name "name_region" consisting of three identical copies of a sequence motif in tandem arrangement, named "name_rep". The repeat unit is identified by the qualifier /repeat_unit.
In order to express more complicated topologies of repeat units within one region, a specific syntax for the /organization qualifier was developed (see Appendix 7).
The following example shows an example of a cox1 gene (nucleotide sequence not shown) that contains three introns, the introns in turn contains intron orfs, and one intron orf is split by a mobile element (Mob). The translation of the latter intron orf into a protein is only possible after removal of the Mob sequence. This is achieved by joining the two fragments of the intron orf (F1 and F2). In order to emphasize the nested structure, the expression is written in an indented form. Note that qualifiers are used only once, when the genetic element is defined at its start position.
; G-cox1 ==> start ; G-cox1-E1 ==> start ; G-cox1-E1 ==> end ; G-cox1-I1 ==> start /group1b ; G-cox1-I1-orf# ==> start /intronic /LAGLIDADG /inframe ; G-cox1-I1-orf# ==> end ; G-cox1-I1 ==> end ; G-cox1-E2 ==> start ; G-cox1-E2 ==> end ; G-cox1-I2 ==> start /group2 ; G-cox1-I2 ==> end ; G-cox1-E3 ==> start ; G-cox1-E3 ==> end ; G-cox1-I3 ==> start /group1a /hompos=255-2(Allomyces) ; G-cox1-I3-orf## ==> start /intronic /endo /inframe /pseudo ; G-cox1-I3-orf##-F1 ==> start ; G-cox1-I3-orf##-F1 ==> end ; G-cox1-I3-orf##-Mob-DHE13 ==> start /dispersed ; G-cox1-I3-orf##-Mob-DHE13 ==> end ; G-cox1-I3-orf##-F2 ==> start ; G-cox1-I3-orf##-F2 ==> end ; G-cox1-I3-orf## ==> end ; G-cox1-I3 ==> end ; G-cox1-E4 ==> start ; G-cox1-E4 ==> end ; G-cox1 ==> end
The conversion of the example outlined above to a GenBank record might look like (arbitrary coordinates and gi numbers):
CDS join(37..1806,2562..2581,3442..4074,4828..5123) /gene="cox1" /EC_number="184.108.40.206" /note="NCBI gi: 467847" /codon_start=1 /product="cytochrome oxidase, subunit 1" /translation="MV.........IQV" exon 37..1806 /gene="cox1" /number=1 intron 1807..2561 /note="/groupIB" /number=1 CDS 1807..2440 /gene="orf#" /note="/intronic /LAGLIDADG /inframe" /codon_start=2 /translation="KID.....RSSIHY" exon 2562..2581 /gene="cox1" /number=2 intron 2582..3441 /note="/groupII" /number=2 exon 3442..4074 /gene="cox1" /number=3 intron 4075..4827 /note="/groupIA /hompos=255-2(Allomyces)" /number=3 CDS join (4077..4277,4377..4755) /gene="orf##" /pseudo /note="/intronic /endo /inframe" /codon_start=3 /translation="KNL.....NLKNL" misc_feature 4278..4376 /note="/Mob-DHE13 /dispersed" exon 4828..5123 /gene="cox1" /number=4