P. Rioux, B.F. Lang, G. Burger.
Université de Montréal, Department of Biochemistry, 2900 Blvd Edouard-Montpetit, Montreal, QC, Canada
GOBASE is a specialized biological relational database that integrates diverse data on organelles such as DNA and protein sequences, RNA secondary structure diagrams, taxonomic information and genetic maps of completely sequenced mitochondrial DNAs. GOBASE has been available for public access via the WWW since 1996 and currently housed mitochondrial data. The major part of the data in GOBASE, i.e., sequence and taxonomic data, are being retrieved from the public sequence data repository at NCBI, and validated by experts in house. Maintaining a curated database comes with a very high labor cost. This is largely due to the fact that genomic sequences are being generated at an unprecedented rate and that records retrieved from public repositories contain annotation errors, nonstandard or misleading information that require correction.
Here, we present the challenges involved in nomenclature standardization of mitochondrial genes. Although the majority of researchers involved in mitochondrial genomics have adapted the gene nomenclature developed by the CPGN in collaboration with the GOBASE team, still a significant number of entries are submitted to the public data repositories with non-conform gene names. Especially the scientific community working on animal mitochondrial genomes have maintained a particular gene nomenclature. To minimize human intervention in the regular updates of GOBASE, a computer program was designed that recognizes non-standard gene names and applies a set of expert-defined rules, also involving taxonomic information, to assign gene names automatically. Only a small number of gene names remains that has to be inspected by experts. A final step of validation checks whether the sequences to which a particular gene name was assigned automatically, are significantly similar to certified sequences of the corresponding gene class.