MIMDB-96, Meeting on Interconnection of Molecular Biology Databases.
St.Louis, Miss. USA, 12-15 June, 1996.
The Organelle Genome Database Project (GOBASE)
Maria Korab-Laskowska,
Pierre Rioux,
B. Franz Lang.
Gertraud Burger,
Michael W. Gray*, and
Timothy G. Littlejohn#
Departement de biochimie, Universite de Montreal, Montreal,
Quebec, Canada
*Department of Biochemistry, Dalhousie University, Halifax
Nova Scotia, Canada
#ANGIS, University of Sydney, Sydney, Australia
An important advantage to genomics offered by organelles is the
number of completely sequenced genomes already available and currently
being sequenced (e.g., by the Canadian "Organelle Genome
Megasequencing Program", OGMP). No larger collection of completely
sequenced genomes currently exists. Organelle genomes are
information-rich and typically encode ~12-100 genes (proteins, tRNAs,
and rRNAs). These genes can be easily identified because most belong to
a highly conserved repertoire.
At present it is difficult or impossible to access all the relevant
information associated with organelles. Data are often dispersed among a
number of sources, difficult to locate, incomplete and may contain errors.
In its present disorganized state, organelle genomic data constitute a major
underexploited source of information. The Organelle Genome Database
Project (GOBASE) is intended to rectify this situation.
A relational method has been adopted to implement GOBASE using
the SYBASE RDMS, installed on a Sun Sparcstation. Custom software has
been developed for the population and maintenance of the database using
the Perl language and Sybase's Open-Client development tools. The query
interface is supported through WWW forms using the
Web/Genera gateway .
At the final stage GOBASE will be populated by many different data types
including:
- primary sequences (nucleotide and protein)
- genetic and physical maps
- RNA secondary structures
- genotype and phenotype data
- biochemical and physiological data
- organismal information
Much of the data will be integrated from a wide range of molecular
biology databases. However, GOBASE will also contribute new
information by supplying genetic maps, compiling organismal information
and constructing RNA secondary structures. The first release of the
GOBASE contains mitochondrial sequence data drawn from the Entrez
database system (ASN.1 format), and taxonomy data extracted from the
NCBI Taxon database. Future versions will be expanded by adding new
entities to the database, and by offering analytical tools such as
subsequence extraction and neighbouring feature searches.
The most complex tasks related to the integration of the data into
the local database are the resolution of semantic conflicts. These conflicts
are caused by:
- inconsistencies in names (e.g., gene synonyms)
- differences in data interpretation and annotation
- annotation errors
- missing information
We have resolved some of the systematic errors by using lists of synonyms
for all main features, and by inferring missing or incorrect information
from related features (e.g., gene names, gene types, gene products and
functions). The development of an interactive program for data verification
and for the interactive correction of the remaining errors by experts is in
progress.
GOBASE is supported by the
Canadian Genome Analysis & Technology program (CGAT).