The Organelle Genome Database Project (GOBASE)

MIMDB-96, Meeting on Interconnection of Molecular Biology Databases.
St.Louis, Miss. USA, 12-15 June, 1996.

The Organelle Genome Database Project (GOBASE)

Maria Korab-Laskowska, Pierre Rioux, B. Franz Lang. Gertraud Burger, Michael W. Gray*, and Timothy G. Littlejohn#

Departement de biochimie, Universite de Montreal, Montreal, Quebec, Canada

*Department of Biochemistry, Dalhousie University, Halifax Nova Scotia, Canada

#ANGIS, University of Sydney, Sydney, Australia

An important advantage to genomics offered by organelles is the number of completely sequenced genomes already available and currently being sequenced (e.g., by the Canadian "Organelle Genome Megasequencing Program", OGMP). No larger collection of completely sequenced genomes currently exists. Organelle genomes are information-rich and typically encode ~12-100 genes (proteins, tRNAs, and rRNAs). These genes can be easily identified because most belong to a highly conserved repertoire.

At present it is difficult or impossible to access all the relevant information associated with organelles. Data are often dispersed among a number of sources, difficult to locate, incomplete and may contain errors. In its present disorganized state, organelle genomic data constitute a major underexploited source of information. The Organelle Genome Database Project (GOBASE) is intended to rectify this situation.

A relational method has been adopted to implement GOBASE using the SYBASE RDMS, installed on a Sun Sparcstation. Custom software has been developed for the population and maintenance of the database using the Perl language and Sybase's Open-Client development tools. The query interface is supported through WWW forms using the Web/Genera gateway .
At the final stage GOBASE will be populated by many different data types including:

primary sequences (nucleotide and protein)
genetic and physical maps
RNA secondary structures
genotype and phenotype data
biochemical and physiological data
organismal information

Much of the data will be integrated from a wide range of molecular biology databases. However, GOBASE will also contribute new information by supplying genetic maps, compiling organismal information and constructing RNA secondary structures. The first release of the GOBASE contains mitochondrial sequence data drawn from the Entrez database system (ASN.1 format), and taxonomy data extracted from the NCBI Taxon database. Future versions will be expanded by adding new entities to the database, and by offering analytical tools such as subsequence extraction and neighbouring feature searches.

The most complex tasks related to the integration of the data into the local database are the resolution of semantic conflicts. These conflicts are caused by:

inconsistencies in names (e.g., gene synonyms)
differences in data interpretation and annotation
annotation errors
missing information

We have resolved some of the systematic errors by using lists of synonyms for all main features, and by inferring missing or incorrect information from related features (e.g., gene names, gene types, gene products and functions). The development of an interactive program for data verification and for the interactive correction of the remaining errors by experts is in progress.

GOBASE is supported by the Canadian Genome Analysis & Technology program (CGAT).