Departement de biochimie, Universite de Montreal, Montreal, Quebec, Canada
*Department of Biochemistry, Dalhousie University, Halifax Nova Scotia, Canada
#ANGIS, University of Sydney, Sydney, Australia
The Gobase database currently contains: mitochondrial sequence records (nucleic and amino acid) retrieved from NCBI's Entrez system ,taxonomy data extracted from the NCBI's Taxon database, standard gene names and information about their products(GOBASE populated), genetic maps (established by OGMP) and supports links to the secondary structures (group I introns, 23S rRNA, 16S rRNA) and objects in the databases: Entrez, PID, Mendel, ENZYME, EcoCyc, WIT(Puma), PIR, SWISS-PROT, Flybase.
The Gobase object classes reflect major groups of genetic elements and in many cases they differ from the NCBI data model. Each object class has a set of attributes that describes objects belonging to the class. The attributes might contain single or multiple values from Gobase tables and hyperlinks to other objects in Gobase and external databases. The Gobase objects are accessible through WWW forms.
There are five groups of Gobase tables: Primary tables are populated with Entrez sequence records. The structure of the Primary tables corresponds to the NCBI data model (ASN.1 definitions). Secondary tables integrate the information about Gobase objects. Each secondary table contains the information about a particular Gobase object class. The attributes (columns) might contain the information contained in the Primary tables (annotated in an ASN.1 record),annotated (or corrected) by Gobase experts or calculated or inferred from other attributes (example:.test whether or not an orf is contained in an intron,verifiy cell location of the sequence based on gene name). Secondary tables might also include new rows (features not explicate annotated in ASN.1 records) calculated from related features (example: introns and exons calculated from NCBI cds features). Gene/Product tables contain standard names and general information about the genes and products supplied by the Gobase experts. Extlink tables contain URL addresses of the postscript files (genetic maps, secondary structures) and links of the Gobase objects to the objects from external databases. Taxon tables contain taxonomy data extracted from the NCBI Taxon dump files.
Analysis of the data retrieved from Entrez has shown inconsistencies in gene/product names, annotation errors and missing information. Inconsistencies in gene/product names are being resolved by experts using the LINK program. LINK is an interactive, learning program that recognizes gene/product name based on the implemented rules and supplies corresponding gene/product identifier for a standard name. There are three types of the implemented rules; the first group is implemented as a list of synonyms or text preprocessing rules. The second group of rules use additional information, fetched from Gobase tables, to assign a correct name. The current version of the Link contains rules based on the following information: taxon name, anticodon information, length of the orf and cellular location. The third group of the rules contains special cases, where the expert has assigned a name to an individual record. Other systematic corrections and annotations of the data are maintained by SQL's scripts during the process of the Secondary tables development. Expert corrections are maintained in a table and will be later included in the data manager.
The Gobase database has been implemented using the SYBASE RDMS. The query interface is supported through WWW forms using the Web/Genera gateway. Custom software has been developed using the Perl language and Sybase's Open-Client development tools.