TBestDB: The Protist EST Database

Investigators: G. Burger, B. Franz Lang, University of Montreal, Montreal, PQ.

Initial project description

Introduction

Genomics has permeated virtually all fields in the life sciences from ecology, developmental biology, genetics, to immunology and pharmacology. Several large-scale genomics projects are well underway world-wide to determine the genetic make-up of human, the weed Arabidopsis, yeast, numerous eubacteria, archaea and viruses, and many more new projects are being started or anticipated. One of these new initiatives is the Protist EST program (PEP), which aims at the exploration of the diversity of eukaryotic genomes, in a systematic and comprehensive way.

Inherent to the genomics approach is the immense body of data it generates. As a consequence, the staggering growth of molecular sequence information and, more recently, also of functional and structural data has made the collection and storage of such data in centralized public-domain repositories a challenge. The scientific community is now increasingly becoming aware that repositories alone are not sufficient to fully exploit this wealth of information. This appraisal has led us to propose an integrated and rigorously curated database for the PEP-generated data.

Rationale

PEP aims at the determination of the expressed genome portions of a phylogenetically broad range of eukaryotes. The large body of data that the PEP initiative anticipates to generate will provide us for the first time with the critical material to delve into a number of fundamental biological questions, spanning from the origin of the eukaryotic cell to the acquisition of primary and second-hand chloroplasts and the phylogenetic affiliation of the enigmatic amoeba and heterotrophic flagellates.

Although each of the participating sub-projects are worthwhile in their own right, the strength and novelty of PEP lies in its integration, which has major benefits at the analytical level. Through the generation of new and broad data sets from eukaryotic proteins, much more rigorous gene-based and genome-based phylogenetic analyses will become feasible. Furthermore, this new data will enhance our capabilities to infer metabolic pathways, molecular mechanisms and cellular processes in a given organism, even with only partial knowledge of its coding capacity. Finally, this comprehensive information is expected to facilitate the identification of moderately conserved genes or genes from highly derived taxa, which is often hampered due to the lack of phylogenetically intermediate species.

In summary, the Protist EST database will be a centerpiece of the PEP collaboration. It will be located at and operated through the University of Montreal.

Objectives

As described in more detail below, TBestDB will serve two major purposes

1. Management of the data generated by PEP, including their

archiving
organization
dissemination

2. Analysis of these data, including their

validation
full annotation
cross-referencing within the PEP projects and with published data
comparative analyses

1. Data Management

Efficient data management is a main objective of the TBestDB. This is important in the light of the large body of sequence information that will be generated by the PEP, estimate at >50 Mbp coding DNA sequence already in the first year, and increasing to a 0.5-1Gbp from about 30 different organisms in year 5.

The data collected and centrally stored in TBestDB will not only consist of the bare nucleotide sequences and deduced protein sequences, but will contain a multitude of additional, related information. This includes, for example, clone library and clone retrieval codes, details about library construction, cloning vectors used, and growth conditions of the cells from which mRNAs were purified. It will be possible to inspect the original sequence readings (SCF-formatted trace files), for an assessment of data quality, and to obtain information about experimental aspects of sequences. Finally, descriptions of the used organisms, source of strains, and taxonomic information will be provided. A complete compilation of these 'experimental' data sets stored in TBestDB is given in Table 1.

In addition to collection and storage of the generated data, TBestDB will provide the ultimate means for their dissemination, making them freely accessible via the Internet. Initially, the information will be available only to members of the consortium, but after a brief delay, the data will be made accessible to the scientific community at large.

2. Data Analysis

To allow comparative sequence analyses at any stage of the projects, the above described 'experimental' data are verified by the database curator as soon as they are added to the database, and completed or corrected, if required (in consultation with the submitter). In addition, higher-level information (sequence annotation, gene ontology) will be also included into the database as detailed below.

2.1. Sequence annotation

Sequence annotations are the prime information in TBestDB. A complete annotation of a cDNA sequence will include the location of the coding region as well as of motifs and signals present on the mRNA (e.g., poly-A site, splice sites for alternatively spliced mRNAs, if applicable); gene assignment and gene family classification; the deduced protein sequence including the location of domains and signals (e.g., sorting and processing signals, ATP-binding sites); product assignment and general information about gene products, their cellular location and their involvement in biochemical pathways (see Table 1).

Through the centralization of the PEP data, a consistent and standardized nomenclature of genes and products can be easily implemented (or changed, if required), and likewise, synonyms can be supported. A unified nomenclature will permit precise and complete data retrieval by using gene and product name, sub-cellular location (nucleus, mitochondrion, etc.), the process in which deduced proteins participate (e.g., translation, transmembrane transport), and the specific molecular reaction gene products may perform (such as electron transfer or DNA methylation). It is widely recognized that public domain data repositories, as they do not have the authority or mandate for imposing a unifying nomenclature, let alone annotation standards, suffer from severe shortcomings in data queries.

Most important for the ultimate comparative analyses that the PEP consortium members intend to perform will be the careful cross-referencing of the data among the various PEP projects and with data from model organisms that are pertinent to our studies. Such models include yeast and the flatworm C. elegans, the only two eukaryotes for which the complete sets of protein sequences as well as the complete nuclear genome sequence is known, furthermore, cyanobacteria (Synechocystis PC6803) and alpha-proteobacteria (Rickettsia prowazekii; Caulobacter and Sinorhizobium underway), due to their common ancestries with chloroplasts and mitochondria. Links between data from the individual PEP sub-projects and between PEP data and published data will be established on basis of similarity scores.

2.2. The gene ontology framework

In order to allow full interoperability with single-organism databases for a number of eukaryotes such as yeast, mouse, C. elegans, Arabidopsis, etc., we will implant the gene ontology framework that has been developed by the Gene Ontology (GO) consortium [2,3]. This conceptual framework consists of three main categories into which biological data are classified: (i) cellular component, (ii) biological process and (iii) molecular function, with numerous, precisely defined subcategories also allowed.

The implementation of this framework in the TBestDB will associate protist proteins, with particular terms in the structured network of the GO vocabulary. The association may be based on sequence analysis, for example, we may detect a targeting signal in a cDNA sequence that points to the cellular location of the encoded protein. Alternatively, a newly determined protist protein may display sequence similarity with known proteins that have been already classified according to GO terminology. This information would provide us with clues about the role and location of the protist protein in question. In this way, we will be able to tab into the wealth of functional information that has been accumulated during the past decades by exploiting the powerful instrumentarium of classical genetics, genetic engineering and protein chemistry that are well established in these organisms. Association files are now available from FlyBase, Mouse Genome Database (MGD), Saccharomyces Genome Database (SGD), Arabidopsis Information Resource (TAIR), and WormBase[59].

Due to the phylogenetic breadth and depth of the organisms studied in the PEP project, we expect to discover molecular functions and biological processes that are not yet included in the GO vocabulary. Because this framework is designed for increasing knowledge and changing views about the roles of genes and proteins in cells, its fine structure is readily expandable. We have been invited to collaborate with the GO consortium at such time that new discoveries necessitate the expansion of the GO vocabulary (see appended letter of collaboration).

2.3. The workbench

In addition to data organization and annotation, TBestDB will provide tools within the database for data analysis, in particular for comparative and phylogenetic analyses, which are the main focus of the PEP initiative. A selection of bioinformatics tools that we intend to regularly employ for our analyses will be included in an integrated working environment (widely referred to as a workbench) that will be accessible from within the database. The provided tools (see Table 2 for a complete listing) will support translation from nucleotide into protein sequence, similarity searches in local and remote databases, and searches for motifs and signals (e.g., PROSITE [1] or user-defined patterns). Searches for sequence pattern can yield information complementary to that provided by a global (FASTA) or local (BLAST) similarity search, and the two approaches, when combined, are powerful for identifying the function of new proteins. The workbench will also permit to generate multiple sequence alignments, visualize them an import them into other analysis programs. Multiple alignments are typically used for the construction of phylogenetic trees, and we will offer a number of different methods for phylogenetic inference, including distance, parsimony and likelihood methods.

A large amount of phylogenetic information can be automated. Among the other automated analyzes, we will run a maximum likelihood analysis of related proteins as new reads are added to the database. This has been done successfully for the Sinorhizobium genome project (map, output list, and tree). For the TBestDB, these genes from protists will be even more diverse, the branch lengths among closely related taxa and will require more careful anaylzes. They will necessitate an incorporation of rate heterogeneity and will require more computationally intensive analyzes. One way to alleviate this is too make use of the added computational power of clusters. The most recent PUZZLE package has been re-written to take advantage of multiple processors but most maximum likelihood programs have not yet be re-written. There are different problems faced by clusters of CPUs within machines and clusters between machines but these are largely technological. Both PUZZLE and the maximum likelihood programs from PHYLIP will be part of the TBestDB workbench.

A larger problem that requires careful research is turning phylogenies generated from individual genes into phylogenies for organisms. Current data from prokaryotes suggests that this is not always possible due to extensive horizontal gene transfer. There is currently, not enough information to determine if single celled eukaryotes also have large amounts of horizontal gene transfer. If they do (then this would be a major advance of the PEP project), then only segments of the genomes (and/or groups of related genes) might share phylogenies. If they do not, then software to combine these phylogenies must be developed that should have not only phylogenetic information but should also incorporate gene function/classification and genomic/chromosomal position. Where ever possible this should be done in a `hypothesis testing' background. As the TBestDB develops, it will need to evaluate which software tools to implement. It must be very flexible in its approach since it will be serving several independent laboratories each with separate desires. In addition, these demands will help help to point to which new tools need to be researched and developed.

3. Database structure

For secure and accurate maintenance (addition, deletion, modification) of information as well as easy searching and retrieval, a dedicated database software will be employed as described in a following section. To represent the PEP data in the database in a way that reflects basic biological concepts, we have classified the information into seven categories or entities, namely SEQUENCE, GENE, PROTEIN, RNA, SIGNAL, TAXON, PROTEIN-PROTEIN_INTERACTION, and GENE_ONTOLOGY. These entities have numerous attributes to represent the essential properties of the data, for example, the attributes of sequence are length, location_from, location_to, molecule type, sequence_id, etc. (Table 2). The TBestDB entities are a sub-set of those used in the GOBASE database, except GENE_ONTOLOGY and PROTEIN-PROTEIN_INTERACTION. This latter entity will allow to accommodate data that will be generated in the context of a functional genomics project in which G.B. and B.F.L. are co-applicants, namely the 'Foyer d'Expertise' application to Genome-Quebec, which proposes large-scale protein-protein interactions studies in yeast and Arabidopsis. The cataloguing of proteins that interact with one another will permit to infer potential functional assignments of genes by virtue of association, which then can be tested by targeted experiments. In the course of the PEP initiative, we expect to find protist proteins that show significant sequence similarity with the interacting proteins detected in these functional genomics studies, which would provide valuable insight in the role of the protist proteins.

4. Software developments

Software developments will be mainly required for the workbench. While most of the analysis programs require minor adaptations, the major developments will concern the workbench interface itself, i.e., the integration of the various tools, input format conversions and the visualization of the results. E.g., no on-line editor of multiple alignments seems to be available at the moment, why we intend to develop a new tool for this central issue. Further new program developments are anticipated for data submission, which will allow members of the PEP consortium to submit to TBestDB their validated sequence data plus annotations in a defined, computer-parsible format. To facilitate the generation of a submission file, an automatic tool will be developed that employs defined syntax rules.

Program developments are also anticipated for enhanced and streamlined phylogenetic analyses, which will be pursued particularly by GBG, and which will constitute a major informatics research component of the TBestDB.

With the tools made available through TBestDB, PEP consortium members and the database curator alike, will quickly be able to compare, analyze and cross-reference information among the various PEP projects, as new data are added. The wider scientific community will also have access to the workbench. (If hardware performance would not be sufficient to serve both the PEP consortium and the wider community, we may limit cpu-intensive, time-expensive jobs for externals until hardware upgrade is available).

5. Database implementation

TBestDB will be implemented in a relational database management system (RDBMS), which has proven to be most appropriate for such types of biological data (see GOBASE [4,5]). After many years of experience with a commercial database software (Sybase), we have now opted to implement an open source RDBMS for reasons of flexibility and economy (MaxSQL [6] or PostgreSQL [7]).

The database will have a Web-interface that will be publicly accessible via the Internet for queries, viewing and downloading of data (read access). Again for flexibility reasons, we intend to develop a custom Web interface with the public-domain PHP4 development tool [8]. We have recently replaced in GOBASE the Web/Genera interface builder by a PHP4-generated interface, and we are satisfied with its functionalities, ease and speed of implementation, the extensive graphical capabilities and the ease with which the database backend can be accessed.

Privileged write-access for individual members to the TBestDB data that a member has contributed, will be also implemented in a Web-based interface, requiring in this case user authentication.

References

1. Hofmann K, Bucher P, Falquet L, Bairoch A (1999) The PROSITE database, its status in 1999. Nucleic Acids Res 27:215-219.
2. The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nature Genet25: 25-29.
3. http://www.geneontology.org/.
4. Korab-Laskowska M, Rioux P, Brossard N, Littlejohn TG, Gray MW, Lang BF, Burger G (1998). The organelle genome database project (GOBASE). Nucleic Acids Res 26:138-144.
5. Shimko N, Liu L, Lang BF, Burger G (2001) GOBASE: the Organelle Genome Database. Nucl Acids Res 29: 128-132.
6. MaxSQL is a MySLQ distribution. http://www.maxsql.com/news/article-28.html.
7. http://www.postgresql.org/docs/devhistory.html.
8. PHP, a hypertext preprocessor. http://www.php.net/version4/.

[ PEP main | Molecular Evolution & Organelle Genomics at UdeM ]

Contact address: Gertraud Burger. Last Update: 12 Nov 2002