Information Management in an Organelle Genome Megasequencing Project

Tim Littlejohn, Organelle Genome Megasequencing Program

tim@bch.umontreal.ca

Possibly the largest group users of molecular biology databases is the genome sequencing community. While a number of genomics projects are well underway, few are presently producing large quantities of comparative genomic data. The Organelle Genome Megasequencing Project (OGMP), centred at the University of Montreal, is unusual in that it generates large quantities of complete genomic sequence information (presently focused on mitochondrial genomes) suitable for comparative genomic analysis. The Megasequencing Unit of the OGMP currently completes between 3 to 6 entire genomes per year, and with predicted improvements in sequencing technology, this output should double within the next year.

The OGMP shares many of the data management challenges experienced by other genomic groups; gene identification/function inference, gene product structural predictions, and integration of molecular, genetic, biochemical, and genome organisation data, for example. Because of the phylogenetic emphasis of the project, OGMP has a major focus on comparative genomics. Hence, the information retrieval and data management requirements of the project are both wide and deep.

The Informatics Division of the OGMP (OGMP-ID) is concerned with data management and information retrieval from a wide variety of sources. Specific aims of the OGMP-ID include development of informatics tools for accessing existing large networked databases, providing specialist databases to the wider scientific community, enhancement of existing genome analysis toolsets and development of large sequence project management and comparative genomics analysis tools (with an emphasis on phylogenetics). Management of genomic data and its efficient, integrated, global distribution in a way that is readily accessible by researchers in the field are key issues for the OGMP-ID.

Current projects of the OGMP-ID include development of a number of tools for accessing diverse molecular sequence databases. These include:

CLEVER - Command Line Entrez, a command driven search engine for querying the NCBI Entrez database (Clever Distribution);
FERRET - Sequence retrieval and reformatting tool that fetches sequence data from a number of repositories, including Entrez, the email server at NCBI, the FARFetch database and local collections. FERRET uses CLEVER amongst other sequence retrieval methods, and by choosing repositories in the most network sensitive and computationally efficient manner, reduces unnecessary workloads and traffic. FERRET creates sequence data entries that are interoperable with many other databases and applications, and is particularly useful for constructing local databases suitable for further (e.g. phylogenetic) analysis;
BBLAST - batch BLAST search tool for large scale sequence similarity searches. BBLAST accesses networked databases either directly through the Internet or indirectly via email servers at NCBI;
TBOB - Text BLAST Output Browser, for managing results from large scale sequence similarity searches. BOB has integrated sequence retrieval functions (using FERRET) and interfaces to any other sequence analysis tool (such as FASTA, CLUSTALV, PHYLIP programs, etc.) using SERAM (TBOB distribution);
SERAM - SEquence Retrieval and Application Manager. SERAM is a management utility that oversees sequence collection by FERRET and, once all sequences are retrieved into a local database, launches a user requested application and ensures its faithful execution. SERAM could be used, for instance, to manage a multiple sequence alignment job (e.g. using CLUSTALV) for a set of sequences being browsed in BOB after a BBLAST search. The OGMP-ID is also developing tools for management of multiple sequencing projects. This includes:
COSMA - An integrated tool for automatic assembly, analysis, annotation and submission of complete genome sequences. COSMA will rely heavily on tools such Staden's BAP utility for sequence assembly and BBLAST and FERRET for gene identification.

In addition, integrated genomic databases (initially with an emphasis on organelle genomes and phylogenetics), image databases (including organisms of phylogenetic interest, with both macro and ultrastructural information) are under construction. Furthermore, through various Internet communication methods (Gopher, World Wide Web, etc.), the OGMP-ID participates in the global movement for unification of genomic data organisation, elimination of non- essential database and software development redundancy, and communication between researchers in the field of information management for genomics.

More information about this meeting.