Information Management in an Organelle Genome Megasequencing Project
Tim Littlejohn, Organelle Genome Megasequencing Program
tim@bch.umontreal.ca
Possibly the largest group users of molecular biology databases is
the genome sequencing community. While a number of genomics
projects are well underway, few are presently producing large
quantities of comparative genomic data. The Organelle Genome
Megasequencing Project (OGMP), centred at the University of
Montreal, is unusual in that it generates large quantities of
complete genomic sequence information (presently focused on
mitochondrial genomes) suitable for comparative genomic analysis.
The Megasequencing Unit of the OGMP currently completes between 3
to 6 entire genomes per year, and with predicted improvements in
sequencing technology, this output should double within the next
year.
The OGMP shares many of the data management challenges experienced
by other genomic groups; gene identification/function inference,
gene product structural predictions, and integration of molecular,
genetic, biochemical, and genome organisation data, for example.
Because of the phylogenetic emphasis of the project, OGMP has a
major focus on comparative genomics. Hence, the information
retrieval and data management requirements of the project are both
wide and deep.
The Informatics Division of the OGMP (OGMP-ID) is concerned with
data management and information retrieval from a wide variety of
sources. Specific aims of the OGMP-ID include development of
informatics tools for accessing existing large networked databases,
providing specialist databases to the wider scientific community,
enhancement of existing genome analysis toolsets and development of
large sequence project management and comparative genomics analysis
tools (with an emphasis on phylogenetics). Management of genomic
data and its efficient, integrated, global distribution in a way
that is readily accessible by researchers in the field are key
issues for the OGMP-ID.
Current projects of the OGMP-ID include development of a number of
tools for accessing diverse molecular sequence databases. These
include:
- CLEVER - Command Line Entrez, a command driven search engine for
querying the NCBI Entrez database (Clever Distribution);
- FERRET - Sequence retrieval and reformatting tool that fetches
sequence data from a number of repositories, including Entrez, the
email server at NCBI, the FARFetch database and local collections.
FERRET uses CLEVER amongst other sequence retrieval methods, and by
choosing repositories in the most network sensitive and
computationally efficient manner, reduces unnecessary workloads and
traffic. FERRET creates sequence data entries that are
interoperable with many other databases and applications, and is
particularly useful for constructing local databases suitable for
further (e.g. phylogenetic) analysis;
- BBLAST - batch BLAST search tool for large scale sequence
similarity searches. BBLAST accesses networked databases either
directly through the Internet or indirectly via email servers at
NCBI;
- TBOB - Text BLAST Output Browser, for managing results from large scale
sequence similarity searches. BOB has integrated sequence
retrieval functions (using FERRET) and interfaces to any other
sequence analysis tool (such as FASTA, CLUSTALV, PHYLIP programs,
etc.) using SERAM (TBOB distribution);
- SERAM - SEquence Retrieval and Application Manager. SERAM is a
management utility that oversees sequence collection by FERRET and,
once all sequences are retrieved into a local database, launches a
user requested application and ensures its faithful execution.
SERAM could be used, for instance, to manage a multiple sequence
alignment job (e.g. using CLUSTALV) for a set of sequences being
browsed in BOB after a BBLAST search.
The OGMP-ID is also developing tools for management of multiple
sequencing projects. This includes:
- COSMA - An integrated tool for automatic assembly, analysis,
annotation and submission of complete genome sequences. COSMA will
rely heavily on tools such Staden's BAP utility for sequence
assembly and BBLAST and FERRET for gene identification.
In addition, integrated genomic databases (initially with an
emphasis on organelle genomes and phylogenetics), image databases
(including organisms of phylogenetic interest, with both macro and
ultrastructural information) are under construction. Furthermore,
through various Internet communication methods
(Gopher,
World Wide Web, etc.), the OGMP-ID participates in the global movement for
unification of genomic data organisation, elimination of non-
essential database and software development redundancy, and
communication between researchers in the field of information
management for genomics.
More information about this meeting.