European Conference for Computational Biology, Saarbruecken. Germany, October 6-9, 2002

Identification of ORFs from organelle genomes: A data mining approach using C4.5 machine learning algorithm

Sivakumar Kannan, Geneviève Boucher, Gertraud Burger

*Université de Montréal, Department of Biochemistry, 2900 Blvd Edouard-Montpetit, Montreal, QC, H3T 1J4, Canada

Genomes of mitochondria and chloroplasts from diverse eukaryotes carry on average 5 to 20 ORFs whose function has remained elusive. In order to understand the biological role of these ORFs, we have developed a comprehensive analysis procedure using data mining methods.To do this, a non-redundant dataset of known mitochondrial proteins from diverse species was compiled from GOBASE [1]. In order to represent or describe the knowledge embedded in the sequences, various bioinformatics analyses (e.g., physico-chemical properties) were done. A table of the results with known proteins in the rows and their features or attributes in the columns was generated with each different function being a class. Then the data mining program C4.5 [2] was used to look for patterns in the data and to learn the rules. The yet unannotated ORFs from organelle genomes sequenced by OGMP [3] and others were represented in the same way like known proteins, and their putative functions were inferred by classifying them in to anyone of th e functional classes using the learnt rules. The advantage of data mining algorithms (C4.5 in this case) over other machine learning algorithms like neural networks is that C4.5 classifiers are expressed as decision trees or sets of if-then rules. Such forms are human readable and hence human expertise could be used to improve the efficiency of the rules. The presented work also serves as a pilot study for automating complete cDNA sequences in the context of the Canadian Protist EST Program (PEP). Also, this would enable us to experiment different ways of representing a sequence using various attributes and in determining the most informative attributes for machine learning algorithms.

[1] Shimko N, Liu L, Lang BF and Burger G (2001) GOBASE: the Organelle Genome Database. Nucleic Acids Res., 29(1):128-132, http://megasun.bch.umontreal.ca/gobase
[2] Quinlan J. R, (1993) "C4.5: Programs for machine learning," Morgan Kaufmann Publishers.
[3] The Organelle Genome Megasequencing Program (OGMP), http://megasun.bch.umontreal.ca/ogmpproj.html