Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Roman Eisner

Supervisors: Duane Szafron and Paul Lu


Abstract
High performance and accurate protein function prediction is a challenging problem in Bioinformatics. Many contemporary ontologies, such as Gene Ontology, have a hierarchical structure that can be exploited to improve the prediction accuracy, and lower the computational cost of protein function prediction. The structure of the hierarchy is leveraged in two ways: First, a novel method of creating hierarchy- aware training sets for machine-learned classifiers is introduced and shown to be the most accurate method. Second, the hierarchy is used to reduce the computational cost of classification. A sound methodology for evaluating hierarchical classifiers using global cross-validation is introduced. Biologists often use BLAST to identify potential functions of new proteins. Therefore, hierarchical methods are compared to BLAST as a baseline, and show improvements in predictive performance, and coverage. This dissertation focuses on the prediction of protein function within the Gene Ontology, but the techniques are applicable to hierarchical classification in general.


Dissertation (2.7mb)
IEEE CIBCB 2005 Paper (with corrected f-measure formula - 480kb) and Poster (226kb)
Seminar Slides (1.8mb)
ISMB Poster (184kb)