Abstract
High performance and accurate protein function prediction is a challenging problem
in Bioinformatics. Many contemporary ontologies, such as Gene Ontology, have a
hierarchical structure that can be exploited to improve the prediction accuracy, and
lower the computational cost of protein function prediction. The structure of the
hierarchy is leveraged in two ways: First, a novel method of creating hierarchy-
aware training sets for machine-learned classifiers is introduced and shown to be the
most accurate method. Second, the hierarchy is used to reduce the computational
cost of classification. A sound methodology for evaluating hierarchical classifiers
using global cross-validation is introduced. Biologists often use BLAST to identify
potential functions of new proteins. Therefore, hierarchical methods are compared
to BLAST as a baseline, and show improvements in predictive performance, and
coverage. This dissertation focuses on the prediction of protein function within the
Gene Ontology, but the techniques are applicable to hierarchical classification in
general.
|
 |