Improving Subcellular Localization Prediction using Text Classification and the Gene Ontology
Each protein performs its functions within some specific locations in a
cell. This subcellular location is important for understanding protein
function and for facilitating its purification. There are now many
computational techniques for predicting location based on sequence
analysis and database information from homologs. A few recent
techniques use text from biological abstracts: our goal is to
improve the prediction accuracy of such text-based techniques.
We identify three techniques for improving text-based prediction:
a rule for ambiguous abstract removal, a mechanism for using synonyms
from the Gene Ontology (GO) and a mechanism for using the GO hierarchy
to generalize terms. We show that these three techniques can
significantly improve the accuracy of protein subcellular-location
predictors that use text extracted from PubMed abstracts whose
references are recorded in Swiss-Prot.