With the rapid growth of modern sequencing technology, more and more entire
genomes have been completed, leading to an explosion of new gene sequences.
Unfortunately, for most of these new sequences, we do not know any of their properties,
such as their sub-cellular localizations. It is too time consuming to determine
the properties of each sequence manually in a biological laboratory, since it typically
takes months or even years to determine the properties of even a single protein sequence.
A much quicker alternative is to use computational techniques to make predictions in a
high-throughput fashion. Proteome Analyst SUB-cellular localization server (PA-SUB) is a
system that uses machine learning techniques to predict the sub-cellular localization of
each protein in a proteome. This dissertation demonstrates how PA-SUB applies established machine
learning techniques to make accurate predictions. It describes techniques and experiments that establish
reliable trainin/testing datasets, significant feature extraction, efficient algorithm selection
and convincing validation methods.
PA-SUB is described and the results of its application are presented along with concrete examples.
Experiments are presented that compare different feature identification techniques and different
classifier technologies. We obtained excellent results (approximately 90% 5-fold cross validation accuracy) for
sub-cellular localization prediction. These results are better than published results using other
techniques. More generally, this dissertation demonstrates that computational machine learning techniques
can be used effectively for the prediction of protein properties.