With rapid advances in sequencing technologies, the availability of biological sequence
information has exploded. In order to take advantage of this information,
researchers need quality annotations about the biological function of each protein
encoded by the sequences. These annotations have traditionally been made in the
laboratory through physical experiments. The rate of the current expansion of raw
data has outstripped the resources available for determining protein function by
physical means. The accelerated rate of information growth even outpaces the abilities
of human annotators working with computational tools. In order to take advantage
of this mountain of information from high-throughput sequencing efforts,
analysis techniques which are equally automated and high-throughput are required.
The analysis of sequence patterns through the use of machine learning techniques
provides a mechanism by which current knowledge about protein sequences
can be used to automatically annotate and classify new proteins.
This work compares current protein function prediction techniques in the context
of the high-throughput and automated classification task. A recently developed
probabilistic model called the probabilistic suffix tree (PST) is further developed.
The current work presents an alternative presentation of PSTs, as well as results
using an efficient implementation of the model. The results show the promise of
this new tool for automatic and high-throughput protein function prediction.