Predicting Sub-cellular Localization using Machine-Learned Classifiers in Proteome Analyst

Zhiyong Lu, Duane Szafron, Russ Greiner, Paul Lu, David Wishart Brett Poulin, John Anvik, Cam Macdonnell and Roman Eisner,,
Bioinformatics, 2004.

Motivation: Identifying the destination or localization of proteins is key to understanding their function and facilitating their purification. A number of existing computational prediction methods are based on sequence analysis. However, these methods are limited in scope, accuracy and most particularly breadth of coverage. Rather than using sequence information alone, we have explored the use of database text annotations from homologs and machine learning to substantially improve the prediction of subcellular location.

Results: We have constructed five machine-learning classifiers for predicting subcellular localization of proteins from animals, plants, fungi, Gram-negative bacteria and Gram-positive bacteria, which are 81% accurate for fungi and 92% to 94% accurate for the other four categories. These are the most accurate subcellular predictors across the widest set of organisms ever published. Our predictor

Proteome Analyst -- Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Duane Szafron, Paul Lu, Russ Greiner, David Wishart Zhiyong Lu, Brett Poulin, Roman Eisner,, John Anvik, and Cam Macdonnell
ICML 2003 Workshop: Machine Learning in Bioinformatics

Modern sequencing technology permits sequencing of entire genomes, whose gene sequences require annotation. It is too time consuming to predict the properties of each protein sequence manually and to organize the results of many prediction tools by hand. The prediction process must be automated, but the predictions must also be transparent. That is, the rationale for each prediction should be easily examinable by anyone that wishes to use the prediction. Proteome Analyst (PA) is a web-based system for predicting the properties of each protein in a proteome. PA has three interesting features. First, it is a single web-based system that allows the user to select a wide range of analytic tools and automatically apply them to each protein in a proteome. In essence, PA provides one-stop automatic high-throughput analysis. Second, PA has the ability to explain its predictions to users. PA is based on established machine learning techniques, but makes every prediction transparent to its users. Third, PA allows users to create their own transparent custom predictors without programming.

Proteome Analyst -- High throughput Protein Function Prediction (poster)

Roman Eisner,, Brett Poulin, Duane Szafron, Paul Lu, Russ Greiner, Bahram Habibi-Nazhad, and David Wishart
ISMB'02 (Poster)

Proteome Analyst (PA) is a web-based tool for predicting the functions of each sequence in a proteome. For example, one or more classification-based function predictors can be applied to any sequence. More importantly, PA users can easily train their own custom classification-based predictors and apply them to their sequences.

Learning to Predict Protein Function from Sequence (poster)

Zhiyong Lu

Xiaomeng Wu

Proteome Analyst (PA) is a web-based tool designed to autonomously infer the functional characterics of each protein in a proteome. We investigate various machine learning algorithms to obtain the optimal proteome function prediction.

Proteome Analyst: Custom Predictions with Explanations in a Web-based Tool for High-Throughput Proteome Annotations

Duane Szafron, Paul Lu, Russ Greiner, David Wishart Brett Poulin, Roman Eisner,, Zhiyong Lu, John Anvik, Cam Macdonnell, Alona Fyshe and David Meeuwis
Nucleic Acid Research, in press, July 2004.

Proteome Analyst (PA) (http://www.cs.ualberta.ca/ ~bioinfo/PA/) is a publicly-available, high-throughput, Web-based system for predicting various properties of each protein in an entire proteome. Using machinelearned classifiers, PA can predict, for example, the GeneQuiz general function and Gene Ontology (GO) molecular function of a protein. As well, PA is currently the most-accurate and most-comprehensive system for predicting subcellular localization, the location within a cell where a protein performs its main function. Two other capabilities of PA are notable. First, PA can create a custom classifier to predict a new property, without requiring any programming, based on labeled training data (i.e., a set of examples, each with the correct classification label), provided by a user. PA has been used to create custom classifiers for K-ion proteins and other general-function ontologies. Second, PA provides a sophisticated explanation feature that shows why one prediction is chosen over another. The PA system produces a Naive Bayes classifier, which is amenable to a graphical and interactive approach to explanations for its predictions; transparent predictions increase the user s confidence in, and understanding of, PA. hyperlinks to clearly display the evidence for each prediction.

Proteome Analyst -- Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Duane Szafron, Paul Lu, Russ Greiner, David Wishart Zhiyong Lu, Brett Poulin, Roman Eisner,, John Anvik, and Cam Macdonnell
ICML 2003 Workshop: Machine Learning in Bioinformatics

Proteome Analyst -- High throughput Protein Function Prediction (poster)

Roman Eisner,, Brett Poulin, Duane Szafron, Paul Lu, Russ Greiner, Bahram Habibi-Nazhad, and David Wishart
ISMB'02 (Poster)

Learning to Predict Protein Function from Sequence (poster)

Zhiyong Lu

Xiaomeng Wu

Cancer, SNPs, and Machine Learning (poster)

Brett Poulin, Jennifer Listgarten, Russell Greiner, Sambasivarao Damar, Thomas Kolacz, Xiang Wan, David Wishart and Brent Zanke
ISMB'02 (Poster)

Single nucleotide polymorphisms (SNPs) are genetic variations that may affect susceptibility to disease. We discuss the accuracy and efficiency of using various machine learning techniques with SNP data in order to distinguish between individuals who have breast cancer and those who do not.

Automated diagnosis of IEMs using High-Throughput Quantitative NMR Spectroscopy (poster)

Ajit Singh, Russell Greiner, Victor Dorian, and Brent Lefebvre
ISMB'02 (Poster)

We examine the utility of high throughput Nuclear Magnetic Resonance (NMR) data for the diagnosis of inborn errors of metabolism. We propose a two-layer Bayesian network with the novel addition of Bayesian error bars, which provide confidence intervals for our diagnoses.