Predicting Sub-cellular Localization using Machine-Learned Classifiers in Proteome Analyst

Z.Lu, D.Szafron, R.Greiner, P.Lu, D.Wishart, B.Poulin, J.Anvik, C.Macdonell, and R.Eisner
Department of Computing Science University of Alberta, Edmonton, AB, Canada T6G 2E8
Contact: bioinfo@cs.ualberta.ca

  1. Introductory page
  2. -- Under this page, we define all the abbreviations and terminologies we used for the following webpages. Also, it includes the links of all the data in the following experiments and explains how we extracted the data from Swiss-Prot_41.



  3. A Complete Survey of Current Sub-cellular Localization Predictors
  4. -- Table inside lists a number of systems for sub-cellular localization prediction that have been developed over the past few years using a series of prediction algorithms.  Both accuracy and coverage are reported.

  5. Confusion Matrices for 6 classifiers: Animals, Plants, Fungi, Gram+ and Gram-, and Archea
  6. -- To evaluate our classifiers, we use standard machine learning technique called 5-fold cross validation. Tables inside are the confusion matrices for the results of the 5-fold cross validation on each of the classifiers.



  7. Confusion Matrices for 1-organism classifiers
  8. -- We built a single classifier called 1-organism classifier from all training data except the sequences from one specific organism. Then the classifier is applied to the specific organism in Swiss-Prot. This simulates the situation in which a classifier is used to predict the sub-cellular locations of all the sequences in a newly sequenced organism.

  9. Confusion Matrices for PA-SUB classifier built using Nair &  Rost's 1161 sequence
    -- This is the comparison of our classification technique to the Swiss-Prot lexical technique of Nair and Rost (Nair and Rost, 2002), we constructed two custom sub-cellular localization classifiers using their single ontology and their training data.

  10. Confusion Matrices for PA-SUB classifier bulit using PSORT-B training data
  11. -- This is another comparison of our classification technique using the reliable PSORT-B Gram-negative data (Gardy et al., 2003).

  12. Whole Proteome Coverage
  13. -- If PA-SUB is applied to an entire organism, there will be some sequences for which there are no homologs, so no features can be extracted and used by the classifier. In this case, PA-SUB makes no sub-cellular localization prediction for these sequences.

  14. Confusion Matrices for feature extracting methods
  15. -- This shows us the results of many different ways of selecting PSI-BLAST homologs and extracting features among various Swiss-Prot fields (KWORD, IPR and SCELL).

  16. Confusion Matrices for different classification techniques
    -- Besides Naive Bayes, various kinds of other classifiers have been evaluated as well. We show the results of Artificial Neural Nets (ANN), Support Vector Machines (SVM), Tree-augmented Naive Bayes (TAN) and three different nearest neighbour classifiers (1NN, 3NN and 5NN).