Extracting Ontological Labels for Training Sequences
Although the Swiss-Prot database contains a sub-cellular localization field,
this field does not contain a single ontological label for each sequence.
There are two problems. First, if this sub-field is empty for a sequence,
then the sequence cannot be used for training. Second, even if the field
is not empty, it often contains a long text string instead of a simple ontological
label. Therefore, we had to construct a parser that extracted a simple ontological
label, when possible.
Rules that our parser uses to label potential training sequences:
1) Check to see if the field contains one of the ontological
labels. If it does not, the sequence is rejected as a training sequence.
2) If it contains more than one ontological label, it is also rejected,
unless one label is an organelle and the other is membrane.
3) If it contains an ontological label, but also contains the phrase 'potential'
or 'by similarity' it is rejected if the number of training sequences with
that label is high. However, if the number of training sequences with that
label is small (< 1.5% of the total number of training instances), it is
accepted.
4) If the ontological label is 'cell wall' and the phrase contains the word
'attached' it is rejected.
5) All proteins labeled as 'fragment' are rejected.
Steps 2), 3) and 4) require some explanation:
Step 2), it is common to describe a protein as being in a membrane of a
specific organelle. In this case, the correct label is the organelle.
Step 3), is complicated because we would like to be conservative with our
training data. Therefore, we want to reject any annotations that contain words
like 'potential' or 'by similarity' in favor of annotations that have been
verified in a laboratory. However, for ontological labels with low numbers
of training instances, we found that accepting "higher risk" annotations is
necessary to obtain enough training data so that the classifiers have good
accuracy. Note that we followed the PSORT-B lead of including any sequences
that contain the phrase 'cell wall' in the extracellular class for the plant
and fungi ontologies, since the Swiss-Prot data is not very accurate in these
cases.
Step 4), we found many Swiss-Prot SCELL annotations for proteins that are
not in the cell wall, which contain the phrase 'attached to the cell wall'.
Therefore, we needed a mechanism to remove this potential misinformation.
To obtain our Swiss-Prot training data, we simply divided all Swiss-Prot entries
into five categories based on taxonomy: animal, plant, fungi, GN bacteria
and GP bacteria and then applied our parser to obtain as many training sequences
in each category as we could find.