Extracting Ontological Labels for Training Sequences


Although the Swiss-Prot database contains a sub-cellular localization field, this field does not contain a single ontological label for each sequence. There are two problems. First, if this sub-field is empty for a sequence, then the sequence cannot be used for training. Second, even if the field is not empty, it often contains a long text string instead of a simple ontological label. Therefore, we had to construct a parser that extracted a simple ontological label, when possible.

Rules that our parser uses to label potential training sequences:

1) Check to see if the field contains one of the ontological labels. If it does not, the sequence is rejected as a training sequence.

2) If it contains more than one ontological label, it is also rejected, unless one label is an organelle and the other is membrane.

3) If it contains an ontological label, but also contains the phrase 'potential' or 'by similarity' it is rejected if the number of training sequences with that label is high. However, if the number of training sequences with that label is small (< 1.5% of the total number of training instances), it is accepted.

4) If the ontological label is 'cell wall' and the phrase contains the word 'attached' it is rejected.

5) All proteins labeled as 'fragment' are rejected.

Steps 2), 3) and 4) require some explanation:

Step 2), it is common to describe a protein as being in a membrane of a specific organelle. In this case, the correct label is the organelle.

Step 3), is complicated because we would like to be conservative with our training data. Therefore, we want to reject any annotations that contain words like 'potential' or 'by similarity' in favor of annotations that have been verified in a laboratory. However, for ontological labels with low numbers of training instances, we found that accepting "higher risk" annotations is necessary to obtain enough training data so that the classifiers have good accuracy. Note that we followed the PSORT-B lead of including any sequences that contain the phrase 'cell wall' in the extracellular class for the plant and fungi ontologies, since the Swiss-Prot data is not very accurate in these cases.

Step 4), we found many Swiss-Prot SCELL annotations for proteins that are not in the cell wall, which contain the phrase 'attached to the cell wall'. Therefore, we needed a mechanism to remove this potential misinformation. To obtain our Swiss-Prot training data, we simply divided all Swiss-Prot entries into five categories based on taxonomy: animal, plant, fungi, GN bacteria and GP bacteria and then applied our parser to obtain as many training sequences in each category as we could find.