One vs Multiple Test Instances

We use standard 5-fold cross-validation to evaluate our BiC(RoBiC, ...) system: we first divided the data into 5 balanced folds FF = {F₁, …, F₅} then use the information in FF - { F_i} to find labels for each instance r_i,j ∈ F_i. Now recall, however, that BiC(RoBiC, ...)'s first step involves finding the biclusters based on both (the nonlabel part of) FF - { F_i} and F_i --- ie on all of FF. This means the biclusters (and hence the classification) for r_5,1 depends on r_5,2, r_5,3, ... r_5,3, as well as F₁, …, F₄.

Does this make a difference? In particular, how does the scenario compare with the more "standard" version, where the label for r_5,1 depends only on itself and F₁, …, F₄, but NOT r_5,2, r_5,3, ... r_5,15.

To find out, we took 4/5 of the BreastCancer data as the training set D, which here has 61 instances. We then considered each of the remaining 15 elements R = { r₁, r₂, ... r₁₅} one by one. Here, we used the set of instances D_i = D ∪ { r_i } to produce the set of k=30 biclusters, B_i = { B_i,1, B_i,2, ..., B_i,k} = RoBiC( D_i, k).

Of course, each of these B_i bicluster sets can be very different from one another. We can allay some of our worries if we find that these 15 different bicluster sets are similar to one another, and also to the biclusters obtained using the full FF set of instances, B^* = RoBiC( FF, k). Below we present two ways to measure these similarities, focussing on just the first three biclusters for each set --- ie, comparing the members of {B_i,1}_i = {B_1,1, B_2,1, ..., B_15,1} with one another and with B^*₁; then comparing {B_i,2}_i with each other and with B^*₂; and finally dealing with {B_i,3}_i and B^*₃. For notation: each bicluster B_i,j involves a particular set of genes G_i,j.

(See also UseOnlyTraining for another way to use only the training data.)

** Comparing α_i,j with α_i^*: FMeasure **

To compare any pair of sets --- eg, G_1,1 with G₁^* --- we can use (a variant of) F-measure index,

F(A, B) = Fmeasure(A, B) =

2 × |A ∩ B|

|A| + |B|

(It is easy to confirm that this corresponds to (2× Prec(A,B) × Recall(A,B))/(Prec(A,B) + Recall(A,B)) where Prec(A,B) = |A ∩ B|/|A| and where Recall(A,B) = |A ∩ B|/|B|.)

We therefore computed the 15 values F( G_i,1, G₁^*), associated with the first bicluster of each bicluster-set. This is graphed in the far left region in left plot in Figure 1 below, as a box-and-whisker plot (produced with Matlab's BOXPLOT). (This plot in corresponds to the 15 values of F( G_i,1, G₁^*) over the 15 single patient additions.) We see that the mean is around 0.85, and one standard deviation is only a few percent. The middle region in this graph corresponds to the second biclusters { F( G_i,2, G₂^*) }; and the far right to the third biclusters { F( G_i,3, G₃^*) }.
Figure 1: Box-and-Whisker plot of (left) F( G_i,1, G₁^*), F( G_i,2, G₂^*) and F( G_i,3, G₃^*); and (right) F( G_i,1, G_j,1) F( G_i,2, G_j,2) and F( G_i,3, G_j,3).

We also compared all (¹⁵₂) pairs F( G_i,1, G_j,1 ) pairs, for (i ≠ j). The left graph of the Figure shows those values, for the first, second and third biclusters. Notice the average F-score here is around 0.95 for both the first and 2nd biclusters.

Membership Histograms across {α_i,j}_i for α = G, P, PG and j=1,2,3

Figure 2(left) presents a histogram of the patients in the 15 biclusters {P_i,1}_i = {P_1,1, P_2,1, ..., P_15,1}; that is, we see that 6 patients appeared in all 15 of the bicluster#1's, and that 3 appeared in only 1. (This is out of the total of 11 patients that appeared in any bicluster#1; ie, in ∪_j P_j,1.) Figure 2(mid) (resp., Figure 2(right)) shows a histogram of the 17 patients in the 15 bicluster#2's ∪_j P_j,2 (resp., 16 patients in the 15 bicluster#3's ∪_j P_j,3). It is not suprising that slightly fewer, 5 patients, appear in all 15 of the 2nd biclusters, and only 4 in all 15 3rd biclusters.

Figure 3 deals with genes. We note that almost 300 genes (of around 400) appear in all 15 bicluster#1's, and around 800 (of 1000) genes in all 15 bicluster#2's.
Figure 2: Histogram of the number of patients in each of the 15 biclusters for the first bicluster (left), the second bicluster (center) and the third bicluster (right).

Figure 3: Histogram of the number of genes in each of the 15 biclusters for the first bicluster (left), the second bicluster (center) and the third bicluster (right).

** Comparing biclusters for each of B₁ⁱ and biclusters containing all of the 15 instances (B₁^, B₃^, B₃^*) **

Figure 4: Histogram of the number of patients in each of the 15 biclusters and the bicluster containing all instances for the first bicluster (left), the second bicluster (center) and the third bicluster (right).

Figure 5: Histogram of the number of genes in each of the 15 biclusters and the bicluster containing all instances for the first bicluster (left), the second bicluster (center) and the third bicluster (right).

One vs Multiple Test Instances

** Comparing αi,j with αi*: FMeasure **

** Membership Histograms across {αi,j}i for α = G, P, PG and j=1,2,3 **

** Comparing biclusters for each of B1i and biclusters containing all of the 15 instances (B1*, B3*, B3*) **

** Comparing α_i,j with α_i^*: FMeasure **

Membership Histograms across {α_i,j}_i for α = G, P, PG and j=1,2,3

** Comparing biclusters for each of B₁ⁱ and biclusters containing all of the 15 instances (B₁^, B₃^, B₃^*) **