Decision Trees - Page 5


Gain Ratio

You should have seen the following tree:

Notice this tree is much smaller than the tree produced by Splitting Randomly, as it has only 12 nodes, including 5 internal nodes. It is claiming the MallRats win when:

  • Joe defends forwards during 5pm games
  • Fred starts on 7pm games and
  • when Joe plays center on away games that Joe does not start.

We can then use this tree to predict the outcome of the game:

Where When Fred Starts Joe offence Joe defense Opp C Outcome
Away 9pm No Center Forward Tall ??

Following the relevant path, we see the answer is "Lost", as this game will start at 9pm.

Problems with Information Gain

While there are several motivations for using InformationGain as a quality measure, there are also several limitations. One problem is: it tends to prefer attributes with MANY values; ie, it would prefer to split on say the 20-valued "OpponentName" attribute (if we had included it here), over the current 2- or 3-valued attributes. Indeed, it would further prefer to split on "AudienceSize" field, which could have any integer values {1,2,3,..., 150, 151, ..., 1023, 1024, ... }).

Why? The information possible for a k-ary attribute is between 0 and ln k; this means the values for a binary attribute cannot be larger than ln 2 = 1, while the largest possible value for a 20-ary attribute is ln 20 == 4.3, and for a 1024-ary attribute is ln 1024 = 10.

Note, however, these larger-arity attributes are not necessarily better, even though the InformationGain criteria would prefer them. (Indeed, it is likely that "AudienceSize" may in fact be one of the least important attributes!)

One trick to avoid this specific problem is to use GainRatio, which basically levels the playing field by penalizing the multiple-valued attributes. (How? See Gain Ratio Note for details.)

Now you try it...

To try out this criteria: Go back to the applet and change the "Splitting Function" to "Gain Ratio" (Recall that you can change the splitting function using the "Splitting Function" submenu under the "Algorithm" menu).

Then choose "Run" from the "Algorithm" menu ("<Ctrl><Alt>R") to see the resulting tree. Or you can "Trace Run", or "Step" or whatever. If you left-click on any unfinished node , you will now get the Gain Ratio score for the attributes, rather than the Gain score.

You can also flip back and forth between these splitting functions while growing the tree. (Again see Manual for details.)


Information Gain Decision Trees Evaluation + Cross Validation