AIxploratorium Decision Trees

Modifications to the LearnDT algorithm

[[This page is still being written. It will be completed soon!]]

Many Learning Algorithms: Why use InformationGain? Or even GainRatio? Yes, they do try to find the smallest decision tree, which is sometimes useful (see Occam). But there remain several issues: First, it is not clear that the smallest tree really is the most accurate. (see No Free Lunch). Second, even if the smallest really is best, our specific algorithm does not always find the absolute smallest, as it is only GREEDILY hill-climbing. (Note that finding the smallest is NP hard.) The third, and most damning, problem is that our learner is only using a SAMPLE of the instances (here, of the games), which may not be representative of the true distributions. For example, perhaps the MallRats will almost always lose away games where Fred starts and Joe plays center (both offense and defense). But in the one such game we actually saw, they just happen to win. (This is like flipping a 0.99-head coin, and getting tails.)

The pruning idea attempts to address this. There have been a variety of other "tweaks", towards addressing this. These include

learning a set of decision trees, based on different samples. See Four Current Trends

Use some other splitting criteria. From Article, there are several obvious criteria for splitting functions:
- the best score should be for a distribution of the form [1,0,0,0]
- the worst score should be for a distribution of the form [1/4, 1/4, 1/4, 1/4] and
- the measure should be symmetric --- ie, the same for [0.3, 0.1, 0, 0.6] as for [0.1, 0, 0.6, 0.3], etc.
Note that both InformationGain and GainRatio satisfy these properties. Another measure is GINI measure

Gini(A) = 1 - sum_i p_i ^2

You can try this out: In the applet, reset the "Split Function" pull-down menu to "GINI".

(Other measures include "Statistical Chi2 test", "Marshall Correction", "G statistic", ...)
Try other PRUNING functions; see Other Pruners

Note: many researchers have found that the splitting criteria is not that useful, but pruning is essential. For that reason, essentially all decision tree learners include some pruning facility.

Each classifier has an associated "decision boundary", which separate the different classes. To introduce this idea, consider the far-simpler learning problem: Who is bald, given the person's gender (from {Male, Female}) and age range (from {Baby, Child, Teenager, Adult, OldAdult}), and imagine our data looks like this:

Gender	Age	*Bald*
Male	Baby	No
Male	Teenager	No
Female	OldAdult	No
Female	Teenager	No
Male	OldAdult	Yes

We could graph this: using each attribute on an axis, each instance corresponds to a point. We could then put a BOX ([]) on each point representing a positive instance (ie, a bald person), and a circle (o) on each negative instance:

[Produce this graph!]

This graph also includes a line that separates the positive from the negative instances.

Here, we drew a straight line. The obvious decision tree, here, would be

[Draw this tree!]

Its "decision surface" is a bit more complicated, as it first splits based on age, and then, for certain ages, further divides, based on gender:

[Produce this revised graph!]

Notice the decision boundary, here, is composed of a series of axis-parallel segments. This is true of decision-trees in general.

Now recall our basketball example: Here, each instances corresponds to a point in a 6-dimensional "instance" space, as each instance has a "Where" value (either Home or Away) a "When" value (one of 5pm, 7pm or 9pm), a "Fred Starts" value (either Yes or No), etc. (As each game corresponds to exactly one of these, there are 2*3*2*2*2*2=96 possible types of games. Of course, several different games could have the same description.) Each of the decision trees considers again corresponds to a sequence of such axis-parallel "hyper-planes".

Another approach is to consider "oblique" trees. See Salzberg, Kasif, ...

Why not MANY different decision trees, averaged together

[[to be completed]]

This algorithm is greedy --- trying to find the single split that most improves things. What if no single split effective --- eg, XOR. Here, there is NO advantage to a single split, but with 2 splits, we can be perfect. In fact, that is true for our current situation;

Note the lower left is the exclusive-or of "OppC=Tall (+) JoeDefense=Center".

[[to be completed]]

Actually there is a yet-smaller tree. Can you find it?

AIxploratorium Decision Trees

Modifications to the LearnDT algorithm

AIxploratorium
Decision Trees