AIxploratorium Decision Trees - Page 7

Overfitting

As noted above, learning means generalizing from the observed information. This in turn corresponds to a form of "educated guessing", which is NOT guaranteed to be correct. ... Learning means always saying you're sorry.

This is true here for decision tree learning. Lke most standard learning algorithms, decision tree learners will first try to produce a tree that is perfect over the training sample, in that this tree will correctly classify all of the instances.

This is problematic, however, as this perfect decision tree may be encoding the ideosyncracies of the specific dataset used, rather than the true distribution. This is especially when the underlying "classification function" is stochastic (ie, when there is no deterministic function that is always correct; see Impure Note), or when the training data is relatively small.

To illustrate these ideas, consider the task of determining the "bias" of a coin: ie, the percentage of the time it will land on Heads when it is flipped. We know that the empirical average (the number of times it landed on heads, divided by the total number of flips) will converge to the true bias, as we take more and more flips. However, this estimate may be significantly wrong if we consider very few flips. For example, consider flipping a coin only ONE time. If this one flip landed "Heads", we would infer that this coin is ALWAYS heads --- ie, "Prob of heads = 1.0". This does represent the data perfectly (all 1 flip), but is unlikely to correspond to the underlying distribution.

To take another (now-standard) example: Suppose we want to decide whether the FrogLegs will win in the next basketball game, given rather silly training (read "uncorrelated") data: We look over the first 50 attendees of the game, and ask whether their birthday was in the first half of the calendar year (ie, January through June) or the second half. Here, there is no correlation between any data point, and the classification ("Won/Lost").

Try it out

Go to the applet, and try to learn from the "FrogLegs" dataset. Try first using the Gain splitting criteria (although it should not matter which splitting function you use).

Here, we are using the first 70 games as training data. You can run this, using the "FrogLegs" dataset (which can be loaded by selecting "FrogLegs" from the "Load Dataset" submenu under the "Dataset" menu).

Now, let's say that instead of building a tree using all the available data, we take 30 examples (at random) and move them to our holdout dataset (which you can do by following the procedure mentioned on the previous page). We can estimate the true error of the learned tree (learned from the 40-record training set) by running it on this holdout set of 30 records.

Try running the algorithm (by choosing "Run from the "Algorithm" menu) with the 30 record holdout dataset.
Right-click on the "Tree View" panel and choose the "Show Testing Results" option to view the tree's performance on the holdout records.

The tree should perform quite poorly on the holdout dataset. We need to develop a way to increase the tree's classification accuracy on the holdout records.

AIxploratorium Decision Trees - Page 7

Overfitting

Try it out

AIxploratorium
Decision Trees - Page 7