Decision Trees


Using the Decision Tree Applet

Everything here applies to the applet,

This applet includes 3 panels: the upper left is used to view the current dataset; the upper right refers to the algorithm being executed; and the bottom shows the decision tree that is grown. The applet also has two menus: the "Dataset" menu allows a user to load a new dataset; the "Algorithm" menu includes a series of algorithm controls and a preferences option.

  • To run this system, you must first specify the relevant dataset. Do this by making a choice from the "Load Dataset" submenu under the "Dataset" menu. The system starts up using "Basketball". You can change this to "CarEvaluation", "FrogLegs", "Genome" or "PlayTennis". (The remaining option, "Custom", is not yet implemented.)
  • Next you must specify which splitting "algorithm" to use, by indicating the "Splitting Function".

    To specify the splitting function, select one of the options from the "Splitting Function" submenu under the "Algorithm" menu. The relevant functions are: "Random", "Gain", "Gain Ratio", and "GINI".

  • If you like, you may also choose a pruning algorithm.

    To specify the pruning algorithm, select one of the options from the "Pruning Algorithm" submenu under the "Algorithm" menu. By default, the pruing algorithm is set to "None" -- no pruning is performed on the finished tree. The two possible pruning algorithms available are "Reduce-error" and "Pessimistic".

    If you select "Reduce-error" pruning, you'll need to create a "Testing" dataset before you continue (reduced-error pruning uses a holdout dataset to test the classification accuracy of a particular tree).

  • Finally, you can run the algorithm. Do this by selecting "Run" from the "Algorithm" pull-down menu.

To summarize all of the actions you can perform (In each case, pull down the "Algorithm" menu to the name indicated):

  • Select "Run" to see the resulting decsion tree in the "Tree View" panel.
  • Select "Initialize" to clear the "Tree View" panel (if necessary), and tell the applet to use the most-recently specified dataset.
  • Select "Trace Run" to see how the code actually works: this executes the applet algorithm, but more slowly than "Run". It also shows the line in the code being executed. Note that this code is recursive -- in fact, the LearnDT procedure will be called once for each node generated. (See Code Description below.)

    If the algorithm is executing too quickly (or too slowly), you can adjust the speed using the "Algorithm Trace Step Delay" slider. Choose the "Preferences..." option from the "Algorithm" menu to display the slider.

  • While this code is executing (within either "Run" or "Trace Run"):
    • Select "Pause" to interrupt the execution. You can then continue with "Run", "Trace Run", or "Step" (or terminate the execution completely with "Cancel").
    • Select "Cancel" to stop the execution.
  • Select "Step" to step through the code, line by line.
  • Select "Backup" to undo the last node generated -- returning it to an "unfinished node", represented by a blue triangle ),

To simplify the user interactions, you can use the keyboard directly: for example, typing "<Ctrl><Alt>R" (that is, holding down the Ctrl, Alt and R keys simultaneously) is the same as pulling down the "Run" option. The complete set is below (note the key corresponds to the first character of the action.)

Run <Ctrl><Alt>R
TraceRun <Ctrl><Alt>T
Pause <Ctrl><Alt>P
Cancel <Ctrl><Alt>C
Initialize <Ctrl><Alt>I
Backup <Ctrl><Alt>B

Description of the Code


You will see there is some overhead at the start of the LearnDT algorithm: to see
  • if there are any examples that reach this location (if not, use the "most common" label)
  • if we have reached a leaf node, as all of the examples that reach this location have the same label,
  • if there are any attributes to consider here. (There are many situations where this is NOT true; see Not Pure).

If we pass these tests, LearnDT next decides which attribute to split on... here by using whatever is specified in the "Split Function" menu (eg, Gain). This leads to several recursive calls --- one for each branch of the resulting selected attribute here.

During this tree-construction process, each "unfinished nodes" is represented as a blue triangle ). When the code is interrupted, you can then left-click on any such unfinished node to see the "Gain" score for each of the available attributes (or Gain Ratio, or GINI -- depending on the value of "Split Function"). You may also (if you wish) specify the attribute for this node, finished or not, or indicate that it should become a leaf not.

You can then continue specifying other attributes by hand, or continue to Run, Trace Run, Step or Cancel.

Note that LearnDT is "greedy", in that it never backs up to replace a finished node --- whether it originally assigned the attribute, or you did.

Other Nuances

"Right-clicking" on any panel gives you the option of expanding this panel to occupy the entire 600x800 screen. (Or, after it is expanded, compressing it back.) On the bottom panel, you also have the option of removing the statistics (the "0 of 20 examples correctly classified" -- initially on the upper left corner, and the "0% ... 100%" legend on the right). You can also zoom tree to various levels; note the actual attribute/class value is only present on the largest (100%) zoom level.

A subset of the records in the dataset reach each node in the decision tree. When clicking on a node in the decision tree, those records are shaded "Yellow".