3. A Problem With Binary Search Trees

To finish off the subject of binary search trees, I would like to return to the issue that originally motivated them - we wanted a data structure that could be searched in logN time, where N is the number of values in the data structure. Do BSTs actually meet this requirement? In other words, Is it true that I can find any value X in any BST in logN time? What do you think? Yes, no?

Well, while you think about that, here is something that is certainly true. The time to search in a BST is definitely limited by the height (or depth) of the tree. Each step in the search goes down one level, so in the absolute worst case, we will have to go all the way from the root to the deepest leaf in order to find X, or to find out that X is not in the tree. So we can say with certainty that search is O(Height).

But we want to know, is search O(logN)? Well, since we know search is O(Height), this is really asking is Height O(logN) ... is it always the case that the height of a BST is proportional to the logarithm of the # of nodes?

Unfortunately, the answer to all the questions is no. There are BSTs in which Height and # of Nodes are equal! In these trees, searching is O(N), linear time, not log time. It's obvious once I draw one:

This is called a Degenerate tree, or linear tree. In it, no node has two subtrees, every node has 1 (except the lone leaf who has 0). Obviously this is nothing more than a sorted linked list, which was where all our problems began!

So, aren't we are back where we started? Not quite. Most Binary Search Trees are not degenerate: in most of them Height is proportional to LOG(# nodes), and we get efficient search.

For example, the book cites a theorem (page 267) that if you start with an empty BST and add to it N randomly generated values, the expected Height of the tree is 1.4(log(N+1)). This is good news. It basically says that you only get a degenerate BST if you insert values in a systematic way. So, whenever you build a BST make sure you do not insert the values in increasing or decreasing order. Or if you find yourself with a degenerate tree it is worthwhile to rebuild it by removing all its values and re-inserting them in random order.

There is another approach to keeping Height related to log(# nodes), based on the observation that in all degenerate BSTs, there is a node that has one subtree that is much deeper than the other ... in our example, the right subtree of `1' is much taller than the left subtree. What would happen if we absolutely required that the subtrees be the same height (within 1)?

Answer: Height would always be O(log(# nodes)), so search would always be O(logN).

Trees with this property are called Balanced Trees, and that's our next topic.