Lecture 11 - Representing Tables

1.2. Representing Tables

We have studied a couple of data structures that could be used to implement tables relatively efficiently. What are they?

Answer: A Heap would be good for insertion and deletion, but terrible for retrieval. In most applications, retrieval is the principal operation: you build up a table initially with a sequence of insertions, and then do a large number of retrievals; deletion is usually rare. The importance of retrieval makes heaps a poor way to implement tables.

The best choices are Binary Search Trees (especially if balanced), or B-trees, giving O(logN) insertion, deletion and retrieval.

Can we do any better? We will look at a technique called hashing that aims to make these operations constant time. That may seem impossible, but hashing does indeed come very close to achieving this goal.

To see how this might be done, let's start with a familiar data structure and try to extend it to handle tables in general. Can anyone think of a simple data structure that allows any of its elements to be accessed in constant time? You are all familiar with one...

Answer: The array. We can access any position of an array in constant time. We think of the subscript as the key, and the value stored in the array as the data. Given the key, we can access the data in constant time.

For example, suppose I wanted to store in a table information about the students in this class. I could use an array of size 100, say, and assign to each student a particular position in the array. I'd tell this number to the student, calling it his/her student number. When you came to ask me about your marks, or when I needed to update your record, I'd use your student number as a subscript in the array.

This is the basic idea behind a hash table. In fact, the only flaw in the strategy that needs to be addressed is the step in which I tell you what your ``student number'' is. In practice, we usually do not control the key values: the set of possible keys is given to us as part of the problem, and we must accommodate it.

To carry on with our example, suppose that circumstances forced me to use some part of your personal data as the key - say your social insurance number. Because this is a number, I can still use my original strategy. I'd use your social insurance number as an array subscript, and store your information in the position that it indexed.

Can anyone see the problem with this?

Answer: Social insurance numbers are 9 digits long, so I would need an array of size 10**9, even though I only have information about 100 students.

The constraints, then, that we are working with are these:

The set of possible key values is very large. This set might even be unbounded - imagine that your name was to be used as the key: there are an infinite number of different names.
The set of actual key values is quite small
To get constant-time operations, we must use an array to store the information.

The array cannot possibly be large enough to have a different position for every possible key. And, in any case, we must be able to accommodate keys of types (such as real numbers or strings) that are not legitimate (in C) as array subscripts.