2. Minimizing The Number Of Collisions

There are two strategies for minimizing the number of collisions. The first is simply to choose a hashing function that spreads the possible key values evenly across all the different positions in the hash table. A commonly used form of hashing function, when keys are integers (or easily converted to integers), is:

H(Key) = (P * Key) mod TABLE_SIZE

where P and TABLE_SIZE are different prime numbers. Unless the key values have some unusual properties - for instance, they are all multiples of TABLE_SIZE - a hashing function of this form will distribute them uniformly across all the different array positions.

In the above example, the table should have been of size 101 (a prime) not 100. If the table size is 100, and all the keys are even numbers, none of them will be mapped to an odd hash value.

The second strategy is simply to make the table larger, either by allowing several values to be stored in an array position (the `bucket' method), or by having more positions available. Doubling the size of the table will halve the expected number of collisions.

The latter strategy gives rise to an important property of hash tables that we have not seen in any other data structure. With a hash table, the efficiency of the operations is under our control to a significant extent. We directly control the size of the table, and thereby control the efficiency of our operations. We can easily make the expected number of collisions as small as we like: all we have to do is increase the table's size.

The load factor is defined to be the ratio:

        

When the load factor is small - i.e. when the table is relatively empty - the chance of a collision is small and the operations are `almost' constant time. When the load factor is high, the operations degrade to log-time, at least, and can even degrade to linear time (if collisions are are resolved in an unsophisticated way). The load factor is directly under our control (because we can choose the size of the table) and this gives us control over the efficiency of our algorithms.