Searching for Optimal Solutions

Flash back to 1980. A small boy, no more than 12 years old, stands nervously, awaiting the start of the competition. A small multi-coloured cube is in his hands. The fingers twitch with anticipation, kneading the cube as if it were bread dough. The judges give their signal and… three… two… one… Go! All traces of nervousness are gone, as the boy shows amazing dexterity. The cube dances in his hands as he nimbly twists and rotates parts of the cube. Bang! The cube is thrust on to the table, stopping the clock. Twenty-seven seconds have elapsed since he started, but that is too long. To be a champion Rubik’s Cube solver, one has to be work faster — 20 seconds is competitive; 27 is not. With a look of disappointment, the boy leaves the stage, and another comes forward to take his place.

Fast forward to the year 2000. The human genome has been sequenced. The blue print of human life is now available in a large database — roughly 3,000,000,000 nucleotides of data. It has taken over a decade to acquire this data, but now the real science can begin. We know the text of the human genome; now we have to figure out what it means. DNA consists of four nucleotides: adenine, cytosine, guanine, and thiamine, better known simply by their first initials A, C, G, and T. Evolution means that species are related, some more closely than others. If the meaning of a piece of, say, monkey DNA can be deciphered, this may benefit the analysis of human DNA. After all, evolutionarily speaking, monkeys and humans are related. Sequence alignment consists of taking a DNA strand (a sequence of A, C, G, and Ts), such as for a monkey, and seeing if we can find similar strands, for example in humans. The DNA between the two species is unlikely to identical, but may be similar. We need to compare two DNA sequences and find out just how similar they are. For example, consider the two sequences:

Sequence 1: A T G G T C A

Sequence 2: A G T T G A

Note that in only one column do the sequences share the same nucleotide. Superficially, it appears that the sequences have little in common. By inserting gaps ("-") into the sequences, we can maximize the number of matching positions:

Sequence 1: A T G G T C - A

Sequence 2: A - G - T T G A

Now the sequences match in four places, and the sequences are much more similar than we thought at first.

__________

The above two scenarios are not as different as you might think. One would like to solve Rubik’s Cube puzzles in the fewest number of moves. The ideal situation is to use the minimal number of moves, an optimal solution. For a pair of DNA sequences, one wants to align them to accentuate the similarity between them. The ideal case is the alignment that shows the maximum possible similarity. Both problems involve finding the best possible answer for a given metric.

The A* Algorithm

A* is one of the foundational algorithms of artificial intelligence. Given certain constraints, it is guaranteed to produce the best (optimal answer). The algorithm can be used for many real-world problems including job-shop scheduling, path finding, planning, DNA sequence alignment, and puzzle solving.

The basic A* idea is simple and elegant. It is based on two quantities, historically named g and h. g is the measure of how much effort has been expended to reach a partial solution, while h is a measure of how much additional effort is needed to read a goal.

Consider driving from Toronto to Edmonton. In Toronto, you might estimate the distance to Edmonton to be 2,000 kilometres. In this case, you have not travelled any distance (g = 0), but have a heuristic estimate of how far you have to go (h = 2,000). The total distance to travel (g+h) appears to be 2,000 kilometres.

You get on the road and drive until you reach Winnipeg. Now you have travelled 1,305 kilometres. However, looking at the map you revise your estimate; it now appears that there are 1,000 kilometres left in the trip. Part of your new estimate is accurate (g) and has no error (after all, you just finished travelling that distance). The other part is heuristic (h). Presumably the error in your estimate for the rest of the trip has decreased, since you are now estimating the distance to complete a shorter trip (Winnipeg to Edmonton). You now have a revised total trip length of g+h = 1,305+1,000 = 2,305 kilometres.

Finally, you arrive in Edmonton. The trip from Toronto was 2,335 kilometres. This is an accurate value; there is no heuristic component (h = 0).

A* uses actual and heuristic values to compute an optimal solution. In an A* search, each node has a g value. This might be the distance travelled, the number of moves made while solving a Rubik’s cube, or the score of the best partial sequence alignment see thus far. Each node also has an h value that is an estimate of the remaining distance to travel, the number of moves left until Rubik’s cube is solved, or the likely increase possible in the remaining part of the sequence being aligned.

A* searches for an optimal solution. At each point in the search, you may have different alternatives (e.g. choosing a move to play in Rubik’s cube). A* scores each choice based on their g and h values. The option having the lowest g+h score is considered first. That path is explored by committing to one of the choices, and considering whether one has moved closer or further away from the gaol. At each step of the search, A* selects the choice with the lowest g+h and explores it. Eventually, it reaches the goal state and the search stops.

If you want an optimal solution, then A* places a constraint on the h-values: they must never over-estimate the cost of reaching the goal node (or, in search parlance, the heuristic must be admissible). With this condition, optimality is guaranteed. Without it, A* still works, but a non-optimal solution may result.

The pseudo-code for A* follows below. <<SHOULD BE A LINK>> Note that this is a simplified version that abstracts away all the management of the open list. Note also that one special case is not considered in the code; the first solution found may not be the optimal one. Once a solution is found, the program has to check if there are any nodes with an f-value less than the best solution. If so, they might be better solutions, and they have to be examined.

A*( StartState, GoalState )

// Initial estimate of solution cost

g = 0

h = HeuristicEstimate( StartState, GoalState )

StartState.g = g

StartState.h = h

StartState.f = g + h

// Initialize the OpenList

Add StartState to OpenList

Done = false

while( not Done )

{

// Find the OpenList entry with the lowest f-score

ConsiderState = BestfScore( OpenList )

Remove ConsiderState from OpenList

// Consider each successor

for each successor S of ConsiderState

{

// g-value is the cost of getting from the

// StartState to ConsiderState plus the cost of

// going from ConsiderState to S

S.g = ConsiderState.g + Cost( ConsiderState, S )

S.h = Heuristic( S, GoalState )

S.f = g + h

// Check for a solution

if( S.h == 0 ) {

done = true

}

else {

Add S to OpenList

}

A* is simple and elegant. But it has two disadvantages:

Performance is strongly tied to the quality of the heuristic estimate. The better it is, the more efficient the search.

A* can use a lot of memory. The algorithm maintains a list of nodes that are under consideration (the open list). The "best" candidate for a solution, the node with the lowest score, is considered and all of its successor states get added to the open list. Thus, at each stage of the algorithm, one node is removed from the open list (the best one), and possibly many nodes are added (all of its successors). The program may not have to run very long before all of memory is exhausted.

Let’s look at these two issues in turn.

IDA*: A Memory-Efficient Variation of A*

Richard Korf introduced Iterative Deepening A* (IDA*) in 1985. The idea is to trade space for time: use less space (a lot less space!) but have the program run a bit slower. It turns out that IDA* is asymptotically equivalent to A*.

IDA* performs a series of assertions, trying to prove whether they are true or not. Assume that hstart is the heuristic estimate of the StartState. Also assume (for simplicity) that every move has a cost of 1 (as is true in Rubik’s Cube). IDA* starts out by asserting that the problem can be solved in hstart moves. It conducts a search and either one of two results comes back: it found a solution, in which case hstart is the optimal answer, or no solution was found. Any node that satisfies the condition g+h > hstart can have the search discontinued; it fails to satisfy the assertion. If the search fails, IDA* tries again. This time it asserts that the solution takes hstart+1 moves, and a search is begun. Again, either a solution is found (problem solved) or the search fails and another iteration is started (asserting the answer is hstart+2). This continues until the problem is solved or a limit on computing resources is reached.

To save space, each IDA* iteration does not use A*. Instead it conducts a depth-first search, considering each move as deeply as possible until either a solution is found or it is proven that a solution does not exist for the conjecture.

This applet illustrates IDA* with the sliding-tile puzzles.