Once a policy, , has been improved using to yield a better policy, , we can then compute and improve it again to yield an even better . We can thus obtain a sequence of monotonically improving policies and value functions:

where denotes a policy * evaluation* and
denotes a policy * improvement*. Each policy is guaranteed to be a
strict improvement over the previous one (unless it is already optimal). Because
a finite MDP has only a finite number of policies, this process must converge to an
optimal policy and optimal value function in a finite number of iterations.

This way of finding an optimal policy is called * policy iteration*.
A complete algorithm is given in Figure 4.3.
Note that each policy evaluation, itself an iterative computation, is started
with the value function for the previous policy. This typically results in a great
increase in the speed of convergence of policy evaluation (presumably because the
value function changes little from one policy to the next).

1. Initialization and arbitrarily for all 2. Policy Evaluation Repeat For each : until (a small positive number) 3. Policy Improvement For each : If then Ifpolicy-stable, then stop; else go to 2

Policy iteration often converges in surprisingly few iterations. This is illustrated by the example in the preceding section. The bottom-left diagram in Figure 4.2 shows the value function for the equiprobable random policy, and the bottom-right diagram shows a greedy policy for this value function. The policy improvement theorem assures us that these policies are better than the original random policy. In this case, however, these policies are not just better, but optimal, proceeding to the terminal states in the minimum number of steps. In this example, policy iteration would find the optimal policy after just one iteration.

** Example .**

* Jack's Car Rental*.
Jack manages two locations for a nationwide car rental company. Each day,
some number of customers arrive at each location to rent cars. If Jack has a car
available, he rents it out and is credited $10 by the national
company. If he is out of cars at that location, then the business is lost. Cars
become available for renting the day after they are returned.
To help ensure that cars are available where they are needed, Jack can move
them between the two locations overnight, at a cost of $2 per car moved. We
assume that the number of cars requested and returned at each location are poisson
random variables, meaning that the probability that the number is **n** is
, where
is the expected number. Suppose
\
is 3 and 4 for rental requests at the first and second locations and 3 and 2 for
dropoffs. To simplify the problem slightly, we assume that there can be no more
than 20 cars at each location (any additional cars are returned to the nationwide
company, and thus disappear from the problem) and a maximum of 5 cars can be
moved from one location to the other in one night. We take the discount rate to
be and formulate this as a continual finite MDP, where the time steps are
days, the state is the number of cars at each location at the end of the day, and
the actions are the net number of cars moved between the two locations overnight.
Figure 4.4 shows the sequence of policies found by policy iteration
starting from the policy that never moves any cars.

**Figure 4.4:** The sequence of policies found by policy iteration on Jack's car
rental problem, and the final state-value function. The first
five diagrams show, for each number of cars at each location at the end of the
day, the number of cars to be moved from the first location to the second (negative
numbers indicate transfers from the second location to the first). Each successive
policy is a strict improvement over the previous policy, and the last policy is
optimal.

** Exercise . (programming) **

Write a program for policy iteration and re-solve Jack's car rental problem with the following changes. One of Jack's employees at the first location rides a bus home each night and lives very near the second location. She is happy to shuttle one car to the second location for free. Each additional car still costs $2, as do all cars in the other direction. In addition, Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location (after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot (independent of how many cars are kept there). These sort of nonlinearities and arbitrary dynamics often occur in real problems and cannot easily be handled by optimization methods other than dynamic programming. To check your program, first replicate the results given for the original problem. If your computer is too slow for the full problem, cut all the numbers of cars in half.

** Exercise .**

How would policy iteration be defined for action values? Give a complete algorithm for computing , analogous to Figure 4.3 for computing . Please pay special attention to this exercise because the ideas involved will be used throughout the rest of the book.

** Exercise .**

Suppose you are restricted to considering only policies that are *
-soft*, meaning that the probability of selecting each action in
each state, **s**, is at least . Describe qualitatively
the changes that would be required in each of the steps 3, 2, and 1, in that
order, of the policy iteration algorithm for
(Figure 4.3).

Sat May 31 11:09:21 EDT 1997