Our reason for computing the value function for a policy is to help
find better policies. Suppose we have determined the value function
for an arbitrary deterministic policy . For some state we
would like to know whether or not we should change the policy to
deterministically choose an action
. We know how good it is to follow the current
policy from --that is --but would it be better or worse to
change to the new policy? One way to answer this question is to consider
selecting in and thereafter following the existing policy, . The
value of this way of behaving is

The key criterion is whether this is greater than or less than . If it is greater--that is, if it is better to select once in and thereafter follow than it would be to follow all the time--then one would expect it to be better still to select every time is encountered, and that the new policy would in fact be a better one overall.

That this is true is a special case of a general result called the *policy improvement theorem*. Let and be any pair of
deterministic policies such that, for all ,

Then the policy must be as good as, or better than, . That is, it must obtain greater or equal expected return from all states :

Moreover, if there is strict inequality of (4.7) at any state, then there must be strict inequality of (4.8) at at least one state. This result applies in particular to the two policies that we considered in the previous paragraph, an original deterministic policy, , and a changed policy, , that is identical to except that . Obviously, (4.7) holds at all states other than . Thus, if , then the changed policy is indeed better than .

The idea behind the proof of the policy improvement theorem is easy to
understand. Starting from (4.7), we keep expanding the
side and reapplying (4.7) until we get :

So far we have seen how, given a policy and its value function, we can easily
evaluate a change in the policy at a single state to a particular action.
It is a natural extension to consider changes at
*all* states and to *all* possible actions, selecting at each state
the action that appears best according to . In other words, to
consider the new *greedy* policy, , given by

where denotes the value of at which the expression that follows is maximized (with ties broken arbitrarily). The greedy policy takes the action that looks best in the short term--after one step of lookahead--according to . By construction, the greedy policy meets the conditions of the policy improvement theorem (4.7), so we know that it is as good as, or better than, the original policy. The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called

Suppose the new greedy policy, , is as good as, but not better than, the old
policy
. Then , and from (4.9) it follows that for all
:

But this is the same as the Bellman optimality equation (4.1), and therefore, must be , and both and must be optimal policies. Policy improvement thus must give us a strictly better policy except when the original policy is already optimal.

So far in this section we have considered the special case of deterministic
policies. In the general case, a stochastic policy specifies probabilities,
, for taking each action, , in each state, . We will not go
through the details, but in fact all the ideas of this section extend easily to
stochastic policies. In particular, the policy improvement theorem carries through
as stated for the stochastic case, under the natural definition:

In addition, if there are ties in policy improvement steps such as (4.9)--that is, if there are several actions at which the maximum is achieved--then in the stochastic case we need not select a single action from among them. Instead, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all submaximal actions are given zero probability.

The last row of Figure 4.2 shows an example of policy improvement for stochastic policies. Here the original policy, , is the equiprobable random policy, and the new policy, , is greedy with respect to . The value function is shown in the bottom-left diagram and the set of possible is shown in the bottom-right diagram. The states with multiple arrows in the diagram are those in which several actions achieve the maximum in (4.9); any apportionment of probability among these actions is permitted. The value function of any such policy, , can be seen by inspection to be either , , or at all states, , whereas is at most . Thus, , for all , illustrating policy improvement. Although in this case the new policy happens to be optimal, in general only an improvement is guaranteed.