We now extend value-prediction methods using function approximation to control methods, following the pattern of GPI. First we extend the the state-value prediction methods to action-value prediction methods, then we combine them with policy-improvement and action-selection techniques. As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.

The extension to action-value prediction is straightforward. In this case it is the action-value function, , that is represented as a parameterized functional form with parameter vector . Whereas before we considered training examples of the form , now we consider examples of the form . The example output, , can be any approximation of , including the usual backed-up values such as the full Monte-Carlo return, , or the 1-step Sarsa-style return, . The general gradient-descent update for action-value prediction is

For example, the backward view of the action-value method analogous to TD() is

where

and

with . We call this method * gradient-descent
Sarsa()
*, particularly when it is elaborated to form a full control method. For
a constant policy, this method converges in the same way that TD()
does,
with the same kind of error bound (8.9).

To form control methods, we need to couple such action-value prediction methods
with techniques for policy improvement and action selection. Suitable
techniques applicable to continuous actions, or to actions from large discrete
sets, are a topic of ongoing research with as yet no clear resolution. On the
other hand, if the action set is discrete and not too large, then we can
use the techniques already developed in previous chapters. That is, for each
possible action, **a**, available in the current state, , we can compute
and then find the greedy action .
Policy improvement is done by changing the estimation policy to the greedy
policy (in off-policy methods) or to a soft approximation of the greedy
policy such as the
-greedy policy (in on-policy methods). Actions are selected according to this
same policy in on-policy methods, or by an arbitrary policy in off-policy methods.

Initialize arbitrarily and Repeat (for each episode): For all : With probability : Repeat (for each step of episode): For all : (optional block for replacing traces) For all : For all : (accumulating traces) or (replacing traces) Take actiona, observe reward,r, and next state, For all : With probability : until is terminal

Initialize arbitrarily and Repeat (for each episode): For all : Repeat (for each step of episode): With probability : else For all : Take actiona, observe reward,r, and next state, For all : until is terminal

Figures 8.8 and 8.9 show examples of on-policy (Sarsa()
)
and off-policy (Watkins's Q()
) control methods using function approximation.
Both methods use linear, gradient-descent function approximation with binary
features, such as in tile coding and Kanerva coding. Both methods use an
-greedy policy for action selection, and the Sarsa method uses it for GPI as
well. Both compute the sets of present features, , corresponding to the
current state and all possible actions, **a**. If the value function for each
action is a separate linear function of the same features (a common case), then
the indices of all the may be the same except for an offset, simplifying
the computation significantly.

All the methods we have discussed above have used * accumulating*
eligibility traces. Although replacing traces (Section 7.8) are known to have
advantages in tabular methods, they do not directly extend to the use of function
approximation. Recall that the idea of replacing traces is to reset a state's
trace to **1** each time it is visited instead of incrementing it by **1**.
But with function approximation there is no single trace corresponding to
a state, just a trace for each component of , each of which corresponds
to many states. One approach that seems to work well for linear,
gradient-descent function approximation with binary features is to treat the
features as if they were states for the purposes of replacing traces. That is,
each time a state is encountered that has feature **i**, the trace for
feature **i** is set to **1** rather than being incremented by **1** as it would be with
accumulating traces.

When working with state-action traces, it may also be useful to clear (set
to zero) the traces of all non-selected actions in the states encountered (see
Section 7.8). This idea can also be extended to the case of linear function
approximation with binary features. For each state encountered, we first clear
the traces of all features for the state and the actions not selected, then
we set to
**1** the traces of the features for the state and the action that was selected.
As we noted for the tabular case, this may or may not be the best way to
proceed when using replacing traces. A procedural specification of both
kinds of traces, including the optional clearing for non-selected actions,
is given for the Sarsa algorithm in Figure 8.8.

** Example .**
* The Mountain-Car Task*.
Consider the task of driving an underpowered car up a steep mountain road, as
suggested by the diagram in the upper left of Figure 8.10. The
difficulty is that gravity is stronger than the car's engine, and even at full
throttle it cannot accelerate up the steep slope. The only solution is to
first move away from the goal and up the opposite slope on the left. Then,
by applying full throttle the car can build up enough inertia to carry
it up the steep slope even though it is slowing down the whole way. This is a
simple example of a continuous control task where things have to get worse in a
sense (farther from the goal) before they can get
better. Many control methodologies have great difficulties with tasks of
this kind unless explicitly aided by a human designer.

**Figure 8.10:** The mountain-car task (upper left panel) and the cost-to-go function
() learned during one run.

The reward in this problem is **-1** on all time steps until the car has moved
past its goal position at the top of the mountain, which ends the episode.
There are three possible actions: full throttle forward (**+1**), full
throttle reverse (**-1**), and zero throttle (**0**). The car moves according to
a simplified physics. It's position,
, and velocity, , are given by

where the **bound** operation enforces and . When reached the left bound, then
was reset to zero. When it reached right bound, then the goal was reached
and the episode was terminated. Each episode started from a random position and
velocity uniformly chosen from these ranges. To convert the two continuous state
variables to binary features, we used grid-tilings exactly as in
Figure 8.5. We used ten tilings, each offset by a random
fraction of a tile width.

The Sarsa algorithm in Figure 8.8 (using replace
traces and the optional clearing) readily solved this task, learning
a near optimal policy within 100 episodes. Figure 8.10
shows the negative of the value function (the * cost-to-go* function)
learned on one run, using the parameters
, , and (). The initial action values were
all zero, which was optimistic (all true values are negative in this task),
causing extensive exploration to occur even though the exploration parameter,
,
was **0**. This can be seen in the middle-top panel of the figure, labeled ``Step
428." At this time not even one episode has been completed, but the car has
oscillated back and forth in the valley, following circular trajectories in state
space. All the states visited frequently are valued worse than
unexplored states, because the actual rewards have been worse than what was
(unrealistically) expected. This continually drives the agent away from wherever
it has been to explore new states, until a solution is found.
Figure 8.11 shows the results of a detailed study of the effect
of the parameters
and
, and of the kind of traces, on the rate of learning
on this task.

**Figure 8.11:** The effect of
,
, and the kind of traces on early performance on the
Mountain-Car task. This study used five tilings.

** Exercise .* **

Describe how the actor-critic control method could be combined with gradient-descent function approximation.

Sat May 31 15:08:20 EDT 1997