R-learning is an off-policy control method for the advanced version of the
reinforcement learning problem in which one neither discounts nor divides
experience into distinct episodes with finite returns. In this case one seeks
to obtain the maximum reward per time step. The
value functions for a policy, , are defined relative to the
average expected reward per time step under the policy, :

assuming the process is ergodic (nonzero probability of reaching any state from any other under any policy) and thus that does not depend on the starting state. From any state, in the long run the average reward is the same, but there is a transient. From some states better-than-average rewards are received for a while, and from others worse-than-average rewards are received. It is this transient that defines the value of a state:

and the value of a state-action pair is similarly the transient difference in reward when starting in that state and taking that action:

We call these

There are subtle distinctions that need to be drawn between different kinds of optimality in the undiscounted continuing case. Nevertheless, for most practical purposes it may be adequate simply to order policies according to their average reward per time step, in other words, according to their . For now let us consider all policies that attain the maximal value of to be optimal.

Other than its use of relative values, R-learning is a standard TD control method based on off-policy GPI, much like Q-learning. It maintains two policies, a behavior policy and an estimation policy, plus an action-value function and an estimated average reward. The behavior policy is used to generate experience; it might be the -greedy policy with respect to the action-value function. The estimation policy is the one involved in GPI. It is typically the greedy policy with respect to the action-value function. If is the estimation policy, then the action-value function, , is an approximation of and the average reward, , is an approximation of . The complete algorithm is given in Figure 6.16. There has been little experience with this method and it should be considered experimental.