- Emphatic TD(λ); Yu's convergence proof

- Weighted importance sampling
version of LSTD(λ), linear-complexity algorithms

- True online TD(λ)
- The predictive
approach to knowledge representation; PEAK; Horde; nexting

- Fast gradient-based TD algorithms, nonlinear case, GQ(lambda),
control, Maei's thesis

- RL book

- Temporal-difference learning; TD(lambda) details
- The
TD model of Pavlovian conditioning; earlier Sutton-Barto
model; more biological 1982
& 1986;
and instrumental
learning

- Dyna; as an integrated
architecture; with
FA 1996, 2008

- The options paper; UAV example; precursor not superseded;
- Policy gradient methods; Incremental Natural Actor-Critic Algorithms
- PhD thesis, introduced actor-critic
architectures and "temporal credit assignment"

- PSRs; the predictive representations hypothesis; TD networks; with options
- RL for RoboCup soccer keepaway

- RL with continuous state and action
spaces

- Step-size
adaptation by meta-gradient descent; IDBD; improved; earliest pub; in classical conditioning; in human category
learning, in
tracking

- Random representations; representation search; feature discovery; more
- Pole-balancing; tracking nonstationarity
- Exponentiated-gradient RL; fuller TR
- A study in alpha and lambda

- Two problems with backprop

- Chris Watkins's thesis
- Boyan's LSTD(lambda),
1999

- Barto and Bradtke LSTD, 1996
- Williams, 1992
- Lin, 1992

- Ross, 1983, chapter 2
- Minsky, 1960, Steps to AI
- Good, 1965, Speculations concerning the first ultraintelligent machine
- Selfridge, 1958, Pandemonium
- Samuel, 1959
- Dayan, 1992
- Tesauro, 1992, TD-Gammon
- Watkins and Dayan, 1992
- Hamid Maei's PhD thesis, 2011
- Masoud Shahamiri's MSc thesis, 2008
- Janey Yu's proof of convergence of Emphatic TD(λ)
- Adam White's PhD thesis
- David Silver's PhD thesis
- Kavosh Asadi's MSc thesis

For any broken links, please send email to rich@richsutton.com.

van Hasselt, H., Sutton, R. S. (2015) Learning to Predict Independent of Span. ArXiv:1508.04582.

ABSTRACT: We
consider how to learn multi-step predictions e�ffciently. Conventional
algorithms wait until observing actual outcomes before performing the
computations to update their predictions. If predictions are made at a
high rate or span over a large amount of time, substantial computation
can be required to store all relevant observations and to update all
predictions when the outcome is finally observed. We show that the
exact same predictions can be learned in a much more computationally
congenial way, with uniform per-step computation that does not depend
on the span of the predictions. We apply this idea to various settings
of increasing generality, repeatedly adding desired properties and each
time deriving an equivalent span-independent algorithm for the
conventional algorithm that satisfies these desiderata. Interestingly,
along the way several known algorithmic constructs emerge spontaneously
from our derivations, including dutch eligibility traces, temporal
difference errors, and averaging. This allows us to link these
constructs one-to-one to the corresponding desiderata, unambiguously
connecting the ‘how’ to the ‘why’. Each step, we make sure that the
derived algorithm subsumes the previous algorithms, thereby retaining
their properties. Ultimately we arrive at a single general
temporal-difference algorithm that is applicable to the full setting of
reinforcement learning.

van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., Sutton, R. S. (in preparation). True online temporal-difference learning.

ABSTRACT: TD(λ) is
a core algorithm of modern reinforcement learning. Its appeal comes
from its simple interpretation provided by its forward view (that is,
the λ-parameter sets the trade-off between bias and variance of
updates), and that it can be implemented in real-time in an inexpensive
manner. Recently, a new version of TD(λ) was introduced, called
true-online TD(λ). Algorithmically, this version only makes two small
changes to the update rules of TD(λ), and the extra computational cost
is negligible in most cases. However, true-online TD(λ) follows the
ideas underlying the forward view much more closely. In particular,
true-online TD(λ) can perform fully unbiased updates.

We perform an extensive empirical comparison between true-online TD(λ) and TD(λ), and their control versions, true-online Sarsa(λ) and Sarsa(λ). The domains we use include intuitive toy-problems, random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. For each domain considered, the true-online version performs at least as good as the traditional version at its best parameter settings, while often outperforming it. In addition, the true-online versions are less sensitive with respect to their parameters, making it easier to find good parameter settings. These results undeniably show the benefits of true-online TD(λ) and true-online Sarsa(λ), and, more generally, are a validation of the true-online approach, which focusses on obtaining an exact equivalence with a well-defined forward view.

Besides the empirical study, we establish the general principles underlying a true-online method and discuss several other true-online methods.

We perform an extensive empirical comparison between true-online TD(λ) and TD(λ), and their control versions, true-online Sarsa(λ) and Sarsa(λ). The domains we use include intuitive toy-problems, random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. For each domain considered, the true-online version performs at least as good as the traditional version at its best parameter settings, while often outperforming it. In addition, the true-online versions are less sensitive with respect to their parameters, making it easier to find good parameter settings. These results undeniably show the benefits of true-online TD(λ) and true-online Sarsa(λ), and, more generally, are a validation of the true-online approach, which focusses on obtaining an exact equivalence with a well-defined forward view.

Besides the empirical study, we establish the general principles underlying a true-online method and discuss several other true-online methods.

van Seijen, H., Mahmood, A. R., Pilarski, P. M., Sutton, R. S. (2015). An empirical evaluation of true online TD(λ). In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT: The true
online TD(λ) algorithm has recently been proposed (van Seijen and
Sutton, 2014) as a universal replacement for the popular TD(λ)
algorithm, in temporal-difference learning and reinforcement learning.
True online TD(λ) has better theoretical properties than conventional
TD(λ), and the expectation is that it also results in faster learning.
In this paper, we put this hypothesis to the test by empirically
comparing true online TD(λ) with TD(λ) on a variety of domains. Our
domains include a challenging one-state and two-state example, random
Markov reward processes, and a real-world myoelectric prosthetic arm.
We use linear function approximation with tabular, binary, and
non-binary features. We assess the algorithms along three dimensions:
computational cost, learning speed, and ease of use. Our results
confirm the strength of true online TD(λ): 1) for sparse feature
vectors, the computational overhead with respect to TD(λ) is
negligible; for non-sparse features the computation time is at most
twice that of TD(λ), 2) across all domains/representations the learning
speed of true online TD(λ) is often better, but never worse than that
of TD(λ), and 3) true online TD(λ) is easier to use, because it does
not require choosing between trace types, and it is generally more
stable with respect to the step-size. Overall, our results suggest that
true online TD(λ) should be the first choice when looking for an
efficient, general-purpose TD method.

Mahmood, A. R., Yu, H., White, M., Sutton, R. S. (2015). Emphatic temporal-difference learning. In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT: Emphatic
algorithms are temporal-difference learning algorithms that change
their effective state distribution by selectively emphasizing and
de-emphasizing their updates on different time steps. Recent works by
Sutton, Mahmood and White (2015) and Yu (2015) show that by varying the
emphasis in a particular way these algorithms become stable and
convergent under off-policy training with linear function
approximation. This paper serves as a unified summary of the available
results from both works. Additionally, we empirically demonstrate the
benefits of emphatic algorithms, due to the flexible learning using
state-dependent discounting, bootstrapping and a user-specified
allocation of function approximation resources.

Mahmood, A. R., Sutton, R. S. (2015). Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 2015 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT:
Importance sampling is an essential component of model-free off-policy
learning algorithms. Weighted importance sampling (WIS) is generally
considered superior to ordinary importance sampling but, when combined
with function approximation, it has hitherto required computational
complexity that is O(n^{2}) or more in the number of features.
In this paper we introduce new off-policy learning algorithms that
obtain the benefits of WIS with O(n) computational complexity. Our
algorithms maintain for each component of the parameter vector a
measure of the extent to which that component has been used in previous
examples. This measure is used to determine component-wise step sizes,
merging the ideas of stochastic gradient descent and sample averages.
We present our main WIS-based algorithm first in an intuitive acausal
form (the forward view) and then derive a causal algorithm using
eligibility traces that is equivalent but more efficient (the backward
view). In three small experiments, our algorithms performed
significantly better than prior O(n) algorithms for off-policy policy
evaluation. We also show that our adaptive step-size technique alone
can improve the performance of on-policy algorithms such as TD(λ) and
true online TD(λ).

Sutton, R. S., Mahmood, A. R., White, M. (2015). An emphatic approach to the problem of off-policy temporal-difference learning. ArXiv:1503.04269. Implementation guide. Code.

ABSTRACT: In this
paper we introduce the idea of improving the performance of parametric
temporal- difference (TD) learning algorithms by selectively
emphasizing or de-emphasizing their updates on different time steps. In
particular, we show that varying the emphasis of linear TD(λ)’s updates
in a particular way causes its expected update to become stable under
off-policy training. The only prior model-free TD methods to achieve
this with per-step computation linear in the number of function
approximation parameters are the gradient-TD family of methods
including TDC, GTD(λ), and GQ(λ). Compared to these methods, our
emphatic TD(λ) is simpler and easier to use; it has only one learned
parameter vector and one step-size parameter. Our treatment includes
general state-dependent discounting and bootstrapping functions, and a
way of specifying varying degrees of interest in accurately valuing
different states.

van Seijen, H., Sutton, R. S. (2015). A deeper look at planning as learning from replay. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France.

ABSTRACT:
Q-learning, the most popular of reinforcement learning algorithms, has
always included an extension to eligibility traces to enable more rapid
learning and improved asymptotic performance on non-Markov problems.
The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and
Sarsa(λ) from a pure bootstrapping form (λ = 0) to a pure Monte Carlo
form (λ = 1). In off-policy algorithms, including Q(λ), GQ(λ), and
off-policy LSTD(λ), the λ parameter is intended to play the same role,
but does not; on every exploratory action these algorithms bootstrap
regardless of the value of λ, and as a result they fail to approximate
Monte Carlo learning when λ = 1. It may seem that this is inevitable
for any online off-policy algorithm; if updates are made on each step
on which the target policy is followed, then how could just the right
updates be ‘un-made’ upon deviation from the target policy? In this
paper, we introduce a new version of Q(λ) that does exactly that,
without significantly increased algorithmic complexity. En route to our
new Q(λ), we introduce a new derivation technique based on the
forward-view/backward-view analysis familiar from TD(λ) but extended to
apply at every time step rather than only at the end of episodes. We
apply this technique to derive first a new off-policy version of TD(λ),
called PTD(λ), and then our new Q(λ), called PQ(λ).

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2014). Time course of the rabbit's nictitating membrane during acquisition, extinction, and reacquisition. Learning and Memory 21:585-590. Cold Spring Harbor Press.

ABSTRACT: The
present experiment tested whether or not the time course of a
conditioned eyeblink response, particularly its duration, would expand
and contract, as the magnitude of the conditioned response (CR) changed
massively during acquisition, extinction, and reacquisition. The CR
duration remained largely constant throughout the experiment, while CR
onset and peak time occurred slightly later during extinction. The
results suggest that computational models can account for these results
by using two layers of plasticity conforming to the sequence of
synapses in the cerebellar pathways that mediate eyeblink conditioning.

Mahmood, A. R., van Hasselt, H, Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation.

ABSTRACT: Importance
sampling is an essential component of off-policy model-free
reinforcement learning algorithms. However, its most effective variant,
weighted importance sampling,
does not carry over easily to function
approximation and, because of this, it is not utilized in existing
off-policy learning algorithms. In this paper, we take two steps toward
bridging this gap. First, we show that weighted importance sampling can
be viewed as a special case of weighting the error of individual
training samples, and that this weighting has theoretical and empirical
benefits similar to those of weighted importance sampling. Second, we
show that these benefits extend to a new weighted-importance-sampling
version of off-policy LSTD(λ). We show empirically that our
new
WIS-LSTD(λ) algorithm can result in much
more rapid and reliable
convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu
2009).

Yao, H., Szepesvari, Cs., Sutton, R., Modayil, J., Bhatnagar, S. (2014). Universal Option Models.

ABSTRACT: We consider
the problem of learning models of options for real-time abstract planning, in the
setting where reward functions can be specified at any time and their expected
returns must be efficiently computed. We introduce a new model for an
option that is independent of any reward function, called the universal option model (UOM). We
prove that the UOM of an option can construct a traditional option
model given a reward function, and also supports efficient computation
of the option-conditional return. We extend the UOM to linear function
approximation, and we show it gives the TD solution of option returns
and value functions of policies over options. We provide a stochastic
approximation algorithm for incrementally learning UOMs from data and
prove its consistency. We demonstrate our method in two domains. The
first domain is a real-time strategy game, where the
controller must select the best game unit to accomplish
dynamically-specified tasks. The second domain is
article recommendation, where each user query defines a new
reward function and an article’s relevance is the expected return from following a
policy that follows the citations between articles. Our experiments
show that UOMs are substantially more efficient than previously known
methods in evaluating option returns and policies over options.

Edwards, A. L., Dawson, M. R., Hebert, J. S., Sutton, R. S., Chan, K. M., Pilarski, P.M. (2014). Adaptive switching in practice: Improving myoelectric prosthesis performance through reinforcement learning. In: Proceedings of MEC14: Myoelectric Controls Symposium, Fredericton, New Brunswick, Canada.

van Hasselt, H., Mahmood, A. R., Sutton, R. S. (2014). Off-policy TD(λ) with a true online equivalence. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada.

ABSTRACT: Van
Seijen and Sutton (2014) recently proposed a new version of the linear
TD(λ) learning
algorithm that is exactly equivalent to an online forward view and that
empirically performed better than its classical counterpart in both
prediction and control problems. However, their algorithm is restricted
to on-policy learning. In the more general case of off-policy learning,
in which the policy whose outcome is predicted and the policy used to
generate data may be different, their algorithm cannot be applied. One
reason for this is that the algorithm bootstraps and thus is subject to
instability problems when function approximation is used. A second
reason true online TD(λ) cannot be used for off-policy learning is that the
off-policy case requires sophisticated importance sampling in its
eligibility traces. To address these limitations, we generalize their
equivalence result and use this generalization to construct the first
online algorithm to be exactly equivalent to an off-policy forward
view. We show this algorithm, named true online GTD(λ), empirically outperforms
GTD(λ) (Maei,
2011) which was derived from the same objective as our forward view but
lacks the exact online equivalence. In the general theorem that allows
us to derive this new algorithm, we encounter a new general
eligibility-trace update.

Sutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q(λ) with interim forward view and Monte Carlo equivalence. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China.

ABSTRACT:
Q-learning, the most popular of reinforcement learning algorithms, has
always included an extension to eligibility traces to enable more rapid
learning and improved asymptotic performance on non-Markov problems.
The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and
Sarsa(λ) from a pure bootstrapping form (λ = 0) to a pure Monte Carlo
form (λ = 1). In off-policy algorithms, including Q(λ), GQ(λ), and
off-policy LSTD(λ), the λ parameter is intended to play the same role,
but does not; on every exploratory action these algorithms bootstrap
regardless of the value of λ, and as a result they fail to approximate
Monte Carlo learning when λ = 1. It may seem that this is inevitable
for any online off-policy algorithm; if updates are made on each step
on which the target policy is followed, then how could just the right
updates be ‘un-made’ upon deviation from the target policy? In this
paper, we introduce a new version of Q(λ) that does exactly that,
without significantly increased algorithmic complexity. En route to our
new Q(λ), we introduce a new derivation technique based on the
forward-view/backward-view analysis familiar from TD(λ) but extended to
apply at every time step rather than only at the end of episodes. We
apply this technique to derive first a new off-policy version of TD(λ),
called PTD(λ), and then our new Q(λ), called PQ(λ).

van Seijen, H., Sutton, R. S. (2014). True online TD(λ). In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. As published. With two minor errors corrected.

ABSTRACT:TD(λ) is
a core algorithm of modern reinforcement learning. Its appeal comes
from its equivalence to a clear and conceptually simple forward view,
and the fact that it can be implemented online in an inexpensive
manner. However, the equivalence between TD(λ) and the forward view is
exact only for the off-line version of the algorithm (in which updates
are made only at the end of each episode). In the online version of
TD(λ) (in which updates are made at each step, which generally performs
better and is always used in applications) the match to the forward
view is only approximate. In a sense this is unavoidable for the
conventional forward view, as it itself presumes that the estimates are
unchanging during an episode. In this paper we introduce a new forward
view that takes into account the possibility of changing estimates and
a new variant of TD(λ) that exactly achieves it. Our algorithm uses a
new form of eligibility trace similar to but different from
conventional accumulating and replacing traces. The overall
computational complexity is the same as TD(λ), even when using function
approximation. In our empirical comparisons, our algorithm outperformed
TD(λ) in all of its variations. It seems, by adhering more truly to the
original goal of TD(λ)—matching an intuitively clear forward view even
in the online case—that we have found a new algorithm that simply
improves on classical TD(λ).

Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale Nexting in a Reinforcement Learning Robot. Adaptive Behavior 22(2):146-160.

ABSTRACT: The term
‘nexting’ has been used by psychologists to refer to the propensity of
people and many other animals to continually predict what will happen
next in an immediate, local, and personal sense. The ability to ‘next’
constitutes a basic kind of awareness and knowledge of one’s
environment. In this paper we present results with a robot that learns
to next in real time, making thousands of predictions about sensory
input signals at timescales from 0.1 to 8 seconds. Our predictions are
formulated as a generalization of the value functions commonly used in
reinforcement learning, where now an arbitrary function of the sensory
input signals is used as a pseudo reward, and the discount rate
determines the timescale. We show that six thousand predictions, each
computed as a function of six thousand features of the state, can be
learned and updated online ten times per second on a laptop computer,
using the standard TD(λ) algorithm with linear function approximation. This
approach is sufficiently computationally efficient to be used for
real-time learning on the robot and sufficiently data efficient to
achieve substantial accuracy within 30 minutes. Moreover, a single
tile-coded feature representation suffices to accurately predict many
different signals over a significant range of timescales. We also
extend nexting beyond simple timescales by letting the discount rate be
a function of the state and show that nexting predictions of this more
general form can also be learned with substantial accuracy. General
nexting provides a simple yet powerful mechanism for a robot to acquire
predictive knowledge of the dynamics of its environment.

Edwards, A. L., Kearney, A., Dawson, M. R., Sutton, R. S., Pilarski, P. M. (2013). Temporal-difference learning to assist human decision making during the control of an artificial limb. In: Proceedings of the First Interdisciplinary Conference on Reinforcement Learning and Decision Making, and http://arxiv.org/abs/1309.4714.

ABSTRACT: In this
work we explore the use of reinforcement learning (RL) to help with
human decision making, combining state-of-the-art RL algorithms with an
application to prosthetics. Managing human-machine interaction is a
problem of considerable scope, and the simplification of human-robot
interfaces is especially important in the domains of biomedical
technology and rehabilitation medicine. For example, amputees who
control artificial limbs are often required to quickly switch between a
number of control actions or modes of operation in order to operate
their devices. We suggest that by learning to anticipate (predict) a
user's behaviour, artificial limbs could take on an active role in a
human's control decisions so as to reduce the burden on their users.
Recently, we showed that RL in the form of general value functions
(GVFs) could be used to accurately detect a user's control intent prior
to their explicit control choices. In the present work, we explore the
use of temporal-difference learning and GVFs to predict when users will
switch their control influence between the different motor functions of
a robot arm. Experiments were performed using a multi-function robot
arm that was controlled by muscle signals from a user's body (similar
to conventional artificial limb control). Our approach was able to
acquire and maintain forecasts about a user's switching decisions in
real time. It also provides an intuitive and reward-free way for users
to correct or reinforce the decisions made by the machine learning
system. We expect that when a system is certain enough about its
predictions, it can begin to take over switching decisions from the
user to streamline control and potentially decrease the time and effort
needed to complete tasks. This preliminary study therefore suggests a
way to naturally integrate human- and machine-based decision making
systems.

Silver, D., Sutton, R. S., Mueller, M. (2013). Temporal-difference search in computer go. In: Proceedings of the ICAPS-13 Workshop on Planning and Learning.

ABSTRACT:
Temporal-difference (TD) learning is one of the most successful and
broadly applied solutions to the reinforcement learning problem; it has
been used to achieve master-level play in chess, checkers and
backgammon. Monte-Carlo tree search is a recent algorithm for
simulation-based
search, which has been used to achieve master-level play in Go. We have
introduced a new approach to high-performance planning (Silver et al.,
2012). Our method, TD search, combines TD learning with
simulation-based search. Like Monte-Carlo tree search, value estimates
are updated by learning online from simulated experience. Like TD
learning, it uses value function approximation and bootstrapping to
efficiently generalise between related states. We applied TD search to
the game of 9 × 9 Go, using a million binary features matching
simple patterns of stones. Without any explicit search tree, our
approach outperformed a vanilla Monte-Carlo tree search with the same
number of simulations. When combined with a simple alpha-beta search,
our program also outperformed all traditional (pre-Monte-Carlo) search
and machine learning programs on the 9 × 9 Computer Go Server.

Mahmood, A. R., Sutton, R. S. (2013). Representation Search through Generate and Test. In: Proceedings of the AAAI Workshop on Learning Rich Representations from Low-Level Sensors, Bellevue, Washington, USA.

ABSTRACT: Learning
representations from data is one of the fundamental problems of
artificial intelligence and machine learning. Many different approaches
exist for learning representations, but what constitutes a good
representation is not yet well understood. Much less is known about how
to learn representations from an unending stream of data. In this work,
we view the problem of representation learning as one of learning
features (e.g., hidden units of neural networks) such that performance
of the underlying base system continually improves. We study an
important case where learning is done fully online (i.e, on an
example-by-example basis). In the presence of an unending stream of
data, the computational cost of the learning element should not grow
with time and cannot be much more than that of the performance element.
Few methods can be used effectively in this case. We show that a search
approach to representation learning can naturally fit into this
setting. In this approach good representations are searched by
generating different features and then testing them for utility. We
show that a fully online method can be developed, which utilizes this
generate and test approach, learns representations continually, and
adds only a small fraction to the overall computation. We believe
online representation search constitutes an important step to- ward
effective and inexpensive solutions to representation learning problems.

van Seijen, H., Sutton, R. S. (2013). Efficient planning in MDPs by small backups. In: Proceedings of the 30th International Conference on Machine Learning, pp. 361-369, Atlanta, USA.

ABSTRACT:
Efficient planning plays a crucial role in model-based reinforcement
learning. This efficiency is largely determined by the order in which
backups are performed. A particularly effective ordering strategy is
the strategy employed by prioritized sweeping. Prioritized sweeping
orders backups according to a heuristic, such that backups that cause a
large value change are selected first. The Bellman error predicts the
value change caused by a full backup exactly, but its computation is
expensive. Hence, methods do not use the Bellman error as a heuristic,
but try to approximate the ordering induced by it with a heuristic that
is cheaper to compute. We introduce the first efficient prioritized
sweeping method that uses exact Bellman error ordering. The core of
this method is a novel backup that allows for efficient computation of
the Bellman error, while its effect as well as its computational
complexity is equal to that of a full backup. We demonstrate
empirically that the obtained method achieves substantially higher
convergence rates than other prioritized sweeping implementations.

Pilarski, P. M., Dick, T. B., Sutton, R. S. (2013). Real-time prediction learning for the simultaneous actuation of multiple prosthetic joints. In: Proceedings of the 2013 IEEE International Conference on Rehabilitation Robotics, Seattle, USA, June 24–26. 8 pages.

ABSTRACT:
Integrating learned predictions into a prosthetic control system
promises to enhance multi-joint prosthesis use by amputees. In this
article, we present a preliminary study of different cases where it may
be beneficial to use a set of temporally extended predictions—learned
and maintained in real time—within an engineered or learned prosthesis
controller. Our study demonstrates the first successful combination of
actor-critic reinforcement learning with real-time prediction learning.
We evaluate this new approach to control learning during the
myoelectric operation of a robot limb. Our results suggest that the
integration of real-time prediction and control learning may speed
control policy acquisition, allow unsupervised adaptation in
myoelectric controllers, and facilitate synergies in highly actuated
limbs. These experiments also show that temporally extended prediction
learning enables anticipatory actuation, opening the way for
coordinated motion in assistive robotic devices. Our work therefore
provides initial evidence that real-time prediction learning is a
practical way to support intuitive joint control in increasingly
complex prosthetic systems.

Pilarski, P. M., Dawson, M. R., Degris, T., Chan, K. M., Hebert, J. S., Carey, J. P., Sutton, R. S. (2013). Adaptive Artificial Limbs: A Real-time Approach to Prediction and Anticipation, IEEE Robotics & Automation Magazine 20(1):53-64.

ABSTRACT:
Predicting the future has long been regarded as a powerful means to
improvement and success. The ability to make accurate and timely
predictions enhances our ability to control our situation and our
environment. Assistive robotics is one prominent area where foresight
of this kind can bring improved quality of life. In this article, we
present a new approach to acquiring and maintaining predictive
knowledge during the online, ongoing operation of an assistive robot.
The ability to learn accurate, temporally abstracted predictions is
shown through two case studies—able-bodied subjects engaging in the
myoelectric control of a robot arm, and an amputee participant’s
interactions with a myoelectric training robot. To our knowledge, this
work is the first demonstration of a practical method for real-time
prediction learning during myoelectric control. Our approach therefore
represents a fundamental tool for addressing one major unsolved
problem: amputee-specific adaptation during the ongoing operation of a
prosthetic device. The findings in this article also contribute a first
explicit look at prediction learning in prosthetics as an important
goal in its own right, independent of its intended use within a
specific controller or system. Our results suggest that real-time
learning of predictions and anticipations is a significant step towards
more intuitive myoelectric prostheses and other assistive robotic
devices.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2013). Timing and cue competition in conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus), Learning and Memory 20:97-102. Cold Spring Harbor Press.

ABSTRACT:Rabbits
were classically conditioned using compounds of tone and light
conditioned stimuli (CSs) presented with either simultaneous onsets
(Experiment 1) or serial onsets (Experiment 2) in a delay conditioning
paradigm. Training with the simultaneous compound reduced the
likelihood of a conditioned response (CR) to the individual CSs
(“mutual overshadowing”) but left CR timing unaltered. CR peaks were
consistently clustered around the time of unconditioned stimulus (US)
delivery. Training with the serial compound (CSA->CSB->US)
reduced responding to CSB (“temporal primacy/information effect”) but
this effect was prevented by prior CSB->US pairings. In both cases,
serial compound training altered CR timing. On CSA->CSB test trials,
the CRs were accelerated; the CR peaks occurred after CSB onset but
well before the time of US delivery. Conversely, CRs on CSB– trials
were decelerated; the distribution of CR peaks was variable but
centered well after the US. Timing on CSB– trials was at most only
slightly accelerated. The results are discussed with respect to
processes of generalization and spectral timing applicable to the
cerebellar and forebrain pathways in eyeblink preparations.

White, A., Modayil, J., Sutton, R. S. (2012). Scaling Life-long Off-policy Learning. In Proceedings of the Second Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob), San Diego, USA.

ABSTRACT:In this
paper, we pursue an approach to scaling life-long learning using
parallel, off-policy reinforcement-learning algorithms. In life-long
learning, a robot continually learns from a life-time of experience,
slowly acquiring and applying skills and knowledge to new situations.
Many of the benefits of life-long learning are a result of scaling the
amount of training data, processed by the robot, to long sensorimotor
streams. Another dimension of scaling can be added by allowing
off-policy sampling from the unending stream of sensorimotor data
generated by a long-lived robot. Recent algorithmic developments have
made it possible to apply off-policy algorithms to life-long learning,
in a sound way, for the first time. We assess the scalability of these
off-policy algorithms on a physical robot. We show that hundreds of
accurate multi-step predictions can be learned about several policies
in parallel and in realtime. We present the first online measures of
off-policy learning progress. Finally we demonstrate that our robot,
using the new off-policy measures, can learn 8000 predictions about 300
distinct policies, a substantial increase in scale compared to previous
simulated and robotic life-long learning systems.

Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring a Broad Range of Empirical Knowledge in Real Time by Temporal-Difference Learning. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Seoul, Korea.

ABSTRACT: Several
robot capabilities rely on predictions about the temporally extended
consequences of a robot’s behaviour. We describe how a robot can both
learn and make many such predictions in real time using a standard
algorithm. Our experiments show that a mobile robot can learn and make
thousands of accurate predictions at 10 Hz. The predictions are about
the future of all of the robot’s sensors and many internal state
variables at multiple time-scales. All the predictions share a single
set of features and learning parameters. We demonstrate the generality
of this method with an application to a different platform, a robot arm
operating at 50 Hz. Here, learned predictions can be used to measurably
improve the user interface. The temporally extended predictions learned
in real time by this method constitute a basic form of knowledge about
the dynamics of the robot’s interaction with the environment. We also
show how this method can be extended to express more general forms of
knowledge.

Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring Diverse Predictive Knowledge in Real Time by Temporal-difference Learning. International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems, Montpellier, France.

ABSTRACT: Existing
robot algorithms demonstrate several capabilities that are enabled by a
robot’s knowledge of the temporally- extended consequences of its
behaviour. This knowledge consists of real-time predictions—predictions
that are conventionally computed by iterating a small one-timestep
model of the robot’s dynamics. Given the utility of such predictions,
alternatives are desirable when this conventional approach is not
applicable, for example when an adequate model of the one-timestep
dynamics is either not available or not computationally tractable. We
describe how a robot can both learn and make many such predictions in
real-time using a standard reinforcement learning algorithm. Our
experiments show that a mobile robot can learn and make thousands of
accurate predictions at 10 Hz about the future of all of its sensors
and many internal state vari- ables at multiple time-scales. The method
uses a single set of features and learning parameters that are shared
across all the predictions. We demonstrate the generality of these
predictions with an application to a different platform, a robot arm
operating at 50 Hz. Here, the predictions are about which arm joint the
user wants to move next, a difficult situation to model analytically,
and we show how the learned predictions enable measurable improvements
to the user interface. The predictions learned in real-time by this
method constitute a basic form of knowledge about the robot’s
interaction with the environ- ment, and extensions of this method can
express more general forms of knowledge.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2012). Evaluating the TD model of classical conditioning. Learning and Behavior 40(3):305-319. Springer.

ABSTRACT: The
temporal-difference (TD) algorithm from reinforcement learning provides
a simple method for incrementally learning predictions of upcoming
events. Applied to classical conditioning, TD models suppose that
animals learn a real-time prediction of the unconditioned stimulus (US)
on the basis of all available conditioned stimuli (CSs). In the TD
model, similar to other error-correction models, learning is driven by
prediction errors—the difference between the change in US prediction
and the actual US. With the TD model, however, learning occurs
continuously from moment to moment and is not artificially constrained
to occur in trials. Accordingly, a key feature of any TD model is the
assumption about the representation of a CS on a moment-to-moment
basis. Here, we evaluate the performance of the TD model with a
heretofore unexplored range of classical conditioning tasks. To do so,
we consider three stimulus representations that vary in their degree of
temporal generalization and evaluate how the representation influences
the performance of the TD model on these conditioning tasks.

Pilarski, P. M., Sutton, R. S. (2012) Between instruction and reward: Human-prompted switching. Proceedings of the AAAI Fall Symposium on Robots Learning Interactively from Human Teachers. AAAI Technical Report FS-12-07, pp. 46–52.

ABSTRACT: Intelligent systems
promise to amplify, augment, and extend innate human abilities. A
principal example is that of assistive rehabilitation robots—artificial
intelligence and machine learning enable new electromechanical systems
that restore biological functions lost through injury or illness. In
order for an intelligent machine to assist a human user, it must be
possible for a human to communicate their intentions and preferences to
their non-human counterpart. While there are a number of techniques
that a human can use to direct a machine learning system, most research
to date has focused on the contrasting strategies of instruction and
reward. The primary contribution of our work is to demonstrate that the
middle ground between instruction and reward is a fertile space for
research and immediate technological progress. To support this idea, we
introduce the setting of human-prompted switching, and illustrate the
successful combination of switching with interactive learning using a
concrete real-world example: human control of a multi-joint robot arm.
We believe techniques that fall between the domains of instruction and
reward are complementary to existing approaches, and will open up new
lines of rapid progress for interactive human training of machine
learning systems.

Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Chan, K. M., Hebert, J. S., Sutton, R. S. (2012). Towards Prediction-Based Prosthetic Control. In Proceedings of the 2012 Conference of the International Functional Electrical Stimulation Society, September 9-12, Banff, Canada, 4 pages.

ABSTRACT: Predictions and
anticipations form a basis for pattern-recognition-based control
systems, including those used in next-generation prostheses and
assistive devices. In this work, we outline how new methods in
real-time prediction learning provide an approach to one of the
principal open problems in multi-function myoelectric control—robust,
ongoing, amputee-specific adaptation. Techniques from reinforcement
learning and general value functions are applied to learn and update
predictions during continuing human interactions with multi-function
robotic appendages. Tests were performed with a targeted motor and
sensory reinnervation amputee subject and non-amputee participants.
Results show that this online prediction learning approach is able to
accurately anticipate a diverse set of human and robot signals by more
than two seconds, and at different levels of temporal abstraction.
These methods allow predictions to be adapted during real-time use,
which is an important step toward the intuitive control of powered
prostheses and other assistive devices.

Degris, T., White, M., Sutton, R. S. (2012). Off-policy Actor-Critic. In

ABSTRACT: This paper presents the
first actor-critic algorithm for off-policy reinforcement learning. Our
algorithm is online and incremental, and its per-time-step complexity
scales linearly with the number of learned weights. Previous work on
actor-critic algorithms is limited to the on-policy setting and does
not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ,
enable a target policy to be learned while following and obtaining data
from another (behavior) policy. For many problems, however,
actor-critic methods are more practical than action value methods (like
Greedy-GQ) because they explicitly represent the policy; consequently,
the policy can be stochastic and utilize a large action space. In this
paper, we illustrate how to practically combine the generality and
learning potential of off-policy learning with the flexibility in
action selection given by actor-critic methods. We derive an
incremental, linear time and space complexity algorithm that includes
eligibility traces, prove convergence under assumptions similar to
previous off-policy algorithms, and empirically show better or
comparable performance to existing algorithms on standard
reinforcement-learning benchmark problems.

Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Sutton, R. S. (2012). Dynamic Switching and Real-time Machine Learning for Improved Human Control of Assistive Biomedical Robots. In Proceedings of the Fourth IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob), June 24-27, Roma, Italy, 296-302.

ABSTRACT: A general problem for
human-machine interaction occurs when a machine’s controllable
dimensions outnumber the control channels available to its human user.
In this work, we examine one prominent example of this problem: amputee
switching between the multiple functions of a powered artificial limb.
We propose a dynamic switching approach that learns during ongoing
interaction to anticipate user behaviour, thereby presenting the most
effective control option for a given context or task. Switching
predictions are learned in real time using temporal difference methods
and reinforcement learning, and demonstrated within the context of a
robotic arm and a multi-function myoelectric controller. We find that a
learned, dynamic switching order is able to out-perform the best fixed
(non-adaptive) switching regime on a standard prosthetic proficiency
task, increasing the number of optimal switching suggestions by 23%,
and decreasing the expected transition time between degrees of freedom
by more than 14%. These preliminary results indicate that real-time
machine learning, specifically online prediction and anticipation, may
be an important tool for developing more robust and intuitive
controllers for assistive biomedical robots. We expect these techniques
will transfer well to near-term use by patients. Future work will
describe clinical testing of this approach with a population of amputee
patients.

Sutton, R. S. (2012). Beyond reward: The problem of knowledge and data (extended abstract). In: Proceedings of the 21st International Conference on Inductive Logic Programming, S.H. Muggleton, A. Tamaddoni-Nezhad, F.A. Lisi (Eds.): ILP 2011, LNAI 7207, pp. 2--6. Springer, Heidelberg.

Degris, T., Modayil, J. (2012). Scaling-up Knowledge for a Cognizant Robot.” In notes of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI.

ABSTRACT: This paper takes a new
approach to the old adage that knowledge is the key for artificial
intelligence. A cognizant robot is a robot with a deep and immediately
accessible understanding of its interaction with the environment—an
understanding the robot can use to flexibly adapt to novel situations.
Such a robot will need a vast amount of situated, revisable, and
expressive knowledge to display flexible intelligent behaviors. Instead
of relying on human-provided knowledge, we propose that an arbitrary
robot can autonomously acquire pertinent knowledge directly from
everyday interaction with the environment. We show how existing ideas
in reinforcement learning can enable a robot to maintain and improve
its knowledge. The robot performs a continual learning process that
scales-up knowledge acquisition to cover a large number of facts,
skills and predictions. This knowledge has semantics that are grounded
in sensorimotor experience. We see the approach of developing more
cognizant robots as a necessary key step towards broadly competent
robots.

Degris, T., Pilarski, P. M., Sutton, R. S. (2012). Model-Free Reinforcement Learning with Continuous Action in Practice. In Proceedings of the 2012 American Control Conference.

ABSTRACT: Reinforcement learning
methods are often considered as a potential solution to enable a robot
to adapt to changes in real time to an unpredictable environment.
However, with continuous action, only a few existing algorithms are
practical for real-time learning. In such a setting, most effective
methods have used a parameterized policy structure, often with a
separate parameterized value function. The goal of this paper is to
assess such actor-critic methods to form a fully specified practical
algorithm. Our specific contributions include 1) developing the
extension of existing incremental policy-gradient algorithms to use
eligibility traces, 2) an empirical comparison of the resulting
algorithms using continuous actions, 3) the evaluation of a
gradient-scaling technique that can significantly improve performance.
Finally, we apply our actor--critic algorithm to learn on a robotic
platform with a fast sensorimotor cycle (10ms). Overall, these results
constitute an important step towards practical real-time learning
control with continuous action.

Mahmood, A. R., Sutton, R. S., Degris, T., Pilarski, P. M. (2012). Tuning-free step-size adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan.

ABSTRACT: Incremental learning
algorithms based on gradient descent are effective and popular in
online supervised learning, reinforcement learning, signal processing,
and many other application areas. An oft-noted drawback of these
algorithms is that they include a step-size parameter that needs to be
tuned for best performance, which might require manual intervention and
significant domain knowledge or additional data. In many cases, an
entire vector of step-size parameters (e.g., one for each input
feature) needs to be tuned in order to attain the best performance of
the algorithm. To address this, several methods have been proposed for
adapting step sizes online. For example, Sutton’s IDBD method can find
the best vector step size for the LMS algorithm, and Schraudolph’s ELK1
method, an extension of IDBD to neural networks, has proven effective
on large applications, such as 3D hand tracking. However, to date all
such step-size adaptation methods have included a tunable step-size
parameter of their own, which we call the meta-step-size parameter. In this
paper we show that the performance of existing step-size adaptation
methods are strongly dependent on the choice of their meta-step-size
parameter and that their meta-step-size parameter cannot be set
reliably in a problem-independent way. We introduce a series of
modifications and normalizations to the IDBD method that together
eliminate the need to tune the meta-step-size parameter to the
particular problem. We show that the resulting overall algorithm,
called Autostep, performs as
well or better than the existing step-size adaptation methods on a
number of idealized and robot prediction problems and does not require
any tuning of its meta-step-size parameter. The ideas behind Autostep
are not restricted to the IDBD method and the same principles are
potentially applicable to other incremental learning settings, such as
reinforcement learning.

Silver, D., Sutton, R. S., Müller, M. (2012). Temporal-difference search in computer Go, Machine Learning 87(2):183-219. Pdf copy for personal research purposes only.

ABSTRACT: Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.

Modayil, J., White, A., Sutton, R. S. (2012). Multi-timescale Nexting in a Reinforcement Learning Robot. Presented at the 2012 International Conference on Adaptive Behaviour, Odense, Denmark. To appear in: SAB 12, LNAI 7426, pp. 299-309, T. Ziemke, C. Balkenius, and J. Hallam, Eds., Springer Heidelberg. Also ArXiv preprint 1112.1133.

ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

Pilarski, P. M., Dawson, M. R., Degris, T.,Fahimi, F., Carey, J. P., Sutton, R. S. (2011). Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning, Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics (ICORR), Zurich, Switzerland, pp. 134-140.

ABSTRACT: As a
contribution toward the goal of adaptable, intelligent artificial
limbs, this work introduces a continuous actor-critic reinforcement
learning method for optimizing the control of multi-function
myoelectric devices. Using a simulated upper-arm robotic prosthesis, we
demonstrate how it is possible to derive successful limb controllers
from myoelectric data using only a sparse human-delivered training
signal, without requiring detailed knowledge about the task domain.
This reinforcement-based machine learning framework is well suited for
use by both patients and clinical staff, and may be easily adapted to
different application domains and the needs of individual amputees. To
our knowledge, this is the first myoelectric control approach that
facilitates the online learning of new amputee-specific motions based
only on a one-dimensional (scalar) feedback signal provided by the user
of the prosthesis.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.

ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

Modayil, J., Pilarski, P. M., White, A., Degris, T., Sutton, R. S. (2010). Off-policy knowledge maintenance for robots, Robotics Science and Systems Workshop (Towards Closing the Loop: Active Learning for Robotics). Extended abstract and poster.

ABSTRACT: A fundamental difficulty in robotics arises
from changes in the experienced environment—periods when the robot’s
current situation differs from past experience. We present an
architecture whereby many independent reinforcement learn- ing agents
(or demons) observe the
behaviour of a single robot. Each demon learns one piece of world
knowledge represented with a generalized value function. This
architecture allows the demons to update their knowledge online and
off-policy from the robot’s behaviour. We present one approach to
active exploration using curiosity—an internal measure of learning
progress—and conclude with a preliminary result showing how a robot can
adapt its prediction of the time needed to come to a full stop.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2010). Timing in trace conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus): Scalar, nonscalar, and adaptive features. Learning and Memory 17:600-604. Cold Spring Harbor Press.

Using
interstimulus intervals (ISIs) of 125, 250, and 500 msec in trace
conditioning of the rabbit nictitating membrane response, the offset
times and durations of conditioned responses (CRs) were collected along
with onset and peak latencies. All measures were proportional to the
ISI, but only onset and peak latencies conformed to the criterion for
scalar timing. Regarding the CR’s possible protective overlap of the
unconditioned stimulus (US), CR duration increased with ISI, while the
peak’s alignment with the US declined. Implications for models of
timing and CR adaptiveness are discussed.

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In

ABSTRACT: We present the ﬁrst temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-diﬀerence learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In

ABSTRACT: A new family of gradient temporal-difference learning
algorithms have recently been introduced by Sutton, Maei and others in
which function approximation is much more straightforward. In this
paper, we introduce the GQ(λ)
algorithm which can be seen as extension of that work to a more general
setting including eligibility traces and off-policy learning of
temporally abstract predictions. These extensions bring us closer to
the ultimate goal of this work—the development of a universal
prediction learning algorithm suitable for learning experientially
grounded knowledge of the world. Eligibility traces are essential to
this goal because they
bridge the temporal gaps in cause and effect when experience is
processed at a temporally ﬁne resolution. Temporally abstract
predictions are also essential as the means for representing abstract,
higher-level knowledge about courses of action, or options. GQ(λ) can
be thought of as an extension of Q-learning. We extend existing
convergence results for policy evaluation to this setting and carry out
a forward-view/backward-view analysis to derive and prove the validity
of the new algorithm.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2009). Magnitude and timing of conditioned responses in delay and trace classical conditioning of the nictitating membrane respons of the rabbit (oryctolagus cuniculus),

ABSTRACT: The
present experiment characterized conditioned nictitating membrane (NM)
movements as a function of CS duration, using the full range of
discernible movements (>.06 mm) rather than movements exceeding a
conventional criterion (>.50 mm). The CS–US interval was fixed at
500 ms, while across groups, the duration of the CS was 50 ms (trace),
550 ms (delay), or 1050 ms (extended delay). The delay group showed the
highest level of acquisition. When tested with the different CS
durations, the delay and extended delay groups showed large reductions
in their responses when their CS was shortened to 50 ms, but the trace
group maintained its response at all durations. Timing of the
conditioned movements appeared similar across all manipulations. The
results suggest that the CS has both a fine timing function tied to CS
onset and a general predictive function tied to CS duration, both of
which may be mediated by cerebellar pathways.

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In

ABSTRACT: We introduce the first temporal-difference learning
algorithms that converge with smooth value function approximators, such
as neural networks. Conventional temporal-difference (TD) methods, such
as TD(λ), Q-learning and Sarsa have been used successfully with
function approximation in many applications. However, it is well known
that off-policy sampling, as well as nonlinear function approximation,
can cause these algorithms to become unstable (i.e., the parameters of
the approximator may diverge). Sutton et al. (2009a, 2009b) solved the
problem of off-policy learning with linear TD algorithms by introducing
a new objective function, related to the Bellman error, and algorithms
that perform stochastic gradient-descent on this function. These
methods can be viewed as natural generalizations to previous TD
methods, as they converge to the same limit points when used with
linear function approximation methods. We generalize this work to
nonlinear function approximation. We present a Bellman error objective
function and two gradient-descent TD algorithms that optimize it. We
prove the asymptotic almost-sure convergence of both algorithms, for
any finite Markov decision process and any smooth value function
approximator, to a locally optimal solution. The algorithms are
incremental and the computational complexity per time step scales
linearly with the number of parameters of the approximator. Empirical
results obtained in the game of Go demonstrate the algorithms’
effectiveness.

Interview with Richard S. Sutton, by Verena Heidrich-Meisner of Kuenstliche Intelligenz. (2009). pp. 41-43.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural Actor-Critic Algorithms.

ABSTRACT: We present four new
reinforcement learning algorithms based on actor-critic,
natural-gradient and function-approximation ideas, and we provide their
convergence proofs.
Actor-critic reinforcement learning methods are online approximations
to policy iteration in which the value-function parameters are
estimated using temporal difference learning and the policy parameters
are updated by stochastic gradient descent. Methods based on policy
gradient in this way are of special interest because of their
compatibility with function approximation methods, which are needed to
handle large or infinite state spaces. The use of temporal
difference learning in this way is of interest because in many
applications it dramatically reduces the variance of the
gradient estimates. The use of the natural gradient is of interest
because it can produce better conditioned parameterizations and has
been shown to further reduce variance in some cases. Our results extend
prior two-timescale convergence results for actor-critic methods by
Konda and Tsitsiklis by using temporal difference learning in the actor
and by
incorporating natural gradients. Our results extend prior empirical
studies of
natural actor-critic methods by Peters, Vijayakumar and Schaal by
providing the
first convergence proofs and the first fully incremental algorithms.

Sutton, R. S. (2009). The grand challenge of
predictive empirical abstract knowledge. Working Notes of the
IJCAI-09 Workshop on Grand Challenges for Reasoning from Experiences.ABSTRACT: We survey ongoing work at the
University of Alberta on an experience-oriented approach to artificial
intelligence (AI) based an reinforcement learning ideas. We seek to
ground world knowledge in a minimal ontology of just the signals
passing back and forth between the AI agent and its environment at a
fine temporal scale. The challenge is to connect these low-level
signals to higher-level representations in such a way that the
knowledge remains grounded and autonomously verifiable. The
mathematical ideas of temporally abstract options, option models, and
temporal-difference networks can be applied to begin to address this
challenge. This has been illustrated in several simple computational
worlds, and we seek now to extend the approach to a physically realized
robot. A recent theoretical development is the extension of simple
temporal-difference methods to off-policy forms using function
approximation; this should enable, for the first time, efficient and
reliable intra-option learning, bringing the goal of predictive
empirical abstract knowledge closer to achievement.

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In

Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Kehoe, E. J., Olsen, K. N., Ludvig, E. A., Sutton, R. S. (2009). Scalar timing varies with response magnitude in classical conditioning of the rabbit nictitating membrane response (Oryctolagus cuniculus). Behavioral Neuroscience 123, 212–217.

The present experiment was aimed at characterizing the timing of conditioned nictitating membrane (NM) movements as function of the interstimulus interval (ISI) in delay conditioning for rabbits (Oryctolagus cuniculus). Onset latency and peak latency were approximately, but not strictly, scalar for all but the smallest movements (<.10 mm). That is, both the mean and standard deviation of the timing measures increased in proportion to the ISI, but their coefficients of variation (standard deviation/mean) tended to be larger for shorter ISIs. For all ISIs, the absolute timing of the NM movements covaried with magnitude. The smaller movements (approximately, .11-.50 mm) were highly variable, and their peaks tended to occur well after the time of US delivery. The larger movements (>.50 mm) were less variable, and their peaks were better aligned with the time of US delivery. These results are discussed with respect to their implications for current models of timing in eyeblink conditioning.Sutton, R. S., Szepesvari, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. Thegradient temporal-difference(GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on itsL2norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Ludvig, E., Sutton, R. S., Verbeek, E., Kehoe, E. J. (2009). A computational model of hippocampal function in trace conditioning. In

We present a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses. For delay conditioning, in contrast, the continued presence of the stimulus supports conditioned responding, and the short-latency elements suppress responding early in the stimulus. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended stimuli or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions.

Cutumisu, M., Szafron, D., Bowling, M., Sutton R. S. (2008). Agent learning using action-dependent learning rates in computer role-playing games. Proceedings of the 4th Artificial Intelligence and Interactive Digital Entertainment Conference, pp. 22-29.

We introduce the
ALeRT (Action-dependent
Learning Rates with Trends) algorithm that
makes two modifications to the learning rate and one change to the
exploration rate of traditional reinforcement learning techniques. Our
learning rates are action-dependent and increase or decrease based on
trends in reward sequences. Our exploration rate decreases when the
agent is learning successfully and increases otherwise. These
improvements result in faster learning. We implemented this algorithm
in NWScript, a scripting language used by BioWare Corp.’s Neverwinter Nights game, with the
goal of improving the behaviours of game agents so that they react more
intelligently to game events. Our goal is to provide an agent with the
ability to (1) discover
favourable policies in a multi-agent computer role-playing game
situation and (2) adapt to
sudden changes in the environment.

Silver, D., Sutton, R. S., and Mueller, M. (2008). Sample-based learning and search with permanent and transient memories. In

We study a reinforcement learning architecture that encompasses both learning and planning, and apply it to high performance Computer Go. In this domain, the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are individuated, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of our architecture, Dyna-2, is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. This architecture subsumes Sarsa and UCT as special cases. We apply Dyna-2 to 9x9 Computer Go, using a million binary features in the function approximator, based on templates matching small fragments of the board. We test our approach against GnuGo, a standard benchmark opponent, and on the Computer Go Online Server (CGOS). We show that the combination of permanent and transient memories performs better than either memory alone, and significantly outperforms a traditional approach using alpha-beta search. Our program based on Dyna-2 achieves a higher rating on CGOS than any other handcrafted or traditional search based program. This is the first exploration of the Dyna concept in a large-scale, high-performance application.Sutton, R. S., Szepesvari, Cs., Geramifard, A., Bowling, M., (2008). Dyna-style planning with linear function approximation and prioritized sweeping, Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pp. 528-536.

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dyna-style planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2008) Stimulus representation and the timing of reward-prediction errors in models of the dopamine system,

The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.

Kehoe, E. J., Ludvig, E. A., Dudeney, J. E., Neufeld, J., Sutton, R. S. (2008). Magnitude and timing of nictitating membrane movements during classical conditioning of the rabbit (oryctolagus cuniculus),

A trial-by-trial, subject-by-subject analysis was conducted to determine whether generation of the conditioned response (CR) occurs on a continuous or all-or-none basis. Three groups of rabbits were trained on different partial reinforcement schedules with the conditioned stimulus presented alone on 10%, 30%, or 50%, respectively, of all trials. Plots of each rabbit’s nictitating membrane movements revealed that their magnitude rose in a continuous fashion. Response growth during acquisition followed a sigmoidal curve, and the timing of CR-sized movements was largely stable throughout the experiment. The results are discussed with respect to alternative models of CR generation.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2008). Incremental Natural Actor-Critic Algorithms.

ABSTRACT: We present four new
reinforcement learning algorithms based on actor-critic and
natural-gradient ideas, and provide their convergence proofs.
Actor-critic reinforcement learning methods are online approximations
to policy iteration in which the value-function parameters are
estimated using temporal difference learning and the policy parameters
are updated by stochastic gradient descent. Methods based on policy
gradient in this way are of special interest because of their
compatibility with function approximation methods, which are needed to
handle large or infinite state spaces, and the use of temporal
difference learning in this way is of interest because in many
applications it dramatically reduces the variance of the policy
gradient estimates. The use of the natural gradient is of interest
because it can produce better conditioned parameterizations and has
been shown to further reduce variance in some cases. Our results extend
prior two-timescale convergence results for actor-critic methods by
Konda et al. by using temporal difference learning in the actor and by
incorporating natural gradients, and extend prior empirical studies of
natural-gradient actor-critic methods by Peters et al. by providing the
first convergence proofs and the first fully incremental algorithms.

Sutton, R. S., Koop,
A., Silver, D. (2007). On the
Role of Tracking in Stationary Environments. In ABSTRACT: It is often thought that
learning algorithms that track the best
solution, as opposed to converging to it, are important only on
nonstationary problems. We present three results suggesting that this
is not so. First we illustrate in a simple concrete example, the Black
and White problem, that tracking can perform better than any converging
algorithm on a stationary problem. Second, we show the same point on a
larger, more realistic problem, an application of temporal-difference
learning to computer Go. Our third result suggests that tracking in
stationary problems could be important for meta-learning research
(e.g., learning to learn, feature selection, transfer). We apply a
meta-learning algorithm for step-size adaptation, IDBD,e to the Black
and White problem, showing that meta-learning has a dramatic long-term
effect on performance whereas, on an analogous converging problem,
meta-learning has only a small second-order effect. This small result
suggests a way of eventually overcoming a major obstacle to
meta-learning research: the lack of an independent methodology for task
selection.

Silver, D., Sutton, R. S., and Mueller, M. (2007) Reinforcement Learning of Local Shape in the Game of Go, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).

ABSTRACT: We explore an application to
the game of Go of a reinforcement
learning approach based on a linear evaluation function and large
numbers of binary features. This strategy has proved effective in game
playing programs and other reinforcement learning applications. We
apply this strategy to Go by creating over a million features based on
templates for small fragments of the board, and then use temporal
difference learning and self-play. This method identifies hundreds of
low level shapes with recognisable significance to expert Go players,
and provides quantitive estimates of their values. We analyse the
relative contributions to performance of templates of different types
and sizes. Our results show that small, translation-invariant templates
are surprisingly effective. We assess the performance of our program by
playing against the Average Liberty Player and a variety of computer
opponents on the 9x9 Computer Go Server. Our linear evaluation function
appears to outperform all other static evaluation functions that do not
incorporate substantial domain knowledge.

Geramifard, A., Bowling, M., Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility Traces and Convergence Analysis, Advances in Neural Information Processing Systems 19 (NIPS*06).

ABSTRACT: We present new theoretical
and empirical results with the iLSTD algorithm for policy evaluation in
reinforcement learning with linear function approximation. iLSTD
is an incremental method for achieving results similar to LSTD, the
data-efficient, least-squares version of temporal difference learning,
without incurring the full cost of the LSTD computation. LSTD is O(n^{2}
), where n is the number of
parameters in the linear function approximator, while iLSTD is O(n).
In this paper, we generalize the previous iLSTD algorithm and present
three new results: (1) the first convergence proof for an iLSTD
algorithm; (2) an extension to incorporate eligibility traces without
changing the asymptotic computational complexity; and (3) the first
empirical results with an iLSTD algorithm for a problem (mountain car)
with feature vectors large enough (n
= 10,000) to show substantial computational advantages over LSTD.

Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental Least-Squares Temporal Difference Learning. In

ABSTRACT: Approximate policy evaluation
with linear function approximation is a commonly arising problem in
reinforcement learning, usually solved using temporal difference (TD)
algorithms. In this paper we introduce a new variant of linear TD
learning, called incremental least-squares TD learning, or iLSTD. This
method is more data efficient than conventional TD algorithms such as
TD(0) and is more computationally efficient than non-incremental
least-squares TD methods such as LSTD (Bradtke & Barto 1996; Boyan
1999). In particular, we show that the per-time-step complexities of
iLSTD and TD(0) are O(n), where n is the number of features,
whereas that of LSTD is O(n^{2}
). This difference can be decisive in modern applications of
reinforcement learning where the use of a large number features has
proven to be an effective solution strategy. We present empirical
comparisons, using the test problem introduced by Boyan (1999), in
which iLSTD converges faster than TD(0) and almost as fast as LSTD.

Precup, D., Sutton, R. S., Paduraru, C.,
Koop, A., Singh, S. (2006).
Off-policy Learning with
Recognizers. Advances in
Neural Information Processing
Systems 18 (NIPS*05).ABSTRACT: We introduce a new algorithm
for
off-policy temporal-difference learning with function approximation
that has much lower variance and requires less knowledge of the
behavior policy than prior methods. We develop the notion of a
recognizer, a filter on actions that distorts the behavior policy to
produce a related target policy with low-variance importance-sampling
corrections. We also consider target policies that are deviations from
the state distribution of the behavior policy, such as potential
temporally abstract options, which further reduces variance. This paper
introduces recognizers and their potential advantages, then develops a
full algorithm for MDPs and proves that its updates are in the same
direction as on-policy TD updates, which implies asymptotic
convergence. Our algorithm achieves this without knowledge of the
behavior policy or even requiring that there exists a behavior policy.

Sutton, R. S., Rafols, E. J., Koop, A. (2006). Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT: Temporal-difference (TD)
networks have
been proposed as a way of representing and learning a wide variety of
predictions about the interaction between an agent and its environment
(Sutton & Tanner, 2005). These predictions are compositional in that their targets
are defined in terms of other predictions, and subjunctive
in that they are about what would happen if an action or sequence of
actions were taken. In conventional TD networks, the
inter-related predictions are at successive time steps and contingent
on a single action; here we generalize them to accommodate extended
time intervals and contingency on whole ways of behaving. Our
generalization is based on the options framework for temporal
abstraction (Sutton, Precup & Singh, 1999). The primary
contribution of this paper is to introduce a new algorithm for
intra-option learning in TD networks with function approximation and
eligibility traces. We present empirical examples of our
algorithm's effectiveness and of the greater representational
expressiveness of temporally-abstract TD networks.

Tanner, B., Sutton, R. S. (2005) TD(lambda) Networks: Temporal-Difference Networks with Eligibility Traces.

ABSTRACT: Temporal-difference (TD)
networks have been introduced as a formalism for expressing and
learning grounded world knowledge in a predictive form (Sutton &
Tanner, 2005). Like conventional TD(0) methods, the learning algorithm
for TD networks uses 1-step backups to train prediction units about
future events. In conventional TD learning, the TD(lambda) algorithm is
often used to do more general multi-step backups of future predictions.
In our work, we introduce a generalization of the 1-step TD network
specification that is based on the TD(lambda) learning al- gorithm,
creating
TD(lambda) networks. We present experimental results that show
TD(lambda)
networks can learn solutions in more complex environments than TD
networks. We also show that in problems that can be solved by TD
networks, TD(lambda) networks generally learn solutions much faster
than
their 1-step counterparts. Finally, we present an analysis of our
algorithm that shows that the computational cost of TD(lambda) networks
is
only slightly more than that of TD networks.

Rafols, E. J., Ring, M. B., Sutton, R. S., Tanner, B. 2005) Using Predictive Representations to Improve Generalization in Reinforcement Learning.

ABSTRACT: The predictive
representations
hypothesis holds that particularly good generalization will result from
representing the state of the world in terms of predictions about
possible future experience. This hypothesis has been a central
motivation behind recent research in, for example, PSRs and TD
networks. In this paper we present the first explicit investigation of
this hypothesis. We show in a reinforcement-learning example (a
grid-world navigation task) that a predictive representation in tabular
form can learn much faster than both the tabular explicit-state
representation and a tabular history-based method.

Tanner, B., Sutton, R. S. (2005) Temporal-Difference Networks with History.

ABSTRACT: Temporal-difference (TD)
networks are a
formalism for expressing and learning grounded world knowledge in a
predictive form [Sutton and Tanner, 2005]. However, not all partially
observable Markov decision processes can be efficiently learned with TD
networks. In this paper, we extend TD networks by allowing the
network-update process (answer network) to depend on the recent history
of previous actions and observations rather than only on the most
recent action and observation. We show that this extension enables the
solution of a larger class of problems than can be solved by the
original TD networks or by historybased methods alone. In addition, we
apply TD networks to a problem that, while still simple, is
significantly larger than has previously been considered. We show that
history-extended TD networks can learn much of the common-sense
knowledge of an egocentric gridworld domain with a single bit of
perception.

Stone, P., Sutton, R. S., Kuhlmann, G. (2005) Reinforcement Learning for RoboCup-Soccer Keepaway. Adaptive Behavior. Some simulations of keepaway referenced in the paper and keepaway software.

ABSTRACT: RoboCup simulated soccer
presents many challenges to reinforcement learning methods, including a
large state space, hidden and uncertain state, multiple independent
agents learning simultaneously, and long and variable delays in the
effects of actions. We describe our application of episodic SMDP
Sarsa(lambda) with linear tile-coding function approximation and
variable lambda
to learning higher-level decisions in a keepaway subtask of RoboCup
soccer. In keepaway, one team, "the keepers," tries to keep control of
the ball for as long as possible despite the efforts of "the takers."
The keepers learn individually when to hold the ball and when to pass
to a teammate. Our agents learned policies that significantly
outperform a range of benchmark policies. We demonstrate the generality
of our approach by applying it to a number of task variations including
different field sizes and different numbers of players on each team.

Sutton, R. S., Tanner, B. (2005) Temporal-Difference Networks.
Advances in Neural Information
Processing
Systems 17, pages 1377-1384. Associated talk from 12/14/05.
Small-size version of same talk.

ABSTRACT: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.

Littman, M.L., Sutton, R.S., Singh, S. (2002). Predictive Representations of State. In:

Stone, P., Sutton, R. S. (2002). Keepaway soccer: A machine learning testbed. In: Lecture Notes in Computer Science 2377, pp. 207-237, Springer.

RoboCup simulated soccer presents many
challenges to machine learning (ML) methods, including a large state
space, hidden and uncertain state, multiple agents, and long and
variable delays in the effects of actions. While there have been many
successful ML applications to portions of the robotic soccer task, it
appears to be still beyond the capabilities of modern machine learning
techniques to enable a team of 11 agents to successfully learn the full
robotic soccer task from sensors to actuators. Because the successful
applications to portions of the task have been embedded in different
teams and have often addressed different sub-tasks, they have been
difficult to compare. We put forth keepaway soccer as a domain suitable
for directly comparing different machine learning approaches to robotic
soccer. It is complex enough that it can’t be solved trivially, yet
simple enough that complete machine learning approaches are feasible.
In keepaway, one team, “the keepers,” tries to keep control of the ball
for as long as possible despite the efforts of “the takers.” The
keepers learn individually when to hold the ball and when to pass to a
teammate, while the takers learn when to charge the ball-holder and
when to cover possible passing lanes. We fully specify the domain and
summarize some initial, successful learning results.

Stone, P., Sutton, R.S. (2001). Scaling reinforcement learning toward RoboCup soccer.

Precup, D., Sutton, R.S., Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. digitally remastered version. scanned version.

Stone, P., Sutton, R.S., Singh, S.
(2001).
Reinforcement Learning for 3 vs. 2 Keepaway. In: *RoboCup-2000:
Robot Soccer World Cup IV,* P. Stone, T. Balch, and G.
Kraetszchmar, Eds., Springer Verlag. An earlier
version appeared in the *Proceedings
of the RoboCup-2000 Workshop,* Melbourne, Australia.

Sutton, R.S., McAllester, D., Singh, S., Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation.

Precup, D., Sutton, R.S., Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation.

Sutton, R.S. (1999). Open
theoretical
questions in reinforcement learning. In *Proceedings of the
Fourth European Conference on
Computational Learning Theory* (Proceedings EuroCOLT'99), pp.
11-17, Fischer, P., Simon, H.U., Eds. Springer-Verlag.

Sutton, R.S. (1999). Reinforcement
Learning. In Rob Wilson and Frank Keil (Eds.) *The MIT Encyclopedia of the
Cognitive Sciences,*
MIT Press.

Sutton, R.S., Precup, D., Singh, S. (1999).
Between MDPs and semi-MDPs:
A Framework for Temporal Abstraction in Reinforcement Learning. *Artificial
Intelligence 112*:181-211. An earlier version appeared as
Technical Report 98-74, Department of
Computer
Science, University of Massachusetts, Amherst, MA 01003. April, 1998. Associated talk from 3/5/98.
Humorous excerpts
from the JAIR reviews recommending rejection of this paper.

Sutton, R.S., Singh, S., Precup, D., Ravindran, B. (1999). Improved switching among temporally abstract actions. Djvu version.

Moll, R., Barto, A.G., Perkins, T. J., Sutton, R.S. (1999).
Learning instance-independent value functions to enhance local search.
*Advances in Neural Information Processing Systems 11*
(Proceedings of the 1998 conference), pp. 1017-1023.
MIT Press.

Sutton, R. S., Reinforcement learning:
Past, present, and future. Extended abstract in Simulated Evolution and Learning,
McKay, B., Yao, X., Newton, C. S., Kim, J.-H., Furuhashi, T., Eds.
Published as Lecture Notes in
Computer Science 1585, pp. 195–197, Springer, 1999.

Sutton, R.S., Barto, A.G. (1998).
*Reinforcement
Learning: An Introduction*.
MIT Press.

Sutton, R.S., Precup, D., Singh, S. (1998).
Intra-option learning about temporally abstract actions.
*Proceedings of the 15th International Conference on Machine
Learning,*
pp. 556-564.
Morgan Kaufmann.

Precup, D., Sutton, R.S. (1998) Multi-time models for temporally abstract planning. Djvu version.

McGovern, A., and Sutton, R.S. (1998). Macro-actions in reinforcement learning: An empirical analysis. Technical Report 98-70, University of Massachusetts, Department of Computer Science.

Precup, D., Sutton, R.S., Singh, S. (1998).
Theoretical results on reinforcement learning with temporally abstract
options.
In *Machine Learning: ECML-98, Proceedings of the 10th European
Conference on Machine Learning, Chemnitz, Germany,* pp. 382-393.
Springer Verlag.

Santamaria, J.C., Sutton, R.S., Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces,

McGovern, A., Precup, D., Ravindran, B.,
Singh, S., Sutton, R.S.
(1998). Hierarchical Optimal Control of
MDPs. *Proceedings of the Tenth Yale Workshop on Adaptive and
Learning Systems,*
pp. 186-191.

Barto, A. G., Sutton, R. S., Reinforcement learning in artificial intelligence. In Neural Network Models of Cognition, Donahoe, J. W., Packard Dorsel, V., Eds., pp. 358–386, Elsevier, 1997.

This chapter provides an overview of an approach to the study of learning that, in broad terms, has developed as a part of the field of Artifificial Intelligence (AI), where it is calledreinforcement learningdue to its roots in reinforcement theories of animal learning. We introduce the field from the perspective of AI and engineering, describing some of its key features, providing a formal model of the reinforcement-learning problem, and defining basic concepts that are exploited by solution methods. Detailed discussion of solution methods themselves and their history are very broad topics that we do not attempt to cover here.

Sutton, R.S. (1997). On the significance of Markov decision processes. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud (Eds.)

Precup, D., Sutton, R.S. (1997). Multi-time models for reinforcement learning.

Precup, D., Sutton, R.S. (1997). Exponentiated gradient methods for reinforcement learning.

McGovern, A., Sutton, R.S., Fagg, A.H. (1997). Roles of macro-actions in accelerating reinforcement learning.

Precup, D., Sutton, R.S., Singh, S.P. (1997). Planning with closed-loop macro actions.

Mehra, R.K., Ravichandran, B., Cabrera, J.B.D., Greve, D.N., Sutton, R.S. (1997). Towards self-learning adaptive scheduling for ATM networks, Proceedings of the 36th Conference on Decision and Control, pp. 2393-2398, San Diego, California USA.

Sutton, R.S. (1996).
Generalization in reinforcement
learning: Successful examples using sparse coarse
coding. Camera ready postscript. Digitally remastered pdf.
Djvu version.
*Advances in Neural Information Processing Systems 8*
(Proceedings of the 1995 conference),
pp. 1038-1044, MIT Press. Errata: there is small error in the equations
of motion of the acrobot; for the correct equations, see the
RL textbook page on the acrobot (thanks to Barry Nichols for
spotting this).

Singh, S.P., Sutton, R.S. (1996).
Reinforcement learning with
replacing eligibility traces. *Machine Learning 22*:
123-158.

Precup, D., Sutton, R.S. (1996). Empirical comparison of gradient descent and exponentiated gradient descent in supervised and reinforcement learning. Technical Report UM-CS-1996-070, Department of Computer Science, University of Massachusetts, Amherst, MA 01003.

Kuvayev, L., Sutton, R.S. (1996).
Model-based reinforcement
learning with an approximate, learned model.
*Proceedings of the Ninth Yale Workshop on Adaptive and Learning
Systems,*
pp. 101-105, Yale University, New Haven, CT. Some additional results
are in this earlier version of the
same paper.

Mehra, R.K., Ravichandran, B., Sutton, R.S. (1996). Adaptive
intelligent scheduling for ATM networks.
*Proceedings of the Ninth Yale Workshop on Adaptive and Learning
Systems,*
pp. 106-111, Yale University, New Haven, CT.

Sutton, R.S. (1995). On the Virtues of Linear Learning and Trajectory Distributions. Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference.

Sutton, R.S. (1995).
TD models: Modeling the world at a
mixture of time scales.
*Proceedings of the Twelfth International Conference on Machine
Learning,*
pp. 531-539, Morgan Kaufmann.

Sutton, R.S., Singh, S.P. (1994).
On bias and step size in
temporal-difference learning.
*Proceedings of the Eighth Yale Workshop on Adaptive and Learning
Systems,* pp. 91-96, Yale University, New Haven, CT.

Sutton, R.S., Whitehead, S.D. (1993).
Online learning with random
representations.
*Proceedings of the Tenth International Conference on Machine
Learning,* pp. 314-321, Morgan Kaufmann.

Sutton, R.S. (1992).
Adapting bias by gradient
descent: An
incremental version of delta-bar-delta.
*Proceedings of the Tenth National Conference on Artificial
Intelligence,*
pp. 171-176, MIT Press.
Associated talk from 2/2/04.

Sutton, R.S.
(1992).
Gain adaptation beats least squares? *Proceedings of the
Seventh Yale Workshop on Adaptive and Learning Systems,* pp.
161-166, Yale University, New Haven, CT.

Sutton, R.S. (1992).
Machines that Learn and
Mimic the Brain.
In * ACCESS, GTE's Journal of Science and Technology,* 1992.
Reprinted in * Stethoscope Quarterly,* Spring.

Sutton, R.S. (1992).
Reinforcement learning
architectures.
*Proceedings ISKIT'92 International Symposium on Neural Information
Processing*,
Fukuoka, Japan.

Sutton, R.S. (Ed.) (1992). *Reinforcement
Learning,* book
version of a special issue of *Machine Learning* on
reinforcement learning (Volume 8, Numbers 3/4). Kluwer. Introduction: The challenge of
reinforcement learning.

Sanger, T.D., Sutton, R.S., Matheus, C.J. (1992).
Iterative construction of sparse polynomial approximations.
*Advances in Neural Information Processing Systems 4,* pp.
1064-1071, Morgan Kaufmann.

Gluck, M., Glauthier, P., Sutton, R.S. (1992).
Adaptation of cue-specific learning rates
in network models of human
category learning, * Proceedings of the Fouteenth
Annual Conference of the Cognitive
Science Society,* pp. 540-545, Erlbaum.

Sutton,
R.S. (1991).
Planning by incremental dynamic programming.
*Proceedings of the Eighth International Workshop on Machine
Learning,* pp. 353-357, Morgan Kaufmann.

Sutton, R.S. (1991).
Dyna, an integrated architecture for
learning, planning and reacting.
*Working Notes of the 1991 AAAI Spring Symposium on
Integrated Intelligent Architectures* and *SIGART Bulletin 2,*
pp. 160-163.

Sutton, R.S. (1991).
Integrated
modeling and control based on reinforcement
learning and dynamic programming.
In D. S. Touretzky (ed.), * Advances in Neural
Information Processing Systems 3,* pages 471-478.

Sutton, R.S. (1991).
Reinforcement learning
architectures for animats,
* Proceedings of the First International Conference on Simulation
of Adaptive Behavior: From Animals to Animats,* pp. 288-296. smaller gzipped version.

Sutton, R.S., Matheus, C.J. (1991). Learning polynomial functions by feature construction.

Sutton, R.S., Barto,
A.G., Williams, R. (1991).
Reinforcement learning is
direct adaptive optimal control, * Proceedings of the American
Control Conference,* pages 2143-2146. Also published in * IEEE
Control
Systems Magazine 12,* No. 2, 19-22.

Miller, W. T., Sutton, R. S., Werbos, P. J. (Eds.), Neural Networks for Control. MIT Press, 1991.

Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.

Sutton, R.S. (1990).
First results with Dyna,
an integrated architecture for
learning, planning, and reacting. In * Neural Networks for
Control,*
Miller, T., Sutton, R.S., & Werbos, P., Eds., MIT Press.

Sutton,
R.S., Barto, A.G. (1990).
Time-derivative models of pavlovian reinforcement.
In *Learning and Computational Neuroscience: Foundations of
Adaptive Networks,* M. Gabriel and J. Moore, Eds., pp. 497-537. MIT
Press.

Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990). Learning and sequential decision making. In

Barto, A.G., Sutton, R.S., & Watkins, C.
(1990).
Sequential
decision problems and neural networks.
In D. S. Touretzky (ed.), * Advances in Neural
Information Processing Systems 2,* pp. 686-693.

Whitehead, S., Sutton, R.S., & Ballard, D. (1990).
Advances in reinforcement learning
and their implications for intelligent control, * Proceedings
of
the
Fifth IEEE International Symposium on Intelligent Control 1990,*
pp. 1289-1297.

Franklin, J., Sutton, R.S., Anderson, C., Selfridge, O., &
Schwartz, D. (1990).
Connectionist learning control at GTE
laboratories, * Intelligent
Control and Adaptive Systems,* G. Rodriguez, Ed., Proc. SPIE 1196,
pp. 242-253.

Anderson, C., Franklin, J., & Sutton, R.S. (1990).
Learning a nonlinear model
of a manufacturing process using multilayer connectionist networks,
*Proceedings of the
Fifth IEEE International Symposium on Intelligent Control 1990,*
pp. 404-409.

Sutton, R.S. (1989).
Artificial intelligence as a control
problem: Comments on the
relationship between machine learning and intelligent control.
Appeared in Machine Learning in
a dynamic world.*
Proceedings of the IEEE International Symposium on Intelligent Control
1988,* pp. 500-507.

Sutton, R.S. (1989).
Implementation details of the
TD(lambda) procedure for the case
of vector predictions and backpropagation.
GTE Laboratories Technical Report TR87-509.1, as corrected August 1989.
GTE Laboratories, 40 Sylvan Road, Waltham, MA 02254.

Franklin, J., Sutton, R.S., & Anderson, C. (1989).
Application of connectionist learning
methods to manufacturing process monitoring, * Proceedings of
the
IEEE International Symposium on Intelligent Control 1988,* pp.
709-712.

Sutton, R.S. (1988).
Learning to predict by the methods of temporal differences.
*Machine Learning 3*: 9-44, erratum p. 377. Scan of paper as published,
with erratum added. Digitally
remastered
with missing figure in place.

Sutton, R.S. (1988).
Convergence theory for a new kind
of
prediction learning,
* Proceedings of the 1988 Workshop on Computational Learning Theory,*
pp. 421-42.

Sutton, R.S. (1988).
NADALINE: A normalized adaptive
linear element that learns efficiently.
GTE Laboratories Technical Report TR88-509.4. GTE Laboratories, 40
Sylvan Road, Waltham, MA 02254.

Selfridge, O., Sutton, R.S., & Anderson, C. (1988).
Selected bibliography on connectionism,
In: * Evolution, Learning, and Cognition,*
Y.C. Lee (Ed.), pp. 391-403, World Scientific Publishing.

Sutton, R.S., & Barto, A.G. (1987).
A temporal-difference model of
classical conditioning,
* Proceedings of the Ninth Annual Conference of the Cognitive
Science Society,* pp. 355-378.

Sutton, R.S. (1986).
Two problems with backpropagation and
other
steepest-descent learning procedures for networks,
* Proceedings of the Eighth Annual Conference of the Cognitive
Science Society,* pp. 823-831.
Reprinted in * Artificial Neural Networks: Concepts and
Theory,* edited by P. Mehra and B. Wah, IEEE Computer Society
Press,
1992. smaller gzipped version.

Moore, J., Desmond, J, Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals,

Sutton, R.S. (1985).
Learning distributed, searchable,
internal models,
* Proceedings of the Distributed Artificial Intelligence Workshop,*
pp. 287-289.

Sutton, R.S., & Pinette, B. (1985).
The learning of world models by
connectionist networks, * Proceedings of the Seventh Annual
Conference of the Cognitive Science Society,* pp. 54-64.

Barto, A.G. & Sutton, R.S. (1985).
Neural problem solving. In: *
Synaptic
Modification, Neuron Selectivity, and Nervous System Organization,*
W.B. Levy & J.A. Anderson (Eds.), pp. 123-152.
Lawrence Erlbaum.

Selfridge, O., Sutton, R.S., &
Barto, A.G. (1985).
Training and tracking in robotics,
* Proceedings of the Ninth
International Joint Conference on Artificial Intelligence,* pp.
670-672.

Moore, J., Desmond, J., Berthier, N., Blazis, D., Sutton, R.S., &
Barto, A.G. (1985).
Connectionist learning in real
time: Sutton-Barto
adaptive element and classical conditioning of the nictitating membrane
response, * Proceedings of the Seventh Annual
Conference of the Cognitive Science Society,* pp. 318-322.

Sutton, R.S. (1984).
Temporal credit assignment in
reinforcement learning (106 Mbytes). Ph.D.
dissertation, Department of Computer Science,
University of Massachusetts, Amherst, MA 01003.
Published as COINS Technical Report 84-2.

Barto, A.G., Sutton, R.S., & Anderson,
C. (1983).
Neuron-like adaptive
elements that can solve difficult learning control problems, *
IEEE Transactions on Systems, Man, and Cybernetics, SMC-13*:
834-846. 12.4 Mb pdf,
3.5 Mb gzipped

Sutton, R.S. (1982 - unpublished draft).
A theory of salience change dependent
on the relationship between discrepancies
on successive trials on which the stimulus is present. smaller gzipped version.

Barto, A.G. & Sutton, R.S. (1982). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element,

Barto, A.G., Anderson, C., &
Sutton, R.S. (1982). Synthesis of
nonlinear control
surfaces by a layered associative network, * Biological
Cybernetics 43*:175-185.

Barto, A.G., Sutton, R.S., & Anderson, C. (1982).
Spatial learning simulation
systems, * Proceedings of the 10th IMACS World Congress on
Systems
Simulation and Scientific Computation,* pp. 204-206.

Sutton, R.S. (1981). Adaptation of learning rate parameters.
In: Goal Seeking Components for
Adaptive Intelligence: An Initial Assessment,
by A. G. Barto and R. S. Sutton. Air Force Wright Aeronautical
Laboratories Technical Report AFWAL-TR-81-1070. Wright-Patterson Air
Force Base, Ohio 45433.

Sutton, R.S., & Barto, A.G.
(1981).
Toward a modern theory of
adaptive networks: Expectation and prediction, * Psychological
Review 88*:135-140. Translated into Spanish by G.
Ruiz to appear in the journal * Estudios de Psicologia.*

Sutton, R.S., & Barto,
A.G. (1981).
An adaptive network that
constructs and uses
an internal model of its world, * Cognition and Brain Theory 4*:217-246.

Barto, A.G., Sutton, R.S. (1981).
Goal seeking components for
adaptive intelligence: An initial
assessment.
Air Force Wright Aeronautical Laboratories Technical Report
AFWAL-TR-81-1070.
Wright-Patterson Air Force Base, Ohio 45433. 524 pages. (Appendix C is available separately.)

Barto, A.G., Sutton, R.S., &
Brouwer, P. (1981).
Associative search network:
A reinforcement learning associative memory, * Biological
Cybernetics 40*:201-211.

Barto, A.G. & Sutton, R.S. (1981).
Landmark learning: An
illustration of associative search,
* Biological Cybernetics 42*:1-8.

Sutton, R.S. (1978).
Single channel theory: A neuronal
theory of learning,
* Brain Theory Newsletter 3,* No. 3/4, pp. 72-75.
(earliest publication)

Sutton, R.S. (1978 -
unpublished).
A unified theory of expectation in
classical and instrumental
conditioning.
Bachelors thesis, Stanford University.

Sutton, R.S. (1978 - unpublished).
Learning theory support for a
single channel theory of the brain.