Rich Sutton's Publications

First, a quick guide to the highlights, roughly in order of the work's popularity or potential current interest:
Also, some RL pubs that aren't mine, available for researchers:

For any broken links, please send email to rich@richsutton.com.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2014). Time course of the rabbit's nictitating membrane during acquisition, extinction, and reacquisition. Learning and Memory 21:585-590. Cold Spring Harbor Press.

ABSTRACT: The present experiment tested whether or not the time course of a conditioned eyeblink response, particularly its duration, would expand and contract, as the magnitude of the conditioned response (CR) changed massively during acquisition, extinction, and reacquisition. The CR duration remained largely constant throughout the experiment, while CR onset and peak time occurred slightly later during extinction. The results suggest that computational models can account for these results by using two layers of plasticity conforming to the sequence of synapses in the cerebellar pathways that mediate eyeblink conditioning.


Mahmood, A. R., van Hasselt, H, Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation.
Advances in Neural Information Processing Systems 27, Montreal, Canada.

ABSTRACT: Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).


Yao, H., Szepesvari, Cs., Sutton, R., Modayil, J., Bhatnagar, S. (2014). Universal Option Models. Advances in Neural Information Processing Systems 27, Montreal, Canada.

ABSTRACT: We consider the problem of learning models of options for real-time abstract planning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed. We introduce a new model for an option that is independent of any reward function, called the universal option model (UOM). We prove that the UOM of an option can construct a traditional option model given a reward function, and also supports efficient computation of the option-conditional return. We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. The second domain is article recommendation, where each user query defines a new reward function and an article’s relevance is the expected return from following a policy that follows the citations between articles. Our experiments show that UOMs are substantially more efficient than previously known methods in evaluating option returns and policies over options.


Edwards, A. L., Dawson, M. R., Hebert, J. S., Sutton, R. S., Chan, K. M., Pilarski, P.M. (2014). Adaptive switching in practice: Improving myoelectric prosthesis performance through reinforcement learning. In: Proceedings of MEC14: Myoelectric Controls Symposium, Fredericton, New Brunswick, Canada.


van Hasselt, H., Mahmood, A. R., Sutton, R. S. (2014). Off-policy TD(λ) with a true online equivalence. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada.

ABSTRACT: Van Seijen and Sutton (2014) recently proposed a new version of the linear TD(λ) learning algorithm that is exactly equivalent to an online forward view and that empirically performed better than its classical counterpart in both prediction and control problems. However, their algorithm is restricted to on-policy learning. In the more general case of off-policy learning, in which the policy whose outcome is predicted and the policy used to generate data may be different, their algorithm cannot be applied. One reason for this is that the algorithm bootstraps and thus is subject to instability problems when function approximation is used. A second reason true online TD(λ) cannot be used for off-policy learning is that the off-policy case requires sophisticated importance sampling in its eligibility traces. To address these limitations, we generalize their equivalence result and use this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view. We show this algorithm, named true online GTD(λ), empirically outperforms GTD(λ) (Maei, 2011) which was derived from the same objective as our forward view but lacks the exact online equivalence. In the general theorem that allows us to derive this new algorithm, we encounter a new general eligibility-trace update.


Sutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q(λ) with interim forward view and Monte Carlo equivalence. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China.

ABSTRACT: Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a pure bootstrapping form (λ = 0) to a pure Monte Carlo form (λ = 1). In off-policy algorithms, including Q(λ), GQ(λ), and off-policy LSTD(λ), the λ parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of λ, and as a result they fail to approximate Monte Carlo learning when λ = 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).


van Seijen, H., Sutton, R. S. (2014). True online TD(λ). In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. As published. With two minor errors corrected.

ABSTRACT:TD(λ) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(λ) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(λ), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ).



Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale Nexting in a Reinforcement Learning Robot. Adaptive Behavior 22(2):146-160.

ABSTRACT: The term ‘nexting’ has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense. The ability to ‘next’ constitutes a basic kind of awareness and knowledge of one’s environment. In this paper we present results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0.1 to 8 seconds. Our predictions are formulated as a generalization of the value functions commonly used in reinforcement learning, where now an arbitrary function of the sensory input signals is used as a pseudo reward, and the discount rate determines the timescale. We show that six thousand predictions, each computed as a function of six thousand features of the state, can be learned and updated online ten times per second on a laptop computer, using the standard TD(λ) algorithm with linear function approximation. This approach is sufficiently computationally efficient to be used for real-time learning on the robot and sufficiently data efficient to achieve substantial accuracy within 30 minutes. Moreover, a single tile-coded feature representation suffices to accurately predict many different signals over a significant range of timescales. We also extend nexting beyond simple timescales by letting the discount rate be a function of the state and show that nexting predictions of this more general form can also be learned with substantial accuracy. General nexting provides a simple yet powerful mechanism for a robot to acquire predictive knowledge of the dynamics of its environment.


Edwards, A. L., Kearney, A., Dawson, M. R., Sutton, R. S., Pilarski, P. M. (2013). Temporal-difference learning to assist human decision making during the control of an artificial limb. In: Proceedings of the First Interdisciplinary Conference on Reinforcement Learning and Decision Making, and http://arxiv.org/abs/1309.4714.

ABSTRACT: In this work we explore the use of reinforcement learning (RL) to help with human decision making, combining state-of-the-art RL algorithms with an application to prosthetics. Managing human-machine interaction is a problem of considerable scope, and the simplification of human-robot interfaces is especially important in the domains of biomedical technology and rehabilitation medicine. For example, amputees who control artificial limbs are often required to quickly switch between a number of control actions or modes of operation in order to operate their devices. We suggest that by learning to anticipate (predict) a user's behaviour, artificial limbs could take on an active role in a human's control decisions so as to reduce the burden on their users. Recently, we showed that RL in the form of general value functions (GVFs) could be used to accurately detect a user's control intent prior to their explicit control choices. In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm. Experiments were performed using a multi-function robot arm that was controlled by muscle signals from a user's body (similar to conventional artificial limb control). Our approach was able to acquire and maintain forecasts about a user's switching decisions in real time. It also provides an intuitive and reward-free way for users to correct or reinforce the decisions made by the machine learning system. We expect that when a system is certain enough about its predictions, it can begin to take over switching decisions from the user to streamline control and potentially decrease the time and effort needed to complete tasks. This preliminary study therefore suggests a way to naturally integrate human- and machine-based decision making systems.


Silver, D., Sutton, R. S., Mueller, M. (2013). Temporal-difference search in computer go. In: Proceedings of the ICAPS-13 Workshop on Planning and Learning.

ABSTRACT: Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning (Silver et al., 2012). Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.


Mahmood, A. R., Sutton, R. S. (2013). Representation Search through Generate and Test. In: Proceedings of the AAAI Workshop on Learning Rich Representations from Low-Level Sensors, Bellevue, Washington, USA.

ABSTRACT: Learning representations from data is one of the fundamental problems of artificial intelligence and machine learning. Many different approaches exist for learning representations, but what constitutes a good representation is not yet well understood. Much less is known about how to learn representations from an unending stream of data. In this work, we view the problem of representation learning as one of learning features (e.g., hidden units of neural networks) such that performance of the underlying base system continually improves. We study an important case where learning is done fully online (i.e, on an example-by-example basis). In the presence of an unending stream of data, the computational cost of the learning element should not grow with time and cannot be much more than that of the performance element. Few methods can be used effectively in this case. We show that a search approach to representation learning can naturally fit into this setting. In this approach good representations are searched by generating different features and then testing them for utility. We show that a fully online method can be developed, which utilizes this generate and test approach, learns representations continually, and adds only a small fraction to the overall computation. We believe online representation search constitutes an important step to- ward effective and inexpensive solutions to representation learning problems.


van Seijen, H., Sutton, R. S. (2013). Efficient planning in MDPs by small backups. In: Proceedings of the 30th International Conference on Machine Learning, pp. 361-369, Atlanta, USA.

ABSTRACT: Efficient planning plays a crucial role in model-based reinforcement learning. This efficiency is largely determined by the order in which backups are performed. A particularly effective ordering strategy is the strategy employed by prioritized sweeping. Prioritized sweeping orders backups according to a heuristic, such that backups that cause a large value change are selected first. The Bellman error predicts the value change caused by a full backup exactly, but its computation is expensive. Hence, methods do not use the Bellman error as a heuristic, but try to approximate the ordering induced by it with a heuristic that is cheaper to compute. We introduce the first efficient prioritized sweeping method that uses exact Bellman error ordering. The core of this method is a novel backup that allows for efficient computation of the Bellman error, while its effect as well as its computational complexity is equal to that of a full backup. We demonstrate empirically that the obtained method achieves substantially higher convergence rates than other prioritized sweeping implementations.


Pilarski, P. M., Dick, T. B., Sutton, R. S. (2013). Real-time prediction learning for the simultaneous actuation of multiple prosthetic joints. In: Proceedings of the 2013 IEEE International Conference on Rehabilitation Robotics, Seattle, USA, June 24–26. 8 pages.

ABSTRACT: Integrating learned predictions into a prosthetic control system promises to enhance multi-joint prosthesis use by amputees. In this article, we present a preliminary study of different cases where it may be beneficial to use a set of temporally extended predictions—learned and maintained in real time—within an engineered or learned prosthesis controller. Our study demonstrates the first successful combination of actor-critic reinforcement learning with real-time prediction learning. We evaluate this new approach to control learning during the myoelectric operation of a robot limb. Our results suggest that the integration of real-time prediction and control learning may speed control policy acquisition, allow unsupervised adaptation in myoelectric controllers, and facilitate synergies in highly actuated limbs. These experiments also show that temporally extended prediction learning enables anticipatory actuation, opening the way for coordinated motion in assistive robotic devices. Our work therefore provides initial evidence that real-time prediction learning is a practical way to support intuitive joint control in increasingly complex prosthetic systems.


Pilarski, P. M., Dawson, M. R., Degris, T., Chan, K. M., Hebert, J. S., Carey, J. P., Sutton, R. S. (2013). Adaptive Artificial Limbs: A Real-time Approach to Prediction and Anticipation, IEEE Robotics & Automation Magazine 20(1):53-64.

ABSTRACT: Predicting the future has long been regarded as a powerful means to improvement and success. The ability to make accurate and timely predictions enhances our ability to control our situation and our environment. Assistive robotics is one prominent area where foresight of this kind can bring improved quality of life. In this article, we present a new approach to acquiring and maintaining predictive knowledge during the online, ongoing operation of an assistive robot. The ability to learn accurate, temporally abstracted predictions is shown through two case studies—able-bodied subjects engaging in the myoelectric control of a robot arm, and an amputee participant’s interactions with a myoelectric training robot. To our knowledge, this work is the first demonstration of a practical method for real-time prediction learning during myoelectric control. Our approach therefore represents a fundamental tool for addressing one major unsolved problem: amputee-specific adaptation during the ongoing operation of a prosthetic device. The findings in this article also contribute a first explicit look at prediction learning in prosthetics as an important goal in its own right, independent of its intended use within a specific controller or system. Our results suggest that real-time learning of predictions and anticipations is a significant step towards more intuitive myoelectric prostheses and other assistive robotic devices.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2013). Timing and cue competition in conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus), Learning and Memory 20:97-102. Cold Spring Harbor Press.

ABSTRACT:Rabbits were classically conditioned using compounds of tone and light conditioned stimuli (CSs) presented with either simultaneous onsets (Experiment 1) or serial onsets (Experiment 2) in a delay conditioning paradigm. Training with the simultaneous compound reduced the likelihood of a conditioned response (CR) to the individual CSs (“mutual overshadowing”) but left CR timing unaltered. CR peaks were consistently clustered around the time of unconditioned stimulus (US) delivery. Training with the serial compound (CSA->CSB->US) reduced responding to CSB (“temporal primacy/information effect”) but this effect was prevented by prior CSB->US pairings. In both cases, serial compound training altered CR timing. On CSA->CSB test trials, the CRs were accelerated; the CR peaks occurred after CSB onset but well before the time of US delivery. Conversely, CRs on CSB– trials were decelerated; the distribution of CR peaks was variable but centered well after the US. Timing on CSB– trials was at most only slightly accelerated. The results are discussed with respect to processes of generalization and spectral timing applicable to the cerebellar and forebrain pathways in eyeblink preparations.


White, A., Modayil, J., Sutton, R. S. (2012). Scaling Life-long Off-policy Learning. In Proceedings of the Second Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob), San Diego, USA.

ABSTRACT:In this paper, we pursue an approach to scaling life-long learning using parallel, off-policy reinforcement-learning algorithms. In life-long learning, a robot continually learns from a life-time of experience, slowly acquiring and applying skills and knowledge to new situations. Many of the benefits of life-long learning are a result of scaling the amount of training data, processed by the robot, to long sensorimotor streams. Another dimension of scaling can be added by allowing off-policy sampling from the unending stream of sensorimotor data generated by a long-lived robot. Recent algorithmic developments have made it possible to apply off-policy algorithms to life-long learning, in a sound way, for the first time. We assess the scalability of these off-policy algorithms on a physical robot. We show that hundreds of accurate multi-step predictions can be learned about several policies in parallel and in realtime. We present the first online measures of off-policy learning progress. Finally we demonstrate that our robot, using the new off-policy measures, can learn 8000 predictions about 300 distinct policies, a substantial increase in scale compared to previous simulated and robotic life-long learning systems.


Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring a Broad Range of Empirical Knowledge in Real Time by Temporal-Difference Learning. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Seoul, Korea.

ABSTRACT: Several robot capabilities rely on predictions about the temporally extended consequences of a robot’s behaviour. We describe how a robot can both learn and make many such predictions in real time using a standard algorithm. Our experiments show that a mobile robot can learn and make thousands of accurate predictions at 10 Hz. The predictions are about the future of all of the robot’s sensors and many internal state variables at multiple time-scales. All the predictions share a single set of features and learning parameters. We demonstrate the generality of this method with an application to a different platform, a robot arm operating at 50 Hz. Here, learned predictions can be used to measurably improve the user interface. The temporally extended predictions learned in real time by this method constitute a basic form of knowledge about the dynamics of the robot’s interaction with the environment. We also show how this method can be extended to express more general forms of knowledge.


Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring Diverse Predictive Knowledge in Real Time by Temporal-difference Learning. International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems, Montpellier, France.

ABSTRACT: Existing robot algorithms demonstrate several capabilities that are enabled by a robot’s knowledge of the temporally- extended consequences of its behaviour. This knowledge consists of real-time predictions—predictions that are conventionally computed by iterating a small one-timestep model of the robot’s dynamics. Given the utility of such predictions, alternatives are desirable when this conventional approach is not applicable, for example when an adequate model of the one-timestep dynamics is either not available or not computationally tractable. We describe how a robot can both learn and make many such predictions in real-time using a standard reinforcement learning algorithm. Our experiments show that a mobile robot can learn and make thousands of accurate predictions at 10 Hz about the future of all of its sensors and many internal state vari- ables at multiple time-scales. The method uses a single set of features and learning parameters that are shared across all the predictions. We demonstrate the generality of these predictions with an application to a different platform, a robot arm operating at 50 Hz. Here, the predictions are about which arm joint the user wants to move next, a difficult situation to model analytically, and we show how the learned predictions enable measurable improvements to the user interface. The predictions learned in real-time by this method constitute a basic form of knowledge about the robot’s interaction with the environ- ment, and extensions of this method can express more general forms of knowledge.


Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2012). Evaluating the TD model of classical conditioning. Learning and Behavior 40(3):305-319. Springer.

ABSTRACT: The temporal-difference (TD) algorithm from reinforcement learning provides a simple method for incrementally learning predictions of upcoming events. Applied to classical conditioning, TD models suppose that animals learn a real-time prediction of the unconditioned stimulus (US) on the basis of all available conditioned stimuli (CSs). In the TD model, similar to other error-correction models, learning is driven by prediction errors—the difference between the change in US prediction and the actual US. With the TD model, however, learning occurs continuously from moment to moment and is not artificially constrained to occur in trials. Accordingly, a key feature of any TD model is the assumption about the representation of a CS on a moment-to-moment basis. Here, we evaluate the performance of the TD model with a heretofore unexplored range of classical conditioning tasks. To do so, we consider three stimulus representations that vary in their degree of temporal generalization and evaluate how the representation influences the performance of the TD model on these conditioning tasks.


Pilarski, P. M., Sutton, R. S. (2012)  Between instruction and reward: Human-prompted switching. Proceedings of the AAAI Fall Symposium on Robots Learning Interactively from Human Teachers. AAAI Technical Report FS-12-07, pp. 46–52.

ABSTRACT: Intelligent systems promise to amplify, augment, and extend innate human abilities. A principal example is that of assistive rehabilitation robots—artificial intelligence and machine learning enable new electromechanical systems that restore biological functions lost through injury or illness. In order for an intelligent machine to assist a human user, it must be possible for a human to communicate their intentions and preferences to their non-human counterpart. While there are a number of techniques that a human can use to direct a machine learning system, most research to date has focused on the contrasting strategies of instruction and reward. The primary contribution of our work is to demonstrate that the middle ground between instruction and reward is a fertile space for research and immediate technological progress. To support this idea, we introduce the setting of human-prompted switching, and illustrate the successful combination of switching with interactive learning using a concrete real-world example: human control of a multi-joint robot arm. We believe techniques that fall between the domains of instruction and reward are complementary to existing approaches, and will open up new lines of rapid progress for interactive human training of machine learning systems.


Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Chan, K. M., Hebert, J. S., Sutton, R. S. (2012). Towards Prediction-Based Prosthetic Control. In Proceedings of the 2012 Conference of the International Functional Electrical Stimulation Society, September 9-12, Banff, Canada, 4 pages.

ABSTRACT: Predictions and anticipations form a basis for pattern-recognition-based control systems, including those used in next-generation prostheses and assistive devices. In this work, we outline how new methods in real-time prediction learning provide an approach to one of the principal open problems in multi-function myoelectric control—robust, ongoing, amputee-specific adaptation. Techniques from reinforcement learning and general value functions are applied to learn and update predictions during continuing human interactions with multi-function robotic appendages. Tests were performed with a targeted motor and sensory reinnervation amputee subject and non-amputee participants. Results show that this online prediction learning approach is able to accurately anticipate a diverse set of human and robot signals by more than two seconds, and at different levels of temporal abstraction. These methods allow predictions to be adapted during real-time use, which is an important step toward the intuitive control of powered prostheses and other assistive devices.


Degris, T., White, M., Sutton, R. S. (2012). Off-policy Actor-Critic. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Great Britain. Demo.

ABSTRACT: This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.


Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Sutton, R. S. (2012). Dynamic Switching and Real-time Machine Learning for Improved Human Control of Assistive Biomedical Robots. In Proceedings of the Fourth IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob), June 24-27, Roma, Italy, 296-302.

ABSTRACT: A general problem for human-machine interaction occurs when a machine’s controllable dimensions outnumber the control channels available to its human user. In this work, we examine one prominent example of this problem: amputee switching between the multiple functions of a powered artificial limb. We propose a dynamic switching approach that learns during ongoing interaction to anticipate user behaviour, thereby presenting the most effective control option for a given context or task. Switching predictions are learned in real time using temporal difference methods and reinforcement learning, and demonstrated within the context of a robotic arm and a multi-function myoelectric controller. We find that a learned, dynamic switching order is able to out-perform the best fixed (non-adaptive) switching regime on a standard prosthetic proficiency task, increasing the number of optimal switching suggestions by 23%, and decreasing the expected transition time between degrees of freedom by more than 14%. These preliminary results indicate that real-time machine learning, specifically online prediction and anticipation, may be an important tool for developing more robust and intuitive controllers for assistive biomedical robots. We expect these techniques will transfer well to near-term use by patients. Future work will describe clinical testing of this approach with a population of amputee patients.


Sutton, R. S. (2012). Beyond reward: The problem of knowledge and data (extended abstract). In: Proceedings of the 21st International Conference on Inductive Logic Programming, S.H. Muggleton, A. Tamaddoni-Nezhad, F.A. Lisi (Eds.): ILP 2011, LNAI 7207, pp. 2--6. Springer, Heidelberg.


Degris, T.,
Modayil, J. (2012). Scaling-up Knowledge for a Cognizant Robot.” In notes of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI.

ABSTRACT: This paper takes a new approach to the old adage that knowledge is the key for artificial intelligence. A cognizant robot is a robot with a deep and immediately accessible understanding of its interaction with the environment—an understanding the robot can use to flexibly adapt to novel situations. Such a robot will need a vast amount of situated, revisable, and expressive knowledge to display flexible intelligent behaviors. Instead of relying on human-provided knowledge, we propose that an arbitrary robot can autonomously acquire pertinent knowledge directly from everyday interaction with the environment. We show how existing ideas in reinforcement learning can enable a robot to maintain and improve its knowledge. The robot performs a continual learning process that scales-up knowledge acquisition to cover a large number of facts, skills and predictions. This knowledge has semantics that are grounded in sensorimotor experience. We see the approach of developing more cognizant robots as a necessary key step towards broadly competent robots.


Degris, T., Pilarski, P. M., Sutton, R. S. (2012). Model-Free Reinforcement Learning with Continuous Action in Practice. In Proceedings of the 2012 American Control Conference.

ABSTRACT: Reinforcement learning methods are often considered as a potential solution to enable a robot to adapt to changes in real time to an unpredictable environment. However, with continuous action, only a few existing algorithms are practical for real-time learning. In such a setting, most effective methods have used a parameterized policy structure, often with a separate parameterized value function. The goal of this paper is to assess such actor-critic methods to form a fully specified practical algorithm. Our specific contributions include 1) developing the extension of existing incremental policy-gradient algorithms to use eligibility traces, 2) an empirical comparison of the resulting algorithms using continuous actions, 3) the evaluation of a gradient-scaling technique that can significantly improve performance. Finally, we apply our actor--critic algorithm to learn on a robotic platform with a fast sensorimotor cycle (10ms). Overall, these results constitute an important step towards practical real-time learning control with continuous action.



Mahmood, A. R., Sutton, R. S., Degris, T., Pilarski, P. M. (2012). Tuning-free step-size adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan.


ABSTRACT: Incremental learning algorithms based on gradient descent are effective and popular in online supervised learning, reinforcement learning, signal processing, and many other application areas. An oft-noted drawback of these algorithms is that they include a step-size parameter that needs to be tuned for best performance, which might require manual intervention and significant domain knowledge or additional data. In many cases, an entire vector of step-size parameters (e.g., one for each input feature) needs to be tuned in order to attain the best performance of the algorithm. To address this, several methods have been proposed for adapting step sizes online. For example, Sutton’s IDBD method can find the best vector step size for the LMS algorithm, and Schraudolph’s ELK1 method, an extension of IDBD to neural networks, has proven effective on large applications, such as 3D hand tracking. However, to date all such step-size adaptation methods have included a tunable step-size parameter of their own, which we call the meta-step-size parameter. In this paper we show that the performance of existing step-size adaptation methods are strongly dependent on the choice of their meta-step-size parameter and that their meta-step-size parameter cannot be set reliably in a problem-independent way. We introduce a series of modifications and normalizations to the IDBD method that together eliminate the need to tune the meta-step-size parameter to the particular problem. We show that the resulting overall algorithm, called Autostep, performs as well or better than the existing step-size adaptation methods on a number of idealized and robot prediction problems and does not require any tuning of its meta-step-size parameter. The ideas behind Autostep are not restricted to the IDBD method and the same principles are potentially applicable to other incremental learning settings, such as reinforcement learning.



Silver, D., Sutton, R. S., Müller, M. (2012). Temporal-difference search in computer Go, Machine Learning 87(2):183-219. Pdf copy for personal research purposes only.
ABSTRACT: Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.
 
Modayil, J., White, A., Sutton, R. S. (2012). Multi-timescale Nexting in a Reinforcement Learning Robot. Presented at the 2012 International Conference on Adaptive Behaviour, Odense, Denmark. To appear in: SAB 12, LNAI 7426, pp. 299-309, T. Ziemke, C. Balkenius, and J. Hallam, Eds., Springer Heidelberg. Also ArXiv preprint 1112.1133.
ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems.  Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.
 
Pilarski, P. M., Dawson, M. R., Degris, T.,Fahimi, F., Carey, J. P., Sutton, R. S. (2011). Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning, Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics (ICORR), Zurich, Switzerland, pp. 134-140.

ABSTRACT: As a contribution toward the goal of adaptable, intelligent artificial limbs, this work introduces a continuous actor-critic reinforcement learning method for optimizing the control of multi-function myoelectric devices. Using a simulated upper-arm robotic prosthesis, we demonstrate how it is possible to derive successful limb controllers from myoelectric data using only a sparse human-delivered training signal, without requiring detailed knowledge about the task domain. This reinforcement-based machine learning framework is well suited for use by both patients and clinical staff, and may be easily adapted to different application domains and the needs of individual amputees. To our knowledge, this is the first myoelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis.


Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.
ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems.  Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.
 
Modayil, J., Pilarski, P. M., White, A., Degris, T., Sutton, R. S. (2010).
Off-policy knowledge maintenance for robots, Robotics Science and Systems Workshop (Towards Closing the Loop: Active Learning for Robotics). Extended abstract and poster.

ABSTRACT: A fundamental difficulty in robotics arises from changes in the experienced environment—periods when the robot’s current situation differs from past experience. We present an architecture whereby many independent reinforcement learn- ing agents (or demons) observe the behaviour of a single robot. Each demon learns one piece of world knowledge represented with a generalized value function. This architecture allows the demons to update their knowledge online and off-policy from the robot’s behaviour. We present one approach to active exploration using curiosity—an internal measure of learning progress—and conclude with a preliminary result showing how a robot can adapt its prediction of the time needed to come to a full stop.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2010). Timing in trace conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus): Scalar, nonscalar, and adaptive features. Learning and Memory 17:600-604. Cold Spring Harbor Press.

Using interstimulus intervals (ISIs) of 125, 250, and 500 msec in trace conditioning of the rabbit nictitating membrane response, the offset times and durations of conditioned responses (CRs) were collected along with onset and peak latencies. All measures were proportional to the ISI, but only onset and peak latencies conformed to the criterion for scalar timing. Regarding the CR’s possible protective overlap of the unconditioned stimulus (US), CR duration increased with ISI, while the peak’s alignment with the US declined. Implications for models of timing and CR adaptiveness are discussed.


Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
ABSTRACT: We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, Lugano, Switzerland. GQ(λ) quick reference guide. GQ(λ) reference implementation in java, C++, and C with header file.

ABSTRACT: A new family of gradient temporal-difference learning algorithms have recently been introduced by Sutton, Maei and others in which function approximation is much more straightforward. In this paper, we introduce the GQ(λ) algorithm which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions. These extensions bring us closer to the ultimate goal of this work—the development of a universal prediction learning algorithm suitable for learning experientially grounded knowledge of the world. Eligibility traces are essential to this goal because they bridge the temporal gaps in cause and effect when experience is processed at a temporally fine resolution. Temporally abstract predictions are also essential as the means for representing abstract, higher-level knowledge about courses of action, or options. GQ(λ) can be thought of as an extension of Q-learning. We extend existing convergence results for policy evaluation to this setting and carry out a forward-view/backward-view analysis to derive and prove the validity of the new algorithm.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2009). Magnitude and timing of conditioned responses in delay and trace classical conditioning of the nictitating membrane respons of the rabbit (oryctolagus cuniculus), Behavioral Neuroscience 123(5):1095-1101.

ABSTRACT: The present experiment characterized conditioned nictitating membrane (NM) movements as a function of CS duration, using the full range of discernible movements (>.06 mm) rather than movements exceeding a conventional criterion (>.50 mm). The CS–US interval was fixed at 500 ms, while across groups, the duration of the CS was 50 ms (trace), 550 ms (delay), or 1050 ms (extended delay). The delay group showed the highest level of acquisition. When tested with the different CS durations, the delay and extended delay groups showed large reductions in their responses when their CS was shortened to 50 ms, but the trace group maintained its response at all durations. Timing of the conditioned movements appeared similar across all manipulations. The results suggest that the CS has both a fine timing function tied to CS onset and a general predictive function tied to CS duration, both of which may be mediated by cerebellar pathways.


Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In Advances in Neural Information Processing Systems 22, Vancouver, BC, pp. 1204-1212. December 2009. MIT Press.

ABSTRACT: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness.


Interview with Richard S. Sutton, by Verena Heidrich-Meisner of Kuenstliche Intelligenz. (2009). pp. 41-43.


Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural Actor-Critic Algorithms. Automatica. Unfortunately, due to a disagreement with the publisher, and to the mistake of using Bibtex, all the citations in this paper are in alphabetical order rather than the order in which credit is due. We regret this. Conference version. TR version with experiments.

ABSTRACT: We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.


Sutton, R. S. (2009). The grand challenge of predictive empirical abstract knowledge. Working Notes of the IJCAI-09 Workshop on Grand Challenges for Reasoning from Experiences.

ABSTRACT: We survey ongoing work at the University of Alberta on an experience-oriented approach to artificial intelligence (AI) based an reinforcement learning ideas. We seek to ground world knowledge in a minimal ontology of just the signals passing back and forth between the AI agent and its environment at a fine temporal scale. The challenge is to connect these low-level signals to higher-level representations in such a way that the knowledge remains grounded and autonomously verifiable.  The mathematical ideas of temporally abstract options, option models, and temporal-difference networks can be applied to begin to address this challenge. This has been illustrated in several simple computational worlds, and we seek now to extend the approach to a physically realized robot. A recent theoretical development is the extension of simple temporal-difference methods to off-policy forms using function approximation; this should enable, for the first time, efficient and reliable intra-option learning, bringing the goal of predictive empirical abstract knowledge closer to achievement.


Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada. Presentation at ICML-09.
Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility.  In this paper we introduce two new related algorithms with better convergence rates.  The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Kehoe, E. J., Olsen, K. N., Ludvig, E. A., Sutton, R. S. (2009). Scalar timing varies with response magnitude in classical conditioning of the rabbit nictitating membrane response (Oryctolagus cuniculus). Behavioral Neuroscience 123, 212–217.
The present experiment was aimed at characterizing the timing of conditioned nictitating membrane (NM) movements as function of the interstimulus interval (ISI) in delay conditioning for rabbits (Oryctolagus cuniculus). Onset latency and peak latency were approximately, but not strictly, scalar for all but the smallest movements (<.10 mm). That is, both the mean and standard deviation of the timing measures increased in proportion to the ISI, but their coefficients of variation (standard deviation/mean) tended to be larger for shorter ISIs. For all ISIs, the absolute timing of the NM movements covaried with magnitude. The smaller movements (approximately, .11-.50 mm) were highly variable, and their peaks tended to occur well after the time of US delivery. The larger movements (>.50 mm) were less variable, and their peaks were better aligned with the time of US delivery. These results are discussed with respect to their implications for current models of timing in eyeblink conditioning.

Sutton, R. S., Szepesvari, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, Vancouver, BC. December 2008. MIT Press.
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters.  We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience.  The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm.  We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity.  GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Ludvig, E.,  Sutton, R. S., Verbeek, E., Kehoe, E. J. (2009). A computational model of hippocampal function in trace conditioning. In Advances in Neural Information Processing Systems 21, pp. 993-1000, Vancouver, BC. December 2008. MIT Press.
We present a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses. For delay conditioning, in contrast, the continued presence of the stimulus supports conditioned responding, and the short-latency elements suppress responding early in the stimulus. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended stimuli or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions.

Cutumisu, M., Szafron, D., Bowling, M., Sutton R. S. (2008). Agent learning using action-dependent learning rates in computer role-playing games. Proceedings of the 4th Artificial Intelligence and Interactive Digital Entertainment Conference, pp. 22-29.

We introduce the ALeRT (Action-dependent Learning Rates with Trends) algorithm that makes two modifications to the learning rate and one change to the exploration rate of traditional reinforcement learning techniques. Our learning rates are action-dependent and increase or decrease based on trends in reward sequences. Our exploration rate decreases when the agent is learning successfully and increases otherwise. These improvements result in faster learning. We implemented this algorithm in NWScript, a scripting language used by BioWare Corp.’s Neverwinter Nights game, with the goal of improving the behaviours of game agents so that they react more intelligently to game events. Our goal is to provide an agent with the ability to (1) discover favourable policies in a multi-agent computer role-playing game situation and (2) adapt to sudden changes in the environment.


Silver, D., Sutton, R. S., and Mueller, M. (2008)
. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th International Conference on Machine Learning, pp. 968-975, Helsinki, Finland. July 2008. Omnipress.
We study a reinforcement learning architecture that encompasses both learning and planning, and apply it to high performance Computer Go. In this domain, the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are individuated, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of our architecture, Dyna-2, is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. This architecture subsumes Sarsa and UCT as special cases. We apply Dyna-2 to 9x9 Computer Go, using a million binary features in the function approximator, based on templates matching small fragments of the board. We test our approach against GnuGo, a standard benchmark opponent, and on the Computer Go Online Server (CGOS). We show that the combination of permanent and transient memories performs better than either memory alone, and significantly outperforms a traditional approach using alpha-beta search. Our program based on Dyna-2 achieves a higher rating on CGOS than any other handcrafted or traditional search based program. This is the first exploration of the Dyna concept in a large-scale, high-performance application.

Sutton, R. S., Szepesvari, Cs., Geramifard, A., Bowling, M., (2008). Dyna-style planning with linear function approximation and prioritized sweeping, Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pp. 528-536.
We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dyna-style planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions.  Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.  In the policy evaluation setting, we prove that the limit point is the  least-squares (LSTD) solution.  An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2008) Stimulus representation and the timing of reward-prediction errors in models of the dopamine system, Neural Computation 20:3034-3054.
The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.

Kehoe, E. J., Ludvig, E. A., Dudeney, J. E., Neufeld, J., Sutton, R. S. (2008). Magnitude and timing of nictitating membrane movements during classical conditioning of the rabbit (oryctolagus cuniculus), Behavioral Neuroscience 122(2):471-176.
A trial-by-trial, subject-by-subject analysis was conducted to determine whether generation of the conditioned response (CR) occurs on a continuous or all-or-none basis. Three groups of rabbits were trained on different partial reinforcement schedules with the conditioned stimulus presented alone on 10%, 30%, or 50%, respectively, of all trials. Plots of each rabbit’s nictitating membrane movements revealed that their magnitude rose in a continuous fashion. Response growth during acquisition followed a sigmoidal curve, and the timing of CR-sized movements was largely stable throughout the experiment. The results are discussed with respect to alternative models of CR generation.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2008). Incremental Natural Actor-Critic Algorithms. Twenty First Annual Conference on Neural Information Processing Systems, pp. 105-112.  Conference held in Vancouver, Canada, December 2007. Journal version. TR version with experiments.

ABSTRACT: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces, and the use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the policy gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda et al. by using temporal difference learning in the actor and by incorporating natural gradients, and extend prior empirical studies of natural-gradient actor-critic methods by Peters et al. by providing the first convergence proofs and the first fully incremental algorithms.


Sutton, R. S., Koop, A., Silver, D. (2007). On the Role of Tracking in Stationary Environments.  In Proceedings of the 2007 International Conference on Machine LearningAssociated talk from 6/21/07.

ABSTRACT: It is often thought that learning algorithms that track the best solution, as opposed to converging to it, are important only on nonstationary problems. We present three results suggesting that this is not so. First we illustrate in a simple concrete example, the Black and White problem, that tracking can perform better than any converging algorithm on a stationary problem. Second, we show the same point on a larger, more realistic problem, an application of temporal-difference learning to computer Go. Our third result suggests that tracking in stationary problems could be important for meta-learning research (e.g., learning to learn, feature selection, transfer). We apply a meta-learning algorithm for step-size adaptation, IDBD,e to the Black and White problem, showing that meta-learning has a dramatic long-term effect on performance whereas, on an analogous converging problem, meta-learning has only a small second-order effect. This small result suggests a way of eventually overcoming a major obstacle to meta-learning research: the lack of an independent methodology for task selection.


Silver, D., Sutton, R. S., and Mueller, M. (2007) Reinforcement Learning of Local Shape in the Game of Go, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).

ABSTRACT: We explore an application to the game of Go of a reinforcement learning approach based on a linear evaluation function and large numbers of binary features. This strategy has proved effective in game playing programs and other reinforcement learning applications. We apply this strategy to Go by creating over a million features based on templates for small fragments of the board, and then use temporal difference learning and self-play. This method identifies hundreds of low level shapes with recognisable significance to expert Go players, and provides quantitive estimates of their values. We analyse the relative contributions to performance of templates of different types and sizes. Our results show that small, translation-invariant templates are surprisingly effective. We assess the performance of our program by playing against the Average Liberty Player and a variety of computer opponents on the 9x9 Computer Go Server. Our linear evaluation function appears to outperform all other static evaluation functions that do not incorporate substantial domain knowledge.


Geramifard, A., Bowling, M., Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility Traces and Convergence Analysis, Advances in Neural Information Processing Systems 19 (NIPS*06).

ABSTRACT: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation.  iLSTD is an incremental method for achieving results similar to LSTD, the data-efficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10,000) to show substantial computational advantages over LSTD.


Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental Least-Squares Temporal Difference Learning.   In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), pages 356-361.

ABSTRACT: Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental least-squares TD learning, or iLSTD. This method is more data efficient than conventional TD algorithms such as TD(0) and is more computationally efficient than non-incremental least-squares TD methods such as LSTD (Bradtke & Barto 1996; Boyan 1999). In particular, we show that the per-time-step complexities of iLSTD and TD(0) are O(n), where n is the number of features, whereas that of LSTD is O(n2 ). This difference can be decisive in modern applications of reinforcement learning where the use of a large number features has proven to be an effective solution strategy. We present empirical comparisons, using the test problem introduced by Boyan (1999), in which iLSTD converges faster than TD(0) and almost as fast as LSTD.


Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with RecognizersAdvances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has much lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for MDPs and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Our algorithm achieves this without knowledge of the behavior policy or even requiring that there exists a behavior policy.


Sutton, R. S., Rafols, E. J., Koop, A. (2006). Temporal abstraction in temporal-difference networksAdvances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT: Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment (Sutton & Tanner, 2005).  These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that they are about what would happen if an action or sequence of actions were taken.  In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving.  Our generalization is based on the options framework for temporal abstraction (Sutton, Precup & Singh, 1999).  The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces.  We present empirical examples of our algorithm's effectiveness and of the greater representational expressiveness of temporally-abstract TD networks.


Tanner, B., Sutton, R. S. (2005) TD(lambda) Networks: Temporal-Difference Networks with Eligibility Traces.   Proceedings of the 2005 International Conference on Machine Learning, pp 889-896.

ABSTRACT: Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conventional TD learning, the TD(lambda) algorithm is often used to do more general multi-step backups of future predictions. In our work, we introduce a generalization of the 1-step TD network specification that is based on the TD(lambda) learning al- gorithm, creating TD(lambda) networks. We present experimental results that show TD(lambda) networks can learn solutions in more complex environments than TD networks. We also show that in problems that can be solved by TD networks, TD(lambda) networks generally learn solutions much faster than their 1-step counterparts. Finally, we  present an analysis of our algorithm that shows that the computational cost of TD(lambda) networks is only slightly more than that of TD networks.


Rafols, E. J., Ring, M. B., Sutton, R. S., Tanner, B. 2005) Using Predictive Representations to Improve Generalization in Reinforcement Learning.   Proceedings of the 2005 International Joint Conference on Artificial Intelligence, pp 835-840.

ABSTRACT: The predictive representations hypothesis holds that particularly good generalization will result from representing the state of the world in terms of predictions about possible future experience. This hypothesis has been a central motivation behind recent research in, for example, PSRs and TD networks. In this paper we present the first explicit investigation of this hypothesis. We show in a reinforcement-learning example (a grid-world navigation task) that a predictive representation in tabular form can learn much faster than both the tabular explicit-state representation and a tabular history-based method.


Tanner, B., Sutton, R. S. (2005) Temporal-Difference Networks with History.   Proceedings of the 2005 International Joint Conference on Artificial Intelligence, pp 865-870.

ABSTRACT: Temporal-difference (TD) networks are a formalism for expressing and learning grounded world knowledge in a predictive form [Sutton and Tanner, 2005]. However, not all partially observable Markov decision processes can be efficiently learned with TD networks. In this paper, we extend TD networks by allowing the network-update process (answer network) to depend on the recent history of previous actions and observations rather than only on the most recent action and observation. We show that this extension enables the solution of a larger class of problems than can be solved by the original TD networks or by historybased methods alone. In addition, we apply TD networks to a problem that, while still simple, is significantly larger than has previously been considered. We show that history-extended TD networks can learn much of the common-sense knowledge of an egocentric gridworld domain with a single bit of perception.


Stone, P., Sutton, R. S., Kuhlmann, G. (2005) Reinforcement Learning for RoboCup-Soccer KeepawayAdaptive Behavior.  Some simulations of keepaway referenced in the paper and keepaway software.

ABSTRACT: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(lambda) with linear tile-coding function approximation and variable lambda to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, "the keepers," tries to keep control of the ball for as long as possible despite the efforts of "the takers." The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.


Sutton, R. S., Tanner, B. (2005) Temporal-Difference NetworksAdvances in Neural Information Processing Systems 17, pages 1377-1384.  Associated talk from 12/14/05Small-size version of same talk.

ABSTRACT: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.

Littman, M.L., Sutton, R.S., Singh, S. (2002). Predictive Representations of State. In: Advances in Neural Information Processing Systems 14:1555-1561 (Proceedings of the 2001 conference). MIT Press. 
ABSTRACT: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDP and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model.

Stone, P., Sutton, R. S. (2002). Keepaway soccer: A machine learning testbed. In: Lecture Notes in Computer Science 2377, pp. 207-237, Springer.

RoboCup simulated soccer presents many challenges to machine learning (ML) methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the effects of actions. While there have been many successful ML applications to portions of the robotic soccer task, it appears to be still beyond the capabilities of modern machine learning techniques to enable a team of 11 agents to successfully learn the full robotic soccer task from sensors to actuators. Because the successful applications to portions of the task have been embedded in different teams and have often addressed different sub-tasks, they have been difficult to compare. We put forth keepaway soccer as a domain suitable for directly comparing different machine learning approaches to robotic soccer. It is complex enough that it can’t be solved trivially, yet simple enough that complete machine learning approaches are feasible. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ball-holder and when to cover possible passing lanes. We fully specify the domain and summarize some initial, successful learning results.


Stone, P., Sutton, R.S. (2001). Scaling reinforcement learning toward RoboCup soccer. Proceedings of the 18th International Conference on Machine Learning, pp. 537-544. ABSTRACT: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(lambda) with linear tile-coding function approximation and variable lambda to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, "the keepers," tries to keep control of the ball for as long as possible despite the efforts of "the takers." The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ball-holder and when to cover possible passing lanes. Our agents learned policies that significantly out-performed a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.
Precup, D., Sutton, R.S., Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. digitally remastered version. scanned version. Proceedings of the 18th International Conference on Machine Learning. Associated talk from 7/1/01. ABSTRACT: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(lambda) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any epsilon-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length.

Stone, P., Sutton, R.S., Singh, S. (2001). Reinforcement Learning for 3 vs. 2 Keepaway. In: RoboCup-2000: Robot Soccer World Cup IV, P. Stone, T. Balch, and G. Kraetszchmar, Eds., Springer Verlag. An earlier version appeared in the Proceedings of the RoboCup-2000 Workshop, Melbourne, Australia.

ABSTRACT: As a sequential decision problem, robotic soccer can benefit from research in reinforcement learning. We introduce the 3 vs. 2 keepaway domain, a subproblem of robotic soccer implemented in the RoboCup soccer server. We then explore reinforcement learning methods for policy evaluation and action selection in this distributed, real-time, partially observable, noisy domain. We present empirical results demonstrating that a learned policy can dramatically outperform hand-coded policies.
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12 (Proceedings of the 1999 conference), pp. 1057-1063. MIT Press. An earlier version, as submitted May 1999, appeared as an AT&T Labs Technical Report. Later we began but never finished another paper, Comparing Policy-Gradient Algorithms.
ABSTRACT:Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Precup, D., Sutton, R.S., Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759-766. Morgan Kaufmann. ABSTRACT: Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally.

Sutton, R.S. (1999). Open theoretical questions in reinforcement learning. In Proceedings of the Fourth European Conference on Computational Learning Theory (Proceedings EuroCOLT'99), pp. 11-17, Fischer, P., Simon, H.U., Eds. Springer-Verlag.


Sutton, R.S. (1999). Reinforcement Learning. In Rob Wilson and Frank Keil (Eds.) The MIT Encyclopedia of the Cognitive Sciences, MIT Press.

ABSTRACT:Reinforcement learning is an approach to artificial intelligence that emphasizes learning by the individual from its interaction with its environment. This contrasts with classical approaches to artificial intelligence and machine learning, which have downplayed learning from interaction, focusing instead on learning from a knowledgeable teacher, or on reasoning from a complete model of the environment. Modern reinforcement learning research is highly interdisciplinary; it includes researchers specializing in operations research, genetic algorithms, neural networks, psychology, and control engineering.


Sutton, R.S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence 112:181-211. An earlier version appeared as Technical Report 98-74, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. April, 1998. Associated talk from 3/5/98.  Humorous excerpts from the JAIR reviews recommending rejection of this paper.

ABSTRACT: Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options---closed-loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning. Formally, a set of options defined over an MDP constitutes a semi-Markov decision process (SMDP), and the theory of SMDPs provides the foundation for the theory of options. However, the most interesting issues concern the interplay between the underlying MDP and the SMDP and are thus beyond SMDP theory. We present results for three such cases: 1) we show that the results of planning with options can be used during execution to interrupt options and thereby perform even better than planned, 2) we introduce new intra-option methods that are able to learn about an option from fragments of its execution, and 3) we propose a notion of subgoal that can be used to improve the options themselves. All of these results have precursors in the existing literature; the contribution of this paper is to establish them in a simpler and more general setting with fewer changes to the existing reinforcement learning framework. In particular, we show that these results can be obtained without committing to (or ruling out) any particular approach to state abstraction, hierarchy, function approximation, or the macro-utility problem.
Sutton, R.S., Singh, S., Precup, D., Ravindran, B. (1999). Improved switching among temporally abstract actions. Djvu version. Advances in Neural Information Processing Systems 11 (Proceedings of the 1998 conference), MIT Press. Associated talk from 12/2/98. ABSTRACT: In robotics and other control applications it is commonplace to have a pre-existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.

Moll, R., Barto, A.G., Perkins, T. J., Sutton, R.S. (1999). Learning instance-independent value functions to enhance local search. Advances in Neural Information Processing Systems 11 (Proceedings of the 1998 conference), pp. 1017-1023. MIT Press.

ABSTRACT: Reinforcement learning methods can be used to improve the performance of local search algorithms for combinatorial optimization by learning an evaluation function that predicts the outcome of search. The evaluation function is therefore able to guide search more effectively to low-cost solutions than can the original cost function. We describe a reinforcement learning method for enhancing local search that combines aspects of previous work by Zhang and Dietterich (1995) and Boyan and Moore (1997, 1998). In an off-line learning phase, a value function is learned that is useful for guiding search for multiple problem sizes and instances. We illustrate our technique by developing several such functions for the Dial-A-Ride Problem, a variant of the well-known Traveling Salesman's Problem. Overall, our learned hillclimbing results exhibit an improvement of more than 30% over the standard Dial-A-Ride local search algorithm.


Sutton, R. S., Reinforcement learning: Past, present, and future. Extended abstract in Simulated Evolution and Learning, McKay, B., Yao, X., Newton, C. S., Kim, J.-H., Furuhashi, T., Eds. Published as Lecture Notes in Computer Science 1585, pp. 195–197, Springer, 1999.


Sutton, R.S., Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press.

This introductory textbook on reinforcement learning is targeted toward engineers and scientists in artificial intelligence, operations research, neural networks, and control systems, and we hope it will also be of interest to psychologists and neuroscientists. It also contains several new results. An html version is available.


Sutton, R.S., Precup, D., Singh, S. (1998). Intra-option learning about temporally abstract actions. Proceedings of the 15th International Conference on Machine Learning, pp. 556-564. Morgan Kaufmann.

ABSTRACT: Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decision processes (SMDPs). However, these methods all require an option to be executed to termination. In this paper we explore methods that learn about an option from small fragments of experience consistent with that option, even if the option itself is not executed. We call these methods intra-option learning methods because they learn from experience within an option. Intra-option methods are sometimes much more efficient than SMDP methods because they can use off-policy temporal-difference mechanisms to learn simultaneously about all the options consistent with an experience, not just the few that were actually executed. In this paper we present intra-option learning methods for learning value functions over options and for learning multi-step models of the consequences of options. We present computational examples in which these new methods learn much faster than SMDP methods and learn effectively when SMDP methods cannot learn at all. We also sketch a convergence proof for intra-option value learning.
Precup, D., Sutton, R.S. (1998) Multi-time models for temporally abstract planning. Djvu version. Advances in Neural Information Processing Systems 10, MIT Press. ABSTRACT: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that can not represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models (Sutton, 1995) and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multi-time model, and establish its suitability for planning and learning by virtue of its relationship to Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a gridworld planning task.

McGovern, A., and Sutton, R.S. (1998). Macro-actions in reinforcement learning: An empirical analysis. Technical Report 98-70, University of Massachusetts, Department of Computer Science.

ABSTRACT: Several researchers have proposed reinforcement learning methods that obtain advantages in learning by using temporally extended actions, or macro-actions, but none has carefully analyzed what these advantages are. In this paper, we separate and analyze two advantages of using macro-actions in reinforcement learning: the effect on exploratory behavior, independent of learning, and the effect on the speed with which the learning process propagates accurate value information. We empirically measure the separate contributions of these two effects in gridworld and simulated robotic environments. In these environments, both effects were significant, but the effect of value propagation was larger. We also compare the accelerations of value propagation due to macro-actions and eligibility traces in the gridworld environment. Although eligibility traces increased the rate of convergence to the optimal value function compared to learning with macro-actions but without eligibility traces, eligibility traces did not permit the optimal policy to be learned as quickly as it was using macro-actions.

Precup, D., Sutton, R.S., Singh, S. (1998). Theoretical results on reinforcement learning with temporally abstract options. In Machine Learning: ECML-98, Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 382-393. Springer Verlag.

ABSTRACT: We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among "options", whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.
Santamaria, J.C., Sutton, R.S., Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces, Adaptive Behavior 6(2): 163-218. postscript. An earlier version appeared as Technical Report UM-CS-1996-088, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. The source code for all the experiments is also available here. ABSTRACT: A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of real-valued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a close-form solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarse-coded function approximators to represent value functions and describe in detail three implementations: CMAC, instance-based, and case-based. Additionally, we discuss how function approximators having different degrees of resolution in different regions of the state and action spaces may influence the performance and learning efficiency of the agent. We propose a simple and modular technique that can be used to implement function approximators with non-uniform degrees of resolution so that it can represent the value function with higher accuracy in important regions of the state and action spaces. We performed extensive experiments in the double-integrator and pendulum swing up systems to demonstrate the proposed ideas.

McGovern, A., Precup, D., Ravindran, B., Singh, S., Sutton, R.S. (1998). Hierarchical Optimal Control of MDPs. Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, pp. 186-191.

ABSTRACT: Fundamental to reinforcement learning, as well as to the theory of systems and control, is the problem of representing knowledge about the environment and about possible courses of action hierarchically, at a multiplicity of interrelated temporal scales. For example, a human traveler must decide which cities to go to, whether to fly, drive, or walk, and the individual muscle contractions involved in each step. In this paper we survey a new approach to reinforcement learning in which each of these decisions is treated uniformly. Each low-level action and high-level course of action is represented as an option, a (sub)controller and a termination condition. The theory of options is based on the theories of Markov and semi-Markov decision processes, but extends these in significant ways. Options can be used in place of actions in all the planning and learning methods conventionally used in reinforcement learning. Options and models of options can be learned for a wide variety of different subtasks, and then rapidly combined to solve new tasks. Options enable planning and learning simultaneously at a wide variety of times scales, and toward a wide variety of subtasks, substantially increasing the efficiency and abilities of reinforcement learning systems.
Barto, A. G., Sutton, R. S., Reinforcement learning in artificial intelligence. In Neural Network Models of Cognition, Donahoe, J. W., Packard Dorsel, V., Eds., pp. 358–386, Elsevier, 1997.
This chapter provides an overview of an approach to the study of learning that, in broad terms, has developed as a part of the field of Artifificial Intelligence (AI), where it is called reinforcement learning due to its roots in reinforcement theories of animal learning.  We introduce the field from the perspective of AI and engineering, describing some of its key features, providing a formal model of the reinforcement-learning problem, and defining basic concepts that are exploited by solution methods. Detailed discussion of solution methods themselves and their history are very broad topics that we do not attempt to cover here.

Sutton, R.S. (1997). On the significance of Markov decision processes. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud (Eds.) Artificial Neural Networks -- ICANN'97, pp. 273-282. Springer. ABSTRACT: Formulating the problem facing an intelligent agent as a Markov decision process (MDP) is increasingly common in artificial intelligence, reinforcement learning, artificial life, and artificial neural networks. In this short paper we examine some of the reasons for the appeal of this framework. Foremost among these are its generality, simplicity, and emphasis on goal-directed interaction between the agent and its environment. MDPs may be becoming a common focal point for different approaches to understanding the mind. Finally, we speculate that this focus may be an enduring one insofar as many of the efforts to extend the MDP framework end up bringing a wider class of problems back within it.
Precup, D., Sutton, R.S. (1997). Multi-time models for reinforcement learning. Proceedings of the ICML'97 Workshop on Modelling in Reinforcement Learning. ABSTRACT: Reinforcement learning can be used not only to predict rewards, but also to predict states, i.e. to learn a model of the world's dynamics. Models can be defined at different levels of temporal abstraction. Multi-time models are models that focus on predicting what will happen, rather than when a certain event will take place. Based on multi-time models, we can define abstract actions, which enable planning (presumably in a more efficient way) at various levels of abstraction.
Precup, D., Sutton, R.S. (1997). Exponentiated gradient methods for reinforcement learning. Proceedings of the 14th International Conference on Machine Learning, pp. 272-277, Morgan Kaufmann. ABSTRACT: This paper introduces and evaluates a natural extension of linear exponentiated gradient methods that makes them applicable to reinforcement learning problems. Just as these methods speed up supervised learning, we find that they can also increase the efficiency of reinforcement learning. Comparisons are made with conventional reinforcement learning methods on two test problems using CMAC function approximators and replacing traces. On a small prediction task, exponentiated gradient methods showed no improvement, but on a larger control task (Mountain Car) they improved the learning speed by approximately 25%. A more detailed analysis suggests that the difference may be due to the distribution of irrelevant features.
McGovern, A., Sutton, R.S., Fagg, A.H. (1997). Roles of macro-actions in accelerating reinforcement learning. Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, pp. 13-17. ABSTRACT: We analyze the use of built-in policies, or macro-actions, as a form of domain knowledge that can improve the speed and scaling of reinforcement learning algorithms. Such macro-actions are often used in robotics, and macro-operators are also well-known as an aid to state-space search in AI systems. The macro-actions we consider are closed-loop policies with termination conditions. The macro-actions can be chosen at the same level as primitive actions. Macro-actions commit the learning agent to act in a particular, purposeful way for a sustained period of time. Overall, macro-actions may either accelerate or retard learning, depending on the appropriateness of the macro-actions to the particular task. We analyze their effect in a simple example, breaking the acceleration effect into two parts: 1) the effect of the macro-action in changing exploratory behavior, independent of learning, and 2) the effect of the macro-action on learning, independent of its effect on behavior. In our example, both effects are significant, but the latter appears to be larger. Finally, we provide a more complex gridworld illustration of how appropriately chosen macro-actions can accelerate overall learning.
Precup, D., Sutton, R.S., Singh, S.P. (1997). Planning with closed-loop macro actions. Working notes of the 1997 AAAI Fall Symposium on Model-directed Autonomous Systems, pp. 70-76. ABSTRACT: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Conventional model-based reinforcement learning uses primitive actions that last one time step and that can be modeled independently of the learning agent. These can be generalized to macro actions, multi-step actions specified by a arbitrary policy and a way of terminating. Macro actions generalize the classical notion of a macro operator in that they are closed loop, uncertain, and of variable duration. Macro actions are needed to represent common-sense higher-level actions such as going to lunch, grasping an object, or traveling to a distant city. This paper generalizes prior work on temporally abstract models (Sutton, 1995) and extends it from the prediction setting to include actions, control, and planning. We define a semantics of models of macro actions that guarantees the validity of planning using such models. This paper present new results in the theory of planning with macro actions and illustrates its potential advantages in a gridworld task.

Mehra, R.K., Ravichandran, B., Cabrera, J.B.D., Greve, D.N., Sutton, R.S. (1997). Towards self-learning adaptive scheduling for ATM networks, Proceedings of the 36th Conference on Decision and Control, pp. 2393-2398, San Diego, California USA.

Sutton, R.S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. Camera ready postscript. Digitally remastered pdf. Djvu version. Advances in Neural Information Processing Systems 8 (Proceedings of the 1995 conference), pp. 1038-1044, MIT Press. Errata: there is small error in the equations of motion of the acrobot; for the correct equations, see the RL textbook page on the acrobot (thanks to Barry Nichols for spotting this).

ABSTRACT: On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases, there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they used learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (rollouts), as in classical Monte Carlo methods, and as in the TD(lambda) algorithm when lambda=1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general lambda.


Singh, S.P., Sutton, R.S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning 22: 123-158.

ABSTRACT: The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the Mountain-Car task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.
Precup, D., Sutton, R.S. (1996). Empirical comparison of gradient descent and exponentiated gradient descent in supervised and reinforcement learning. Technical Report UM-CS-1996-070, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. ABSTRACT: This report describes a series of results using the exponentiated gradient descent (EG) method recently proposed by Kivinen and Warmuth. Prior work is extended by comparing speed of learning on a nonstationary problem and on an extension to backpropagation networks. Most significantly, we present an extension of the EG method to temporal-difference and reinforcement learning. This extension is compared to conventional reinforcement learning methods on two test problems using CMAC function approximators and replace traces. On the larger of the two problems, the average loss was approximately 25% smaller for the EG method. The relative computational complexity and parameter sensitivity of the two methods is also discussed.

Kuvayev, L., Sutton, R.S. (1996). Model-based reinforcement learning with an approximate, learned model. Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, pp. 101-105, Yale University, New Haven, CT. Some additional results are in this earlier version of the same paper.

ABSTRACT: Model-based reinforcement learning, in which a model of the environment's dynamics is learned and used to supplement direct learning from experience, has been proposed as a general approach to learning and planning. We present the first experiments with this idea in which the model of the environment's dynamics is both approximate and learned online. These experiments involve the Mountain Car task, which requires approximation of both value function and model because it has continuous state variables. We used models of the simplest possible form, state-aggregation or grid models, and CMACs to represent the value function. We find that model-based methods do indeed perform better than model-free reinforcement learning on this task, but only slightly.

Mehra, R.K., Ravichandran, B., Sutton, R.S. (1996). Adaptive intelligent scheduling for ATM networks. Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, pp. 106-111, Yale University, New Haven, CT.

ABSTRACT: Asynchronous Transfer Mode (ATM) is a fast emerging information technology that promises to provide interoperable multi-media services for commercial and defense applications. Unlike commercial broadband networks which ATM was originally designed for, defense or tactical ATM networks must be able to traverse low-rate transmission links. To allow more flexible and efficient use of this limited bandwidth resource, optimal traffic management in ATM networks is critical. This paper demonstrates the feasibility of developing Self-Learning Adaptive (SLA) scheduling algorithms using Reinforcement Learning (RL). This techniques was applied to simulated data and proved to be more efficient than fixed static scheduling methods.

Sutton, R.S. (1995). On the Virtues of Linear Learning and Trajectory Distributions. Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference.


Sutton, R.S. (1995). TD models: Modeling the world at a mixture of time scales. Proceedings of the Twelfth International Conference on Machine Learning, pp. 531-539, Morgan Kaufmann.

ABSTRACT: Temporal-difference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of temporal abstraction within a single structure. Such multi-scale TD models can be used in model-based reinforcement-learning architectures and dynamic programming methods in place of conventional Markov models. This enables planning at higher and varied levels of abstraction, and, as such, may prove useful in formulating methods for hierarchical or multi-level planning and reinforcement learning. In this paper we treat only the prediction problem---that of learning a model and value function for the case of fixed agent behavior. Within this context, we establish the theoretical foundations of multi-scale models and derive TD algorithms for learning them. Two small computational experiments are presented to test and illustrate the theory. This work is an extension and generalization of the work of Singh (1992), Dayan (1993), and Sutton & Pinette (1985).

Sutton, R.S., Singh, S.P. (1994). On bias and step size in temporal-difference learning. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91-96, Yale University, New Haven, CT.

ABSTRACT: We present results for three new algorithms for setting the step-size parameters, alpha and lambda, of temporal-difference learning methods such as TD(lambda). The overall task is that of learning to predict the outcome of an unknown Markov chain based on repeated observations of its state trajectories. The new algorithms select step-size parameters online in such a way as to eliminate the bias normally inherent in temporal-difference methods. We compare our algorithms with conventional Monte Carlo methods. Monte Carlo methods have a natural way of setting the step size: for each state s they use a step size of 1/n_s, where n_s is the number of times state s has been visited. We seek and come close to achieving comparable step-size algorithms for TD(lambda). One new algorithm uses a lambda=1/n_s schedule to achieve the same effect as processing a state backwards with TD(0), but remains completely incremental. Another algorithm uses a lambda at each time equal to the estimated transition probability of the current transition. We present empirical results showing improvement in convergence rate over Monte Carlo methods and conventional TD(lambda). A limitation of our results at present is that they apply only to tasks whose state trajectories do not contain cycles.

Sutton, R.S., Whitehead, S.D. (1993). Online learning with random representations. Proceedings of the Tenth International Conference on Machine Learning, pp. 314-321, Morgan Kaufmann.

ABSTRACT: We consider the requirements of online learning---learning which must be done incrementally and in realtime, with the results of learning available soon after each new example is acquired. Despite the abundance of methods for learning from examples, there are few that can be used effectively for online learning, e.g., as components of reinforcement learning systems. Most of these few, including radial basis functions, CMACs, Kohonen's self-organizing maps, and those developed in this paper, share the same structure. All expand the original input representation into a higher dimensional representation in an unsupervised way, and then map that representation to the final answer using a relatively simple supervised learner, such as a perceptron or LMS rule. Such structures learn very rapidly and reliably, but have been thought either to scale poorly or to require extensive domain knowledge. To the contrary, some researchers (Rosenblatt, 1962; Gallant & Smith, 1987; Kanerva, 1988; Prager & Fallside, 1988) have argued that the expanded representation can be chosen largely at random with good results. The main contribution of this paper is to develop and test this hypothesis. We show that simple random-representation methods can perform as well as nearest-neighbor methods (while being more suited to online learning), and significantly better than backpropagation. We find that the size of the random representation does increase with the dimensionality of the problem, but not unreasonably so, and that the required size can be reduced substantially using unsupervised-learning techniques. Our results suggest that randomness has a useful role to play in online supervised learning and constructive induction.


Sutton, R.S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 171-176, MIT Press. Associated talk from 2/2/04.

ABSTRACT: Appropriate bias is widely viewed as the key to efficient learning and generalization. I present a new algorithm, the Incremental Delta-Bar-Delta (IDBD) algorithm, for the learning of appropriate biases based on previous learning experience. The IDBD algorithm is developed for the case of a simple, linear learning system---the LMS or delta rule with a separate learning-rate parameter for each input. The IDBD algorithm adjusts the learning-rate parameters, which are an important form of bias for this system. Because bias in this approach is adapted based on previous learning experience, the appropriate testbeds are drifting or non-stationary learning tasks. For particular tasks of this type, I show that the IDBD algorithm performs better than ordinary LMS and in fact finds the optimal learning rates. The IDBD algorithm extends and improves over prior work by Jacobs and by me in that it is fully incremental and has only a single free parameter. This paper also extends previous work by presenting a derivation of the IDBD algorithm as gradient descent in the space of learning-rate parameters. Finally, I offer a novel interpretation of the IDBD algorithm as an incremental form of hold-one-out cross validation.

Sutton, R.S. (1992). Gain adaptation beats least squares? Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, pp. 161-166, Yale University, New Haven, CT.

ABSTRACT: I present computational results suggesting that gain-adaptation algorithms based in part on connectionist learning methods may improve over least squares and other classical parameter-estimation methods for stochastic time-varying linear systems. The new algorithms are evaluated with respect to classical methods along three dimensions: asymptotic error, computational complexity, and required prior knowledge about the system. The new algorithms are all of the same order of complexity as LMS methods, O(n), where n is the dimensionality of the system, whereas least-squares methods and the Kalman filter are O(n^2). The new methods also improve over the Kalman filter in that they do not require a complete statistical model of how the system varies over time. In a simple computational experiment, the new methods are shown to produce asymptotic error levels near that of the optimal Kalman filter and significantly below those of least-squares and LMS methods. The new methods may perform better even than the Kalman filter if there is any error in the filter's model of how the system varies over time.


Sutton, R.S. (1992). Machines that Learn and Mimic the Brain. In ACCESS, GTE's Journal of Science and Technology, 1992. Reprinted in Stethoscope Quarterly, Spring.


Sutton, R.S. (1992). Reinforcement learning architectures. Proceedings ISKIT'92 International Symposium on Neural Information Processing, Fukuoka, Japan.

ABSTRACT: Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics---trial-and-error search and delayed reward---are the two most important distinguishing features of reinforcement learning. In this paper I present a brief overview of the development of reinforcement learning architectures over the past decade, including reinforcement-comparison, actor-critic, and Q-learning architectures. Finally, I present Dyna, a class of architectures based on reinforcement learning but which go beyond trial-and-error learning to include a learned internal model of the world. By intermixing conventional trial and error with hypothetical trial and error using the world model, Dyna systems can plan and learn optimal behavior very rapidly.


Sutton, R.S. (Ed.) (1992). Reinforcement Learning, book version of a special issue of Machine Learning on reinforcement learning (Volume 8, Numbers 3/4). Kluwer.  Introduction: The challenge of reinforcement learning.


Sanger, T.D., Sutton, R.S., Matheus, C.J. (1992). Iterative construction of sparse polynomial approximations. Advances in Neural Information Processing Systems 4, pp. 1064-1071, Morgan Kaufmann.

ABSTRACT: We present an iterative algorithm for nonlinear regression based on construction of sparse polynomials. Polynomials are built sequentially from lower to higher order. Selection of new terms is accomplished using a novel look-ahead approach that predicts whether a variable contributes to the remaining error. The algorithm is based on the tree-growing heuristic in LMS Trees which we have extended to approximation of arbitrary polynomials of the input features. In addition, we provide a new theoretical justification for this heuristic approach. The algorithm is shown to discover a known polynomial from samples, and to make accurate estimates of pixel values in an image processing task.


Gluck, M., Glauthier, P., Sutton, R.S. (1992). Adaptation of cue-specific learning rates in network models of human category learning, Proceedings of the Fouteenth Annual Conference of the Cognitive Science Society, pp. 540-545, Erlbaum.

ABSTRACT: Recent engineering considerations have prompted an improvement to the least mean squares (LMS) learning rule for training one-layer adaptive networks: incorporating a dynamically modifiable learning rate for each associative weight accellerates overall learning and provides a mechanism for adjusting the salience of individual cues (Sutton, 1992a,b). Prior research has established that the standard LMS rule can characterize aspects of animal learning (Rescorla & Wagner, 1972) and human category learning (Gluck & Bower, 1988a,b). We illustrate here how this enhanced LMS rule is analogous to adding a cue-salience or attentional component to the psychological model, giving the network model a means for discriminating between relevant and irrelevant cues. We then demonstrate the effectiveness of this enhanced LMS rule for modeling human performance in two non-stationary learning tasks for which the standard LMS network model fails to adequately account for the data (Hurwitz, 1990; Gluck, Glauthier, & Sutton, in preparation).

Sutton, R.S. (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann.

ABSTRACT: This paper presents the basic results and ideas of dynamic programming as they relate most directly to the concerns of planning in AI. These form the theoretical basis for the incremental planning methods used in the integrated architecture Dyna. These incremental planning methods are based on continually updating an evaluation function and the situation-action mapping of a reactive system. Actions are generated by the reactive system and thus involve minimal delay, while the incremental planning process guarantees that the actions and evaluation function will eventually be optimal---no matter how extensive a search is required. These methods are well suited to stochastic tasks and to tasks in which a complete and accurate model is not available. For tasks too large to implement the situation-action mapping as a table, supervised-learning methods must be used, and their capabilities remain a significant limitation of the approach.

Sutton, R.S. (1991). Dyna, an integrated architecture for learning, planning and reacting. Working Notes of the 1991 AAAI Spring Symposium on Integrated Intelligent Architectures and SIGART Bulletin 2, pp. 160-163.

ABSTRACT: Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agent's actions on the world. Planning is incremental and can use the probabilistic and ofttimes incorrect world models generated by learning processes. Execution is fully reactive in the sense that no planning intervenes between perception and action. Dyna relies on machine learning methods for learning from examples---these are among the basic building blocks making up the architecture---yet is not tied to any particular method. This paper briefly introduces Dyna and discusses its strengths and weaknesses with respect to other architectures.


Sutton, R.S. (1991). Integrated modeling and control based on reinforcement learning and dynamic programming. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 3, pages 471-478.

ABSTRACT: This is a summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures. The Dyna-PI architecture is based on dynamic programming's policy iteration method and can be related to existing AI ideas such as evaluation functions and universal plans (reactive systems). Using a navigation task, results are shown for a simple Dyna-PI system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. The Dyna-Q architecture is based on Watkins's Q-learning, a new kind of reinforcement learning. Dyna-Q uses a less familiar set of data structures than does Dyna-PI, but is arguably simpler to implement and use. We show that Dyna-Q architectures are easy to adapt for use in changing environments.


Sutton, R.S. (1991). Reinforcement learning architectures for animats, Proceedings of the First International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 288-296.  smaller gzipped version.

ABSTRACT: In the first part of this paper I argue that the learning problem facing animats is essentially that which has been studied as the reinforcement learning problem---the learning of behavior by trial and error without an explicit teacher. A brief overview is presented of the development of reinforcement learning architectures over the past decade, with references to the literature.

The second part of the paper presents Dyna, a class of architectures based on reinforcement learning but which go beyond trial-and-error learning. Dyna architectures include a learned internal model of the world. By intermixing conventional trial and error with hypothetical trial and error using the world model, Dyna systems can plan and learn optimal behavior very rapidly. Results are shown for simple Dyna systems that learn from trial and error while they simultaneously learn a world model and use it to plan optimal action sequences. We also show that Dyna architectures are easy to adapt for use in changing environments.


Sutton, R.S., Matheus, C.J. (1991). Learning polynomial functions by feature construction. Proceedings of the Eighth International Workshop on Machine Learning, pp. 208-212, Morgan Kaufmann. ABSTRACT: We present a method for learning higher-order polynomial functions from examples using linear regression and feature construction. Regression is used on a set of training instances to produce a weight vector for a linear function over the feature set. If this hypothesis is imperfect, a new feature is constructed by forming the product of the two features that most effectively predict the squared error of the current hypothesis. This algorithm is then repeated. In an extension to this method, the specific pair of features to combine is selected by measuring their joint ability to predict the hypothesis' error.


Sutton, R.S., Barto, A.G., Williams, R. (1991). Reinforcement learning is direct adaptive optimal control, Proceedings of the American Control Conference, pages 2143-2146. Also published in IEEE Control Systems Magazine 12, No. 2, 19-22.

ABSTRACT: In this paper we present a control-systems perspective on one of the major neural-network approaches to learning control, reinforcement learning. Control problems can be divided into two classes: 1) regulation and tracking problems, in which the objective is to follow a reference trajectory, and 2) optimal control problems, in which the objective is to extremize a functional of the controlled system's behavior that is not necessarily defined in terms of a reference trajectory. Adaptive methods for problems of the first kind are well known, and include self-tuning regulators and model-reference methods, whereas adaptive methods for optimal-control problems have received relatively little attention. Moreover, the adaptive optimal-control methods that have been studied are almost all indirect methods, in which controls are recomputed from an estimated system model at each step. This computation is inherently complex, making adaptive methods in which the optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Q-learning) and on its analytically proven capabilities for one class of adaptive optimal control problems (markov decision problems with unknown transition probabilities).
Miller, W. T., Sutton, R. S., Werbos, P. J. (Eds.), Neural Networks for Control.  MIT Press, 1991.


Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, pp. 216-224, Morgan Kaufmann. Also appeared as "Artificial intelligence by dynamic programming," in Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems, pp. 89-95.  smaller gzipped postscript version. ABSTRACT: This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned model of the world. In this paper, I present and show results for two Dyna architectures. The Dyna-PI architecture is based on dynamic programming's policy iteration method and can be related to existing AI ideas such as evaluation functions and universal plans (reactive systems). Using a navigation task, results are shown for a simple Dyna-PI system that simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. The Dyna-Q architecture is based on Watkins's Q-learning, a new kind of reinforcement learning. Dyna-Q uses a less familiar set of data structures than does Dyna-PI, but is arguably simpler to implement and use. We show that Dyna-Q architectures are easy to adapt for use in changing environments.


Sutton, R.S. (1990). First results with Dyna, an integrated architecture for learning, planning, and reacting. In Neural Networks for Control, Miller, T., Sutton, R.S., & Werbos, P., Eds., MIT Press.


Sutton, R.S., Barto, A.G. (1990). Time-derivative models of pavlovian reinforcement. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds., pp. 497-537. MIT Press.

Presentation and analysis of models of classical conditioning based on temporal-difference learning. Most comprehensive treatment of the TD model of classical conditioning.
Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990). Learning and sequential decision making. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J.W. Moore, Eds., pp. 539-602, MIT Press. The following abstract is from the technical report version of this paper. ABSTRACT: In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of sequential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the synamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.


Barto, A.G., Sutton, R.S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 2, pp. 686-693.

ABSTRACT: Many problems require the formation of sequences of decisions whose consequences emerge over time periods of variable and uncertain duration. Decision-making strategies must be formed that take into account both the expected short- and long-term consequences of decisions. Problems such as this can be formulated as stochastic sequential decision problems, and can be solved by stochastic dynamic programming methods. In this paper, we describe the stochastic sequential decision framework and dynamic programming, and we show how the Adaptive Heuristic Critic (AHC) algorithm, a variant of which was used in the adaptive pole-balancer of Barto, Sutton, and Anderson, fits into this framework. We relate the AHC algorithm to its antececents and discuss the utility of neural-network methods within this context. We present several examples illustrating the utility of these methods, and conclude by discussing implications for adaptive optimal control and the modeling of animal learning.


Whitehead, S., Sutton, R.S., & Ballard, D. (1990). Advances in reinforcement learning and their implications for intelligent control, Proceedings of the Fifth IEEE International Symposium on Intelligent Control 1990, pp. 1289-1297.


Franklin, J., Sutton, R.S., Anderson, C., Selfridge, O., & Schwartz, D. (1990). Connectionist learning control at GTE laboratories, Intelligent Control and Adaptive Systems, G. Rodriguez, Ed., Proc. SPIE 1196, pp. 242-253.


Anderson, C., Franklin, J., & Sutton, R.S. (1990). Learning a nonlinear model of a manufacturing process using multilayer connectionist networks, Proceedings of the Fifth IEEE International Symposium on Intelligent Control 1990, pp. 404-409.


Sutton, R.S. (1989). Artificial intelligence as a control problem: Comments on the relationship between machine learning and intelligent control. Appeared in Machine Learning in a dynamic world. Proceedings of the IEEE International Symposium on Intelligent Control 1988, pp. 500-507.


Sutton, R.S. (1989). Implementation details of the TD(lambda) procedure for the case of vector predictions and backpropagation. GTE Laboratories Technical Report TR87-509.1, as corrected August 1989. GTE Laboratories, 40 Sylvan Road, Waltham, MA 02254.


Franklin, J., Sutton, R.S., & Anderson, C. (1989). Application of connectionist learning methods to manufacturing process monitoring, Proceedings of the IEEE International Symposium on Intelligent Control 1988, pp. 709-712.


Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44, erratum p. 377. Scan of paper as published, with erratum added. Digitally remastered with missing figure in place.

ABSTRACT: This article introduces a class of incremental learning procedures specialized for prediction---that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods; and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.


Sutton, R.S. (1988). Convergence theory for a new kind of prediction learning, Proceedings of the 1988 Workshop on Computational Learning Theory, pp. 421-42.


Sutton, R.S. (1988). NADALINE: A normalized adaptive linear element that learns efficiently. GTE Laboratories Technical Report TR88-509.4. GTE Laboratories, 40 Sylvan Road, Waltham, MA 02254.

ABSTRACT: This paper introduces a variant of the ADALINE in which the input signals are normalized to have zero mean and unit variance, and in which the bias or "threshold weight" is learned slightly differently. These changes result in a linear learning element that learns much more efficiently and rapidly, and that is much less dependent on the choice of the step-size parameter. Using simulation experiments, learning time improvements of from 30% to hundreds of times are shown. The memory and computational complexity of the new element remains O(N), where N is the number of input signals, and the added computations are entirely local. Theoretical analysis indicates that the new element learns optimally fast in a certain sense and to the extent that the input signals are statistically independent.


Selfridge, O., Sutton, R.S., & Anderson, C. (1988). Selected bibliography on connectionism, In: Evolution, Learning, and Cognition, Y.C. Lee (Ed.), pp. 391-403, World Scientific Publishing.


Sutton, R.S., & Barto, A.G. (1987). A temporal-difference model of classical conditioning, Proceedings of the Ninth Annual Conference of the Cognitive Science Society, pp. 355-378.

ABSTRACT: Rescorla and Wagner's model of classical conditioning has been one of the most influential and successful theories of this fundamental learning process. The learning rule of their theory was first described as a learning procedure for connectionist networks by Widrow and Hoff. In this paper we propose a similar confluence of psychological and engineering constraints. Sutton has recently argued that adaptive prediction methods called temporal-difference methods have advantages over other prediction methods for certain types of problems. Here we argue that temporal-difference methods can provide detailed accounts of aspects of classical conditioning behavior. We present a model of classical conditioning behavior that takes the form of a temporal-difference prediction method. We argue that it is an improvement over the Rescorla-Wagner model in its handling of within-trial temporal effects such as the ISI dependency, primacy effects, and the facilitation of remote associations in serial-compound conditioning. The new model is closely related to the model of classical conditioning that we proposed in 1981, but avoids some of the problems with that model recently identified by Moore et al. We suggest that the theory of adaptive prediction on which our model is based provides insight into the functionality of classical conditioning behavior.


Sutton, R.S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks, Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 823-831. Reprinted in Artificial Neural Networks: Concepts and Theory, edited by P. Mehra and B. Wah, IEEE Computer Society Press, 1992.  smaller gzipped version.

ABSTRACT: This article contributes to the theory of network learning procedures by identifying and anlyzing two problems with the backpropagation procedure of Rumelhart, Hinton, and Williams (1985) that may slow its learning. Both problems are due to backpropagation's being a gradient- or steepest-descent method in the weight space of the network. The first problem is that steepest descent is a particularly poor descent procedure for surfaces conatining ravines---places which curve more sharply in some directions than others---and such ravines are common and pronounced in performance surfaces arising from networks. The second problem is that steepest descent results in a high level of interference between learning with different patterns, because those units that have so far been found most useful are also those most likely to be changed to handle new patterns. The same problems probably also arise with the Boltzmann machine learning procedure (Ackley, Hinton, and Sejnowski, 1985) and with reinforcement learning procedures (Barto and Anderson, 1985), as these are also steepest-descent procedures. Finally, some directions in which to look for improvements to backpropagation based on alternative descent procedures are briefly considered.
Moore, J., Desmond, J, Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals, Behavioural Brain Research 21: 143-154. ABSTRACT: A neuron-like adaptive element with computational features suitable for classical conditioning, the Sutton-Barto (S-B) model, was extended to simulate real-time aspects of the conditioned nictitating membrane (NM) response. The aspects of concern were response topography, CR-related neuronal firing, and interstimulus interval (ISI) effects for forward-delay and trace conditioning paradigms. The topography of the NM CR has the following features: response latency after CS onset decreases over trials; response amplitude increases gradually within the ISI and attains its maximum coincidentally with the UR. A similar pattern characterizes the firing of some (but not all) neurons in brain regions demonstrated experimentally to be important for NM conditioning. The variant of the S-B model described in this paper consists of a set of parameters and implementation rules based on 10-ms computational time steps. It differs from the original S-B model in a number of ways. The main difference is the assumption that CS inputs to the adaptive element are not instantaneous but are instead shaped by unspecified coding processes so as to produce outputs that conform with the real-time properties of NM conditioning. The model successfully simulates the aforementioned features of NM response topography. It is also capable of simulating appropriate ISI functions, i.e. with maximum conditioning strength with ISIs of 250 ms, for forward-delay and trace paradigms. The original model's successful treatment of multiple-CS phenomena, such as blocking, conditioned inhibition, and higher-order conditioning, are retained by the present model.


Sutton, R.S. (1985). Learning distributed, searchable, internal models, Proceedings of the Distributed Artificial Intelligence Workshop, pp. 287-289.


Sutton, R.S., & Pinette, B. (1985). The learning of world models by connectionist networks, Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. 54-64.

ABSTRACT: Connectionist learning schemes have hitherto seemed far removed from cognitive skills such as reasoning, planning, and the formation of internal models. In this article we investigate what sort of world models a connectionist system might learn and how it might do so. A learning scheme is presented that forms such models based on observed stimulus-stimulus relationships. The basis of the scheme is a recurrently connected network of simple, neuron-like processing elements. The net produces a set of predictions of future stimuli based on the current stimuli, where these predictions are based on a model and involve multiple-step chains of predictions. Results are presented from computer simulations of the scheme connected to a simple world consisting of a stochastic maze (Markov process). By wandering around the maze the network learns its construction. When reinforcement is subsequently introduced, the solution to the maze is learned much more quickly than it is without the "exploration" period. The form and logic of the experiment is the same as that of the latent learning experiments of animal learning research.


Barto, A.G. & Sutton, R.S. (1985). Neural problem solving. In: Synaptic Modification, Neuron Selectivity, and Nervous System Organization, W.B. Levy & J.A. Anderson (Eds.), pp. 123-152. Lawrence Erlbaum.


Selfridge, O., Sutton, R.S., & Barto, A.G. (1985). Training and tracking in robotics, Proceedings of the Ninth International Joint Conference on Artificial Intelligence, pp. 670-672.

ABSTRACT: We explore the use of learning schemes in training and adapting performance on simple coordination tasks. The tasks are 1-D pole balancing. Several programs incorporating learning have already achieved this [1,3,8]: the problem is to move a cart along a short piece of track so as to keep a pole balanced on its end; the pole is hinged to the cart at its bottom, and the cart is moved either to the left or to the right by a force of constant magnitude. The form of the task considered here, after [3], involves a genuinely difficult credit-assignment problem. We use a learning scheme previously developed and analyzed [1,7] to achieve performance through reinforcement, and extend it to include changing and new requirements. For example, the length or mass of the pole can change, the bias of the force, its strength, and so on; and the system can be tasked to avoid certain regions altogether. In this way we explore the learning system's ability to adapt to changes and to profit from a selected training sequence, both of which are of obvious utility in practical robotics applications.

The results here were obtained using a computer simulation of the pole-balancing problem. A movie will be shown of the performance of the system under the various requirements and tasks.

[1] Barto, Sutton, & Anderson, 1983
[3] Michie & Chambers, 1968
[7] Sutton, 1984
[8] Widrow & Smith, 1964


Moore, J., Desmond, J., Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1985). Connectionist learning in real time: Sutton-Barto adaptive element and classical conditioning of the nictitating membrane response, Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. 318-322.


Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning (106 Mbytes).  Ph.D. dissertation, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. Published as COINS Technical Report 84-2.

ABSTRACT: This dissertation describes computational experiments comparing the performance of a range of reinforcement learning algorithms. The experiments are designed to focus on aspects of the credit-assignment problem having to do with determining when the behavior that deserves credit occurred. The issues of knowledge representation involved in developing new features or refining existing ones are not addressed.

The algorithms considered include some from learning automata theory, mathematical learning theory, early "cybernetic" approaches to learning, Samuel's checker-playing program, Michie and Chambers's "Boxes" system, and a number of new algorithms. The tasks were selected to involve, first in isolation and then in combination, the issues of misleading generalizations, delayed reinforcement, unbalanced reinforcement, and secondary reinforcement. The tasks range from simple, abstract "two-armed bandit" tasks to a physically realistic pole-balancing task.

The results indicate several areas where the algorithms presented here perform substantially better than those previously studied. An unbalanced distribution of reinforcement, misleading generalizations, and delayed reinforcement can greatly retard learning and in some cases even make it counterproductive. Performance can be substantially improved in the presence of these common problems through the use of mechanisms of reinforcement comparison and secondary reinforcement. We present a new algorithm similar to the "learning-by-generalization" algorithm used for altering the static evaluation function in Samuel's checker-playing program. Simulation experiments indicate that the new algorithm performs better than a version of Samuel's algorithm suitably modified for reinforcement learning tasks. Theoretical analysis in terms of an "ideal reinforcement signal" sheds light on the relationship between these two algorithms and other temporal credit-assignment algorithms.


Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. 12.4 Mb pdf3.5 Mb gzipped

ABSTRACT: It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is assumed that the equations of motion of the cart-pole system are not known and that the only feedback evaluating performance is a failure signal that occurs when the pole falls past a certain angle from the vertical, or the cart reaches an end of a track. This evaluative feedback is of much lower quality than is required by standard adaptive control techniques. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neuronlike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.


Sutton, R.S. (1982 - unpublished draft). A theory of salience change dependent on the relationship between discrepancies on successive trials on which the stimulus is present. smaller gzipped version.

ABSTRACT: All current theories of salience change (Frey and Sears, 1978; Kirk, 1974; Mackintosh, 1975; Dickinson and Mackintosh, 1979; Moore and Stickney, 1980; Pearce and Hall, 1980) determine change in salience based solely on events occurring on a single trial. However, a theoretical view of classical conditioning as prediction suggests that the relationship between the discrepancies on different trials on which a stimulus occurs should be used to change its salience. This paper presents a theory of salience change based on this idea and discusses its relationship to known experimental results and other theories of salience change. Several experiments are proposed where the theory makes novel predictions subject to experimental test. A mathematical derivation of the main equation of the theory from high-level assumptions is presented in the appendix.
Barto, A.G. & Sutton, R.S. (1982). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element, Behavioral Brain Research 4:221-235. SUMMARY: A neuron-like adaptive element is described that produces an important feature of the anticipatory nature of classical conditioning. The response that occurs after training (conditioned response) usually begins earlier than the reinforcing stimulus (unconditioned stimulus). The conditioned response therefore usually anticipates the unconditioned stimulus. This aspect of classical conditioning has been largely neglected by hypotheses that neurons provide single unit analogs of classical conditioning. This paper briefly presents the model and extends earlier results by computer simulation of conditioned inhibition and chaining of associations.


Barto, A.G., Anderson, C., & Sutton, R.S. (1982). Synthesis of nonlinear control surfaces by a layered associative network, Biological Cybernetics 43:175-185.

ABSTRACT: An approach to solving nonlinear control problems is illustrated by means of a layered associative network composed of adaptive elements capable of reinforcement learning. The first later adaptively develops a representation in terms of which the second layer can solve the problem linearly. The adaptive elements comprising the network employ a novel type of learning rule whose properties, we argue, are essential to the adaptive behavior of the layered network. The behavior of the network is illustrated by means of a spatial learning problem that requires the formation of nonlinear associations. We argue that this approach to nonlinearity can be extended to a large class of nonlinear control problems.


Barto, A.G., Sutton, R.S., & Anderson, C. (1982). Spatial learning simulation systems, Proceedings of the 10th IMACS World Congress on Systems Simulation and Scientific Computation, pp. 204-206.


Sutton, R.S. (1981). Adaptation of learning rate parameters.  In: Goal Seeking Components for Adaptive Intelligence: An Initial Assessment, by A. G. Barto and R. S. Sutton. Air Force Wright Aeronautical Laboratories Technical Report AFWAL-TR-81-1070. Wright-Patterson Air Force Base, Ohio 45433.


Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction, Psychological Review 88:135-140. Translated into Spanish by G. Ruiz to appear in the journal Estudios de Psicologia.

ABSTRACT: Many adaptive neural theories are based on neuronlike adaptive elements that can behave as single unit analogs of associative conditioning. In this article we develop a similar adaptive element, but one which is more closely in accord with the facts of animal learning theory than elements commonly studied in adaptive network research. We suggest that an essential feature of classical conditioning that has been largely overlooked by adaptive network theorists is its predictive nature. The adaptive element we present learns to increase its response rate in anticipation of increased stimulation, producing a conditioned response before the occurrence of the unconditioned stimulus. The element also is in strong agreement with the behavioral data regarding the effects of stimulus context, since it is a temporally refined extension of the Rescorla-Wagner model. We show by computer simulation that the element becomes sensitive to the most reliable, nonredundant, and earliest predictors of reinforcement. We also point out that the model solves many of the stability and saturation problems encountered in network simulations. Finally, we discuss our model in light of recent advances in the physiology and biochemistry of synaptic mechanisms.


Sutton, R.S., & Barto, A.G. (1981). An adaptive network that constructs and uses an internal model of its world, Cognition and Brain Theory 4:217-246.

ABSTRACT: Many theorists have emphasized the role of an "internal model of the world" in directing intelligent adaptive behavior. An internal model can be used to internally simulate the consequences of possible actions in order to choose among them without the necessity of overtly performing them. Animal learning theorists have taken latent learning experiments as demonstrations that animals can learn and use such internal models. In this paper, we describe an adaptive network of neuronlike components that constructs and uses an internal model, and we demonstrate this ability by describing a computer simulation of its behavior in a simplified analog of a latent learning task. The task has been made as simple as possible while still retaining those features that make behavior in latent learning tasks difficult to account for by connectionist models. The network illustrates a principle by which connectionist-like learning rules can give rise to behavior apparently requiring the formation and use of internal models. As such, it may help form a bridge between brain theory and connectionist models on the one hand, and cognitive and information processing models on the other.


Barto, A.G., Sutton, R.S. (1981). Goal seeking components for adaptive intelligence: An initial assessment. Air Force Wright Aeronautical Laboratories Technical Report AFWAL-TR-81-1070. Wright-Patterson Air Force Base, Ohio 45433. 524 pages.  (Appendix C is available separately.)

ABSTRACT: This report assesses the promise of a network approach to adaptive problem solving in which the network components themselves possess considerable adaptive power. We show that components designed with attention to the temporal aspects of reinforcement learning can acquire knowledge about feedback pathways in which they are embedded and can use this knowledge to seek their preferred inputs, thus combining pattern recognition, search, and control functions. A review of adaptive network research shows that networks of components having these capabilities have not been studied previously. We demonstrate that simple networks of these elements can solve types of problems that are beyond the capabilities of networks studied in the past. An associative memory is presented that retains the generalization capabilities and noise resistance of associative memories previously studied but does not require a "teacher" to provide the desired associations. It conducts active, closed-loop searches for the most rewarding associations. We provide an example in which these searches are conducted through the system's external environment and an example in which they are conducted through an internal predictive model of that environment. The latter system is capable of a simple form of latent learning. We argue that components capable of making progress toward their goals when embedded in environments that are indifferent, or even hostile, with respect to these goals can form the substrate for a decentralized, parallel, and pervasively adaptive problem solver. We discuss the hypothesis that neural information processing can be understood in these terms, and we relate our results to animal learning data.


Barto, A.G., Sutton, R.S., & Brouwer, P. (1981). Associative search network: A reinforcement learning associative memory, Biological Cybernetics 40:201-211.

ABSTRACT: An associative memory system is presented which does not require a "teacher" to provide the desired associations. For each input key it conducts a search for the output pattern which optimizes an external payoff or reinforcement signal. The associative search network (ASN) combines pattern recognition and function optimization capabilities in a simple and effective way. We define the associative search problem, discuss conditions under which the associative search network is capable of solving it, and present results from computer simulations. The synthesis of sensori-motor control surfaces is discussed as an example of the associative search problem.


Barto, A.G. & Sutton, R.S. (1981). Landmark learning: An illustration of associative search, Biological Cybernetics 42:1-8.

ABSTRACT: In a previous paper we defined the associative search problem and presented a system capable of solving it under certain conditions. In this paper we interpret a spatial learning problem as an associative search task and describe the behavior of an adaptive network capable of solving it. This example shows how naturally the associative search problem can arise and permits the search, association, and generalization properties of the adaptive network to be clearly illustrated.


Sutton, R.S. (1978). Single channel theory: A neuronal theory of learning, Brain Theory Newsletter 3, No. 3/4, pp. 72-75. (earliest publication)


Sutton, R.S. (1978 - unpublished). A unified theory of expectation in classical and instrumental conditioning. Bachelors thesis, Stanford University.


Sutton, R.S. (1978 - unpublished). Learning theory support for a single channel theory of the brain.