Research

On this page, I delve into various research directions that I have explored through the papers I have authored. While there is some overlap among them, the categorized sections below provide a helpful overview of most (though not all) of my research interests. They also shed light on the connections between some of the papers I have written.

I strongly believe that addressing sequential decision making problems entails tackling three primary challenges: representation learning (or generalization), exploration, and credit-assignment. To overcome these hurdles, diverse techniques can be employed, including temporal abstractions, model-based reinforcement learning, and policy optimization. These approaches can be explored within various contexts, such as continual learning problems, real-world applications, and computer games.

When it comes to my research style, I primarily focus on empirical investigations, placing significant emphasis on proper empirical evaluation and best practices for conducting such evaluations. Additionally, I am particularly excited in unifying algorithms that seemingly lack a connection at first glance but can be shown to be deeply connected.

Continual learning

Description

Reinforcement learning approaches have been successfully used in several real-world applications, such as navigating balloons in the stratosphere and automatically producing plasma configurations for Tokamak nuclear fusion reactors. Largely, these systems can be characterized as a search for a fixed policy that once deployed no longer changes. The search is typically done offline using a batch of historical data or more often on a simulator. Indeed, in some cases the pursuit of a stationary policy is wholly appropriate as the policy does not need to change once deployed. But there are many applications in which a stationary policy can quickly become ineffective once the system is deployed and more data becomes available. In water treatment plants, for example, we observe sensor changes, with data drifting, within minutes [3]. More generally, continual adaptation is unavoidable in situations in which the world around the learning system continually changes.

Some of the published research we have done in this problem include introducing an Atari-based framework for research in continual reinforcement learning [2], characterizing the loss of plasticity in performant deep reinforcement learning systems and showing how remarkably simple solutions can mitigate such an issue [2]. It is important though to understand the underlying phenomenon that leads to loss of plasticity, and currently it seems loss of curvature is the most complete explanation we have for this phenomenon [5].

When thinking about continual learning as a problem formulation, it is then natural to ask the question about what are the effective solutions for the traditional challenges of sequential-decision making problems. For generalization, it seems that discrete representations (as in multiple one-hot vectors) seem to be more effective for continual learning [4]. Another question to ask is how to effectively explore an ever-changing environment. We have shown that discovering options in order to perform temporally-extended exploration is a remarkably effective strategy [1].

References

[1] Deep Laplacian-based Options for Temporally-Extended Exploration

M. Klissarov, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{klissarov2023deep, title = {Deep Laplacian-based Options for Temporally-Extended Exploration}, author = {Martin Klissarov and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
[2] Loss of Plasticity in Continual Deep Reinforcement Learning

Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado

In Conference on Lifelong Learning Agents (CoLLAs), 2023

@inproceedings{abbas2023loss, title = {Loss of Plasticity in Continual Deep Reinforcement Learning}, author = {Zaheer Abbas and Rosie Zhao and Joseph Modayil and Adam White and Marlos C. Machado}, booktitle = {Conference on Lifelong Learning Agents (CoLLAs)}, year = {2023} }

The ability to learn continually is essential in a complex and changing world. In this paper, we characterize the behavior of canonical value-based deep reinforcement learning (RL) approaches under varying degrees of non-stationarity. In particular, we demonstrate that deep RL agents lose their ability to learn good policies when they cycle through a sequence of Atari 2600 games. This phenomenon is alluded to in prior work under various guises -- e.g., loss of plasticity, implicit under-parameterization, primacy bias, and capacity loss. We investigate this phenomenon closely at scale and analyze how the weights, gradients, and activations change over time in several experiments with varying dimensions (e.g., similarity between games, number of games, number of frames per game), with some experiments spanning 50 days and 2 billion environment interactions. Our analysis shows that the activation footprint of the network becomes sparser, contributing to the diminishing gradients. We investigate a remarkably simple mitigation strategy -- Concatenated ReLUs (CReLUs) activation function -- and demonstrate its effectiveness in facilitating continual learning in a changing environment.
[3] GVFs in the Real World: Making Predictions Online for Water Treatment

M. K. Janjua, H. Shah, M. White, E. Miahi, M. C. Machado, and A. White

Machine Learning, 2023

@article{janjua2023gvfs, title = {GVFs in the Real World: Making Predictions Online for Water Treatment}, author = {Muhammad Kamran Janjua and Haseeb Shah and Martha White and Erfan Miahi and Marlos C. Machado and Adam White}, journal = {Machine Learning}, year = {2023} }

In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about predictability of the data, suitable neural network architectures, how to overcome partially observability, and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions—discounted cumulative sums of observations–and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by contrasting to a TD agent trained purely offline with no online updating. This final result is one of the first to motivate the importance of adapting predictions in realtime, for non-stationary high-volume systems in the real-world.
[4] Harnessing Discrete Representations For Continual Reinforcement Learning

Edan Meyer, Adam White, and M. C. Machado

CoRR abs/2312.01203, 2023

@article{meyer2023harnessing, title = {Harnessing Discrete Representations For Continual Reinforcement Learning}, author = {Edan Meyer and Adam White and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.01203}, year = {2023} }

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.
[5] Curvature Explains Loss of Plasticity

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and M. C. Machado

CoRR abs/2312.00246, 2023

@article{lewandowski2023curvature, title = {Curvature Explains Loss of Plasticity}, author = {Alex Lewandowski and Haruto Tanaka and Dale Schuurmans and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.00246}, year = {2023} }

Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.

Temporal abstractions

Description

In the reinforcement learning problem, an agent interacts with its environment such that the agent receives an observation from the environment and takes an action based on the received observations. This interaction takes place at every time step, which is often the fundamental unit of time in this problem formulation. Nevertheless, several decision making problems involve operating over different time scales. The options framework is maybe the most common formalism that allows us to do so, giving agents the ability to reason in terms of actions extended in time. This framework models courses of actions as options, which have the ability to accelerate learning in different ways, allowing, for example, faster credit assignment [9], planning [3], transfer learning [1,9], and better exploration [1,2,4-8,10].

Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In our long-standing research [1,2,5-10], we have developed a formulation based on the successor representation, advocating that it serves as a natural foundation for discovering and utilizing temporal abstractions. The core idea underlying our proposed techniques is captured in a comprehensive framework for option discovery [2,6]. Here, the agent's representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. We extensively discuss this framework in one of our papers published in the Journal of Machine Learning Research (JMLR) [2].

More specifically, the research we have published on temporal abstraction, among other things, introduced the idea of eigenoptions[1,2,6,7,8], popularizing the idea of temporally-extended exploration through options. In fact, we have even used temporally-extended exploration with options when learning to navigate balloons in the stratosphere. We have also unified multiple option discovery methods and characterized the types of options that are most effective for planning [3]. Additionally, we have investigated strategies for integrating different techniques into a cohesive option discovery method [1,4,5,9]. For a comprehensive understanding of my approach to and utilization of temporal abstractions in reinforcement learning, I recommend starting with my recent papers [1-3].

References

[1] Deep Laplacian-based Options for Temporally-Extended Exploration

M. Klissarov, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{klissarov2023deep, title = {Deep Laplacian-based Options for Temporally-Extended Exploration}, author = {Martin Klissarov and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
[2] Temporal Abstraction in Reinforcement Learning with the Successor Representation

M. C. Machado, A. Barreto, D. Precup, and M. Bowling

Journal of Machine Learning Research (JMLR), 24, pp. 1-69, 2023

@article{machado2023temporal, title = {Temporal Abstraction in Reinforcement Learning with the Successor Representation}, author = {Marlos C. Machado and Andre Barreto and Doina Precup and Michael Bowling}, journal = {Journal of Machine Learning Research (JMLR)}, volume = {24}, number = {80}, pages = {1--69}, year = {2023} }

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called \emph{options}. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
[3] Reward-Respecting Subtasks for Model-Based Reinforcement Learning

R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

Artificial Intelligence (AIJ), 2023

@article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
[4] Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

A. Erraqabi, M. C. Machado, M. Zhao, S. Sukhbaatar, A. Lazaric, L. Denoyer, Y. Bengio

In Conference on Uncertainty in Artificial Intelligence (UAI), 2022

@inproceedings{erraqabi2022temporal, author = {Akram Erraqabi and Marlos C. Machado and Mingde Zhao and Sainbayar Sukhbaatar and Alessandro Lazaric and Ludovic Denoyer and Yoshua Bengio}, title = {Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL}, booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2022} }

In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
[5] Exploration in Reinforcement Learning with Deep Covering Options

Y. Jinnai, J. W. Park, M. C. Machado, and G. Konidaris

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{jinnai2020exploration, author = {Yuu Jinnai and Jee W. Park and Marlos C. Machado and George Konidaris}, title = {Exploration in Reinforcement Learning with Deep Covering Options}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
[6] Efficient Exploration in Reinforcement Learning through Time-Based Representations

M. C. Machado

Ph.D. Thesis, University of Alberta, 2019

@phdthesis{machado2019efficient, author = {Marlos C. Machado}, title = {Efficient Exploration in Reinforcement Learning through Time-Based Representations}, school = {University of Alberta}, year = {2019} }

In the reinforcement learning (RL) problem an agent must learn how to act optimally through trial-and-error interactions with a complex, unknown, stochastic environment. The actions taken by the agent influence not just the immediate reward it observes but also the future states and rewards it will observe, implicitly requiring the agent to deal with the trade-off between short-term and long-term consequences. In this context, the problem of exploration is the problem of selecting appropriate actions to explore the state space to gather information while taking this trade-off into consideration.In this dissertation I advocate that agents' exploration strategy can be guided by the process of representation learning. I support this claim by introducing different exploration approaches for RL algorithms that are applicable to complex environments with sparse rewards. They all use learned time-based representations, state representations that capture the temporal aspect of RL problems, implicitly encoding the temporal proximity of states. The two instantiations of time-based representations I use are proto-value functions (PVFs) and the successor representation (SR).The first approaches I introduce are based on the idea of option-based exploration. Option-based exploration hinges on the assumption that an agent that exhibits purposeful behavior is more likely to visit states that are far from its current state than an agent that randomly selects actions at every time step. I model this purposefulness through options, which, in reinforcement learning, represent temporally extended courses of actions over different time scales. I then introduce algorithms capable of discovering options autonomously through PVFs and the SR.I also introduce count-based exploration approaches, which are based on the idea of keeping state visitation counts to ensure all states (or abstractions of a state) are visited a proper number of times. I show that the norm of the SR, while it is being learned, incorporates state visitation counts and I use this result to introduce RL algorithms that achieve state-of-the-art results in large domains that require function approximation.I evaluate my algorithms in both tabular domains and Atari 2600 games. I use tabular domains such as the 4-room domain, RiverSwim, and SixArms in order to develop a better intuition about the proposed algorithms and to compare the proposed approaches to classic baselines in the field. I use Atari 2600 games to evaluate the scalability and generality of the proposed approaches since the state space of Atari 2600 games is too large, requiring function approximation. I discuss approaches based on linear and non-linear function approximation.
[7] Eigenoption Discovery through the Deep Successor Representation

M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell

In International Conference on Learning Representations (ICLR), 2018

@inproceedings{machado2018eigenoption, author = {Marlos C. Machado and Clemens Rosenbaum and Xiaoxiao Guo and Miao Liu and Gerald Tesauro and Murray Campbell}, title = {Eigenoption Discovery through the Deep Successor Representation}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.
[8] A Laplacian Framework for Option Discovery in Reinforcement Learning

M. C. Machado, M. G. Bellemare, M. Bowling

In International Conference on Machine Learning (ICML), 2017

@inproceedings{machado2017laplacian, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {A Laplacian Framework for Option Discovery in Reinforcement Learning}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2017} }

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-value functions (PVFs) are a well-known approach for representation learning in MDPs. In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing eigenpurposes, intrinsic reward functions derived from the learned representations. The options discovered from eigenpurposes traverse the principal directions of the state space. They are useful for multiple tasks because they are discovered without taking the environment’s rewards into consideration. Moreover, different options act at different time scales, making them helpful for exploration. We demonstrate features of eigenpurposes in traditional tabular domains as well as in Atari 2600 games.
[9] The Eigenoption-Critic Framework

M. Liu, M. C. Machado, G. Tesauro, and M. Campbell

In NeurIPS Hierarchical RL Workshop, 2017

@inproceedings{liu2017eigenoption-critic, title = {The Eigenoption-Critic Framework}, author = {Miao Liu and Marlos C. Machado and Gerald Tesauro and Murray Campbell}, booktitle = {NeurIPS Hierarchical RL Workshop}, year = {2017} }

Eigenoptions (EOs) have been recently introduced as a promising idea for generating a diverse set of options through the graph Laplacian, having been shown to allow efficient exploration. Despite its initial promising results, a couple of issues in current algorithms limit its application, namely: (1) EO methods require two separate steps (eigenoption discovery and reward maximization) to learn a control policy, which can incur a significant amount of storage and computation; (2) EOs are only defined for problems with discrete state-spaces and; (3) it is not easy to take the environment's reward function into consideration when discovering EOs. To addresses these issues, we introduce an algorithm termed eigenoption-critic (EOC) based on the Option-critic (OC) framework [Bacon17], a general hierarchical reinforcement learning (RL) algorithm that allows learning the intra-option policies simultaneously with the policy over options. We also propose a generalization of EOC to problems with continuous state-spaces through the Nyström approximation. EOC can also be seen as extending OC to nonstationary settings, where the discovered options are not tailored for a single task.
[10] Learning Purposeful Behaviour in the Absence of Rewards

M. C. Machado, and M. Bowling

In ICML Workshop on Abstraction in Reinforcement Learning, 2016

@inproceedings{machado2016learning, title = {Learning Purposeful Behaviour in the Absence of Rewards}, author = {Marlos C. Machado and Michael Bowling}, booktitle = {ICML Workshop on Abstraction in Reinforcement Learning}, year = {2016} }

Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.

Exploration

Description

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning. This challenge, known as the exploration problem, is about minimizing the number of samples (i.e., interactions) an agent requires to effectively learn and perform well in an initially unknown environment. It is this aspect that sets reinforcement learning apart from many other problem formulations in machine learning.

Our research has focused extensively on this topic, resulting in several published papers. A recurring theme in our work revolves around promoting exploration by explicitly considering various courses of actions within the environment [1-3,6,8-11]. This line of research has contributed to the popularization of option-based exploration. Notably, we introduced the concept of eigenoptions, which are options derived from the eigenvectors of the graph Laplacian or the successor representation [10]. While many existing exploration techniques in reinforcement learning employ (pseudo-)counts to generate exploration bonuses, these approaches often assume the availability of problem-specific density models when combined with function approximation. In contrast, we have developed a general technique that avoids this requirement by utilizing the norm of the successor representation to estimate counts [5]. Additionally, we have conducted thorough performance analyses of count-based exploration methods [7]. Furthermore, we have proposed a simple method for achieving domain-independent optimistic initialization when using function approximation [12]. Lastly, we have presented several results highlighting the influence of baselines on the exploratory behavior of policy gradient methods [4].

References

[1] Deep Laplacian-based Options for Temporally-Extended Exploration

M. Klissarov, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{klissarov2023deep, title = {Deep Laplacian-based Options for Temporally-Extended Exploration}, author = {Martin Klissarov and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
[2] Temporal Abstraction in Reinforcement Learning with the Successor Representation

M. C. Machado, A. Barreto, D. Precup, and M. Bowling

Journal of Machine Learning Research (JMLR), 24, pp. 1-69, 2023

@article{machado2023temporal, title = {Temporal Abstraction in Reinforcement Learning with the Successor Representation}, author = {Marlos C. Machado and Andre Barreto and Doina Precup and Michael Bowling}, journal = {Journal of Machine Learning Research (JMLR)}, volume = {24}, number = {80}, pages = {1--69}, year = {2023} }

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called \emph{options}. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
[3] Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

A. Erraqabi, M. C. Machado, M. Zhao, S. Sukhbaatar, A. Lazaric, L. Denoyer, Y. Bengio

In Conference on Uncertainty in Artificial Intelligence (UAI), 2022

@inproceedings{erraqabi2022temporal, author = {Akram Erraqabi and Marlos C. Machado and Mingde Zhao and Sainbayar Sukhbaatar and Alessandro Lazaric and Ludovic Denoyer and Yoshua Bengio}, title = {Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL}, booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2022} }

In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
[4] Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

W. Chung*, V. Thomas*, M. C. Machado, and N. Le Roux

In International Conference on Machine Learning (ICML), 2021

@inproceedings{chung2021beyond, author = {Wesley Chung and Valentin Thomas and Marlos C. Machado and Nicolas Le Roux}, title = {Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }

Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
[5] Count-Based Exploration with the Successor Representation

M. C. Machado, M. G. Bellemare, and M. Bowling

In AAAI Conference on Artificial Intelligence (AAAI), 2020

@inproceedings{machado2020counts, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {Count-Based Exploration with the Successor Representation}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2020} }

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.
[6] Exploration in Reinforcement Learning with Deep Covering Options

Y. Jinnai, J. W. Park, M. C. Machado, and G. Konidaris

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{jinnai2020exploration, author = {Yuu Jinnai and Jee W. Park and Marlos C. Machado and George Konidaris}, title = {Exploration in Reinforcement Learning with Deep Covering Options}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
[7] On Bonus Based Exploration Methods In The Arcade Learning Environment

A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{taiga2020bonus, author = {Adrien Ali Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, title = {On Bonus Based Exploration Methods In The Arcade Learning Environment}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler epsilon-greedy scheme. In fact, we find that methods that perform best on that game often underperform epsilon-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
[8] Efficient Exploration in Reinforcement Learning through Time-Based Representations

M. C. Machado

Ph.D. Thesis, University of Alberta, 2019

@phdthesis{machado2019efficient, author = {Marlos C. Machado}, title = {Efficient Exploration in Reinforcement Learning through Time-Based Representations}, school = {University of Alberta}, year = {2019} }

In the reinforcement learning (RL) problem an agent must learn how to act optimally through trial-and-error interactions with a complex, unknown, stochastic environment. The actions taken by the agent influence not just the immediate reward it observes but also the future states and rewards it will observe, implicitly requiring the agent to deal with the trade-off between short-term and long-term consequences. In this context, the problem of exploration is the problem of selecting appropriate actions to explore the state space to gather information while taking this trade-off into consideration.In this dissertation I advocate that agents' exploration strategy can be guided by the process of representation learning. I support this claim by introducing different exploration approaches for RL algorithms that are applicable to complex environments with sparse rewards. They all use learned time-based representations, state representations that capture the temporal aspect of RL problems, implicitly encoding the temporal proximity of states. The two instantiations of time-based representations I use are proto-value functions (PVFs) and the successor representation (SR).The first approaches I introduce are based on the idea of option-based exploration. Option-based exploration hinges on the assumption that an agent that exhibits purposeful behavior is more likely to visit states that are far from its current state than an agent that randomly selects actions at every time step. I model this purposefulness through options, which, in reinforcement learning, represent temporally extended courses of actions over different time scales. I then introduce algorithms capable of discovering options autonomously through PVFs and the SR.I also introduce count-based exploration approaches, which are based on the idea of keeping state visitation counts to ensure all states (or abstractions of a state) are visited a proper number of times. I show that the norm of the SR, while it is being learned, incorporates state visitation counts and I use this result to introduce RL algorithms that achieve state-of-the-art results in large domains that require function approximation.I evaluate my algorithms in both tabular domains and Atari 2600 games. I use tabular domains such as the 4-room domain, RiverSwim, and SixArms in order to develop a better intuition about the proposed algorithms and to compare the proposed approaches to classic baselines in the field. I use Atari 2600 games to evaluate the scalability and generality of the proposed approaches since the state space of Atari 2600 games is too large, requiring function approximation. I discuss approaches based on linear and non-linear function approximation.
[9] Eigenoption Discovery through the Deep Successor Representation

M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell

In International Conference on Learning Representations (ICLR), 2018

@inproceedings{machado2018eigenoption, author = {Marlos C. Machado and Clemens Rosenbaum and Xiaoxiao Guo and Miao Liu and Gerald Tesauro and Murray Campbell}, title = {Eigenoption Discovery through the Deep Successor Representation}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.
[10] A Laplacian Framework for Option Discovery in Reinforcement Learning

M. C. Machado, M. G. Bellemare, M. Bowling

In International Conference on Machine Learning (ICML), 2017

@inproceedings{machado2017laplacian, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {A Laplacian Framework for Option Discovery in Reinforcement Learning}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2017} }

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-value functions (PVFs) are a well-known approach for representation learning in MDPs. In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing eigenpurposes, intrinsic reward functions derived from the learned representations. The options discovered from eigenpurposes traverse the principal directions of the state space. They are useful for multiple tasks because they are discovered without taking the environment’s rewards into consideration. Moreover, different options act at different time scales, making them helpful for exploration. We demonstrate features of eigenpurposes in traditional tabular domains as well as in Atari 2600 games.
[11] Learning Purposeful Behaviour in the Absence of Rewards

M. C. Machado, and M. Bowling

In ICML Workshop on Abstraction in Reinforcement Learning, 2016

@inproceedings{machado2016learning, title = {Learning Purposeful Behaviour in the Absence of Rewards}, author = {Marlos C. Machado and Michael Bowling}, booktitle = {ICML Workshop on Abstraction in Reinforcement Learning}, year = {2016} }

Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.
[12] Domain-Independent Optimistic Initialization for Reinforcement Learning

M. C. Machado, S. Srinivasan, and M. Bowling

In AAAI Workshop on Learning for General Competency in Video Games, 2015

@inproceedings{machado2015domain-independent, title = {Domain-Independent Optimistic Initialization for Reinforcement Learning}, author = {Marlos C. Machado and Sriram Srinivasan and Michael Bowling}, booktitle = {AAAI Workshop on Learning for General Competency in Video Games}, year = {2015} }

In Reinforcement Learning (RL), it is common to use optimistic initialization of value functions to encourage exploration. However, such an approach generally depends on the domain, viz., the scale of the rewards must be known, and the feature representation must have a constant norm. We present a simple approach that performs optimistic initialization with less dependence on the domain.

Representation learning

Description

When discussing the problem of reinforcement learning, we often describe it as follows: at each time step the agent is in a specific state and it takes an action. In response, the environment transitions to another state according to a probability kernel, accompanied by a bounded reward signal. The agent's objective is to learn a policy that maximizes the expected cumulative sum of discounted rewards. While this description is accurate, it becomes too simplistic when dealing with real-world problems where enumerating every state in the environment is infeasible, and the agent rarely revisits the exact same state. Additionally, the agent often times does not have access to the true underlying state of the world, but to a high-dimensional observation of it, which means that often we also have to deal with partial observability. These challenges require our algorithms to learn a representation that captures the relevant information one can generalize over, identifying similarities between observations. Obviously, there are many different ways of generalizing over experience, but some ways yield better learning outcomes. Learning appropriate representations is thus one of the fundamental challenges in reinforcement learning, and machine learning more broadly.

Generalization in reinforcement learning occupies a central position in our research and shapes much of our work. In our initial paper on the topic, we characterized the inductive biases of convolutional networks and quantified their impact on the performance of deep reinforcement learning algorithms like DQN [6]. Subsequently, we published one of the first papers in the field highlighting how policies learned by deep reinforcement learning algorithms can struggle to generalize, even when evaluated in remarkably similar environments [5]. To address these issues, we proposed more effective approaches to representation learning, specifically tailored for reinforcement learning. We incorporated the inherent sequential structure of reinforcement learning into the representation learning process [4]. Really understanding things has always been a driving force behind our work. In the context of deep reinforcement learning, we extensively investigated the properties of learned representations, quantifying and correlating these properties with actual generalization performance [3]. We also looked at representation learning in the context of continual learning, showing how discrete representations (as in multiple one-hot vectors) seem to be much more effective than the traditional continuous representations [7]. The same seems to be the case for model-learning [7].

As previously mentioned, representation learning remains at the forefront of our research and has influenced our approach to other problems. Notably, a significant portion of our work on temporal abstractions is based on using the representation learning process to guide option discovery [2]. A particularly important representation in this line of work is the Laplacian representation , which was originally introduced in the tabular case. Until recently, the methods that approximated the eigenfunctions of the graph Laplacian with stochastic gradient descent were not guaranteed to find the true solution, they were untunable, and were not able to estimate the eigenvalues of the graph Laplacian; we addressed all those issues with ALLO [9].

Lastly, generalization is just one aspect of the story; dealing with partial observabilty is also crucial. In such scenarios, the agent must rely on more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Many impressive reinforcement learning applications tackle this challenge by using environment-specific functions to aid in summarizing the agent's history. We identified, characterized, and justified this practice, coining the term auxiliary inputs to describe these additional information sources [1]. We also introduced the recurrent linear transformer [8] to be able to learn to summarize history by combining the best of both transformers and recurrent neural networks.

References

[1] Agent-State Construction with Auxiliary Inputs

R. Y. Tao, A. White, and M. C. Machado

Transactions on Machine Learning Research (TMLR), 2023

@article{tao2023agent-state, title = {Agent-State Construction with Auxiliary Inputs}, author = {Ruo Yu Tao and Adam White and Marlos C. Machado}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2023} }

In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.
[2] Temporal Abstraction in Reinforcement Learning with the Successor Representation

M. C. Machado, A. Barreto, D. Precup, and M. Bowling

Journal of Machine Learning Research (JMLR), 24, pp. 1-69, 2023

@article{machado2023temporal, title = {Temporal Abstraction in Reinforcement Learning with the Successor Representation}, author = {Marlos C. Machado and Andre Barreto and Doina Precup and Michael Bowling}, journal = {Journal of Machine Learning Research (JMLR)}, volume = {24}, number = {80}, pages = {1--69}, year = {2023} }

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called \emph{options}. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
[3] Investigating the Properties of Neural Network Representations in Reinforcement Learning

H. Wang, E. Miahi, M. White, M. C. Machado, Z. Abbas, R. Kumaraswamy, V. Liu, and A. White

Artificial Intelligence, 2024

@article{wang2024investigating, title = {Investigating the Properties of Neural Network Representations in Reinforcement Learning}, author = {Han Wang and Erfan Miahi and Martha White and Marlos C. Machado and Zaheer Abbas and Raksha Kumaraswamy and Vincent Liu and Adam White}, journal = {Artificial Intelligence}, volume = {}, year = {2024} }

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the earlier work in representation learning for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. This analysis allows us to provide novel hypotheses regarding impact of auxiliary tasks in end-to-end training of non-linear reinforcement learning methods. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider DQN agents with convolutional networks in a pixel-based navigation environment. We develop a method to better understand \emph{why} some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance.
[4] Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

R. Agarwal, M. C. Machado, P. S. Castro, and M. G. Bellemare

In International Conference on Learning Representations (ICLR) [Spotlight], 2021

@inproceedings{agarwal2021contrastive, author = {R. Agarwal and Marlos C. Machado and Pablo S. Castro and Marc G. Bellemare}, title = {Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021} }

Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
[5] Generalization and Regularization in DQN

J. Farebrother, M. C. Machado, and M. Bowling

In NeurIPS Deep Reinforcement Learning Workshop, 2018

@article{farebrother2018generalization, title = {Generalization and Regularization in DQN}, author = {Jesse Farebrother and Marlos C. Machado and Michael Bowling}, booktitle = {NeurIPS Deep Reinforcement Learning Workshop}, year = {2018} }

Deep reinforcement learning algorithms have shown an impressive ability to learn complex control policies in high-dimensional tasks. However, despite the ever-increasing performance on popular benchmarks, policies learned by deep reinforcement learning algorithms can struggle to generalize when evaluated in remarkably similar environments. In this paper we propose a protocol to evaluate generalization in reinforcement learning through different modes of Atari 2600 games. With that protocol we assess the generalization capabilities of DQN, one of the most traditional deep reinforcement learning algorithms, and we provide evidence suggesting that DQN overspecializes to the training environment. We then comprehensively evaluate the impact of dropout and ℓ2 regularization, as well as the impact of reusing learned representations to improve the generalization capabilities of DQN. Despite regularization being largely underutilized in deep reinforcement learning, we show that it can, in fact, help DQN learn more general features. These features can be reused and fine-tuned on similar tasks, considerably improving DQN's sample efficiency.
[6] State of the Art Control of Atari Games Using Shallow Reinforcement Learning [Best paper nominee]

Y. Liang, M. C. Machado, E. Talvitie, and M. Bowling

In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2016

@inproceedings{liang16state, author = {Yitao Liang and Marlos C. Machado and Erik Talvitie and Michael H. Bowling}, title = {State of the Art Control of Atari Games Using Shallow Reinforcement Learning}, booktitle = {International Conference on Autonomous Agents and Multiagent Systems (AAMAS)}, year = {2016} }

The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN's impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN's network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.
[7] Harnessing Discrete Representations For Continual Reinforcement Learning

Edan Meyer, Adam White, and M. C. Machado

CoRR abs/2312.01203, 2023

@article{meyer2023harnessing, title = {Harnessing Discrete Representations For Continual Reinforcement Learning}, author = {Edan Meyer and Adam White and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.01203}, year = {2023} }

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.
[8] AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning

S. Pramanik, E. Elelimy,M. C. Machado, and A. White

Transactions on Machine Learning Research (TMLR), 2024

@article{pramanik2023recurrent, title = {AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning}, author = {Subhojeet Pramanik and Esraa Elelimy and Marlos C. Machado and Adam White}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2024} }

In this paper we investigate transformer architectures designed for partially observable online reinforcement learning. The self-attention mechanism in the transformer architecture is capable of capturing long-range dependencies and it is the main reason behind its effectiveness in processing sequential data. Nevertheless, despite their success, transformers have two significant drawbacks that still limit their applicability in online reinforcement learning: (1) in order to remember all past information, the self-attention mechanism requires access to the whole history to be provided as context. (2) The inference cost in transformers is expensive. In this paper, we introduce recurrent alternatives to the transformer self-attention mechanism that offer context-independent inference cost, leverage long-range dependencies effectively, and performs well in online reinforcement learning task. We quantify the impact of the different components of our architecture in a diagnostic environment and assess performance gains in 2D and 3D pixel-based partially-observable environments (e.g. T-Maze, Mystery Path, Craftax, and Memory Maze). Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%. Our approach either performs similarly or better than GTrXL, improving more than 37% upon GTrXL performance in harder tasks.
[9] Proper Laplacian Representation Learning

Diego Gomez, Michael Bowling, and M. C. Machado

International Conference on Learning Representations, 2024

@inproceedings{gomez2024proper, title = {Proper Laplacian Representation Learning}, author = {Diego Gomez and Michael Bowling and Marlos C. Machado}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2024} }

The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing intrinsic rewards for temporally-extended action discovery and reward shaping, and informative state encoding. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.

Credit assignment

Description

In the context of reinforcement learning, actions often have delayed consequences, meaning their effects are not immediately observable. A classic example is chess, where the outcome of a game can be attributed to several moves made in the past, including the crucial "winning" or "losing" move. In reinforcement learning, this phenomenon is referred to as the problem of (long-term) credit assignment. Basic temporal-difference (TD) methods assign credit to the immediately taken action, bootstrapping from previous experience to learn long-term dependencies.

However, this process typically requires a large number of repetitions to generate effective behaviors based on rewards. Therefore, the challenge arises: how can we design algorithms that learn faster? Note this is a challenge even if we have access to a good representation and to the "right" data to learn from.

Naturally, we have written papers on this fundamental question in the field of reinforcement learning. For instance, we have explored the estimation of multistep returns, where credit is distributed among multiple past actions according to some eligibility rule. This research involved designing, evaluating, and characterizing various types of eligibility traces, such as true online methods [6] and trajectory-aware methods [1]. Also, we have looked at the benefits of averaging different n-step returns (similar to what eligibility traces do), and we have shown how such an approach leads to lower variance during the learning process [7].

Other approaches we have looked at include how one can use options in planning to accelerate learning [2], and how the successor representation, which can be seen as a hybrid between model-free and model-based reinforcement learning, can also do the same [5]. Finally, we have designed policy gradient methods that are more effective (and more theoretically sound) in assigning credit faster to past observations [3,4].

References

[1] Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

B. Daley, M. White, C. Amato, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{daley2023trajectory-aware, title = {Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning}, author = {Brett Daley and Martha White and Christopher Amato and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in an off-policy control task.
[2] Reward-Respecting Subtasks for Model-Based Reinforcement Learning

R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

Artificial Intelligence (AIJ), 2023

@article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
[3] A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning [Best paper nominee]

S. Vaswani, O. Bachem, S. Totaro, R. Mueller, S. Garg, M. Geist, M. C. Machado, P. S. Castro, and N. Le Roux

In International Conference on Artificial Intelligence and Statistics (AISTATS) [Oral], 2022

@inproceedings{vaswani2022general, author = {Sharan Vaswani and Olivier Bachem and Simone Totaro and Robert Mueller and Shivam Garg and Matthieu Geist and Marlos C. Machado and Pablo Samuel Castro and Nicolas Le Roux}, title = {A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning}, booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2022} }

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
[4] An Operator View of Policy Gradient Methods

D. Ghosh, M. C. Machado, and N. Le Roux

In Neural Information Processing Systems (NeurIPS), 2020

@inproceedings{ghosh2020operator, author = {Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux}, title = {An Operator View of Policy Gradient Methods}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2020} }

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.
[5] Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation

C. Sherstan, M. C. Machado, and P. Pilarski

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

@inproceedings{sherstan2018accelerating, author = {Craig Sherstan and Marlos C. Machado and Patrick M. Pilarski}, title = {Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2018} }

We propose using the Successor Representation (SR) to accelerate learning in a constructive knowledge system based on General Value Functions (GVFs). In real-world settings, like robotics for unstructured and dynamic environments, it is impossible to model all meaningful aspects of a system and its environment by hand. Instead, robots must learn and adapt to changes in their environment and task, incrementally constructing models from their own experience. GVFs, taken from the field of reinforcement learning (RL), are a way of modeling the world as predictive questions. One approach to such models proposes a massive network of interconnected and interdependent GVFs, which are incrementally added over time. It is reasonable to expect that new, incrementally added predictions can be learned more swiftly if the learning process leverages knowledge gained from past experience. The SR provides a means of capturing regularities that can be reused across multiple GVFs by separating the dynamics of the world from the prediction targets. As a primary contribution of this work, we show that using the SR can improve sample efficiency and learning speed of GVFs in a continual learning setting where new predictions are incrementally added and learned over time. We analyze our approach in a grid-world and then demonstrate its potential on data from a physical robot arm.
[6] True Online Temporal-Difference Learning

H. van Seijen, A. R. Mahmood, P. Pilarski, M. C. Machado, and R. Sutton

Journal of Machine Learning Research (JMLR), 17(145), pp. 1-40, 2016

@article{seijen2016true, title = {True Online Temporal-Difference Learning}, author = {Harm van Seijen and A. Rupam Mahmood and Patrick M. Pilarski and Marlos C. Machado and Richard S. Sutton}, year = {2016}, volume = {17}, number = {145}, pages = {1--40}, journal = {Journal of Machine Learning Research (JMLR)} }

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.
[7] Compound Returns Reduce Variance in Reinforcement Learning

Brett Daley, Martha White, and M. C. Machado

Under review, 2024

}

Real-world applications

Description

In the field of reinforcement learning, there is often a focus on developing general-purpose algorithms. These algorithms are typically evaluated using standard problems that are well-understood, such as gridworlds, or in problems that challenge our algorithms in some particular dimension, such as generality, as in the Atari 2600 games. However, these are simply proxy problems; Atari 2600 games themselves are not the ultimate goal, for example. The true hope is that success in these domains will translate into solving real-world problems effectively. We find it valuable to venture into tackling these real-world problems because it forces us to confront the assumptions we make in our research and sheds light on any overlooked issues. This process helps us refine the questions we explore in our research.

We are constantly on the lookout for real-world problems that can be addressed using reinforcement learning algorithms, while taking feasibility into account, considering factors such as cost and engineering support. In the past, during my Ph.D., we collaborated on a project that validated the effectiveness of the successor representation in accelerating learning for a robot arm [3]. While working at Google, we designed a deep reinforcement learning algorithm for flying balloons in the stratosphere [2]. This solution was actually deployed in the real world and provided internet access in Kenya for a period of time. Additionally, this project resulted in a patent application (it was done in industry after all). Upon returning to the University of Alberta, we have been collaborating with other Amii fellows on utilizing reinforcement learning in water treatment plants [1].

References

[1] GVFs in the Real World: Making Predictions Online for Water Treatment

M. K. Janjua, H. Shah, M. White, E. Miahi, M. C. Machado, and A. White

Machine Learning,2023

@article{janjua2023gvfs, title = {GVFs in the Real World: Making Predictions Online for Water Treatment}, author = {Muhammad Kamran Janjua and Haseeb Shah and Martha White and Erfan Miahi and Marlos C. Machado and Adam White}, journal = {Machine Learning}, year = {2023} }

In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about predictability of the data, suitable neural network architectures, how to overcome partially observability, and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions—discounted cumulative sums of observations–and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by contrasting to a TD agent trained purely offline with no online updating. This final result is one of the first to motivate the importance of adapting predictions in realtime, for non-stationary high-volume systems in the real-world.
[2] Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning

[Alph. order] M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, M. C. Machado, S. Moitra, S. Ponda, and Z. Wang

Nature, 588, pp. 77-82, 2020

@article{bellemare2020autonomous, title = {Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning}, author = {Marc G. Bellemare and Salvatore Candido and Pablo S. Castro and Jun Gong and Marlos C. Machado and Subhodeep Moitra and Sameera Ponda and Ziyu Wang}, year = {2020}, volume = {588}, pages = {77--82}, journal = {Nature} }

Efficiently navigating a superpressure balloon in the stratosphere requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques. Here we describe the use of reinforcement learning to create a high-performing flight controller. Our algorithm uses data augmentation and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems8. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.
[3] Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation

C. Sherstan, M. C. Machado, and P. Pilarski

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

@inproceedings{sherstan2018accelerating, author = {Craig Sherstan and Marlos C. Machado and Patrick M. Pilarski}, title = {Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2018} }

We propose using the Successor Representation (SR) to accelerate learning in a constructive knowledge system based on General Value Functions (GVFs). In real-world settings, like robotics for unstructured and dynamic environments, it is impossible to model all meaningful aspects of a system and its environment by hand. Instead, robots must learn and adapt to changes in their environment and task, incrementally constructing models from their own experience. GVFs, taken from the field of reinforcement learning (RL), are a way of modeling the world as predictive questions. One approach to such models proposes a massive network of interconnected and interdependent GVFs, which are incrementally added over time. It is reasonable to expect that new, incrementally added predictions can be learned more swiftly if the learning process leverages knowledge gained from past experience. The SR provides a means of capturing regularities that can be reused across multiple GVFs by separating the dynamics of the world from the prediction targets. As a primary contribution of this work, we show that using the SR can improve sample efficiency and learning speed of GVFs in a continual learning setting where new predictions are incrementally added and learned over time. We analyze our approach in a grid-world and then demonstrate its potential on data from a physical robot arm.

Unifying algorithms

Description

Research is often about identifying the problems to tackle as well as choosing specific techniques to explore within those contexts. However, there is more to it—research is also influenced by personal taste and the approach one prefers to take when addressing problems. Personally, I like to adopt a broad perspective in my research, learning about topics that may not be immediately connected to the problems I'm studying. I believe that such exploration enables the development of tools that will prove valuable in the future. One aspect I particularly enjoy is discovering similarities between seemingly unrelated methods, thereby potentially unifying diverse approaches. I find such results elegant and always take great satisfaction in achieving them.

One of our published results that stands out as particularly interesting and beneficial to our research is the equivalence we established between proto-value functions and the eigenvectors of the successor representation [5]. This equivalence allows us to utilize the most appropriate formulation in different scenarios, thereby harnessing the strengths of both worlds. This perspective has proven especially valuable in our work on option discovery. In fact, we have also presented a unifying formulation of options that allows us to talk about different methods using the same terms, effectively unifying several option discovery methods [3]. When talking about representation learning in the face of partial observability, we introduced the term auxiliary inputs, which unifies various techniques used for resolving partial observability in successful real-world reinforcement learning applications [2]. Additionally, when talking about more traditional algorithms, we have introduced an operator view of policy gradient methods that demonstrates their direct relationship with value-based methods, positioning them on opposite ends of the same spectrum [4]. We also have results on eligibility traces. Specifically, we have worked on an exact equivalence between the forward and the backward view of TD(λ) [6], and we have provided a unifying perspective over multiple eligibility trace methods, something that also allowed us to get much stronger convergence results for them [1].

References

[1] Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

B. Daley, M. White, C. Amato, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{daley2023trajectory-aware, title = {Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning}, author = {Brett Daley and Martha White and Christopher Amato and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in an off-policy control task.
[2] Agent-State Construction with Auxiliary Inputs

R. Y. Tao, A. White, and M. C. Machado

Transactions on Machine Learning Research (TMLR), 2023

@article{tao2023agent-state, title = {Agent-State Construction with Auxiliary Inputs}, author = {Ruo Yu Tao and Adam White and Marlos C. Machado}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2023} }

In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.
[3] Reward-Respecting Subtasks for Model-Based Reinforcement Learning

R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

Artificial Intelligence (AIJ), 2023

@article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
[4] An Operator View of Policy Gradient Methods

D. Ghosh, M. C. Machado, and N. Le Roux

In Neural Information Processing Systems (NeurIPS), 2020

@inproceedings{ghosh2020operator, author = {Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux}, title = {An Operator View of Policy Gradient Methods}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2020} }

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.
[5] Eigenoption Discovery through the Deep Successor Representation

M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell

In International Conference on Learning Representations (ICLR), 2018

@inproceedings{machado2018eigenoption, author = {Marlos C. Machado and Clemens Rosenbaum and Xiaoxiao Guo and Miao Liu and Gerald Tesauro and Murray Campbell}, title = {Eigenoption Discovery through the Deep Successor Representation}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.
[6] True Online Temporal-Difference Learning

H. van Seijen, A. R. Mahmood, P. Pilarski, M. C. Machado, and R. Sutton

Journal of Machine Learning Research (JMLR), 17(145), pp. 1-40, 2016

@article{seijen2016true, title = {True Online Temporal-Difference Learning}, author = {Harm van Seijen and A. Rupam Mahmood and Patrick M. Pilarski and Marlos C. Machado and Richard S. Sutton}, year = {2016}, volume = {17}, number = {145}, pages = {1--40}, journal = {Journal of Machine Learning Research (JMLR)} }

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

Empirical evaluation and best practices

Description

Research is often about identifying the problems to tackle as well as choosing specific techniques to explore within those contexts. However, there is more to it—research is also influenced by personal taste and the approach one prefers to take when addressing problems. Personally, my research has a strong emphasis on empirical investigations, and I place significant emphasis on proper empirical evaluation and best practices for conducting such evaluations. I often find frustration with the current state of the field and the lack of precision in empirical practices. In response, I have written papers that highlight some of the issues I observe.

Some of the published research stemmed from noticing meaningless comparisons across exploration methods in the literature. We conducted rigorous, apples-to-apples comparisons in Atari 2600 games, and we showed that while exploration bonuses led to higher scores in Montezuma's Revenge, they did not provide meaningful advantages over the simpler ϵ-greedy scheme in most of the the other games [1]. A well-known contribution we made, which was also motivated by inadequate empirical practices in the field, was a paper discussing evaluation practices in the Arcade Learning Environment (ALE). Such a discussion motivated the introduction of stochasticity and game modes in the ALE [2].

References

[1] On Bonus Based Exploration Methods In The Arcade Learning Environment

A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{taiga2020bonus, author = {Adrien Ali Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, title = {On Bonus Based Exploration Methods In The Arcade Learning Environment}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler epsilon-greedy scheme. In fact, we find that methods that perform best on that game often underperform epsilon-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
[2] Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling

Journal of Artificial Intelligence Research (JAIR), 61, pp. 523-562, 2018

@article{machado2018revisiting, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents}, author = {Marlos C. Machado and Marc G. Bellemare and Erik Talvitie and Joel Veness and Matthew Hausknecht and Michael Bowling}, year = {2018}, volume = {61}, pages = {523--562}, journal = {Journal of Artificial Intelligence Research (JAIR)} }

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.

Model-based reinforcement learning

Description

Reinforcement learning solution methods can be categorized into two main types: model-free and model-based approaches. Model-free methods are much more common in the literature, but model-based methods, which consist in methods that first learn about the transition dynamics of the environment and then plan with them, are thought to be more conducive to better exploration, faster adaptation in the face of non-stationarity, and more.

It is important to note that model-based reinforcement learning is an active area of research, and there are relatively few established and effective solutions available. Integrating temporal abstraction, specifically options, into model-based reinforcement learning presents an even more challenging topic. This is precisely the problem we have been focusing on. In this context, options are discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. We propose reward-respecting subtasks, which combine the original reward with a bonus based on a specific feature of the state when the option terminates; and we demonstrate that options and option models derived from these reward-respecting subtasks are more likely to be beneficial in the planning process [1].

We have also looked at what types of representations are more conducive to model-learning; that is, what representations make model-learning "easier". We have observed that discrete representations (as in multiple one-hot vectors), such as those used by DreamerV3, seem to be more effective at model learning [2].

References

[1] Reward-Respecting Subtasks for Model-Based Reinforcement Learning

R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

Artificial Intelligence (AIJ), 2023

@article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
[2] Harnessing Discrete Representations For Continual Reinforcement Learning

Edan Meyer, Adam White, and M. C. Machado

CoRR abs/2312.01203, 2023

@article{meyer2023harnessing, title = {Harnessing Discrete Representations For Continual Reinforcement Learning}, author = {Edan Meyer and Adam White and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.01203}, year = {2023} }

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.

Policy optimization

Description

Model-free reinforcement learning solution methods can be broadly divided into two main types: value-based and policy gradient methods. Policy-gradient methods consist in learning a parameterized policy that maps states to actions. They are extremely popular, with examples such as PPO and A3C.

Within this domain, our research has focused on gaining a comprehensive understanding of established concepts in policy gradient methods and developing approaches that are more theoretically grounded. Specifically, we revisited the role of baselines in policy gradient methods, showing how, different from common belief, baselines have a bigger role than variance reduction; with baselines that induce the same variance leading to very different behaviours. Surprisingly, we have also shown that higher variance can actually lead to faster convergence [2]! Additionally, we have introduced an operator view of policy gradient methods, which reveals their direct relationship with value-based methods. This perspective positions policy gradient methods and value-based methods as opposite ends of the same spectrum, shedding new light on their fundamental properties [1]. This first paper did not specify how to design consistent operators. To address this gap, we later proposed a general framework based on functional mirror ascent that provides a theoretically sound approach for designing operators by giving us an entire family of surrogate functions [3].

References

[1] A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning [Best paper nominee]

S. Vaswani, O. Bachem, S. Totaro, R. Mueller, S. Garg, M. Geist, M. C. Machado, P. S. Castro, and N. Le Roux

In International Conference on Artificial Intelligence and Statistics (AISTATS) [Oral], 2022

@inproceedings{vaswani2022general, author = {Sharan Vaswani and Olivier Bachem and Simone Totaro and Robert Mueller and Shivam Garg and Matthieu Geist and Marlos C. Machado and Pablo Samuel Castro and Nicolas Le Roux}, title = {A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning}, booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2022} }

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
[2] Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

W. Chung*, V. Thomas*, M. C. Machado, and N. Le Roux

In International Conference on Machine Learning (ICML), 2021

@inproceedings{chung2021beyond, author = {Wesley Chung and Valentin Thomas and Marlos C. Machado and Nicolas Le Roux}, title = {Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }

Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
[3] An Operator View of Policy Gradient Methods

D. Ghosh, M. C. Machado, and N. Le Roux

In Neural Information Processing Systems (NeurIPS), 2020

@inproceedings{ghosh2020operator, author = {Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux}, title = {An Operator View of Policy Gradient Methods}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2020} }

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.

Computer games

Description

I primarily use games as evaluation testbeds because they are well-specified domains that capture several properties of the problems I want to study. These games are not the ultimate goal, they are simply proxy problems. Personally, this is how I distinguish computer games research to the research done in AI or ML that is evaluated in games: the game itself is the ultimate goal of computer games research. While I do not generally do computer games research, I did delve into this area during my M.Sc. degree several several years ago [2].

During my master's studies, I worked on the problem of player modeling, which involves capturing (learning) players' styles, preferences, and more in order to adapt the game accordingly. Focusing on games like Civilization IV and Counter Strike, we advocated for modeling players as a linear combination of weights [4], and we modeled players with various techniques ranging from piece-wise linear regression [6] to classification-based approaches [3]. When working on this topic we also wrote a survey of the field [5]. Apart from player modeling, we also ventured into Sudoku [7] and, before research in real-time strategy (RTS) games was mainstream, we designed an advice system based on decision trees [1].

References

[1] RTSMate: Towards and Advice System for RTS Games

R. L. F. Cunha, M. C. Machado, and L. Chaimowicz

Computers in Entertainment (CiE), 12(1), pp. 1-20, 2014.

@article{cunha2014rtsmate, title = {RTSMate: Towards and Advice System for RTS Games}, author = {Renato L. de Freitas Cunha and Marlos C. Machado and Luiz Chaimowicz}, year = {2014}, volume = {12}, number = {1}, pages = {1--20}, journal = {Computers in Entertainment (CiE)} }

Real Time Strategy (RTS) games can be very challenging, especially to novice users, who are normally overwhelmed by the dynamic, distributed, and multi-objective structure of these games. In this paper we present RTSMate, an advice system designed to help the player of an RTS game. Using inference mechanisms to reason about the game state and a decision tree to encode its knowledge, RTSMate helps the player by giving him/her tactical and strategical tips about the best actions to be taken according to the current game state, aiming at improving player's performance. This paper describes the main ideas behind the system, its implementation, and the experiments performed using the system in a real game environment. Results show that RTSMate fulfills its objective: most players considered the system useful and were able to improve their performance by using it.
[2] A Methodology for Player Modeling based on Machine Learning

M. C. Machado

M.Sc. Thesis, Universidade Federal de Minas Gerais, 2013

@mscthesis{machado2013methodology, author = {Marlos C. Machado}, title = {A Methodology for Player Modeling based on Machine Learning}, school = {Universidade Federal de Minas Gerais}, year = {2013} }

AI is gradually receiving more attention as a fundamental feature to increase the immersion in digital games. Among the several AI approaches, player modeling is becoming an important one. The main idea is to understand and model the player characteristics and behaviors in order to develop a better AI. In this work, we discuss several aspects of this new field. We proposed a taxonomy to organize the area, discussing several facets of this topic, ranging from implementation decisions up to what a model attempts to describe. We then classify, in our taxonomy, some of the most important works in this field. We also presented a generic approach to deal with player modeling using ML, and we instantiated this approach to model players' preferences in the game Civilization IV. The instantiation of this approach has several steps. We first discuss a generic representation, regardless of what is being modeled, and evaluate it performing experiments with the strategy game Civilization IV. Continuing the instantiation of the proposed approach we evaluated the applicability of using game score information to distinguish different preferences. We presented a characterization of virtual agents in the game, comparing their behavior with their stated preferences. Once we have characterized these agents, we were able to observe that different preferences generate different behaviors, measured by several game indicators. We then tackled the preference modeling problem as a binary classification task, with a supervised learning approach. We compared four different methods, based on different paradigms (SVM, AdaBoost, NaiveBayes and JRip), evaluating them on a set of matches played by different virtual agents. We conclude our work using the learned models to infer human players' preferences. Using some of the evaluated classifiers we obtained accuracies over 60% for most of the inferred preferences.
[3] A Binary Classification Approach for Automatic Preference Modeling of Virtual Agents in Civilization IV

M. C. Machado, G. L. Pappa, and L. Chaimowicz

In International Conference on Computational Intelligence and Games (CIG), 2012

@inproceedings{machado2012binary, author = {Marlos C. Machado and Gisele L. Pappa and Luiz Chaimowicz}, title = {A Binary Classification Approach for Automatic Preference Modeling of Virtual Agents in Civilization IV}, booktitle = {International Conference on Computational Intelligence and Games (CIG)}, year = {2012} }

Player Modeling tries to model players behaviors and characteristics during a game. When these are related to more abstract preferences, the process is normally called Preference Modeling. In this paper we infer Civilization IV's virtual agents preferences with classifiers based on support vector machines. Our vectors contain score indicators from agents gameplay, allowing us to predict preferences based on the indirect observations of actions. We model this task as a binary classification problem, allowing us to make more precise inference. In this sense, we leveraged previous approaches that also used kernel machines but relied on different preference levels. Using binary classification and parameter optimization, our method is able to predict some agents preferences with an accuracy of 100%. Moreover, it is also capable of generalizing to different agents, being able to predict preferences of agents that were not used in the training process.
[4] Characterizing and Modeling Agents in Digital Games

M. C. Machado, G. L. Pappa, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2012

@inproceedings{machado2012characterizing, author = {Marlos C. Machado and Gisele L. Pappa and Luiz Chaimowicz}, title = {Characterizing and Modeling Agents in Digital Games}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2012} }

A promising approach in digital games is the possibility of customizing the game according to different demands. Artificial Intelligence algorithms play an important role in this direction, allowing the implementation of different behaviors for game agents. To accomplish this, it is necessary to model these agents in such way their behavior can be easily tunned to address different game features. In this paper, we discuss a generic representation to model virtual agents in digital games. Agents are modeled using a linear combination of different variables, which are used to represent specific game features. We perform experiments with FPS and Strategy games (Counter Strike and Civilization IV, respectively) and results show the effectiveness of this approach in characterizing and modeling agents. We were able to infer agents models by observing matches and also to generate different behaviors varying agent’s models.
[5] Player Modeling: Towards a Common Taxonomy

M. C. Machado, E. P. C. Fantini, and L. Chaimowicz

In International Conference on Computer Games (CGames), 2011

@inproceedings{machado2011player, author = {Marlos C. Machado and Eduardo P. C. Fantini and Luiz Chaimowicz}, title = {Player Modeling: Towards a Common Taxonomy}, booktitle = {International Conference on Computer Games (CGames)}, year = {2011} }

Artificial Intelligence (AI) is gradually receiving more attention as a fundamental feature to increase the immersion in digital games. Among the several AI approaches, Player Modeling is becoming an important one. The main idea is to try to understand and model the player characteristics and behaviors in order to develop a better AI. This paper presents a survey of the field, discussing the main concepts and proposing a taxonomy to better organize them. We also present several game platforms that can be used by player modeling and AI researchers. We believe that compiling this information can be important to the field, specially to new researchers.
[6] Agents Behavior and Preferences Characterization in Civilization IV

M. C. Machado, B. S. L. Rocha, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2011

@inproceedings{machado2011agents, author = {Marlos C. Machado and Bruno S. L. Rocha and Luiz Chaimowicz}, title = {Agents Behavior and Preferences Characterization in Civilization IV}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2011} }

Player Modeling is becoming an important feature in Digital Games. It basically consists in understanding and modeling the player characteristics and behaviors during the game and has been mainly used to improve the games artificial intelligence, making games more adaptable to different players. In this paper, we try to characterize the preference of the players using a novel approach in games: we use mathematical regressions to characterize players behavior, looking for functions that best fit these behaviors. Using AI controlled players in Civilization IV as a testbed, this characterization is performed by extracting game data (score and resources, for example) at the end of each turn and generating functions that characterize the data evolution during the game. We were able to obtain models that distinguish the agents preferences showing the effectiveness of this approach.
[7] Combining Metaheuristics and CSP Algorithms to Solve Sudoku

M. C. Machado, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2011

@inproceedings{machado2011combining, author = {Marlos C. Machado and Luiz Chaimowicz}, title = {Combining Metaheuristics and CSP Algorithms to Solve Sudoku}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2011} }

Sudoku is a very popular puzzle game that is played by millions of people everyday. In spite of that, it is a NP-Hard problem that can be very difficult to solve depending on the initial conditions of the board. In this paper, we propose the combination of metaheuristics with techniques from the Constraint Satisfaction Problem (CSP) domain that speed up the solution's search process by decreasing its search space and its processing time. Experiments performed with boards of size 3, 4 and 5 show that this approach allows the resolution of a greater number of instances when compared to an initial baseline.