Publications


2024
  1. TMLR-2
    AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning
    S. Pramanik, E. Elelimy,M. C. Machado, and A. White
    Transactions on Machine Learning Research (TMLR), 2024
    @article{pramanik2023recurrent, title = {AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning}, author = {Subhojeet Pramanik and Esraa Elelimy and Marlos C. Machado and Adam White}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2024} }
    In this paper we investigate transformer architectures designed for partially observable online reinforcement learning. The self-attention mechanism in the transformer architecture is capable of capturing long-range dependencies and it is the main reason behind its effectiveness in processing sequential data. Nevertheless, despite their success, transformers have two significant drawbacks that still limit their applicability in online reinforcement learning: (1) in order to remember all past information, the self-attention mechanism requires access to the whole history to be provided as context. (2) The inference cost in transformers is expensive. In this paper, we introduce recurrent alternatives to the transformer self-attention mechanism that offer context-independent inference cost, leverage long-range dependencies effectively, and performs well in online reinforcement learning task. We quantify the impact of the different components of our architecture in a diagnostic environment and assess performance gains in 2D and 3D pixel-based partially-observable environments (e.g. T-Maze, Mystery Path, Craftax, and Memory Maze). Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%. Our approach either performs similarly or better than GTrXL, improving more than 37% upon GTrXL performance in harder tasks.
  2. arXiv-2
    Learning Continually by Spectral Regularization
    A. Lewandowski, S. Kumar, D. Schuurmans, A. Gyorgy, and M. C. Machado
    CoRR abs/2406.06811, 2024
    @article{lewandowski2024learning, title = {Learning Continually by Spectral Regularization}, author = {Alex Lewandowski and Saraubh Kumar and Dale Schuurmans and Andr\'as Gy\"orgy and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2406.06811}, year = {2024} }
    Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good predictive performance while maintaining network trainability. We develop new techniques for improving continual learning by first reconsidering how initialization can ensure trainability during early phases of learning. From this perspective, we derive new regularization strategies for continual learning that ensure beneficial initialization properties are better maintained throughout training. In particular, we investigate two new regularization techniques for continual learning: (i) Wasserstein regularization toward the initial weight distribution, which is less restrictive than regularizing toward initial weights; and (ii) regularizing weight matrix singular values, which directly ensures gradient diversity is maintained throughout training. We present an experimental analysis that shows these alternative regularizers can improve continual learning performance across a range of supervised learning tasks and model architectures. The alternative regularizers prove to be less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance.
  3. AIJ-2
    Investigating the Properties of Neural Network Representations in Reinforcement Learning
    H. Wang, E. Miahi, M. White, M. C. Machado, Z. Abbas, R. Kumaraswamy, V. Liu, and A. White
    Artificial Intelligence, 2024
    @article{wang2024investigating, title = {Investigating the Properties of Neural Network Representations in Reinforcement Learning}, author = {Han Wang and Erfan Miahi and Martha White and Marlos C. Machado and Zaheer Abbas and Raksha Kumaraswamy and Vincent Liu and Adam White}, journal = {Artificial Intelligence}, volume = {}, year = {2024} }
    In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the earlier work in representation learning for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. This analysis allows us to provide novel hypotheses regarding impact of auxiliary tasks in end-to-end training of non-linear reinforcement learning methods. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider DQN agents with convolutional networks in a pixel-based navigation environment. We develop a method to better understand \emph{why} some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance.
  4. ICML-5
    Averaging n-step Returns Reduces Variance in Reinforcement Learning
    B. Daley, M. White, and M. C. Machado
    International Conference on Machine Learning, 2024
    @inproceedings{daley2024compound, title = {Averaging n-step Returns Reduces Variance in Reinforcement Learning}, author = {Brett Daley and Martha White and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2024} }
    Multistep returns, such as n-step returns and λ-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of n-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given n-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that two-bootstrap returns can improve the sample efficiency of n-step deep RL agents, with little additional computational cost.
  5. ICLR-5
    Proper Laplacian Representation Learning
    D. Gomez, M. Bowling, and M. C. Machado
    International Conference on Learning Representations, 2024
    @inproceedings{gomez2024proper, title = {Proper Laplacian Representation Learning}, author = {Diego Gomez and Michael Bowling and Marlos C. Machado}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2024} }
    The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing intrinsic rewards for temporally-extended action discovery and reward shaping, and informative state encoding. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.
  6. RLC-2
    Harnessing Discrete Representations For Continual Reinforcement Learning
    E. Meyer, A. White, and M. C. Machado
    Reinforcement Learning Conference, 2024
    @inproceedings{meyer2024harnessing, title = {Harnessing Discrete Representations For Continual Reinforcement Learning}, author = {Edan Meyer and Adam White and Marlos C. Machado}, booktitle = {Reinforcement Learning Conference (RLC)}, year = {2024} }
    Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.
  7. RLC-1
    Demystifying the Recency Heuristic in Temporal-Difference Learning
    B. Daley, M. C. Machado, and M. White
    Reinforcement Learning Conference, 2024
    @inproceedings{daley2024demystifying, title = {Demystifying the Recency Heuristic in Temporal-Difference Learning}, author = {Brett Daley and Marlos C. Machado and Martha White}, booktitle = {Reinforcement Learning Conference (RLC)}, year = {2024} }
    The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD(λ), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as n-step returns, satisfy a weaker (i.e., nonmonotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a longer window of effective credit assignment, yet the same worst-case bias and variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.
2023
  1. arXiv-1
    Curvature Explains Loss of Plasticity
    A. Lewandowski, H. Tanaka, D. Schuurmans, and M. C. Machado
    CoRR abs/2312.00246, 2023
    @article{lewandowski2023curvature, title = {Curvature Explains Loss of Plasticity}, author = {Alex Lewandowski and Haruto Tanaka and Dale Schuurmans and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.00246}, year = {2023} }
    Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.
  2. MLJ-1
    GVFs in the Real World: Making Predictions Online for Water Treatment
    M. K. Janjua, H. Shah, M. White, E. Miahi, M. C. Machado, and A. White
    Machine Learning, 2023
    @article{janjua2023gvfs, title = {GVFs in the Real World: Making Predictions Online for Water Treatment}, author = {Muhammad Kamran Janjua and Haseeb Shah and Martha White and Erfan Miahi and Marlos C. Machado and Adam White}, journal = {Machine Learning}, year = {2023} }
    In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about predictability of the data, suitable neural network architectures, how to overcome partially observability, and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions—discounted cumulative sums of observations–and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by contrasting to a TD agent trained purely offline with no online updating. This final result is one of the first to motivate the importance of adapting predictions in realtime, for non-stationary high-volume systems in the real-world.
  3. AIJ-1
    Reward-Respecting Subtasks for Model-Based Reinforcement Learning
    R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White
    Artificial Intelligence (AIJ), 2023
    @article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }
    To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
  4. ICML-4
    Deep Laplacian-based Options for Temporally-Extended Exploration
    M. Klissarov, and M. C. Machado
    In International Conference on Machine Learning (ICML), 2023
    @inproceedings{klissarov2023deep, title = {Deep Laplacian-based Options for Temporally-Extended Exploration}, author = {Martin Klissarov and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }
    Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
  5. ICML-3
    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning
    B. Daley, M. White, C. Amato, and M. C. Machado
    In International Conference on Machine Learning (ICML), 2023
    @inproceedings{daley2023trajectory-aware, title = {Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning}, author = {Brett Daley and Martha White and Christopher Amato and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }
    Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in an off-policy control task.
  6. CoLLAs-1
    Loss of Plasticity in Continual Deep Reinforcement Learning
    Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado
    In Conference on Lifelong Learning Agents (CoLLAs), 2023
    @inproceedings{abbas2023loss, title = {Loss of Plasticity in Continual Deep Reinforcement Learning}, author = {Zaheer Abbas and Rosie Zhao and Joseph Modayil and Adam White and Marlos C. Machado}, booktitle = {Conference on Lifelong Learning Agents (CoLLAs)}, year = {2023} }
    The ability to learn continually is essential in a complex and changing world. In this paper, we characterize the behavior of canonical value-based deep reinforcement learning (RL) approaches under varying degrees of non-stationarity. In particular, we demonstrate that deep RL agents lose their ability to learn good policies when they cycle through a sequence of Atari 2600 games. This phenomenon is alluded to in prior work under various guises -- e.g., loss of plasticity, implicit under-parameterization, primacy bias, and capacity loss. We investigate this phenomenon closely at scale and analyze how the weights, gradients, and activations change over time in several experiments with varying dimensions (e.g., similarity between games, number of games, number of frames per game), with some experiments spanning 50 days and 2 billion environment interactions. Our analysis shows that the activation footprint of the network becomes sparser, contributing to the diminishing gradients. We investigate a remarkably simple mitigation strategy -- Concatenated ReLUs (CReLUs) activation function -- and demonstrate its effectiveness in facilitating continual learning in a changing environment.
  7. TMLR-1
    Agent-State Construction with Auxiliary Inputs
    R. Y. Tao, A. White, and M. C. Machado
    Transactions on Machine Learning Research (TMLR), 2023
    @article{tao2023agent-state, title = {Agent-State Construction with Auxiliary Inputs}, author = {Ruo Yu Tao and Adam White and Marlos C. Machado}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2023} }
    In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.
  8. JMLR-2
    Temporal Abstraction in Reinforcement Learning with the Successor Representation
    M. C. Machado, A. Barreto, D. Precup, and M. Bowling
    Journal of Machine Learning Research (JMLR), 24, pp. 1-69, 2023
    @article{machado2023temporal, title = {Temporal Abstraction in Reinforcement Learning with the Successor Representation}, author = {Marlos C. Machado and Andre Barreto and Doina Precup and Michael Bowling}, journal = {Journal of Machine Learning Research (JMLR)}, volume = {24}, number = {80}, pages = {1--69}, year = {2023} }
    Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called \emph{options}. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
2022
  1. UAI-1
    Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL
    A. Erraqabi, M. C. Machado, M. Zhao, S. Sukhbaatar, A. Lazaric, L. Denoyer, Y. Bengio
    In Conference on Uncertainty in Artificial Intelligence (UAI), 2022
    @inproceedings{erraqabi2022temporal, author = {Akram Erraqabi and Marlos C. Machado and Mingde Zhao and Sainbayar Sukhbaatar and Alessandro Lazaric and Ludovic Denoyer and Yoshua Bengio}, title = {Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL}, booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2022} }
    In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
  2. AISTATS-1
    A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning [Best paper nominee]
    S. Vaswani, O. Bachem, S. Totaro, R. Mueller, S. Garg, M. Geist, M. C. Machado, P. S. Castro, and N. Le Roux
    In International Conference on Artificial Intelligence and Statistics (AISTATS) [Oral], 2022
    @inproceedings{vaswani2022general, author = {Sharan Vaswani and Olivier Bachem and Simone Totaro and Robert Mueller and Shivam Garg and Matthieu Geist and Marlos C. Machado and Pablo Samuel Castro and Nicolas Le Roux}, title = {A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning}, booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2022} }
    Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
2021
  1. ICML-2
    Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
    W. Chung*, V. Thomas*, M. C. Machado, and N. Le Roux
    In International Conference on Machine Learning (ICML), 2021
    @inproceedings{chung2021beyond, author = {Wesley Chung and Valentin Thomas and Marlos C. Machado and Nicolas Le Roux}, title = {Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }
    Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
  2. ICLR-4
    Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
    R. Agarwal, M. C. Machado, P. S. Castro, and M. G. Bellemare
    In International Conference on Learning Representations (ICLR) [Spotlight], 2021
    @inproceedings{agarwal2021contrastive, author = {R. Agarwal and Marlos C. Machado and Pablo S. Castro and Marc G. Bellemare}, title = {Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021} }
    Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
2020
  1. Nature-1
    Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning
    [Alph. order] M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, M. C. Machado, S. Moitra, S. Ponda, and Z. Wang
    Nature, 588, pp. 77-82, 2020
    @article{bellemare2020autonomous, title = {Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning}, author = {Marc G. Bellemare and Salvatore Candido and Pablo S. Castro and Jun Gong and Marlos C. Machado and Subhodeep Moitra and Sameera Ponda and Ziyu Wang}, year = {2020}, volume = {588}, pages = {77--82}, journal = {Nature} }
    Efficiently navigating a superpressure balloon in the stratosphere requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques. Here we describe the use of reinforcement learning to create a high-performing flight controller. Our algorithm uses data augmentation and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems8. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.
  2. NeurIPS-1
    An Operator View of Policy Gradient Methods
    D. Ghosh, M. C. Machado, and N. Le Roux
    In Neural Information Processing Systems (NeurIPS), 2020
    @inproceedings{ghosh2020operator, author = {Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux}, title = {An Operator View of Policy Gradient Methods}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2020} }
    We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.
  3. AAAI-1
    Count-Based Exploration with the Successor Representation
    M. C. Machado, M. G. Bellemare, and M. Bowling
    In AAAI Conference on Artificial Intelligence (AAAI), 2020
    @inproceedings{machado2020counts, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {Count-Based Exploration with the Successor Representation}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2020} }
    In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.
  4. ICLR-3
    Exploration in Reinforcement Learning with Deep Covering Options
    Y. Jinnai, J. W. Park, M. C. Machado, and G. Konidaris
    In International Conference on Learning Representations (ICLR), 2020
    @inproceedings{jinnai2020exploration, author = {Yuu Jinnai and Jee W. Park and Marlos C. Machado and George Konidaris}, title = {Exploration in Reinforcement Learning with Deep Covering Options}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }
    While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
  5. ICLR-2
    On Bonus Based Exploration Methods In The Arcade Learning Environment
    A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare
    In International Conference on Learning Representations (ICLR), 2020
    @inproceedings{taiga2020bonus, author = {Adrien Ali Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, title = {On Bonus Based Exploration Methods In The Arcade Learning Environment}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }
    Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler epsilon-greedy scheme. In fact, we find that methods that perform best on that game often underperform epsilon-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
2019
  1. ICML-W-3
    Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment [Best paper award]
    A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare
    In ICML Workshop on Exploration in Reinforcement Learning, 2019
    Longer version was published at ICLR’20
    @inproceedings{taiga2019benchmarking, title = {Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment}, author = {Adrien A. Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, booktitle = {ICML Workshop on Exploration in Reinforcement Learning}, year = {2019} }
    This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the state-of-the-art algorithm for value-based agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma's Revenge which has gathered a lot of interest from the exploration community, across the the set of seven games identified by Bellemare et al. (2016) as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma's Revenge or hard exploration games. We also find that existing bonus-based methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than ϵ-greedy exploration.
  2. PhD-1
    Efficient Exploration in Reinforcement Learning through Time-Based Representations
    M. C. Machado
    Ph.D. Thesis, University of Alberta, 2019
    @phdthesis{machado2019efficient, author = {Marlos C. Machado}, title = {Efficient Exploration in Reinforcement Learning through Time-Based Representations}, school = {University of Alberta}, year = {2019} }
    In the reinforcement learning (RL) problem an agent must learn how to act optimally through trial-and-error interactions with a complex, unknown, stochastic environment. The actions taken by the agent influence not just the immediate reward it observes but also the future states and rewards it will observe, implicitly requiring the agent to deal with the trade-off between short-term and long-term consequences. In this context, the problem of exploration is the problem of selecting appropriate actions to explore the state space to gather information while taking this trade-off into consideration.In this dissertation I advocate that agents' exploration strategy can be guided by the process of representation learning. I support this claim by introducing different exploration approaches for RL algorithms that are applicable to complex environments with sparse rewards. They all use learned time-based representations, state representations that capture the temporal aspect of RL problems, implicitly encoding the temporal proximity of states. The two instantiations of time-based representations I use are proto-value functions (PVFs) and the successor representation (SR).The first approaches I introduce are based on the idea of option-based exploration. Option-based exploration hinges on the assumption that an agent that exhibits purposeful behavior is more likely to visit states that are far from its current state than an agent that randomly selects actions at every time step. I model this purposefulness through options, which, in reinforcement learning, represent temporally extended courses of actions over different time scales. I then introduce algorithms capable of discovering options autonomously through PVFs and the SR.I also introduce count-based exploration approaches, which are based on the idea of keeping state visitation counts to ensure all states (or abstractions of a state) are visited a proper number of times. I show that the norm of the SR, while it is being learned, incorporates state visitation counts and I use this result to introduce RL algorithms that achieve state-of-the-art results in large domains that require function approximation.I evaluate my algorithms in both tabular domains and Atari 2600 games. I use tabular domains such as the 4-room domain, RiverSwim, and SixArms in order to develop a better intuition about the proposed algorithms and to compare the proposed approaches to classic baselines in the field. I use Atari 2600 games to evaluate the scalability and generality of the proposed approaches since the state space of Atari 2600 games is too large, requiring function approximation. I discuss approaches based on linear and non-linear function approximation.
2018
  1. JAIR-1
    Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents
    M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling
    Journal of Artificial Intelligence Research (JAIR), 61, pp. 523-562, 2018
    @article{machado2018revisiting, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents}, author = {Marlos C. Machado and Marc G. Bellemare and Erik Talvitie and Joel Veness and Matthew Hausknecht and Michael Bowling}, year = {2018}, volume = {61}, pages = {523--562}, journal = {Journal of Artificial Intelligence Research (JAIR)} }
    The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.
  2. ICLR-1
    Eigenoption Discovery through the Deep Successor Representation
    M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell
    In International Conference on Learning Representations (ICLR), 2018
    @inproceedings{machado2018eigenoption, author = {Marlos C. Machado and Clemens Rosenbaum and Xiaoxiao Guo and Miao Liu and Gerald Tesauro and Murray Campbell}, title = {Eigenoption Discovery through the Deep Successor Representation}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }
    Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.
  3. IROS-1
    Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation
    C. Sherstan, M. C. Machado, and P. Pilarski
    In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018
    @inproceedings{sherstan2018accelerating, author = {Craig Sherstan and Marlos C. Machado and Patrick M. Pilarski}, title = {Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2018} }
    We propose using the Successor Representation (SR) to accelerate learning in a constructive knowledge system based on General Value Functions (GVFs). In real-world settings, like robotics for unstructured and dynamic environments, it is impossible to model all meaningful aspects of a system and its environment by hand. Instead, robots must learn and adapt to changes in their environment and task, incrementally constructing models from their own experience. GVFs, taken from the field of reinforcement learning (RL), are a way of modeling the world as predictive questions. One approach to such models proposes a massive network of interconnected and interdependent GVFs, which are incrementally added over time. It is reasonable to expect that new, incrementally added predictions can be learned more swiftly if the learning process leverages knowledge gained from past experience. The SR provides a means of capturing regularities that can be reused across multiple GVFs by separating the dynamics of the world from the prediction targets. As a primary contribution of this work, we show that using the SR can improve sample efficiency and learning speed of GVFs in a continual learning setting where new predictions are incrementally added and learned over time. We analyze our approach in a grid-world and then demonstrate its potential on data from a physical robot arm.
  4. ICML-W-2
    Count-Based Exploration with the Successor Representation [Best paper award]
    M. C. Machado, M. G. Bellemare, and M. Bowling
    In ICML Workshop on Exploration in Reinforcement Learning, 2018
    Longer version was published at AAAI’20.
    @inproceedings{machado2020counts, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {Count-Based Exploration with the Successor Representation}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2020} }
    In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.
  5. NeurIPS-W-2
    Generalization and Regularization in DQN
    J. Farebrother, M. C. Machado, and M. Bowling
    In NeurIPS Deep Reinforcement Learning Workshop, 2018
    @article{farebrother2018generalization, title = {Generalization and Regularization in DQN}, author = {Jesse Farebrother and Marlos C. Machado and Michael Bowling}, booktitle = {NeurIPS Deep Reinforcement Learning Workshop}, year = {2018} }
    Deep reinforcement learning algorithms have shown an impressive ability to learn complex control policies in high-dimensional tasks. However, despite the ever-increasing performance on popular benchmarks, policies learned by deep reinforcement learning algorithms can struggle to generalize when evaluated in remarkably similar environments. In this paper we propose a protocol to evaluate generalization in reinforcement learning through different modes of Atari 2600 games. With that protocol we assess the generalization capabilities of DQN, one of the most traditional deep reinforcement learning algorithms, and we provide evidence suggesting that DQN overspecializes to the training environment. We then comprehensively evaluate the impact of dropout and ℓ2 regularization, as well as the impact of reusing learned representations to improve the generalization capabilities of DQN. Despite regularization being largely underutilized in deep reinforcement learning, we show that it can, in fact, help DQN learn more general features. These features can be reused and fine-tuned on similar tasks, considerably improving DQN's sample efficiency.
  6. IJCAI*-1
    Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract) [Invited paper]
    M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling
    In International Joint Conference on Artificial Intelligence (IJCAI), 2018
    @inproceedings{machado2018revisiting-ijcai, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract)}, author = {Marlos C. Machado and Marc G. Bellemare and Erik Talvitie and Joel Veness and Matthew Hausknecht and Michael Bowling}, booktitle = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract)}, year = {2018} }
    The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community. In this paper we take a big picture look at how the ALE is being used by the research community. We focus on how diverse the evaluation methodologies in the ALE have become and we highlight some key concerns when evaluating agents in this platform. We use this discussion to present what we consider to be the best practices for future evaluations in the ALE. To further the progress in the field, we also introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions.
2017
  1. ICML-1
    A Laplacian Framework for Option Discovery in Reinforcement Learning
    M. C. Machado, M. G. Bellemare, M. Bowling
    In International Conference on Machine Learning (ICML), 2017
    @inproceedings{machado2017laplacian, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {A Laplacian Framework for Option Discovery in Reinforcement Learning}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2017} }
    Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-value functions (PVFs) are a well-known approach for representation learning in MDPs. In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing eigenpurposes, intrinsic reward functions derived from the learned representations. The options discovered from eigenpurposes traverse the principal directions of the state space. They are useful for multiple tasks because they are discovered without taking the environment’s rewards into consideration. Moreover, different options act at different time scales, making them helpful for exploration. We demonstrate features of eigenpurposes in traditional tabular domains as well as in Atari 2600 games.
  2. NeurIPS-W-1
    The Eigenoption-Critic Framework
    M. Liu, M. C. Machado, G. Tesauro, and M. Campbell
    In NeurIPS Hierarchical RL Workshop, 2017
    @inproceedings{liu2017eigenoption-critic, title = {The Eigenoption-Critic Framework}, author = {Miao Liu and Marlos C. Machado and Gerald Tesauro and Murray Campbell}, booktitle = {NeurIPS Hierarchical RL Workshop}, year = {2017} }
    Eigenoptions (EOs) have been recently introduced as a promising idea for generating a diverse set of options through the graph Laplacian, having been shown to allow efficient exploration. Despite its initial promising results, a couple of issues in current algorithms limit its application, namely: (1) EO methods require two separate steps (eigenoption discovery and reward maximization) to learn a control policy, which can incur a significant amount of storage and computation; (2) EOs are only defined for problems with discrete state-spaces and; (3) it is not easy to take the environment's reward function into consideration when discovering EOs. To addresses these issues, we introduce an algorithm termed eigenoption-critic (EOC) based on the Option-critic (OC) framework [Bacon17], a general hierarchical reinforcement learning (RL) algorithm that allows learning the intra-option policies simultaneously with the policy over options. We also propose a generalization of EOC to problems with continuous state-spaces through the Nyström approximation. EOC can also be seen as extending OC to nonstationary settings, where the discovered options are not tailored for a single task.
2016
  1. JMLR-1
    True Online Temporal-Difference Learning
    H. van Seijen, A. R. Mahmood, P. Pilarski, M. C. Machado, and R. Sutton
    Journal of Machine Learning Research (JMLR), 17(145), pp. 1-40, 2016
    @article{seijen2016true, title = {True Online Temporal-Difference Learning}, author = {Harm van Seijen and A. Rupam Mahmood and Patrick M. Pilarski and Marlos C. Machado and Richard S. Sutton}, year = {2016}, volume = {17}, number = {145}, pages = {1--40}, journal = {Journal of Machine Learning Research (JMLR)} }
    The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.
  2. AAMAS-1
    State of the Art Control of Atari Games Using Shallow Reinforcement Learning [Best paper nominee]
    Y. Liang, M. C. Machado, E. Talvitie, and M. Bowling
    In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2016
    @inproceedings{liang16state, author = {Yitao Liang and Marlos C. Machado and Erik Talvitie and Michael H. Bowling}, title = {State of the Art Control of Atari Games Using Shallow Reinforcement Learning}, booktitle = {International Conference on Autonomous Agents and Multiagent Systems (AAMAS)}, year = {2016} }
    The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN's impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN's network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.
  3. AGI-1
    Introspective Agents: Confidence Measures for General Value Functions
    C. Sherstan, A. White, M. C. Machado, and P. Pilarski
    In Conference on Artificial General Intelligence (AGI), 2016
    @inproceedings{sherstan2016introspective, author = {Craig Sherstan and Adam White and Marlos C. Machado and Patrick M. Pilarski}, title = {Introspective Agents: Confidence Measures for General Value Functions}, booktitle = {Conference on Artificial General Intelligence (AGI)}, year = {2016} }
    Agents of general intelligence deployed in real-world scenarios must adapt to ever-changing environmental conditions. While such adaptive agents may leverage engineered knowledge, they will require the capacity to construct and evaluate knowledge themselves from their own experience in a bottom-up, constructivist fashion. This position paper builds on the idea of encoding knowledge as temporally extended predictions through the use of general value functions. Prior work has focused on learning predictions about externally derived signals about a task or environment (e.g. battery level, joint position). Here we advocate that the agent should also predict internally generated signals regarding its own learning process—for example, an agent’s confidence in its learned predictions. Finally, we suggest how such information would be beneficial in creating an introspective agent that is able to learn to make good decisions in a complex, changing world.
  4. ICML-W-1
    Learning Purposeful Behaviour in the Absence of Rewards
    M. C. Machado, and M. Bowling
    In ICML Workshop on Abstraction in Reinforcement Learning, 2016
    @inproceedings{machado2016learning, title = {Learning Purposeful Behaviour in the Absence of Rewards}, author = {Marlos C. Machado and Michael Bowling}, booktitle = {ICML Workshop on Abstraction in Reinforcement Learning}, year = {2016} }
    Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.
2015
  1. AAAI-W-1
    Domain-Independent Optimistic Initialization for Reinforcement Learning
    M. C. Machado, S. Srinivasan, and M. Bowling
    In AAAI Workshop on Learning for General Competency in Video Games, 2015
    @inproceedings{machado2015domain-independent, title = {Domain-Independent Optimistic Initialization for Reinforcement Learning}, author = {Marlos C. Machado and Sriram Srinivasan and Michael Bowling}, booktitle = {AAAI Workshop on Learning for General Competency in Video Games}, year = {2015} }
    In Reinforcement Learning (RL), it is common to use optimistic initialization of value functions to encourage exploration. However, such an approach generally depends on the domain, viz., the scale of the rewards must be known, and the feature representation must have a constant norm. We present a simple approach that performs optimistic initialization with less dependence on the domain.
  2. AIM-1
    Reports from the 2015 AAAI Workshop Program
    [Alph. order] S. V. Albrecht, J. C. Beck, D. L. Buckeridge, A. Botea, C. Caragea, C.-H. Chi, T. Damoulas, B. N. Dilkina, E. Eaton, P. Fazli, S. Ganzfried, M. T. Lindauer, M. C. Machado, Y. Malitsky, G. Marcus, S. Meijer, F. Rossi, A. Shaban-Nejad, S. Thiébaux, M. M. Veloso, T. Walsh, C. Wang, J. Zhang, and Y. Zheng
    AI Magazine, 36(2), pp. 90-101, 2015
    @article{albrecht2015reports, title = {Reports from the 2015 AAAI Workshop Program}, author = {Stefano V. Albrecht and J. Christopher Beck and David L. Buckeridge and Adi Botea and Cornelia Caragea and Chi-Hung Chi and Theodoros Damoulas and Bistra N. Dilkina and Eric Eaton and Pooyan Fazli and Sam Ganzfried and Marius Thomas Lindauer and Marlos C. Machado and Yuri Malitsky and Gary Marcus and Sebastiaan Meijer and Francesca Rossi and Arash Shaban-Nejad and Sylvie Thiébaux and Manuela M. Veloso and Toby Walsh and Can Wang and Jie Zhang and Yu Zheng}, journal = {AI Magazine}, volume = {36}, number = {2}, pages = {90-101}, year = {2015} }
    Monday, January 25–26, 2015 at the Hyatt Regency Austin Hotel in Austion, Texas, USA. The AAAI-15 workshop program included 15 workshops covering a wide range of topics in artificial intelligence. Most workshops were held on a single day. The titles of the workshops included AI and Ethics, AI for Cities, AI for Transportation: Advice, Interactivity and Actor Modeling, Algorithm Configuration, Artificial Intelligence Applied to Assistive Technologies and Smart Environments, Beyond the Turing Test, Computational Sustainability, Computer Poker and Imperfect Information, Incentive and Trust in E-Communities, Multiagent Interaction without Prior Coordination, Planning, Search, and Optimization, Scholarly Big Data: AI Perspectives, Challenges, and Ideas, Trajectory-Based Behaviour Analytics, World Wide Web and Public Health Intelligence, Knowledge, Skill, and Behavior Transfer in Autonomous Robots, and Learning for General Competency in Video Games.
Before 2015