Publications

Details By year

Journal articles

RLJ-5

The Cell Must Go On: Agar.io for Continual Reinforcement Learning

M. A. Mohamed, K. Nekhomiazh, V. Vyas, M. M. Jose, A. Patterson, and M. C. Machado

Reinforcement Learning Journal (RLJ), 2026

@article{mohamed2025cell, title = {The Cell Must Go On: Agar.io for Continual Reinforcement Learning}, author = {Mohamed A. Mohamed, Kateryna Nekhomiazh, Vedant Vyas, Marcos M. Jose, Andrew Patterson and Marlos C. Machado}, journal = {Reinforcement Learning Journal (RLJ)}, year = {2026} }

Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. This setting is well-suited to environments that the agent perceives as changing over time, rendering any static policy ineffective. In continual RL, researchers often simulate such changes either by modifying episodic environments to incorporate task shifts during interaction or by designing simulators that explicitly model continual dynamics. However, transforming episodic problems into continual ones primarily captures scenarios involving abrupt changes in the data stream and still relies on episodic structure. Meanwhile, the few simulators explicitly designed for empirical continual RL research are often limited in scope or complexity. In this paper, we introduce AgarCL, a research platform for continual RL that enables agents to progress toward increasingly sophisticated behaviour. AgarCL is based on the game this http URL, a non-episodic, high-dimensional problem with stochastic, ever-evolving dynamics, continuous actions, and partial observability. We provide benchmark results for DQN, PPO, and SAC on the primary continual RL challenge, as well as across a suite of smaller tasks within AgarCL. These smaller tasks isolate aspects of the full environment and allow us to characterize the distinct challenges posed by different components of the game. We further evaluate three continual learning methods-Shrink and Perturb, ReDo, and Continual Backpropagation-and observe little improvement over standard RL algorithms, suggesting that the challenges posed by AgarCL extend beyond the stability-plasticity dilemma.
RLJ-4

An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

B. Daley*, P. Nagarajan*, M. White, and M. C. Machado

Reinforcement Learning Journal (RLJ), 2025

@article{daley2025analysis, title = {An Analysis of Action-Value Temporal-Difference Methods That Learn State Values}, author = {Brett Daley and Prabhat Nagarajan and Martha White and Marlos C. Machado}, journal = {Reinforcement Learning Journal (RLJ)}, year = {2025} }
RLJ-3

Deep Reinforcement Learning with Gradient Eligibility Traces [Best paper award]

E. Elelimy, B. Daley, A. Patterson, M. C. Machado, A. White, and M. White

Reinforcement Learning Journal (RLJ), 2025

@article{elelimy2025deep, title = {Deep Reinforcement Learning with Gradient Eligibility Traces}, author = {Esraa Elelimy and Brett Daley and Andrew Patterson and Marlos C. Machado and Adam White and Martha White}, journal = {Reinforcement Learning Journal (RLJ)}, year = {2025} }
TMLR-2

AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning

S. Pramanik, E. Elelimy,M. C. Machado, and A. White

Transactions on Machine Learning Research (TMLR), 2024

@article{pramanik2023recurrent, title = {AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning}, author = {Subhojeet Pramanik and Esraa Elelimy and Marlos C. Machado and Adam White}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2024} }

In this paper we investigate transformer architectures designed for partially observable online reinforcement learning. The self-attention mechanism in the transformer architecture is capable of capturing long-range dependencies and it is the main reason behind its effectiveness in processing sequential data. Nevertheless, despite their success, transformers have two significant drawbacks that still limit their applicability in online reinforcement learning: (1) in order to remember all past information, the self-attention mechanism requires access to the whole history to be provided as context. (2) The inference cost in transformers is expensive. In this paper, we introduce recurrent alternatives to the transformer self-attention mechanism that offer context-independent inference cost, leverage long-range dependencies effectively, and performs well in online reinforcement learning task. We quantify the impact of the different components of our architecture in a diagnostic environment and assess performance gains in 2D and 3D pixel-based partially-observable environments (e.g. T-Maze, Mystery Path, Craftax, and Memory Maze). Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%. Our approach either performs similarly or better than GTrXL, improving more than 37% upon GTrXL performance in harder tasks.
RLJ-2

Harnessing Discrete Representations For Continual Reinforcement Learning

E. Meyer, Adam White, and M. C. Machado

Reinforcement Learning Journal (RLJ), 2024

@article{meyer2024harnessing, title = {Harnessing Discrete Representations For Continual Reinforcement Learning}, author = {Edan Meyer and Adam White and Marlos C. Machado}, journal = {Reinforcement Learning Journal (RLJ)}, volume = {2}, pages = {606--628}, year = {2024} }

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.
RLJ-1

Demystifying the Recency Heuristic in Temporal-Difference Learning

B. Daley, M. C. Machado, and M. White

Reinforcement Learning Journal (RLJ), 2024

@article{daley2024demystifying, title = {Demystifying the Recency Heuristic in Temporal-Difference Learning}, author = {Brett Daley and Marlos C. Machado and Martha White}, journal = {Reinforcement Learning Journal (RLJ)}, volume = {3}, pages = {1019--1036}, year = {2024} }

The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD(λ), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as n-step returns, satisfy a weaker (i.e., nonmonotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a longer window of effective credit assignment, yet the same worst-case bias and variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.
AIJ-2

Investigating the Properties of Neural Network Representations in Reinforcement Learning

H. Wang, E. Miahi, M. White, M. C. Machado, Z. Abbas, R. Kumaraswamy, V. Liu, and A. White

Artificial Intelligence, 2024

@article{wang2024investigating, title = {Investigating the Properties of Neural Network Representations in Reinforcement Learning}, author = {Han Wang and Erfan Miahi and Martha White and Marlos C. Machado and Zaheer Abbas and Raksha Kumaraswamy and Vincent Liu and Adam White}, journal = {Artificial Intelligence}, volume = {}, year = {2024} }

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the earlier work in representation learning for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. This analysis allows us to provide novel hypotheses regarding impact of auxiliary tasks in end-to-end training of non-linear reinforcement learning methods. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider DQN agents with convolutional networks in a pixel-based navigation environment. We develop a method to better understand \emph{why} some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance.
MLJ-1

GVFs in the Real World: Making Predictions Online for Water Treatment

M. K. Janjua, H. Shah, M. White, E. Miahi, M. C. Machado, and A. White

Machine Learning, 2023

@article{janjua2023gvfs, title = {GVFs in the Real World: Making Predictions Online for Water Treatment}, author = {Muhammad Kamran Janjua and Haseeb Shah and Martha White and Erfan Miahi and Marlos C. Machado and Adam White}, journal = {Machine Learning}, year = {2023} }

In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about predictability of the data, suitable neural network architectures, how to overcome partially observability, and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions—discounted cumulative sums of observations–and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by contrasting to a TD agent trained purely offline with no online updating. This final result is one of the first to motivate the importance of adapting predictions in realtime, for non-stationary high-volume systems in the real-world.
AIJ-1

Reward-Respecting Subtasks for Model-Based Reinforcement Learning

R. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

Artificial Intelligence (AIJ), 2023

@article{sutton2023reward-respecting, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, journal = {Artificial Intelligence}, year = {2023} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
TMLR-1

Agent-State Construction with Auxiliary Inputs

R. Y. Tao, A. White, and M. C. Machado

Transactions on Machine Learning Research (TMLR), 2023

@article{tao2023agent-state, title = {Agent-State Construction with Auxiliary Inputs}, author = {Ruo Yu Tao and Adam White and Marlos C. Machado}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2023} }

In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.
JMLR-2

Temporal Abstraction in Reinforcement Learning with the Successor Representation

M. C. Machado, A. Barreto, D. Precup, and M. Bowling

Journal of Machine Learning Research (JMLR), 24, pp. 1-69, 2023

@article{machado2023temporal, title = {Temporal Abstraction in Reinforcement Learning with the Successor Representation}, author = {Marlos C. Machado and Andre Barreto and Doina Precup and Michael Bowling}, journal = {Journal of Machine Learning Research (JMLR)}, volume = {24}, number = {80}, pages = {1--69}, year = {2023} }

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called \emph{options}. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
Nature-1

Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning

[Alph. order] M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, M. C. Machado, S. Moitra, S. Ponda, and Z. Wang

Nature, 588, pp. 77-82, 2020

@article{bellemare2020autonomous, title = {Autonomous Navigation of Stratospheric Balloons using Reinforcement Learning}, author = {Marc G. Bellemare and Salvatore Candido and Pablo S. Castro and Jun Gong and Marlos C. Machado and Subhodeep Moitra and Sameera Ponda and Ziyu Wang}, year = {2020}, volume = {588}, pages = {77--82}, journal = {Nature} }

Efficiently navigating a superpressure balloon in the stratosphere requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques. Here we describe the use of reinforcement learning to create a high-performing flight controller. Our algorithm uses data augmentation and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems8. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.
JAIR-1

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling

Journal of Artificial Intelligence Research (JAIR), 61, pp. 523-562, 2018

@article{machado2018revisiting, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents}, author = {Marlos C. Machado and Marc G. Bellemare and Erik Talvitie and Joel Veness and Matthew Hausknecht and Michael Bowling}, year = {2018}, volume = {61}, pages = {523--562}, journal = {Journal of Artificial Intelligence Research (JAIR)} }

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.
JMLR-1

True Online Temporal-Difference Learning

H. van Seijen, A. R. Mahmood, P. Pilarski, M. C. Machado, and R. Sutton

Journal of Machine Learning Research (JMLR), 17(145), pp. 1-40, 2016

@article{seijen2016true, title = {True Online Temporal-Difference Learning}, author = {Harm van Seijen and A. Rupam Mahmood and Patrick M. Pilarski and Marlos C. Machado and Richard S. Sutton}, year = {2016}, volume = {17}, number = {145}, pages = {1--40}, journal = {Journal of Machine Learning Research (JMLR)} }

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.
CiE-1

RTSMate: Towards and Advice System for RTS Games

R. L. F. Cunha, M. C. Machado, and L. Chaimowicz

Computers in Entertainment (CiE), 12(1), pp. 1-20, 2014.

@article{cunha2014rtsmate, title = {RTSMate: Towards and Advice System for RTS Games}, author = {Renato L. de Freitas Cunha and Marlos C. Machado and Luiz Chaimowicz}, year = {2014}, volume = {12}, number = {1}, pages = {1--20}, journal = {Computers in Entertainment (CiE)} }

Real Time Strategy (RTS) games can be very challenging, especially to novice users, who are normally overwhelmed by the dynamic, distributed, and multi-objective structure of these games. In this paper we present RTSMate, an advice system designed to help the player of an RTS game. Using inference mechanisms to reason about the game state and a decision tree to encode its knowledge, RTSMate helps the player by giving him/her tactical and strategical tips about the best actions to be taken according to the current game state, aiming at improving player's performance. This paper describes the main ideas behind the system, its implementation, and the experiments performed using the system in a real game environment. Results show that RTSMate fulfills its objective: most players considered the system useful and were able to improve their performance by using it.

Conference papers

ICML-7

Accelerating Q-learning through Efficient Value-sharing across Actions

P. Nagarajan, B. Daley, M. White, and M. C. Machado

International Conference on Machine Learning (ICML) [Spotlight], 2026

@inproceedings{nagarajan2026ib, title = {Accelerating Q-learning through Efficient Value-sharing across Actions}, author = {Prabhat Nagarajan and Brett Daley and Martha White and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026} }

Learning action-values efficiently is central to reinforcement learning (RL), as they underpin many control algorithms such as Q-learning. However, action-value learning can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state–action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion transformation, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this transformation can be applied as a parameter-free modification to Q-network architectures without altering the underlying algorithm. Empirically, we show that it improves DQN's performance in aggregate across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.
ICML-6

Laplacian Representations for Decision-Time Planning

D. Shehmar, M. Schlegel, M. E. Taylor, and M. C. Machado

International Conference on Machine Learning (ICML), 2026

@inproceedings{shehmar2026alps, title = {Laplacian Representations for Decision-Time Planning}, author = {Dikshant Shehmar and Matthew Schlegel and Matthew E. Taylor and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026} }

Planning with a learned model remains a key challenge in model-based reinforcement learning (RL). In decision-time planning, state representations are critical as they must support local cost computation while preserving long-horizon structure. In this paper, we show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales. This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons. Building on these properties, we introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.
NeurIPS-3

Reward-Aware Proto-Representations in Reinforcement Learning

H. T. Tse, S. Chandrasekar, and M. C. Machado

In Neural Information Processing Systems (NeurIPS), 2025

@inproceedings{tse2025reward, title = {Reward-Aware Proto-Representations in Reinforcement Learning}, author = {Hon Tik Tse and Siddarth Chandrasekar and Marlos C. Machado}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2025} }

In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.
NeurIPS-2

The World Is Bigger: A Computationally-Embedded Perspective on the Big World Hypothesis

A. Lewandowski, A. Ramesh, E. Meyer, D. Schuurmans and M. C. Machado

In Neural Information Processing Systems (NeurIPS) [Spotlight], 2025

@inproceedings{tse2025reward, title = {The World Is Bigger: A Computationally-Embedded Perspective on the Big World Hypothesis}, author = {Alex Lewandowski and Aditya A. Ramesh and Edan Meyer and Dale Schuurmans and Marlos C. Machado}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2025} }

Continual learning is often motivated by the idea, known as the big world hypothesis, that the "world is bigger" than the agent. Recent problem formulations capture this idea by explicitly constraining an agent relative to the environment. These constraints lead to solutions in which the agent continually adapts to best use its limited capacity, rather than converging to a fixed solution. However, explicit constraints can be ad hoc, difficult to incorporate, and limiting to the effectiveness of scaling up the agent's capacity. In this paper, we characterize a problem setting in which an agent, regardless of its capacity, is implicitly constrained by being embedded in the environment. In particular, we introduce a computationally-embedded perspective that represents an embedded agent as an automaton simulated within a universal (formal) computer. We prove that such an automaton is implicitly constrained and that it is equivalent to an agent that interacts with a partially observable Markov decision process over a countably infinite state-space. We then propose an objective for this setting, which we call interactivity, that measures an agent's ability to continually adapt its behaviour and to continually learn new predictions. We develop a reinforcement learning algorithm for maximizing interactivity and a synthetic benchmark to experimentation on continual learning. Our results indicate that deep nonlinear networks struggle to sustain interactivity whereas deep linear networks can achieve higher interactivity as capacity increases.
ICLR-8

Plastic Learning with Deep Fourier Features

A. Lewandowski, D. Schuurmans, and M. C. Machado

International Conference on Learning Representations (ICLR), 2025

@inproceedings{lewandowski2025plastic, title = {Plastic Learning with Deep Fourier Features}, author = {Alex Lewandowski and Dale Schuurmans, and Marlos C. Machado}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2025} }

Deep neural networks can struggle to learn continually in the face of non-stationarity. This phenomenon is known as loss of plasticity. In this paper, we identify underlying principles that lead to plastic algorithms. In particular, we provide theoretical results showing that shallow linear networks, as well as a special case of deep linear networks, do not suffer from loss of plasticity. We then propose deep Fourier features, which are the concatenation of a sine and cosine in every layer, and we show that this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks. Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning. Our empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features. These results hold for different continual learning scenarios (e.g., label noise, class incremental learning, pixel permutations) on all major supervised learning datasets used for continual learning research, such as CIFAR10, CIFAR100, and tiny-ImageNet.
ICLR-7

Learning Continually by Spectral Regularization

A. Lewandowski, M. Bortkiewicz, S. Kumar, A. György, D. Schuurmans, M. Ostaszewski, and M. C. Machado

International Conference on Learning Representations (ICLR), 2025

@inproceedings{lewandowski2025learning, title = {Learning Continually by Spectral Regularization}, author = {Alex Lewandowski and Micha\l Bortkiewicz and Saurabh Kumar and Andr\'as Gy\"orgy and Dale Schuurmans and Mateusz Ostaszewski and Marlos C. Machado}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2025} }

Loss of plasticity is a phenomenon where neural networks can become more difficult to train over the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good performance while maintaining network trainability. We develop a new technique for improving continual learning inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning. From this perspective, we derive a new spectral regularizer for continual learning that better sustains these beneficial initialization properties throughout training. In particular, the regularizer keeps the maximum singular value of each layer close to one. Spectral regularization directly ensures that gradient diversity is maintained throughout training, which promotes continual trainability, while minimally interfering with performance in a single task. We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings. Spectral regularization is less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance.
ICLR-6

MaestroMotif: Skill Design from Artificial Intelligence Feedback

M. Klissarov, M. Henaff, R. Raileanu, S. Sodhani, P. Vincent, A. Zhang, P.-L. Bacon, D. Precup, M. C. Machado, and P. D'Oro

International Conference on Learning Representations (ICLR) [Oral], 2025

@inproceedings{lewandowski2025learning, title = {{MaestroMotif}: {S}kill Design from Artificial Intelligence Feedback}, author = {Martin Klissarov and Mikael Henaff and Roberta Raileanu and Shagun Sodhani and Pascal Vincent and Amy Zhang and Pierre-Luc Bacon and Doina Precup and Marlos C. Machado and Pierluca D'Oro}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2025} }

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.
ICML-5

Averaging n-step Returns Reduces Variance in Reinforcement Learning

B. Daley, M. White, and M. C. Machado

International Conference on Machine Learning (ICML), 2024

@inproceedings{daley2024compound, title = {Averaging n-step Returns Reduces Variance in Reinforcement Learning}, author = {Brett Daley and Martha White and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2024} }

Multistep returns, such as n-step returns and λ-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of n-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given n-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that two-bootstrap returns can improve the sample efficiency of n-step deep RL agents, with little additional computational cost.
ICLR-5

Proper Laplacian Representation Learning

D. Gomez, M. Bowling, and M. C. Machado

International Conference on Learning Representations (ICLR), 2024

@inproceedings{gomez2024proper, title = {Proper Laplacian Representation Learning}, author = {Diego Gomez and Michael Bowling and Marlos C. Machado}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2024} }

The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing intrinsic rewards for temporally-extended action discovery and reward shaping, and informative state encoding. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.
ICML-4

Deep Laplacian-based Options for Temporally-Extended Exploration

M. Klissarov, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{klissarov2023deep, title = {Deep Laplacian-based Options for Temporally-Extended Exploration}, author = {Martin Klissarov and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
ICML-3

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

B. Daley, M. White, C. Amato, and M. C. Machado

In International Conference on Machine Learning (ICML), 2023

@inproceedings{daley2023trajectory-aware, title = {Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning}, author = {Brett Daley and Martha White and Christopher Amato and Marlos C. Machado}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in an off-policy control task.
CoLLAs-1

Loss of Plasticity in Continual Deep Reinforcement Learning

Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado

In Conference on Lifelong Learning Agents (CoLLAs), 2023

@inproceedings{abbas2023loss, title = {Loss of Plasticity in Continual Deep Reinforcement Learning}, author = {Zaheer Abbas and Rosie Zhao and Joseph Modayil and Adam White and Marlos C. Machado}, booktitle = {Conference on Lifelong Learning Agents (CoLLAs)}, year = {2023} }

The ability to learn continually is essential in a complex and changing world. In this paper, we characterize the behavior of canonical value-based deep reinforcement learning (RL) approaches under varying degrees of non-stationarity. In particular, we demonstrate that deep RL agents lose their ability to learn good policies when they cycle through a sequence of Atari 2600 games. This phenomenon is alluded to in prior work under various guises -- e.g., loss of plasticity, implicit under-parameterization, primacy bias, and capacity loss. We investigate this phenomenon closely at scale and analyze how the weights, gradients, and activations change over time in several experiments with varying dimensions (e.g., similarity between games, number of games, number of frames per game), with some experiments spanning 50 days and 2 billion environment interactions. Our analysis shows that the activation footprint of the network becomes sparser, contributing to the diminishing gradients. We investigate a remarkably simple mitigation strategy -- Concatenated ReLUs (CReLUs) activation function -- and demonstrate its effectiveness in facilitating continual learning in a changing environment.
UAI-1

Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

A. Erraqabi, M. C. Machado, M. Zhao, S. Sukhbaatar, A. Lazaric, L. Denoyer, Y. Bengio

In Conference on Uncertainty in Artificial Intelligence (UAI), 2022

@inproceedings{erraqabi2022temporal, author = {Akram Erraqabi and Marlos C. Machado and Mingde Zhao and Sainbayar Sukhbaatar and Alessandro Lazaric and Ludovic Denoyer and Yoshua Bengio}, title = {Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL}, booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2022} }

In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
AISTATS-1

A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning [Best paper nominee]

S. Vaswani, O. Bachem, S. Totaro, R. Mueller, S. Garg, M. Geist, M. C. Machado, P. S. Castro, and N. Le Roux

In International Conference on Artificial Intelligence and Statistics (AISTATS) [Oral], 2022

@inproceedings{vaswani2022general, author = {Sharan Vaswani and Olivier Bachem and Simone Totaro and Robert Mueller and Shivam Garg and Matthieu Geist and Marlos C. Machado and Pablo Samuel Castro and Nicolas Le Roux}, title = {A General Class of Surrogate Functions for Stable and Efficient Reinforcement Learning}, booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2022} }

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
ICML-2

Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

W. Chung*, V. Thomas*, M. C. Machado, and N. Le Roux

In International Conference on Machine Learning (ICML), 2021

@inproceedings{chung2021beyond, author = {Wesley Chung and Valentin Thomas and Marlos C. Machado and Nicolas Le Roux}, title = {Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }

Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
ICLR-4

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

R. Agarwal, M. C. Machado, P. S. Castro, and M. G. Bellemare

In International Conference on Learning Representations (ICLR) [Spotlight], 2021

@inproceedings{agarwal2021contrastive, author = {R. Agarwal and Marlos C. Machado and Pablo S. Castro and Marc G. Bellemare}, title = {Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021} }

Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
NeurIPS-1

An Operator View of Policy Gradient Methods

D. Ghosh, M. C. Machado, and N. Le Roux

In Neural Information Processing Systems (NeurIPS), 2020

@inproceedings{ghosh2020operator, author = {Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux}, title = {An Operator View of Policy Gradient Methods}, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2020} }

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.
AAAI-1

Count-Based Exploration with the Successor Representation

M. C. Machado, M. G. Bellemare, and M. Bowling

In AAAI Conference on Artificial Intelligence (AAAI), 2020

@inproceedings{machado2020counts, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {Count-Based Exploration with the Successor Representation}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2020} }

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.
ICLR-3

Exploration in Reinforcement Learning with Deep Covering Options

Y. Jinnai, J. W. Park, M. C. Machado, and G. Konidaris

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{jinnai2020exploration, author = {Yuu Jinnai and Jee W. Park and Marlos C. Machado and George Konidaris}, title = {Exploration in Reinforcement Learning with Deep Covering Options}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
ICLR-2

On Bonus Based Exploration Methods In The Arcade Learning Environment

A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare

In International Conference on Learning Representations (ICLR), 2020

@inproceedings{taiga2020bonus, author = {Adrien Ali Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, title = {On Bonus Based Exploration Methods In The Arcade Learning Environment}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }

Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler epsilon-greedy scheme. In fact, we find that methods that perform best on that game often underperform epsilon-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
ICLR-1

Eigenoption Discovery through the Deep Successor Representation

M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell

In International Conference on Learning Representations (ICLR), 2018

@inproceedings{machado2018eigenoption, author = {Marlos C. Machado and Clemens Rosenbaum and Xiaoxiao Guo and Miao Liu and Gerald Tesauro and Murray Campbell}, title = {Eigenoption Discovery through the Deep Successor Representation}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.
IROS-1

Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation

C. Sherstan, M. C. Machado, and P. Pilarski

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

@inproceedings{sherstan2018accelerating, author = {Craig Sherstan and Marlos C. Machado and Patrick M. Pilarski}, title = {Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2018} }

We propose using the Successor Representation (SR) to accelerate learning in a constructive knowledge system based on General Value Functions (GVFs). In real-world settings, like robotics for unstructured and dynamic environments, it is impossible to model all meaningful aspects of a system and its environment by hand. Instead, robots must learn and adapt to changes in their environment and task, incrementally constructing models from their own experience. GVFs, taken from the field of reinforcement learning (RL), are a way of modeling the world as predictive questions. One approach to such models proposes a massive network of interconnected and interdependent GVFs, which are incrementally added over time. It is reasonable to expect that new, incrementally added predictions can be learned more swiftly if the learning process leverages knowledge gained from past experience. The SR provides a means of capturing regularities that can be reused across multiple GVFs by separating the dynamics of the world from the prediction targets. As a primary contribution of this work, we show that using the SR can improve sample efficiency and learning speed of GVFs in a continual learning setting where new predictions are incrementally added and learned over time. We analyze our approach in a grid-world and then demonstrate its potential on data from a physical robot arm.
ICML-1

A Laplacian Framework for Option Discovery in Reinforcement Learning

M. C. Machado, M. G. Bellemare, M. Bowling

In International Conference on Machine Learning (ICML), 2017

@inproceedings{machado2017laplacian, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {A Laplacian Framework for Option Discovery in Reinforcement Learning}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2017} }

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-value functions (PVFs) are a well-known approach for representation learning in MDPs. In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing eigenpurposes, intrinsic reward functions derived from the learned representations. The options discovered from eigenpurposes traverse the principal directions of the state space. They are useful for multiple tasks because they are discovered without taking the environment’s rewards into consideration. Moreover, different options act at different time scales, making them helpful for exploration. We demonstrate features of eigenpurposes in traditional tabular domains as well as in Atari 2600 games.
AAMAS-1

State of the Art Control of Atari Games Using Shallow Reinforcement Learning [Best paper nominee]

Y. Liang, M. C. Machado, E. Talvitie, and M. Bowling

In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2016

@inproceedings{liang16state, author = {Yitao Liang and Marlos C. Machado and Erik Talvitie and Michael H. Bowling}, title = {State of the Art Control of Atari Games Using Shallow Reinforcement Learning}, booktitle = {International Conference on Autonomous Agents and Multiagent Systems (AAMAS)}, year = {2016} }

The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN's impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN's network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.
AGI-1

Introspective Agents: Confidence Measures for General Value Functions

C. Sherstan, A. White, M. C. Machado, and P. Pilarski

In Conference on Artificial General Intelligence (AGI), 2016

@inproceedings{sherstan2016introspective, author = {Craig Sherstan and Adam White and Marlos C. Machado and Patrick M. Pilarski}, title = {Introspective Agents: Confidence Measures for General Value Functions}, booktitle = {Conference on Artificial General Intelligence (AGI)}, year = {2016} }

Agents of general intelligence deployed in real-world scenarios must adapt to ever-changing environmental conditions. While such adaptive agents may leverage engineered knowledge, they will require the capacity to construct and evaluate knowledge themselves from their own experience in a bottom-up, constructivist fashion. This position paper builds on the idea of encoding knowledge as temporally extended predictions through the use of general value functions. Prior work has focused on learning predictions about externally derived signals about a task or environment (e.g. battery level, joint position). Here we advocate that the agent should also predict internally generated signals regarding its own learning process—for example, an agent’s confidence in its learned predictions. Finally, we suggest how such information would be beneficial in creating an introspective agent that is able to learn to make good decisions in a complex, changing world.
CIG-1

A Binary Classification Approach for Automatic Preference Modeling of Virtual Agents in Civilization IV

M. C. Machado, G. L. Pappa, and L. Chaimowicz

In International Conference on Computational Intelligence and Games (CIG), 2012

@inproceedings{machado2012binary, author = {Marlos C. Machado and Gisele L. Pappa and Luiz Chaimowicz}, title = {A Binary Classification Approach for Automatic Preference Modeling of Virtual Agents in Civilization IV}, booktitle = {International Conference on Computational Intelligence and Games (CIG)}, year = {2012} }

Player Modeling tries to model players behaviors and characteristics during a game. When these are related to more abstract preferences, the process is normally called Preference Modeling. In this paper we infer Civilization IV's virtual agents preferences with classifiers based on support vector machines. Our vectors contain score indicators from agents gameplay, allowing us to predict preferences based on the indirect observations of actions. We model this task as a binary classification problem, allowing us to make more precise inference. In this sense, we leveraged previous approaches that also used kernel machines but relied on different preference levels. Using binary classification and parameter optimization, our method is able to predict some agents preferences with an accuracy of 100%. Moreover, it is also capable of generalizing to different agents, being able to predict preferences of agents that were not used in the training process.
SBGames-3

Characterizing and Modeling Agents in Digital Games

M. C. Machado, G. L. Pappa, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2012

@inproceedings{machado2012characterizing, author = {Marlos C. Machado and Gisele L. Pappa and Luiz Chaimowicz}, title = {Characterizing and Modeling Agents in Digital Games}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2012} }

A promising approach in digital games is the possibility of customizing the game according to different demands. Artificial Intelligence algorithms play an important role in this direction, allowing the implementation of different behaviors for game agents. To accomplish this, it is necessary to model these agents in such way their behavior can be easily tunned to address different game features. In this paper, we discuss a generic representation to model virtual agents in digital games. Agents are modeled using a linear combination of different variables, which are used to represent specific game features. We perform experiments with FPS and Strategy games (Counter Strike and Civilization IV, respectively) and results show the effectiveness of this approach in characterizing and modeling agents. We were able to infer agents models by observing matches and also to generate different behaviors varying agent’s models.
CGames-1

Player Modeling: Towards a Common Taxonomy

M. C. Machado, E. P. C. Fantini, and L. Chaimowicz

In International Conference on Computer Games (CGames), 2011

@inproceedings{machado2011player, author = {Marlos C. Machado and Eduardo P. C. Fantini and Luiz Chaimowicz}, title = {Player Modeling: Towards a Common Taxonomy}, booktitle = {International Conference on Computer Games (CGames)}, year = {2011} }

Artificial Intelligence (AI) is gradually receiving more attention as a fundamental feature to increase the immersion in digital games. Among the several AI approaches, Player Modeling is becoming an important one. The main idea is to try to understand and model the player characteristics and behaviors in order to develop a better AI. This paper presents a survey of the field, discussing the main concepts and proposing a taxonomy to better organize them. We also present several game platforms that can be used by player modeling and AI researchers. We believe that compiling this information can be important to the field, specially to new researchers.
SBGames-2

Agents Behavior and Preferences Characterization in Civilization IV

M. C. Machado, B. S. L. Rocha, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2011

@inproceedings{machado2011agents, author = {Marlos C. Machado and Bruno S. L. Rocha and Luiz Chaimowicz}, title = {Agents Behavior and Preferences Characterization in Civilization IV}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2011} }

Player Modeling is becoming an important feature in Digital Games. It basically consists in understanding and modeling the player characteristics and behaviors during the game and has been mainly used to improve the games artificial intelligence, making games more adaptable to different players. In this paper, we try to characterize the preference of the players using a novel approach in games: we use mathematical regressions to characterize players behavior, looking for functions that best fit these behaviors. Using AI controlled players in Civilization IV as a testbed, this characterization is performed by extracting game data (score and resources, for example) at the end of each turn and generating functions that characterize the data evolution during the game. We were able to obtain models that distinguish the agents preferences showing the effectiveness of this approach.
SBGames-1

Combining Metaheuristics and CSP Algorithms to Solve Sudoku

M. C. Machado, and L. Chaimowicz

In Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 2011

@inproceedings{machado2011combining, author = {Marlos C. Machado and Luiz Chaimowicz}, title = {Combining Metaheuristics and CSP Algorithms to Solve Sudoku}, booktitle = {Brazilian Symposium on Computer Games and Digital Entertainment (SBGames)}, year = {2011} }

Sudoku is a very popular puzzle game that is played by millions of people everyday. In spite of that, it is a NP-Hard problem that can be very difficult to solve depending on the initial conditions of the board. In this paper, we propose the combination of metaheuristics with techniques from the Constraint Satisfaction Problem (CSP) domain that speed up the solution's search process by decreasing its search space and its processing time. Experiments performed with boards of size 3, 4 and 5 show that this approach allows the resolution of a greater number of instances when compared to an initial baseline.

Preprints

arXiv-6

The Laplacian Keyboard: Beyond the Linear Span

S. Chandrasekar, and M. C. Machado

CoRR abs/2602.07730, 2026

@article{chandrasekar2026lk, title = {The Laplacian Keyboard: Beyond the Linear Span}, author = {Siddarth Chandrasekar and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2602.07730}, year = {2026} }

Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), these eigenvectors provide a natural basis for approximating reward functions; however, their use is typically limited to their linear span, which restricts expressivity in complex environments. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond the linear span. LK constructs a task-agnostic library of options from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these options dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK surpasses zero-shot solutions while achieving improved sample efficiency compared to standard RL methods.
arXiv-5

DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

H. T. Tse, and M. C. Machado

CoRR abs/2602.00403, 2026

@article{tse2026drogo, title = {DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning}, author = {Hon Tik Tse and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2602.00403}, year = {2026} }

In computational reinforcement learning, the default representation (DR) and its principal eigenvector have been shown to be effective for a wide variety of applications, including reward shaping, count-based exploration, option discovery, and transfer. However, in prior investigations, the eigenvectors of the DR were computed by first approximating the DR matrix, and then performing an eigendecomposition. This procedure is computationally expensive and does not scale to high-dimensional spaces. In this paper, we derive an objective for directly approximating the principal eigenvector of the DR with a neural network. We empirically demonstrate the effectiveness of the objective in a number of environments, and apply the learned eigenvectors for reward shaping.
arXiv-4

Universal Computation is Intrinsic to Language Model Decoding

A. Lewandowski, M. C. Machado, and D. Schuurmans

CoRR abs/2601.08061, 2026

@article{lewandowski2026universal, title = {Universal Computation is Intrinsic to Language Model Decoding}, author = {Alex Lewandowski and Marlos C. Machado and Dale Schuurmans}, journal = {CoRR}, volume = {abs/2601.08061}, year = {2026} }

Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model's autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness -- rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.
bioRxiv-1

Representations in the Hippocampal-Entorhinal System Emerge from Learning Sensory Predictions

D. Gomez, M. Bowling, J. Q. Lee, and M. C. Machado

bioRxiv 2025.10.03.680189, 2025

@article{gomez2025representations, title = {Representations in the Hippocampal-Entorhinal System Emerge from Learning Sensory Predictions}, author = {Diego Gomez and Michael Bowling and J. Quinn Lee and Marlos C. Machado}, journal = {bioRxiv}, volume = {2025.10.03.680189}, year = {2025} }

The hippocampal formation and adjacent parahippocampal areas are central to intelligent behaviour such as memory and navigation. Understanding how systems in the brain generate structured representations from experience remains a fundamental goal in neuroscience. A central open question is whether a single computational principle can account for the diverse neural responses observed across the hippocampal-entorhinal circuit. Existing models often rely on hand-crafted features or specialized learning mechanisms unrelated to sensory observations, and typically express representations of only a small subset of known cell types. Further, representations learned in such models are often not empirically evaluated against neural representations observed in the navigating brain. Here, we introduce a neurobiologically-inspired and robust computational model in which diverse cell types emerge from a single learning objective with minimal hand-engineered assumptions. Our model applies contrastive graph representation learning to transitions between high-dimensional visual observations, constructing a metric space in which temporally adjacent sensory observations are mapped to nearby states. Inspired by the anatomical information flow of the hippocampal-entorhinal system, and anchored in output representations based on neural coding in the entorhinal cortex, the model gives rise to activity resembling place cells, grid cells, boundary vector cells, band cells, corner cells, and conjunctive cells among others. Across varied environments and sensory streams, the framework captures not only diverse neural response patterns but also the functional dependencies between them, mirroring the proposed sequential representational structure observed in the hippocampal-entorhinal system. Crucially, place-cell-like features of the model quantitatively reproduce remapping dynamics observed in CA1 of freely moving animals, and afford theoretical explanatory power of existing neurobiologically-informed models. This work thus offers a unified computational model of spatial coding in the hippocampal-entorhinal system and a testable framework for generating mechanistic hypotheses in silico, to be evaluated in vivo.
arXiv-3

Deep Double Q-learning

P. Nagarajan, M. White, and M. C. Machado

CoRR abs/2507.00275, 2025

@article{nagarajan2025double, title = {Deep Double Q-learning}, author = {Prabhat Nagarajan and Martha White and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2507.00275}, year = {2025} }

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.
arXiv-2

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

M. Klissarov*, A. Bagaria*, Z. Luo, G. Konidaris, D. Precup, and M. C. Machado

CoRR abs/2506.14045, 2025

@article{klissarov2025discovering, title = {Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning}, author = {Martin Klissarov and Akhil Bagaria and Ziyan Luo and George Konidaris and Doina Precup and Marlos C Machado}, journal = {CoRR}, volume = {abs/2506.14045}, year = {2025} }

Developing agents capable of exploring, planning and learning in complex open-ended environments is a grand challenge in artificial intelligence (AI). Hierarchical reinforcement learning (HRL) offers a promising solution to this challenge by discovering and exploiting the temporal structure within a stream of experience. The strong appeal of the HRL framework has led to a rich and diverse body of literature attempting to discover a useful structure. However, it is still not clear how one might define what constitutes good structure in the first place, or the kind of problems in which identifying it may be helpful. This work aims to identify the benefits of HRL from the perspective of the fundamental challenges in decision-making, as well as highlight its impact on the performance trade-offs of AI agents. Through these benefits, we then cover the families of methods that discover temporal structure in HRL, ranging from learning directly from online experience to offline datasets, to leveraging large language models (LLMs). Finally, we highlight the challenges of temporal structure discovery and the domains that are particularly well-suited for such endeavours.
arXiv-1

Curvature Explains Loss of Plasticity

A. Lewandowski, H. Tanaka, D. Schuurmans, and M. C. Machado

CoRR abs/2312.00246, 2023

@article{lewandowski2023curvature, title = {Curvature Explains Loss of Plasticity}, author = {Alex Lewandowski and Haruto Tanaka and Dale Schuurmans and Marlos C. Machado}, journal = {CoRR}, volume = {abs/2312.00246}, year = {2023} }

Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.

Magazine Articles, Extended Abstracts, and Workshop Papers

ICML-W-4

AI Agent Safety is a Reinforcement Learning Problem

R. McLean, T. E. Lee, M. Mohammedalamen, K. Roice, G. Berseth, P. M. Pilarski, M. C. Machado, A. L. Škopac, and B. Rosman

In ICML Workshop on Agents in the Wild: Safety, Security, and Beyond, 2026

@inproceedings{mclean2026ai, title = {AI Agent Safety is a Reinforcement Learning Problem}, author = {Reginald McLean and Tabitha Edith Lee and Montaser Mohammedalamen and Kevin Roice and Glen Berseth and Patrick M. Pilarski and Marlos C. Machado and Alyssa Lefaivre Škopac and Benjamin Rosman}, booktitle = {ICML Workshop on Agents in the Wild: Safety, Security, and Beyond}, year = {2026} }

With the rapid advancement and deployment of Agentic AI, our scientific understanding of capabilities and limitations has not kept pace, leading to cases where AI agents cause harm. We argue that many of these safety limitations are not novel problems. Instead, the safety challenges currently facing AI agents can be seen as instances of problems the reinforcement learning (RL) community has studied rigorously for decades. The core of this argument concerns the problem formulation of AI agents. AI agents are designed to solve sequential decision-making problems: problems with long-term objectives in which actions have delayed consequences. To model these types of problem, the problem is set up the problem such that the agent receives observations, feedback on its progress, and then takes actions. This is precisely the formulation of the RL problem. In this paper, we formalize the problem equivalence, which we then leverage to argue that AI Agent safety is a reinforcement learning problem: the failure modes currently observed in deployed AI agents are structural instances of problems RL has formalized for decades, and the RL safety literature provides principled tools to diagnose and address them. We conclude with a call for deliberate collaboration between the RL and AI agent research communities: AI agent researchers gain access to principled frameworks, while RL researchers gain a class of real-world problems that could expose fundamental gaps in current RL benchmarks and theory.
AAAI*-3

Representation-driven Option Discovery in Reinforcement Learning

M. C. Machado

In AAAI Conference on Artificial Intelligence (AAAI): New Faculty Highlights,, 2025

@inproceedings{machado2025representation, title = {Representation-driven Option Discovery in Reinforcement Learning}, author = {Marlos C. Machado}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI): New Faculty Highlights}, year = {2025} }

The ability to reason at multiple levels of temporal abstraction is a fundamental aspect of intelligence. In reinforcement learning (RL), this attribute is often modelled through temporally extended courses of actions called options. In this talk, I will introduce a general framework for option discovery, which uses the agent's representation to discover useful options. By leveraging these options to generate a rich stream of experience, the agent can improve its representations and learn more effectively. This representation-driven option discovery approach creates a virtuous cycle of refinement, continuously improving both the representation and options, and it is particularly effective for problems where agents need to operate at varying levels of abstraction to succeed.
RLC-W-2

Towards An Option Basis To Optimize All Rewards

S. Chandrasekar, and M. C. Machado

In RLC Workshop on Reinforcement Learning Beyond Rewards: Ingredients for Developing Generalist Agents, [Spotlight] 2025

@inproceedings{chandrasekar2025towards, title = {Towards An Option Basis To Optimize All Rewards}, author = {Siddarth Chandrasekar and Marlos C. Machado}, booktitle = {RLC Workshop on Reinforcement Learning Beyond Rewards: Ingredients for Developing Generalist Agents}, year = {2025} }

The Option Keyboard framework enables efficient behavior generation by composing a set of basis options. However, it remains unclear how to construct a global and compact basis, from scratch, for solving any given task in an environment. In this work, we investigate using the eigenvectors of the graph Laplacian of the environment to form such a basis. The behaviors obtained from such eigenvectors are known as eigenoptions. We empirically demonstrate that a sufficiently large eigenoption basis, combined with Generalized Policy Improvement, can recover near-optimal policies in the goal-reaching tasks we considered. Building on this, we introduce the Laplacian Keyboard, which matches this performance while requiring a substantially smaller set of options. Finally, we briefly outline a method for constructing a universal optimal option basis capable of solving any task within a given environment.
RLC-W-1

A Study of Value-Aware Eigenoptions

H. Kotamreddy, and M. C. Machado

In RLC Workshop on Inductive Biases in Reinforcement Learning,, 2025

@inproceedings{kotamreddy2025study, title = {A Study of Value-Aware Eigenoptions}, author = {Harshil Kotamreddy and Marlos C. Machado}, booktitle = {RLC Workshop on Inductive Biases in Reinforcement Learning}, year = {2025} }

Options, which impose an inductive bias toward temporal and hierarchical structure, offer a powerful framework for reinforcement learning (RL). While effective in sequential decision-making, they are often handcrafted rather than learned. Among approaches for discovering options, eigenoptions have shown strong performance in exploration, but their role in credit assignment remains underexplored. In this paper, we investigate whether eigenoptions can accelerate credit assignment in model-free RL, evaluating them in tabular and pixel-based gridworlds. We find that pre-specified eigenoptions aid not only exploration but also credit assignment, whereas online discovery can bias the agent’s experience too strongly and hinder learning. In the context of deep RL, we also propose a method for learning option-values under non-linear function approximation, highlighting the impact of termination conditions on performance. Our findings reveal both the promise and complexity of using eigenoptions, and options more broadly, to simultaneously support credit assignment and exploration in reinforcement learning.
AAAI*-2

Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint) [Invited abstract]

R. S. Sutton, M. C. Machado, G. Z. Holland, D. Szepesvari, F. Timbers, B. Tanner, and A. White

In AAAI Conference on Artificial Intelligence (AAAI),, 2024

@inproceedings{machado2018revisiting-ijcai, title = {Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint)}, author = {Richard S. Sutton and Marlos C. Machado and G. Zacharias Holland and David Szepesvari and Finbarr Timbers and Brian Tanner and Adam White}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2024} }

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.
NeurIPS-W-1

The Eigenoption-Critic Framework

M. Liu, M. C. Machado, G. Tesauro, and M. Campbell

In NeurIPS Hierarchical RL Workshop, 2017

@inproceedings{liu2017eigenoption-critic, title = {The Eigenoption-Critic Framework}, author = {Miao Liu and Marlos C. Machado and Gerald Tesauro and Murray Campbell}, booktitle = {NeurIPS Hierarchical RL Workshop}, year = {2017} }

Eigenoptions (EOs) have been recently introduced as a promising idea for generating a diverse set of options through the graph Laplacian, having been shown to allow efficient exploration. Despite its initial promising results, a couple of issues in current algorithms limit its application, namely: (1) EO methods require two separate steps (eigenoption discovery and reward maximization) to learn a control policy, which can incur a significant amount of storage and computation; (2) EOs are only defined for problems with discrete state-spaces and; (3) it is not easy to take the environment's reward function into consideration when discovering EOs. To addresses these issues, we introduce an algorithm termed eigenoption-critic (EOC) based on the Option-critic (OC) framework [Bacon17], a general hierarchical reinforcement learning (RL) algorithm that allows learning the intra-option policies simultaneously with the policy over options. We also propose a generalization of EOC to problems with continuous state-spaces through the Nyström approximation. EOC can also be seen as extending OC to nonstationary settings, where the discovered options are not tailored for a single task.
ICML-W-3

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment [Best paper award]

A. A. Taiga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare

In ICML Workshop on Exploration in Reinforcement Learning, 2019

Longer version was published at ICLR’20

@inproceedings{taiga2019benchmarking, title = {Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment}, author = {Adrien A. Taiga and William Fedus and Marlos C. Machado and Aaron Courville and Marc G. Bellemare}, booktitle = {ICML Workshop on Exploration in Reinforcement Learning}, year = {2019} }

This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the state-of-the-art algorithm for value-based agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma's Revenge which has gathered a lot of interest from the exploration community, across the the set of seven games identified by Bellemare et al. (2016) as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma's Revenge or hard exploration games. We also find that existing bonus-based methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than ϵ-greedy exploration.
ICML-W-2

Count-Based Exploration with the Successor Representation [Best paper award]

M. C. Machado, M. G. Bellemare, and M. Bowling

In ICML Workshop on Exploration in Reinforcement Learning, 2018

Longer version was published at AAAI’20.

@inproceedings{machado2020counts, author = {Marlos C. Machado and Marc G. Bellemare and Michael Bowling}, title = {Count-Based Exploration with the Successor Representation}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2020} }

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.
NeurIPS-W-2

Generalization and Regularization in DQN

J. Farebrother, M. C. Machado, and M. Bowling

In NeurIPS Deep Reinforcement Learning Workshop, 2018

@article{farebrother2018generalization, title = {Generalization and Regularization in DQN}, author = {Jesse Farebrother and Marlos C. Machado and Michael Bowling}, booktitle = {NeurIPS Deep Reinforcement Learning Workshop}, year = {2018} }

Deep reinforcement learning algorithms have shown an impressive ability to learn complex control policies in high-dimensional tasks. However, despite the ever-increasing performance on popular benchmarks, policies learned by deep reinforcement learning algorithms can struggle to generalize when evaluated in remarkably similar environments. In this paper we propose a protocol to evaluate generalization in reinforcement learning through different modes of Atari 2600 games. With that protocol we assess the generalization capabilities of DQN, one of the most traditional deep reinforcement learning algorithms, and we provide evidence suggesting that DQN overspecializes to the training environment. We then comprehensively evaluate the impact of dropout and ℓ2 regularization, as well as the impact of reusing learned representations to improve the generalization capabilities of DQN. Despite regularization being largely underutilized in deep reinforcement learning, we show that it can, in fact, help DQN learn more general features. These features can be reused and fine-tuned on similar tasks, considerably improving DQN's sample efficiency.
IJCAI*-1

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract) [Invited paper]

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling

In International Joint Conference on Artificial Intelligence (IJCAI), 2018

@inproceedings{machado2018revisiting-ijcai, title = {Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract)}, author = {Marlos C. Machado and Marc G. Bellemare and Erik Talvitie and Joel Veness and Matthew Hausknecht and Michael Bowling}, booktitle = {International Joint Conference on Artificial Intelligence (IJCAI)}, year = {2018} }

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community. In this paper we take a big picture look at how the ALE is being used by the research community. We focus on how diverse the evaluation methodologies in the ALE have become and we highlight some key concerns when evaluating agents in this platform. We use this discussion to present what we consider to be the best practices for future evaluations in the ALE. To further the progress in the field, we also introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions.
NeurIPS-W-1

The Eigenoption-Critic Framework

M. Liu, M. C. Machado, G. Tesauro, and M. Campbell

In NeurIPS Hierarchical RL Workshop, 2017

@inproceedings{liu2017eigenoption-critic, title = {The Eigenoption-Critic Framework}, author = {Miao Liu and Marlos C. Machado and Gerald Tesauro and Murray Campbell}, booktitle = {NeurIPS Hierarchical RL Workshop}, year = {2017} }

Eigenoptions (EOs) have been recently introduced as a promising idea for generating a diverse set of options through the graph Laplacian, having been shown to allow efficient exploration. Despite its initial promising results, a couple of issues in current algorithms limit its application, namely: (1) EO methods require two separate steps (eigenoption discovery and reward maximization) to learn a control policy, which can incur a significant amount of storage and computation; (2) EOs are only defined for problems with discrete state-spaces and; (3) it is not easy to take the environment's reward function into consideration when discovering EOs. To addresses these issues, we introduce an algorithm termed eigenoption-critic (EOC) based on the Option-critic (OC) framework [Bacon17], a general hierarchical reinforcement learning (RL) algorithm that allows learning the intra-option policies simultaneously with the policy over options. We also propose a generalization of EOC to problems with continuous state-spaces through the Nyström approximation. EOC can also be seen as extending OC to nonstationary settings, where the discovered options are not tailored for a single task.
ICML-W-1

Learning Purposeful Behaviour in the Absence of Rewards

M. C. Machado, and M. Bowling

In ICML Workshop on Abstraction in Reinforcement Learning, 2016

@inproceedings{machado2016learning, title = {Learning Purposeful Behaviour in the Absence of Rewards}, author = {Marlos C. Machado and Michael Bowling}, booktitle = {ICML Workshop on Abstraction in Reinforcement Learning}, year = {2016} }

Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.
AAAI-W-1

Domain-Independent Optimistic Initialization for Reinforcement Learning

M. C. Machado, S. Srinivasan, and M. Bowling

In AAAI Workshop on Learning for General Competency in Video Games, 2015

@inproceedings{machado2015domain-independent, title = {Domain-Independent Optimistic Initialization for Reinforcement Learning}, author = {Marlos C. Machado and Sriram Srinivasan and Michael Bowling}, booktitle = {AAAI Workshop on Learning for General Competency in Video Games}, year = {2015} }

In Reinforcement Learning (RL), it is common to use optimistic initialization of value functions to encourage exploration. However, such an approach generally depends on the domain, viz., the scale of the rewards must be known, and the feature representation must have a constant norm. We present a simple approach that performs optimistic initialization with less dependence on the domain.
AIM-1

Reports from the 2015 AAAI Workshop Program

[Alph. order] S. V. Albrecht, J. C. Beck, D. L. Buckeridge, A. Botea, C. Caragea, C.-H. Chi, T. Damoulas, B. N. Dilkina, E. Eaton, P. Fazli, S. Ganzfried, M. T. Lindauer, M. C. Machado, Y. Malitsky, G. Marcus, S. Meijer, F. Rossi, A. Shaban-Nejad, S. Thiébaux, M. M. Veloso, T. Walsh, C. Wang, J. Zhang, and Y. Zheng

AI Magazine, 36(2), pp. 90-101, 2015

@article{albrecht2015reports, title = {Reports from the 2015 AAAI Workshop Program}, author = {Stefano V. Albrecht and J. Christopher Beck and David L. Buckeridge and Adi Botea and Cornelia Caragea and Chi-Hung Chi and Theodoros Damoulas and Bistra N. Dilkina and Eric Eaton and Pooyan Fazli and Sam Ganzfried and Marius Thomas Lindauer and Marlos C. Machado and Yuri Malitsky and Gary Marcus and Sebastiaan Meijer and Francesca Rossi and Arash Shaban-Nejad and Sylvie Thiébaux and Manuela M. Veloso and Toby Walsh and Can Wang and Jie Zhang and Yu Zheng}, journal = {AI Magazine}, volume = {36}, number = {2}, pages = {90-101}, year = {2015} }

Monday, January 25–26, 2015 at the Hyatt Regency Austin Hotel in Austion, Texas, USA. The AAAI-15 workshop program included 15 workshops covering a wide range of topics in artificial intelligence. Most workshops were held on a single day. The titles of the workshops included AI and Ethics, AI for Cities, AI for Transportation: Advice, Interactivity and Actor Modeling, Algorithm Configuration, Artificial Intelligence Applied to Assistive Technologies and Smart Environments, Beyond the Turing Test, Computational Sustainability, Computer Poker and Imperfect Information, Incentive and Trust in E-Communities, Multiagent Interaction without Prior Coordination, Planning, Search, and Optimization, Scholarly Big Data: AI Perspectives, Challenges, and Ideas, Trajectory-Based Behaviour Analytics, World Wide Web and Public Health Intelligence, Knowledge, Skill, and Behavior Transfer in Autonomous Robots, and Learning for General Competency in Video Games.

Theses

PhD-1

Efficient Exploration in Reinforcement Learning through Time-Based Representations

M. C. Machado

Ph.D. Thesis, University of Alberta, 2019

@phdthesis{machado2019efficient, author = {Marlos C. Machado}, title = {Efficient Exploration in Reinforcement Learning through Time-Based Representations}, school = {University of Alberta}, year = {2019} }

In the reinforcement learning (RL) problem an agent must learn how to act optimally through trial-and-error interactions with a complex, unknown, stochastic environment. The actions taken by the agent influence not just the immediate reward it observes but also the future states and rewards it will observe, implicitly requiring the agent to deal with the trade-off between short-term and long-term consequences. In this context, the problem of exploration is the problem of selecting appropriate actions to explore the state space to gather information while taking this trade-off into consideration.In this dissertation I advocate that agents' exploration strategy can be guided by the process of representation learning. I support this claim by introducing different exploration approaches for RL algorithms that are applicable to complex environments with sparse rewards. They all use learned time-based representations, state representations that capture the temporal aspect of RL problems, implicitly encoding the temporal proximity of states. The two instantiations of time-based representations I use are proto-value functions (PVFs) and the successor representation (SR).The first approaches I introduce are based on the idea of option-based exploration. Option-based exploration hinges on the assumption that an agent that exhibits purposeful behavior is more likely to visit states that are far from its current state than an agent that randomly selects actions at every time step. I model this purposefulness through options, which, in reinforcement learning, represent temporally extended courses of actions over different time scales. I then introduce algorithms capable of discovering options autonomously through PVFs and the SR.I also introduce count-based exploration approaches, which are based on the idea of keeping state visitation counts to ensure all states (or abstractions of a state) are visited a proper number of times. I show that the norm of the SR, while it is being learned, incorporates state visitation counts and I use this result to introduce RL algorithms that achieve state-of-the-art results in large domains that require function approximation.I evaluate my algorithms in both tabular domains and Atari 2600 games. I use tabular domains such as the 4-room domain, RiverSwim, and SixArms in order to develop a better intuition about the proposed algorithms and to compare the proposed approaches to classic baselines in the field. I use Atari 2600 games to evaluate the scalability and generality of the proposed approaches since the state space of Atari 2600 games is too large, requiring function approximation. I discuss approaches based on linear and non-linear function approximation.
MSc-1

A Methodology for Player Modeling based on Machine Learning

M. C. Machado

M.Sc. Thesis, Universidade Federal de Minas Gerais, 2013

@mscthesis{machado2013methodology, author = {Marlos C. Machado}, title = {A Methodology for Player Modeling based on Machine Learning}, school = {Universidade Federal de Minas Gerais}, year = {2013} }

AI is gradually receiving more attention as a fundamental feature to increase the immersion in digital games. Among the several AI approaches, player modeling is becoming an important one. The main idea is to understand and model the player characteristics and behaviors in order to develop a better AI. In this work, we discuss several aspects of this new field. We proposed a taxonomy to organize the area, discussing several facets of this topic, ranging from implementation decisions up to what a model attempts to describe. We then classify, in our taxonomy, some of the most important works in this field. We also presented a generic approach to deal with player modeling using ML, and we instantiated this approach to model players' preferences in the game Civilization IV. The instantiation of this approach has several steps. We first discuss a generic representation, regardless of what is being modeled, and evaluate it performing experiments with the strategy game Civilization IV. Continuing the instantiation of the proposed approach we evaluated the applicability of using game score information to distinguish different preferences. We presented a characterization of virtual agents in the game, comparing their behavior with their stated preferences. Once we have characterized these agents, we were able to observe that different preferences generate different behaviors, measured by several game indicators. We then tackled the preference modeling problem as a binary classification task, with a supervised learning approach. We compared four different methods, based on different paradigms (SVM, AdaBoost, NaiveBayes and JRip), evaluating them on a set of matches played by different virtual agents. We conclude our work using the learned models to infer human players' preferences. Using some of the evaluated classifiers we obtained accuracies over 60% for most of the inferred preferences.