32 research outputs found
Can reinforcement learning model learning across development? Online lifelong learning through adaptive intrinsic motivation
Reinforcement learning is a powerful model of animal learning in brief, controlled experimental conditions, but does not readily explain the development of behavior over an animal’s whole lifetime. In this paper, we describe a framework to address this shortcoming by introducing the single-life reinforcement learning setting to cognitive science. We construct an agent with two learning systems: an extrinsic learner that learns within a single lifetime, and an intrinsic learner that learns across lifetimes, equipping the agent with intrinsic motivation. We show that this model outperforms heuristic benchmarks and recapitulates a transition from exploratory to habit-driven behavior, while allowing the agent to learn an interpretable value function. We formulate a precise definition of intrinsic motivation and discuss the philosophical implications of using reinforcement learning as a model of behavior in the real worl
Bayesian Exploration Networks
Bayesian reinforcement learning (RL) offers a principled and elegant approach
for sequential decision making under uncertainty. Most notably, Bayesian agents
do not face an exploration/exploitation dilemma, a major pathology of
frequentist methods. A key challenge for Bayesian RL is the computational
complexity of learning Bayes-optimal policies, which is only tractable in toy
domains. In this paper we propose a novel model-free approach to address this
challenge. Rather than modelling uncertainty in high-dimensional state
transition distributions as model-based approaches do, we model uncertainty in
a one-dimensional Bellman operator. Our theoretical analysis reveals that
existing model-free approaches either do not propagate epistemic uncertainty
through the MDP or optimise over a set of contextual policies instead of all
history-conditioned policies. Both approximations yield policies that can be
arbitrarily Bayes-suboptimal. To overcome these issues, we introduce the
Bayesian exploration network (BEN) which uses normalising flows to model both
the aleatoric uncertainty (via density estimation) and epistemic uncertainty
(via variational inference) in the Bellman operator. In the limit of complete
optimisation, BEN learns true Bayes-optimal policies, but like in variational
expectation-maximisation, partial optimisation renders our approach tractable.
Empirical results demonstrate that BEN can learn true Bayes-optimal policies in
tasks where existing model-free approaches fail
The Danger Of Arrogance: Welfare Equilibra As A Solution To Stackelberg Self-Play In Non-Coincidental Games
The increasing prevalence of multi-agent learning systems in society
necessitates understanding how to learn effective and safe policies in
general-sum multi-agent environments against a variety of opponents, including
self-play. General-sum learning is difficult because of non-stationary
opponents and misaligned incentives. Our first main contribution is to show
that many recent approaches to general-sum learning can be derived as
approximations to Stackelberg strategies, which suggests a framework for
developing new multi-agent learning algorithms. We then define non-coincidental
games as games in which the Stackelberg strategy profile is not a Nash
Equilibrium. This notably includes several canonical matrix games and provides
a normative theory for why existing algorithms fail in self-play in such games.
We address this problem by introducing Welfare Equilibria (WE) as a
generalisation of Stackelberg Strategies, which can recover desirable Nash
Equilibria even in non-coincidental games. Finally, we introduce Welfare
Function Search (WelFuSe) as a practical approach to finding desirable WE
against unknown opponents, which finds more mutually desirable solutions in
self-play, while preserving performance against naive learning opponents.Comment: 31 pages, 23 figure
Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection
While reinforcement learning (RL) algorithms have been successfully applied
across numerous sequential decision-making problems, their generalization to
unforeseen testing environments remains a significant concern. In this paper,
we study the problem of out-of-distribution (OOD) detection in RL, which
focuses on identifying situations at test time that RL agents have not
encountered in their training environments. We first propose a clarification of
terminology for OOD detection in RL, which aligns it with the literature from
other machine learning domains. We then present new benchmark scenarios for OOD
detection, which introduce anomalies with temporal autocorrelation into
different components of the agent-environment loop. We argue that such
scenarios have been understudied in the current literature, despite their
relevance to real-world situations. Confirming our theoretical predictions, our
experimental results suggest that state-of-the-art OOD detectors are not able
to identify such anomalies. To address this problem, we propose a novel method
for OOD detection, which we call DEXTER (Detection via Extraction of Time
Series Representations). By treating environment observations as time series
data, DEXTER extracts salient time series features, and then leverages an
ensemble of isolation forest algorithms to detect anomalies. We find that
DEXTER can reliably identify anomalies across benchmark scenarios, exhibiting
superior performance compared to both state-of-the-art OOD detectors and
high-dimensional changepoint detectors adopted from statistics.Comment: Accepted as a full paper to the 23rd International Conference on
Autonomous Agents and Multiagent Systems (AAMAS 2024
Perfectly Secure Steganography Using Minimum Entropy Coupling
Steganography is the practice of encoding secret information into innocuous
content in such a manner that an adversarial third party would not realize that
there is hidden meaning. While this problem has classically been studied in
security literature, recent advances in generative models have led to a shared
interest among security and machine learning researchers in developing scalable
steganography techniques. In this work, we show that a steganography procedure
is perfectly secure under Cachin (1998)'s information-theoretic model of
steganography if and only if it is induced by a coupling. Furthermore, we show
that, among perfectly secure procedures, a procedure maximizes information
throughput if and only if it is induced by a minimum entropy coupling. These
insights yield what are, to the best of our knowledge, the first steganography
algorithms to achieve perfect security guarantees for arbitrary covertext
distributions. To provide empirical validation, we compare a minimum entropy
coupling-based approach to three modern baselines -- arithmetic coding, Meteor,
and adaptive dynamic grouping -- using GPT-2, WaveRNN, and Image Transformer as
communication channels. We find that the minimum entropy coupling-based
approach achieves superior encoding efficiency, despite its stronger security
constraints. In aggregate, these results suggest that it may be natural to view
information-theoretic steganography through the lens of minimum entropy
coupling
Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate its behaviour
while acting in a decentralised fashion. At the same time, it is often possible
to train the agents in a centralised fashion where global state information is
available and communication constraints are lifted. Learning joint
action-values conditioned on extra state information is an attractive way to
exploit centralised learning, but the best strategy for then extracting
decentralised policies is unclear. Our solution is QMIX, a novel value-based
method that can train decentralised policies in a centralised end-to-end
fashion. QMIX employs a mixing network that estimates joint action-values as a
monotonic combination of per-agent values. We structurally enforce that the
joint-action value is monotonic in the per-agent values, through the use of
non-negative weights in the mixing network, which guarantees consistency
between the centralised and decentralised policies. To evaluate the performance
of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new
benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a
challenging set of SMAC scenarios and show that it significantly outperforms
existing multi-agent reinforcement learning methods.Comment: Extended version of the ICML 2018 conference paper (arXiv:1803.11485
Revealing Robust Oil and Gas Company Macro-Strategies using Deep Multi-Agent Reinforcement Learning
The energy transition potentially poses an existential risk for major
international oil companies (IOCs) if they fail to adapt to low-carbon business
models. Projections of energy futures, however, are met with diverging
assumptions on its scale and pace, causing disagreement among IOC
decision-makers and their stakeholders over what the business model of an
incumbent fossil fuel company should be. In this work, we used deep multi-agent
reinforcement learning to solve an energy systems wargame wherein players
simulate IOC decision-making, including hydrocarbon and low-carbon investments
decisions, dividend policies, and capital structure measures, through an
uncertain energy transition to explore critical and non-linear governance
questions, from leveraged transitions to reserve replacements. Adversarial play
facilitated by state-of-the-art algorithms revealed decision-making strategies
robust to energy transition uncertainty and against multiple IOCs. In all
games, robust strategies emerged in the form of low-carbon business models as a
result of early transition-oriented movement. IOCs adopting such strategies
outperformed business-as-usual and delayed transition strategies regardless of
hydrocarbon demand projections. In addition to maximizing value, these
strategies benefit greater society by contributing substantial amounts of
capital necessary to accelerate the global low-carbon energy transition. Our
findings point towards the need for lenders and investors to effectively
mobilize transition-oriented finance and engage with IOCs to ensure responsible
reallocation of capital towards low-carbon business models that would enable
the emergence of fossil fuel incumbents as future low-carbon leaders
Equivariant Networks for Zero-Shot Coordination
Successful coordination in Dec-POMDPs requires agents to adopt robust
strategies and interpretable styles of play for their partner. A common failure
mode is symmetry breaking, when agents arbitrarily converge on one out of many
equivalent but mutually incompatible policies. Commonly these examples include
partial observability, e.g. waving your right hand vs. left hand to convey a
covert message. In this paper, we present a novel equivariant network
architecture for use in Dec-POMDPs that effectively leverages environmental
symmetry for improving zero-shot coordination, doing so more effectively than
prior methods. Our method also acts as a ``coordination-improvement operator''
for generic, pre-trained policies, and thus may be applied at test-time in
conjunction with any self-play algorithm. We provide theoretical guarantees of
our work and test on the AI benchmark task of Hanabi, where we demonstrate our
methods outperforming other symmetry-aware baselines in zero-shot coordination,
as well as able to improve the coordination ability of a variety of pre-trained
policies. In particular, we show our method can be used to improve on the state
of the art for zero-shot coordination on the Hanabi benchmark