23 research outputs found
Learning Curricula in Open-Ended Worlds
Deep reinforcement learning (RL) provides powerful methods for training
optimal sequential decision-making agents. As collecting real-world
interactions can entail additional costs and safety risks, the common paradigm
of sim2real conducts training in a simulator, followed by real-world
deployment. Unfortunately, RL agents easily overfit to the choice of simulated
training environments, and worse still, learning ends when the agent masters
the specific set of simulated environments. In contrast, the real world is
highly open-ended, featuring endlessly evolving environments and challenges,
making such RL approaches unsuitable. Simply randomizing over simulated
environments is insufficient, as it requires making arbitrary distributional
assumptions and can be combinatorially less likely to sample specific
environment instances that are useful for learning. An ideal learning process
should automatically adapt the training environment to maximize the learning
potential of the agent over an open-ended task space that matches or surpasses
the complexity of the real world. This thesis develops a class of methods
called Unsupervised Environment Design (UED), which aim to produce such
open-ended processes. Given an environment design space, UED automatically
generates an infinite sequence or curriculum of training environments at the
frontier of the learning agent's capabilities. Through extensive empirical
studies and theoretical arguments founded on minimax-regret decision theory and
game theory, the findings in this thesis show that UED autocurricula can
produce RL agents exhibiting significantly improved robustness and
generalization to previously unseen environment instances. Such autocurricula
are promising paths toward open-ended learning systems that achieve more
general intelligence by continually generating and mastering additional
challenges of their own design.Comment: PhD dissertatio
Learning Curricula in Open-Ended Worlds
Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real-world is highly open-ended—featuring endlessly evolving environments and challenges, making such RL approaches unsuitable. Simply randomizing across a large space of simulated environments is insufficient, as it requires making arbitrary distributional assumptions, and as the design space grows, it can become combinatorially less likely to sample specific environment instances that are useful for learning. An ideal learning process should automatically adapt the training environment to maximize the learning potential of the agent over an open-ended task space that matches or surpasses the complexity of the real world. This thesis develops a class of methods called Unsupervised Environment Design (UED), which seeks to enable such an open-ended process via a principled approach for gradually improving the robustness and generality of the learning agent. Given a potentially open-ended environment design space, UED automatically generates an infinite sequence or curriculum of training environments at the frontier of the learning agent’s capabilities. Through both extensive empirical studies and theoretical arguments founded on minimax-regret decision theory and game theory, the findings in this thesis show that UED autocurricula can produce RL agents exhibiting significantly improved robustness and generalization to previously unseen environment instances. Such autocurricula are promising paths toward open-ended learning systems that approach general intelligence—a long sought-after ambition of artificial intelligence research—by continually generating and mastering additional challenges of their own design
Reward-Free Curricula for Training Robust World Models
There has been a recent surge of interest in developing generally-capable
agents that can adapt to new tasks without additional training in the
environment. Learning world models from reward-free exploration is a promising
approach, and enables policies to be trained using imagined experience for new
tasks. However, achieving a general agent requires robustness across different
environments. In this work, we address the novel problem of generating
curricula in the reward-free setting to train robust world models. We consider
robustness in terms of minimax regret over all environment instantiations and
show that the minimax regret can be connected to minimising the maximum error
in the world model across environment instances. This result informs our
algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for
Robustness. WAKER selects environments for data collection based on the
estimated error of the world model for each environment. Our experiments
demonstrate that WAKER outperforms several baselines, resulting in improved
robustness, efficiency, and generalisation.Comment: ICLR 202
Prioritized Level Replay
Environments with procedurally generated content serve as important
benchmarks for testing systematic generalization in deep reinforcement
learning. In this setting, each level is an algorithmically created environment
instance with a unique configuration of its factors of variation. Training on a
prespecified subset of levels allows for testing generalization to unseen
levels. What can be learned from a level depends on the current policy, yet
prior work defaults to uniform sampling of training levels independently of the
policy. We introduce Prioritized Level Replay (PLR), a general framework for
selectively sampling the next training level by prioritizing those with higher
estimated learning potential when revisited in the future. We show TD-errors
effectively estimate a level's future learning potential and, when used to
guide the sampling procedure, induce an emergent curriculum of increasingly
difficult levels. By adapting the sampling of training levels, PLR
significantly improves sample efficiency and generalization on Procgen
Benchmark--matching the previous state-of-the-art in test return--and readily
combines with other methods. Combined with the previous leading method, PLR
raises the state-of-the-art to over 76% improvement in test return relative to
standard RL baselines
Stabilizing Unsupervised Environment Design with a Learned Adversary
A key challenge in training generally-capable agents is the design of
training tasks that facilitate broad generalization and robustness to
environment variations. This challenge motivates the problem setting of
Unsupervised Environment Design (UED), whereby a student agent trains on an
adaptive distribution of tasks proposed by a teacher agent. A pioneering
approach for UED is PAIRED, which uses reinforcement learning (RL) to train a
teacher policy to design tasks from scratch, making it possible to directly
generate tasks that are adapted to the agent's current capabilities. Despite
its strong theoretical backing, PAIRED suffers from a variety of challenges
that hinder its practical performance. Thus, state-of-the-art methods currently
rely on curation and mutation rather than generation of new tasks. In this
work, we investigate several key shortcomings of PAIRED and propose solutions
for each shortcoming. As a result, we make it possible for PAIRED to match or
exceed state-of-the-art methods, producing robust agents in several established
challenging procedurally-generated environments, including a partially-observed
maze navigation task and a continuous-control car racing environment. We
believe this work motivates a renewed emphasis on UED methods based on learned
models that directly generate challenging environments, potentially unlocking
more open-ended RL training and, as a result, more general agents.Comment: CoLLAs 2023 - Oral; Second and third authors contributed equall
Improving Intrinsic Exploration with Language Abstractions
Reinforcement learning (RL) agents are particularly hard to train when
rewards are sparse. One common solution is to use intrinsic rewards to
encourage agents to explore their environment. However, recent intrinsic
exploration methods often use state-based novelty measures which reward
low-level exploration and may not scale to domains requiring more abstract
skills. Instead, we explore natural language as a general medium for
highlighting relevant abstractions in an environment. Unlike previous work, we
evaluate whether language can improve over existing exploration methods by
directly extending (and comparing to) competitive intrinsic exploration
baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These
language-based variants outperform their non-linguistic forms by 47-85% across
13 challenging tasks from the MiniGrid and MiniHack environment suites.Comment: NeurIPS 202
ADGym: Design Choices for Deep Anomaly Detection
Deep learning (DL) techniques have recently found success in anomaly
detection (AD) across various fields such as finance, medical services, and
cloud computing. However, most of the current research tends to view deep AD
algorithms as a whole, without dissecting the contributions of individual
design choices like loss functions and network architectures. This view tends
to diminish the value of preliminary steps like data preprocessing, as more
attention is given to newly designed loss functions, network architectures, and
learning paradigms. In this paper, we aim to bridge this gap by asking two key
questions: (i) Which design choices in deep AD methods are crucial for
detecting anomalies? (ii) How can we automatically select the optimal design
choices for a given AD dataset, instead of relying on generic, pre-existing
solutions? To address these questions, we introduce ADGym, a platform
specifically crafted for comprehensive evaluation and automatic selection of AD
design elements in deep methods. Our extensive experiments reveal that relying
solely on existing leading methods is not sufficient. In contrast, models
developed using ADGym significantly surpass current state-of-the-art
techniques.Comment: NeurIPS 2023. The first three authors contribute equally. Code
available at https://github.com/Minqi824/ADGy
Grounding Aleatoric Uncertainty in Unsupervised Environment Design
Adaptive curricula in reinforcement learning (RL) have proven effective for
producing policies robust to discrepancies between the train and test
environment. Recently, the Unsupervised Environment Design (UED) framework
generalized RL curricula to generating sequences of entire environments,
leading to new methods with robust minimax regret properties. Problematically,
in partially-observable or stochastic settings, optimal policies may depend on
the ground-truth distribution over aleatoric parameters of the environment in
the intended deployment setting, while curriculum learning necessarily shifts
the training distribution. We formalize this phenomenon as curriculum-induced
covariate shift (CICS), and describe how its occurrence in aleatoric parameters
can lead to suboptimal policies. Directly sampling these parameters from the
ground-truth distribution avoids the issue, but thwarts curriculum learning. We
propose SAMPLR, a minimax regret UED method that optimizes the ground-truth
utility function, even when the underlying training data is biased due to CICS.
We prove, and validate on challenging domains, that our approach preserves
optimality under the ground-truth distribution, while promoting robustness
across the full range of environment settings