92,617 research outputs found
Efficient Exploration using Model-Based Quality-Diversity with Gradients
Exploration is a key challenge in Reinforcement Learning, especially in
long-horizon, deceptive and sparse-reward environments. For such applications,
population-based approaches have proven effective. Methods such as
Quality-Diversity deals with this by encouraging novel solutions and producing
a diversity of behaviours. However, these methods are driven by either
undirected sampling (i.e. mutations) or use approximated gradients (i.e.
Evolution Strategies) in the parameter space, which makes them highly
sample-inefficient. In this paper, we propose a model-based Quality-Diversity
approach. It extends existing QD methods to use gradients for efficient
exploitation and leverage perturbations in imagination for efficient
exploration. Our approach optimizes all members of a population simultaneously
to maintain both performance and diversity efficiently by leveraging the
effectiveness of QD algorithms as good data generators to train deep models. We
demonstrate that it maintains the divergent search capabilities of
population-based approaches on tasks with deceptive rewards while significantly
improving their sample efficiency and quality of solutions
Open-ended Learning in Symmetric Zero-sum Games
Zero-sum games such as chess and poker are, abstractly, functions that
evaluate pairs of agents, for example labeling them `winner' and `loser'. If
the game is approximately transitive, then self-play generates sequences of
agents of increasing strength. However, nontransitive games, such as
rock-paper-scissors, can exhibit strategic cycles, and there is no longer a
clear objective -- we want agents to increase in strength, but against whom is
unclear. In this paper, we introduce a geometric framework for formulating
agent objectives in zero-sum games, in order to construct adaptive sequences of
objectives that yield open-ended learning. The framework allows us to reason
about population performance in nontransitive games, and enables the
development of a new algorithm (rectified Nash response, PSRO_rN) that uses
game-theoretic niching to construct diverse populations of effective agents,
producing a stronger set of agents than existing algorithms. We apply PSRO_rN
to two highly nontransitive resource allocation games and find that PSRO_rN
consistently outperforms the existing alternatives.Comment: ICML 2019, final versio
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents
Evolution strategies (ES) are a family of black-box optimization algorithms
able to train deep neural networks roughly as well as Q-learning and policy
gradient methods on challenging deep reinforcement learning (RL) problems, but
are much faster (e.g. hours vs. days) because they parallelize better. However,
many RL problems require directed exploration because they have reward
functions that are sparse or deceptive (i.e. contain local optima), and it is
unknown how to encourage such exploration with ES. Here we show that algorithms
that have been invented to promote directed exploration in small-scale evolved
neural networks via populations of exploring agents, specifically novelty
search (NS) and quality diversity (QD) algorithms, can be hybridized with ES to
improve its performance on sparse or deceptive deep RL tasks, while retaining
scalability. Our experiments confirm that the resultant new algorithms, NS-ES
and two QD algorithms, NSR-ES and NSRA-ES, avoid local optima encountered by ES
to achieve higher performance on Atari and simulated robots learning to walk
around a deceptive trap. This paper thus introduces a family of fast, scalable
algorithms for reinforcement learning that are capable of directed exploration.
It also adds this new family of exploration algorithms to the RL toolbox and
raises the interesting possibility that analogous algorithms with multiple
simultaneous paths of exploration might also combine well with existing RL
algorithms outside ES
Two-Timescale Learning Using Idiotypic Behaviour Mediation For A Navigating Mobile Robot
A combined Short-Term Learning (STL) and Long-Term Learning (LTL) approach to
solving mobile-robot navigation problems is presented and tested in both the
real and virtual domains. The LTL phase consists of rapid simulations that use
a Genetic Algorithm to derive diverse sets of behaviours, encoded as variable
sets of attributes, and the STL phase is an idiotypic Artificial Immune System.
Results from the LTL phase show that sets of behaviours develop very rapidly,
and significantly greater diversity is obtained when multiple autonomous
populations are used, rather than a single one. The architecture is assessed
under various scenarios, including removal of the LTL phase and switching off
the idiotypic mechanism in the STL phase. The comparisons provide substantial
evidence that the best option is the inclusion of both the LTL phase and the
idiotypic system. In addition, this paper shows that structurally different
environments can be used for the two phases without compromising
transferability.Comment: 40 pages, 12 tables, Journal of Applied Soft Computin
Automatic Curriculum Learning For Deep RL: A Short Survey
Automatic Curriculum Learning (ACL) has become a cornerstone of recent
successes in Deep Reinforcement Learning (DRL).These methods shape the learning
trajectories of agents by challenging them with tasks adapted to their
capacities. In recent years, they have been used to improve sample efficiency
and asymptotic performance, to organize exploration, to encourage
generalization or to solve sparse reward problems, among others. The ambition
of this work is dual: 1) to present a compact and accessible introduction to
the Automatic Curriculum Learning literature and 2) to draw a bigger picture of
the current state of the art in ACL to encourage the cross-breeding of existing
concepts and the emergence of new ideas.Comment: Accepted at IJCAI202
- …