6 research outputs found
Perturbation-based exploration methods in deep reinforcement learning
Recent research on structured exploration placed emphasis on identifying
novel states in the state space and incentivizing the agent to revisit them
through intrinsic reward bonuses. In this study, we question whether the
performance boost demonstrated through these methods is indeed due to the
discovery of structure in exploratory schedule of the agent or is the benefit
largely attributed to the perturbations in the policy and reward space
manifested in pursuit of structured exploration. In this study we investigate
the effect of perturbations in policy and reward spaces on the exploratory
behavior of the agent. We proceed to show that simple acts of perturbing the
policy just before the softmax layer and introduction of sporadic reward
bonuses into the domain can greatly enhance exploration in several domains of
the arcade learning environment. In light of these findings, we recommend
benchmarking any enhancements to structured exploration research against the
backdrop of noisy exploration
Novelty Search in representational space for sample efficient exploration
We present a new approach for efficient exploration which leverages a
low-dimensional encoding of the environment learned with a combination of
model-based and model-free objectives. Our approach uses intrinsic rewards that
are based on the distance of nearest neighbors in the low dimensional
representational space to gauge novelty. We then leverage these intrinsic
rewards for sample-efficient exploration with planning routines in
representational space for hard exploration tasks with sparse rewards. One key
element of our approach is the use of information theoretic principles to shape
our representations in a way so that our novelty reward goes beyond pixel
similarity. We test our approach on a number of maze tasks, as well as a
control problem and show that our exploration approach is more sample-efficient
compared to strong baselines.Comment: 9 pages + references + appendix. Oral presentation at NeurIPS 202
Evaluating Agents without Rewards
Reinforcement learning has enabled agents to solve challenging tasks in
unknown environments. However, manually crafting reward functions can be time
consuming, expensive, and error prone to human error. Competing objectives have
been proposed for agents to learn without external supervision, but it has been
unclear how well they reflect task rewards or human behavior. To accelerate the
development of intrinsic objectives, we retrospectively compute potential
objectives on pre-collected datasets of agent behavior, rather than optimizing
them online, and compare them by analyzing their correlations. We study input
entropy, information gain, and empowerment across seven agents, three Atari
games, and the 3D game Minecraft. We find that all three intrinsic objectives
correlate more strongly with a human behavior similarity metric than with task
reward. Moreover, input entropy and information gain correlate more strongly
with human similarity than task reward does, suggesting the use of intrinsic
objectives for designing agents that behave similarly to human players.Comment: 15 pages, 6 figures, 5 table
Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning
Despite the close connection between exploration and sample efficiency, most
state of the art reinforcement learning algorithms include no considerations
for exploration beyond maximizing the entropy of the policy. In this work we
address this seeming missed opportunity. We observe that the most common
formulation of directed exploration in deep RL, known as bonus-based
exploration (BBE), suffers from bias and slow coverage in the few-sample
regime. This causes BBE to be actively detrimental to policy learning in many
control tasks. We show that by decoupling the task policy from the exploration
policy, directed exploration can be highly effective for sample-efficient
continuous control. Our method, Decoupled Exploration and Exploitation Policies
(DEEP), can be combined with any off-policy RL algorithm without modification.
When used in conjunction with soft actor-critic, DEEP incurs no performance
penalty in densely-rewarding environments. On sparse environments, DEEP gives a
several-fold improvement in data efficiency due to better exploration
Temporally-Extended {\epsilon}-Greedy Exploration
Recent work on exploration in reinforcement learning (RL) has led to a series
of increasingly complex solutions to the problem. This increase in complexity
often comes at the expense of generality. Recent empirical studies suggest
that, when applied to a broader set of domains, some sophisticated exploration
methods are outperformed by simpler counterparts, such as {\epsilon}-greedy. In
this paper we propose an exploration algorithm that retains the simplicity of
{\epsilon}-greedy while reducing dithering. We build on a simple hypothesis:
the main limitation of {\epsilon}-greedy exploration is its lack of temporal
persistence, which limits its ability to escape local optima. We propose a
temporally extended form of {\epsilon}-greedy that simply repeats the sampled
action for a random duration. It turns out that, for many duration
distributions, this suffices to improve exploration on a large set of domains.
Interestingly, a class of distributions inspired by ecological models of animal
foraging behaviour yields particularly strong performance
First return, then explore
The promise of reinforcement learning is to solve complex sequential decision
problems by specifying a high-level reward function only. However, RL
algorithms struggle when, as is often the case, simple and intuitive rewards
provide sparse and deceptive feedback. Avoiding these pitfalls requires
thoroughly exploring the environment, but despite substantial investments by
the community, creating algorithms that can do so remains one of the central
challenges of the field. We hypothesize that the main impediment to effective
exploration originates from algorithms forgetting how to reach previously
visited states ("detachment") and from failing to first return to a state
before exploring from it ("derailment"). We introduce Go-Explore, a family of
algorithms that addresses these two challenges directly through the simple
principles of explicitly remembering promising states and first returning to
such states before exploring. Go-Explore solves all heretofore unsolved Atari
games (those for which algorithms could not previously outperform humans when
evaluated following current community standards) and surpasses the state of the
art on all hard-exploration games, with orders of magnitude improvements on the
grand challenges Montezuma's Revenge and Pitfall. We also demonstrate the
practical potential of Go-Explore on a challenging and extremely sparse-reward
robotics task. Additionally, we show that adding a goal-conditioned policy can
further improve Go-Explore's exploration efficiency and enable it to handle
stochasticity throughout training. The striking contrast between the
substantial performance gains from Go-Explore and the simplicity of its
mechanisms suggests that remembering promising states, returning to them, and
exploring from them is a powerful and general approach to exploration, an
insight that may prove critical to the creation of truly intelligent learning
agents.Comment: 45 pages, 13 figures, 4 tables; reorganized sections and modified SI
text extensively; added reference to the published version, changed title to
published titl