3,335 research outputs found
Planning to Explore via Self-Supervised World Models
Reinforcement learning allows solving complex tasks, however, the learning
tends to be task-specific and the sample efficiency remains a challenge. We
present Plan2Explore, a self-supervised reinforcement learning agent that
tackles both these challenges through a new approach to self-supervised
exploration and fast adaptation to new tasks, which need not be known during
exploration. During exploration, unlike prior methods which retrospectively
compute the novelty of observations after the agent has already reached them,
our agent acts efficiently by leveraging planning to seek out expected future
novelty. After exploration, the agent quickly adapts to multiple downstream
tasks in a zero or a few-shot manner. We evaluate on challenging control tasks
from high-dimensional image inputs. Without any training supervision or
task-specific interaction, Plan2Explore outperforms prior self-supervised
exploration methods, and in fact, almost matches the performances oracle which
has access to rewards. Videos and code at
https://ramanans1.github.io/plan2explore/Comment: Accepted at ICML 2020. Videos and code at
https://ramanans1.github.io/plan2explore
Randomized Prior Functions for Deep Reinforcement Learning
Dealing with uncertainty is essential for efficient reinforcement learning.
There is a growing literature on uncertainty estimation for deep learning from
fixed datasets, but many of the most popular approaches are poorly-suited to
sequential decision problems. Other methods, such as bootstrap sampling, have
no mechanism for uncertainty that does not come from the observed data. We
highlight why this can be a crucial shortcoming and propose a simple remedy
through addition of a randomized untrainable `prior' network to each ensemble
member. We prove that this approach is efficient with linear representations,
provide simple illustrations of its efficacy with nonlinear representations and
show that this approach scales to large-scale problems far better than previous
attempts
Projective simulation for classical learning agents: a comprehensive investigation
We study the model of projective simulation (PS), a novel approach to
artificial intelligence based on stochastic processing of episodic memory which
was recently introduced [H.J. Briegel and G. De las Cuevas. Sci. Rep. 2, 400,
(2012)]. Here we provide a detailed analysis of the model and examine its
performance, including its achievable efficiency, its learning times and the
way both properties scale with the problems' dimension. In addition, we situate
the PS agent in different learning scenarios, and study its learning abilities.
A variety of new scenarios are being considered, thereby demonstrating the
model's flexibility. Furthermore, to put the PS scheme in context, we compare
its performance with those of Q-learning and learning classifier systems, two
popular models in the field of reinforcement learning. It is shown that PS is a
competitive artificial intelligence model of unique properties and strengths.Comment: Accepted for publication in New Generation Computing. 23 pages, 23
figure
Off-Policy Shaping Ensembles in Reinforcement Learning
Recent advances of gradient temporal-difference methods allow to learn
off-policy multiple value functions in parallel with- out sacrificing
convergence guarantees or computational efficiency. This opens up new
possibilities for sound ensemble techniques in reinforcement learning. In this
work we propose learning an ensemble of policies related through
potential-based shaping rewards. The ensemble induces a combination policy by
using a voting mechanism on its components. Learning happens in real time, and
we empirically show the combination policy to outperform the individual
policies of the ensemble.Comment: Full version of the paper to appear in Proc. ECAI 201
Deep Exploration via Bootstrapped DQN
Efficient exploration in complex environments remains a major challenge for
reinforcement learning. We propose bootstrapped DQN, a simple algorithm that
explores in a computationally and statistically efficient manner through use of
randomized value functions. Unlike dithering strategies such as epsilon-greedy
exploration, bootstrapped DQN carries out temporally-extended (or deep)
exploration; this can lead to exponentially faster learning. We demonstrate
these benefits in complex stochastic MDPs and in the large-scale Arcade
Learning Environment. Bootstrapped DQN substantially improves learning times
and performance across most Atari games
UCB Exploration via Q-Ensembles
We show how an ensemble of -functions can be leveraged for more
effective exploration in deep reinforcement learning. We build on well
established algorithms from the bandit setting, and adapt them to the
-learning setting. We propose an exploration strategy based on
upper-confidence bounds (UCB). Our experiments show significant gains on the
Atari benchmark
Off-Policy Reward Shaping with Ensembles
Potential-based reward shaping (PBRS) is an effective and popular technique
to speed up reinforcement learning by leveraging domain knowledge. While PBRS
is proven to always preserve optimal policies, its effect on learning speed is
determined by the quality of its potential function, which, in turn, depends on
both the underlying heuristic and the scale. Knowing which heuristic will prove
effective requires testing the options beforehand, and determining the
appropriate scale requires tuning, both of which introduce additional sample
complexity. We formulate a PBRS framework that reduces learning speed, but does
not incur extra sample complexity. For this, we propose to simultaneously learn
an ensemble of policies, shaped w.r.t. many heuristics and on a range of
scales. The target policy is then obtained by voting. The ensemble needs to be
able to efficiently and reliably learn off-policy: requirements fulfilled by
the recent Horde architecture, which we take as our basis. We demonstrate
empirically that (1) our ensemble policy outperforms both the base policy, and
its single-heuristic components, and (2) an ensemble over a general range of
scales performs at least as well as one with optimally tuned components.Comment: To be presented at ALA-15. Short version to appear at AAMAS-1
Learning Gentle Object Manipulation with Curiosity-Driven Deep Reinforcement Learning
Robots must know how to be gentle when they need to interact with fragile
objects, or when the robot itself is prone to wear and tear. We propose an
approach that enables deep reinforcement learning to train policies that are
gentle, both during exploration and task execution. In a reward-based learning
environment, a natural approach involves augmenting the (task) reward with a
penalty for non-gentleness, which can be defined as excessive impact force.
However, augmenting with only this penalty impairs learning: policies get stuck
in a local optimum which avoids all contact with the environment. Prior
research has shown that combining auxiliary tasks or intrinsic rewards can be
beneficial for stabilizing and accelerating learning in sparse-reward domains,
and indeed we find that introducing a surprise-based intrinsic reward does
avoid the no-contact failure case. However, we show that a simple
dynamics-based surprise is not as effective as penalty-based surprise.
Penalty-based surprise, based on predicting forceful contacts, has a further
benefit: it encourages exploration which is contact-rich yet gentle. We
demonstrate the effectiveness of the approach using a complex, tendon-powered
robot hand with tactile sensors. Videos are available at
http://sites.google.com/view/gentlemanipulation
Neuronal Circuit Policies
We propose an effective way to create interpretable control agents, by
re-purposing the function of a biological neural circuit model, to govern
simulated and real world reinforcement learning (RL) test-beds. We model the
tap-withdrawal (TW) neural circuit of the nematode, C. elegans, a circuit
responsible for the worm's reflexive response to external mechanical touch
stimulations, and learn its synaptic and neuronal parameters as a policy for
controlling basic RL tasks. We also autonomously park a real rover robot on a
pre-defined trajectory, by deploying such neuronal circuit policies learned in
a simulated environment. For reconfiguration of the purpose of the TW neural
circuit, we adopt a search-based RL algorithm. We show that our neuronal
policies perform as good as deep neural network policies with the advantage of
realizing interpretable dynamics at the cell level
Scale-invariant temporal history (SITH): optimal slicing of the past in an uncertain world
In both the human brain and any general artificial intelligence (AI), a
representation of the past is necessary to predict the future. However, perfect
storage of all experiences is not feasible. One approach utilized in many
applications, including reward prediction in reinforcement learning, is to
retain recently active features of experience in a buffer. Despite its prior
successes, we show that the fixed length buffer renders Deep Q-learning
Networks (DQNs) fragile to changes in the scale over which information can be
learned. To enable learning when the relevant temporal scales in the
environment are not known *a priori*, recent advances in psychology and
neuroscience suggest that the brain maintains a compressed representation of
the past. Here we introduce a neurally-plausible, scale-free memory
representation we call Scale-Invariant Temporal History (SITH) for use with
artificial agents. This representation covers an exponentially large period of
time by sacrificing temporal accuracy for events further in the past. We
demonstrate the utility of this representation by comparing the performance of
agents given SITH, buffer, and exponential decay representations in learning to
play video games at different levels of complexity. In these environments, SITH
exhibits better learning performance by storing information for longer
timescales than a fixed-size buffer, and representing this information more
clearly than a set of exponentially decayed features. Finally, we discuss how
the application of SITH, along with other human-inspired models of cognition,
could improve reinforcement and machine learning algorithms in general.Comment: Preprint for submission to Neural Computation. Submitted to Neural
Computation - Update 12/18/2018: revised based on reviewer comments,
resubmitted to Neural Computation on 15 December, 2018. Restructured
introduction and discussion, combined figures, added section for SITH
parameterizatio
- …