92,110 research outputs found
Efficient Model-Free Reinforcement Learning Using Gaussian Process
Efficient Reinforcement Learning usually takes advantage of demonstration or
good exploration strategy. By applying posterior sampling in model-free RL
under the hypothesis of GP, we propose Gaussian Process Posterior Sampling
Reinforcement Learning(GPPSTD) algorithm in continuous state space, giving
theoretical justifications and empirical results. We also provide theoretical
and empirical results that various demonstration could lower expected
uncertainty and benefit posterior sampling exploration. In this way, we
combined the demonstration and exploration process together to achieve a more
efficient reinforcement learning.Comment: 10 page
Deep Exploration via Randomized Value Functions
We study the use of randomized value functions to guide deep exploration in
reinforcement learning. This offers an elegant means for synthesizing
statistically and computationally efficient exploration with common practical
approaches to value function learning. We present several reinforcement
learning algorithms that leverage randomized value functions and demonstrate
their efficacy through computational studies. We also prove a regret bound that
establishes statistical efficiency with a tabular representation.Comment: Accepted for publication in Journal of Machine Learning Research 201
Deep Exploration via Bootstrapped DQN
Efficient exploration in complex environments remains a major challenge for
reinforcement learning. We propose bootstrapped DQN, a simple algorithm that
explores in a computationally and statistically efficient manner through use of
randomized value functions. Unlike dithering strategies such as epsilon-greedy
exploration, bootstrapped DQN carries out temporally-extended (or deep)
exploration; this can lead to exponentially faster learning. We demonstrate
these benefits in complex stochastic MDPs and in the large-scale Arcade
Learning Environment. Bootstrapped DQN substantially improves learning times
and performance across most Atari games
When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
Efficient exploration is one of the key challenges for reinforcement learning
(RL) algorithms. Most traditional sample efficiency bounds require strategic
exploration. Recently many deep RL algorithms with simple heuristic exploration
strategies that have few formal guarantees, achieve surprising success in many
domains. These results pose an important question about understanding these
exploration strategies such as -greedy, as well as understanding what
characterize the difficulty of exploration in MDPs. In this work we propose
problem specific sample complexity bounds of learning with random walk
exploration that rely on several structural properties. We also link our
theoretical results to some empirical benchmark domains, to illustrate if our
bound gives polynomial sample complexity in these domains and how that is
related with the empirical performance.Comment: Appeared in The 14th European Workshop on Reinforcement Learning
(EWRL), 201
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
Efficient exploration remains a challenging research problem in reinforcement
learning, especially when an environment contains large state spaces, deceptive
local optima, or sparse rewards. To tackle this problem, we present a
diversity-driven approach for exploration, which can be easily combined with
both off- and on-policy reinforcement learning algorithms. We show that by
simply adding a distance measure to the loss function, the proposed methodology
significantly enhances an agent's exploratory behaviors, and thus preventing
the policy from being trapped in local optima. We further propose an adaptive
scaling method for stabilizing the learning process. Our experimental results
in Atari 2600 show that our method outperforms baseline approaches in several
tasks in terms of mean scores and exploration efficiency
Leveraging exploration in off-policy algorithms via normalizing flows
The ability to discover approximately optimal policies in domains with sparse
rewards is crucial to applying reinforcement learning (RL) in many real-world
scenarios. Approaches such as neural density models and continuous exploration
(e.g., Go-Explore) have been proposed to maintain the high exploration rate
necessary to find high performing and generalizable policies. Soft
actor-critic(SAC) is another method for improving exploration that aims to
combine efficient learning via off-policy updates while maximizing the policy
entropy. In this work, we extend SAC to a richer class of probability
distributions (e.g., multimodal) through normalizing flows (NF) and show that
this significantly improves performance by accelerating the discovery of good
policies while using much smaller policy representations. Our approach, which
we call SAC-NF, is a simple, efficient,easy-to-implement modification and
improvement to SAC on continuous control baselines such as MuJoCo and PyBullet
Roboschool domains. Finally, SAC-NF does this while being significantly
parameter efficient, using as few as 5.5% the parameters for an equivalent SAC
model.Comment: Accepted to 3rd Conference on Robot Learning (CoRL 2019); Keywords:
Exploration, soft actor-critic, normalizing flow, off-policy; maximum
entropy, reinforcement learning; deceptive reward; sparse reward; inverse
autoregressive flo
Generalization and Exploration via Randomized Value Functions
We propose randomized least-squares value iteration (RLSVI) -- a new
reinforcement learning algorithm designed to explore and generalize efficiently
via linearly parameterized value functions. We explain why versions of
least-squares value iteration that use Boltzmann or epsilon-greedy exploration
can be highly inefficient, and we present computational results that
demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, we establish
an upper bound on the expected regret of RLSVI that demonstrates
near-optimality in a tabula rasa learning context. More broadly, our results
suggest that randomized value functions offer a promising approach to tackling
a critical challenge in reinforcement learning: synthesizing efficient
exploration and effective generalization.Comment: arXiv admin note: text overlap with arXiv:1307.484
Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy
A fundamental issue in reinforcement learning algorithms is the balance
between exploration of the environment and exploitation of information already
obtained by the agent. Especially, exploration has played a critical role for
both efficiency and efficacy of the learning process. However, Existing works
for exploration involve task-agnostic design, that is performing well in one
environment, but be ill-suited to another. To the purpose of learning an
effective and efficient exploration policy in an automated manner. We
formalized a feasible metric for measuring the utility of exploration based on
counterfactual ideology. Based on that, We proposed an end-to-end algorithm to
learn exploration policy by meta-learning. We demonstrate that our method
achieves good results compared to previous works in the high-dimensional
control tasks in MuJoCo simulator
(More) Efficient Reinforcement Learning via Posterior Sampling
Most provably-efficient learning algorithms introduce optimism about
poorly-understood states and actions to encourage exploration. We study an
alternative approach for efficient exploration, posterior sampling for
reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of
known duration. At the start of each episode, PSRL updates a prior distribution
over Markov decision processes and takes one sample from this posterior. PSRL
then follows the policy that is optimal for this sample during the episode. The
algorithm is conceptually simple, computationally efficient and allows an agent
to encode prior knowledge in a natural way. We establish an bound on the expected regret, where is time, is the
episode length and and are the cardinalities of the state and action
spaces. This bound is one of the first for an algorithm not based on optimism,
and close to the state of the art for any reinforcement learning algorithm. We
show through simulation that PSRL significantly outperforms existing algorithms
with similar regret bounds.Comment: 10 page
Context-Dependent Upper-Confidence Bounds for Directed Exploration
Directed exploration strategies for reinforcement learning are critical for
learning an optimal policy in a minimal number of interactions with the
environment. Many algorithms use optimism to direct exploration, either through
visitation estimates or upper confidence bounds, as opposed to data-inefficient
strategies like \epsilon-greedy that use random, undirected exploration. Most
data-efficient exploration methods require significant computation, typically
relying on a learned model to guide exploration. Least-squares methods have the
potential to provide some of the data-efficiency benefits of model-based
approaches -- because they summarize past interactions -- with the computation
closer to that of model-free approaches. In this work, we provide a novel,
computationally efficient, incremental exploration strategy, leveraging this
property of least-squares temporal difference learning (LSTD). We derive upper
confidence bounds on the action-values learned by LSTD, with context-dependent
(or state-dependent) noise variance. Such context-dependent noise focuses
exploration on a subset of variable states, and allows for reduced exploration
in other states. We empirically demonstrate that our algorithm can converge
more quickly than other incremental exploration strategies using confidence
estimates on action-values.Comment: Neural Information Processing Systems 201
- …