Search CORE

92,110 research outputs found

Efficient Model-Free Reinforcement Learning Using Gaussian Process

Author: Chen Letian
Fan Ying
Wang Yizhou
Publication venue
Publication date: 11/12/2018
Field of study

Efficient Reinforcement Learning usually takes advantage of demonstration or good exploration strategy. By applying posterior sampling in model-free RL under the hypothesis of GP, we propose Gaussian Process Posterior Sampling Reinforcement Learning(GPPSTD) algorithm in continuous state space, giving theoretical justifications and empirical results. We also provide theoretical and empirical results that various demonstration could lower expected uncertainty and benefit posterior sampling exploration. In this way, we combined the demonstration and exploration process together to achieve a more efficient reinforcement learning.Comment: 10 page

arXiv.org e-Print Archive

Deep Exploration via Randomized Value Functions

Author: Osband Ian
Russo Daniel
Van Roy Benjamin
Wen Zheng
Publication venue
Publication date: 23/09/2019
Field of study

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.Comment: Accepted for publication in Journal of Machine Learning Research 201

arXiv.org e-Print Archive

Deep Exploration via Bootstrapped DQN

Author: Blundell Charles
Osband Ian
Pritzel Alexander
Van Roy Benjamin
Publication venue
Publication date: 04/07/2016
Field of study

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games

arXiv.org e-Print Archive

When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms

Author: Brunskill Emma
Liu Yao
Publication venue
Publication date: 17/04/2019
Field of study

Efficient exploration is one of the key challenges for reinforcement learning (RL) algorithms. Most traditional sample efficiency bounds require strategic exploration. Recently many deep RL algorithms with simple heuristic exploration strategies that have few formal guarantees, achieve surprising success in many domains. These results pose an important question about understanding these exploration strategies such as

e

-greedy, as well as understanding what characterize the difficulty of exploration in MDPs. In this work we propose problem specific sample complexity bounds of

Q

learning with random walk exploration that rely on several structural properties. We also link our theoretical results to some empirical benchmark domains, to illustrate if our bound gives polynomial sample complexity in these domains and how that is related with the empirical performance.Comment: Appeared in The 14th European Workshop on Reinforcement Learning (EWRL), 201

arXiv.org e-Print Archive

Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

Author: Chang Yi-Hsiang
Hong Zhang-Wei
Lee Chun-Yi
Shann Tzu-Yun
Su Shih-Yang
Publication venue
Publication date: 28/10/2018
Field of study

Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. Our experimental results in Atari 2600 show that our method outperforms baseline approaches in several tasks in terms of mean scores and exploration efficiency

arXiv.org e-Print Archive

Leveraging exploration in off-policy algorithms via normalizing flows

Author: Doan Thang
Durand Audrey
Hjelm R Devon
Mazoure Bogdan
Pineau Joelle
Publication venue
Publication date: 24/09/2019
Field of study

The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model.Comment: Accepted to 3rd Conference on Robot Learning (CoRL 2019); Keywords: Exploration, soft actor-critic, normalizing flow, off-policy; maximum entropy, reinforcement learning; deceptive reward; sparse reward; inverse autoregressive flo

arXiv.org e-Print Archive

Generalization and Exploration via Randomized Value Functions

Author: Osband Ian
Van Roy Benjamin
Wen Zheng
Publication venue
Publication date: 15/02/2016
Field of study

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions. We explain why versions of least-squares value iteration that use Boltzmann or epsilon-greedy exploration can be highly inefficient, and we present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, we establish an upper bound on the expected regret of RLSVI that demonstrates near-optimality in a tabula rasa learning context. More broadly, our results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.Comment: arXiv admin note: text overlap with arXiv:1307.484

arXiv.org e-Print Archive

Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy

Author: Liu Tie-Yan
Yang Ruihan
Ye Qiwei
Publication venue
Publication date: 27/05/2019
Field of study

A fundamental issue in reinforcement learning algorithms is the balance between exploration of the environment and exploitation of information already obtained by the agent. Especially, exploration has played a critical role for both efficiency and efficacy of the learning process. However, Existing works for exploration involve task-agnostic design, that is performing well in one environment, but be ill-suited to another. To the purpose of learning an effective and efficient exploration policy in an automated manner. We formalized a feasible metric for measuring the utility of exploration based on counterfactual ideology. Based on that, We proposed an end-to-end algorithm to learn exploration policy by meta-learning. We demonstrate that our method achieves good results compared to previous works in the high-dimensional control tasks in MuJoCo simulator

arXiv.org e-Print Archive

(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Osband Ian
Russo Daniel
Van Roy Benjamin
Publication venue
Publication date: 26/12/2013
Field of study

Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an

\tilde{O}(\tau S \sqrt{AT})

bound on the expected regret, where

T

is time,

\tau

is the episode length and

S

and

A

are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.Comment: 10 page

arXiv.org e-Print Archive

Context-Dependent Upper-Confidence Bounds for Directed Exploration

Author: Kumaraswamy Raksha
Schlegel Matthew
White Adam
White Martha
Publication venue
Publication date: 15/11/2018
Field of study

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like \epsilon-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.Comment: Neural Information Processing Systems 201

arXiv.org e-Print Archive