Search CORE

12,105 research outputs found

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Author: Van Roy Benjamin
Wen Zheng
Publication venue
Publication date: 06/07/2016
Field of study

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within a given hypothesis class, OCP selects optimal actions over all but at most K episodes, where K is the eluder dimension of the given hypothesis class. We establish further efficiency and asymptotic performance guarantees that apply even if the true value function does not lie in the given hypothesis class, for the special case where the hypothesis class is the span of pre-specified indicator functions over disjoint sets. We also discuss the computational complexity of OCP and present computational results involving two illustrative examples

arXiv.org e-Print Archive

Deep Exploration via Bootstrapped DQN

Author: Blundell Charles
Osband Ian
Pritzel Alexander
Van Roy Benjamin
Publication venue
Publication date: 04/07/2016
Field of study

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games

arXiv.org e-Print Archive

CIRL: Controllable Imitative Reinforcement Learning for Vision-based Self-driving

Author: Liang Xiaodan
Wang Tairui
Xing Eric
Yang Luona
Publication venue
Publication date: 10/07/2018
Field of study

Autonomous urban driving navigation with complex multi-agent dynamics is under-explored due to the difficulty of learning an optimal driving policy. The traditional modular pipeline heavily relies on hand-designed rules and the pre-processing perception system while the supervised learning-based models are limited by the accessibility of extensive human experience. We present a general and principled Controllable Imitative Reinforcement Learning (CIRL) approach which successfully makes the driving agent achieve higher success rates based on only vision inputs in a high-fidelity car simulator. To alleviate the low exploration efficiency for large continuous action space that often prohibits the use of classical RL on challenging real tasks, our CIRL explores over a reasonably constrained action space guided by encoded experiences that imitate human demonstrations, building upon Deep Deterministic Policy Gradient (DDPG). Moreover, we propose to specialize adaptive policies and steering-angle reward designs for different control signals (i.e. follow, straight, turn right, turn left) based on the shared representations to improve the model capability in tackling with diverse cases. Extensive experiments on CARLA driving benchmark demonstrate that CIRL substantially outperforms all previous methods in terms of the percentage of successfully completed episodes on a variety of goal-directed driving tasks. We also show its superior generalization capability in unseen environments. To our knowledge, this is the first successful case of the learned driving policy through reinforcement learning in the high-fidelity simulator, which performs better-than supervised imitation learning.Comment: To appear in ECCV 201

arXiv.org e-Print Archive

Meta reinforcement learning as task inference

Author: Galashov Alexandre
Hasenclever Leonard
Heess Nicolas
Humplik Jan
Ortega Pedro A.
Teh Yee Whye
Publication venue
Publication date: 22/10/2019
Field of study

Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multi-task RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent's observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partially-observed MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory

arXiv.org e-Print Archive

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Author: D'Oro Pierluca
Jaśkowski Wojciech
Publication venue
Publication date: 29/04/2020
Field of study

Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic w.r.t. input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm with respect to model-free and model-based state-of-the-art baselines

arXiv.org e-Print Archive

DORA The Explorer: Directed Outreaching Reinforcement Action-Selection

Author: Choshen Leshem
Fox Lior
Loewenstein Yonatan
Publication venue
Publication date: 11/04/2018
Field of study

Exploration is a fundamental aspect of Reinforcement Learning, typically implemented using stochastic action-selection. Exploration, however, can be more efficient if directed toward gaining new world knowledge. Visit-counters have been proven useful both in practice and in theory for directed exploration. However, a major limitation of counters is their locality. While there are a few model-based solutions to this shortcoming, a model-free approach is still missing. We propose

E

-values, a generalization of counters that can be used to evaluate the propagating exploratory value over state-action trajectories. We compare our approach to commonly used RL techniques, and show that using

E

-values improves learning and performance over traditional counters. We also show how our method can be implemented with function approximation to efficiently learn continuous MDPs. We demonstrate this by showing that our approach surpasses state of the art performance in the Freeway Atari 2600 game.Comment: Final version for ICLR 201

arXiv.org e-Print Archive

Episodic Memory Deep Q-Networks

Author: Lin Zichuan
Yang Guangwen
Zhang Lintao
Zhao Tianqi
Publication venue
Publication date: 19/05/2018
Field of study

Reinforcement learning (RL) algorithms have made huge progress in recent years by leveraging the power of deep neural networks (DNN). Despite the success, deep RL algorithms are known to be sample inefficient, often requiring many rounds of interaction with the environments to obtain satisfactory performance. Recently, episodic memory based RL has attracted attention due to its ability to latch on good actions quickly. In this paper, we present a simple yet effective biologically inspired RL algorithm called Episodic Memory Deep Q-Networks (EMDQN), which leverages episodic memory to supervise an agent during training. Experiments show that our proposed method can lead to better sample efficiency and is more likely to find good policies. It only requires 1/5 of the interactions of DQN to achieve many state-of-the-art performances on Atari games, significantly outperforming regular DQN and other episodic memory based RL algorithms.Comment: Accepted by IJCAI 201

arXiv.org e-Print Archive

Deep Exploration via Randomized Value Functions

Author: Osband Ian
Russo Daniel
Van Roy Benjamin
Wen Zheng
Publication venue
Publication date: 23/09/2019
Field of study

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.Comment: Accepted for publication in Journal of Machine Learning Research 201

arXiv.org e-Print Archive

Generalization and Exploration via Randomized Value Functions

Author: Osband Ian
Van Roy Benjamin
Wen Zheng
Publication venue
Publication date: 15/02/2016
Field of study

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions. We explain why versions of least-squares value iteration that use Boltzmann or epsilon-greedy exploration can be highly inefficient, and we present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, we establish an upper bound on the expected regret of RLSVI that demonstrates near-optimality in a tabula rasa learning context. More broadly, our results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.Comment: arXiv admin note: text overlap with arXiv:1307.484

arXiv.org e-Print Archive

Training Reinforcement Neurocontrollers Using the Polytope Algorithm

Author: Lagaris I. E.
Likas A.
Publication venue
Publication date: 03/12/1998
Field of study

A new training algorithm is presented for delayed reinforcement learning problems that does not assume the existence of a critic model and employs the polytope optimization algorithm to adjust the weights of the action network so that a simple direct measure of the training performance is maximized. Experimental results from the application of the method to the pole balancing problem indicate improved training performance compared with critic-based and genetic reinforcement approaches

arXiv.org e-Print Archive

CiteSeerX