Search CORE

238 research outputs found

Noisy Networks for Exploration

Author: Azar Mohammad Gheshlaghi
Blundell Charles
Fortunato Meire
Graves Alex
Hassabis Demis
Legg Shane
Menick Jacob
Mnih Vlad
Munos Remi
Osband Ian
Pietquin Olivier
Piot Bilal
Publication venue
Publication date: 09/07/2019
Field of study

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and

\epsilon

-greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.Comment: ICLR 201

arXiv.org e-Print Archive

Variational Bayes: A report on approaches and applications

Author: Konkimalla Chandra Prakash
Yellapragada Manikanta Srikar
Publication venue
Publication date: 26/05/2019
Field of study

Deep neural networks have achieved impressive results on a wide variety of tasks. However, quantifying uncertainty in the network's output is a challenging task. Bayesian models offer a mathematical framework to reason about model uncertainty. Variational methods have been used for approximating intractable integrals that arise in Bayesian inference for neural networks. In this report, we review the major variational inference concepts pertinent to Bayesian neural networks and compare various approximation methods used in literature. We also talk about the applications of variational bayes in Reinforcement learning and continual learning

arXiv.org e-Print Archive

Privileged Information Dropout in Reinforcement Learning

Author: Arulkumaran Kai
Behbahani Feryal
Boehmer Wendelin
Kamienny Pierre-Alexandre
Whiteson Shimon
Publication venue
Publication date: 19/05/2020
Field of study

Using privileged information during training can improve the sample efficiency and performance of machine learning systems. This paradigm has been applied to reinforcement learning (RL), primarily in the form of distillation or auxiliary tasks, and less commonly in the form of augmenting the inputs of agents. In this work, we investigate Privileged Information Dropout (\pid) for achieving the latter which can be applied equally to value-based and policy-based RL algorithms. Within a simple partially-observed environment, we demonstrate that \pid outperforms alternatives for leveraging privileged information, including distillation and auxiliary tasks, and can successfully utilise different types of privileged information. Finally, we analyse its effect on the learned representations

arXiv.org e-Print Archive

On the Complexity of Exploration in Goal-Driven Navigation

Author: Al-Shedivat Maruan
Lee Lisa
Salakhutdinov Ruslan
Xing Eric
Publication venue
Publication date: 16/11/2018
Field of study

Building agents that can explore their environments intelligently is a challenging open problem. In this paper, we make a step towards understanding how a hierarchical design of the agent's policy can affect its exploration capabilities. First, we design EscapeRoom environments, where the agent must figure out how to navigate to the exit by accomplishing a number of intermediate tasks (\emph{subgoals}), such as finding keys or opening doors. Our environments are procedurally generated and vary in complexity, which can be controlled by the number of subgoals and relationships between them. Next, we propose to measure the complexity of each environment by constructing dependency graphs between the goals and analytically computing \emph{hitting times} of a random walk in the graph. We empirically evaluate Proximal Policy Optimization (PPO) with sparse and shaped rewards, a variation of policy sketches, and a hierarchical version of PPO (called HiPPO) akin to h-DQN. We show that analytically estimated \emph{hitting time} in goal dependency graphs is an informative metric of the environment complexity. We conjecture that the result should hold for environments other than navigation. Finally, we show that solving environments beyond certain level of complexity requires hierarchical approaches.Comment: Relational Representation Learning Workshop (NIPS 2018

arXiv.org e-Print Archive

Combine PPO with NES to Improve Exploration

Author: Li Bingna
Li Lianjiang
Yang Yunrong
Publication venue
Publication date: 14/06/2019
Field of study

We introduce two approaches for combining neural evolution strategy (NES) and proximal policy optimization (PPO): parameter transfer and parameter space noise. Parameter transfer is a PPO agent with parameters transferred from a NES agent. Parameter space noise is to directly add noise to the PPO agent`s parameters. We demonstrate that PPO could benefit from both methods through experimental comparison on discrete action environments as well as continuous control tasksComment: 18 pages, 14 figure

arXiv.org e-Print Archive

Learning latent state representation for speeding up exploration

Author: Abbeel Pieter
Gupta Abhishek
Natale Lorenzo
Vezzani Giulia
Publication venue
Publication date: 27/05/2019
Field of study

Exploration is an extremely challenging problem in reinforcement learning, especially in high dimensional state and action spaces and when only sparse rewards are available. Effective representations can indicate which components of the state are task relevant and thus reduce the dimensionality of the space to explore. In this work, we take a representation learning viewpoint on exploration, utilizing prior experience to learn effective latent representations, which can subsequently indicate which regions to explore. Prior experience on separate but related tasks help learn representations of the state which are effective at predicting instantaneous rewards. These learned representations can then be used with an entropy-based exploration method to effectively perform exploration in high dimensional spaces by effectively lowering the dimensionality of the search space. We show the benefits of this representation for meta-exploration in a simulated object pushing environment.Comment: 7 pages, 8 figures, worksho

arXiv.org e-Print Archive

Mitigation of Policy Manipulation Attacks on Deep Q-Networks with Parameter-Space Noise

Author: Behzadan Vahid
Munir Arslan
Publication venue
Publication date: 04/06/2018
Field of study

Recent developments have established the vulnerability of deep reinforcement learning to policy manipulation attacks via intentionally perturbed inputs, known as adversarial examples. In this work, we propose a technique for mitigation of such attacks based on addition of noise to the parameter space of deep reinforcement learners during training. We experimentally verify the effect of parameter-space noise in reducing the transferability of adversarial examples, and demonstrate the promising performance of this technique in mitigating the impact of whitebox and blackbox attacks at both test and training times.Comment: arXiv admin note: substantial text overlap with arXiv:1701.04143, arXiv:1712.0934

arXiv.org e-Print Archive

Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy

Author: Liu Tie-Yan
Yang Ruihan
Ye Qiwei
Publication venue
Publication date: 27/05/2019
Field of study

A fundamental issue in reinforcement learning algorithms is the balance between exploration of the environment and exploitation of information already obtained by the agent. Especially, exploration has played a critical role for both efficiency and efficacy of the learning process. However, Existing works for exploration involve task-agnostic design, that is performing well in one environment, but be ill-suited to another. To the purpose of learning an effective and efficient exploration policy in an automated manner. We formalized a feasible metric for measuring the utility of exploration based on counterfactual ideology. Based on that, We proposed an end-to-end algorithm to learn exploration policy by meta-learning. We demonstrate that our method achieves good results compared to previous works in the high-dimensional control tasks in MuJoCo simulator

arXiv.org e-Print Archive

Reinforcement Learning with Attention that Works: A Self-Supervised Approach

Author: Abbasnejad Ehsan
Hengel Anton van den
Manchin Anthony
Publication venue
Publication date: 06/04/2019
Field of study

Attention models have had a significant positive impact on deep learning across a range of tasks. However previous attempts at integrating attention with reinforcement learning have failed to produce significant improvements. We propose the first combination of self attention and reinforcement learning that is capable of producing significant improvements, including new state of the art results in the Arcade Learning Environment. Unlike the selective attention models used in previous attempts, which constrain the attention via preconceived notions of importance, our implementation utilises the Markovian properties inherent in the state input. Our method produces a faithful visualisation of the policy, focusing on the behaviour of the agent. Our experiments demonstrate that the trained policies use multiple simultaneous foci of attention, and are able to modulate attention over time to deal with situations of partial observability

arXiv.org e-Print Archive

Model-Based Action Exploration for Learning Dynamic Motion Skills

Author: Berseth Glen
van de Panne Michiel
Publication venue
Publication date: 11/04/2018
Field of study

Deep reinforcement learning has achieved great strides in solving challenging motion control tasks. Recently, there has been significant work on methods for exploiting the data gathered during training, but there has been less work on how to best generate the data to learn from. For continuous action domains, the most common method for generating exploratory actions involves sampling from a Gaussian distribution centred around the mean action output by a policy. Although these methods can be quite capable, they do not scale well with the dimensionality of the action space, and can be dangerous to apply on hardware. We consider learning a forward dynamics model to predict the result, (

x_{t+1}

), of taking a particular action, (

u

), given a specific observation of the state, (

x_{t}

). With this model we perform internal look-ahead predictions of outcomes and seek actions we believe have a reasonable chance of success. This method alters the exploratory action space, thereby increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion and juggling.Comment: 7 pages, 7 figures, conference pape

arXiv.org e-Print Archive