238 research outputs found
Noisy Networks for Exploration
We introduce NoisyNet, a deep reinforcement learning agent with parametric
noise added to its weights, and show that the induced stochasticity of the
agent's policy can be used to aid efficient exploration. The parameters of the
noise are learned with gradient descent along with the remaining network
weights. NoisyNet is straightforward to implement and adds little computational
overhead. We find that replacing the conventional exploration heuristics for
A3C, DQN and dueling agents (entropy reward and -greedy respectively)
with NoisyNet yields substantially higher scores for a wide range of Atari
games, in some cases advancing the agent from sub to super-human performance.Comment: ICLR 201
Variational Bayes: A report on approaches and applications
Deep neural networks have achieved impressive results on a wide variety of
tasks. However, quantifying uncertainty in the network's output is a
challenging task. Bayesian models offer a mathematical framework to reason
about model uncertainty. Variational methods have been used for approximating
intractable integrals that arise in Bayesian inference for neural networks. In
this report, we review the major variational inference concepts pertinent to
Bayesian neural networks and compare various approximation methods used in
literature. We also talk about the applications of variational bayes in
Reinforcement learning and continual learning
Privileged Information Dropout in Reinforcement Learning
Using privileged information during training can improve the sample
efficiency and performance of machine learning systems. This paradigm has been
applied to reinforcement learning (RL), primarily in the form of distillation
or auxiliary tasks, and less commonly in the form of augmenting the inputs of
agents. In this work, we investigate Privileged Information Dropout (\pid) for
achieving the latter which can be applied equally to value-based and
policy-based RL algorithms. Within a simple partially-observed environment, we
demonstrate that \pid outperforms alternatives for leveraging privileged
information, including distillation and auxiliary tasks, and can successfully
utilise different types of privileged information. Finally, we analyse its
effect on the learned representations
On the Complexity of Exploration in Goal-Driven Navigation
Building agents that can explore their environments intelligently is a
challenging open problem. In this paper, we make a step towards understanding
how a hierarchical design of the agent's policy can affect its exploration
capabilities. First, we design EscapeRoom environments, where the agent must
figure out how to navigate to the exit by accomplishing a number of
intermediate tasks (\emph{subgoals}), such as finding keys or opening doors.
Our environments are procedurally generated and vary in complexity, which can
be controlled by the number of subgoals and relationships between them. Next,
we propose to measure the complexity of each environment by constructing
dependency graphs between the goals and analytically computing \emph{hitting
times} of a random walk in the graph. We empirically evaluate Proximal Policy
Optimization (PPO) with sparse and shaped rewards, a variation of policy
sketches, and a hierarchical version of PPO (called HiPPO) akin to h-DQN. We
show that analytically estimated \emph{hitting time} in goal dependency graphs
is an informative metric of the environment complexity. We conjecture that the
result should hold for environments other than navigation. Finally, we show
that solving environments beyond certain level of complexity requires
hierarchical approaches.Comment: Relational Representation Learning Workshop (NIPS 2018
Combine PPO with NES to Improve Exploration
We introduce two approaches for combining neural evolution strategy (NES) and
proximal policy optimization (PPO): parameter transfer and parameter space
noise. Parameter transfer is a PPO agent with parameters transferred from a NES
agent. Parameter space noise is to directly add noise to the PPO agent`s
parameters. We demonstrate that PPO could benefit from both methods through
experimental comparison on discrete action environments as well as continuous
control tasksComment: 18 pages, 14 figure
Learning latent state representation for speeding up exploration
Exploration is an extremely challenging problem in reinforcement learning,
especially in high dimensional state and action spaces and when only sparse
rewards are available. Effective representations can indicate which components
of the state are task relevant and thus reduce the dimensionality of the space
to explore. In this work, we take a representation learning viewpoint on
exploration, utilizing prior experience to learn effective latent
representations, which can subsequently indicate which regions to explore.
Prior experience on separate but related tasks help learn representations of
the state which are effective at predicting instantaneous rewards. These
learned representations can then be used with an entropy-based exploration
method to effectively perform exploration in high dimensional spaces by
effectively lowering the dimensionality of the search space. We show the
benefits of this representation for meta-exploration in a simulated object
pushing environment.Comment: 7 pages, 8 figures, worksho
Mitigation of Policy Manipulation Attacks on Deep Q-Networks with Parameter-Space Noise
Recent developments have established the vulnerability of deep reinforcement
learning to policy manipulation attacks via intentionally perturbed inputs,
known as adversarial examples. In this work, we propose a technique for
mitigation of such attacks based on addition of noise to the parameter space of
deep reinforcement learners during training. We experimentally verify the
effect of parameter-space noise in reducing the transferability of adversarial
examples, and demonstrate the promising performance of this technique in
mitigating the impact of whitebox and blackbox attacks at both test and
training times.Comment: arXiv admin note: substantial text overlap with arXiv:1701.04143,
arXiv:1712.0934
Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy
A fundamental issue in reinforcement learning algorithms is the balance
between exploration of the environment and exploitation of information already
obtained by the agent. Especially, exploration has played a critical role for
both efficiency and efficacy of the learning process. However, Existing works
for exploration involve task-agnostic design, that is performing well in one
environment, but be ill-suited to another. To the purpose of learning an
effective and efficient exploration policy in an automated manner. We
formalized a feasible metric for measuring the utility of exploration based on
counterfactual ideology. Based on that, We proposed an end-to-end algorithm to
learn exploration policy by meta-learning. We demonstrate that our method
achieves good results compared to previous works in the high-dimensional
control tasks in MuJoCo simulator
Reinforcement Learning with Attention that Works: A Self-Supervised Approach
Attention models have had a significant positive impact on deep learning
across a range of tasks. However previous attempts at integrating attention
with reinforcement learning have failed to produce significant improvements. We
propose the first combination of self attention and reinforcement learning that
is capable of producing significant improvements, including new state of the
art results in the Arcade Learning Environment. Unlike the selective attention
models used in previous attempts, which constrain the attention via
preconceived notions of importance, our implementation utilises the Markovian
properties inherent in the state input. Our method produces a faithful
visualisation of the policy, focusing on the behaviour of the agent. Our
experiments demonstrate that the trained policies use multiple simultaneous
foci of attention, and are able to modulate attention over time to deal with
situations of partial observability
Model-Based Action Exploration for Learning Dynamic Motion Skills
Deep reinforcement learning has achieved great strides in solving challenging
motion control tasks. Recently, there has been significant work on methods for
exploiting the data gathered during training, but there has been less work on
how to best generate the data to learn from. For continuous action domains, the
most common method for generating exploratory actions involves sampling from a
Gaussian distribution centred around the mean action output by a policy.
Although these methods can be quite capable, they do not scale well with the
dimensionality of the action space, and can be dangerous to apply on hardware.
We consider learning a forward dynamics model to predict the result,
(), of taking a particular action, (), given a specific observation
of the state, (). With this model we perform internal look-ahead
predictions of outcomes and seek actions we believe have a reasonable chance of
success. This method alters the exploratory action space, thereby increasing
learning speed and enables higher quality solutions to difficult problems, such
as robotic locomotion and juggling.Comment: 7 pages, 7 figures, conference pape
- …