Search CORE

3,335 research outputs found

Planning to Explore via Self-Supervised World Models

Author: Abbeel Pieter
Daniilidis Kostas
Hafner Danijar
Pathak Deepak
Rybkin Oleh
Sekar Ramanan
Publication venue
Publication date: 30/06/2020
Field of study

Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at https://ramanans1.github.io/plan2explore/Comment: Accepted at ICML 2020. Videos and code at https://ramanans1.github.io/plan2explore

arXiv.org e-Print Archive

Randomized Prior Functions for Deep Reinforcement Learning

Author: Aslanides John
Cassirer Albin
Osband Ian
Publication venue
Publication date: 15/11/2018
Field of study

Dealing with uncertainty is essential for efficient reinforcement learning. There is a growing literature on uncertainty estimation for deep learning from fixed datasets, but many of the most popular approaches are poorly-suited to sequential decision problems. Other methods, such as bootstrap sampling, have no mechanism for uncertainty that does not come from the observed data. We highlight why this can be a crucial shortcoming and propose a simple remedy through addition of a randomized untrainable `prior' network to each ensemble member. We prove that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and show that this approach scales to large-scale problems far better than previous attempts

arXiv.org e-Print Archive

Projective simulation for classical learning agents: a comprehensive investigation

Author: Briegel Hans J.
Makmal Adi
Manzano Daniel
Mautner Julian
Tiersch Markus
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2014
Field of study

We study the model of projective simulation (PS), a novel approach to artificial intelligence based on stochastic processing of episodic memory which was recently introduced [H.J. Briegel and G. De las Cuevas. Sci. Rep. 2, 400, (2012)]. Here we provide a detailed analysis of the model and examine its performance, including its achievable efficiency, its learning times and the way both properties scale with the problems' dimension. In addition, we situate the PS agent in different learning scenarios, and study its learning abilities. A variety of new scenarios are being considered, thereby demonstrating the model's flexibility. Furthermore, to put the PS scheme in context, we compare its performance with those of Q-learning and learning classifier systems, two popular models in the field of reinforcement learning. It is shown that PS is a competitive artificial intelligence model of unique properties and strengths.Comment: Accepted for publication in New Generation Computing. 23 pages, 23 figure

arXiv.org e-Print Archive

Off-Policy Shaping Ensembles in Reinforcement Learning

Author: Brys Tim
Harutyunyan Anna
Nowe Ann
Vrancx Peter
Publication venue
Publication date: 21/05/2014
Field of study

Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensemble induces a combination policy by using a voting mechanism on its components. Learning happens in real time, and we empirically show the combination policy to outperform the individual policies of the ensemble.Comment: Full version of the paper to appear in Proc. ECAI 201

arXiv.org e-Print Archive

Deep Exploration via Bootstrapped DQN

Author: Blundell Charles
Osband Ian
Pritzel Alexander
Van Roy Benjamin
Publication venue
Publication date: 04/07/2016
Field of study

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games

arXiv.org e-Print Archive

UCB Exploration via Q-Ensembles

Author: Abbeel Pieter
Chen Richard Y.
Schulman John
Sidor Szymon
Publication venue
Publication date: 07/11/2017
Field of study

We show how an ensemble of

Q^*

-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the

Q

-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark

arXiv.org e-Print Archive

Off-Policy Reward Shaping with Ensembles

Author: Brys Tim
Harutyunyan Anna
Nowe Ann
Vrancx Peter
Publication venue
Publication date: 23/03/2015
Field of study

Potential-based reward shaping (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effective requires testing the options beforehand, and determining the appropriate scale requires tuning, both of which introduce additional sample complexity. We formulate a PBRS framework that reduces learning speed, but does not incur extra sample complexity. For this, we propose to simultaneously learn an ensemble of policies, shaped w.r.t. many heuristics and on a range of scales. The target policy is then obtained by voting. The ensemble needs to be able to efficiently and reliably learn off-policy: requirements fulfilled by the recent Horde architecture, which we take as our basis. We demonstrate empirically that (1) our ensemble policy outperforms both the base policy, and its single-heuristic components, and (2) an ensemble over a general range of scales performs at least as well as one with optimally tuned components.Comment: To be presented at ALA-15. Short version to appear at AAMAS-1

arXiv.org e-Print Archive

Learning Gentle Object Manipulation with Curiosity-Driven Deep Reinforcement Learning

Author: Hadsell Raia
Huang Sandy H.
Kay Jackie
Martins Murilo F.
Pilarski Patrick M.
Tassa Yuval
Zambelli Martina
Publication venue
Publication date: 20/03/2019
Field of study

Robots must know how to be gentle when they need to interact with fragile objects, or when the robot itself is prone to wear and tear. We propose an approach that enables deep reinforcement learning to train policies that are gentle, both during exploration and task execution. In a reward-based learning environment, a natural approach involves augmenting the (task) reward with a penalty for non-gentleness, which can be defined as excessive impact force. However, augmenting with only this penalty impairs learning: policies get stuck in a local optimum which avoids all contact with the environment. Prior research has shown that combining auxiliary tasks or intrinsic rewards can be beneficial for stabilizing and accelerating learning in sparse-reward domains, and indeed we find that introducing a surprise-based intrinsic reward does avoid the no-contact failure case. However, we show that a simple dynamics-based surprise is not as effective as penalty-based surprise. Penalty-based surprise, based on predicting forceful contacts, has a further benefit: it encourages exploration which is contact-rich yet gentle. We demonstrate the effectiveness of the approach using a complex, tendon-powered robot hand with tactile sensors. Videos are available at http://sites.google.com/view/gentlemanipulation

arXiv.org e-Print Archive

Neuronal Circuit Policies

Author: Grosu Radu
Hasani Ramin M.
Lechner Mathias
Publication venue
Publication date: 22/03/2018
Field of study

We propose an effective way to create interpretable control agents, by re-purposing the function of a biological neural circuit model, to govern simulated and real world reinforcement learning (RL) test-beds. We model the tap-withdrawal (TW) neural circuit of the nematode, C. elegans, a circuit responsible for the worm's reflexive response to external mechanical touch stimulations, and learn its synaptic and neuronal parameters as a policy for controlling basic RL tasks. We also autonomously park a real rover robot on a pre-defined trajectory, by deploying such neuronal circuit policies learned in a simulated environment. For reconfiguration of the purpose of the TW neural circuit, we adopt a search-based RL algorithm. We show that our neuronal policies perform as good as deep neural network policies with the advantage of realizing interpretable dynamics at the cell level

arXiv.org e-Print Archive

Scale-invariant temporal history (SITH): optimal slicing of the past in an uncertain world

Author: Howard Marc W.
Jacques Brandon G.
Sederberg Per B.
Spears Tyler A.
Publication venue
Publication date: 18/12/2018
Field of study

In both the human brain and any general artificial intelligence (AI), a representation of the past is necessary to predict the future. However, perfect storage of all experiences is not feasible. One approach utilized in many applications, including reward prediction in reinforcement learning, is to retain recently active features of experience in a buffer. Despite its prior successes, we show that the fixed length buffer renders Deep Q-learning Networks (DQNs) fragile to changes in the scale over which information can be learned. To enable learning when the relevant temporal scales in the environment are not known *a priori*, recent advances in psychology and neuroscience suggest that the brain maintains a compressed representation of the past. Here we introduce a neurally-plausible, scale-free memory representation we call Scale-Invariant Temporal History (SITH) for use with artificial agents. This representation covers an exponentially large period of time by sacrificing temporal accuracy for events further in the past. We demonstrate the utility of this representation by comparing the performance of agents given SITH, buffer, and exponential decay representations in learning to play video games at different levels of complexity. In these environments, SITH exhibits better learning performance by storing information for longer timescales than a fixed-size buffer, and representing this information more clearly than a set of exponentially decayed features. Finally, we discuss how the application of SITH, along with other human-inspired models of cognition, could improve reinforcement and machine learning algorithms in general.Comment: Preprint for submission to Neural Computation. Submitted to Neural Computation - Update 12/18/2018: revised based on reviewer comments, resubmitted to Neural Computation on 15 December, 2018. Restructured introduction and discussion, combined figures, added section for SITH parameterizatio

arXiv.org e-Print Archive