7 research outputs found
Adversarially Guided Actor-Critic
International audienceDespite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks
Intrinsic reward driven exploration for deep reinforcement learning
Deep reinforcement learning has become one of the hottest research topics in machine learning. In reinforcement learning, agents interact with the environment and try to maximise the expected cumulative reward. The goal of reinforcement learning is to find a policy to maximise the agent’s total cumulative rewards. Unfortunately, some environments can only provide extremely sparse rewards, so the agent needs to learn a strategy to explore in its environment more efficiently to find these rewards. However, it is known that exploration in complex environments is a key challenge of deep reinforcement learning, especially for tasks where rewards are very sparse.
In this thesis, intrinsic reward driven exploration strategies are investigated. The agent driven by this intrinsic reward can explore expeditiously, so as to find the sparse extrinsic rewards provided by the environment. Recently, surprise has been used as an intrinsic reward that encourages systematic and efficient exploration. We first define a novel intrinsic reward function called assorted surprise, and propose Variational Assorted Surprise Exploration (VASE) algorithm to approximate this assorted surprise in a tractable way, with the help of Bayesian neural networks. Then we apply VASE algorithm to continuous control problems and large scale Atari video games respectively. Experimental results show that VASE performs well across these tasks. Then we discover that all surprise based exploration methods will lose exploration efficiency in areas where the environmental transition is discontinuous. To solve this problem, we propose Mutual Information Minimising Exploration (MIME) algorithm. We show that MIME can explore as efficiently as surprise based methods in other areas of the environment but much better in areas with discontinuous transitions
Deep reinforcement learning for multi-modal embodied navigation
Ce travail se concentre sur une tâche de micro-navigation en plein air où le but est de naviguer
vers une adresse de rue spécifiée en utilisant plusieurs modalités (par exemple, images, texte
de scène et GPS). La tâche de micro-navigation extérieure s’avère etre un défi important pour
de nombreuses personnes malvoyantes, ce que nous démontrons à travers des entretiens et
des études de marché, et nous limitons notre définition des problèmes à leurs besoins. Nous
expérimentons d’abord avec un monde en grille partiellement observable (Grid-Street et Grid
City) contenant des maisons, des numéros de rue et des régions navigables. Ensuite, nous
introduisons le Environnement de Trottoir pour la Navigation Visuelle (ETNV), qui contient
des images panoramiques avec des boîtes englobantes pour les numéros de maison, les portes
et les panneaux de nom de rue, et des formulations pour plusieurs tâches de navigation. Dans
SEVN, nous formons un modèle de politique pour fusionner des observations multimodales
sous la forme d’images à résolution variable, de texte visible et de données GPS simulées afin
de naviguer vers une porte d’objectif. Nous entraînons ce modèle en utilisant l’algorithme
d’apprentissage par renforcement, Proximal Policy Optimization (PPO). Nous espérons que
cette thèse fournira une base pour d’autres recherches sur la création d’agents pouvant aider
les membres de la communauté des gens malvoyantes à naviguer le monde.This work focuses on an Outdoor Micro-Navigation (OMN) task in which the goal is to
navigate to a specified street address using multiple modalities including images, scene-text,
and GPS. This task is a significant challenge to many Blind and Visually Impaired (BVI)
people, which we demonstrate through interviews and market research. To investigate the
feasibility of solving this task with Deep Reinforcement Learning (DRL), we first introduce
two partially observable grid-worlds, Grid-Street and Grid City, containing houses, street
numbers, and navigable regions. In these environments, we train an agent to find specific
houses using local observations under a variety of training procedures. We parameterize
our agent with a neural network and train using reinforcement learning methods. Next, we
introduce the Sidewalk Environment for Visual Navigation (SEVN), which contains panoramic
images with labels for house numbers, doors, and street name signs, and formulations for
several navigation tasks. In SEVN, we train another neural network model using Proximal
Policy Optimization (PPO) to fuse multi-modal observations in the form of variable resolution
images, visible text, and simulated GPS data, and to use this representation to navigate to
goal doors. Our best model used all available modalities and was able to navigate to over 100
goals with an 85% success rate. We found that models with access to only a subset of these
modalities performed significantly worse, supporting the need for a multi-modal approach to
the OMN task. We hope that this thesis provides a foundation for further research into the
creation of agents to assist members of the BVI community to safely navigate
Learning from alternative sources of supervision
With the rise of the internet, data of many varieties including: images, audio, text
and video are abundant. Unfortunately for a very specific task one might have, the
data for that problem is not typically abundant unless you are lucky. Typically one
might have only a small amount of labelled data, or only noisy labels, or labels for a
different task, or perhaps a simulator and reward function but no demonstrations, or
even a simulator but no reward function at all. However, arguably no task is truly novel
and so it is often possible for neural networks to benefit from the abundant data that
is related to your current task. This thesis documents three methods for learning from
alternative sources of supervision, an alternative to the more preferable case of simply
having unlimited amounts of direct examples of your task. Firstly we show how having
data from many related tasks could be described with a simple graphical model and
fit using a Variational-Autoencoder - directly modelling and representing the relations
amongst tasks. Secondly we investigate various forms of prediction-based intrinsic
rewards for agents in a simulator with no extrinsic rewards. Thirdly we introduce a
novel intrinsic reward and investigate how to best combine it with an extrinsic reward
for best performance
Learning and planning in videogames via task decomposition
Artificial intelligence (AI) methods have come a long way in tabletop games, with computer programs having now surpassed human experts in the challenging games of chess, Go and heads-up no-limit Texas hold'em. However, a significant simplifying factor in these games is that individual decisions have a relatively large impact on the state of the game. The real world, however, is granular. Human beings are continually presented with new information and are faced with making a multitude of tiny decisions every second. Viewed in these terms, feedback is often sparse, meaning that it only arrives after one has made a great number of decisions. Moreover, in many real-world problems there is a continuous range of actions to choose from, and attaining meaningful feedback from the environment often requires a strong degree of action coordination. Videogames, in which players must likewise contend with granular time scales and continuous action spaces, are in this sense a better proxy for real-world problems, and have thus become regarded by many as the new frontier in games AI. Seemingly, the way in which human players approach granular decision-making in videogames is by decomposing complex tasks into high-level subproblems, thereby allowing them to focus on the "big picture". For example, in Super Mario World, human players seem to look ahead in extended steps, such as climbing a vine or jumping over a pit, rather than planning one frame at a time. Currently though, this type of reasoning does not come easily to machines, leaving many open research problems related to task decomposition. This thesis focuses on three such problems in particular: (1) The challenge of learning subgoals autonomously, so as to lessen the issue of sparse feedback. (2) The challenge of combining discrete planning techniques with extended actions whose durations and effects on the environment are uncertain. (3) The questions of when and why it is beneficial to reason over high-level continuous control variables, such as the velocity of a player-controlled ship, rather than over the most low-level actions available. We address these problems via new algorithms and novel experimental design, demonstrating empirically that our algorithms are more efficient than strong baselines that do not leverage task decomposition, and yielding insight into the types of environment where task decomposition is likely to be beneficial
Ghost In the Grid: Challenges for Reinforcement Learning in Grid World Environments
The current state-of-the-art deep reinforcement learning techniques require agents to gather large amounts of diverse experiences to train effective and general models. In addition, there are also many other factors that have to be taken into consideration: for example, how the agent interacts with its environment; parameter optimization techniques; environment exploration methods; and finally the diversity of environments that is provided to an agent. In this thesis, we investigate several of these factors. Firstly we introduce Griddly, a high-performance grid-world game engine that provides a state-of-the-art combination of high performance and flexibility. We demonstrate that grid worlds provide a principled and expressive substrate for fundamental research questions in reinforcement learning, whilst filtering out noise inherent in physical systems. We show that although grid-worlds are constructed with simple rules-based mechanics, they can be used to construct complex open-ended, and procedurally generated environments. We improve upon Griddly with GriddlyJS, a web-based tool for designing and testing grid-world environments for reinforcement learning research. GriddlyJS provides a rich suite of features that assist researchers in a multitude of different learning approaches. To highlight the features of GriddlyJS we present a dataset of 100 complex escape-room puzzle levels. In addition to these complex puzzle levels, we provide human-generated trajectories and a baseline policy that can be run in a web browser. We show that this tooling enables significantly faster research iteration in many sub-fields. We then explore several areas of RL research that are made accessible by the features introduced by Griddly: Firstly, we explore learning grid-world game mechanics using deep neural networks. The {\em neural game engine} is introduced which has competitive performance in terms of sample efficiency and predicting states accurately over long time horizons. Secondly, {\em conditional action trees} are introduced which describe a method for compactly expressing complex hierarchical action spaces. Expressing hierarchical action spaces as trees leads to action spaces that are additive rather than multiplicative over the factors of the action space. It is shown that these compressed action spaces reduce the required output size of neural networks without compromising performance. This makes the interfaces to complex environments significantly simpler to implement. Finally, we explore the inherent symmetry in common observation spaces, using the concept of {\em geometric deep learning}. We show that certain geometric data augmentation methods do not conform to the underlying assumptions in several training algorithms. We provide solutions to these problems in the form of novel regularization functions and demonstrate that these methods fix the underlying assumptions