199 research outputs found
Deep Reinforcement Learning: A Brief Survey
Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher-level understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field
Automating Vehicles by Deep Reinforcement Learning using Task Separation with Hill Climbing
Within the context of autonomous driving a model-based reinforcement learning
algorithm is proposed for the design of neural network-parameterized
controllers. Classical model-based control methods, which include sampling- and
lattice-based algorithms and model predictive control, suffer from the
trade-off between model complexity and computational burden required for the
online solution of expensive optimization or search problems at every short
sampling time. To circumvent this trade-off, a 2-step procedure is motivated:
first learning of a controller during offline training based on an arbitrarily
complicated mathematical system model, before online fast feedforward
evaluation of the trained controller. The contribution of this paper is the
proposition of a simple gradient-free and model-based algorithm for deep
reinforcement learning using task separation with hill climbing (TSHC). In
particular, (i) simultaneous training on separate deterministic tasks with the
purpose of encoding many motion primitives in a neural network, and (ii) the
employment of maximally sparse rewards in combination with virtual velocity
constraints (VVCs) in setpoint proximity are advocated.Comment: 10 pages, 6 figures, 1 tabl
Human-in-the-Loop Methods for Data-Driven and Reinforcement Learning Systems
Recent successes combine reinforcement learning algorithms and deep neural
networks, despite reinforcement learning not being widely applied to robotics
and real world scenarios. This can be attributed to the fact that current
state-of-the-art, end-to-end reinforcement learning approaches still require
thousands or millions of data samples to converge to a satisfactory policy and
are subject to catastrophic failures during training. Conversely, in real world
scenarios and after just a few data samples, humans are able to either provide
demonstrations of the task, intervene to prevent catastrophic actions, or
simply evaluate if the policy is performing correctly. This research
investigates how to integrate these human interaction modalities to the
reinforcement learning loop, increasing sample efficiency and enabling
real-time reinforcement learning in robotics and real world scenarios. This
novel theoretical foundation is called Cycle-of-Learning, a reference to how
different human interaction modalities, namely, task demonstration,
intervention, and evaluation, are cycled and combined to reinforcement learning
algorithms. Results presented in this work show that the reward signal that is
learned based upon human interaction accelerates the rate of learning of
reinforcement learning algorithms and that learning from a combination of human
demonstrations and interventions is faster and more sample efficient when
compared to traditional supervised learning algorithms. Finally,
Cycle-of-Learning develops an effective transition between policies learned
using human demonstrations and interventions to reinforcement learning. The
theoretical foundation developed by this research opens new research paths to
human-agent teaming scenarios where autonomous agents are able to learn from
human teammates and adapt to mission performance metrics in real-time and in
real world scenarios.Comment: PhD thesis, Aerospace Engineering, Texas A&M (2020). For more
information, see https://vggoecks.com
Learning predictive cognitive maps with spiking neurons during behaviour and replays
The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data
Recommended from our members
Sample-Efficient Deep Reinforcement Learning for Continuous Control
Reinforcement learning (RL) is a powerful, generic approach to discovering optimal policies
in complex sequential decision-making problems. Recently, with flexible function approximators such as neural networks, RL has greatly expanded its realm of applications, from
playing computer games with pixel inputs, to mastering the game of Go, to learning parkour
movements by simulated humanoids. However, the common RL approaches are known
to be sample intensive, making them difficult to be applied to real-world problems such
as robotics. This thesis makes several contributions toward developing RL algorithms for
learning in the wild, where sample-efficiency and stability are critical. The key contributions
include Normalized Advantage Functions (NAF), extending Q-learning for continuous action problems; Interpolated Policy Gradient (IPG), unifying prior policy gradient algorithm
variants through theoretical analyses on bias and variance; and Temporal Difference Models
(TDM), interpreting a parameterized Q-function as a generalized dynamics model for novel
temporally abstracted model-based planning. Importantly, this thesis highlights that these
algorithms can be seen as bridging gaps between branches of RL – model-based with modelfree, and on-policy with off-policy. The proposed algorithms not only achieve substantial
improvements over the prior approaches, but also provide novel perspectives on how to mix
different branches of RL effectively to gain the best of both worlds. NAF has subsequently
been shown to be able to train two 7-DoF robot arms to open doors using only 2.5 hours of
real-world experience, making it one of the first demonstrations of deep RL approaches on
real robots.- Cambridge-Tuebingen PhD Fellowship in Machine Learning
- Google Focused Research Award
- NSER
Deep Reinforcement Learning based Path-Planning for Multi-Agent Systems in Advection-Diffusion Field Reconstruction Tasks
Many environmental processes can be represented mathematically using spatial-temporal varying partial-differential equations. Timely estimation and prediction of processes such as wildfires is critical for disaster management response, but is difficult to accomplish without the availability of a dense network of stationary sensors. In this work, we propose a deep reinforcement learning-based real-time path-planning algorithm for mobile sensor networks traveling in a formation through a spatial-temporal varying advection-diffusion field for the task of field reconstruction. A deep Q-network (DQN) agent is trained on simulated advection-diffusion fields to direct the mobile sensor network to travel along information-rich trajectories. The field measurements made by the mobile sensor network along their trajectories enable identification of field advection parameters, which are required for field reconstruction. A cooperative Kalman filter developed in previous works is employed to receive estimates of the field values and gradients, which are essential for reconstruction as well as for the estimation of the diffusion parameter. A mechanism is provided that encourages exploration in the field domain once a stationary state is reached, which allows the algorithm to identify other information-rich trajectories that may exist in the field improving reconstruction performance significantly. Two simulation environments of different fidelities are provided to test the feasibility of the proposed algorithm. The low-fidelity simulation environment is used for training of the DQN agent. The high-fidelity simulation environment is based on Robot Operating System (ROS) and simulates real robots. We provide results of running sample test episodes in both environments which demonstrate the effectiveness and feasibility of the proposed algorithm
Deciphering the Firing Patterns of Hippocampal Neurons During Sharp-Wave Ripples
The hippocampus is essential for learning and memory. Neurons in the rat hippocampus selectively fire when the animal is at specific locations - place fields - within an environment. Place fields corresponding to such place cells tile the entire environment, forming a stable spatial map supporting navigation and planning. Remarkably, the same place cells reactivate together outside of their place fields and in coincidence with sharp-wave ripples (SWRs) - dominant electrical field oscillations (150-250 Hz) in the hippocampus. These offline SWR events frequently occur during quiet wake periods in the middle of exploration and the follow-up slow-wave sleep and are associated with spatial memory performance and stabilization of spatial maps. Therefore, deciphering the firing patterns during these events is essential to understanding offline memory processing.I provide two novel methods to analyze the SWRs firing patterns in this dissertation project. The first method uses hidden Markov models (HMM), in which I model the dynamics of neural activity during SWRs in terms of transitions between distinct states of neuronal ensemble activity. This method detects consistent temporal structures over many instances of SWRs and, in contrast to standard approaches, relaxes the dependence on positional data during the behavior to interpret temporal patterns during SWRs. To validate this method, I applied the method to quiet wake SWRs. In a simple spatial memory task in which the animal ran on a linear track or in an open arena, the individual states corresponded to the activation of distinct group of neurons with inter-state transitions that resembled the animal’s trajectories during the exploration. In other words, this method enabled us to identify the topology and spatial map of the explored environment by dissecting the firings occurring during the quiescence periods’ SWRs. This result indicated that downstream brain regions may rely only on SWRs to uncover hippocampal code as a substrate for memory processing. I developed a second analysis method based on the principles of Bayesian learning. This method enabled us to track the spatial tunings over the sleep following exploration of an environment by taking neurons’ place fields in the environment as the prior belief and updating it using dynamic ensemble firing patterns unfolding over time. This method introduces a neuronal-ensemble-based approach that calculates tunings to the position encoded by ensemble firings during sleep rather than the animal’s actual position during exploration. When I applied this method to several datasets, I found that during the early slow-wave sleep after an experience, but not during late hours of sleep or sleep before the exploration, the spatial tunings highly resembled the place fields on the track. Furthermore, the fidelity of the spatial tunings to the place fields predicted the place fields’ stability when the animal was re-exposed to the same environment after ~ 9h. Moreover, even for neurons with shifted place fields during re-exposure, the spatial tunings during early sleep were predictive of the place fields during the re-exposure. These results indicated that early sleep actively maintains or retunes the place fields of neurons, explaining the representational drift of place fields across multiple exposures
- …