773 research outputs found
Searching for Plannable Domains can Speed up Reinforcement Learning
Reinforcement learning (RL) involves sequential decision making in uncertain
environments. The aim of the decision-making agent is to maximize the benefit
of acting in its environment over an extended period of time. Finding an
optimal policy in RL may be very slow. To speed up learning, one often used
solution is the integration of planning, for example, Sutton's Dyna algorithm,
or various other methods using macro-actions.
Here we suggest to separate plannable, i.e., close to deterministic parts of
the world, and focus planning efforts in this domain. A novel reinforcement
learning method called plannable RL (pRL) is proposed here. pRL builds a simple
model, which is used to search for macro actions. The simplicity of the model
makes planning computationally inexpensive. It is shown that pRL finds an
optimal policy, and that plannable macro actions found by pRL are near-optimal.
In turn, it is unnecessary to try large numbers of macro actions, which enables
fast learning. The utility of pRL is demonstrated by computer simulations
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models
This paper addresses the general problem of reinforcement learning (RL) in
partially observable environments. In 2013, our large RL recurrent neural
networks (RNNs) learned from scratch to drive simulated cars from
high-dimensional video input. However, real brains are more powerful in many
ways. In particular, they learn a predictive model of their initially unknown
environment, and somehow use it for abstract (e.g., hierarchical) planning and
reasoning. Guided by algorithmic information theory, we describe RNN-based AIs
(RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending
sequences of tasks, some of them provided by the user, others invented by the
RNNAI itself in a curious, playful fashion, to improve its RNN-based world
model. Unlike our previous model-building RNN-based RL machines dating back to
1990, the RNNAI learns to actively query its model for abstract reasoning and
planning and decision making, essentially "learning to think." The basic ideas
of this report can be applied to many other cases where one RNN-like system
exploits the algorithmic information content of another. They are taken from a
grant proposal submitted in Fall 2014, and also explain concepts such as
"mirror neurons." Experimental results will be described in separate papers.Comment: 36 pages, 1 figure. arXiv admin note: substantial text overlap with
arXiv:1404.782
Temporal Logic Guided Safe Reinforcement Learning Using Control Barrier Functions
Using reinforcement learning to learn control policies is a challenge when
the task is complex with potentially long horizons. Ensuring adequate but safe
exploration is also crucial for controlling physical systems. In this paper, we
use temporal logic to facilitate specification and learning of complex tasks.
We combine temporal logic with control Lyapunov functions to improve
exploration. We incorporate control barrier functions to safeguard the
exploration and deployment process. We develop a flexible and learnable system
that allows users to specify task objectives and constraints in different forms
and at various levels. The framework is also able to take advantage of known
system dynamics and handle unknown environmental dynamics by integrating
model-free learning with model-based planning
Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal Control
Reinforcement learning (RL) is a promising data-driven approach for adaptive
traffic signal control (ATSC) in complex urban traffic networks, and deep
neural networks further enhance its learning power. However, centralized RL is
infeasible for large-scale ATSC due to the extremely high dimension of the
joint action space. Multi-agent RL (MARL) overcomes the scalability issue by
distributing the global control to each local RL agent, but it introduces new
challenges: now the environment becomes partially observable from the viewpoint
of each local agent due to limited communication among agents. Most existing
studies in MARL focus on designing efficient communication and coordination
among traditional Q-learning agents. This paper presents, for the first time, a
fully scalable and decentralized MARL algorithm for the state-of-the-art deep
RL agent: advantage actor critic (A2C), within the context of ATSC. In
particular, two methods are proposed to stabilize the learning procedure, by
improving the observability and reducing the learning difficulty of each local
agent. The proposed multi-agent A2C is compared against independent A2C and
independent Q-learning algorithms, in both a large synthetic traffic grid and a
large real-world traffic network of Monaco city, under simulated peak-hour
traffic dynamics. Results demonstrate its optimality, robustness, and sample
efficiency over other state-of-the-art decentralized MARL algorithms
Simulation to Scaled City: Zero-Shot Policy Transfer for Traffic Control via Autonomous Vehicles
Using deep reinforcement learning, we train control policies for autonomous
vehicles leading a platoon of vehicles onto a roundabout. Using Flow, a library
for deep reinforcement learning in micro-simulators, we train two policies, one
policy with noise injected into the state and action space and one without any
injected noise. In simulation, the autonomous vehicle learns an emergent
metering behavior for both policies in which it slows to allow for smoother
merging. We then directly transfer this policy without any tuning to the
University of Delaware Scaled Smart City (UDSSC), a 1:25 scale testbed for
connected and automated vehicles. We characterize the performance of both
policies on the scaled city. We show that the noise-free policy winds up
crashing and only occasionally metering. However, the noise-injected policy
consistently performs the metering behavior and remains collision-free,
suggesting that the noise helps with the zero-shot policy transfer.
Additionally, the transferred, noise-injected policy leads to a 5% reduction of
average travel time and a reduction of 22% in maximum travel time in the UDSSC.
Videos of the controllers can be found at
https://sites.google.com/view/iccps-policy-transfer.Comment: To be published at the International Conference on Cyber Physical
Systems (ICCPS) 2019. 10 pages, 9 figure
MP3: Movement Primitive-Based (Re-)Planning Policy
We introduce a novel deep reinforcement learning (RL) approach called
Movement Prmitive-based Planning Policy (MP3). By integrating movement
primitives (MPs) into the deep RL framework, MP3 enables the generation of
smooth trajectories throughout the whole learning process while effectively
learning from sparse and non-Markovian rewards. Additionally, MP3 maintains the
capability to adapt to changes in the environment during execution. Although
many early successes in robot RL have been achieved by combining RL with MPs,
these approaches are often limited to learning single stroke-based motions,
lacking the ability to adapt to task variations or adjust motions during
execution. Building upon our previous work, which introduced an episode-based
RL method for the non-linear adaptation of MP parameters to different task
variations, this paper extends the approach to incorporating replanning
strategies. This allows adaptation of the MP parameters throughout motion
execution, addressing the lack of online motion adaptation in stochastic
domains requiring feedback. We compared our approach against state-of-the-art
deep RL and RL with MPs methods. The results demonstrated improved performance
in sophisticated, sparse reward settings and in domains requiring replanning.Comment: The video demonstration can be accessed at
https://intuitive-robots.github.io/mp3_website/. arXiv admin note: text
overlap with arXiv:2210.0962
Deep Reinforcement Learning based Optimal Control of Hot Water Systems
Energy consumption for hot water production is a major draw in high
efficiency buildings. Optimizing this has typically been approached from a
thermodynamics perspective, decoupled from occupant influence. Furthermore,
optimization usually presupposes existence of a detailed dynamics model for the
hot water system. These assumptions lead to suboptimal energy efficiency in the
real world. In this paper, we present a novel reinforcement learning based
methodology which optimizes hot water production. The proposed methodology is
completely generalizable, and does not require an offline step or human domain
knowledge to build a model for the hot water vessel or the heating element.
Occupant preferences too are learnt on the fly. The proposed system is applied
to a set of 32 houses in the Netherlands where it reduces energy consumption
for hot water production by roughly 20% with no loss of occupant comfort.
Extrapolating, this translates to absolute savings of roughly 200 kWh for a
single household on an annual basis. This performance can be replicated to any
domestic hot water system and optimization objective, given that the fairly
minimal requirements on sensor data are met. With millions of hot water systems
operational worldwide, the proposed framework has the potential to reduce
energy consumption in existing and new systems on a multi Gigawatt-hour scale
in the years to come
Accelerating Reinforcement Learning through Implicit Imitation
Imitation can be viewed as a means of enhancing learning in multiagent
environments. It augments an agent's ability to learn useful behaviors by
making intelligent use of the knowledge implicit in behaviors demonstrated by
cooperative teachers or other more experienced agents. We propose and study a
formal model of implicit imitation that can accelerate reinforcement learning
dramatically in certain cases. Roughly, by observing a mentor, a
reinforcement-learning agent can extract information about its own capabilities
in, and the relative value of, unvisited parts of the state space. We study two
specific instantiations of this model, one in which the learning agent and the
mentor have identical abilities, and one designed to deal with agents and
mentors with different action sets. We illustrate the benefits of implicit
imitation by integrating it with prioritized sweeping, and demonstrating
improved performance and convergence through observation of single and multiple
mentors. Though we make some stringent assumptions regarding observability and
possible interactions, we briefly comment on extensions of the model that relax
these restricitions
TriFinger: An Open-Source Robot for Learning Dexterity
Dexterous object manipulation remains an open problem in robotics, despite
the rapid progress in machine learning during the past decade. We argue that a
hindrance is the high cost of experimentation on real systems, in terms of both
time and money. We address this problem by proposing an open-source robotic
platform which can safely operate without human supervision. The hardware is
inexpensive (about \SI{5000}[\$]{}) yet highly dynamic, robust, and capable of
complex interaction with external objects. The software operates at 1-kilohertz
and performs safety checks to prevent the hardware from breaking. The
easy-to-use front-end (in C++ and Python) is suitable for real-time control as
well as deep reinforcement learning. In addition, the software framework is
largely robot-agnostic and can hence be used independently of the hardware
proposed herein. Finally, we illustrate the potential of the proposed platform
through a number of experiments, including real-time optimal control, deep
reinforcement learning from scratch, throwing, and writing
- …