3,190 research outputs found
Batch Reinforcement Learning on the Industrial Benchmark: First Experiences
The Particle Swarm Optimization Policy (PSO-P) has been recently introduced
and proven to produce remarkable results on interacting with academic
reinforcement learning benchmarks in an off-policy, batch-based setting. To
further investigate the properties and feasibility on real-world applications,
this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a
novel reinforcement learning (RL) benchmark that aims at being realistic by
including a variety of aspects found in industrial applications, like
continuous state and action spaces, a high dimensional, partially observable
state space, delayed effects, and complex stochasticity. The experimental
results of PSO-P on IB are compared to results of closed-form control policies
derived from the model-based Recurrent Control Neural Network (RCNN) and the
model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not
only of interest for academic benchmarks, but also for real-world industrial
applications, since it also yielded the best performing policy in our IB
setting. Compared to other well established RL techniques, PSO-P produced
outstanding results in performance and robustness, requiring only a relatively
low amount of effort in finding adequate parameters or making complex design
decisions
Deep Reinforcement Learning from Self-Play in Imperfect-Information Games
Many real-world applications can be described as large-scale games of
imperfect information. To deal with these challenging domains, prior work has
focused on computing Nash equilibria in a handcrafted abstraction of the
domain. In this paper we introduce the first scalable end-to-end approach to
learning approximate Nash equilibria without prior domain knowledge. Our method
combines fictitious self-play with deep reinforcement learning. When applied to
Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium,
whereas common reinforcement learning methods diverged. In Limit Texas Holdem,
a poker game of real-world scale, NFSP learnt a strategy that approached the
performance of state-of-the-art, superhuman algorithms based on significant
domain expertise.Comment: updated version, incorporating conference feedbac
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Multiobjective Reinforcement Learning for Reconfigurable Adaptive Optimal Control of Manufacturing Processes
In industrial applications of adaptive optimal control often multiple
contrary objectives have to be considered. The weights (relative importance) of
the objectives are often not known during the design of the control and can
change with changing production conditions and requirements. In this work a
novel model-free multiobjective reinforcement learning approach for adaptive
optimal control of manufacturing processes is proposed. The approach enables
sample-efficient learning in sequences of control configurations, given by
particular objective weights.Comment: Conference, Preprint, 978-1-5386-5925-0/18/$31.00 \c{opyright} 2018
IEE
Definition and evaluation of model-free coordination of electrical vehicle charging with reinforcement learning
Demand response (DR) becomes critical to manage the charging load of a growing electric vehicle (EV) deployment. Initial DR studies mainly adopt model predictive control, but models are largely uncertain for the EV scenario (e.g., customer behavior). Model-free approaches, based on reinforcement learning (RL), are an attractive alternative. We propose a new Markov decision process (MDP) formulation in the RL framework, to jointly coordinate a set of charging stations. State-of-the-art algorithms either focus on a single EV, or control an aggregate of EVs in multiple steps (e.g., 1) make aggregate load decisions and 2) translate the aggregate decision to individual EVs). In contrast, our RL approach jointly controls the whole set of EVs at once. We contribute a new MDP formulation with a scalable state representation independent of the number of charging stations. Using a batch RL algorithm, fitted -iteration, we learn an optimal charging policy. With simulations using real-world data, we: 1) differentiate settings in training the RL policy (e.g., the time span covered by training data); 2) compare its performance to an oracle all-knowing benchmark (providing an upper performance bound); 3) analyze performance fluctuations throughout a full year; and 4) demonstrate generalization capacity to larger sets of charging stations
- …