1,100 research outputs found
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
We present a multi-agent Deep Reinforcement Learning (DRL) framework for
managing large transportation infrastructure systems over their life-cycle.
Life-cycle management of such engineering systems is a computationally
intensive task, requiring appropriate sequential inspection and maintenance
decisions able to reduce long-term risks and costs, while dealing with
different uncertainties and constraints that lie in high-dimensional spaces. To
date, static age- or condition-based maintenance methods and risk-based or
periodic inspection plans have mostly addressed this class of optimization
problems. However, optimality, scalability, and uncertainty limitations are
often manifested under such approaches. The optimization problem in this work
is cast in the framework of constrained Partially Observable Markov Decision
Processes (POMDPs), which provides a comprehensive mathematical basis for
stochastic sequential decision settings with observation uncertainties, risk
considerations, and limited resources. To address significantly large state and
action spaces, a Deep Decentralized Multi-agent Actor-Critic (DDMAC) DRL method
with Centralized Training and Decentralized Execution (CTDE), termed as
DDMAC-CTDE is developed. The performance strengths of the DDMAC-CTDE method are
demonstrated in a generally representative and realistic example application of
an existing transportation network in Virginia, USA. The network includes
several bridge and pavement components with nonstationary degradation,
agency-imposed constraints, and traffic delay and risk considerations. Compared
to traditional management policies for transportation networks, the proposed
DDMAC-CTDE method vastly outperforms its counterparts. Overall, the proposed
algorithmic framework provides near optimal solutions for transportation
infrastructure management under real-world constraints and complexities
Towards Standardising Reinforcement Learning Approaches for Production Scheduling Problems
Recent years have seen a rise in interest in terms of using machine learning,
particularly reinforcement learning (RL), for production scheduling problems of
varying degrees of complexity. The general approach is to break down the
scheduling problem into a Markov Decision Process (MDP), whereupon a simulation
implementing the MDP is used to train an RL agent. Since existing studies rely
on (sometimes) complex simulations for which the code is unavailable, the
experiments presented are hard, or, in the case of stochastic environments,
impossible to reproduce accurately. Furthermore, there is a vast array of RL
designs to choose from. To make RL methods widely applicable in production
scheduling and work out their strength for the industry, the standardisation of
model descriptions - both production setup and RL design - and validation
scheme are a prerequisite. Our contribution is threefold: First, we standardize
the description of production setups used in RL studies based on established
nomenclature. Secondly, we classify RL design choices from existing
publications. Lastly, we propose recommendations for a validation scheme
focusing on reproducibility and sufficient benchmarking
Patching Neural Barrier Functions Using Hamilton-Jacobi Reachability
Learning-based control algorithms have led to major advances in robotics at
the cost of decreased safety guarantees. Recently, neural networks have also
been used to characterize safety through the use of barrier functions for
complex nonlinear systems. Learned barrier functions approximately encode and
enforce a desired safety constraint through a value function, but do not
provide any formal guarantees. In this paper, we propose a local dynamic
programming (DP) based approach to "patch" an almost-safe learned barrier at
potentially unsafe points in the state space. This algorithm, HJ-Patch, obtains
a novel barrier that provides formal safety guarantees, yet retains the global
structure of the learned barrier. Our local DP based reachability algorithm,
HJ-Patch, updates the barrier function "minimally" at points that both (a)
neighbor the barrier safety boundary and (b) do not satisfy the safety
condition. We view this as a key step to bridging the gap between
learning-based barrier functions and Hamilton-Jacobi reachability analysis,
providing a framework for further integration of these approaches. We
demonstrate that for well-trained barriers we reduce the computational load by
2 orders of magnitude with respect to standard DP-based reachability, and
demonstrate scalability to a 6-dimensional system, which is at the limit of
standard DP-based reachability.Comment: 8 pages, submitted to IEEE Conference on Decision and Control (CDC),
202
λͺ¨λΈκΈ°λ°κ°ννμ΅μμ΄μ©ν곡μ μ μ΄λ°μ΅μ ν
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν ννμ물곡νλΆ,2020. 2. μ΄μ’
λ―Ό.μμ°¨μ μμ¬κ²°μ λ¬Έμ λ 곡μ μ΅μ νμ ν΅μ¬ λΆμΌ μ€ νλμ΄λ€. μ΄ λ¬Έμ μ μμΉμ ν΄λ² μ€ κ°μ₯ λ§μ΄ μ¬μ©λλ κ²μ μλ°©ν₯μΌλ‘ μλνλ μ§μ λ² (direct optimization) λ°©λ²μ΄μ§λ§, λͺκ°μ§ νκ³μ μ μ§λκ³ μλ€. μ΅μ ν΄λ open-loopμ ννλ₯Ό μ§λκ³ μμΌλ©°, λΆνμ μ±μ΄ μ‘΄μ¬ν λ λ°©λ²λ‘ μ μμΉμ 볡μ‘λκ° μ¦κ°νλ€λ κ²μ΄λ€. λμ κ³νλ² (dynamic programming) μ μ΄λ¬ν νκ³μ μ κ·Όμμ μΌλ‘ ν΄κ²°ν μ μμ§λ§, κ·Έλμ 곡μ μ΅μ νμ μ κ·Ήμ μΌλ‘ κ³ λ €λμ§ μμλ μ΄μ λ λμ κ³νλ²μ κ²°κ³Όλ‘ μ»μ΄μ§ νΈλ―ΈλΆ λ°©μ μ λ¬Έμ κ° μ νμ°¨μ 벑ν°κ³΅κ°μ΄ μλ 무νμ°¨μμ ν¨μ곡κ°μμ λ€λ£¨μ΄μ§κΈ° λλ¬Έμ΄λ€. μμ μ°¨μμ μ μ£ΌλΌκ³ λΆλ¦¬λ μ΄ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν νκ°μ§ λ°©λ²μΌλ‘μ, μνμ μ΄μ©ν κ·Όμ¬μ ν΄λ²μ μ΄μ μ λ κ°ννμ΅ λ°©λ²λ‘ μ΄ μ°κ΅¬λμ΄ μλ€. λ³Έ νμλ
Όλ¬Έμμλ κ°ννμ΅ λ°©λ²λ‘ μ€, 곡μ μ΅μ νμ μ ν©ν λͺ¨λΈ κΈ°λ° κ°ννμ΅μ λν΄ μ°κ΅¬νκ³ , μ΄λ₯Ό 곡μ μ΅μ νμ λνμ μΈ μΈκ°μ§ μμ°¨μ μμ¬κ²°μ λ¬Έμ μΈ μ€μΌμ€λ§, μμλ¨κ³ μ΅μ ν, νμλ¨κ³ μ μ΄μ μ μ©νλ κ²μ λͺ©νλ‘ νλ€. μ΄ λ¬Έμ λ€μ κ°κ° λΆλΆκ΄μΈ‘ λ§λ₯΄μ½ν κ²°μ κ³Όμ (partially observable Markov decision process), μ μ΄-μν μνκ³΅κ° λͺ¨λΈ (control-affine state space model), μΌλ°μ μνκ³΅κ° λͺ¨λΈ (general state space model)λ‘ λͺ¨λΈλ§λλ€. λν κ° μμΉμ λͺ¨λΈλ€μ ν΄κ²°νκΈ° μν΄ point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP)λ‘ λΆλ¦¬λ λ°©λ²λ€μ λμ
νμλ€.
μ΄ μΈκ°μ§ λ¬Έμ μ λ°©λ²λ‘ μμ μ μλ νΉμ§λ€μ λ€μκ³Ό κ°μ΄ μμ½ν μ μλ€: 첫λ²μ§Έλ‘, μ€μΌμ€λ§ λ¬Έμ μμ closed-loop νΌλλ°± ννμ ν΄λ₯Ό μ μν μ μμλ€. μ΄λ κΈ°μ‘΄ μ§μ λ²μμ μ»μ μ μμλ ννλ‘μ, κ°ννμ΅μ κ°μ μ λΆκ°ν μ μλ μΈ‘λ©΄μ΄λΌ μκ°ν μ μλ€. λλ²μ§Έλ‘ κ³ λ €ν νμλ¨κ³ μ μ΄ λ¬Έμ μμ, λμ κ³νλ²μ 무νμ°¨μ ν¨μκ³΅κ° μ΅μ ν λ¬Έμ λ₯Ό ν¨μ κ·Όμ¬ λ°©λ²μ ν΅ν΄ μ νμ°¨μ 벑ν°κ³΅κ° μ΅μ ν λ¬Έμ λ‘ μνν μ μλ λ°©λ²μ λμ
νμλ€. νΉν, μ¬μΈ΅ μ κ²½λ§μ μ΄μ©νμ¬ ν¨μ κ·Όμ¬λ₯Ό νμκ³ , μ΄λ λ°μνλ μ¬λ¬κ°μ§ μ₯μ κ³Ό μλ ΄ ν΄μ κ²°κ³Όλ₯Ό λ³Έ νμλ
Όλ¬Έμ μ€μλ€. λ§μ§λ§ λ¬Έμ λ μμ λ¨κ³ λμ μ΅μ ν λ¬Έμ μ΄λ€. λμ μ΅μ ν λ¬Έμ μμ λ°μνλ μ μ½ μ‘°κ±΄νμμ κ°ννμ΅μ μννκΈ° μν΄, μ-μλ λ―ΈλΆλμ κ³νλ² (primal-dual DDP) λ°©λ²λ‘ μ μλ‘ μ μνμλ€. μμ μ€λͺ
ν μΈκ°μ§ λ¬Έμ μ μ μ©λ λ°©λ²λ‘ μ κ²μ¦νκ³ , λμ κ³νλ²μ΄ μ§μ λ²μ λΉκ²¬λ μ μλ λ°©λ²λ‘ μ΄λΌλ μ£Όμ₯μ μ€μ¦νκΈ° μν΄ μ¬λ¬κ°μ§ 곡μ μμ λ₯Ό μ€μλ€.Sequential decision making problem is a crucial technology for plant-wide process optimization. While the dominant numerical method is the forward-in-time direct optimization, it is limited to the open-loop solution and has difficulty in considering the uncertainty. Dynamic programming method complements the limitations, nonetheless associated functional optimization suffers from the curse-of-dimensionality. The sample-based approach for approximating the dynamic programming, referred to as reinforcement learning (RL) can resolve the issue and investigated throughout this thesis. The method that accounts for the system model explicitly is in particular interest. The model-based RL is exploited to solve the three representative sequential decision making problems; scheduling, supervisory optimization, and regulatory control. The problems are formulated with partially observable Markov decision process, control-affine state space model, and general state space model, and associated model-based RL algorithms are point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP), respectively.
The contribution for each problem can be written as follows: First, for the scheduling problem, we developed the closed-loop feedback scheme which highlights the strength compared to the direct optimization method. In the second case, the regulatory control problem is tackled by the function approximation method which relaxes the functional optimization to the finite dimensional vector space optimization. Deep neural networks (DNNs) is utilized as the approximator, and the advantages as well as the convergence analysis is performed in the thesis. Finally, for the supervisory optimization problem, we developed the novel constraint RL framework that uses the primal-dual DDP method. Various illustrative examples are demonstrated to validate the developed model-based RL algorithms and to support the thesis statement on which the dynamic programming method can be considered as a complementary method for direct optimization method.1. Introduction 1
1.1 Motivation and previous work 1
1.2 Statement of contributions 9
1.3 Outline of the thesis 11
2. Background and preliminaries 13
2.1 Optimization problem formulation and the principle of optimality 13
2.1.1 Markov decision process 15
2.1.2 State space model 19
2.2 Overview of the developed RL algorithms 28
2.2.1 Point based value iteration 28
2.2.2 Globalized dual heuristic programming 29
2.2.3 Differential dynamic programming 32
3. A POMDP framework for integrated scheduling of infrastructure maintenance and inspection 35
3.1 Introduction 35
3.2 POMDP solution algorithm 38
3.2.1 General point based value iteration 38
3.2.2 GapMin algorithm 46
3.2.3 Receding horizon POMDP 49
3.3 Problem formulation for infrastructure scheduling 54
3.3.1 State 56
3.3.2 Maintenance and inspection actions 57
3.3.3 State transition function 61
3.3.4 Cost function 67
3.3.5 Observation set and observation function 68
3.3.6 State augmentation 69
3.4 Illustrative example and simulation result 69
3.4.1 Structural point for the analysis of a high dimensional belief space 72
3.4.2 Infinite horizon policy under the natural deterioration process 72
3.4.3 Receding horizon POMDP 79
3.4.4 Validation of POMDP policy via Monte Carlo simulation 83
4. A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system 88
4.1 Introduction 88
4.2 Function approximation and learning with deep neural networks 91
4.2.1 GDHP with a function approximator 91
4.2.2 Stable learning of DNNs 96
4.2.3 Overall algorithm 103
4.3 Results and discussions 107
4.3.1 Example 1: Semi-batch reactor 107
4.3.2 Example 2: Diffusion-Convection-Reaction (DCR) process 120
5. Convergence analysis of the model-based deep reinforcement learning for optimal control of nonlinear control-affine system 126
5.1 Introduction 126
5.2 Convergence proof of globalized dual heuristic programming (GDHP) 128
5.3 Function approximation with deep neural networks 137
5.3.1 Function approximation and gradient descent learning 137
5.3.2 Forward and backward propagations of DNNs 139
5.4 Convergence analysis in the deep neural networks space 141
5.4.1 Lyapunov analysis of the neural network parameter errors 141
5.4.2 Lyapunov analysis of the closed-loop stability 150
5.4.3 Overall Lyapunov function 152
5.5 Simulation results and discussions 157
5.5.1 System description 158
5.5.2 Algorithmic settings 160
5.5.3 Control result 161
6. Primal-dual differential dynamic programming for constrained dynamic optimization of continuous system 170
6.1 Introduction 170
6.2 Primal-dual differential dynamic programming for constrained dynamic optimization 172
6.2.1 Augmented Lagrangian method 172
6.2.2 Primal-dual differential dynamic programming algorithm 175
6.2.3 Overall algorithm 179
6.3 Results and discussions 179
7. Concluding remarks 186
7.1 Summary of the contributions 187
7.2 Future works 189
Bibliography 192Docto
The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning
Offline reinforcement learning aims to enable agents to be trained from
pre-collected datasets, however, this comes with the added challenge of
estimating the value of behavior not covered in the dataset. Model-based
methods offer a solution by allowing agents to collect additional synthetic
data via rollouts in a learned dynamics model. The prevailing theoretical
understanding is that this can then be viewed as online reinforcement learning
in an approximate dynamics model, and any remaining gap is therefore assumed to
be due to the imperfect dynamics model. Surprisingly, however, we find that if
the learned dynamics model is replaced by the true error-free dynamics,
existing model-based methods completely fail. This reveals a major
misconception. Our subsequent investigation finds that the general procedure
used in model-based algorithms results in the existence of a set of
edge-of-reach states which trigger pathological value overestimation and
collapse in Bellman-based algorithms. We term this the edge-of-reach problem.
Based on this, we fill some gaps in existing theory and also explain how prior
model-based methods are inadvertently addressing the true underlying
edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a
simple and robust method that directly addresses the edge-of-reach problem and
achieves strong performance across both proprioceptive and pixel-based
benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.Comment: Code open-sourced at: https://github.com/anyasims/edge-of-reac
Better Optimism By Bayes: Adaptive Planning with Rich Models
The computational costs of inference and planning have confined Bayesian
model-based reinforcement learning to one of two dismal fates: powerful
Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian
non-parametric models but using simple, myopic planning strategies such as
Thompson sampling. We ask whether it is feasible and truly beneficial to
combine rich probabilistic models with a closer approximation to fully Bayesian
planning. First, we use a collection of counterexamples to show formal problems
with the over-optimism inherent in Thompson sampling. Then we leverage
state-of-the-art techniques in efficient Bayes-adaptive planning and
non-parametric Bayesian methods to perform qualitatively better than both
existing conventional algorithms and Thompson sampling on two contextual
bandit-like problems.Comment: 11 pages, 11 figure
Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach
The integrated development of city clusters has given rise to an increasing
demand for intercity travel. Intercity ride-pooling service exhibits
considerable potential in upgrading traditional intercity bus services by
implementing demand-responsive enhancements. Nevertheless, its online
operations suffer the inherent complexities due to the coupling of vehicle
resource allocation among cities and pooled-ride vehicle routing. To tackle
these challenges, this study proposes a two-level framework designed to
facilitate online fleet management. Specifically, a novel multi-agent feudal
reinforcement learning model is proposed at the upper level of the framework to
cooperatively assign idle vehicles to different intercity lines, while the
lower level updates the routes of vehicles using an adaptive large neighborhood
search heuristic. Numerical studies based on the realistic dataset of Xiamen
and its surrounding cities in China show that the proposed framework
effectively mitigates the supply and demand imbalances, and achieves
significant improvement in both the average daily system profit and order
fulfillment ratio
- β¦