Search CORE

1,100 research outputs found

Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management

Author: Andriotis C. P.
Papakonstantinou K. G.
Saifullah M.
Stoffels S. M.
Publication venue
Publication date: 22/01/2024
Field of study

We present a multi-agent Deep Reinforcement Learning (DRL) framework for managing large transportation infrastructure systems over their life-cycle. Life-cycle management of such engineering systems is a computationally intensive task, requiring appropriate sequential inspection and maintenance decisions able to reduce long-term risks and costs, while dealing with different uncertainties and constraints that lie in high-dimensional spaces. To date, static age- or condition-based maintenance methods and risk-based or periodic inspection plans have mostly addressed this class of optimization problems. However, optimality, scalability, and uncertainty limitations are often manifested under such approaches. The optimization problem in this work is cast in the framework of constrained Partially Observable Markov Decision Processes (POMDPs), which provides a comprehensive mathematical basis for stochastic sequential decision settings with observation uncertainties, risk considerations, and limited resources. To address significantly large state and action spaces, a Deep Decentralized Multi-agent Actor-Critic (DDMAC) DRL method with Centralized Training and Decentralized Execution (CTDE), termed as DDMAC-CTDE is developed. The performance strengths of the DDMAC-CTDE method are demonstrated in a generally representative and realistic example application of an existing transportation network in Virginia, USA. The network includes several bridge and pavement components with nonstationary degradation, agency-imposed constraints, and traffic delay and risk considerations. Compared to traditional management policies for transportation networks, the proposed DDMAC-CTDE method vastly outperforms its counterparts. Overall, the proposed algorithmic framework provides near optimal solutions for transportation infrastructure management under real-world constraints and complexities

arXiv.org e-Print Archive

Towards Standardising Reinforcement Learning Approaches for Production Scheduling Problems

Author: Meyer Anne
Rinciog Alexandru
Publication venue: 'Elsevier BV'
Publication date: 15/02/2023
Field of study

Recent years have seen a rise in interest in terms of using machine learning, particularly reinforcement learning (RL), for production scheduling problems of varying degrees of complexity. The general approach is to break down the scheduling problem into a Markov Decision Process (MDP), whereupon a simulation implementing the MDP is used to train an RL agent. Since existing studies rely on (sometimes) complex simulations for which the code is unavailable, the experiments presented are hard, or, in the case of stochastic environments, impossible to reproduce accurately. Furthermore, there is a vast array of RL designs to choose from. To make RL methods widely applicable in production scheduling and work out their strength for the industry, the standardisation of model descriptions - both production setup and RL design - and validation scheme are a prerequisite. Our contribution is threefold: First, we standardize the description of production setups used in RL studies based on established nomenclature. Secondly, we classify RL design choices from existing publications. Lastly, we propose recommendations for a validation scheme focusing on reproducibility and sufficient benchmarking

arXiv.org e-Print Archive

KITopen

Patching Neural Barrier Functions Using Hamilton-Jacobi Reachability

Author: Gao Sicun
Herbert Sylvia
Qin Zhizhen
Tonkens Sander
Toofanian Alex
Publication venue
Publication date: 19/04/2023
Field of study

Learning-based control algorithms have led to major advances in robotics at the cost of decreased safety guarantees. Recently, neural networks have also been used to characterize safety through the use of barrier functions for complex nonlinear systems. Learned barrier functions approximately encode and enforce a desired safety constraint through a value function, but do not provide any formal guarantees. In this paper, we propose a local dynamic programming (DP) based approach to "patch" an almost-safe learned barrier at potentially unsafe points in the state space. This algorithm, HJ-Patch, obtains a novel barrier that provides formal safety guarantees, yet retains the global structure of the learned barrier. Our local DP based reachability algorithm, HJ-Patch, updates the barrier function "minimally" at points that both (a) neighbor the barrier safety boundary and (b) do not satisfy the safety condition. We view this as a key step to bridging the gap between learning-based barrier functions and Hamilton-Jacobi reachability analysis, providing a framework for further integration of these approaches. We demonstrate that for well-trained barriers we reduce the computational load by 2 orders of magnitude with respect to standard DP-based reachability, and demonstrate scalability to a 6-dimensional system, which is at the limit of standard DP-based reachability.Comment: 8 pages, submitted to IEEE Conference on Decision and Control (CDC), 202

arXiv.org e-Print Archive

모델기반강화학습을이용한공정제어및최적화

Author: 김종우
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 화학생물공학부,2020. 2. 이종민.순차적 의사결정 문제는 공정 최적화의 핵심 분야 중 하나이다. 이 문제의 수치적 해법 중 가장 많이 사용되는 것은 순방향으로 작동하는 직접법 (direct optimization) 방법이지만, 몇가지 한계점을 지니고 있다. 최적해는 open-loop의 형태를 지니고 있으며, 불확정성이 존재할때 방법론의 수치적 복잡도가 증가한다는 것이다. 동적 계획법 (dynamic programming) 은 이러한 한계점을 근원적으로 해결할 수 있지만, 그동안 공정 최적화에 적극적으로 고려되지 않았던 이유는 동적 계획법의 결과로 얻어진 편미분 방정식 문제가 유한차원 벡터공간이 아닌 무한차원의 함수공간에서 다루어지기 때문이다. 소위 차원의 저주라고 불리는 이 문제를 해결하기 위한 한가지 방법으로서, 샘플을 이용한 근사적 해법에 초점을 둔 강화학습 방법론이 연구되어 왔다. 본 학위논문에서는 강화학습 방법론 중, 공정 최적화에 적합한 모델 기반 강화학습에 대해 연구하고, 이를 공정 최적화의 대표적인 세가지 순차적 의사결정 문제인 스케줄링, 상위단계 최적화, 하위단계 제어에 적용하는 것을 목표로 한다. 이 문제들은 각각 부분관측 마르코프 결정 과정 (partially observable Markov decision process), 제어-아핀 상태공간 모델 (control-affine state space model), 일반적 상태공간 모델 (general state space model)로 모델링된다. 또한 각 수치적 모델들을 해결하기 위해 point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP)로 불리는 방법들을 도입하였다. 이 세가지 문제와 방법론에서 제시된 특징들을 다음과 같이 요약할 수 있다: 첫번째로, 스케줄링 문제에서 closed-loop 피드백 형태의 해를 제시할 수 있었다. 이는 기존 직접법에서 얻을 수 없었던 형태로서, 강화학습의 강점을 부각할 수 있는 측면이라 생각할 수 있다. 두번째로 고려한 하위단계 제어 문제에서, 동적 계획법의 무한차원 함수공간 최적화 문제를 함수 근사 방법을 통해 유한차원 벡터공간 최적화 문제로 완화할 수 있는 방법을 도입하였다. 특히, 심층 신경망을 이용하여 함수 근사를 하였고, 이때 발생하는 여러가지 장점과 수렴 해석 결과를 본 학위논문에 실었다. 마지막 문제는 상위 단계 동적 최적화 문제이다. 동적 최적화 문제에서 발생하는 제약 조건하에서 강화학습을 수행하기 위해, 원-쌍대 미분동적 계획법 (primal-dual DDP) 방법론을 새로 제안하였다. 앞서 설명한 세가지 문제에 적용된 방법론을 검증하고, 동적 계획법이 직접법에 비견될 수 있는 방법론이라는 주장을 실증하기 위해 여러가지 공정 예제를 실었다.Sequential decision making problem is a crucial technology for plant-wide process optimization. While the dominant numerical method is the forward-in-time direct optimization, it is limited to the open-loop solution and has difficulty in considering the uncertainty. Dynamic programming method complements the limitations, nonetheless associated functional optimization suffers from the curse-of-dimensionality. The sample-based approach for approximating the dynamic programming, referred to as reinforcement learning (RL) can resolve the issue and investigated throughout this thesis. The method that accounts for the system model explicitly is in particular interest. The model-based RL is exploited to solve the three representative sequential decision making problems; scheduling, supervisory optimization, and regulatory control. The problems are formulated with partially observable Markov decision process, control-affine state space model, and general state space model, and associated model-based RL algorithms are point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP), respectively. The contribution for each problem can be written as follows: First, for the scheduling problem, we developed the closed-loop feedback scheme which highlights the strength compared to the direct optimization method. In the second case, the regulatory control problem is tackled by the function approximation method which relaxes the functional optimization to the finite dimensional vector space optimization. Deep neural networks (DNNs) is utilized as the approximator, and the advantages as well as the convergence analysis is performed in the thesis. Finally, for the supervisory optimization problem, we developed the novel constraint RL framework that uses the primal-dual DDP method. Various illustrative examples are demonstrated to validate the developed model-based RL algorithms and to support the thesis statement on which the dynamic programming method can be considered as a complementary method for direct optimization method.1. Introduction 1 1.1 Motivation and previous work 1 1.2 Statement of contributions 9 1.3 Outline of the thesis 11 2. Background and preliminaries 13 2.1 Optimization problem formulation and the principle of optimality 13 2.1.1 Markov decision process 15 2.1.2 State space model 19 2.2 Overview of the developed RL algorithms 28 2.2.1 Point based value iteration 28 2.2.2 Globalized dual heuristic programming 29 2.2.3 Differential dynamic programming 32 3. A POMDP framework for integrated scheduling of infrastructure maintenance and inspection 35 3.1 Introduction 35 3.2 POMDP solution algorithm 38 3.2.1 General point based value iteration 38 3.2.2 GapMin algorithm 46 3.2.3 Receding horizon POMDP 49 3.3 Problem formulation for infrastructure scheduling 54 3.3.1 State 56 3.3.2 Maintenance and inspection actions 57 3.3.3 State transition function 61 3.3.4 Cost function 67 3.3.5 Observation set and observation function 68 3.3.6 State augmentation 69 3.4 Illustrative example and simulation result 69 3.4.1 Structural point for the analysis of a high dimensional belief space 72 3.4.2 Infinite horizon policy under the natural deterioration process 72 3.4.3 Receding horizon POMDP 79 3.4.4 Validation of POMDP policy via Monte Carlo simulation 83 4. A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system 88 4.1 Introduction 88 4.2 Function approximation and learning with deep neural networks 91 4.2.1 GDHP with a function approximator 91 4.2.2 Stable learning of DNNs 96 4.2.3 Overall algorithm 103 4.3 Results and discussions 107 4.3.1 Example 1: Semi-batch reactor 107 4.3.2 Example 2: Diffusion-Convection-Reaction (DCR) process 120 5. Convergence analysis of the model-based deep reinforcement learning for optimal control of nonlinear control-affine system 126 5.1 Introduction 126 5.2 Convergence proof of globalized dual heuristic programming (GDHP) 128 5.3 Function approximation with deep neural networks 137 5.3.1 Function approximation and gradient descent learning 137 5.3.2 Forward and backward propagations of DNNs 139 5.4 Convergence analysis in the deep neural networks space 141 5.4.1 Lyapunov analysis of the neural network parameter errors 141 5.4.2 Lyapunov analysis of the closed-loop stability 150 5.4.3 Overall Lyapunov function 152 5.5 Simulation results and discussions 157 5.5.1 System description 158 5.5.2 Algorithmic settings 160 5.5.3 Control result 161 6. Primal-dual differential dynamic programming for constrained dynamic optimization of continuous system 170 6.1 Introduction 170 6.2 Primal-dual differential dynamic programming for constrained dynamic optimization 172 6.2.1 Augmented Lagrangian method 172 6.2.2 Primal-dual differential dynamic programming algorithm 175 6.2.3 Overall algorithm 179 6.3 Results and discussions 179 7. Concluding remarks 186 7.1 Summary of the contributions 187 7.2 Future works 189 Bibliography 192Docto

SNU Open Repository and Archive

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Author: Lu Cong
Sims Anya
Teh Yee Whye
Publication venue
Publication date: 19/02/2024
Field of study

Offline reinforcement learning aims to enable agents to be trained from pre-collected datasets, however, this comes with the added challenge of estimating the value of behavior not covered in the dataset. Model-based methods offer a solution by allowing agents to collect additional synthetic data via rollouts in a learned dynamics model. The prevailing theoretical understanding is that this can then be viewed as online reinforcement learning in an approximate dynamics model, and any remaining gap is therefore assumed to be due to the imperfect dynamics model. Surprisingly, however, we find that if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a major misconception. Our subsequent investigation finds that the general procedure used in model-based algorithms results in the existence of a set of edge-of-reach states which trigger pathological value overestimation and collapse in Bellman-based algorithms. We term this the edge-of-reach problem. Based on this, we fill some gaps in existing theory and also explain how prior model-based methods are inadvertently addressing the true underlying edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and achieves strong performance across both proprioceptive and pixel-based benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.Comment: Code open-sourced at: https://github.com/anyasims/edge-of-reac

arXiv.org e-Print Archive

Better Optimism By Bayes: Adaptive Planning with Rich Models

Author: Dayan Peter
Guez Arthur
Silver David
Publication venue
Publication date: 01/02/2014
Field of study

The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-the-art techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively better than both existing conventional algorithms and Thompson sampling on two contextual bandit-like problems.Comment: 11 pages, 11 figure

arXiv.org e-Print Archive

MPG.PuRe

Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach

Author: He Fang
Lin Xi
Si Jinhua
Tang Xindi
Publication venue
Publication date: 13/07/2023
Field of study

The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio

arXiv.org e-Print Archive