1,100 research outputs found

    Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management

    Full text link
    We present a multi-agent Deep Reinforcement Learning (DRL) framework for managing large transportation infrastructure systems over their life-cycle. Life-cycle management of such engineering systems is a computationally intensive task, requiring appropriate sequential inspection and maintenance decisions able to reduce long-term risks and costs, while dealing with different uncertainties and constraints that lie in high-dimensional spaces. To date, static age- or condition-based maintenance methods and risk-based or periodic inspection plans have mostly addressed this class of optimization problems. However, optimality, scalability, and uncertainty limitations are often manifested under such approaches. The optimization problem in this work is cast in the framework of constrained Partially Observable Markov Decision Processes (POMDPs), which provides a comprehensive mathematical basis for stochastic sequential decision settings with observation uncertainties, risk considerations, and limited resources. To address significantly large state and action spaces, a Deep Decentralized Multi-agent Actor-Critic (DDMAC) DRL method with Centralized Training and Decentralized Execution (CTDE), termed as DDMAC-CTDE is developed. The performance strengths of the DDMAC-CTDE method are demonstrated in a generally representative and realistic example application of an existing transportation network in Virginia, USA. The network includes several bridge and pavement components with nonstationary degradation, agency-imposed constraints, and traffic delay and risk considerations. Compared to traditional management policies for transportation networks, the proposed DDMAC-CTDE method vastly outperforms its counterparts. Overall, the proposed algorithmic framework provides near optimal solutions for transportation infrastructure management under real-world constraints and complexities

    Towards Standardising Reinforcement Learning Approaches for Production Scheduling Problems

    Get PDF
    Recent years have seen a rise in interest in terms of using machine learning, particularly reinforcement learning (RL), for production scheduling problems of varying degrees of complexity. The general approach is to break down the scheduling problem into a Markov Decision Process (MDP), whereupon a simulation implementing the MDP is used to train an RL agent. Since existing studies rely on (sometimes) complex simulations for which the code is unavailable, the experiments presented are hard, or, in the case of stochastic environments, impossible to reproduce accurately. Furthermore, there is a vast array of RL designs to choose from. To make RL methods widely applicable in production scheduling and work out their strength for the industry, the standardisation of model descriptions - both production setup and RL design - and validation scheme are a prerequisite. Our contribution is threefold: First, we standardize the description of production setups used in RL studies based on established nomenclature. Secondly, we classify RL design choices from existing publications. Lastly, we propose recommendations for a validation scheme focusing on reproducibility and sufficient benchmarking

    Patching Neural Barrier Functions Using Hamilton-Jacobi Reachability

    Full text link
    Learning-based control algorithms have led to major advances in robotics at the cost of decreased safety guarantees. Recently, neural networks have also been used to characterize safety through the use of barrier functions for complex nonlinear systems. Learned barrier functions approximately encode and enforce a desired safety constraint through a value function, but do not provide any formal guarantees. In this paper, we propose a local dynamic programming (DP) based approach to "patch" an almost-safe learned barrier at potentially unsafe points in the state space. This algorithm, HJ-Patch, obtains a novel barrier that provides formal safety guarantees, yet retains the global structure of the learned barrier. Our local DP based reachability algorithm, HJ-Patch, updates the barrier function "minimally" at points that both (a) neighbor the barrier safety boundary and (b) do not satisfy the safety condition. We view this as a key step to bridging the gap between learning-based barrier functions and Hamilton-Jacobi reachability analysis, providing a framework for further integration of these approaches. We demonstrate that for well-trained barriers we reduce the computational load by 2 orders of magnitude with respect to standard DP-based reachability, and demonstrate scalability to a 6-dimensional system, which is at the limit of standard DP-based reachability.Comment: 8 pages, submitted to IEEE Conference on Decision and Control (CDC), 202

    λͺ¨λΈκΈ°λ°˜κ°•ν™”ν•™μŠ΅μ„μ΄μš©ν•œκ³΅μ •μ œμ–΄λ°μ΅œμ ν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 화학생물곡학뢀,2020. 2. 이쒅민.순차적 μ˜μ‚¬κ²°μ • λ¬Έμ œλŠ” 곡정 μ΅œμ ν™”μ˜ 핡심 λΆ„μ•Ό 쀑 ν•˜λ‚˜μ΄λ‹€. 이 문제의 수치적 해법 쀑 κ°€μž₯ 많이 μ‚¬μš©λ˜λŠ” 것은 순방ν–₯으둜 μž‘λ™ν•˜λŠ” 직접법 (direct optimization) λ°©λ²•μ΄μ§€λ§Œ, λͺ‡κ°€μ§€ ν•œκ³„μ μ„ μ§€λ‹ˆκ³  μžˆλ‹€. μ΅œμ ν•΄λŠ” open-loop의 ν˜•νƒœλ₯Ό μ§€λ‹ˆκ³  있으며, λΆˆν™•μ •μ„±μ΄ μ‘΄μž¬ν• λ•Œ λ°©λ²•λ‘ μ˜ 수치적 λ³΅μž‘λ„κ°€ μ¦κ°€ν•œλ‹€λŠ” 것이닀. 동적 κ³„νšλ²• (dynamic programming) 은 μ΄λŸ¬ν•œ ν•œκ³„μ μ„ κ·Όμ›μ μœΌλ‘œ ν•΄κ²°ν•  수 μžˆμ§€λ§Œ, κ·Έλ™μ•ˆ 곡정 μ΅œμ ν™”μ— 적극적으둜 κ³ λ €λ˜μ§€ μ•Šμ•˜λ˜ μ΄μœ λŠ” 동적 κ³„νšλ²•μ˜ 결과둜 얻어진 νŽΈλ―ΈλΆ„ 방정식 λ¬Έμ œκ°€ μœ ν•œμ°¨μ› 벑터곡간이 μ•„λ‹Œ λ¬΄ν•œμ°¨μ›μ˜ ν•¨μˆ˜κ³΅κ°„μ—μ„œ 닀루어지기 λ•Œλ¬Έμ΄λ‹€. μ†Œμœ„ μ°¨μ›μ˜ 저주라고 λΆˆλ¦¬λŠ” 이 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ ν•œκ°€μ§€ λ°©λ²•μœΌλ‘œμ„œ, μƒ˜ν”Œμ„ μ΄μš©ν•œ 근사적 해법에 μ΄ˆμ μ„ λ‘” κ°•ν™”ν•™μŠ΅ 방법둠이 μ—°κ΅¬λ˜μ–΄ μ™”λ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” κ°•ν™”ν•™μŠ΅ 방법둠 쀑, 곡정 μ΅œμ ν™”μ— μ ν•©ν•œ λͺ¨λΈ 기반 κ°•ν™”ν•™μŠ΅μ— λŒ€ν•΄ μ—°κ΅¬ν•˜κ³ , 이λ₯Ό 곡정 μ΅œμ ν™”μ˜ λŒ€ν‘œμ μΈ 세가지 순차적 μ˜μ‚¬κ²°μ • 문제인 μŠ€μΌ€μ€„λ§, μƒμœ„λ‹¨κ³„ μ΅œμ ν™”, ν•˜μœ„λ‹¨κ³„ μ œμ–΄μ— μ μš©ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. 이 λ¬Έμ œλ“€μ€ 각각 λΆ€λΆ„κ΄€μΈ‘ 마λ₯΄μ½”ν”„ κ²°μ • κ³Όμ • (partially observable Markov decision process), μ œμ–΄-μ•„ν•€ μƒνƒœκ³΅κ°„ λͺ¨λΈ (control-affine state space model), 일반적 μƒνƒœκ³΅κ°„ λͺ¨λΈ (general state space model)둜 λͺ¨λΈλ§λœλ‹€. λ˜ν•œ 각 수치적 λͺ¨λΈλ“€μ„ ν•΄κ²°ν•˜κΈ° μœ„ν•΄ point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP)둜 λΆˆλ¦¬λŠ” 방법듀을 λ„μž…ν•˜μ˜€λ‹€. 이 세가지 λ¬Έμ œμ™€ λ°©λ²•λ‘ μ—μ„œ μ œμ‹œλœ νŠΉμ§•λ“€μ„ λ‹€μŒκ³Ό 같이 μš”μ•½ν•  수 μžˆλ‹€: 첫번째둜, μŠ€μΌ€μ€„λ§ λ¬Έμ œμ—μ„œ closed-loop ν”Όλ“œλ°± ν˜•νƒœμ˜ ν•΄λ₯Ό μ œμ‹œν•  수 μžˆμ—ˆλ‹€. μ΄λŠ” κΈ°μ‘΄ μ§μ ‘λ²•μ—μ„œ 얻을 수 μ—†μ—ˆλ˜ ν˜•νƒœλ‘œμ„œ, κ°•ν™”ν•™μŠ΅μ˜ 강점을 뢀각할 수 μžˆλŠ” 츑면이라 생각할 수 μžˆλ‹€. λ‘λ²ˆμ§Έλ‘œ κ³ λ €ν•œ ν•˜μœ„λ‹¨κ³„ μ œμ–΄ λ¬Έμ œμ—μ„œ, 동적 κ³„νšλ²•μ˜ λ¬΄ν•œμ°¨μ› ν•¨μˆ˜κ³΅κ°„ μ΅œμ ν™” 문제λ₯Ό ν•¨μˆ˜ 근사 방법을 톡해 μœ ν•œμ°¨μ› 벑터곡간 μ΅œμ ν™” 문제둜 μ™„ν™”ν•  수 μžˆλŠ” 방법을 λ„μž…ν•˜μ˜€λ‹€. 특히, 심측 신경망을 μ΄μš©ν•˜μ—¬ ν•¨μˆ˜ 근사λ₯Ό ν•˜μ˜€κ³ , μ΄λ•Œ λ°œμƒν•˜λŠ” μ—¬λŸ¬κ°€μ§€ μž₯점과 수렴 해석 κ²°κ³Όλ₯Ό λ³Έ ν•™μœ„λ…Όλ¬Έμ— μ‹€μ—ˆλ‹€. λ§ˆμ§€λ§‰ λ¬Έμ œλŠ” μƒμœ„ 단계 동적 μ΅œμ ν™” λ¬Έμ œμ΄λ‹€. 동적 μ΅œμ ν™” λ¬Έμ œμ—μ„œ λ°œμƒν•˜λŠ” μ œμ•½ μ‘°κ±΄ν•˜μ—μ„œ κ°•ν™”ν•™μŠ΅μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄, 원-μŒλŒ€ 미뢄동적 κ³„νšλ²• (primal-dual DDP) 방법둠을 μƒˆλ‘œ μ œμ•ˆν•˜μ˜€λ‹€. μ•žμ„œ μ„€λͺ…ν•œ 세가지 λ¬Έμ œμ— 적용된 방법둠을 κ²€μ¦ν•˜κ³ , 동적 κ³„νšλ²•μ΄ 직접법에 비견될 수 μžˆλŠ” λ°©λ²•λ‘ μ΄λΌλŠ” μ£Όμž₯을 μ‹€μ¦ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬κ°€μ§€ 곡정 예제λ₯Ό μ‹€μ—ˆλ‹€.Sequential decision making problem is a crucial technology for plant-wide process optimization. While the dominant numerical method is the forward-in-time direct optimization, it is limited to the open-loop solution and has difficulty in considering the uncertainty. Dynamic programming method complements the limitations, nonetheless associated functional optimization suffers from the curse-of-dimensionality. The sample-based approach for approximating the dynamic programming, referred to as reinforcement learning (RL) can resolve the issue and investigated throughout this thesis. The method that accounts for the system model explicitly is in particular interest. The model-based RL is exploited to solve the three representative sequential decision making problems; scheduling, supervisory optimization, and regulatory control. The problems are formulated with partially observable Markov decision process, control-affine state space model, and general state space model, and associated model-based RL algorithms are point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP), respectively. The contribution for each problem can be written as follows: First, for the scheduling problem, we developed the closed-loop feedback scheme which highlights the strength compared to the direct optimization method. In the second case, the regulatory control problem is tackled by the function approximation method which relaxes the functional optimization to the finite dimensional vector space optimization. Deep neural networks (DNNs) is utilized as the approximator, and the advantages as well as the convergence analysis is performed in the thesis. Finally, for the supervisory optimization problem, we developed the novel constraint RL framework that uses the primal-dual DDP method. Various illustrative examples are demonstrated to validate the developed model-based RL algorithms and to support the thesis statement on which the dynamic programming method can be considered as a complementary method for direct optimization method.1. Introduction 1 1.1 Motivation and previous work 1 1.2 Statement of contributions 9 1.3 Outline of the thesis 11 2. Background and preliminaries 13 2.1 Optimization problem formulation and the principle of optimality 13 2.1.1 Markov decision process 15 2.1.2 State space model 19 2.2 Overview of the developed RL algorithms 28 2.2.1 Point based value iteration 28 2.2.2 Globalized dual heuristic programming 29 2.2.3 Differential dynamic programming 32 3. A POMDP framework for integrated scheduling of infrastructure maintenance and inspection 35 3.1 Introduction 35 3.2 POMDP solution algorithm 38 3.2.1 General point based value iteration 38 3.2.2 GapMin algorithm 46 3.2.3 Receding horizon POMDP 49 3.3 Problem formulation for infrastructure scheduling 54 3.3.1 State 56 3.3.2 Maintenance and inspection actions 57 3.3.3 State transition function 61 3.3.4 Cost function 67 3.3.5 Observation set and observation function 68 3.3.6 State augmentation 69 3.4 Illustrative example and simulation result 69 3.4.1 Structural point for the analysis of a high dimensional belief space 72 3.4.2 Infinite horizon policy under the natural deterioration process 72 3.4.3 Receding horizon POMDP 79 3.4.4 Validation of POMDP policy via Monte Carlo simulation 83 4. A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system 88 4.1 Introduction 88 4.2 Function approximation and learning with deep neural networks 91 4.2.1 GDHP with a function approximator 91 4.2.2 Stable learning of DNNs 96 4.2.3 Overall algorithm 103 4.3 Results and discussions 107 4.3.1 Example 1: Semi-batch reactor 107 4.3.2 Example 2: Diffusion-Convection-Reaction (DCR) process 120 5. Convergence analysis of the model-based deep reinforcement learning for optimal control of nonlinear control-affine system 126 5.1 Introduction 126 5.2 Convergence proof of globalized dual heuristic programming (GDHP) 128 5.3 Function approximation with deep neural networks 137 5.3.1 Function approximation and gradient descent learning 137 5.3.2 Forward and backward propagations of DNNs 139 5.4 Convergence analysis in the deep neural networks space 141 5.4.1 Lyapunov analysis of the neural network parameter errors 141 5.4.2 Lyapunov analysis of the closed-loop stability 150 5.4.3 Overall Lyapunov function 152 5.5 Simulation results and discussions 157 5.5.1 System description 158 5.5.2 Algorithmic settings 160 5.5.3 Control result 161 6. Primal-dual differential dynamic programming for constrained dynamic optimization of continuous system 170 6.1 Introduction 170 6.2 Primal-dual differential dynamic programming for constrained dynamic optimization 172 6.2.1 Augmented Lagrangian method 172 6.2.2 Primal-dual differential dynamic programming algorithm 175 6.2.3 Overall algorithm 179 6.3 Results and discussions 179 7. Concluding remarks 186 7.1 Summary of the contributions 187 7.2 Future works 189 Bibliography 192Docto

    The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

    Full text link
    Offline reinforcement learning aims to enable agents to be trained from pre-collected datasets, however, this comes with the added challenge of estimating the value of behavior not covered in the dataset. Model-based methods offer a solution by allowing agents to collect additional synthetic data via rollouts in a learned dynamics model. The prevailing theoretical understanding is that this can then be viewed as online reinforcement learning in an approximate dynamics model, and any remaining gap is therefore assumed to be due to the imperfect dynamics model. Surprisingly, however, we find that if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a major misconception. Our subsequent investigation finds that the general procedure used in model-based algorithms results in the existence of a set of edge-of-reach states which trigger pathological value overestimation and collapse in Bellman-based algorithms. We term this the edge-of-reach problem. Based on this, we fill some gaps in existing theory and also explain how prior model-based methods are inadvertently addressing the true underlying edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and achieves strong performance across both proprioceptive and pixel-based benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.Comment: Code open-sourced at: https://github.com/anyasims/edge-of-reac

    Better Optimism By Bayes: Adaptive Planning with Rich Models

    Full text link
    The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-the-art techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively better than both existing conventional algorithms and Thompson sampling on two contextual bandit-like problems.Comment: 11 pages, 11 figure

    Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach

    Full text link
    The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio
    • …
    corecore