219 research outputs found
Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective
We provide a rigorous mathematical formulation of Deep Learning (DL) methodologies through an in-depth analysis of the learning procedures characterizing Neural Network (NN) models within the theoretical frameworks of Stochastic Optimal Control (SOC) and Mean-Field Games (MFGs). In particular, we show how the supervised learning approach can be translated in terms of a (stochastic) mean-field optimal control problem by applying the Hamilton\u2013Jacobi\u2013Bellman (HJB) approach and the mean-field Pontryagin maximum principle. Our contribution sheds new light on a possible theoretical connection between mean-field problems and DL, melting heterogeneous approaches and reporting the state-of-the-art within such fields to show how the latter different perspectives can be indeed fruitfully unified
๋ชจ๋ธ๊ธฐ๋ฐ๊ฐํํ์ต์์ด์ฉํ๊ณต์ ์ ์ด๋ฐ์ต์ ํ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ํํ์๋ฌผ๊ณตํ๋ถ,2020. 2. ์ด์ข
๋ฏผ.์์ฐจ์ ์์ฌ๊ฒฐ์ ๋ฌธ์ ๋ ๊ณต์ ์ต์ ํ์ ํต์ฌ ๋ถ์ผ ์ค ํ๋์ด๋ค. ์ด ๋ฌธ์ ์ ์์น์ ํด๋ฒ ์ค ๊ฐ์ฅ ๋ง์ด ์ฌ์ฉ๋๋ ๊ฒ์ ์๋ฐฉํฅ์ผ๋ก ์๋ํ๋ ์ง์ ๋ฒ (direct optimization) ๋ฐฉ๋ฒ์ด์ง๋ง, ๋ช๊ฐ์ง ํ๊ณ์ ์ ์ง๋๊ณ ์๋ค. ์ต์ ํด๋ open-loop์ ํํ๋ฅผ ์ง๋๊ณ ์์ผ๋ฉฐ, ๋ถํ์ ์ฑ์ด ์กด์ฌํ ๋ ๋ฐฉ๋ฒ๋ก ์ ์์น์ ๋ณต์ก๋๊ฐ ์ฆ๊ฐํ๋ค๋ ๊ฒ์ด๋ค. ๋์ ๊ณํ๋ฒ (dynamic programming) ์ ์ด๋ฌํ ํ๊ณ์ ์ ๊ทผ์์ ์ผ๋ก ํด๊ฒฐํ ์ ์์ง๋ง, ๊ทธ๋์ ๊ณต์ ์ต์ ํ์ ์ ๊ทน์ ์ผ๋ก ๊ณ ๋ ค๋์ง ์์๋ ์ด์ ๋ ๋์ ๊ณํ๋ฒ์ ๊ฒฐ๊ณผ๋ก ์ป์ด์ง ํธ๋ฏธ๋ถ ๋ฐฉ์ ์ ๋ฌธ์ ๊ฐ ์ ํ์ฐจ์ ๋ฒกํฐ๊ณต๊ฐ์ด ์๋ ๋ฌดํ์ฐจ์์ ํจ์๊ณต๊ฐ์์ ๋ค๋ฃจ์ด์ง๊ธฐ ๋๋ฌธ์ด๋ค. ์์ ์ฐจ์์ ์ ์ฃผ๋ผ๊ณ ๋ถ๋ฆฌ๋ ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ ํ๊ฐ์ง ๋ฐฉ๋ฒ์ผ๋ก์, ์ํ์ ์ด์ฉํ ๊ทผ์ฌ์ ํด๋ฒ์ ์ด์ ์ ๋ ๊ฐํํ์ต ๋ฐฉ๋ฒ๋ก ์ด ์ฐ๊ตฌ๋์ด ์๋ค. ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ๊ฐํํ์ต ๋ฐฉ๋ฒ๋ก ์ค, ๊ณต์ ์ต์ ํ์ ์ ํฉํ ๋ชจ๋ธ ๊ธฐ๋ฐ ๊ฐํํ์ต์ ๋ํด ์ฐ๊ตฌํ๊ณ , ์ด๋ฅผ ๊ณต์ ์ต์ ํ์ ๋ํ์ ์ธ ์ธ๊ฐ์ง ์์ฐจ์ ์์ฌ๊ฒฐ์ ๋ฌธ์ ์ธ ์ค์ผ์ค๋ง, ์์๋จ๊ณ ์ต์ ํ, ํ์๋จ๊ณ ์ ์ด์ ์ ์ฉํ๋ ๊ฒ์ ๋ชฉํ๋ก ํ๋ค. ์ด ๋ฌธ์ ๋ค์ ๊ฐ๊ฐ ๋ถ๋ถ๊ด์ธก ๋ง๋ฅด์ฝํ ๊ฒฐ์ ๊ณผ์ (partially observable Markov decision process), ์ ์ด-์ํ ์ํ๊ณต๊ฐ ๋ชจ๋ธ (control-affine state space model), ์ผ๋ฐ์ ์ํ๊ณต๊ฐ ๋ชจ๋ธ (general state space model)๋ก ๋ชจ๋ธ๋ง๋๋ค. ๋ํ ๊ฐ ์์น์ ๋ชจ๋ธ๋ค์ ํด๊ฒฐํ๊ธฐ ์ํด point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP)๋ก ๋ถ๋ฆฌ๋ ๋ฐฉ๋ฒ๋ค์ ๋์
ํ์๋ค.
์ด ์ธ๊ฐ์ง ๋ฌธ์ ์ ๋ฐฉ๋ฒ๋ก ์์ ์ ์๋ ํน์ง๋ค์ ๋ค์๊ณผ ๊ฐ์ด ์์ฝํ ์ ์๋ค: ์ฒซ๋ฒ์งธ๋ก, ์ค์ผ์ค๋ง ๋ฌธ์ ์์ closed-loop ํผ๋๋ฐฑ ํํ์ ํด๋ฅผ ์ ์ํ ์ ์์๋ค. ์ด๋ ๊ธฐ์กด ์ง์ ๋ฒ์์ ์ป์ ์ ์์๋ ํํ๋ก์, ๊ฐํํ์ต์ ๊ฐ์ ์ ๋ถ๊ฐํ ์ ์๋ ์ธก๋ฉด์ด๋ผ ์๊ฐํ ์ ์๋ค. ๋๋ฒ์งธ๋ก ๊ณ ๋ คํ ํ์๋จ๊ณ ์ ์ด ๋ฌธ์ ์์, ๋์ ๊ณํ๋ฒ์ ๋ฌดํ์ฐจ์ ํจ์๊ณต๊ฐ ์ต์ ํ ๋ฌธ์ ๋ฅผ ํจ์ ๊ทผ์ฌ ๋ฐฉ๋ฒ์ ํตํด ์ ํ์ฐจ์ ๋ฒกํฐ๊ณต๊ฐ ์ต์ ํ ๋ฌธ์ ๋ก ์ํํ ์ ์๋ ๋ฐฉ๋ฒ์ ๋์
ํ์๋ค. ํนํ, ์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ด์ฉํ์ฌ ํจ์ ๊ทผ์ฌ๋ฅผ ํ์๊ณ , ์ด๋ ๋ฐ์ํ๋ ์ฌ๋ฌ๊ฐ์ง ์ฅ์ ๊ณผ ์๋ ด ํด์ ๊ฒฐ๊ณผ๋ฅผ ๋ณธ ํ์๋
ผ๋ฌธ์ ์ค์๋ค. ๋ง์ง๋ง ๋ฌธ์ ๋ ์์ ๋จ๊ณ ๋์ ์ต์ ํ ๋ฌธ์ ์ด๋ค. ๋์ ์ต์ ํ ๋ฌธ์ ์์ ๋ฐ์ํ๋ ์ ์ฝ ์กฐ๊ฑดํ์์ ๊ฐํํ์ต์ ์ํํ๊ธฐ ์ํด, ์-์๋ ๋ฏธ๋ถ๋์ ๊ณํ๋ฒ (primal-dual DDP) ๋ฐฉ๋ฒ๋ก ์ ์๋ก ์ ์ํ์๋ค. ์์ ์ค๋ช
ํ ์ธ๊ฐ์ง ๋ฌธ์ ์ ์ ์ฉ๋ ๋ฐฉ๋ฒ๋ก ์ ๊ฒ์ฆํ๊ณ , ๋์ ๊ณํ๋ฒ์ด ์ง์ ๋ฒ์ ๋น๊ฒฌ๋ ์ ์๋ ๋ฐฉ๋ฒ๋ก ์ด๋ผ๋ ์ฃผ์ฅ์ ์ค์ฆํ๊ธฐ ์ํด ์ฌ๋ฌ๊ฐ์ง ๊ณต์ ์์ ๋ฅผ ์ค์๋ค.Sequential decision making problem is a crucial technology for plant-wide process optimization. While the dominant numerical method is the forward-in-time direct optimization, it is limited to the open-loop solution and has difficulty in considering the uncertainty. Dynamic programming method complements the limitations, nonetheless associated functional optimization suffers from the curse-of-dimensionality. The sample-based approach for approximating the dynamic programming, referred to as reinforcement learning (RL) can resolve the issue and investigated throughout this thesis. The method that accounts for the system model explicitly is in particular interest. The model-based RL is exploited to solve the three representative sequential decision making problems; scheduling, supervisory optimization, and regulatory control. The problems are formulated with partially observable Markov decision process, control-affine state space model, and general state space model, and associated model-based RL algorithms are point based value iteration (PBVI), globalized dual heuristic programming (GDHP), and differential dynamic programming (DDP), respectively.
The contribution for each problem can be written as follows: First, for the scheduling problem, we developed the closed-loop feedback scheme which highlights the strength compared to the direct optimization method. In the second case, the regulatory control problem is tackled by the function approximation method which relaxes the functional optimization to the finite dimensional vector space optimization. Deep neural networks (DNNs) is utilized as the approximator, and the advantages as well as the convergence analysis is performed in the thesis. Finally, for the supervisory optimization problem, we developed the novel constraint RL framework that uses the primal-dual DDP method. Various illustrative examples are demonstrated to validate the developed model-based RL algorithms and to support the thesis statement on which the dynamic programming method can be considered as a complementary method for direct optimization method.1. Introduction 1
1.1 Motivation and previous work 1
1.2 Statement of contributions 9
1.3 Outline of the thesis 11
2. Background and preliminaries 13
2.1 Optimization problem formulation and the principle of optimality 13
2.1.1 Markov decision process 15
2.1.2 State space model 19
2.2 Overview of the developed RL algorithms 28
2.2.1 Point based value iteration 28
2.2.2 Globalized dual heuristic programming 29
2.2.3 Differential dynamic programming 32
3. A POMDP framework for integrated scheduling of infrastructure maintenance and inspection 35
3.1 Introduction 35
3.2 POMDP solution algorithm 38
3.2.1 General point based value iteration 38
3.2.2 GapMin algorithm 46
3.2.3 Receding horizon POMDP 49
3.3 Problem formulation for infrastructure scheduling 54
3.3.1 State 56
3.3.2 Maintenance and inspection actions 57
3.3.3 State transition function 61
3.3.4 Cost function 67
3.3.5 Observation set and observation function 68
3.3.6 State augmentation 69
3.4 Illustrative example and simulation result 69
3.4.1 Structural point for the analysis of a high dimensional belief space 72
3.4.2 Infinite horizon policy under the natural deterioration process 72
3.4.3 Receding horizon POMDP 79
3.4.4 Validation of POMDP policy via Monte Carlo simulation 83
4. A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system 88
4.1 Introduction 88
4.2 Function approximation and learning with deep neural networks 91
4.2.1 GDHP with a function approximator 91
4.2.2 Stable learning of DNNs 96
4.2.3 Overall algorithm 103
4.3 Results and discussions 107
4.3.1 Example 1: Semi-batch reactor 107
4.3.2 Example 2: Diffusion-Convection-Reaction (DCR) process 120
5. Convergence analysis of the model-based deep reinforcement learning for optimal control of nonlinear control-affine system 126
5.1 Introduction 126
5.2 Convergence proof of globalized dual heuristic programming (GDHP) 128
5.3 Function approximation with deep neural networks 137
5.3.1 Function approximation and gradient descent learning 137
5.3.2 Forward and backward propagations of DNNs 139
5.4 Convergence analysis in the deep neural networks space 141
5.4.1 Lyapunov analysis of the neural network parameter errors 141
5.4.2 Lyapunov analysis of the closed-loop stability 150
5.4.3 Overall Lyapunov function 152
5.5 Simulation results and discussions 157
5.5.1 System description 158
5.5.2 Algorithmic settings 160
5.5.3 Control result 161
6. Primal-dual differential dynamic programming for constrained dynamic optimization of continuous system 170
6.1 Introduction 170
6.2 Primal-dual differential dynamic programming for constrained dynamic optimization 172
6.2.1 Augmented Lagrangian method 172
6.2.2 Primal-dual differential dynamic programming algorithm 175
6.2.3 Overall algorithm 179
6.3 Results and discussions 179
7. Concluding remarks 186
7.1 Summary of the contributions 187
7.2 Future works 189
Bibliography 192Docto
Recommended from our members
Algorithms of data generation for deep learning and feedback design: A survey
17 USC 105 interim-entered record; under review.The article of record as published may be found at https://doi.org/10.1016/j.physd.2021.132955Recent research reveals that deep learning is an effective way of solving high dimensional Hamiltonโ JacobiโBellman equations. The resulting feedback control law in the form of a neural network is computationally efficient for real-time applications of optimal control. A critical part of this design method is to generate data for training the neural network and validating its accuracy. In this paper, we provide a survey of existing algorithms that can be used to generate data. All the algorithms surveyed in this paper are causality-free, i.e., the solution at a point is computed without using the value of the function at any other points. An illustrative example is given for the optimal feedback design using supervised learning in which the data is generated using causality-free algorithms.U.S. Naval Research Laborator
Issues on Stability of ADP Feedback Controllers for Dynamical Systems
This paper traces the development of neural-network (NN)-based feedback controllers that are derived from the principle of adaptive/approximate dynamic programming (ADP) and discusses their closed-loop stability. Different versions of NN structures in the literature, which embed mathematical mappings related to solutions of the ADP-formulated problems called โadaptive criticsโ or โaction-criticโ networks, are discussed. Distinction between the two classes of ADP applications is pointed out. Furthermore, papers in โmodel-freeโ development and model-based neurocontrollers are reviewed in terms of their contributions to stability issues. Recent literature suggests that work in ADP-based feedback controllers with assured stability is growing in diverse forms
Adaptive Deep Learning for High-Dimensional Hamilton-Jacobi-Bellman Equations
Computing optimal feedback controls for nonlinear systems generally requires
solving Hamilton-Jacobi-Bellman (HJB) equations, which are notoriously
difficult when the state dimension is large. Existing strategies for
high-dimensional problems often rely on specific, restrictive problem
structures, or are valid only locally around some nominal trajectory. In this
paper, we propose a data-driven method to approximate semi-global solutions to
HJB equations for general high-dimensional nonlinear systems and compute
candidate optimal feedback controls in real-time. To accomplish this, we model
solutions to HJB equations with neural networks (NNs) trained on data generated
without discretizing the state space. Training is made more effective and
data-efficient by leveraging the known physics of the problem and using the
partially-trained NN to aid in adaptive data generation. We demonstrate the
effectiveness of our method by learning solutions to HJB equations
corresponding to the attitude control of a six-dimensional nonlinear rigid
body, and nonlinear systems of dimension up to 30 arising from the
stabilization of a Burgers'-type partial differential equation. The trained NNs
are then used for real-time feedback control of these systems.Comment: Added section on validation error computation. Updated convergence
test formula and associated result
Generalized Policy Iteration for Optimal Control in Continuous Time
This paper proposes the Deep Generalized Policy Iteration (DGPI) algorithm to
find the infinite horizon optimal control policy for general nonlinear
continuous-time systems with known dynamics. Unlike existing adaptive dynamic
programming algorithms for continuous time systems, DGPI does not require the
admissibility of initialized policy, and input-affine nature of controlled
systems for convergence. Our algorithm employs the actor-critic architecture to
approximate both policy and value functions with the purpose of iteratively
solving the Hamilton-Jacobi-Bellman equation. Both the policy and value
functions are approximated by deep neural networks. Given any arbitrary initial
policy, the proposed DGPI algorithm can eventually converge to an admissible,
and subsequently an optimal policy for an arbitrary nonlinear system. We also
relax the update termination conditions of both the policy evaluation and
improvement processes, which leads to a faster convergence speed than
conventional Policy Iteration (PI) methods, for the same architecture of
function approximators. We further prove the convergence and optimality of the
algorithm with thorough Lyapunov analysis, and demonstrate its generality and
efficacy using two detailed numerical examples
- โฆ