83 research outputs found

    Enhancing Robustness of Deep Reinforcement Learning based Semiconductor Packaging Lines Scheduling with Regularized Training

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์‚ฐ์—…๊ณตํ•™๊ณผ,2019. 8. ๋ฐ•์ข…ํ—Œ.์ตœ๊ทผ ๊ณ ์„ฑ๋Šฅ ์ „์ž ์ œํ’ˆ์— ๋Œ€ํ•œ ์ˆ˜์š”๊ฐ€ ๋†’์•„์ง€๋ฉด์„œ ๋‹ค์ค‘ ์นฉ ์ œํ’ˆ ์ƒ์‚ฐ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ฐ˜๋„์ฒด ์ œ์กฐ๊ณต์ •์ด ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋‹ค. ๋‹ค์ค‘ ์นฉ ์ œํ’ˆ์€ ํŒจํ‚ค์ง• ๋ผ์ธ์—์„œ ๊ณต์ •์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•˜๋Š” ์žฌ์œ ์ž…์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋˜๋ฉฐ, ๊ณต์ • ์„ค๋น„์˜ ์…‹์—… ๊ต์ฒด๊ฐ€ ๋นˆ๋ฒˆํžˆ ์ผ์œผํ‚ค๊ฒŒ ๋œ๋‹ค. ์ด๋Š” ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ์˜ ์Šค์ผ€์ค„๋ง์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“œ๋Š” ์ฃผ์š”ํ•œ ์š”์†Œ์ด๋‹ค. ๋˜ํ•œ, ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ์€ ์ œ์กฐ๊ณต์ • ๋‚ด,์™ธ์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ๋ณ€๋™ ์‚ฌํ•ญ์— ์˜ํ•ด ์ƒ์‚ฐํ™˜๊ฒฝ์ด ๋นˆ๋ฒˆํžˆ ๋ณ€ํ™”ํ•˜๋ฉฐ, ์ œ์กฐ ํ˜„์žฅ์—์„œ๋Š” ์Šค์ผ€์ค„๋ง์„ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ๊ณ„์‚ฐ ์‹œ๊ฐ„์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ ์†ํ•œ ์Šค์ผ€์ค„ ๋„์ถœ์ด ์š”๊ตฌ๋œ๋‹ค. ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ์˜ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๋ฉด์„œ ์ „์—ญ ์ตœ์ ํ™”๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ๊ฐ€ ๋Š˜์–ด๋‚˜๊ณ  ์žˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ๋Š” ๊ทธ ํ™œ์šฉ ์ธก๋ฉด์—์„œ ๋‹ค์–‘ํ•œ ์ƒ์‚ฐํ™˜๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•๊ฑดํžˆ ๋Œ€์‘ํ•˜๋ฉฐ, ์งง์€ ์‹œ๊ฐ„ ์•ˆ์— ์ข‹์€ ์Šค์ผ€์ค„์„ ์–ป์„ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šค์ผ€์ค„๋ง ๋ชจ๋ธ์˜ ๊ฐ•๊ฑด์„ฑ ํ™•๋ณด๋ฅผ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ƒˆ๋กœ์šด ์ƒ์‚ฐํ™˜๊ฒฝ์ด ํ…Œ์ŠคํŠธ๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์žฌํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š๊ณ  ์„ฑ๋Šฅ์˜ ํฐ ์ €ํ•˜์—†๋Š” ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ ์Šค์ผ€์ค„๋ง์„ ์œ„ํ•œ ์ •๊ทœํ™” ํ•™์Šต๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์œ ์—ฐ ์žก์ƒต ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ์— ๊ฐ•ํ™”ํ•™์Šต์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด ๊ณต์ • ์ƒํ™ฉ์„ ๊ณ ๋ คํ•œ ์ƒํƒœ์™€ ํ–‰๋™, ๋ณด์ƒ์„ ์„ค๊ณ„ํ•˜์˜€๊ณ , ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต์˜ ๋Œ€ํ‘œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ ์‹ฌ์ธต Q ๋„คํŠธ์›Œํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•˜๋Š” ์ •๊ทœํ™” ํ•™์Šต ๊ธฐ๋ฒ•์€ 4๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋‹จ๊ณ„์—์„œ ์—ฌ๋Ÿฌ ์ƒ์‚ฐํ™˜๊ฒฝ ๋ณ€ํ™”๊ฐ€ ๋ฐ˜์˜๋œ ๋ฌธ์ œ์˜ ์ผ๋ฐ˜์„ฑ๊ณผ ๊ฐ ๋ฌธ์ œ์˜ ํŠน์ˆ˜์„ฑ์„ ํ•™์Šตํ•˜๋„๋ก ์„ค๊ณ„ํ•˜์˜€๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋ณต์žก๋„์˜ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ, ๋ฃฐ ๊ธฐ๋ฐ˜ ๋ฐ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋‹ค๋ฅธ ์Šค์ผ€์ค„๋ง ๋ชจ๋ธ์— ๋น„ํ•ด ๋Œ€์ฒด์ ์œผ๋กœ ์„ฑ๋Šฅ์˜ ์šฐ์ˆ˜ํ•จ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ์—์„œ ๋ชจ๋ธ์˜ ๊ฐ•๊ฑด์„ฑ์— ์—ฐ๊ตฌ์˜ ์ดˆ์ ์„ ๋งž์ถ˜ ์ฒซ ์—ฐ๊ตฌ์ด๋ฉฐ, ๋ณธ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋Š” ์‹ค์ œ ๊ณต์žฅ์—์„œ ์—ฐ๊ตฌ์˜ ํ™œ์šฉ์„ฑ์„ ํ•œ์ธต ๋†’์—ฌ์ค€ ์—ฐ๊ตฌ์ด๋‹ค.As the demand for high-performance electronic devices has increased, the semiconductor manufacturing process is being developed centering on the production of multi-chip products. In multi-chip products, re-entrance occurs by repeating the process several times in the packaging line, and the setup change of equipment is frequently incurred. These are major factors that make the scheduling of the semiconductor packaging line difficult. The production environment frequently changes due to internal and external variabilities. In addition, since the calculation time required for scheduling is very important at the manufacturing site, prompt schedule generation is required. As the research of the semiconductor packaging line scheduling becomes active, the reinforcement learning based scheduling research aiming at the global optimization is increasing. In view of the utilization of scheduling research based on reinforcement learning, there is a need for a method capable of reacting to various production environment changes and obtaining a good schedule in a short time. This study aims at obtaining the robustness of the scheduling model based on deep reinforcement learning. We propose a regularzied training method for semiconductor packaging lines scheduling based on deep reinforcement learning without performance degradation and re-training when a new production environment is given as a test data. In order to apply reinforcement learning to flexible job-shop scheduling problem, we designed state, action and reward considering overall process and trained deep Q network which is a representative algorithm of deep reinforcement learning. The regularzied training method proposed in this study is divided into four stages and designed to train the generalities of the problems reflected in various production environment and the specificity of each problem. Experiments were conducted using scheduling problems of different complexity, and it was verified that the performance was superior to other scheduling models based on rule-based and deep reinforcement learning. This study is the first research that focuses on the robustness of the model in the reinforcement learning based scheduling. Moreover, the result of this study enhances the practicality of research in real factory application.์ดˆ๋ก ๋ชฉ์ฐจ ํ‘œ ๋ชฉ์ฐจ ๊ทธ๋ฆผ ๋ชฉ์ฐจ ์ œ 1 ์žฅ ์„œ๋ก  1.1 ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ 1.2 ์—ฐ๊ตฌ ๋ชฉ์  1.3 ์—ฐ๊ตฌ ๋Œ€์ƒ ์ •์˜ 1.4 ์—ฐ๊ตฌ ๋‚ด์šฉ ๋ฐ ๊ตฌ์„ฑ ์ œ 2 ์žฅ ๋ฐฐ๊ฒฝ์ด๋ก  ๋ฐ ๊ด€๋ จ์—ฐ๊ตฌ 2.1 ๋ฐฐ๊ฒฝ์ด๋ก  2.1.1 ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต 2.1.2 ์ •๊ทœํ™” 2.2 ๊ด€๋ จ์—ฐ๊ตฌ 2.2.1 ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ 2.2.2 ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์Šค์ผ€์ค„๋ง ์—ฐ๊ตฌ 2.2.3 ๊ฐ•ํ™”ํ•™์Šต ๊ฐ•๊ฑด์„ฑ ์—ฐ๊ตฌ ์ œ 3 ์žฅ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐ˜๋„์ฒด ํŒจํ‚ค์ง• ๋ผ์ธ ์Šค์ผ€์ค„๋ง 3.1 ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์Šค์ผ€์ค„๋ง ์˜์‚ฌ ๊ฒฐ์ • 3.2 ์ƒํƒœ, ํ–‰๋™, ๋ณด์ƒ ์ •์˜ 3.3 ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ 3.3.1 ์‹ฌ์ธต Q ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ 3.3.2 ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ ํ•™์Šต ๋‹จ๊ณ„ 3.3.3 ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ ํ…Œ์ŠคํŠธ ๋‹จ๊ณ„ ์ œ 4 ์žฅ ๊ฐ•ํ™”ํ•™์Šต ๊ฐ•๊ฑด์„ฑ ํ™•๋ณด๋ฅผ ์œ„ํ•œ ์ •๊ทœํ™” ํ•™์Šต ๊ธฐ๋ฒ• 4.1 ์ •๊ทœํ™” ํ•™์Šต ๊ฐœ์š” 4.2 ์ •๊ทœํ™” ํ•™์Šต ๊ณผ์ • 4.2.1 ์‹ฌ์ธต Q ๋„คํŠธ์›Œํฌ ํ•™์Šต 4.2.2 Q์ธต ํ•™์Šต 4.2.3 ์ •๊ทœํ™” ๊ฐ€์ค‘์น˜ ํ•™์Šต 4.2.4 ์ƒˆ๋กœ์šด Q์ธต ํ•™์Šต ์ œ 5 ์žฅ ์‹คํ—˜ ๊ฒฐ๊ณผ 5.1 ๋ฐ์ดํ„ฐ์…‹ 5.2 ์‹คํ—˜ ๊ณผ์ • 5.3 ์‹คํ—˜ ์„ธํŒ… 5.3.1 ๊ฐ•ํ™”ํ•™์Šต ์‹คํ—˜ ์„ธํŒ… 5.3.2 ์ •๊ทœํ™” ํ•™์Šต ์‹คํ—˜ ์„ธํŒ… 5.4 ์‹คํ—˜ ๊ฒฐ๊ณผ ์ œ 6 ์žฅ ๊ฒฐ๋ก  6.1 ๊ฒฐ๋ก  6.2 ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ์ฐธ๊ณ ๋ฌธํ—Œ AbstractMaste

    ํ™•๋ฅ ์  ์•ˆ์ „์„ฑ ๊ฒ€์ฆ์„ ์œ„ํ•œ ์•ˆ์ „ ๊ฐ•ํ™”ํ•™์Šต: ๋žดํ‘ธ๋…ธ๋ธŒ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์–‘์ธ์ˆœ.Emerging applications in robotic and autonomous systems, such as autonomous driving and robotic surgery, often involve critical safety constraints that must be satisfied even when information about system models is limited. In this regard, we propose a model-free safety specification method that learns the maximal probability of safe operation by carefully combining probabilistic reachability analysis and safe reinforcement learning (RL). Our approach constructs a Lyapunov function with respect to a safe policy to restrain each policy improvement stage. As a result, it yields a sequence of safe policies that determine the range of safe operation, called the safe set, which monotonically expands and gradually converges. We also develop an efficient safe exploration scheme that accelerates the process of identifying the safety of unexamined states. Exploiting the Lyapunov shieding, our method regulates the exploratory policy to avoid dangerous states with high confidence. To handle high-dimensional systems, we further extend our approach to deep RL by introducing a Lagrangian relaxation technique to establish a tractable actor-critic algorithm. The empirical performance of our method is demonstrated through continuous control benchmark problems, such as a reaching task on a planar robot arm.์ž์œจ์ฃผํ–‰, ๋กœ๋ด‡ ์ˆ˜์ˆ  ๋“ฑ ์ž์œจ์‹œ์Šคํ…œ ๋ฐ ๋กœ๋ณดํ‹ฑ์Šค์˜ ๋– ์˜ค๋ฅด๋Š” ์‘์šฉ ๋ถ„์•ผ์˜ ์ ˆ๋Œ€ ๋‹ค์ˆ˜๋Š” ์•ˆ์ „ํ•œ ๋™์ž‘์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ •ํ•œ ์ œ์•ฝ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ํŠนํžˆ, ์•ˆ์ „์ œ์•ฝ์€ ์‹œ์Šคํ…œ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ œํ•œ๋œ ์ •๋ณด๋งŒ ์•Œ๋ ค์ ธ ์žˆ์„ ๋•Œ์—๋„ ๋ณด์žฅ๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด์— ๋”ฐ๋ผ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ™•๋ฅ ์  ๋„๋‹ฌ์„ฑ ๋ถ„์„(probabilistic reachability analysis)๊ณผ ์•ˆ์ „ ๊ฐ•ํ™”ํ•™์Šต(safe reinforcement learning)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‹œ์Šคํ…œ์ด ์•ˆ์ „ํ•˜๊ฒŒ ๋™์ž‘ํ•  ํ™•๋ฅ ์˜ ์ตœ๋Œ“๊ฐ’์œผ๋กœ ์ •์˜๋˜๋Š” ์•ˆ์ „ ์‚ฌ์–‘์„ ๋ณ„๋„์˜ ๋ชจ๋ธ ์—†์ด ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ๋ฒ•์€ ๋งค๋ฒˆ ์ •์ฑ…์„ ์ƒˆ๋กœ ๊ตฌํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฌผ์ด ์•ˆ์ „ํ•จ์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ์ถฉ์กฑ์‹œํ‚ค๋„๋ก ์ œํ•œ์„ ๊ฑฐ๋Š” ๊ฒƒ์œผ๋กœ, ์ด๋ฅผ ์œ„ํ•ด ์•ˆ์ „ํ•œ ์ •์ฑ…์— ๊ด€ํ•œ ๋žดํ‘ธ๋…ธํ”„ ํ•จ์ˆ˜๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ ์‚ฐ์ถœ๋˜๋Š” ์ผ๋ จ์˜ ์ •์ฑ…์œผ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ ์ง‘ํ•ฉ(safe set)์ด๋ผ ๋ถˆ๋ฆฌ๋Š” ์•ˆ์ „ํ•œ ๋™์ž‘์ด ๋ณด์žฅ๋˜๋Š” ์˜์—ญ์ด ๊ณ„์‚ฐ๋˜๊ณ , ์ด ์ง‘ํ•ฉ์€ ๋‹จ์กฐ๋กญ๊ฒŒ ํ™•์žฅํ•˜์—ฌ ์ ์ฐจ ์ตœ์ ํ•ด๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก ๋งŒ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ์กฐ์‚ฌ๋˜์ง€ ์•Š์€ ์ƒํƒœ์˜ ์•ˆ์ „์„ฑ์„ ๋” ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ํšจ์œจ์ ์ธ ์•ˆ์ „ ํƒ์‚ฌ ์ฒด๊ณ„๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ๋žดํ‘ธ๋…ธ๋ธŒ ์ฐจํ๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ํƒํ—˜ ์ •์ฑ…์€ ๋†’์€ ํ™•๋ฅ ๋กœ ์œ„ํ—˜ํ•˜๋‹ค ์—ฌ๊ฒจ์ง€๋Š” ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋„๋ก ์ œํ•œ์ด ๊ฑธ๋ฆฐ๋‹ค. ์—ฌ๊ธฐ์— ๋”ํ•ด ์šฐ๋ฆฌ๋Š” ๊ณ ์ฐจ์› ์‹œ์Šคํ…œ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์„ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ํ™•์žฅํ–ˆ๊ณ , ๊ตฌํ˜„ ๊ฐ€๋Šฅํ•œ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ผ๊ทธ๋ž‘์ฃผ ์ด์™„๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ๋ฐฉ๋ฒ•์˜ ์‹คํšจ์„ฑ์€ ์—ฐ์†์ ์ธ ์ œ์–ด ๋ฒค์น˜๋งˆํฌ์ธ 2์ฐจ์› ํ‰๋ฉด์—์„œ ๋™์ž‘ํ•˜๋Š” 2-DOF ๋กœ๋ด‡ ํŒ”์„ ํ†ตํ•ด ์‹คํ—˜์ ์œผ๋กœ ์ž…์ฆ๋˜์—ˆ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related work 4 Chapter 3 Background 6 3.1 Probabilistic Reachability and Safety Specifications 6 3.2 Safe Reinforcement Learning 8 Chapter 4 Lyapunov-Based Safe Reinforcement Learning for Safety Specification 10 4.1 Lyapunov Safety Specification 11 4.2 Efficient Safe Exploration 14 4.3 Deep RL Implementation 19 Chapter 5 Simulation Studies 23 5.1 Tabular Q-Learning 25 5.2 Deep RL 27 5.3 Experimental Setup 31 5.3.1 Deep RL Implementation 31 5.3.2 Environments 32 Chapter 6 Conclusion 35 Bibliography 35 ์ดˆ๋ก 41 Acknowledgements 42Maste

    ๊ฐ•ํ™”ํ•™์Šต์„ ํ™œ์šฉํ•œ ๊ณ ์†๋„๋กœ ๊ฐ€๋ณ€์ œํ•œ์†๋„ ๋ฐ ๋žจํ”„๋ฏธํ„ฐ๋ง ์ „๋žต ๊ฐœ๋ฐœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ๊ฑด์„คํ™˜๊ฒฝ๊ณตํ•™๋ถ€, 2022.2. ๊น€๋™๊ทœ.Recently, to resolve societal problems caused by traffic congestion, traffic control strategies have been developed to operate freeways efficiently. The representative strategies to effectively manage freeway flow are variable speed limit (VSL) control and the coordinated ramp metering (RM) strategy. This paper aims to develop a dynamic VSL and RM control algorithm to obtain efficient traffic flow on freeways using deep reinforcement learning (DRL). The traffic control strategies applying the deep deterministic policy gradient (DDPG) algorithm are tested through traffic simulation in the freeway section with multiple VSL and RM controls. The results show that implementing the strategy alleviates the congestion in the on-ramp section and shifts to the overall sections. For most cases, the VSL or RM strategy improves the overall flow rates by reducing the density and improving the average speed of the vehicles. However, VSL or RM control may not be appropriate, particularly at the high level of traffic flow. It is required to introduce the selective application of the integrated control strategies according to the level of traffic flow. It is found that the integrated strategy can be used when including the relationship between each state detector in multiple VSL sections and lanes by applying the adjacency matrix in the neural network layer. The result of this study implies the effectiveness of DRL-based VSL and the RM strategy and the importance of the spatial correlation between the state detectors.์ตœ๊ทผ์—๋Š” ๊ตํ†ตํ˜ผ์žก์œผ๋กœ ์ธํ•œ ์‚ฌํšŒ์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์†๋„๋กœ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์šด์˜ํ•˜๊ธฐ ์œ„ํ•œ ๊ตํ†ตํ†ต์ œ ์ „๋žต์ด ๋‹ค์–‘ํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๊ณ ์†๋„๋กœ ๊ตํ†ต๋ฅ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ์ „๋žต์œผ๋กœ๋Š” ์ฐจ๋กœ๋ณ„ ์ œํ•œ์†๋„๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•˜๋Š” ๊ฐ€๋ณ€ ์†๋„ ์ œํ•œ(VSL) ์ œ์–ด์™€ ์ง„์ž… ๋žจํ”„์—์„œ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ด ์ฐจ๋Ÿ‰์„ ํ†ต์ œํ•˜๋Š” ๋žจํ”„ ๋ฏธํ„ฐ๋ง(RM) ์ „๋žต ๋“ฑ์ด ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ๋Š” ์‹ฌ์ธต ๊ฐ•ํ™” ํ•™์Šต(deep reinforcement learning)์„ ํ™œ์šฉํ•˜์—ฌ ๊ณ ์†๋„๋กœ์˜ ํšจ์œจ์ ์ธ ๊ตํ†ต ํ๋ฆ„์„ ์–ป๊ธฐ ์œ„ํ•ด ๋™์  VSL ๋ฐ RM ์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ณ ์†๋„๋กœ์˜ ์—ฌ๋Ÿฌ VSL๊ณผ RM ๊ตฌ๊ฐ„์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•ด ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ธ deep deterministic policy gradient (DDPG) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•œ ๊ตํ†ต๋ฅ˜ ์ œ์–ด ์ „๋žต์„ ๊ฒ€์ฆํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ VSL ๋˜๋Š” RM ์ „๋žต์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋žจํ”„ ์ง„์ž…๋กœ ๊ตฌ๊ฐ„์˜ ํ˜ผ์žก์„ ์™„ํ™”ํ•˜๊ณ  ๋‚˜์•„๊ฐ€ ์ „์ฒด ๊ตฌ๊ฐ„์˜ ํ˜ผ์žก์„ ์ค„์ด๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ VSL์ด๋‚˜ RM ์ „๋žต์€ ๋ณธ์„ ๊ณผ ์ง„์ž…๋กœ ๊ตฌ๊ฐ„์˜ ๋ฐ€๋„๋ฅผ ์ค„์ด๊ณ  ์ฐจ๋Ÿ‰์˜ ํ‰๊ท  ํ†ตํ–‰ ์†๋„๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ์ „์ฒด ๊ตํ†ต ํ๋ฆ„์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. VSL ๋˜๋Š” RM ์ „๋žต๋“ค์€ ๋†’์€ ์ˆ˜์ค€์˜ ๊ตํ†ต๋ฅ˜์—์„œ ์ ์ ˆํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์–ด ๊ตํ†ต๋ฅ˜ ์ˆ˜์ค€์— ๋”ฐ๋ฅธ ์ „๋žต์˜ ์„ ํƒ์  ๋„์ž…์ด ํ•„์š”ํ•˜๋‹ค. ๋˜ํ•œ ๊ฒ€์ง€๊ธฐ๊ฐ„ ์ง€๋ฆฌ์  ๊ฑฐ๋ฆฌ์™€ ๊ด€๋ จํ•œ ์ธ์ ‘ ํ–‰๋ ฌ์„ ํฌํ•จํ•˜๋Š” graph neural network layer์ด ์—ฌ๋Ÿฌ ์ง€์  ๊ฒ€์ง€๊ธฐ์˜ ๊ณต๊ฐ„์  ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ ์ด์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋Š” ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ VSL๊ณผ RM ์ „๋žต ๋„์ž…์˜ ํ•„์š”์„ฑ๊ณผ ์ง€์  ๊ฒ€์ง€๊ธฐ ๊ฐ„์˜ ๊ณต๊ฐ„์  ์ƒ๊ด€๊ด€๊ณ„์˜ ์ค‘์š”์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ์ „๋žต ๋„์ž…์˜ ํšจ๊ณผ๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค.Chapter 1. Introduction 1 Chapter 2. Literature Review 4 Chapter 3. Methods 8 3.1. Study Area and the Collection of Data 8 3.2. Simulation Framework 11 3.3. Trip Generation and Route Choice 13 3.4. Deep Deterministic Policy Gradient (DDPG) Algorithm 14 3.5. Graph Convolution Network (GCN) Layer 17 3.6. RL Formulation 18 Chapter 4. Results 20 4.1. VSL and RM 20 4.2. Efficiency according to the flow rate 28 4.3. Effectiveness of the GCN Layer 33 Chapter 5. Conclusion 34 Bibliography 37 Abstract in Korean 44์„

    ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ์˜์‚ฌ๊ฒฐ์ •์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์‹ฌ๋ฆฌํ•™๊ณผ, 2015. 2. ์ตœ์ง„์˜.๋ฌด์—‡์„ ๊ฒฐ์ •ํ•˜๊ฑฐ๋‚˜ ์„ ํƒํ•  ๋•Œ, ๋‘ ์ข…๋ฅ˜์˜ ์ธ์ง€์‹ ๊ฒฝํ•™์  ์กฐ์ ˆ ์‹œ์Šคํ…œ ๊ฐ„ ๊ฒฝ์Ÿ์ ์ธ ํ™œ๋™์— ์˜ํ•ด ํ–‰๋™์ด ๊ฒฐ์ •๋œ๋‹ค๋Š” ๊ฐ€์ •์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ํ•˜๋‚˜๋Š” ์ฆ‰๊ฐ์ ์ธ ๊ฐ•ํ™” ์—ฌ๋ถ€์— ๋”ฐ๋ผ ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๋Š” ์Šต๊ด€์ habit ํ˜น์€ ๋ชจํ˜• ๋ถ€์žฌmodel-freeํ–‰๋™ ์‹œ์Šคํ…œ์ด๋ฉฐ, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ํ–‰์œ„์ž์˜ ๋‚ด์  ์ƒํƒœ๋‚˜ ์™ธ์  ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ง€์‹ ๋ฐ ์ •๋ณด๋ฅผ ์ ๊ทน์ ์œผ๋กœ ์ด์šฉํ•˜์—ฌ ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๋Š” ๋ชฉํ‘œ์ง€ํ–ฅ์  goal-directed ํ˜น์€ ๋ชจํ˜• ๊ธฐ๋ฐ˜model-based ํ–‰๋™ ์‹œ์Šคํ…œ์ด๋‹ค. ์ŠคํŠธ๋ ˆ์Šค๋Š” ๋ชฉํ‘œ์ง€ํ–ฅ์ ์ธ ํ–‰๋™์„ ๋ฐฉํ•ดํ•˜๊ณ  ์Šต๊ด€์  ํ–‰๋™์„ ์ด‰์ง„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ฐํ˜€์ ธ ์™”์œผ๋ฉฐ, ์ด๋Š” ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ๋‘ ํ–‰๋™ ์กฐ์ ˆ ์‹œ์Šคํ…œ์˜ ๊ฒฝ์Ÿ์  ํ™œ๋™ ๊ณผ์ •์— ๊ฐœ์ž…ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ํ–‰๋™ ์„ ํƒ ๋ฐ ํ•™์Šต์˜ ์—ฌ๋Ÿฌ ์š”์ธ์— ๋ฏธ์น˜๋Š” ๊ตฌ์ฒด์ ์ธ ๊ธฐ์ „์— ๋Œ€ํ•ด์„œ๋Š” ์•„์ง ์ฒด๊ณ„์  ์—ฐ๊ตฌ๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” 2๊ฐœ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด, ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ํ–‰๋™ ์„ ํƒ์˜ ๊ณผ์ • ๋ฐ ๊ฒฐ๊ณผ์— ๋ฏธ์น˜๋Š” ๋‹ค๋ฉด์ ์ธ ์˜ํ–ฅ์— ๋Œ€ํ•ด ๋ฉด๋ฐ€ํ•˜๊ฒŒ ํƒ์ƒ‰ํ–ˆ๋‹ค. ์—ฐ๊ตฌ 1์—์„œ๋Š” ์Šต๊ด€์  ํ–‰๋™ ์ฒ˜๋ฆฌ๊ณผ์ •๊ณผ ๋ชฉํ‘œ์ง€ํ–ฅ์  ํ–‰๋™ ์ฒ˜๋ฆฌ๊ณผ์ •์„ ๊ตฌ๋ถ„ํ•˜๋Š” ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ œ, 2 ๋‹จ๊ณ„ ๋ฐ˜์ „ํ•™์Šต ๊ณผ์ œ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ, ์‹คํ—˜์‹ค์—์„œ ์œ ๋ฐœ๋œ ๊ธ‰์„ฑ ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ์ด ๋‘ ์ฒ˜๋ฆฌ๊ณผ์ •์— ์–ด๋–ป๊ฒŒ ๊ด€์—ฌํ•˜๋Š”์ง€๋ฅผ ํƒ์ƒ‰ํ–ˆ๋‹ค. ์ •์ƒ ๋Œ€ํ•™์ƒ์„ ์ŠคํŠธ๋ ˆ์Šค ์ฒ˜์น˜ ์กฐ๊ฑด๊ณผ ๋น„๊ตํ†ต์ œ ์กฐ๊ฑด์— ๋ฌด์„  ํ• ๋‹นํ–ˆ๊ณ , ํ”ผํ—˜์ž๋“ค์˜ ๊ณผ์ œ ์ˆ˜ํ–‰ ํ–‰๋™์— ๊ฐ•ํ™”ํ•™์Šต์˜ ๊ณ„์‚ฐ๋ชจํ˜•์„ ์ ์šฉํ•˜์—ฌ ๋ชจํ˜• ๊ธฐ๋ฐ˜model-based, ๋ชจํ˜• ๋ถ€์žฌmodel-free ํ–‰๋™ ๊ฒฝํ–ฅ์„ฑ๊ณผ ํ•™์Šต๋ฅ learning rate์„ ์ถ”์ •ํ–ˆ๋‹ค. ์‹คํ—˜ ์กฐ๊ฑด ๊ฐ„ ๊ณผ์ œ ์ˆ˜ํ–‰ ํ–‰๋™ ๋ฐ ๊ฐ•ํ™”ํ•™์Šต ๋ชจํ˜• ๋ชจ์ˆ˜ ์ถ”์ •์น˜๋“ค์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ์ŠคํŠธ๋ ˆ์Šค ์ฒ˜์น˜ ์กฐ๊ฑด์—์„œ๋Š” ๋น„๊ตํ†ต์ œ ์กฐ๊ฑด์— ๋น„ํ•ด ๋ชจํ˜• ๊ธฐ๋ฐ˜model-based ํ–‰๋™์ด ์ €์กฐํ–ˆ๊ณ , ๊ฐ•ํ™” ์—†๋Š” ์ƒํ™ฉ์—์„œ์˜ ๋ชจํ˜• ๋ถ€์žฌmodel-free ํ–‰๋™ ๊ฒฝํ–ฅ์ด ๋†’์•˜์œผ๋ฉฐ, ํ–‰๋™ ์„ ํƒ ์‹œ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒฝํ–ฅ, ์ฆ‰ ํ•™์Šต๋ฅ learning rate์ด ์ €์กฐํ–ˆ๋‹ค. ์ด์–ด์„œ ์—ฐ๊ตฌ 2์—์„œ๋Š” ๊ธฐ๋Šฅ์  ์ž๊ธฐ๊ณต๋ช… ๋‡Œ ์˜์ƒ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜์‚ฌ๊ฒฐ์ •์— ๋Œ€ํ•œ ์ŠคํŠธ๋ ˆ์Šค ํšจ๊ณผ๋ฅผ ์‹ ๊ฒฝํ™œ๋™ ์ˆ˜์ค€์—์„œ ํƒ์ƒ‰ํ–ˆ์œผ๋ฉฐ, ์ŠคํŠธ๋ ˆ์Šค์˜ ํ–‰๋™์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด ์ฒ˜์น˜ ์ •๋„์— ๋”ฐ๋ผ ์ผ๊ด€์ ์ธ์ง€ ์—ฌํ‚ค์Šค-๋„์Šจ ๋ฒ•์น™Yerkes-Dodson law์„ ๋”ฐ๋ฅด๋Š”์ง€ ํ™•์ธํ–ˆ๋‹ค. ์ •์ƒ ์„ฑ์ธ๋“ค์„ ์ŠคํŠธ๋ ˆ์Šค ๋ฌด์ฒ˜์น˜ ์กฐ๊ฑด, ์ŠคํŠธ๋ ˆ์Šค ๋‹จ์ผ์ฒ˜์น˜ ์กฐ๊ฑด๊ณผ ์ŠคํŠธ๋ ˆ์Šค ์ด์ค‘์ฒ˜์น˜ ์กฐ๊ฑด์— ๋ฌด์„  ํ• ๋‹นํ•˜์˜€๊ณ , 2 ๋‹จ๊ณ„ ๋ฐ˜์ „ ํ•™์Šต ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์•ˆ ๊ธฐ๋Šฅ์  ๋‡Œ ์˜์ƒ์„ ์ดฌ์˜ํ–ˆ๋‹ค. ์กฐ๊ฑด ๊ฐ„ ๊ณผ์ œ ์ˆ˜ํ–‰ ํ–‰๋™ ๋ฐ ๊ฐ•ํ™”ํ•™์Šต ๋ชจํ˜• ๋ชจ์ˆ˜ ์ถ”์ •์น˜๋“ค์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ์ŠคํŠธ๋ ˆ์Šค ๋‹จ์ผ์ฒ˜์น˜ ์กฐ๊ฑด์˜ ํ”ผํ—˜์ž๋“ค์€ ๋ฌด์ฒ˜์น˜ ์กฐ๊ฑด์— ๋น„ํ•ด ๋ชจํ˜• ๊ธฐ๋ฐ˜์˜model-based ๋ชฉํ‘œ์ง€ํ–ฅ์  ํ–‰๋™์ด ์ฆ๊ฐ€ํ•˜๊ณ  ๋ชจํ˜• ๋ถ€์žฌ์˜model-free ํ–‰๋™ ๊ฒฝํ–ฅ์ด ๊ฐ์†Œ๋˜์—ˆ์œผ๋‚˜, ์ŠคํŠธ๋ ˆ์Šค ์ˆ˜์ค€์ด ๋” ๋†’์€ ์ŠคํŠธ๋ ˆ์Šค ์ด์ค‘์ฒ˜์น˜ ์กฐ๊ฑด์—์„œ๋Š” ๋‹จ์ผ์ฒ˜์น˜ ์กฐ๊ฑด์— ๋น„ํ•ด ๋ชจํ˜• ๊ธฐ๋ฐ˜ model-based ํ–‰๋™์ด ์ €์กฐํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ŠคํŠธ๋ ˆ์Šค์™€ ๊ด€๋ จ๋œ ์ธ์ง€ํ–‰๋™์˜ ์–‘๋ฐฉํ–ฅ์  ๋ณ€ํ™”๋Š” ๋‡Œ์‹ ๊ฒฝ ํ™œ๋™ ์ˆ˜์ค€์—์„œ๋„ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ฆ‰, ์˜์‚ฌ๊ฒฐ์ • ์‹œ ๋‚ด์ธก ์ „์ „๋‘์—ฝmedial prefrontal cortex, ์ƒ์ธก ์ธก๋‘์—ฝsuperior temporal cortex์˜ ์‹ ๊ฒฝํ™œ์„ฑํ™”๊ฐ€ ์ŠคํŠธ๋ ˆ์Šค ์ฒ˜์น˜ ์ˆ˜์ค€์— ๋”ฐ๋ผ ์ฆ์ง„๋˜๊ฑฐ๋‚˜ ์ €ํ•˜๋˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ด ๋‘ ์˜์—ญ์˜ ์‹ ๊ฒฝํ™œ์„ฑํ™”๋Š” ๋ชจํ˜• ๊ธฐ๋ฐ˜ model-based ํ–‰๋™ ๊ฒฝํ–ฅ์„ ๋ฐ˜์˜ํ•˜๋Š” ๋ชจ์ˆ˜ ์ถ”์ •์น˜์™€ ์ •์  ์ƒ๊ด€๊ด€๊ณ„์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, ํŠนํžˆ ๋‚ด์ธก ์ „์ „๋‘์—ฝmedial prefrontal corttex์˜ ์˜์‚ฌ๊ฒฐ์ • ๊ด€๋ จ ์‹ ๊ฒฝํ™œ์„ฑํ™”๋Š” ์Šต๊ด€์  ํ–‰๋™ ์ง€ํ‘œ์™€๋Š” ๋ถ€์  ์ƒ๊ด€๊ด€๊ณ„์˜€๋‹ค. ๋˜ํ•œ ์ŠคํŠธ๋ ˆ์Šค ์ฒ˜์น˜๋Š” ์šฐ์ธก ํ•ด๋งˆhippocampus์˜ ์„ ํƒ ํ–‰๋™์˜ ๊ธฐ๋Œ€์น˜chosen value ๊ด€๋ จ ์‹ ๊ฒฝํ™œ์„ฑํ™”๋ฅผ ์ €ํ•˜์‹œ์ผฐ์œผ๋ฉฐ, ์ด๋Š” ํ–‰๋™์ ์œผ๋กœ๋Š” ๋ฐ˜์ „ํ•™์Šตreversal learning ์ˆ˜ํ–‰ ์ €ํ•˜๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ์˜์‚ฌ๊ฒฐ์ • ์‹œ ์Šต๊ด€์  ํ–‰๋™์„ ์ฆ์ง„์‹œํ‚ค๋Š” ์ธ์ง€ํ–‰๋™์  ๊ธฐ์ „์„ ๋ฐํžˆ๋Š” ๋™์‹œ์—, ์ŠคํŠธ๋ ˆ์Šค์˜ ํšจ๊ณผ๊ฐ€ ๊ทธ ์ •๋„์— ๋”ฐ๋ผ ํ–‰๋™ ์„ ํƒ์˜ ์—ฌ๋Ÿฌ ์ธ์ง€์‹ ๊ฒฝํ•™์  ์š”์ธ์— ๋‹ค๋ฉด์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์นจ์„ ํ™•์ธํ–ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋Š” ์ŠคํŠธ๋ ˆ์Šค์™€ ๊ด€๋ จ๋œ ์ค‘๋… ํ–‰๋™ ๋ฐ ๊ฐ•๋ฐ• ํ–‰๋™ ๋“ฑ ๋ถ€์ ์‘์  ํ–‰๋™์˜ ๋ณ‘๋ฆฌ์  ๊ธฐ์ „ ๋ฐ ๊ฐœ์ž… ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž„์ƒ์  ํ•จ์˜๋ฅผ ๊ฐ–๋Š”๋‹ค.โ… . ์„œ ๋ก  1 1. ์ŠคํŠธ๋ ˆ์Šค์™€ ์ŠคํŠธ๋ ˆ์Šค ๋ฐ˜์‘ 2 2. ์ŠคํŠธ๋ ˆ์Šค์™€ ์˜์‚ฌ๊ฒฐ์ • 7 3. ์˜์‚ฌ๊ฒฐ์ •์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋ชจํ˜• 11 4. ์—ฐ๊ตฌ ๋ชฉ์  16 โ…ก. ์—ฐ ๊ตฌ 1 19 1. ์—ฐ๊ตฌ๋ฐฉ๋ฒ• 22 2. ์—ฐ๊ตฌ๊ฒฐ๊ณผ 40 3. ๋…ผ ์˜ 1 52 โ…ข. ์—ฐ ๊ตฌ 2 59 1. ์—ฐ๊ตฌ๋ฐฉ๋ฒ• 60 2. ์—ฐ๊ตฌ๊ฒฐ๊ณผ 74 3. ๋…ผ ์˜ 2 95 โ…ฃ. ์ข…ํ•ฉ ๋…ผ์˜ 102 ์ฐธ๊ณ ๋ฌธํ—Œ 106Docto

    Setup Change Scheduling Under Due-date Constraints Using Deep Reinforcement Learning with Self-supervision

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์‚ฐ์—…ยท์กฐ์„ ๊ณตํ•™๋ถ€, 2021.8. ๋ฐ•์ข…ํ—Œ.๋‚ฉ๊ธฐ ์ œ์•ฝ ํ•˜์—์„œ ์…‹์—… ์Šค์ผ€์ค„์„ ์ˆ˜๋ฆฝํ•˜๋Š” ๊ฒƒ์€ ํ˜„์‹ค์˜ ์—ฌ๋Ÿฌ ์ œ์กฐ ์‚ฐ์—…์—์„œ ์‰ฝ๊ฒŒ ์ฐพ์•„ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ํ•™๊ณ„์˜ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋Œ๊ณ  ์žˆ๋Š” ์ค‘๋Œ€ํ•œ ๋ฌธ์ œ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‚ฉ๊ธฐ์™€ ์…‹์—… ์ œ์•ฝ์ด ๋™์‹œ์— ์กด์žฌํ•จ์— ๋”ฐ๋ผ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋˜๋ฉฐ, ์‹œ์‹œ๊ฐ๊ฐ ์ƒˆ๋กœ์šด ์ƒ์‚ฐ ๊ณ„ํš์ด ์ฃผ์–ด์ง€๊ณ  ์ดˆ๊ธฐ ์„ค๋น„ ์ƒํƒœ๊ฐ€ ๋ณ€ํ™”๋˜๋Š” ํ™˜๊ฒฝ์—์„œ ๊ณ ํ’ˆ์งˆ์˜ ์Šค์ผ€์ค„ ์ˆ˜๋ฆฝ์€ ๋” ์–ด๋ ค์›Œ์ง„๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•™์Šต๋œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์ด ์ƒ๊ธฐํ•œ ๋ณ€ํ™”๊ฐ€ ๋ฐœ์ƒํ•œ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ๋„ ์žฌํ•™์Šต ์—†์ด ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋„๋ก, ์ž๊ธฐ์ง€๋„ ๊ธฐ๋ฐ˜ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ƒํƒœ์™€ ํ–‰๋™ ํ‘œํ˜„์„ ์ƒ์‚ฐ ๊ณ„ํš๊ณผ ์„ค๋น„ ์ƒํƒœ์— ๋ฌด๊ด€ํ•œ ์ฐจ์›์„ ๊ฐ–๋„๋ก ์„ค๊ณ„ํ•œ๋‹ค. ๋™์‹œ์— ์ฃผ์–ด์ง„ ์ƒํƒœ๋กœ๋ถ€ํ„ฐ ํšจ์œจ์ ์œผ๋กœ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ  ๊ตฌ์กฐ๋ฅผ ๋„์ž…ํ•œ๋‹ค. ์ด์— ๋”ํ•˜์—ฌ, ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ์— ์ ํ•ฉํ•œ ์ž๊ธฐ์ง€๋„๋ฅผ ๊ณ ์•ˆํ•˜์—ฌ ์„ค๋น„์™€ ์žก์˜ ์ˆ˜, ์ƒ์‚ฐ ๊ณ„ํš์˜ ๋ถ„ํฌ๊ฐ€ ์ƒ์ดํ•œ ํ‰๊ฐ€ ํ™˜๊ฒฝ์œผ๋กœ๋„ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•œ๋‹ค. ์ œ์•ˆ ๊ธฐ๋ฒ•์˜ ์œ ํšจ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ํ˜„์‹ค์˜ ๋ณ‘๋ ฌ์„ค๋น„ ๋ฐ ์žก์ƒต ๊ณต์ •์„ ๋ชจ์‚ฌํ•œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ง‘์•ฝ์ ์ธ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ œ์•ˆ ๊ธฐ๋ฒ•์„ ๋ฉ”ํƒ€ํœด๋ฆฌ์Šคํ‹ฑ ๊ธฐ๋ฒ•๊ณผ ๋‹ค๋ฅธ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•, ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๊ณผ ๋น„๊ตํ•จ์œผ๋กœ์จ ๋‚ฉ๊ธฐ ์ค€์ˆ˜ ์„ฑ๋Šฅ๊ณผ ์—ฐ์‚ฐ ์‹œ๊ฐ„ ๊ด€์ ์—์„œ ์šฐ์ˆ˜์„ฑ์„ ์ž…์ฆํ•˜์˜€๋‹ค. ๋”๋ถˆ์–ด ์ƒํƒœ ํ‘œํ˜„, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ , ์ž๊ธฐ์ง€๋„ ๊ฐ๊ฐ์œผ๋กœ ์ธํ•œ ํšจ๊ณผ๋ฅผ ์กฐ์‚ฌํ•œ ๊ฒฐ๊ณผ, ๊ฐœ๋ณ„์ ์œผ๋กœ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•จ์„ ๋ฐํ˜€๋ƒˆ๋‹ค.Setup change scheduling under due-date constraints has attracted much attention from academia and industry due to its practical applications. In a real-world manufacturing system, however, solving the scheduling problem becomes challenging since it is required to address urgent and frequent changes in demand and due-dates of products, and initial machine status. In this thesis, we propose a scheduling framework based on deep reinforcement learning (RL) with self-supervision in which trained neural networks (NNs) are able to solve unseen scheduling problems without re-training even when such changes occur. Specifically, we propose state and action representations whose dimensions are independent of production requirements and due-dates of jobs while accommodating family setups. At the same time, an NN architecture with parameter sharing was utilized to improve the training efficiency. Finally, we devise an additional self-supervised loss specific to the scheduling problem for training the NN scheduler robust to the variations in the numbers of machines and jobs, and distribution of production plans. We carried out extensive experiments in large-scale datasets that simulate the real-world wafer preparation facility and semiconductor packaging line. Experiment results demonstrate that the proposed method outperforms the recent metaheuristics, rule-based, and other RL-based methods in terms of the schedule quality and computation time for obtaining a schedule. Besides, we investigated individual contributions of the state representation, parameter sharing, and self-supervision on the performance improvements.์ œ 1 ์žฅ ์„œ๋ก  1 1.1 ์—ฐ๊ตฌ ๋™๊ธฐ ๋ฐ ๋ฐฐ๊ฒฝ 1 1.2 ์—ฐ๊ตฌ ๋ชฉ์  ๋ฐ ๊ณตํ—Œ 4 1.3 ๋…ผ๋ฌธ๊ตฌ์„ฑ 6 ์ œ 2 ์žฅ ๋ฐฐ๊ฒฝ 7 2.1 ์ˆœ์„œ ์˜์กด์  ์…‹์—…์ด ์žˆ๋Š” ๋‚ฉ๊ธฐ ์ œ์•ฝ ํ•˜์—์„œ์˜ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 7 2.1.1 ๋‚ฉ๊ธฐ ์ œ์•ฝ ํ•˜์—์„œ์˜ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 7 2.1.2 ํŒจ๋ฐ€๋ฆฌ ์…‹์—…์„ ๊ณ ๋ คํ•œ ๋ณ‘๋ ฌ์„ค๋น„ ์Šค์ผ€์ค„๋ง 8 2.1.3 ์…‹์—… ์ œ์•ฝ์ด ์žˆ๋Š” ์žก์ƒต ์Šค์ผ€์ค„๋ง 9 2.2 ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์Šค์ผ€์ค„๋ง 12 2.2.1 ์ด๋ก ์  ๋ฐฐ๊ฒฝ 12 2.2.2 ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•œ ์ œ์กฐ ๋ผ์ธ ์Šค์ผ€์ค„๋ง 13 2.2.3 ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ์—์„œ์˜ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต 15 2.3 ์ž๊ธฐ์ง€๋„ ๊ธฐ๋ฐ˜ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต 19 ์ œ 3 ์žฅ ๋ฌธ์ œ ์ •์˜ 22 3.1 ๋ณ‘๋ ฌ์„ค๋น„ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 22 3.1.1 ์ง€์—ฐ์‹œ๊ฐ„ ์ตœ์†Œํ™”๋ฅผ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ค๋น„ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 22 3.1.2 ํ˜ผํ•ฉ์ •์ˆ˜๊ณ„ํš ๋ชจํ˜• 24 3.1.3 ์˜ˆ์‹œ ๊ณต์ • 25 3.2 ์žก์ƒต ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 26 3.2.1 ํˆฌ์ž…๋Ÿ‰ ์ตœ๋Œ€ํ™”๋ฅผ ์œ„ํ•œ ์œ ์—ฐ์žก์ƒต ์Šค์ผ€์ค„๋ง 26 3.2.2 ์˜ˆ์‹œ ๊ณต์ • 27 ์ œ 4 ์žฅ ์ž๊ธฐ์ง€๋„ ๊ธฐ๋ฐ˜ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•œ ๋ณ‘๋ ฌ์„ค๋น„ ์Šค์ผ€์ค„๋ง 31 4.1 MDP ๋ชจํ˜• 31 4.1.1 ํ–‰๋™ ์ •์˜ 31 4.1.2 ์ƒํƒœ ํ‘œํ˜„ 32 4.1.3 ๋ณด์ƒ ์ •์˜ 37 4.1.4 ์ƒํƒœ ์ „์ด 38 4.1.5 ์˜ˆ์‹œ 39 4.2 ์‹ ๊ฒฝ๋ง ํ•™์Šต 41 4.2.1 ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ 41 4.2.2 ์†์‹ค ํ•จ์ˆ˜ 42 4.2.3 DQN ํ•™์Šต ์ ˆ์ฐจ 43 4.2.4 DQN ํ‰๊ฐ€ ์ ˆ์ฐจ 44 4.3 ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ์—์„œ์˜ ์ž๊ธฐ์ง€๋„ 46 4.3.1 ๋‚ด์žฌ์  ๋ณด์ƒ ์„ค๊ณ„ 46 4.3.2 ์…‹์—… ์Šค์ผ€์ค„๋ง์„ ์œ„ํ•œ ์„ ํ˜ธ๋„ ์ ์ˆ˜ ์„ค๊ณ„ 47 4.4 ์ž๊ธฐ์ง€๋„ ๊ธฐ๋ฐ˜ DQN ํ•™์Šต 49 4.4.1 ์ž๊ธฐ์ง€๋„ ์†์‹ค ํ•จ์ˆ˜ 49 4.4.2 ํ•™์Šต ์ ˆ์ฐจ 50 ์ œ 5 ์žฅ ์ž๊ธฐ์ง€๋„ ๊ธฐ๋ฐ˜ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•œ ์žก์ƒต ์Šค์ผ€์ค„๋ง 53 5.1 ์Šค์ผ€์ค„๋ง ํ”„๋ ˆ์ž„์›Œํฌ 53 5.1.1 ๋ณ‘๋ชฉ ๊ณต์ • ์ •์˜ 53 5.1.2 ๋””์ŠคํŒจ์น˜ ๊ทœ์น™ 54 5.1.3 ์ด์‚ฐ ์‚ฌ๊ฑด ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ 55 5.1.4 ์Šค์ผ€์ค„๋Ÿฌ ํ•™์Šต 56 5.2 ํˆฌ์ž… ์ •์ฑ…๊ณผ ์ž๊ธฐ์ง€๋„ 58 5.3 MDP ๋ชจํ˜• ์ˆ˜์ • 59 5.3.1 ํ–‰๋™ ์ •์˜ 59 5.3.2 ์ƒํƒœ ํ‘œํ˜„ 59 5.3.3 ๋ณด์ƒ ์ •์˜ 61 ์ œ 6 ์žฅ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ 62 6.1 ๋ณ‘๋ ฌ์„ค๋น„ ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 62 6.1.1 ๋ฐ์ดํ„ฐ์…‹ 62 6.1.2 ์‹คํ—˜ ์„ธํŒ… 64 6.1.3 ์ง€์—ฐ์‹œ๊ฐ„ ์ดํ•ฉ ์„ฑ๋Šฅ ๋น„๊ต 67 6.1.4 ์ƒํƒœ ํ‘œํ˜„ ๋ฐฉ์‹์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต 72 6.2 ์žก์ƒต ์Šค์ผ€์ค„๋ง ๋ฌธ์ œ 74 6.2.1 ๋ฐ์ดํ„ฐ์…‹ 74 6.2.2 ์‹คํ—˜ ์„ธํŒ… 75 6.2.3 ํˆฌ์ž…๋Ÿ‰ ์„ฑ๋Šฅ ๋น„๊ต 77 6.2.4 ํ–‰๋™ ์ •์˜ ๋ฐฉ์‹์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต 80 6.3 ์ž๊ธฐ์ง€๋„๋กœ ์ธํ•œ ํšจ๊ณผ 84 6.3.1 ๋ฐ์ดํ„ฐ์…‹ 84 6.3.2 ์‹คํ—˜ ์„ธํŒ… 86 6.3.3 ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ  ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ์ž๊ธฐ์ง€๋„์˜ ํšจ๊ณผ 87 6.3.4 ํ•™์Šต ์‹œ์™€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ 91 ์ œ 7 ์žฅ ๊ฒฐ๋ก  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ 96 7.1 ๊ฒฐ๋ก  96 7.2 ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ 98 ์ฐธ๊ณ ๋ฌธํ—Œ 100 Abstract 118 ๊ฐ์‚ฌ์˜ ๊ธ€ 120๋ฐ•

    An integration of neuroscience and computational reinforcement learning

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ๋‡Œ์ธ์ง€๊ณผํ•™๊ณผ, 2021.8. ๊น€ํƒ์™„.์„œ๋ก : ๋ชฉ์ -์ง€ํ–ฅ์  ํ–‰๋™์ „๋žต๊ณผ ์Šต๊ด€์  ํ–‰๋™์ „๋žต ์‚ฌ์ด์˜ ์กฐ์œจ ๋ถˆ๊ท ํ˜•์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ์Šต๊ด€ ํŽธํ–ฅ์€ ๊ฐ•๋ฐ•์žฅ์• (OCD) ์ฃผ์ฆ์ƒ์ธ ๊ฐ•๋ฐ•ํ–‰๋™์˜ ๊ธฐ์ €๋ฅผ ์ด๋ฃฌ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ์ธ๊ณต์ง€๋Šฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•œ ๊ณ„์‚ฐ์‹ ๊ฒฝ๊ณผํ•™ ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ ๋‘ ํ–‰๋™์ „๋žต ์‚ฌ์ด์˜ ์กฐ์œจ ๊ธฐ์ „์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‚ฌ๋žŒ์˜ ๋‡Œ๋Š” ๋ชฉ์ -์ง€ํ–ฅ์ (๋ชจ๋ธ-๊ธฐ๋ฐ˜) ํ•™์Šต ์‹œ์Šคํ…œ๊ณผ ์Šต๊ด€์ (๋ชจ๋ธ-์ž์œ ) ํ•™์Šต ์‹œ์Šคํ…œ์˜ ์ƒํƒœ/๋ณด์ƒ ์˜ˆ์ธก ์‹ ๋ขฐ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์‹ ๋ขฐ๋„๊ฐ€ ๋†’์€ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์„ ํƒํ•˜์—ฌ ์˜์‚ฌ๊ฒฐ์ •์„ ์กฐ์œจํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๊ฐ•๋ฐ•์žฅ์•  ํ™˜์ž์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ์˜์‚ฌ๊ฒฐ์ • ์กฐ์œจ ๋ถˆ๊ท ํ˜•์ด ์ž˜๋ชป๋œ ํ•™์Šต์ „๋žต ์‹ ๋ขฐ๋„ ์ถ”์ •์— ์›์ธ์„ ๋‘” ๊ฒƒ์ธ์ง€ ์•„์ง ๋ถˆ๋ถ„๋ช…ํ•˜๋‹ค. ๋˜ํ•œ, ํ•™์Šต์ „๋žต ์‹ ๋ขฐ๋„ ๊ณ„์‚ฐ์„ ๋‹ด๋‹นํ•˜๋Š” ํ•˜์ „๋‘ํšŒ(IFG)์™€ ์ „๋‘๊ทนํ”ผ์งˆ(FPC)์˜ ๊ธฐ๋Šฅ ์†์ƒ์ด ์ด๋Ÿฌํ•œ ์กฐ์œจ ๋ถˆ๊ท ํ˜•์˜ ์‹ ๊ฒฝ์ƒ๋ฌผํ•™์  ๊ธฐ์ €์ธ์ง€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ฐฉ๋ฒ•: ์—ฐ๊ตฌ์ฐธ์—ฌ์ž๋“ค์˜ ๋ชจ๋ธ-๊ธฐ๋ฐ˜ ํ•™์Šต์ „๋žต๊ณผ ๋ชจ๋ธ-์ž์œ  ํ•™์Šต์ „๋žต ํ–‰๋™์„ ๋ถ„๋ฆฌํ•ด ๊ด€์ฐฐํ•˜๊ธฐ ์œ„ํ•ด ๋งˆ๋ฅด์ฝ”ํ”„ ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ œ(sequential two-choice Markov decision task)๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. 30๋ช…์˜ ๊ฐ•๋ฐ•์žฅ์•  ํ™˜์ž์™€ 31๋ช…์˜ ๊ฑด๊ฐ• ๋Œ€์กฐ๊ตฐ์ด ์—ฐ๊ตฌ์— ์ฐธ์—ฌํ–ˆ์œผ๋ฉฐ, ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•จ๊ณผ ๋™์‹œ์— ๊ธฐ๋Šฅ์  ๋‡Œ ์ž๊ธฐ๊ณต๋ช…์˜์ƒ(fMRI)์„ ์ดฌ์˜ํ–ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•œ ๊ณ„์‚ฐ๋ชจ๋ธ์„ ์ด์šฉํ•ด ์˜์‚ฌ๊ฒฐ์ • ์กฐ์œจ ๊ณผ์ • ๋™์•ˆ์˜ ํ–‰๋™์„ ์ถ”์ •ํ–ˆ๋‹ค. ๋ชจ๋ธ ํ–‰๋™๋ณ€์ˆ˜ ๋ฐ ๊ด€๋ จ ๋‡Œ ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด ํ™˜์ž๊ตฐ๊ณผ ๋Œ€์กฐ๊ตฐ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ฆํ–ˆ์œผ๋ฉฐ, ํ•ด๋‹น ๋‡Œ ๊ธฐ๋Šฅ ์ฐจ์ด๊ฐ€ ์‹ ๋ขฐ๋„ ์ถ”์ • ์˜ค๋ฅ˜ ๋ฐ ๊ฐ•๋ฐ•ํ–‰๋™ ์ฆ์ƒ์„ ์„ค๋ช…ํ•˜๋Š”์ง€ ํšŒ๊ท€๋ถ„์„์„ ํ†ตํ•ด ํ™•์ธํ–ˆ๋‹ค. ๊ฒฐ๊ณผ: ๊ฐ•๋ฐ•์žฅ์•  ํ™˜์ž๋“ค์€ ๋Œ€์กฐ๊ตฐ์— ๋น„ํ•ด ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ œ ์ˆ˜ํ–‰ ์‹œ ๋ณด์ƒ ํš๋“์— ๋” ํฐ ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ๋” ๋ณด์†์ ์œผ๋กœ ํ–‰๋™ํ–ˆ๋‹ค. ๋ชจ๋ธ-๊ธฐ๋ฐ˜ ํ•™์Šต์ „๋žต์ด ํ•„์š”ํ•œ ์ƒํ™ฉ์—์„œ, ํ™˜์ž๋“ค์€ ์˜คํžˆ๋ ค ๋ชจ๋ธ-์ž์œ  ํ•™์Šต์ „๋žต์„ ๊ณผ๋„ํžˆ ์‹ ๋ขฐํ–ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ํ™˜์ž๋“ค์—์„œ ๋‘ ํ•™์Šต์ „๋žต ์‚ฌ์ด์˜ ์กฐ์œจ ์•ˆ์ •์„ฑ์ด ๋” ๋†’์•˜์œผ๋ฉฐ, ๋ชจ๋ธ-์ž์œ  ํ•™์Šต์ „๋žต์œผ๋กœ์˜ ํŽธํ–ฅ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. ํ™˜์ž์—์„œ ๊ณผ๋„ํžˆ ๋†’์€ ์กฐ์œจ ์•ˆ์ •์„ฑ์€ ์ „๋‘๊ทนํ”ผ์งˆ ์˜์—ญ ์ค‘ ์ „์™ธ์ธก ์•ˆ์™€์ „๋‘ํ”ผ์งˆ(anterolateral OFC)์˜ ๊ณผํ™œ์„ฑํ™”์™€ ๊ด€๋ จ์žˆ์—ˆ์œผ๋ฉฐ, ์‹ ๋ขฐ๋„ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต์ „๋žต์„ ์„ ํƒํ•  ๋•Œ ์ „์™ธ์ธก ์•ˆ์™€์ „๋‘ํ”ผ์งˆ๊ณผ ์๊ธฐ์•ž์†Œ์—ฝ ์‚ฌ์ด์˜ ๊ธฐ๋Šฅ์  ์—ฐ๊ฒฐ์„ฑ์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๊ฐ•ํ™”๋˜์—ˆ๋‹ค. ๋ฐ˜๋ฉด, ํ™˜์ž์—์„œ ๊ณผํ™œ์„ฑํ™”๋œ ํ•˜์ „๋‘ํšŒ๋Š” ์กฐ์œจ ์•ˆ์ •์„ฑ ๋ฐ ๊ฐ•๋ฐ•ํ–‰๋™ ์ค‘์ฆ๋„์™€ ๋ถ€์  ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ๊ฒฐ๋ก : ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ฐ•๋ฐ•์žฅ์• ์˜ ์˜์‚ฌ๊ฒฐ์ • ์กฐ์œจ ๋ถˆ๊ท ํ˜•์ด ๋ชจ๋ธ-์ž์œ  ํ•™์Šต์ „๋žต์— ํŽธํ–ฅ๋œ ์กฐ์œจ์„ ์•ผ๊ธฐํ•˜๋Š” ๋‡Œ ๊ธฐ๋Šฅ ์ด์ƒ์— ์›์ธ์ด ์žˆ์Œ์„ ๋ฐํ˜”๋‹ค. ๋‚˜์•„๊ฐ€, ์˜ˆ์ธก ์‹ ๋ขฐ๋„๋ฅผ ์ถ”์ •ํ•˜๋Š” ํ•˜์ „๋‘ํšŒ ๋ฐ ์ „๋‘๊ทนํ”ผ์งˆ์„ ๊ฐ•๋ฐ•ํ–‰๋™ ๋ฐ ์Šต๊ด€ ํŽธํ–ฅ์— ๋Œ€ํ•œ ์‹ ๊ฒฝํšŒ๋กœ-๊ธฐ๋ฐ˜ ์น˜๋ฃŒ์˜ ๋‡Œ ์ƒ๋ฌผ์ง€ํ‘œ๋กœ ์ œ์•ˆํ•œ๋‹ค.Introduction: Habit bias, resulted from imbalanced arbitration between goal-directed and habitual controls, is thought to underlie compulsive symptoms of patients with obsessive-compulsive disorder (OCD). A computational reinforcement learning (RL) model accounts for that, between the goal-directed (model-based; MB) and habitual (model-free; MF) RL systems, brain allocates weight to a controller with higher reliability in state or reward prediction. However, it remains unclear whether the impaired arbitration in OCD is attributed to faulty estimation of the reliability in the RLs and if inferior frontal gyrus (IFG) and/or frontopolar cortex (FPC), known to track the reliability signals, are grounded on this impairment. Methods: The sequential two-choice Markov decision task was used to dissociate the MB and MF learning strategies. Thirty patients with OCD and thirty-one healthy controls (HCs) underwent a fMRI scan while performing the behavioral task. Behaviors of the arbitration process were estimated through a computational model based on RL algorithms. The model parameters and their neural estimates were compared between groups. Regression analyses were conducted to examine if neural differences explained faulty estimation of the reliability, in addition to compulsion severity, in OCD. Results: Patients with OCD earned less reward and showed higher perseveration than HCs. During MB-favored trials, the uncertainty of prediction based on the MF strategy was lower in patients, which led to higher maximum reliability of the RL systems arbitrating behaviors (i.e., stability of the arbitration) and higher probability to choose the MF strategy. The higher stability of the arbitration was associated with hyperactive signal of the lateral orbitofrontal cortex (OFC)/FPC in patients. Patients increased connectivity strength between the OFC/FPC and precuneus when choosing an action strategy. On the other hand, the hyperactive IFG signal was inversely associated with the lower stability of the arbitration and less severe compulsion in patients. Conclusions: It was demonstrated that the hyperactive neural arbitrators encoding the excessively stable arbitration in which the MF reliability was predominant underlay the imbalanced arbitration in OCD. Therefore, the findings suggest the IFG and FPC as brain biomarkers useful to plan a neurocircuit-based treatment for habit biases and compulsions of OCD.Background 1 Clinical characteristics of obsessive-compulsive disorder 1 Theoretical models for OCD symptomatology 3 Neurocircuitry mechanisms of OCD 4 Treatment strategies and unsatisfactory responses in patients with OCD 7 Current issues to be addressed in developing neurobiological evidence-based treatments for OCD 8 Chapter 1. Reliability-based competition between model-based and model-free learning strategies in OCD 11 Introduction 12 Methods 15 Results 26 Discussion 35 Chapter 2. Aberrant neural arbitrators underlying the imbalanced arbitration between decision-making strategies in OCD 37 Introduction 38 Methods 40 Results 45 Discussion 55 General Discussion 57 References 62 Abstract in Korean 74๋ฐ•

    ์„ธ๊ทธ๋จผํŠธ ๊ต์ฒด ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•œ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ABR ์•Œ๊ณ ๋ฆฌ์ฆ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€์ข…๊ถŒ.์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์˜จ๋ผ์ธ ๋น„๋””์˜ค ์„œ๋น„์Šค์˜ ์žฌ์ƒ ํ’ˆ์งˆ, ์ฆ‰ ์‚ฌ์šฉ์ž ์ฒด๊ฐ ํ’ˆ์งˆ์„ ์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉ๋˜๋Š” ๋Œ€ํ‘œ์  ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ง€๊ธˆ๊นŒ์ง€ ์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์–‘ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ฒด๊ฐ ํ’ˆ์งˆ์„ ์ตœ์ ํ™”ํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ณตํ†ต๋œ ํ•œ๊ณ„์ ์„ ์ง€๋‹Œ๋‹ค. ์‚ฌ์šฉ์ž ์ฒด๊ฐ ํ’ˆ์งˆ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ˆœํžˆ ๋‹ค์Œ์œผ๋กœ ๋‹ค์šด๋กœ๋“œ ํ•ด์•ผํ•˜๋Š” ์„ธ๊ทธ๋จผํŠธ์˜ ๋น„ํŠธ๋ ˆ์ดํŠธ๋งŒ์„ ๊ฒฐ์ •ํ•œ๋‹ค๋Š” ์ ์ด ๊ทธ ํ•œ๊ณ„์ ์œผ๋กœ, ์ด๋Ÿฌํ•œ ์œ ํ˜•์— ์†ํ•˜๋Š” ์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ๋ณ€ํ™”ํ•˜๋Š” ๋„คํŠธ์›Œํฌ ํ™˜๊ฒฝ์— ๋งž์ถฐ ์•ž์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•  ์„ธ๊ทธ๋จผํŠธ์˜ ๋น„ํŠธ๋ ˆ์ดํŠธ๋Š” ์ตœ์ ์œผ๋กœ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ด๋ฏธ ๋‹ค์šด๋กœ๋“œํ•œ ์„ธ๊ทธ๋จผํŠธ์— ๋Œ€ํ•ด์„  ์–ด๋– ํ•œ ์ตœ์ ํ™”๋„ ์ง„ํ–‰ํ•  ์ˆ˜ ์—†๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ์‚ฌ์šฉ์ž์˜ ๋„คํŠธ์›Œํฌ ํ™˜๊ฒฝ์ด ๊ทน๋‹จ์ ์œผ๋กœ ๊ฐœ์„ ๋˜๋”๋ผ๋„ ์ด์— ๋Œ€ํ•œ ํ™œ์šฉ๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” LAWS ๊ธฐ๋ฒ•, ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์„ธ๊ทธ๋จผํŠธ ๊ต์ฒด ์ „๋žต์„ ํฌํ•จํ•œ ์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ ๋ชจ๋ธ์€ ์‚ฌ์šฉ์ž์˜ ๋„คํŠธ์›Œํฌ ํ™˜๊ฒฝ ๋“ฑ์— ๋”ฐ๋ผ์„œ ๋” ๋‚˜์€ ๋น„ํŠธ๋ ˆ์ดํŠธ๋กœ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ๊ต์ฒดํ•  ์ˆ˜ ์žˆ๋‹ค. ์ œ์•ˆ ๊ธฐ๋ฒ•์„ ์‹คํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์ƒˆ๋กœ์šด ํ˜•ํƒœ์˜ ๋ฆฌ์›Œ๋“œ๋ฅผ ๋””์ž์ธํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ œ์•ˆ ๊ธฐ๋ฒ•์€ ์„ธ๊ทธ๋จผํŠธ ๊ต์ฒด ์ „๋žต์„ ํฌํ•จํ•œ ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ์ž ์ฒด๊ฐ ํ’ˆ์งˆ์„ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์„ธ๊ทธ๋จผํŠธ ๊ต์ฒด ์ „๋žต์„ ํฌํ•จํ•จ์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ์˜ ๋ณต์žก๋„์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ–‰๋™ ์ œ์•ฝ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ํ•™์Šต์„ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ตœ์ข…์ ์œผ๋กœ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์ ์‘ํ˜• ๋น„ํŠธ๋ ˆ์ดํŠธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๋„คํŠธ์›Œํฌ ํŠธ๋ ˆ์ด์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹ค์‹œํ•œ ์‹คํ—˜์—์„œ๋Š” ์ œ์•ˆ ๊ธฐ๋ฒ•์ด ๊ธฐ์กด์˜ ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•ด ์‚ฌ์šฉ์ž ์ฒด๊ฐ ํ’ˆ์งˆ์„ 13.1%๊นŒ์ง€ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋๋‹คAdaptive bitrate (ABR) algorithm is one of the representative techniques used to optimize the playback quality of online video services, namely Quality of Experience (QoE). So far, ABR algorithms based on various optimization techniques have optimized QoE. However, most of the ABR algorithms proposed to date have common limitations; the range of options for optimization. Currently, most ABR algorithms only determine the bit rate of the next segment for QoE optimization. This type of ABR algorithm can optimize the bit rate of a segment to be downloaded in the future in a dynamic network environment. However, it is not possible to optimize any segment previously downloaded, so the changed network environment cannot be utilized to the maximum. To overcome this limitation, we propose LAWS, learning based ABR algorithm with segment replacement. LAWS can be replaced with a better bit rate, even for previously downloaded segments, in conditions such as an improved network environment. First for this, we design a novel form of reward for optimization, including segment replacement. Through this, QoE, the optimization objective of the ABR algorithm, can be optimized in the form of segment replacement. In addition, we propose a rule-based learning method to solve the challenges arising in the model learning process. We finally propose an ABR algorithm with segment replacement based on deep reinforcement learning. Experiments based on network traces show that the newly proposed technique has a QoE improvement of 13.1% compared to the existing ABR techniques.I. Introduction 1 II. Related Work 4 2.1 DASH 4 2.2 Adaptive BitRate Algorithm 6 III. Motivation and Approach 9 3.1 Motivation 9 3.2 Approach 11 IV. Neural ABR algorithm with Segment Replacement 13 4.1 Action 15 4.2 State 15 4.3 Reward 18 4.4 Rule based learning 26 4.5 Implementation 27 V. Experiments 28 5.1 Experiment Setup 28 5.2 Baselines 29 5.3 Comparison with Existing ABR algorithms 33 5.4 Analyze Replacement Characteristics 35 5.5 Comparison Between Learning Based Algorithms 35 VI. Conclusion 37Maste

    ๊ฐ•ํ™”ํ•™์Šต์„ ์ ์šฉํ•œ ์‹ค์šฉ์ ์ธ ๊ฑด๋ฌผ ์‹œ์Šคํ…œ ์ œ์–ด

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ๊ฑด์ถ•ํ•™๊ณผ, 2021.8. ์กฐ์„ฑ๊ถŒ.HVAC ๋ฐ ์กฐ๋ช…๊ณผ ๊ฐ™์€ ๊ธฐ์กด ์‹œ์Šคํ…œ๊ณผ ๊ฐ„ํ—์  ์žฌ์ƒ ์—๋„ˆ์ง€, ์—๋„ˆ์ง€ ์ €์žฅ ์‹œ์Šคํ…œ ๋“ฑ๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ์‹œ์Šคํ…œ์—๋„ ๋Œ€์‘ํ•ด์•ผ ํ•˜๋ฏ€๋กœ ํ˜„๋Œ€ ๊ฑด๋ฌผ ์‹œ์Šคํ…œ ์ œ์–ด๋Š” ๋ณต์žกํ•ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ, ๊ฑด๋ฌผ ์‹œ์Šคํ…œ ์ œ์–ด๊ธฐ๋Š” ๊ฑด๋ฌผ์˜ ๋™์  ๊ฑฐ๋™์— ์Šค์Šค๋กœ ์ ์‘ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๊ณ  ๋‹ค๋ชฉ์  ์ตœ์ ํ™” ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต (reinforcement learning, RL)์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ˆ ๋œ ๊ฑด๋ฌผ ์ œ์–ด๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ๋„๋ฆฌ ์•Œ๋ ค์ ธ ์žˆ์ง€๋งŒ, RL์„ ์‹ค์ œ ๊ฑด๋ฌผ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋“ค์ด ์žˆ๋‹ค: (1) RL์˜ ์ดˆ๊ธฐ ํ›ˆ๋ จ ๊ธฐ๊ฐ„ ๋™์•ˆ ๋ถˆ์•ˆ์ •ํ•œ ์ œ์–ด๋Š” ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋น„์šฉ์„ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋‹ค. (2) ์—ฌ์ „ํžˆ ๋Œ€๋ถ€๋ถ„์˜ RL ๊ธฐ๋ฐ˜ ์ œ์–ด ์ „๋žต์€ ์ผ์ƒ์  ์‹ค๋ฌด์— ์ ์šฉํ•˜๊ธฐ์—๋Š” ์‹œ์„ค ๊ด€๋ฆฌ์ž ์ž…์žฅ์—์„œ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ต๊ณ  ์ œ์–ด ์ „๋žต์— ๋Œ€ํ•œ ํ•ด์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†๋‹ค. RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฑด๋ฌผ ์ œ์–ด์— ์ ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์˜์‚ฌ๊ฒฐ์ •์˜ ์ฃผ์ฒด๊ฐ€ ์ธ๊ณต์ง€๋Šฅ์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ, ๊ฑด๋ฌผ์˜ ์†Œ์œ ์ฃผ์™€ ์šด์˜์ž๋Š” ์ธ๊ณต์ง€๋Šฅ ๊ธฐ๋ฐ˜ ๊ฑด๋ฌผ ์ œ์–ด๊ธฐ์˜ ์˜๋„ ๋ฐ ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ •์— ๋Œ€ํ•œ ํ•ด์„ ๋ฐ ์ดํ•ด๋ฅผ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, RL ์—์ด์ „ํŠธ๋ฅผ ์‚ฌ์ „ ํ•™์Šตํ•˜๊ณ  ์ด๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๊ฐœ๋…์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์ธ ์—ฐํ•ฉ ๋ชจ๋ธ์ด ์ œ์•ˆ๋œ๋‹ค. ์—ฐํ•ฉ ๋ชจ๋ธ์€ ๋นŒ๋”ฉ ์‹œ์Šคํ…œ์„ ๋ฌผ๋ฆฌ์  ์ธ๊ณผ ๊ด€๊ณ„์— ๋”ฐ๋ผ ๋ชจ๋“ˆ๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ๋ชจ๋“ˆ์„ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ ๊ฐœ๋ฐœํ•˜์—ฌ ๋นŒ๋”ฉ ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ†ตํ•ฉ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด๋‹ค. ๋Œ€์ƒ ๊ฑด๋ฌผ์˜ ๋ƒ‰๋ฐฉ ์‹œ์Šคํ…œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์€ 6๊ฐœ์˜ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜๊ณ  ๊ฐ ๋ชจ๋“ˆ์€ BEMS์—์„œ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ๋ฐœ๋œ๋‹ค. ์—ฐํ•ฉ ๋ชจ๋ธ์€ ์ œ1๋ฒ•์น™ ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์˜ ํ•œ๊ณ„ (์˜ˆ: ์œ„์ƒ ๊ทœ์น™, ๋ชจ๋ธ ๋ณด์ •)๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋‹ค. Deep Q-Network (DQN)์€ ๋ƒ‰๋ฐฉ ์‹œ์Šคํ…œ์˜ ๋™์  ๊ฑฐ๋™์„ ํ•™์Šตํ•˜๊ณ  ๊ฑด๋ฌผ์— ๋ƒ‰๋ฐฉ์„ ๊ณต๊ธ‰ํ•˜๋Š” ๋™์‹œ์— ์—๋„ˆ์ง€ ์‚ฌ์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์ œ์–ด ์ „๋žต์„ ๋ชจ์ƒ‰ํ•˜๋Š” ๋ฐ ์ ์šฉ๋œ๋‹ค. DQN์˜ ์ œ์–ด ์„ฑ๋Šฅ์„ ํ˜„์žฌ ๊ฑด๋ฌผ ์šด์˜์ž๋“ค์ด ์ ์šฉํ•˜๋Š” ๊ธฐ์กด ์ œ์–ด ์„ฑ๋Šฅ๊ณผ ๋น„๊ตํ•จ์œผ๋กœ์จ RL ์ œ์–ด๊ธฐ๊ฐ€ ์‹œ์Šคํ…œ์˜ ์ œ์–ด ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์—ฐํ•ฉ ๋ชจ๋ธ์€ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์ œ์–ด๊ธฐ์˜ ํ•™์Šต์„ ์œ„ํ•œ ๊ฐ€์ƒ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•œ๋‹ค. DQN ์—์ด์ „ํŠธ์˜ ํ•ด์„์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—์ด์ „ํŠธ์˜ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์ถ”์ถœํ•œ๋‹ค. ์—์ด์ „ํŠธ์—์„œ ์ƒ์„ฑ๋œ ์ƒํƒœ-์ž‘์—… (state-action) ์Œ์ด ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์–•์ง€๋งŒ ์‰ฝ๊ฒŒ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์‚ฌํ›„ ํ•ด์„์€ ๊ฐ•ํ™” ํ•™์Šต์˜ ํˆฌ๋ช…์„ฑ๊ณผ ํ•ด์„์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๊ฐ€ ๋งŒ๋“  ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋Š” ์ธ๊ณต์ง€๋Šฅ์ด ๋งŒ๋“  ์ œ์–ด ์ „๋žต์„ ๋‹จ์ˆœํ™”์‹œํ‚จ 'If-then' ๊ทœ์น™์„ ๋„์ถœํ•œ๋‹ค. ์ถ”์ถœ๋œ ๊ทœ์น™ (reduced rule) ๊ธฐ๋ฐ˜ ์ œ์–ด์˜ ์„ฑ๋Šฅ๊ณผ DQN ์ œ์–ด๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์—ฌ ๋‘ ์ œ์–ด๊ธฐ ์‚ฌ์ด์˜ ์—๋„ˆ์ง€ ์ ˆ์•ฝ๋Ÿ‰ ์ฐจ์ด๊ฐ€ 2.8%๋กœ ๋ฏธ๋ฏธํ•จ์„ ๋ณด์ธ๋‹ค. ์ฆ‰, ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ œ์–ด๊ฐ€ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ์ถ• ์‚ฌ๋ฌด์‹ค ๊ฑด๋ฌผ์˜ ๋ƒ‰๋ฐฉ ์ œ์–ด๋ฅผ ์œ„ํ•œ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ RL์˜ ์ ์šฉ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด ์ˆ˜ํ–‰๋œ๋‹ค. ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จ๋œ DQN ์—์ด์ „ํŠธ์— ์ ์šฉํ•œ ๋‹ค์Œ ์ผ๋ จ์˜ ๋‹จ์ˆœํ™”๋œ ์ œ์–ด ๊ทœ์น™์„ ๋„์ถœํ•œ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•œ ์ •๋Ÿ‰ํ™”๋œ ๊ทœ์น™ ๋„์ถœ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ , ๋ณต์žกํ•œ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋น„๊ตํ•˜์—ฌ ๋‹จ์ˆœํ•˜์ง€๋งŒ ์ •๋Ÿ‰์ ์ธ ํ‰๊ฐ€๊ฐ€ ์ˆ˜ํ–‰๋œ ๊ทœ์น™์ด ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด ์—ฐ๊ตฌ์˜ ์˜์˜๋Š” ๊ฑด๋ฌผ ํ†ต์ œ์— ๋Œ€ํ•œ ์ •๋Ÿ‰์  ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ๊ทœ์น™์„ ๋„์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋Š” ๋ฐ ์žˆ๋‹ค.Building controls are becoming complicated because modern building systems must respond to not only conventional systems like HVAC and lighting, but also to novel systems such as intermittent renewables, energy storage systems, and more. Therefore, the advanced building controllers must balance the trade-off between multiple objectives and automatically adapt to dynamic environment. Although it is widely acknowledged that reinforcement learning (RL) can be beneficially used for better building control, there are several challenges that should be addressed for real life application of RL: (1) unstable and poor control actions during early training period of RL may cause unexpected costs; (2) many RL-based control actions still remain unexplainable for daily practice of facility managers. By applying RL algorithms as artificial intelligences that are the subject of decision-making, owners and operators of buildings need to be reassured about the controllers intentions. To address the first challenge, federated model, a novel concept of simulation model, is proposed for pre-training RL agents. The federated model is an integrated data-driven model that divides a building system into several modules based on physical causality and develops each module into a data-driven model to perform simulations on building systems. A federated model of a complex cooling system of a target building is realized using six modules, each developed using data gathered from BEMS. By developing the federated model, limitations of physics-based simulation models (eg. topology rules, model calibration) are overcome. Deep Q-network (DQN) is applied to learn the dynamics of the cooling system and explore control strategies that can reduce energy use while providing cold for the building. By comparing the control performance of DQN with the performance of baseline control, it is shown that RL controller can significantly enhance control efficiency of the system and the federated model can provide sufficient virtual experience for the controller. To enhance interpretability of the DQN agent, decision tree is used to extract explanation of the decision making process of the agent. State-action pairs generated by the agent is used train a decision tree. Post-hoc interpretation using a shallow but easily interpretable model enhances transparency and interpretability of reinforcement learning. Also, the result of classification made by the decision tree provides If-then rules which are reduced version of control strategies made by the artificial intelligence. The performance of the reduced rule-based control is also compared to the performance of DQN controller. It is demonstrated that the reduced rule is good-enough and the difference in energy savings between the two is marginal, resulting in 2.8%. This study reports the development of explainable RL for cooling control of an existing office building. A decision tree is applied to trained DQN agent and then a set of reduced-order control rules are suggested. This study proposes rule reduction framework using explainable reinforcement learning and demonstrates that reduced rules can perform as well as complex reinforcement learning algorithms. The significance of this study lies in proposing how to derive rules with quantitative evaluation for building control.1. Introduction 1 1.1 Control of building systems 1 1.2 Problem Description 2 1.3 Goal 4 1.4 Thesis Outline 5 2. Deep Q-network (DQN) 7 2.1. Summary of reinforcement learning 7 2.1.1 Elements of reinforcement learning 7 2.1.2 Value function 9 2.2. Deep Q-learning 12 2.2.1 Temporal difference (TD) learning and Q-learning 12 2.2.2 Deep Q-learning 14 2.3. Previous works to implement reinforcement learning to existing buildings 16 2.4. Conclusion 19 3. Decision Trees 21 3.1 Summary of decision tree 21 3.2 Classification And Regression Trees (CART) 23 3.3 Interpreting reinforcement learning using decision tree 24 3.4 Conclusion 26 4. Target building and Federated model 27 4.1 Parallel cooling system 27 4.2 Federated model 31 5. Explainable deep Q-network and rule reduction for building control 40 5.1 DQN implementation framework 40 5.2 Control results of DQN 46 5.3 Rule reduction from DQN agent 50 5.4 Discussion 54 6. Conclusion 55 6.1 Summary and conclusion 55 6.2 Future works 57 Reference 58์„

    Controller Indirect Learning Algorithm Using Experimental Implantation Technique

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 8. ์ด์ œํฌ.๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์• ๋‹ˆ๋งค์ด์…˜์ด๋ž€ ๊ฐ€์ƒ์˜ ์บ๋ฆญํ„ฐ๋“ค์ด ๋ฌผ๋ฆฌ ๋ฒ•์น™์˜ ์ง€๋ฐฐ ํ•˜์—์„œ ์›€์ง์ด๋„๋ก ํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ์›€์ง์ž„์— ํ˜„์‹ค์„ฑ์„ ๋ถ€์—ฌํ•จ์œผ๋กœ์จ ๋ณด๋Š” ์‚ฌ๋žŒ๋“ค๋กœ ํ•˜์—ฌ๊ธˆ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋Š๋‚Œ์ด ๋“ค๊ฒŒ ํ•ด์ฃผ๋Š” ๊ธฐ๋ฒ•์ด๋‹ค. ํ˜„์žฌ ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ์˜ ๋™์ž‘์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์žฅ ๋ณดํŽธ์ ์œผ๋กœ ์ด์šฉ๋˜๊ณ  ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๋ชจ์…˜ ์บก์ณ ๊ธฐ๋ฒ•์ธ๋ฐ, ์ด ๋ฐฉ๋ฒ•์€ ํ˜„์‹ค์˜ ์‚ฌ๋žŒ์ด๋‚˜ ๋™๋ฌผ์ด ๋ฐฐ์šฐ๊ฐ€ ๋˜์–ด ์ง์ ‘ ์ดฌ์˜ํ•œ๋‹ค๋Š” ์ ์—์„œ ํ•„์—ฐ์ ์œผ๋กœ ๋ช‡ ๊ฐ€์ง€ ๋ฌผ๋ฆฌ์  ํ•œ๊ณ„๋ฅผ ๊ฐ–๋Š”๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋‘ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์ฒซ ๋ฒˆ์งธ๋Š” ์›ํ•˜๋Š” ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ๊ณผ ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ๊ฐ€ ์žˆ์„ ๋•Œ, ์–ป๊ณ ์ž ํ•˜๋Š” ๋™์ž‘์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์บ๋ฆญํ„ฐ์˜ ์›€์ง์ž„์— ๋Œ€ํ•œ ๋ณด์ƒ(reward) ์‹œ์Šคํ…œ๋งŒ ์ •ํ•ด์ฃผ๋ฉด ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ์ฃผ์–ด์ง„ ์กฐ๊ฑด์— ๋งž๋Š” ๋™์ž‘์„ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ œ์–ด๊ธฐ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋‘ ๋ฒˆ์งธ ์ œ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ฒซ ๋ฒˆ์งธ์— ์ด์–ด์ง€๋Š” ๋‚ด์šฉ์œผ๋กœ, ์ฃผ์–ด์ง„ ํ™˜๊ฒฝ์—์„œ ์ž˜ ํ•™์Šต๋œ ๋™์ž‘ ์ œ์–ด๊ธฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์„ ๋•Œ, ํ˜•ํƒœ ๋ฐ ๊ตฌ์กฐ๋Š” ๋™์ผํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ™˜๊ฒฝ์„ ์ธ์‹ํ•˜๋Š” ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ์˜ ์ œ์–ด๊ธฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ํ™˜๊ฒฝ ์ธ์‹ ์„ผ์„œ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์‹คํ—˜์œผ๋กœ๋Š” ์žฅ์• ๋ฌผ์„ ํ”ผํ•ด ๋ชฉํ‘œ๋ฌผ๋กœ ๋น„ํ–‰ํ•˜๋Š” ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฏธ ํ•™์Šต๋œ ์ œ์–ด๊ธฐ์˜ ๊ฒฝํ—˜์„ ํ†ตํ•ด ๊ฐ„์ ‘์ ์œผ๋กœ ํ•™์Šต๋œ ์ œ์–ด๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค.์ œ 1์žฅ ์„œ๋ก  1 ์ œ 2์žฅ ๊ด€๋ จ ์—ฐ๊ตฌ 5 2.1 ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์• ๋‹ˆ๋งค์ด์…˜ 5 2.2 ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•œ ์ œ์–ด๊ธฐ ํ•™์Šต 7 ์ œ 3์žฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์š” 9 ์ œ 4์žฅ ์ดˆ๊ธฐ ์ตœ์ ํ™” ๊ถค์  ์ƒ์„ฑ 13 ์ œ 5์žฅ ์ง„ํ™”์  CACLA 17 ์ œ 6์žฅ ๊ฐ„์ ‘ ๊ฒฝํ—˜ ํ•™์Šต 20 ์ œ 7์žฅ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ 24 ์ฐธ๊ณ ๋ฌธํ—Œ 27 Abstract 32Maste
    • โ€ฆ
    corecore