Search CORE

88 research outputs found

Enhancing Robustness of Deep Reinforcement Learning based Semiconductor Packaging Lines Scheduling with Regularized Training

Author: 김중균
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(석사)--서울대학교 대학원 :공과대학 산업공학과,2019. 8. 박종헌.최근 고성능 전자 제품에 대한 수요가 높아지면서 다중 칩 제품 생산을 중심으로 반도체 제조공정이 발전하고 있다. 다중 칩 제품은 패키징 라인에서 공정을 여러 번 반복하는 재유입이 발생하게 되며, 공정 설비의 셋업 교체가 빈번히 일으키게 된다. 이는 반도체 패키징 라인의 스케줄링을 어렵게 만드는 주요한 요소이다. 또한, 반도체 패키징 라인은 제조공정 내,외적으로 다양한 변동 사항에 의해 생산환경이 빈번히 변화하며, 제조 현장에서는 스케줄링을 위해 요구되는 계산 시간이 매우 중요하기 때문에 신속한 스케줄 도출이 요구된다. 반도체 패키징 라인의 스케줄링 연구가 활발해지면서 전역 최적화를 목표로 하는 강화학습 기반의 스케줄링 연구가 늘어나고 있다. 강화학습 기반의 반도체 패키징 라인 스케줄링 연구는 그 활용 측면에서 다양한 생산환경 변화에 강건히 대응하며, 짧은 시간 안에 좋은 스케줄을 얻을 수 있어야 한다. 본 연구에서는 심층 강화학습 기반의 스케줄링 모델의 강건성 확보를 목표로 한다. 새로운 생산환경이 테스트로 주어졌을 때, 재학습을 수행하지 않고 성능의 큰 저하없는 심층 강화학습 기반 반도체 패키징 라인 스케줄링을 위한 정규화 학습기법을 제안한다. 유연 잡샵 스케줄링 문제에 강화학습을 적용하기 위해 전체 공정 상황을 고려한 상태와 행동, 보상을 설계하였고, 심층 강화학습의 대표 알고리즘인 심층 Q 네트워크를 이용하여 스케줄링 문제를 학습하였다. 본 연구에서 제안하는 정규화 학습 기법은 4단계로 나누어 각 단계에서 여러 생산환경 변화가 반영된 문제의 일반성과 각 문제의 특수성을 학습하도록 설계하였다. 서로 다른 복잡도의 스케줄링 문제를 이용하여 실험을 진행하였으며, 룰 기반 및 심층 강화학습 기반의 다른 스케줄링 모델에 비해 대체적으로 성능의 우수함을 검증하였다. 본 연구는 강화학습 기반의 스케줄링 연구에서 모델의 강건성에 연구의 초점을 맞춘 첫 연구이며, 본 연구의 결과는 실제 공장에서 연구의 활용성을 한층 높여준 연구이다.As the demand for high-performance electronic devices has increased, the semiconductor manufacturing process is being developed centering on the production of multi-chip products. In multi-chip products, re-entrance occurs by repeating the process several times in the packaging line, and the setup change of equipment is frequently incurred. These are major factors that make the scheduling of the semiconductor packaging line difficult. The production environment frequently changes due to internal and external variabilities. In addition, since the calculation time required for scheduling is very important at the manufacturing site, prompt schedule generation is required. As the research of the semiconductor packaging line scheduling becomes active, the reinforcement learning based scheduling research aiming at the global optimization is increasing. In view of the utilization of scheduling research based on reinforcement learning, there is a need for a method capable of reacting to various production environment changes and obtaining a good schedule in a short time. This study aims at obtaining the robustness of the scheduling model based on deep reinforcement learning. We propose a regularzied training method for semiconductor packaging lines scheduling based on deep reinforcement learning without performance degradation and re-training when a new production environment is given as a test data. In order to apply reinforcement learning to flexible job-shop scheduling problem, we designed state, action and reward considering overall process and trained deep Q network which is a representative algorithm of deep reinforcement learning. The regularzied training method proposed in this study is divided into four stages and designed to train the generalities of the problems reflected in various production environment and the specificity of each problem. Experiments were conducted using scheduling problems of different complexity, and it was verified that the performance was superior to other scheduling models based on rule-based and deep reinforcement learning. This study is the first research that focuses on the robustness of the model in the reinforcement learning based scheduling. Moreover, the result of this study enhances the practicality of research in real factory application.초록 목차 표 목차 그림 목차 제 1 장 서론 1.1 연구 배경 및 동기 1.2 연구 목적 1.3 연구 대상 정의 1.4 연구 내용 및 구성 제 2 장 배경이론 및 관련연구 2.1 배경이론 2.1.1 심층 강화학습 2.1.2 정규화 2.2 관련연구 2.2.1 반도체 패키징 라인 스케줄링 연구 2.2.2 강화학습 기반 스케줄링 연구 2.2.3 강화학습 강건성 연구 제 3 장 강화학습 기반 반도체 패키징 라인 스케줄링 3.1 강화학습 기반 스케줄링 의사 결정 3.2 상태, 행동, 보상 정의 3.3 강화학습 에이전트 학습과 테스트 3.3.1 심층 Q 네트워크 구조 3.3.2 강화학습 에이전트 학습 단계 3.3.3 강화학습 에이전트 테스트 단계 제 4 장 강화학습 강건성 확보를 위한 정규화 학습 기법 4.1 정규화 학습 개요 4.2 정규화 학습 과정 4.2.1 심층 Q 네트워크 학습 4.2.2 Q층 학습 4.2.3 정규화 가중치 학습 4.2.4 새로운 Q층 학습 제 5 장 실험 결과 5.1 데이터셋 5.2 실험 과정 5.3 실험 세팅 5.3.1 강화학습 실험 세팅 5.3.2 정규화 학습 실험 세팅 5.4 실험 결과 제 6 장 결론 6.1 결론 6.2 한계점 및 향후 연구 참고문헌 AbstractMaste

SNU Open Repository and Archive

확률적 안전성 검증을 위한 안전 강화학습: 랴푸노브 기반 방법론

Author: 허수빈
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 양인순.Emerging applications in robotic and autonomous systems, such as autonomous driving and robotic surgery, often involve critical safety constraints that must be satisfied even when information about system models is limited. In this regard, we propose a model-free safety specification method that learns the maximal probability of safe operation by carefully combining probabilistic reachability analysis and safe reinforcement learning (RL). Our approach constructs a Lyapunov function with respect to a safe policy to restrain each policy improvement stage. As a result, it yields a sequence of safe policies that determine the range of safe operation, called the safe set, which monotonically expands and gradually converges. We also develop an efficient safe exploration scheme that accelerates the process of identifying the safety of unexamined states. Exploiting the Lyapunov shieding, our method regulates the exploratory policy to avoid dangerous states with high confidence. To handle high-dimensional systems, we further extend our approach to deep RL by introducing a Lagrangian relaxation technique to establish a tractable actor-critic algorithm. The empirical performance of our method is demonstrated through continuous control benchmark problems, such as a reaching task on a planar robot arm.자율주행, 로봇 수술 등 자율시스템 및 로보틱스의 떠오르는 응용 분야의 절대 다수는 안전한 동작을 보장하기 위해 일정한 제약을 필요로 한다. 특히, 안전제약은 시스템 모델에 대해 제한된 정보만 알려져 있을 때에도 보장되어야 한다. 이에 따라, 본 논문에서는 확률적 도달성 분석(probabilistic reachability analysis)과 안전 강화학습(safe reinforcement learning)을 결합하여 시스템이 안전하게 동작할 확률의 최댓값으로 정의되는 안전 사양을 별도의 모델 없이 추정하는 방법론을 제안한다. 우리의 접근법은 매번 정책을 새로 구하는 과정에서 그 결과물이 안전함에 대한 기준을 충족시키도록 제한을 거는 것으로, 이를 위해 안전한 정책에 관한 랴푸노프 함수를 구축한다. 그 결과로 산출되는 일련의 정책으로부터 안전 집합(safe set)이라 불리는 안전한 동작이 보장되는 영역이 계산되고, 이 집합은 단조롭게 확장하여 점차 최적해로 수렴하도록 만다. 또한, 우리는 조사되지 않은 상태의 안전성을 더 빠르게 파악할 수 있는 효율적인 안전 탐사 체계를 개발하였다. 랴푸노브 차폐를 이용한 결과, 우리가 제안하는 탐험 정책은 높은 확률로 위험하다 여겨지는 상태를 피하도록 제한이 걸린다. 여기에 더해 우리는 고차원 시스템을 처리하기 위해 제안한 방법을 심층강화학습으로 확장했고, 구현 가능한 액터-크리틱 알고리즘을 만들기 위해 라그랑주 이완법을 사용하였다. 더불어 본 방법의 실효성은 연속적인 제어 벤치마크인 2차원 평면에서 동작하는 2-DOF 로봇 팔을 통해 실험적으로 입증되었다.Chapter 1 Introduction 1 Chapter 2 Related work 4 Chapter 3 Background 6 3.1 Probabilistic Reachability and Safety Specifications 6 3.2 Safe Reinforcement Learning 8 Chapter 4 Lyapunov-Based Safe Reinforcement Learning for Safety Specification 10 4.1 Lyapunov Safety Specification 11 4.2 Efficient Safe Exploration 14 4.3 Deep RL Implementation 19 Chapter 5 Simulation Studies 23 5.1 Tabular Q-Learning 25 5.2 Deep RL 27 5.3 Experimental Setup 31 5.3.1 Deep RL Implementation 31 5.3.2 Environments 32 Chapter 6 Conclusion 35 Bibliography 35 초록 41 Acknowledgements 42Maste

SNU Open Repository and Archive

강화학습을 활용한 고속도로 가변제한속도 및 램프미터링 전략 개발

Author: 조정훈
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 건설환경공학부, 2022.2. 김동규.Recently, to resolve societal problems caused by traffic congestion, traffic control strategies have been developed to operate freeways efficiently. The representative strategies to effectively manage freeway flow are variable speed limit (VSL) control and the coordinated ramp metering (RM) strategy. This paper aims to develop a dynamic VSL and RM control algorithm to obtain efficient traffic flow on freeways using deep reinforcement learning (DRL). The traffic control strategies applying the deep deterministic policy gradient (DDPG) algorithm are tested through traffic simulation in the freeway section with multiple VSL and RM controls. The results show that implementing the strategy alleviates the congestion in the on-ramp section and shifts to the overall sections. For most cases, the VSL or RM strategy improves the overall flow rates by reducing the density and improving the average speed of the vehicles. However, VSL or RM control may not be appropriate, particularly at the high level of traffic flow. It is required to introduce the selective application of the integrated control strategies according to the level of traffic flow. It is found that the integrated strategy can be used when including the relationship between each state detector in multiple VSL sections and lanes by applying the adjacency matrix in the neural network layer. The result of this study implies the effectiveness of DRL-based VSL and the RM strategy and the importance of the spatial correlation between the state detectors.최근에는 교통혼잡으로 인한 사회적 문제를 해결하기 위해 고속도로를 효율적으로 운영하기 위한 교통통제 전략이 다양하게 개발되고 있다. 고속도로 교통류를 효과적으로 관리하기 위한 대표적인 전략으로는 차로별 제한속도를 다르게 적용하는 가변 속도 제한(VSL) 제어와 진입 램프에서 신호를 통해 차량을 통제하는 램프 미터링(RM) 전략 등이 있다. 본 연구의 목표는 심층 강화 학습(deep reinforcement learning)을 활용하여 고속도로의 효율적인 교통 흐름을 얻기 위해 동적 VSL 및 RM 제어 알고리즘을 개발하는 것이다. 고속도로의 여러 VSL과 RM 구간에서 시뮬레이션을 통해 심층 강화학습 알고리즘 중 하나인 deep deterministic policy gradient (DDPG) 알고리즘을 적용한 교통류 제어 전략을 검증한다. 실험 결과, 강화학습 기반 VSL 또는 RM 전략을 적용하는 것이 램프 진입로 구간의 혼잡을 완화하고 나아가 전체 구간의 혼잡을 줄이는 것으로 나타났다. 대부분의 경우 VSL이나 RM 전략은 본선과 진입로 구간의 밀도를 줄이고 차량의 평균 통행 속도를 증가시켜 전체 교통 흐름을 향상시킨다. VSL 또는 RM 전략들은 높은 수준의 교통류에서 적절하지 않을 수 있어 교통류 수준에 따른 전략의 선택적 도입이 필요하다. 또한 검지기간 지리적 거리와 관련한 인접 행렬을 포함하는 graph neural network layer이 여러 지점 검지기의 공간적 상관 관계를 감지하는 데 이용될 수 있다. 본 연구의 결과는 강화학습 기반 VSL과 RM 전략 도입의 필요성과 지점 검지기 간의 공간적 상관관계의 중요성을 반영하는 전략 도입의 효과를 시사한다.Chapter 1. Introduction 1 Chapter 2. Literature Review 4 Chapter 3. Methods 8 3.1. Study Area and the Collection of Data 8 3.2. Simulation Framework 11 3.3. Trip Generation and Route Choice 13 3.4. Deep Deterministic Policy Gradient (DDPG) Algorithm 14 3.5. Graph Convolution Network (GCN) Layer 17 3.6. RL Formulation 18 Chapter 4. Results 20 4.1. VSL and RM 20 4.2. Efficiency according to the flow rate 28 4.3. Effectiveness of the GCN Layer 33 Chapter 5. Conclusion 34 Bibliography 37 Abstract in Korean 44석

SNU Open Repository and Archive

스트레스가 의사결정에 미치는 영향

Author: 박혜연
Publication venue: 서울대학교 대학원
Publication date: 01/02/2015
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 심리학과, 2015. 2. 최진영.무엇을 결정하거나 선택할 때, 두 종류의 인지신경학적 조절 시스템 간 경쟁적인 활동에 의해 행동이 결정된다는 가정이 일반적이다. 하나는 즉각적인 강화 여부에 따라 행동을 결정하는 습관적habit 혹은 모형 부재model-free행동 시스템이며, 다른 하나는 행위자의 내적 상태나 외적 환경에 대한 지식 및 정보를 적극적으로 이용하여 행동을 결정하는 목표지향적 goal-directed 혹은 모형 기반model-based 행동 시스템이다. 스트레스는 목표지향적인 행동을 방해하고 습관적 행동을 촉진하는 것으로 밝혀져 왔으며, 이는 스트레스가 두 행동 조절 시스템의 경쟁적 활동 과정에 개입할 가능성을 시사한다. 그러나 스트레스가 행동 선택 및 학습의 여러 요인에 미치는 구체적인 기전에 대해서는 아직 체계적 연구가 부족하다. 본 연구에서는 2개 연구를 통해, 스트레스가 행동 선택의 과정 및 결과에 미치는 다면적인 영향에 대해 면밀하게 탐색했다. 연구 1에서는 습관적 행동 처리과정과 목표지향적 행동 처리과정을 구분하는 의사결정 과제, 2 단계 반전학습 과제를 개발하여, 실험실에서 유발된 급성 스트레스가 이 두 처리과정에 어떻게 관여하는지를 탐색했다. 정상 대학생을 스트레스 처치 조건과 비교통제 조건에 무선 할당했고, 피험자들의 과제 수행 행동에 강화학습의 계산모형을 적용하여 모형 기반model-based, 모형 부재model-free 행동 경향성과 학습률learning rate을 추정했다. 실험 조건 간 과제 수행 행동 및 강화학습 모형 모수 추정치들을 비교한 결과, 스트레스 처치 조건에서는 비교통제 조건에 비해 모형 기반model-based 행동이 저조했고, 강화 없는 상황에서의 모형 부재model-free 행동 경향이 높았으며, 행동 선택 시 새로운 정보를 반영하는 경향, 즉 학습률learning rate이 저조했다. 이어서 연구 2에서는 기능적 자기공명 뇌 영상 기법을 사용하여 의사결정에 대한 스트레스 효과를 신경활동 수준에서 탐색했으며, 스트레스의 행동에 대한 영향이 처치 정도에 따라 일관적인지 여키스-도슨 법칙Yerkes-Dodson law을 따르는지 확인했다. 정상 성인들을 스트레스 무처치 조건, 스트레스 단일처치 조건과 스트레스 이중처치 조건에 무선 할당하였고, 2 단계 반전 학습 과제를 수행하는 동안 기능적 뇌 영상을 촬영했다. 조건 간 과제 수행 행동 및 강화학습 모형 모수 추정치들을 비교한 결과, 스트레스 단일처치 조건의 피험자들은 무처치 조건에 비해 모형 기반의model-based 목표지향적 행동이 증가하고 모형 부재의model-free 행동 경향이 감소되었으나, 스트레스 수준이 더 높은 스트레스 이중처치 조건에서는 단일처치 조건에 비해 모형 기반 model-based 행동이 저조하게 나타났다. 스트레스와 관련된 인지행동의 양방향적 변화는 뇌신경 활동 수준에서도 확인되었다. 즉, 의사결정 시 내측 전전두엽medial prefrontal cortex, 상측 측두엽superior temporal cortex의 신경활성화가 스트레스 처치 수준에 따라 증진되거나 저하되는 것으로 확인되었다. 이 두 영역의 신경활성화는 모형 기반 model-based 행동 경향을 반영하는 모수 추정치와 정적 상관관계인 것으로 나타났으며, 특히 내측 전전두엽medial prefrontal corttex의 의사결정 관련 신경활성화는 습관적 행동 지표와는 부적 상관관계였다. 또한 스트레스 처치는 우측 해마hippocampus의 선택 행동의 기대치chosen value 관련 신경활성화를 저하시켰으며, 이는 행동적으로는 반전학습reversal learning 수행 저하로 나타났다. 본 연구는 스트레스가 의사결정 시 습관적 행동을 증진시키는 인지행동적 기전을 밝히는 동시에, 스트레스의 효과가 그 정도에 따라 행동 선택의 여러 인지신경학적 요인에 다면적인 영향을 미침을 확인했다. 본 연구의 결과는 스트레스와 관련된 중독 행동 및 강박 행동 등 부적응적 행동의 병리적 기전 및 개입 방법에 대한 임상적 함의를 갖는다.Ⅰ. 서 론 1 1. 스트레스와 스트레스 반응 2 2. 스트레스와 의사결정 7 3. 의사결정에 대한 계산 모형 11 4. 연구 목적 16 Ⅱ. 연 구 1 19 1. 연구방법 22 2. 연구결과 40 3. 논 의 1 52 Ⅲ. 연 구 2 59 1. 연구방법 60 2. 연구결과 74 3. 논 의 2 95 Ⅳ. 종합 논의 102 참고문헌 106Docto

SNU Open Repository and Archive

AD4RL: Autonomous Driving Benchmarks for Offline Reinforcement Learning with Value-based Dataset

Author: Eom Chanin
Kwon Minhae
Lee Dongsu
Publication venue
Publication date: 02/04/2024
Field of study

Offline reinforcement learning has emerged as a promising technology by enhancing its practicality through the use of pre-collected large datasets. Despite its practical benefits, most algorithm development research in offline reinforcement learning still relies on game tasks with synthetic datasets. To address such limitations, this paper provides autonomous driving datasets and benchmarks for offline reinforcement learning research. We provide 19 datasets, including real-world human driver's datasets, and seven popular offline reinforcement learning algorithms in three realistic driving scenarios. We also provide a unified decision-making process model that can operate effectively across different scenarios, serving as a reference framework in algorithm design. Our research lays the groundwork for further collaborations in the community to explore practical aspects of existing reinforcement learning methods. Dataset and codes can be found in https://sites.google.com/view/ad4rl.Comment: ICRA 2024 Website at: https://sites.google.com/view/ad4r

arXiv.org e-Print Archive

Setup Change Scheduling Under Due-date Constraints Using Deep Reinforcement Learning with Self-supervision

Author: 팽보형
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업·조선공학부, 2021.8. 박종헌.납기 제약 하에서 셋업 스케줄을 수립하는 것은 현실의 여러 제조 산업에서 쉽게 찾아 볼 수 있으며 학계의 많은 관심을 끌고 있는 중대한 문제이다. 그러나 납기와 셋업 제약이 동시에 존재함에 따라 문제의 복잡도가 증가하게 되며, 시시각각 새로운 생산 계획이 주어지고 초기 설비 상태가 변화되는 환경에서 고품질의 스케줄 수립은 더 어려워진다. 본 논문에서는 학습된 심층신경망이 상기한 변화가 발생한 스케줄링 문제도 재학습 없이 해결할 수 있도록, 자기지도 기반 심층강화학습 기법을 제안한다. 구체적으로, 상태와 행동 표현을 생산 계획과 설비 상태에 무관한 차원을 갖도록 설계한다. 동시에 주어진 상태로부터 효율적으로 신경망을 학습하기 위해 파라미터 공유 구조를 도입한다. 이에 더하여, 스케줄링 문제에 적합한 자기지도를 고안하여 설비와 잡의 수, 생산 계획의 분포가 상이한 평가 환경으로도 일반화 가능한 심층신경망을 학습한다. 제안 기법의 유효성을 검증하기 위해 현실의 병렬설비 및 잡샵 공정을 모사한 대규모 데이터셋에서 집약적인 실험을 수행하였다. 제안 기법을 메타휴리스틱 기법과 다른 강화학습 기반 기법, 규칙 기반 기법과 비교함으로써 납기 준수 성능과 연산 시간 관점에서 우수성을 입증하였다. 더불어 상태 표현, 파라미터 공유, 자기지도 각각으로 인한 효과를 조사한 결과, 개별적으로 성능 개선에 기여함을 밝혀냈다.Setup change scheduling under due-date constraints has attracted much attention from academia and industry due to its practical applications. In a real-world manufacturing system, however, solving the scheduling problem becomes challenging since it is required to address urgent and frequent changes in demand and due-dates of products, and initial machine status. In this thesis, we propose a scheduling framework based on deep reinforcement learning (RL) with self-supervision in which trained neural networks (NNs) are able to solve unseen scheduling problems without re-training even when such changes occur. Specifically, we propose state and action representations whose dimensions are independent of production requirements and due-dates of jobs while accommodating family setups. At the same time, an NN architecture with parameter sharing was utilized to improve the training efficiency. Finally, we devise an additional self-supervised loss specific to the scheduling problem for training the NN scheduler robust to the variations in the numbers of machines and jobs, and distribution of production plans. We carried out extensive experiments in large-scale datasets that simulate the real-world wafer preparation facility and semiconductor packaging line. Experiment results demonstrate that the proposed method outperforms the recent metaheuristics, rule-based, and other RL-based methods in terms of the schedule quality and computation time for obtaining a schedule. Besides, we investigated individual contributions of the state representation, parameter sharing, and self-supervision on the performance improvements.제 1 장 서론 1 1.1 연구 동기 및 배경 1 1.2 연구 목적 및 공헌 4 1.3 논문구성 6 제 2 장 배경 7 2.1 순서 의존적 셋업이 있는 납기 제약 하에서의 스케줄링 문제 7 2.1.1 납기 제약 하에서의 스케줄링 문제 7 2.1.2 패밀리 셋업을 고려한 병렬설비 스케줄링 8 2.1.3 셋업 제약이 있는 잡샵 스케줄링 9 2.2 강화학습 기반 스케줄링 12 2.2.1 이론적 배경 12 2.2.2 강화학습을 이용한 제조 라인 스케줄링 13 2.2.3 스케줄링 문제에서의 심층강화학습 15 2.3 자기지도 기반 심층강화학습 19 제 3 장 문제 정의 22 3.1 병렬설비 스케줄링 문제 22 3.1.1 지연시간 최소화를 위한 병렬설비 스케줄링 문제 22 3.1.2 혼합정수계획 모형 24 3.1.3 예시 공정 25 3.2 잡샵 스케줄링 문제 26 3.2.1 투입량 최대화를 위한 유연잡샵 스케줄링 26 3.2.2 예시 공정 27 제 4 장 자기지도 기반 심층강화학습을 이용한 병렬설비 스케줄링 31 4.1 MDP 모형 31 4.1.1 행동 정의 31 4.1.2 상태 표현 32 4.1.3 보상 정의 37 4.1.4 상태 전이 38 4.1.5 예시 39 4.2 신경망 학습 41 4.2.1 심층신경망 구조 41 4.2.2 손실 함수 42 4.2.3 DQN 학습 절차 43 4.2.4 DQN 평가 절차 44 4.3 스케줄링 문제에서의 자기지도 46 4.3.1 내재적 보상 설계 46 4.3.2 셋업 스케줄링을 위한 선호도 점수 설계 47 4.4 자기지도 기반 DQN 학습 49 4.4.1 자기지도 손실 함수 49 4.4.2 학습 절차 50 제 5 장 자기지도 기반 심층강화학습을 이용한 잡샵 스케줄링 53 5.1 스케줄링 프레임워크 53 5.1.1 병목 공정 정의 53 5.1.2 디스패치 규칙 54 5.1.3 이산 사건 시뮬레이터 55 5.1.4 스케줄러 학습 56 5.2 투입 정책과 자기지도 58 5.3 MDP 모형 수정 59 5.3.1 행동 정의 59 5.3.2 상태 표현 59 5.3.3 보상 정의 61 제 6 장 실험 및 결과 62 6.1 병렬설비 스케줄링 문제 62 6.1.1 데이터셋 62 6.1.2 실험 세팅 64 6.1.3 지연시간 총합 성능 비교 67 6.1.4 상태 표현 방식에 따른 성능 비교 72 6.2 잡샵 스케줄링 문제 74 6.2.1 데이터셋 74 6.2.2 실험 세팅 75 6.2.3 투입량 성능 비교 77 6.2.4 행동 정의 방식에 따른 성능 비교 80 6.3 자기지도로 인한 효과 84 6.3.1 데이터셋 84 6.3.2 실험 세팅 86 6.3.3 파라미터 공유 여부에 따른 자기지도의 효과 87 6.3.4 학습 시와 다른 데이터셋에서의 성능 평가 91 제 7 장 결론 및 향후 연구 방향 96 7.1 결론 96 7.2 향후 연구 방향 98 참고문헌 100 Abstract 118 감사의 글 120박

SNU Open Repository and Archive

An integration of neuroscience and computational reinforcement learning

Author: 김택완
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 자연과학대학 뇌인지과학과, 2021.8. 김택완.서론: 목적-지향적 행동전략과 습관적 행동전략 사이의 조율 불균형으로 발생하는 습관 편향은 강박장애(OCD) 주증상인 강박행동의 기저를 이룬다. 강화학습 인공지능 알고리즘에 기반한 계산신경과학 모델은 이러한 두 행동전략 사이의 조율 기전을 설명할 수 있다. 사람의 뇌는 목적-지향적(모델-기반) 학습 시스템과 습관적(모델-자유) 학습 시스템의 상태/보상 예측 신뢰도를 계산하고, 신뢰도가 높은 학습 시스템을 선택하여 의사결정을 조율한다. 하지만, 강박장애 환자에서 나타나는 의사결정 조율 불균형이 잘못된 학습전략 신뢰도 추정에 원인을 둔 것인지 아직 불분명하다. 또한, 학습전략 신뢰도 계산을 담당하는 하전두회(IFG)와 전두극피질(FPC)의 기능 손상이 이러한 조율 불균형의 신경생물학적 기저인지 연구가 필요하다. 방법: 연구참여자들의 모델-기반 학습전략과 모델-자유 학습전략 행동을 분리해 관찰하기 위해 마르코프 의사결정 과제(sequential two-choice Markov decision task)를 사용했다. 30명의 강박장애 환자와 31명의 건강 대조군이 연구에 참여했으며, 의사결정 과제를 수행함과 동시에 기능적 뇌 자기공명영상(fMRI)을 촬영했다. 강화학습 알고리즘에 기반한 계산모델을 이용해 의사결정 조율 과정 동안의 행동을 추정했다. 모델 행동변수 및 관련 뇌 기능에 대해 환자군과 대조군 사이의 차이를 통계적으로 검증했으며, 해당 뇌 기능 차이가 신뢰도 추정 오류 및 강박행동 증상을 설명하는지 회귀분석을 통해 확인했다. 결과: 강박장애 환자들은 대조군에 비해 의사결정 과제 수행 시 보상 획득에 더 큰 어려움을 겪고 더 보속적으로 행동했다. 모델-기반 학습전략이 필요한 상황에서, 환자들은 오히려 모델-자유 학습전략을 과도히 신뢰했다. 그 결과, 환자들에서 두 학습전략 사이의 조율 안정성이 더 높았으며, 모델-자유 학습전략으로의 편향이 확인되었다. 환자에서 과도히 높은 조율 안정성은 전두극피질 영역 중 전외측 안와전두피질(anterolateral OFC)의 과활성화와 관련있었으며, 신뢰도 정보를 바탕으로 학습전략을 선택할 때 전외측 안와전두피질과 쐐기앞소엽 사이의 기능적 연결성이 비정상적으로 강화되었다. 반면, 환자에서 과활성화된 하전두회는 조율 안정성 및 강박행동 중증도와 부적 상관관계를 보였다. 결론: 본 연구는 강박장애의 의사결정 조율 불균형이 모델-자유 학습전략에 편향된 조율을 야기하는 뇌 기능 이상에 원인이 있음을 밝혔다. 나아가, 예측 신뢰도를 추정하는 하전두회 및 전두극피질을 강박행동 및 습관 편향에 대한 신경회로-기반 치료의 뇌 생물지표로 제안한다.Introduction: Habit bias, resulted from imbalanced arbitration between goal-directed and habitual controls, is thought to underlie compulsive symptoms of patients with obsessive-compulsive disorder (OCD). A computational reinforcement learning (RL) model accounts for that, between the goal-directed (model-based; MB) and habitual (model-free; MF) RL systems, brain allocates weight to a controller with higher reliability in state or reward prediction. However, it remains unclear whether the impaired arbitration in OCD is attributed to faulty estimation of the reliability in the RLs and if inferior frontal gyrus (IFG) and/or frontopolar cortex (FPC), known to track the reliability signals, are grounded on this impairment. Methods: The sequential two-choice Markov decision task was used to dissociate the MB and MF learning strategies. Thirty patients with OCD and thirty-one healthy controls (HCs) underwent a fMRI scan while performing the behavioral task. Behaviors of the arbitration process were estimated through a computational model based on RL algorithms. The model parameters and their neural estimates were compared between groups. Regression analyses were conducted to examine if neural differences explained faulty estimation of the reliability, in addition to compulsion severity, in OCD. Results: Patients with OCD earned less reward and showed higher perseveration than HCs. During MB-favored trials, the uncertainty of prediction based on the MF strategy was lower in patients, which led to higher maximum reliability of the RL systems arbitrating behaviors (i.e., stability of the arbitration) and higher probability to choose the MF strategy. The higher stability of the arbitration was associated with hyperactive signal of the lateral orbitofrontal cortex (OFC)/FPC in patients. Patients increased connectivity strength between the OFC/FPC and precuneus when choosing an action strategy. On the other hand, the hyperactive IFG signal was inversely associated with the lower stability of the arbitration and less severe compulsion in patients. Conclusions: It was demonstrated that the hyperactive neural arbitrators encoding the excessively stable arbitration in which the MF reliability was predominant underlay the imbalanced arbitration in OCD. Therefore, the findings suggest the IFG and FPC as brain biomarkers useful to plan a neurocircuit-based treatment for habit biases and compulsions of OCD.Background 1 Clinical characteristics of obsessive-compulsive disorder 1 Theoretical models for OCD symptomatology 3 Neurocircuitry mechanisms of OCD 4 Treatment strategies and unsatisfactory responses in patients with OCD 7 Current issues to be addressed in developing neurobiological evidence-based treatments for OCD 8 Chapter 1. Reliability-based competition between model-based and model-free learning strategies in OCD 11 Introduction 12 Methods 15 Results 26 Discussion 35 Chapter 2. Aberrant neural arbitrators underlying the imbalanced arbitration between decision-making strategies in OCD 37 Introduction 38 Methods 40 Results 45 Discussion 55 General Discussion 57 References 62 Abstract in Korean 74박

SNU Open Repository and Archive

세그먼트 교체 기법을 활용한 심층 강화학습 기반의 ABR 알고리즘

Author: 배형호
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 김종권.적응형 비트레이트 알고리즘은 온라인 비디오 서비스의 재생 품질, 즉 사용자 체감 품질을 올리기 위하여 사용되는 대표적 기술 중 하나이다. 지금까지 적응형 비트레이트 알고리즘은 다양한 최적화 기법에 기반하여 사용자 체감 품질을 최적화하였다. 그러나 대부분의 적응형 비트레이트 알고리즘은 공통된 한계점을 지닌다. 사용자 체감 품질을 최적화하기 위해 단순히 다음으로 다운로드 해야하는 세그먼트의 비트레이트만을 결정한다는 점이 그 한계점으로, 이러한 유형에 속하는 적응형 비트레이트 알고리즘들은 변화하는 네트워크 환경에 맞춰 앞으로 다운로드할 세그먼트의 비트레이트는 최적으로 조정할 수 있지만 이미 다운로드한 세그먼트에 대해선 어떠한 최적화도 진행할 수 없다. 그렇기에 사용자의 네트워크 환경이 극단적으로 개선되더라도 이에 대한 활용도가 떨어진다. 이러한 한계점을 극복하기 위해 우리는 LAWS 기법, 학습 기반의 세그먼트 교체 전략을 포함한 적응형 비트레이트 알고리즘, 을 제안한다. 제안 모델은 사용자의 네트워크 환경 등에 따라서 더 나은 비트레이트로 세그먼트를 교체할 수 있다. 제안 기법을 실현하기 위해 우리는 새로운 형태의 리워드를 디자인한다. 이를 통해 제안 기법은 세그먼트 교체 전략을 포함한 형태로 사용자 체감 품질을 최적화할 수 있다. 또한 세그먼트 교체 전략을 포함함에 따라 증가하는 문제의 복잡도에 대응하기 위해 규칙 기반 행동 제약 기법을 사용하여 모델의 학습을 원하는 방향으로 유도한다. 우리는 최종적으로 심층 강화학습 기반의 적응형 비트레이트 알고리즘을 제안한다. 네트워크 트레이스를 기반으로 실시한 실험에서는 제안 기법이 기존의 기법들에 비해 사용자 체감 품질을 13.1%까지 개선시키는 것으로 확인됐다Adaptive bitrate (ABR) algorithm is one of the representative techniques used to optimize the playback quality of online video services, namely Quality of Experience (QoE). So far, ABR algorithms based on various optimization techniques have optimized QoE. However, most of the ABR algorithms proposed to date have common limitations; the range of options for optimization. Currently, most ABR algorithms only determine the bit rate of the next segment for QoE optimization. This type of ABR algorithm can optimize the bit rate of a segment to be downloaded in the future in a dynamic network environment. However, it is not possible to optimize any segment previously downloaded, so the changed network environment cannot be utilized to the maximum. To overcome this limitation, we propose LAWS, learning based ABR algorithm with segment replacement. LAWS can be replaced with a better bit rate, even for previously downloaded segments, in conditions such as an improved network environment. First for this, we design a novel form of reward for optimization, including segment replacement. Through this, QoE, the optimization objective of the ABR algorithm, can be optimized in the form of segment replacement. In addition, we propose a rule-based learning method to solve the challenges arising in the model learning process. We finally propose an ABR algorithm with segment replacement based on deep reinforcement learning. Experiments based on network traces show that the newly proposed technique has a QoE improvement of 13.1% compared to the existing ABR techniques.I. Introduction 1 II. Related Work 4 2.1 DASH 4 2.2 Adaptive BitRate Algorithm 6 III. Motivation and Approach 9 3.1 Motivation 9 3.2 Approach 11 IV. Neural ABR algorithm with Segment Replacement 13 4.1 Action 15 4.2 State 15 4.3 Reward 18 4.4 Rule based learning 26 4.5 Implementation 27 V. Experiments 28 5.1 Experiment Setup 28 5.2 Baselines 29 5.3 Comparison with Existing ABR algorithms 33 5.4 Analyze Replacement Characteristics 35 5.5 Comparison Between Learning Based Algorithms 35 VI. Conclusion 37Maste

SNU Open Repository and Archive

전략적 고객 행동을 고려한 심층 강화학습 기반 항공사 동적 가격 결정 연구

Author: 조성배
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 산업공학과, 2023. 2. 문일경.This thesis considers an airline dynamic pricing problem in the presence of patient customers. Nowadays, customers behave strategically to pay lower than their willingness to pay because they know airlines are implementing dynamic pricing strategies. To capture the non-myopic characteristic, we propose a Markov decision process (MDP) including a history of offered prices as a state variable. In contrast to previous studies, distributions of customers' properties are assumed to be unknown in advance. Deep reinforcement learning (DRL) algorithms are utilized to solve it, and the results of numerical experiments are presented to show that their performance can be improved with the proposed formulation. Comparisons between algorithms are also made to determine which can construct appropriate pricing structures for the patient and non-stationary demand. The structures of pricing policies generated from the bootstrapped deep Q-network algorithm imply that airlines should offer high and low prices alternately from the beginning of the sales period rather than increasing prices as time goes on. We also ascertain that more frequent consecutive high-priced periods can increase airlines' revenue in environments with higher customer patience levels.본 연구에서는 전략적 소비자가 존재하는 시장에서 항공사 동적 가격 결정 문제를 다루었다. 최근 소비자들은 항공사에서 동적 가격 정책을 시행하는 것을 인지하고 있기 때문에, 그들의 지불 용의보다 낮은 가격을 지불하기 위해 전략적으로 행동한다. 이러한 소비자 특성을 고려하여, 본 연구에서는 과거에 제시된 가격 기록을 상태 변수로 포함하는 마르코프 의사결정 과정 모델을 제안하였다. 이 때 고객 특성에 대한 확률 분포들은 사전에 알려져 있지 않다고 가정하였다. 문제 해결을 위해 심층 강화학습 방법론이 활용되었으며, 알고리즘 별 비교를 통해 전략적이고 동적인 수요 하에서 가장 적절한 가격 구조를 도출하는 알고리즘을 제시하였다. 또한 해당 가격 구조를 분석하여 전략적 수요로부터 추가적인 수익을 발생시키기 위한 경영적 통찰력을 제공하고자 하였다.Chapter 1 Introduction 1 Chapter 2 Problem description 9 2.1 Dynamics of patient customers 9 2.2 Markov decision process 11 2.3 Airline dynamic pricing 11 Chapter 3 Solution methods 15 3.1 Deep Q-network 17 3.2 Bootstrapped DQN 18 3.3 Optimistic learning for decreasing cyclic policies 21 Chapter 4 Numerical experiments 23 4.1 Comparison between MDP formulations in the presence of patient customers 24 4.2 Comparison between pricing algorithms for non-stationary demand and insufficient inventory 27 4.3 Structure of pricing policies from the BDQN algorithm 33 4.4 Non-stationary test for the distributions of reservation prices 34 Chapter 5 Conclusions 38 Bibliography 41 국문초록 47석

SNU Open Repository and Archive