Search CORE

1 research outputs found

트랜스포머를 통한 복잡한 추론 능력 정복을 위한 연구: 시각적, 대화적, 수학적 추론에의 적용

Author: 안진원
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 산업공학과, 2021. 2. 조성준.As deep learning models advanced, research is focusing on sophisticated tasks that require complex reasoning, rather than simple classification tasks. These complex tasks require multiple reasoning steps that resembles human intelligence. Architecture-wise, recurrent neural networks and convolutional neural networks have long been the main stream model for deep learning. However, both models suffer from shortcomings from their innate architecture. Nowadays, the attention-based Transformer is replacing them due to its superior architecture and performance. Particularly, the encoder of the Transformer has been extensively studied in the field of natural language processing. However, for the Transformer to be effective in data with distinct structures and characteristics, appropriate adjustments to its structure is required. In this dissertation, we propose novel architectures based on the Transformer encoder for various supervised learning tasks with different data types and characteristics. The tasks that we consider are visual IQ tests, dialogue state tracking and mathematical question answering. For the visual IQ test, the input is in a visual format with hierarchy. To deal with this, we propose using a hierarchical Transformer encoder with structured representation that employs a novel neural network architecture to improve both perception and reasoning. The hierarchical structure of the Transformer encoders and the architecture of each individual Transformer encoder all fit to the characteristics of the data of visual IQ tests. For dialogue state tracking, value prediction for multiple domain-slot pairs is required. To address this issue, we propose a dialogue state tracking model using a pre-trained language model, which is a pre-trained Transformer encoder, for domain-slot relationship modeling. We introduced special tokens for each domain-slot pair which enables effective dependency modeling among domain-slot pairs through the pre-trained language encoder. Finally, for mathematical question answering, we propose a method to pre-train a Transformer encoder on a mathematical question answering dataset for improved performance. Our pre-training method, Question-Answer Masked Language Modeling, utilizes both the question and answer text, which is suitable for the mathematical question answering dataset. Through experiments, we show that each of our proposed methods is effective in their corresponding task and data type.순환 신경망과 합성곱 신경망은 오랫동안 딥러닝 분야에서 주요 모델로 쓰여왔다. 하지만 두 모델 모두 자체적인 구조에서 오는 한계를 가진다. 최근에는 어텐션(attention)에 기반한 트랜스포머(Transformer)가 더 나은 성능과 구조로 인해서 이들을 대체해 나가고 있다. 트랜스포머 인코더(Transformer encoder)는 자연어 처리 분야에서 특별히 더 많은 연구가 이루어지고 있다. 하지만 Transformer가 특별한 구조와 특징을 가진 데이터에 대해서도 제대로 작동하기 위해서는 그 구조에 적절한 변화가 요구된다. 본 논문에서는 다양한 데이터 종류와 특성에 대한 교사 학습에 적용할 수 있는 트랜스포머 인코더에 기반한 새로운 구조의 모델들을 제안한다. 이번 연구에서 다루는 과업은 시각 IQ 테스트, 대화 상태 트래킹 그리고 수학 질의 응답이다. 시각 IQ 테스트의 입력 변수는 위계를 가진 시각적인 형태이다. 이에 대응하기 위해서 우리는 인지와 사고 측면에서 성능을 향상 시킬 수 있는 새로운 뉴럴 네트워크 구조인, 구조화된 표현형을 처리할 수 있는 계층적인 트랜스포머 인코더 모델을 제안한다. 트랜스 포머 인코더의 계층적 구조와 각각의 트랜스포머 인코더의 구조 모두가 시각 IQ 테스트 데이터의 특징에 적합하다. 대화 상태 트래킹은 여러 개의 도메인-슬롯(domain-slot)쌍에 대한 값(value)이 요구된다. 이를 해결하기 위해서 우리는 사전 학습된 트랜스포머 인코더인, 사전 학습 언어 모델을 활용하여 도메인-슬롯의 관계를 모델링하는 것을 제안한다. 각 도메인-슬롯 쌍에 대한 특수 토큰을 도입함으로써 효과적으로 도메인-슬롯 쌍들 간의 관계를 모델링 할 수 있다. 마지막으로, 수학 질의 응답을 위해서는 수학 질의 응답 데이터에 대해서 사전 학습을 진행함으로써 수학 질의 응답 과업에 대해서 성능을 높히는 방법을 제안한다. 우리의 사전 학습 방법인 질의-응답 마스킹 언어 모델링은 질의와 응답 텍스트 모두를 활용 함으로써 수학 질의 응답 데이터에 적합한 형태이다. 실험을 통해서 각각의 제안된 방법론들이 해당하는 과업과 데이터 종류에 대해서 효과적인 것을 밝혔다.Abstract i Contents vi List of Tables viii List of Figures xii Chapter 1 Introduction 1 Chapter 2 Literature Review 7 2.1 Related Works on Transformer . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Works on Visual IQ Tests . . . . . . . . . . . . . . . . . . . 10 2.2.1 RPM-related studies . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Object Detection related studies . . . . . . . . . . . . . . . . 11 2.3 Related works on Dialogue State Tracking . . . . . . . . . . . . . . . 12 2.4 Related Works on Mathematical Question Answering . . . . . . . . . 14 2.4.1 Pre-training of Neural Networks . . . . . . . . . . . . . . . . 14 2.4.2 Language Model Pre-training . . . . . . . . . . . . . . . . . . 15 2.4.3 Mathematical Reasoning with Neural Networks . . . . . . . . 17 Chapter 3 Hierarchical end-to-end architecture of Transformer encoders for solving visual IQ tests 19 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Perception Module: Object Detection Model . . . . . . . . . 24 3.2.2 Reasoning Module: Hierarchical Transformer Encoder . . . . 26 3.2.3 Contrasting Module and Loss function . . . . . . . . . . . . . 29 3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Results for Perception Module . . . . . . . . . . . . . . . . . 35 3.3.4 Results for Reasoning Module . . . . . . . . . . . . . . . . . . 36 3.3.5 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 4 Domain-slot relationship modeling using Transformers for dialogue state tracking 40 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Domain-Slot-Context Encoder . . . . . . . . . . . . . . . . . 44 4.2.2 Slot-gate classifier . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Slot-value classifier . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.4 Total objective function . . . . . . . . . . . . . . . . . . . . . 50 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.3 Results for the MultiWOZ-2.1 dataset . . . . . . . . . . . . . 52 4.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 5 Pre-training of Transformers with Question-Answer Masked Language Modeling for Mathematical Question Answering 62 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.1 Pre-training: Question-Answer Masked Language Modeling . 65 5.2.2 Fine-tuning: Mathematical Question Answering . . . . . . . . 67 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3 Experimental Results on the Mathematics dataset . . . . . . 71 5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Conclusion 79 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography 83 국문초록 101 감사의 글 103Docto

SNU Open Repository and Archive