Search CORE

2,680 research outputs found

시각적 추론을 위한 멀티모달 셀프어텐션 네트워크

Author: 류성원
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(석사)--서울대학교 대학원 :공과대학 산업공학과,2019. 8. 조성준.Visual reasoning is more difficult than visual question answering since it requires sophisticated control of information from image and question. Extracted information from one source is used to extract information from the other and this process occurs alternately. This is natural since even human needs multiple glimpses of image and question to solve complicated natural language question with multi-step reasoning. One needs to handle information from earlier steps and use them in later steps to get the answer. Due to this difference, the results on these two tasks tend not to correlate closely. In this paper, we propose Multimodal Self-attention Network (MUSAN) to solve visual reasoning task. Our model uses Transformer encoder by [22] to promote inti- mate interactions between images and the question in fine granular level. MUSAN achieved state-of-the-art performance in CLEVR dataset from raw pixels without prior knowledge or pretrained feature extractor. Also, MUSAN recorded 8th rank in the 2019 GQA challenge without functional or graphical information. Attention visualization of MUSAN shows that MUSAN performs stepwise reasoning with its own logic.시각적 추론은 이미지와 질문의 정교한 정보 제어가 필요하기 때문에 시각적 질문 응답 보다 어렵다. 한 소스에서 추출 된 정보는 다른 소스에서 정보를 추출하는 데 사용되며 이 프로세스는 교대로 발생한다. 복잡한 자연어 문제를 다단계적 추리로 풀려면 인간조 차도 이미지와 질문을 여러 번 흘끗 볼 필요가 있기 때문에 이것은 당연한 것이다. 초기 단계에서 얻은 정보를 처리하고 나중에 답을 얻기 위해 사용할 필요가 있다. 이러한 차이 때문에, 이 두 과제에 대한 결과는 밀접하게 연관되지 않는 경향이 있다. 본 논문에서는 시각적 추리 과제를 해결하기 위해 MUSAN(Multimodal Self-attention Network)을 제안한다. 본 모델은 [22]가 제안한 트렌스포머 인코더를 사용하여 세부적 인 수준에서 이미지와 질문 간의 긴밀한 상호작용을 촉진한다. MUSAN은 사전 지식 이나 사전 훈련된 피쳐 추출기 없이 원시 픽셀에서 CLEVR 데이터셋의 최고 성능을 달성했다. 또 2019년 GQA 챌린지에서 문제 생성 함수 정보나 그래프 정보 없이 8위 를 기록했다. MUSAN의 어탠션 시각화는 MUSAN이 자신의 논리로 단계적 추론을 수행한다는 것을 보여준다.Chapter 1 Introduction 1 1.1 Multimodality 1 1.2 Visual Question Answering 2 1.3 Visual Reasoning 3 Chapter 2 Related Works 5 2.1 Visual Question Answering Models 5 2.1.1 Attention based models 6 2.1.2 Relation based models 9 2.1.3 Module based models 10 Chapter 3 Multimodal Self-Attention Network 14 3.1 Model Architecture 14 3.2 Input Representation 15 3.3 Transformer Encoder 16 3.3.1 Multi-Head Attention layer 17 3.3.2 Position-wise Feed Forward layer 18 3.3.3 Pooling layer 18 Chapter 4 Experiments 20 4.1 CLEVR 20 4.1.1 Dataset 21 4.1.2 Setting 22 4.1.3 Result 22 4.1.4 Analysis 24 4.2 GQA Dataset 29 4.2.1 Dataset 29 4.2.2 Setting 30 4.2.3 Result 31 Chapter 5 Conclusion 32Maste

SNU Open Repository and Archive