2,680 research outputs found

    μ‹œκ°μ  좔둠을 μœ„ν•œ λ©€ν‹°λͺ¨λ‹¬ μ…€ν”„μ–΄ν…μ…˜ λ„€νŠΈμ›Œν¬

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 산업곡학과,2019. 8. μ‘°μ„±μ€€.Visual reasoning is more difficult than visual question answering since it requires sophisticated control of information from image and question. Extracted information from one source is used to extract information from the other and this process occurs alternately. This is natural since even human needs multiple glimpses of image and question to solve complicated natural language question with multi-step reasoning. One needs to handle information from earlier steps and use them in later steps to get the answer. Due to this difference, the results on these two tasks tend not to correlate closely. In this paper, we propose Multimodal Self-attention Network (MUSAN) to solve visual reasoning task. Our model uses Transformer encoder by [22] to promote inti- mate interactions between images and the question in fine granular level. MUSAN achieved state-of-the-art performance in CLEVR dataset from raw pixels without prior knowledge or pretrained feature extractor. Also, MUSAN recorded 8th rank in the 2019 GQA challenge without functional or graphical information. Attention visualization of MUSAN shows that MUSAN performs stepwise reasoning with its own logic.μ‹œκ°μ  좔둠은 이미지와 질문의 μ •κ΅ν•œ 정보 μ œμ–΄κ°€ ν•„μš”ν•˜κΈ° λ•Œλ¬Έμ— μ‹œκ°μ  질문 응닡 보닀 μ–΄λ ΅λ‹€. ν•œ μ†ŒμŠ€μ—μ„œ μΆ”μΆœ 된 μ •λ³΄λŠ” λ‹€λ₯Έ μ†ŒμŠ€μ—μ„œ 정보λ₯Ό μΆ”μΆœν•˜λŠ” 데 μ‚¬μš©λ˜λ©° 이 ν”„λ‘œμ„ΈμŠ€λŠ” κ΅λŒ€λ‘œ λ°œμƒν•œλ‹€. λ³΅μž‘ν•œ μžμ—°μ–΄ 문제λ₯Ό 닀단계적 μΆ”λ¦¬λ‘œ ν’€λ €λ©΄ 인간쑰 차도 이미지와 μ§ˆλ¬Έμ„ μ—¬λŸ¬ 번 ν˜λ— λ³Ό ν•„μš”κ°€ 있기 λ•Œλ¬Έμ— 이것은 λ‹Ήμ—°ν•œ 것이닀. 초기 λ‹¨κ³„μ—μ„œ 얻은 정보λ₯Ό μ²˜λ¦¬ν•˜κ³  λ‚˜μ€‘μ— 닡을 μ–»κΈ° μœ„ν•΄ μ‚¬μš©ν•  ν•„μš”κ°€ μžˆλ‹€. μ΄λŸ¬ν•œ 차이 λ•Œλ¬Έμ—, 이 두 κ³Όμ œμ— λŒ€ν•œ κ²°κ³ΌλŠ” λ°€μ ‘ν•˜κ²Œ μ—°κ΄€λ˜μ§€ μ•ŠλŠ” κ²½ν–₯이 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ‹œκ°μ  좔리 과제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ MUSAN(Multimodal Self-attention Network)을 μ œμ•ˆν•œλ‹€. λ³Έ λͺ¨λΈμ€ [22]κ°€ μ œμ•ˆν•œ 트렌슀포머 인코더λ₯Ό μ‚¬μš©ν•˜μ—¬ 세뢀적 인 μˆ˜μ€€μ—μ„œ 이미지와 질문 κ°„μ˜ κΈ΄λ°€ν•œ μƒν˜Έμž‘μš©μ„ μ΄‰μ§„ν•œλ‹€. MUSAN은 사전 지식 μ΄λ‚˜ 사전 ν›ˆλ ¨λœ 피쳐 μΆ”μΆœκΈ° 없이 μ›μ‹œ ν”½μ…€μ—μ„œ CLEVR λ°μ΄ν„°μ…‹μ˜ 졜고 μ„±λŠ₯을 λ‹¬μ„±ν–ˆλ‹€. 또 2019λ…„ GQA μ±Œλ¦°μ§€μ—μ„œ 문제 생성 ν•¨μˆ˜ μ •λ³΄λ‚˜ κ·Έλž˜ν”„ 정보 없이 8μœ„ λ₯Ό κΈ°λ‘ν–ˆλ‹€. MUSAN의 μ–΄νƒ μ…˜ μ‹œκ°ν™”λŠ” MUSAN이 μžμ‹ μ˜ λ…Όλ¦¬λ‘œ 단계적 좔둠을 μˆ˜ν–‰ν•œλ‹€λŠ” 것을 보여쀀닀.Chapter 1 Introduction 1 1.1 Multimodality 1 1.2 Visual Question Answering 2 1.3 Visual Reasoning 3 Chapter 2 Related Works 5 2.1 Visual Question Answering Models 5 2.1.1 Attention based models 6 2.1.2 Relation based models 9 2.1.3 Module based models 10 Chapter 3 Multimodal Self-Attention Network 14 3.1 Model Architecture 14 3.2 Input Representation 15 3.3 Transformer Encoder 16 3.3.1 Multi-Head Attention layer 17 3.3.2 Position-wise Feed Forward layer 18 3.3.3 Pooling layer 18 Chapter 4 Experiments 20 4.1 CLEVR 20 4.1.1 Dataset 21 4.1.2 Setting 22 4.1.3 Result 22 4.1.4 Analysis 24 4.2 GQA Dataset 29 4.2.1 Dataset 29 4.2.2 Setting 30 4.2.3 Result 31 Chapter 5 Conclusion 32Maste
    • …
    corecore