2,680 research outputs found
μκ°μ μΆλ‘ μ μν λ©ν°λͺ¨λ¬ μ νμ΄ν μ λ€νΈμν¬
νμλ
Όλ¬Έ(μμ¬)--μμΈλνκ΅ λνμ :곡과λν μ°μ
곡νκ³Ό,2019. 8. μ‘°μ±μ€.Visual reasoning is more difficult than visual question answering since it requires sophisticated control of information from image and question. Extracted information from one source is used to extract information from the other and this process occurs alternately. This is natural since even human needs multiple glimpses of image and question to solve complicated natural language question with multi-step reasoning. One needs to handle information from earlier steps and use them in later steps to get the answer. Due to this difference, the results on these two tasks tend not to correlate closely.
In this paper, we propose Multimodal Self-attention Network (MUSAN) to solve visual reasoning task. Our model uses Transformer encoder by [22] to promote inti- mate interactions between images and the question in fine granular level. MUSAN achieved state-of-the-art performance in CLEVR dataset from raw pixels without prior knowledge or pretrained feature extractor. Also, MUSAN recorded 8th rank in the 2019 GQA challenge without functional or graphical information. Attention visualization of MUSAN shows that MUSAN performs stepwise reasoning with its own logic.μκ°μ μΆλ‘ μ μ΄λ―Έμ§μ μ§λ¬Έμ μ κ΅ν μ 보 μ μ΄κ° νμνκΈ° λλ¬Έμ μκ°μ μ§λ¬Έ μλ΅ λ³΄λ€ μ΄λ ΅λ€. ν μμ€μμ μΆμΆ λ μ 보λ λ€λ₯Έ μμ€μμ μ 보λ₯Ό μΆμΆνλ λ° μ¬μ©λλ©° μ΄ νλ‘μΈμ€λ κ΅λλ‘ λ°μνλ€. 볡μ‘ν μμ°μ΄ λ¬Έμ λ₯Ό λ€λ¨κ³μ μΆλ¦¬λ‘ νλ €λ©΄ μΈκ°μ‘° μ°¨λ μ΄λ―Έμ§μ μ§λ¬Έμ μ¬λ¬ λ² νλ λ³Ό νμκ° μκΈ° λλ¬Έμ μ΄κ²μ λΉμ°ν κ²μ΄λ€. μ΄κΈ° λ¨κ³μμ μ»μ μ 보λ₯Ό μ²λ¦¬νκ³ λμ€μ λ΅μ μ»κΈ° μν΄ μ¬μ©ν νμκ° μλ€. μ΄λ¬ν μ°¨μ΄ λλ¬Έμ, μ΄ λ κ³Όμ μ λν κ²°κ³Όλ λ°μ νκ² μ°κ΄λμ§ μλ κ²½ν₯μ΄ μλ€.
λ³Έ λ
Όλ¬Έμμλ μκ°μ μΆλ¦¬ κ³Όμ λ₯Ό ν΄κ²°νκΈ° μν΄ MUSAN(Multimodal Self-attention Network)μ μ μνλ€. λ³Έ λͺ¨λΈμ [22]κ° μ μν νΈλ μ€ν¬λ¨Έ μΈμ½λλ₯Ό μ¬μ©νμ¬ μΈλΆμ μΈ μμ€μμ μ΄λ―Έμ§μ μ§λ¬Έ κ°μ κΈ΄λ°ν μνΈμμ©μ μ΄μ§νλ€. MUSANμ μ¬μ μ§μ μ΄λ μ¬μ νλ ¨λ νΌμ³ μΆμΆκΈ° μμ΄ μμ ν½μ
μμ CLEVR λ°μ΄ν°μ
μ μ΅κ³ μ±λ₯μ λ¬μ±νλ€. λ 2019λ
GQA μ±λ¦°μ§μμ λ¬Έμ μμ± ν¨μ μ 보λ κ·Έλν μ 보 μμ΄ 8μ λ₯Ό κΈ°λ‘νλ€. MUSANμ μ΄ν μ
μκ°νλ MUSANμ΄ μμ μ λ
Όλ¦¬λ‘ λ¨κ³μ μΆλ‘ μ μννλ€λ κ²μ 보μ¬μ€λ€.Chapter 1 Introduction 1
1.1 Multimodality 1
1.2 Visual Question Answering 2
1.3 Visual Reasoning 3
Chapter 2 Related Works 5
2.1 Visual Question Answering Models 5
2.1.1 Attention based models 6
2.1.2 Relation based models 9
2.1.3 Module based models 10
Chapter 3 Multimodal Self-Attention Network 14
3.1 Model Architecture 14
3.2 Input Representation 15
3.3 Transformer Encoder 16
3.3.1 Multi-Head Attention layer 17
3.3.2 Position-wise Feed Forward layer 18
3.3.3 Pooling layer 18
Chapter 4 Experiments 20
4.1 CLEVR 20
4.1.1 Dataset 21
4.1.2 Setting 22
4.1.3 Result 22
4.1.4 Analysis 24
4.2 GQA Dataset 29
4.2.1 Dataset 29
4.2.2 Setting 30
4.2.3 Result 31
Chapter 5 Conclusion 32Maste
- β¦