9 research outputs found
Data augmentation techniques for the Video Question Answering task
Video Question Answering (VideoQA) is a task that requires a model to analyze
and understand both the visual content given by the input video and the textual
part given by the question, and the interaction between them in order to
produce a meaningful answer. In our work we focus on the Egocentric VideoQA
task, which exploits first-person videos, because of the importance of such
task which can have impact on many different fields, such as those pertaining
the social assistance and the industrial training. Recently, an Egocentric
VideoQA dataset, called EgoVQA, has been released. Given its small size, models
tend to overfit quickly. To alleviate this problem, we propose several
augmentation techniques which give us a +5.5% improvement on the final accuracy
over the considered baseline.Comment: 16 pages, 5 figures; to be published in Egocentric Perception,
Interaction and Computing (EPIC) Workshop Proceedings, at ECCV 202
Motion-Appearance Synergistic Networks for Video Question Answering
Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1)understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose MotionAppearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the questionβs intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.λΉλμ€ μ§μ μλ΅μ AI μμ΄μ νΈκ° μ£Όμ΄μ§ λΉλμ€λ₯Ό κΈ°λ°μΌλ‘ κ΄λ ¨λ μ§λ¬Έμ μλ΅νλ λ¬Έμ μ΄λ€. λΉλμ€ μ§μ μλ΅ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄μλ μΈ κ°μ§ κ³Όμ λ₯Ό ν΄κ²°νμ¬μΌ νλ€: (1) λ€μν μ§λ¬Έμ μλλ₯Ό μ΄ν΄νκ³ , (2) μ£Όμ΄μ§ λΉλμ€μ λ€μν μμ(e.g. 물체, νλ, μΈκ³Όκ΄κ³)λ₯Ό νμ
νμ¬μΌ νλ©°, (3) μΈμ΄μ μκ° μ 보 λ modality κ°μ μκ΄κ΄κ³λ₯Ό κΈ°λ°μΌλ‘ μμ±λ νμ(cross-modal representation)μ ν΅ν΄ μ λ΅μ μΆλ‘ νμ¬μΌ νλ€. λ°λΌμ λ³Έ νμλ
Όλ¬Έμμλ λμ μ 보 λ° λͺ¨μ μ 보μ κΈ°λ°ν λ κ°μ§ cross-modal representation μ μμ±νκ³ , μ΄λ₯Ό μ§λ¬Έμ μλμ λ°λΌ κ°μ€ν©νλ λμ-λͺ¨μ μλμ§ λ€νΈμν¬λ₯Ό μ μνλ€.
μ μνλ λͺ¨λΈμ μΈ κ°μ§μ λͺ¨λ: λμ λͺ¨λ, λͺ¨μ λͺ¨λ, λμ-λͺ¨μ μ΅ν© λͺ¨λλ‘ κ΅¬μ±λμ΄ μλ€. λμ λͺ¨λμμλ μ§λ¬Έκ³Ό νλ μ 보λ₯Ό μ΅ν©ν cross-modal representation μ μμ±νλ©°, λͺ¨μ λͺ¨λμμλ μ£Όμ΄μ§ λΉλμ€μ λͺ¨μ μΈ‘λ©΄μ μ§μ€νμ¬ νμμ μμ±νλ€. μ΅μ’
μ μΌλ‘ λμ-λͺ¨μ μ΅ν© λͺ¨λμμ μΈμ½λ©λ λ μ λ³΄κ° μ§λ¬Έμ λ΄μ©μ κΈ°λ°μΌλ‘ μ΅ν©λλ€. μ€ν κ²°κ³Ό, μ μνλ λͺ¨λΈμ λκ·λͺ¨ λΉλμ€ μ§μ μλ΅ λ°μ΄ν°μ
μΈ TGIF-QA μ MSVD-QA μ λν΄ μ΅μ²¨λ¨μ μ±λ₯μ 보μλ€. λ³Έ λ
Όλ¬Έμμλ λν μ μνλ λͺ¨λΈμ μ μ±μ νκ° κ²°κ³Όμ λν΄μλ 보μ¬μ€λ€.μ 1 μ₯ μ λ‘ 1
μ 1 μ μ°κ΅¬μ λ°°κ²½ 1
μ 2 μ μ°κ΅¬μ λ΄μ© 2
μ 2 μ₯ λ°°κ²½ μ°κ΅¬ 5
μ 1 μ μκ° μ 보 κΈ°λ° μ§μ μλ΅ λͺ¨λΈλ€ 5
μ 2 μ νλ λΆλ₯ λͺ¨λΈλ€ 5
μ 3 μ μ΄ν
μ
λ©μ»€λμ¦ 6
μ 3 μ₯ λμ-λͺ¨μ μλμ§ λ€νΈμν¬ 7
μ 1 μ μκ° λ° μΈμ΄ νμ 7
μ 2 μ λμ λ° λͺ¨μ λͺ¨λ 9
μ 3 μ λμ-λͺ¨μ μ΅ν© λͺ¨λ 10
μ 4 μ μ λ΅ μΆλ‘ λ° λͺ©μ ν¨μ 13
μ 4 μ₯ μ€ν λ° κ²°κ³Ό 14
μ 1 μ νμ΅ λ°μ΄ν° 14
μ 2 μ νμ΅ μ‘°κ±΄ 15
μ 3 μ μ΅μ²¨λ¨ μ κ·Ό λ°©μκ³Όμ λΉκ΅ 15
μ 4 μ λͺ¨λ λ³ κΈ°μ¬λ νκ° 17
μ 5 μ μ μ±μ νκ° 19
μ 5 μ₯ κ²°λ‘ λ° μ μΈ 21
μ°Έκ³ λ¬Έν 22
Abstract 27μ