9 research outputs found

    Data augmentation techniques for the Video Question Answering task

    Full text link
    Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can have impact on many different fields, such as those pertaining the social assistance and the industrial training. Recently, an Egocentric VideoQA dataset, called EgoVQA, has been released. Given its small size, models tend to overfit quickly. To alleviate this problem, we propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.Comment: 16 pages, 5 figures; to be published in Egocentric Perception, Interaction and Computing (EPIC) Workshop Proceedings, at ECCV 202

    Motion-Appearance Synergistic Networks for Video Question Answering

    Get PDF
    Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1)understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose MotionAppearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the question’s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.λΉ„λ””μ˜€ 질의 응닡은 AI μ—μ΄μ „νŠΈκ°€ 주어진 λΉ„λ””μ˜€λ₯Ό 기반으둜 κ΄€λ ¨λœ μ§ˆλ¬Έμ— μ‘λ‹΅ν•˜λŠ” λ¬Έμ œμ΄λ‹€. λΉ„λ””μ˜€ 질의 응닡 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄μ„œλŠ” μ„Έ 가지 과제λ₯Ό ν•΄κ²°ν•˜μ—¬μ•Ό ν•œλ‹€: (1) λ‹€μ–‘ν•œ 질문의 μ˜λ„λ₯Ό μ΄ν•΄ν•˜κ³ , (2) 주어진 λΉ„λ””μ˜€μ˜ λ‹€μ–‘ν•œ μš”μ†Œ(e.g. 물체, 행동, 인과관계)λ₯Ό νŒŒμ•…ν•˜μ—¬μ•Ό ν•˜λ©°, (3) 언어와 μ‹œκ° 정보 두 modality κ°„μ˜ 상관관계λ₯Ό 기반으둜 μƒμ„±λœ ν‘œμƒ(cross-modal representation)을 톡해 정닡을 μΆ”λ‘ ν•˜μ—¬μ•Ό ν•œλ‹€. λ”°λΌμ„œ λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” λ™μž‘ 정보 및 λͺ¨μ–‘ 정보에 κΈ°λ°˜ν•œ 두 가지 cross-modal representation 을 μƒμ„±ν•˜κ³ , 이λ₯Ό 질문의 μ˜λ„μ— 따라 κ°€μ€‘ν•©ν•˜λŠ” λ™μž‘-λͺ¨μ–‘ μ‹œλ„ˆμ§€ λ„€νŠΈμ›Œν¬λ₯Ό μ œμ•ˆν•œλ‹€. μ œμ•ˆν•˜λŠ” λͺ¨λΈμ€ μ„Έ κ°€μ§€μ˜ λͺ¨λ“ˆ: λ™μž‘ λͺ¨λ“ˆ, λͺ¨μ–‘ λͺ¨λ“ˆ, λ™μž‘-λͺ¨μ–‘ μœ΅ν•© λͺ¨λ“ˆλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆλ‹€. λ™μž‘ λͺ¨λ“ˆμ—μ„œλŠ” 질문과 행동 정보λ₯Ό μœ΅ν•©ν•œ cross-modal representation 을 μƒμ„±ν•˜λ©°, λͺ¨μ–‘ λͺ¨λ“ˆμ—μ„œλŠ” 주어진 λΉ„λ””μ˜€μ˜ λͺ¨μ–‘ 츑면에 μ§‘μ€‘ν•˜μ—¬ ν‘œμƒμ„ μƒμ„±ν•œλ‹€. μ΅œμ’…μ μœΌλ‘œ λ™μž‘-λͺ¨μ–‘ μœ΅ν•© λͺ¨λ“ˆμ—μ„œ μΈμ½”λ”©λœ 두 정보가 질문의 λ‚΄μš©μ„ 기반으둜 μœ΅ν•©λœλ‹€. μ‹€ν—˜ κ²°κ³Ό, μ œμ•ˆν•˜λŠ” λͺ¨λΈμ€ λŒ€κ·œλͺ¨ λΉ„λ””μ˜€ 질의 응닡 데이터셋인 TGIF-QA 와 MSVD-QA 에 λŒ€ν•΄ μ΅œμ²¨λ‹¨μ˜ μ„±λŠ₯을 λ³΄μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ˜ν•œ μ œμ•ˆν•˜λŠ” λͺ¨λΈμ˜ 정성적 평가 결과에 λŒ€ν•΄μ„œλ„ 보여쀀닀.제 1 μž₯ μ„œ λ‘  1 제 1 절 μ—°κ΅¬μ˜ λ°°κ²½ 1 제 2 절 μ—°κ΅¬μ˜ λ‚΄μš© 2 제 2 μž₯ λ°°κ²½ 연ꡬ 5 제 1 절 μ‹œκ° 정보 기반 질의 응닡 λͺ¨λΈλ“€ 5 제 2 절 행동 λΆ„λ₯˜ λͺ¨λΈλ“€ 5 제 3 절 μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜ 6 제 3 μž₯ λ™μž‘-λͺ¨μ–‘ μ‹œλ„ˆμ§€ λ„€νŠΈμ›Œν¬ 7 제 1 절 μ‹œκ° 및 μ–Έμ–΄ ν‘œμƒ 7 제 2 절 λ™μž‘ 및 λͺ¨μ–‘ λͺ¨λ“ˆ 9 제 3 절 λ™μž‘-λͺ¨μ–‘ μœ΅ν•© λͺ¨λ“ˆ 10 제 4 절 μ •λ‹΅ μΆ”λ‘  및 λͺ©μ  ν•¨μˆ˜ 13 제 4 μž₯ μ‹€ν—˜ 및 κ²°κ³Ό 14 제 1 절 ν•™μŠ΅ 데이터 14 제 2 절 ν•™μŠ΅ 쑰건 15 제 3 절 μ΅œμ²¨λ‹¨ μ ‘κ·Ό λ°©μ‹κ³Όμ˜ 비ꡐ 15 제 4 절 λͺ¨λ“ˆ 별 기여도 평가 17 제 5 절 정성적 평가 19 제 5 μž₯ κ²°λ‘  및 μ œμ–Έ 21 μ°Έκ³ λ¬Έν—Œ 22 Abstract 27석
    corecore