7 research outputs found

    Structured Co-reference Graph Attention for Video-grounded Dialogue

    Full text link
    A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video while keeping track of the dialogue context. Although recent efforts have made great strides in improving the quality of the response, performance is still far from satisfactory. The two main challenging issues are as follows: (1) how to deduce co-reference among multiple modalities and (2) how to reason on the rich underlying semantic structure of video with complex spatial and temporal dynamics. To this end, SCGA is based on (1) Structured Co-reference Resolver that performs dereferencing via building a structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner that captures local-to-global dynamics of video via gradually neighboring graph attention. SCGA makes use of pointer network to dynamically replicate parts of the question for decoding the answer sequence. The validity of the proposed SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA benchmark. Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on both benchmarks, while extensive ablation study and qualitative analysis reveal performance gain and improved interpretability.Comment: Accepted to AAAI202

    Motion-Appearance Synergistic Networks for Video Question Answering

    Get PDF
    Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1)understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose MotionAppearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the questionโ€™s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.๋น„๋””์˜ค ์งˆ์˜ ์‘๋‹ต์€ AI ์—์ด์ „ํŠธ๊ฐ€ ์ฃผ์–ด์ง„ ๋น„๋””์˜ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ด€๋ จ๋œ ์งˆ๋ฌธ์— ์‘๋‹ตํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค. ๋น„๋””์˜ค ์งˆ์˜ ์‘๋‹ต ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์„ธ ๊ฐ€์ง€ ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ์•ผ ํ•œ๋‹ค: (1) ๋‹ค์–‘ํ•œ ์งˆ๋ฌธ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ , (2) ์ฃผ์–ด์ง„ ๋น„๋””์˜ค์˜ ๋‹ค์–‘ํ•œ ์š”์†Œ(e.g. ๋ฌผ์ฒด, ํ–‰๋™, ์ธ๊ณผ๊ด€๊ณ„)๋ฅผ ํŒŒ์•…ํ•˜์—ฌ์•ผ ํ•˜๋ฉฐ, (3) ์–ธ์–ด์™€ ์‹œ๊ฐ ์ •๋ณด ๋‘ modality ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋œ ํ‘œ์ƒ(cross-modal representation)์„ ํ†ตํ•ด ์ •๋‹ต์„ ์ถ”๋ก ํ•˜์—ฌ์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋™์ž‘ ์ •๋ณด ๋ฐ ๋ชจ์–‘ ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•œ ๋‘ ๊ฐ€์ง€ cross-modal representation ์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์งˆ๋ฌธ์˜ ์˜๋„์— ๋”ฐ๋ผ ๊ฐ€์ค‘ํ•ฉํ•˜๋Š” ๋™์ž‘-๋ชจ์–‘ ์‹œ๋„ˆ์ง€ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์€ ์„ธ ๊ฐ€์ง€์˜ ๋ชจ๋“ˆ: ๋™์ž‘ ๋ชจ๋“ˆ, ๋ชจ์–‘ ๋ชจ๋“ˆ, ๋™์ž‘-๋ชจ์–‘ ์œตํ•ฉ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๋™์ž‘ ๋ชจ๋“ˆ์—์„œ๋Š” ์งˆ๋ฌธ๊ณผ ํ–‰๋™ ์ •๋ณด๋ฅผ ์œตํ•ฉํ•œ cross-modal representation ์„ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ชจ์–‘ ๋ชจ๋“ˆ์—์„œ๋Š” ์ฃผ์–ด์ง„ ๋น„๋””์˜ค์˜ ๋ชจ์–‘ ์ธก๋ฉด์— ์ง‘์ค‘ํ•˜์—ฌ ํ‘œ์ƒ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ตœ์ข…์ ์œผ๋กœ ๋™์ž‘-๋ชจ์–‘ ์œตํ•ฉ ๋ชจ๋“ˆ์—์„œ ์ธ์ฝ”๋”ฉ๋œ ๋‘ ์ •๋ณด๊ฐ€ ์งˆ๋ฌธ์˜ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์œตํ•ฉ๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ์…‹์ธ TGIF-QA ์™€ MSVD-QA ์— ๋Œ€ํ•ด ์ตœ์ฒจ๋‹จ์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋˜ํ•œ ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์˜ ์ •์„ฑ์  ํ‰๊ฐ€ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ๋„ ๋ณด์—ฌ์ค€๋‹ค.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ์˜ ๋‚ด์šฉ 2 ์ œ 2 ์žฅ ๋ฐฐ๊ฒฝ ์—ฐ๊ตฌ 5 ์ œ 1 ์ ˆ ์‹œ๊ฐ ์ •๋ณด ๊ธฐ๋ฐ˜ ์งˆ์˜ ์‘๋‹ต ๋ชจ๋ธ๋“ค 5 ์ œ 2 ์ ˆ ํ–‰๋™ ๋ถ„๋ฅ˜ ๋ชจ๋ธ๋“ค 5 ์ œ 3 ์ ˆ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ 6 ์ œ 3 ์žฅ ๋™์ž‘-๋ชจ์–‘ ์‹œ๋„ˆ์ง€ ๋„คํŠธ์›Œํฌ 7 ์ œ 1 ์ ˆ ์‹œ๊ฐ ๋ฐ ์–ธ์–ด ํ‘œ์ƒ 7 ์ œ 2 ์ ˆ ๋™์ž‘ ๋ฐ ๋ชจ์–‘ ๋ชจ๋“ˆ 9 ์ œ 3 ์ ˆ ๋™์ž‘-๋ชจ์–‘ ์œตํ•ฉ ๋ชจ๋“ˆ 10 ์ œ 4 ์ ˆ ์ •๋‹ต ์ถ”๋ก  ๋ฐ ๋ชฉ์  ํ•จ์ˆ˜ 13 ์ œ 4 ์žฅ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ 14 ์ œ 1 ์ ˆ ํ•™์Šต ๋ฐ์ดํ„ฐ 14 ์ œ 2 ์ ˆ ํ•™์Šต ์กฐ๊ฑด 15 ์ œ 3 ์ ˆ ์ตœ์ฒจ๋‹จ ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต 15 ์ œ 4 ์ ˆ ๋ชจ๋“ˆ ๋ณ„ ๊ธฐ์—ฌ๋„ ํ‰๊ฐ€ 17 ์ œ 5 ์ ˆ ์ •์„ฑ์  ํ‰๊ฐ€ 19 ์ œ 5 ์žฅ ๊ฒฐ๋ก  ๋ฐ ์ œ์–ธ 21 ์ฐธ๊ณ ๋ฌธํ—Œ 22 Abstract 27์„

    Video question answering supported by a multi-task learning objective

    Get PDF
    Video Question Answering (VideoQA) concerns the realization of models able to analyze a video, and produce a meaningful answer to visual content-related questions. To encode the given question, word embedding techniques are used to compute a representation of the tokens suitable for neural networks. Yet almost all the works in the literature use the same technique, although recent advancements in NLP brought better solutions. This lack of analysis is a major shortcoming. To address it, in this paper we present a twofold contribution about this inquiry and its relation with question encoding. First of all, we integrate four of the most popular word embedding techniques in three recent VideoQA architectures, and investigate how they influence the performance on two public datasets: EgoVQA and PororoQA. Thanks to the learning process, we show that embeddings carry question type-dependent characteristics. Secondly, to leverage this result, we propose a simple yet effective multi-task learning protocol which uses an auxiliary task defined on the question types. By using the proposed learning strategy, significant improvements are observed in most of the combinations of network architecture and embedding under analysis

    Reasoning with Heterogeneous Graph Alignment for Video Question Answering

    No full text
    The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality

    Reasoning with Heterogeneous Graph Alignment for Video Question Answering

    No full text
    corecore