4 research outputs found
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.Comment: Accepted at AAAI 202
Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues
Compared to traditional visual question answering, video-grounded dialogues
require additional reasoning over dialogue context to answer questions in a
multi-turn setting. Previous approaches to video-grounded dialogues mostly use
dialogue context as a simple text input without modelling the inherent
information flows at the turn level. In this paper, we propose a novel
framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers
information flows among dialogue turns through a semantic graph constructed
based on lexical components in each question and answer. PDC model then learns
to predict reasoning paths over this semantic graph. Our path prediction model
predicts a path from the current turn through past dialogue turns that contain
additional visual cues to answer the current question. Our reasoning model
sequentially processes both visual and textual information through this
reasoning path and the propagated features are used to generate the answer. Our
experimental results demonstrate the effectiveness of our method and provide
additional insights on how models use semantic dependencies in a dialogue
context to retrieve visual cues.Comment: Accepted at ICLR (International Conference on Learning
Representations) 202