20 research outputs found
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog
Visual Dialog is a vision-language task that requires an AI agent to engage
in a conversation with humans grounded in an image. It remains a challenging
task since it requires the agent to fully understand a given question before
making an appropriate response not only from the textual dialog history, but
also from the visually-grounded information. While previous models typically
leverage single-hop reasoning or single-channel reasoning to deal with this
complex multimodal reasoning task, which is intuitively insufficient. In this
paper, we thus propose a novel and more powerful Dual-channel Multi-hop
Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures
information from the dialog history and the image to enrich the semantic
representation of the question by exploiting dual-channel reasoning.
Specifically, DMRM maintains a dual channel to obtain the question- and
history-aware image features and the question- and image-aware dialog history
features by a mulit-hop reasoning process in each channel. Additionally, we
also design an effective multimodal attention to further enhance the decoder to
generate more accurate responses. Experimental results on the VisDial v0.9 and
v1.0 datasets demonstrate that the proposed model is effective and outperforms
compared models by a significant margin.Comment: Accepted at AAAI 202
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
Different from Visual Question Answering task that requires to answer only
one question about an image, Visual Dialogue involves multiple questions which
cover a broad range of visual content that could be related to any objects,
relationships or semantics. The key challenge in Visual Dialogue task is thus
to learn a more comprehensive and semantic-rich image representation which may
have adaptive attentions on the image for variant questions. In this research,
we propose a novel model to depict an image from both visual and semantic
perspectives. Specifically, the visual view helps capture the appearance-level
information, including objects and their relationships, while the semantic view
enables the agent to understand high-level visual semantics from the whole
image to the local regions. Futhermore, on top of such multi-view image
features, we propose a feature selection framework which is able to adaptively
capture question-relevant information hierarchically in fine-grained level. The
proposed method achieved state-of-the-art results on benchmark Visual Dialogue
datasets. More importantly, we can tell which modality (visual or semantic) has
more contribution in answering the current question by visualizing the gate
values. It gives us insights in understanding of human cognition in Visual
Dialogue.Comment: Accepted by the Thirty-Fourth AAAI Conference on Artificial
Intelligence (AAAI-2020
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.Comment: Accepted at AAAI 202