10,406 research outputs found
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering
Video question answering (VideoQA) is challenging given its multimodal
combination of visual understanding and natural language processing. While most
existing approaches ignore the visual appearance-motion information at
different temporal scales, it is unknown how to incorporate the multilevel
processing capacity of a deep learning model with such multiscale information.
Targeting these issues, this paper proposes a novel Multilevel Hierarchical
Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules,
namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning
(PVR). With a multiscale sampling, RMI iterates the interaction of
appearance-motion information at each scale and the question embeddings to
build the multilevel question-guided visual representations. Thereon, with a
shared transformer encoder, PVR infers the visual cues at each level in
parallel to fit with answering different question types that may rely on the
visual information at relevant levels. Through extensive experiments on three
VideoQA datasets, we demonstrate improved performances than previous
state-of-the-arts and justify the effectiveness of each part of our method
Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives
Reasoning about causal and temporal event relations in videos is a new
destination of Video Question Answering (VideoQA).The major stumbling block to
achieve this purpose is the semantic gap between language and video since they
are at different levels of abstraction. Existing efforts mainly focus on
designing sophisticated architectures while utilizing frame- or object-level
visual representations. In this paper, we reconsider the multi-modal alignment
problem in VideoQA from feature and sample perspectives to achieve better
performance. From the view of feature,we break down the video into trajectories
and first leverage trajectory feature in VideoQA to enhance the alignment
between two modalities. Moreover, we adopt a heterogeneous graph architecture
and design a hierarchical framework to align both trajectory-level and
frame-level visual feature with language feature. In addition, we found that
VideoQA models are largely dependent on language priors and always neglect
visual-language interactions. Thus, two effective yet portable training
augmentation strategies are designed to strengthen the cross-modal
correspondence ability of our model from the view of sample. Extensive results
show that our method outperforms all the state-of-the-art models on the
challenging NExT-QA benchmark, which demonstrates the effectiveness of the
proposed method
Video Question Answering with Iterative Video-Text Co-Tokenization
Video question answering is a challenging task that requires understanding
jointly the language input, the visual information in individual video frames,
as well as the temporal information about the events occurring in the video. In
this paper, we propose a novel multi-stream video encoder for video question
answering that uses multiple video inputs and a new video-text iterative
co-tokenization approach to answer a variety of questions related to videos. We
experimentally evaluate the model on several datasets, such as MSRVTT-QA,
MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins.
Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67,
producing a highly efficient video question answering model.Comment: ECCV 202
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Existing visual question answering methods tend to capture the cross-modal
spurious correlations and fail to discover the true causal mechanism that
facilitates reasoning truthfully based on the dominant visual evidence and the
question intention. Additionally, the existing methods usually ignore the
cross-modal event-level understanding that requires to jointly model event
temporality, causality, and dynamics. In this work, we focus on event-level
visual question answering from a new perspective, i.e., cross-modal causal
relational reasoning, by introducing causal intervention methods to discover
the true causal structures for visual and linguistic modalities. Specifically,
we propose a novel event-level visual question answering framework named
Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust
causality-aware visual-linguistic question answering. To discover cross-modal
causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR)
module is proposed to collaboratively disentangle the visual and linguistic
spurious correlations via front-door and back-door causal interventions. To
model the fine-grained interactions between linguistic semantics and
spatial-temporal representations, we build a Spatial-Temporal Transformer (STT)
that creates multi-modal co-occurrence interactions between visual and
linguistic content. To adaptively fuse the causality-ware visual and linguistic
features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that
leverages the hierarchical linguistic semantic relations as the guidance to
learn the global semantic-aware visual-linguistic representations adaptively.
Extensive experiments on four event-level datasets demonstrate the superiority
of our CMCIR in discovering visual-linguistic causal structures and achieving
robust event-level visual question answering.Comment: 17 pages, 9 figures. This work has been submitted to the IEEE for
possible publication. Copyright may be transferred without notice, after
which this version may no longer be accessible. The datasets, code and models
are available at https://github.com/YangLiu9208/CMCI
- …