569 research outputs found
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Dense video captioning aims to generate corresponding text descriptions for a
series of events in the untrimmed video, which can be divided into two
sub-tasks, event detection and event captioning. Unlike previous works that
tackle the two sub-tasks separately, recent works have focused on enhancing the
inter-task association between the two sub-tasks. However, designing inter-task
interactions for event detection and captioning is not trivial due to the large
differences in their task specific solutions. Besides, previous event detection
methods normally ignore temporal dependencies between events, leading to event
redundancy or inconsistency problems. To tackle above the two defects, in this
paper, we define event detection as a sequence generation task and propose a
unified pre-training and fine-tuning framework to naturally enhance the
inter-task association between event detection and captioning. Since the model
predicts each event with previous events as context, the inter-dependency
between events is fully exploited and thus our model can detect more diverse
and consistent events in the video. Experiments on the ActivityNet dataset show
that our model outperforms the state-of-the-art methods, and can be further
boosted when pre-trained on extra large-scale video-text data. Code is
available at \url{https://github.com/QiQAng/UEDVC}
Fine-grained Audible Video Description
We explore a new task for audio-visual-language modeling called fine-grained
audible video description (FAVD). It aims to provide detailed textual
descriptions for the given audible videos, including the appearance and spatial
locations of each object, the actions of moving objects, and the sounds in
videos. Existing visual-language modeling tasks often concentrate on visual
cues in videos while undervaluing the language and audio modalities. On the
other hand, FAVD requires not only audio-visual-language modeling skills but
also paragraph-level language generation abilities. We construct the first
fine-grained audible video description benchmark (FAVDBench) to facilitate this
research. For each video clip, we first provide a one-sentence summary of the
video, ie, the caption, followed by 4-6 sentences describing the visual details
and 1-2 audio-related descriptions at the end. The descriptions are provided in
both English and Chinese. We create two new metrics for this task: an
EntityScore to gauge the completeness of entities in the visual descriptions,
and an AudioScore to assess the audio descriptions. As a preliminary approach
to this task, we propose an audio-visual-language transformer that extends
existing video captioning model with an additional audio branch. We combine the
masked language modeling and auto-regressive language modeling losses to
optimize our model so that it can produce paragraph-level descriptions. We
illustrate the efficiency of our model in audio-visual-language modeling by
evaluating it against the proposed benchmark using both conventional captioning
metrics and our proposed metrics. We further put our benchmark to the test in
video generation models, demonstrating that employing fine-grained video
descriptions can create more intricate videos than using captions.Comment: accpeted to CVPR 2023, Xuyang Shen, Dong Li and Jinxing Zhou
contribute equally, code link: github.com/OpenNLPLab/FAVDBench, dataset link:
www.avlbench.opennlplab.c
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Masked visual modeling (MVM) has been recently proven effective for visual
pre-training. While similar reconstructive objectives on video inputs (e.g.,
masked frame modeling) have been explored in video-language (VidL)
pre-training, previous studies fail to find a truly effective MVM strategy that
can largely benefit the downstream performance. In this work, we systematically
examine the potential of MVM in the context of VidL learning. Specifically, we
base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where
the supervision from MVM training can be backpropagated to the video pixel
space. In total, eight different reconstructive targets of MVM are explored,
from low-level pixel values and oriented gradients to high-level depth maps,
optical flow, discrete visual tokens, and latent visual features. We conduct
comprehensive experiments and provide insights into the factors leading to
effective MVM training, resulting in an enhanced model VIOLETv2. Empirically,
we show VIOLETv2 pre-trained with MVM objective achieves notable improvements
on 13 VidL benchmarks, ranging from video question answering, video captioning,
to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available
at https://github.com/tsujuifu/pytorch_empirical-mv
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Video paragraph captioning aims to generate a multi-sentence description of
an untrimmed video with several temporal event locations in coherent
storytelling. Following the human perception process, where the scene is
effectively understood by decomposing it into visual (e.g. human, animal) and
non-visual components (e.g. action, relations) under the mutual influence of
vision and language, we first propose a visual-linguistic (VL) feature. In the
proposed VL feature, the scene is modeled by three modalities including (i) a
global visual environment; (ii) local visual main agents; (iii) linguistic
scene elements. We then introduce an autoregressive Transformer-in-Transformer
(TinT) to simultaneously capture the semantic coherence of intra- and
inter-event contents within a video. Finally, we present a new VL contrastive
loss function to guarantee learnt embedding features are matched with the
captions semantics. Comprehensive experiments and extensive ablation studies on
ActivityNet Captions and YouCookII datasets show that the proposed
Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
state-of-the-art methods on accuracy and diversity.Comment: Accepted to AAAI 202
Towards Multi-modal Explainable Video Understanding
This thesis presents a novel approach to video understanding by emulating human perceptual processes and creating an explainable and coherent storytelling representation of video content. Central to this approach is the development of a Visual-Linguistic (VL) feature for an interpretable video representation and the creation of a Transformer-in-Transformer (TinT) decoder for modeling intra- and inter-event coherence in a video. Drawing inspiration from the way humans comprehend scenes by breaking them down into visual and non-visual components, the proposed VL feature models a scene through three distinct modalities. These include: (i) a global visual environment, providing a broad contextual understanding of the scene; (ii) local visual main agents, focusing on key elements or entities in the video; and (iii) linguistic scene elements, incorporating semantically relevant language-based information for a comprehensive understanding of the scene. By integrating these multimodal features, the VL representation offers a rich, diverse, and interpretable view of video content, effectively bridging the gap between visual perception and linguistic description. To ensure the temporal coherence and narrative structure of the video content, we introduce an autoregressive Transformer-in-Transformer (TinT) decoder. The TinT design consists of a nested architecture where the inner transformer models the intra-event coherency, capturing the semantic connections within individual events, while the outer transformer models the inter-event coherency, identifying the relationships and transitions between different events. This dual-layer transformer structure facilitates the generation of accurate and meaningful video descriptions that reflect the chronological and causal links in the video content. Another crucial aspect of this work is the introduction of a novel VL contrastive loss function. This function plays an essential role in ensuring that the learned embedding features are semantically consistent with the video captions. By aligning the embeddings with the ground truth captions, the VL contrastive loss function enhances the model\u27s performance and contributes to the quality of the generated descriptions. The efficacy of our proposed methods is validated through comprehensive experiments on popular video understanding benchmarks. The results demonstrate superior performance in terms of both the accuracy and diversity of the generated captions, highlighting the potential of our approach in advancing the field of video understanding. In conclusion, this thesis provides a promising pathway toward building explainable video understanding models. By emulating human perception processes, leveraging multimodal features, and incorporating a nested transformer design, we contribute a new perspective to the field, paving the way for more advanced and intuitive video understanding systems in the future
- …