25 research outputs found
Exploring the Role of Audio in Video Captioning
Recent focus in video captioning has been on designing architectures that can
consume both video and text modalities, and using large-scale video datasets
with text transcripts for pre-training, such as HowTo100M. Though these
approaches have achieved significant improvement, the audio modality is often
ignored in video captioning. In this work, we present an audio-visual
framework, which aims to fully exploit the potential of the audio modality for
captioning. Instead of relying on text transcripts extracted via automatic
speech recognition (ASR), we argue that learning with raw audio signals can be
more beneficial, as audio has additional information including acoustic events,
speaker identity, etc. Our contributions are twofold. First, we observed that
the model overspecializes to the audio modality when pre-training with both
video and audio modality, since the ground truth (i.e., text transcripts) can
be solely predicted using audio. We proposed a Modality Balanced Pre-training
(MBP) loss to mitigate this issue and significantly improve the performance on
downstream tasks. Second, we slice and dice different design choices of the
cross-modal module, which may become an information bottleneck and generate
inferior results. We proposed new local-global fusion mechanisms to improve
information exchange across audio and video. We demonstrate significant
improvements by leveraging the audio modality on four datasets, and even
outperform the state of the art on some metrics without relying on the text
modality as the input
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions
Automatic generation of natural language descriptions of visual data: describing images and videos using recurrent and self-attentive models
Humans are faced with a constant flow of visual stimuli, e.g., from the environment or when looking at social media. In contrast, visually-impaired people are often incapable to perceive and process this advantageous and beneficial information that could help maneuver them through everyday situations and activities. However, audible feedback such as natural language can give them the ability to better be aware of their surroundings, thus enabling them to autonomously master everyday's challenges. One possibility to create audible feedback is to produce natural language descriptions for visual data such as still images and then read this text to the person. Moreover, textual descriptions for images can be further utilized for text analysis (e.g., sentiment analysis) and information aggregation. In this work, we investigate different approaches and techniques for the automatic generation of natural language of visual data such as still images and video clips.
In particular, we look at language models that generate textual descriptions with recurrent neural networks: First, we present a model that allows to generate image captions for scenes that depict interactions between humans and branded products. Thereby, we focus on the correct identification of the brand name in a multi-task training setting and present two new metrics that allow us to evaluate this requirement. Second, we explore the automatic answering of questions posed for an image. In fact, we propose a model that generates answers from scratch instead of predicting an answer from a limited set of possible answers. In comparison to related works, we are therefore able to generate rare answers, which are not contained in the pool of frequent answers. Third, we review the automatic generation of doctors' reports for chest X-ray images. That is, we introduce a model that can cope with a dataset bias of medical datasets (i.e., abnormal cases are very rare) and generates reports with a hierarchical recurrent model. We also investigate the correlation between the distinctiveness of the report and the score in traditional metrics and find a discrepancy between good scores and accurate reports.
Then, we examine self-attentive language models that improve computational efficiency and performance over the recurrent models. Specifically, we utilize the Transformer architecture. First, we expand the automatic description generation to the domain of videos where we present a video-to-text (VTT) model that can easily synchronize audio-visual features. With an extensive experimental exploration, we verify the effectiveness of our video-to-text translation pipeline. Finally, we revisit our recurrent models with this self-attentive approach
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Vision and text have been fully explored in contemporary video-text
foundational models, while other modalities such as audio and subtitles in
videos have not received sufficient attention. In this paper, we resort to
establish connections between multi-modality video tracks, including Vision,
Audio, and Subtitle, and Text by exploring an automatically generated
large-scale omni-modality video caption dataset called VAST-27M. Specifically,
we first collect 27 million open-domain video clips and separately train a
vision and an audio captioner to generate vision and audio captions. Then, we
employ an off-the-shelf Large Language Model (LLM) to integrate the generated
captions, together with subtitles and instructional prompts into omni-modality
captions. Based on the proposed VAST-27M dataset, we train an omni-modality
video-text foundational model named VAST, which can perceive and process
vision, audio, and subtitle modalities from video, and better support various
tasks including vision-text, audio-text, and multi-modal video-text tasks
(retrieval, captioning and QA). Extensive experiments have been conducted to
demonstrate the effectiveness of our proposed VAST-27M corpus and VAST
foundation model. VAST achieves 22 new state-of-the-art results on various
cross-modality benchmarks. Code, model and dataset will be released at
https://github.com/TXH-mercury/VAST.Comment: 23 pages, 5 figure
LGDN: Language-Guided Denoising Network for Video-Language Modeling
Video-language modeling has attracted much attention with the rapid growth of
web videos. Most existing methods assume that the video frames and text
description are semantically correlated, and focus on video-language modeling
at video level. However, this hypothesis often fails for two reasons: (1) With
the rich semantics of video contents, it is difficult to cover all frames with
a single video-level description; (2) A raw video typically has
noisy/meaningless information (e.g., scenery shot, transition or teaser).
Although a number of recent works deploy attention mechanism to alleviate this
problem, the irrelevant/noisy information still makes it very difficult to
address. To overcome such challenge, we thus propose an efficient and effective
model, termed Language-Guided Denoising Network (LGDN), for video-language
modeling. Different from most existing methods that utilize all extracted video
frames, LGDN dynamically filters out the misaligned or redundant frames under
the language supervision and obtains only 2--4 salient frames per video for
cross-modal token-level alignment. Extensive experiments on five public
datasets show that our LGDN outperforms the state-of-the-arts by large margins.
We also provide detailed ablation study to reveal the critical importance of
solving the noise issue, in hope of inspiring future video-language work.Comment: Accepted by NeurIPS202
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Every hour, huge amounts of visual contents are posted on social media and
user-generated content platforms. To find relevant videos by means of a natural
language query, text-video retrieval methods have received increased attention
over the past few years. Data augmentation techniques were introduced to
increase the performance on unseen test examples by creating new training
samples with the application of semantics-preserving techniques, such as color
space or geometric transformations on images. Yet, these techniques are usually
applied on raw data, leading to more resource-demanding solutions and also
requiring the shareability of the raw data, which may not always be true, e.g.
copyright issues with clips from movies or TV series. To address this
shortcoming, we propose a multimodal data augmentation technique which works in
the feature space and creates new videos and captions by mixing semantically
similar samples. We experiment our solution on a large scale public dataset,
EPIC-Kitchens-100, and achieve considerable improvements over a baseline
method, improved state-of-the-art performance, while at the same time
performing multiple ablation studies. We release code and pretrained models on
Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.Comment: Accepted for presentation at 30th ACM International Conference on
Multimedia (ACM MM
Fine-grained Audible Video Description
We explore a new task for audio-visual-language modeling called fine-grained
audible video description (FAVD). It aims to provide detailed textual
descriptions for the given audible videos, including the appearance and spatial
locations of each object, the actions of moving objects, and the sounds in
videos. Existing visual-language modeling tasks often concentrate on visual
cues in videos while undervaluing the language and audio modalities. On the
other hand, FAVD requires not only audio-visual-language modeling skills but
also paragraph-level language generation abilities. We construct the first
fine-grained audible video description benchmark (FAVDBench) to facilitate this
research. For each video clip, we first provide a one-sentence summary of the
video, ie, the caption, followed by 4-6 sentences describing the visual details
and 1-2 audio-related descriptions at the end. The descriptions are provided in
both English and Chinese. We create two new metrics for this task: an
EntityScore to gauge the completeness of entities in the visual descriptions,
and an AudioScore to assess the audio descriptions. As a preliminary approach
to this task, we propose an audio-visual-language transformer that extends
existing video captioning model with an additional audio branch. We combine the
masked language modeling and auto-regressive language modeling losses to
optimize our model so that it can produce paragraph-level descriptions. We
illustrate the efficiency of our model in audio-visual-language modeling by
evaluating it against the proposed benchmark using both conventional captioning
metrics and our proposed metrics. We further put our benchmark to the test in
video generation models, demonstrating that employing fine-grained video
descriptions can create more intricate videos than using captions.Comment: accpeted to CVPR 2023, Xuyang Shen, Dong Li and Jinxing Zhou
contribute equally, code link: github.com/OpenNLPLab/FAVDBench, dataset link:
www.avlbench.opennlplab.c