22,441 research outputs found
Multi-modal Transformer for Video Retrieval
The task of retrieving video content relevant to natural language queries
plays a critical role in effectively handling internet-scale datasets. Most of
the existing methods for this caption-to-video retrieval problem do not fully
exploit cross-modal cues present in video. Furthermore, they aggregate
per-frame visual features with limited or no temporal information. In this
paper, we present a multi-modal transformer to jointly encode the different
modalities in video, which allows each of them to attend to the others. The
transformer architecture is also leveraged to encode and model the temporal
information. On the natural language side, we investigate the best practices to
jointly optimize the language embedding together with the multi-modal
transformer. This novel framework allows us to establish state-of-the-art
results for video retrieval on three datasets. More details are available at
http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Video-text retrieval has been a crucial and fundamental task in multi-modal
research. The development of video-text retrieval has been considerably
promoted by large-scale multi-modal contrastive pre-training, which primarily
focuses on coarse-grained or fine-grained contrast. However, cross-grained
contrast, which is the contrast between coarse-grained representations and
fine-grained representations, has rarely been explored in prior research.
Compared with fine-grained or coarse-grained contrasts, cross-grained contrast
calculate the correlation between coarse-grained features and each fine-grained
feature, and is able to filter out the unnecessary fine-grained features guided
by the coarse-grained feature during similarity calculation, thus improving the
accuracy of retrieval. To this end, this paper presents a novel multi-grained
contrastive model, namely X-CLIP, for video-text retrieval. However, another
challenge lies in the similarity aggregation problem, which aims to aggregate
fine-grained and cross-grained similarity matrices to instance-level
similarity. To address this challenge, we propose the Attention Over Similarity
Matrix (AOSM) module to make the model focus on the contrast between essential
frames and words, thus lowering the impact of unnecessary frames and words on
retrieval results. With multi-grained contrast and the proposed AOSM module,
X-CLIP achieves outstanding performance on five widely-used video-text
retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1
R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous
state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on
these benchmarks, demonstrating the superiority of multi-grained contrast and
AOSM.Comment: 13 pages, 6 figures, ACMMM2
Multi-modal Transformer for Video Retrieval
International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT
Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?
Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels
MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling
Video-and-language understanding has a variety of applications in the
industry, such as video question answering, text-video retrieval and
multi-label classification. Existing video-and-language understanding methods
generally adopt heavy multi-modal encoders and feature fusion modules, which
consume large amounts of GPU memory. Especially, they have difficulty dealing
with dense video frames or long text that are prevalent in industrial
applications. In this paper, we propose MuLTI, a highly accurate and
memory-efficient video-and-language understanding model that achieves efficient
and effective feature fusion through feature sampling and attention modules.
Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we
introduce an attention-based adapter to the encoders, which finetunes the
shallow features to improve the model's performance with low GPU memory
consumption. Finally, to further improve the model's performance, we introduce
a new pretraining task named Multiple Choice Modeling to bridge the task gap
between pretraining and downstream tasks and enhance the model's ability to
align the video and the text. Benefiting from the efficient feature fusion
module, the attention-based adapter and the new pretraining task, MuLTI
achieves state-of-the-art performance on multiple datasets. Implementation and
pretrained models will be released
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video
content using natural language queries a significant challenge. Human-generated
queries for video datasets `in the wild' vary a lot in terms of degree of
specificity, with some queries describing specific details such as the names of
famous identities, content from speech, or text available on the screen. Our
goal is to condense the multi-modal, extremely high dimensional information
from videos into a single, compact video representation for the task of video
retrieval using free-form text queries, where the degree of specificity is
open-ended.
For this we exploit existing knowledge in the form of pre-trained semantic
embeddings which include 'general' features such as motion, appearance, and
scene features from visual content. We also explore the use of more 'specific'
cues from ASR and OCR which are intermittently available for videos and find
that these signals remain challenging to use effectively for retrieval. We
propose a collaborative experts model to aggregate information from these
different pre-trained experts and assess our approach empirically on five
retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and
data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
This paper contains a correction to results reported in the previous version.Comment: This update contains a correction to previously reported result
- …