84 research outputs found
Multi-modal Transformer for Video Retrieval
The task of retrieving video content relevant to natural language queries
plays a critical role in effectively handling internet-scale datasets. Most of
the existing methods for this caption-to-video retrieval problem do not fully
exploit cross-modal cues present in video. Furthermore, they aggregate
per-frame visual features with limited or no temporal information. In this
paper, we present a multi-modal transformer to jointly encode the different
modalities in video, which allows each of them to attend to the others. The
transformer architecture is also leveraged to encode and model the temporal
information. On the natural language side, we investigate the best practices to
jointly optimize the language embedding together with the multi-modal
transformer. This novel framework allows us to establish state-of-the-art
results for video retrieval on three datasets. More details are available at
http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper
Multi-modal Transformer for Video Retrieval
International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Cross-modal Retrieval methods build similarity relations between vision and
language modalities by jointly learning a common representation space. However,
the predictions are often unreliable due to the Aleatoric uncertainty, which is
induced by low-quality data, e.g., corrupt images, fast-paced videos, and
non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric
Uncertainty Quantification (PAU) framework to provide trustworthy predictions
by quantifying the uncertainty arisen from the inherent data ambiguity.
Concretely, we first construct a set of various learnable prototypes for each
modality to represent the entire semantics subspace. Then Dempster-Shafer
Theory and Subjective Logic Theory are utilized to build an evidential
theoretical framework by associating evidence with Dirichlet Distribution
parameters. The PAU model induces accurate uncertainty and reliable predictions
for cross-modal retrieval. Extensive experiments are performed on four major
benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the
effectiveness of our method. The code is accessible at
https://github.com/leolee99/PAU.Comment: Accepted to NeurIPS 202
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Temporal video grounding (TVG) aims to retrieve the time interval of a
language query from an untrimmed video. A significant challenge in TVG is the
low "Semantic Noise Ratio (SNR)", which results in worse performance with lower
SNR. Prior works have addressed this challenge using sophisticated techniques.
In this paper, we propose a no-frills TVG model that consists of two core
modules, namely multi-scale neighboring attention and zoom-in boundary
detection. The multi-scale neighboring attention restricts each video token to
only aggregate visual contexts from its neighbor, enabling the extraction of
the most distinguishing information with multi-scale feature hierarchies from
high-ratio noises. The zoom-in boundary detection then focuses on local-wise
discrimination of the selected top candidates for fine-grained grounding
adjustment. With an end-to-end training strategy, our model achieves
competitive performance on different TVG benchmarks, while also having the
advantage of faster inference speed and lighter model parameters, thanks to its
lightweight architecture
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022. We first parse sentences into semantic roles
corresponding to verbs and nouns; then utilize self-attentions to exploit
semantic role contextualized video features along with textual features via
triplet losses in multiple embedding spaces. Our method overpasses the strong
baseline in normalized Discounted Cumulative Gain (nDCG), which is more
valuable for semantic similarity. Our submission is ranked 3rd for nDCG and
ranked 4th for mAP.Comment: Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at
EPIC@CVPR2022. (v2: ref error is corrected
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
We present Perceiver-VL, a vision-and-language framework that efficiently
handles high-dimensional multimodal inputs such as long videos and text.
Powered by the iterative latent cross-attention of Perceiver, our framework
scales with linear complexity, in contrast to the quadratic complexity of
self-attention used in many state-of-the-art transformer-based models. To
further improve the efficiency of our framework, we also study applying
LayerDrop on cross-attention layers and introduce a mixed-stream architecture
for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and
image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and
latency while maintaining competitive performance. In addition, we also provide
comprehensive analyses of various aspects of our framework, including
pretraining data, scalability of latent size and input size, dropping
cross-attention layers at inference to reduce latency, modality aggregation
strategy, positional encoding, and weight initialization strategy. Our code and
checkpoints are available at: https://github.com/zinengtang/Perceiver_VLComment: WACV 2023 (first two authors contributed equally
Multimodal Characterization of Emotion within Multimedia Space
Technological advancement and its omnipresent connection have pushed humans
past the boundaries and limitations of a computer screen, physical state, or
geographical location. It has provided a depth of avenues that facilitate
human-computer interaction that was once inconceivable such as audio and body
language detection. Given the complex modularities of emotions, it becomes
vital to study human-computer interaction, as it is the commencement of a
thorough understanding of the emotional state of users and, in the context of
social networks, the producers of multimodal information. This study first
acknowledges the accuracy of classification found within multimodal emotion
detection systems compared to unimodal solutions. Second, it explores the
characterization of multimedia content produced based on their emotions and the
coherence of emotion in different modalities by utilizing deep learning models
to classify emotion across different modalities.Comment: 8 pages, Published in International Conference on Computers and
Computation (COMPUTE 2022), November 03-04, 2022, San Francisco, United
State
Encoding and Decoding Narratives: Datafication and Alternative Access Models for Audiovisual Archives
Situated in the intersection of audiovisual archives, computational methods,
and immersive interactions, this work probes the increasingly important
accessibility issues from a two-fold approach. Firstly, the work proposes an
ontological data model to handle complex descriptors (metadata, feature
vectors, etc.) with regard to user interactions. Secondly, this work examines
text-to-video retrieval from an implementation perspective by proposing a
classifier-enhanced workflow to deal with complex and hybrid queries and a
training data augmentation workflow to improve performance. This work serves as
the foundation for experimenting with novel public-facing access models to
large audiovisual archivesComment: arXiv admin note: substantial text overlap with arXiv:2310.0582
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
A recent trend in multimodal retrieval is related to postprocessing test set
results via the dual-softmax loss (DSL). While this approach can bring
significant improvements, it usually presumes that an entire matrix of test
samples is available as DSL input. This work introduces a new postprocessing
approach based on Sinkhorn transformations that outperforms DSL. Further, we
propose a new postprocessing setting that does not require access to multiple
test queries. We show that our approach can significantly improve the results
of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus
achieving a new state-of-the-art on several standard text-video retrieval
datasets both with access to the entire test set and in the single-query
setting.Comment: SIGIR 202
- …