Search CORE

84 research outputs found

Multi-modal Transformer for Video Retrieval

Author: B Zhang
D Harwath
K-H Lee
NC Mithun
S Xie
Y Yu
Y Zhang
Publication venue
Publication date: 21/07/2020
Field of study

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Multi-modal Transformer for Video Retrieval

Author: B Zhang
D Harwath
K-H Lee
NC Mithun
S Xie
Y Yu
Y Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/08/2020
Field of study

International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT

Crossref

INRIA a CCSD electronic archive server

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Author: Gao Lianli
Li Hao
Shen Heng Tao
Song Jingkuan
Zhu Xiaosu
Publication venue
Publication date: 14/01/2024
Field of study

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.Comment: Accepted to NeurIPS 202

arXiv.org e-Print Archive

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Author: Jin Qin
Zhang Qi
Zheng Sipeng
Publication venue
Publication date: 20/07/2023
Field of study

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture

arXiv.org e-Print Archive

Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Author: Lim Joo Hwee
Satar Burak
Zhang Hanwang
Zhu Hongyuan
Publication venue
Publication date: 26/09/2023
Field of study

In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.Comment: Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at EPIC@CVPR2022. (v2: ref error is corrected

arXiv.org e-Print Archive

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

Author: Bansal Mohit
Cho Jaemin
Lei Jie
Tang Zineng
Publication venue
Publication date: 21/11/2022
Field of study

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy. Our code and checkpoints are available at: https://github.com/zinengtang/Perceiver_VLComment: WACV 2023 (first two authors contributed equally

arXiv.org e-Print Archive

Multimodal Characterization of Emotion within Multimedia Space

Author: Agarwal Nitin
Banjo Dayo Samuel
Trimmingham Connice
Yousefi Niloofar
Publication venue
Publication date: 20/11/2023
Field of study

Technological advancement and its omnipresent connection have pushed humans past the boundaries and limitations of a computer screen, physical state, or geographical location. It has provided a depth of avenues that facilitate human-computer interaction that was once inconceivable such as audio and body language detection. Given the complex modularities of emotions, it becomes vital to study human-computer interaction, as it is the commencement of a thorough understanding of the emotional state of users and, in the context of social networks, the producers of multimodal information. This study first acknowledges the accuracy of classification found within multimodal emotion detection systems compared to unimodal solutions. Second, it explores the characterization of multimedia content produced based on their emotions and the coherence of emotion in different modalities by utilizing deep learning models to classify emotion across different modalities.Comment: 8 pages, Published in International Conference on Computers and Computation (COMPUTE 2022), November 03-04, 2022, San Francisco, United State

arXiv.org e-Print Archive

Encoding and Decoding Narratives: Datafication and Alternative Access Models for Audiovisual Archives

Author: Yang Yuchen
Publication venue
Publication date: 10/10/2023
Field of study

Situated in the intersection of audiovisual archives, computational methods, and immersive interactions, this work probes the increasingly important accessibility issues from a two-fold approach. Firstly, the work proposes an ontological data model to handle complex descriptors (metadata, feature vectors, etc.) with regard to user interactions. Secondly, this work examines text-to-video retrieval from an implementation perspective by proposing a classifier-enhanced workflow to deal with complex and hybrid queries and a training data augmentation workflow to improve performance. This work serves as the foundation for experimenting with novel public-facing access models to large audiovisual archivesComment: arXiv admin note: substantial text overlap with arXiv:2310.0582

arXiv.org e-Print Archive

Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Author: Alimova Ilseyar
Bout Andrey
Nikolenko Sergey
Piontkovskaya Irina
Podolskiy Alexander
Polyakov Gregory
Yakovlev Konstantin
Publication venue
Publication date: 14/11/2023
Field of study

A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.Comment: SIGIR 202

arXiv.org e-Print Archive