Search CORE

1,071 research outputs found

Simple to Complex Cross-modal Learning to Rank

Author: Luo Minnan
Chang Xiaojun
Li Zhihui
Nie Liqiang
Hauptmann Alexander G.
Zheng Qinghua
Publication venue
Publication date: 01/01/1998
Field of study

The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear mapping functions which might not be sophisticated enough to reveal more complicated inter-modal correspondences. Additionally, current studies assume that the rankings are of equal importance, and thus all rankings are used simultaneously, or a small number of rankings are selected randomly to train the embedding space at each iteration. Such strategies, however, always suffer from outliers as well as reduced generalization capability due to their lack of insightful understanding of procedure of human cognition. In this paper, we involve the self-paced learning theory with diversity into the cross-modal learning to rank and learn an optimal multi-modal embedding space based on non-linear mapping functions. This strategy enhances the model's robustness to outliers and achieves better generalization via training the model gradually from easy rankings by diverse queries to more complex ones. An efficient alternative algorithm is exploited to solve the proposed challenging problem with fast convergence in practice. Extensive experimental results on several benchmark datasets indicate that the proposed method achieves significant improvements over the state-of-the-arts in this literature.Comment: 14 pages; Accepted by Computer Vision and Image Understandin

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Wageningen University & Research Publications

Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

Author: Li Wanqing
Ogunbona Philip
Xu Dong
Zhang Jing
Publication venue
Publication date: 01/01/2019
Field of study

This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

arXiv.org e-Print Archive

Research Online

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Author: Arponen Heikki
Bishop Tom E.
Laaksonen Jorma
Langer Tomas
Wang Tzu-Jui Julius
Publication venue
Publication date: 27/10/2022
Field of study

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on image-to-text and text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.Comment: Accepted to WACV'23. Please find supplementary material at https://drive.google.com/file/d/1SmCBGsUgkYLAhmK83RZqY03bq4j3214p/view?usp=sharin

arXiv.org e-Print Archive

Sign language video retrieval with free-form textual queries

Author: Albanie Samuel
Cardoso Duarte Amanda
Giró Nieto Xavier
Varol Gül
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.This work was supported by the project PID2020-117142GB-I00, funded by MCIN/ AEI /10.13039/501100011033, ANR project CorVis ANR-21-CE23-0003- 01, and gifts from Google and Adobe. AD received support from la Caixa Foundation (ID 100010434), fellowship code LCF/BQ/IN18/11660029.Peer ReviewedObjectius de Desenvolupament Sostenible::10 - Reducció de les DesigualtatsObjectius de Desenvolupament Sostenible::10 - Reducció de les Desigualtats::10.2 - Per a 2030, potenciar i promoure la inclusió social, econòmica i política de totes les persones, independentment de l’edat, sexe, discapacitat, raça, ètnia, origen, religió, situació econòmica o altra condicióPostprint (author's final draft

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

HAL-Ecole des Ponts ParisTech