1,338 research outputs found

    Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

    Full text link
    Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodalComment: 17 pages, accepted as a full paper at ECIR 202

    An information-theoretic framework for semantic-multimedia retrieval

    Get PDF
    This article is set in the context of searching text and image repositories by keyword. We develop a unified probabilistic framework for text, image, and combined text and image retrieval that is based on the detection of keywords (concepts) using automated image annotation technology. Our framework is deeply rooted in information theory and lends itself to use with other media types. We estimate a statistical model in a multimodal feature space for each possible query keyword. The key element of our framework is to identify feature space transformations that make them comparable in complexity and density. We select the optimal multimodal feature space with a minimum description length criterion from a set of candidate feature spaces that are computed with the average-mutual-information criterion for the text part and hierarchical expectation maximization for the visual part of the data. We evaluate our approach in three retrieval experiments (only text retrieval, only image retrieval, and text combined with image retrieval), verify the framework’s low computational complexity, and compare with existing state-of-the-art ad-hoc models

    LiveSketch: Query Perturbations for Guided Sketch-based Visual Search

    Get PDF
    LiveSketch is a novel algorithm for searching large image collections using hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch search by creating visual suggestions that augment the query as it is drawn, making query specification an iterative rather than one-shot process that helps disambiguate users' search intent. Our technical contributions are: a triplet convnet architecture that incorporates an RNN based variational autoencoder to search for images using vector (stroke-based) queries; real-time clustering to identify likely search intents (and so, targets within the search embedding); and the use of backpropagation from those targets to perturb the input stroke sequence, so suggesting alterations to the query in order to guide the search. We show improvements in accuracy and time-to-task over contemporary baselines using a 67M image corpus.Comment: Accepted to CVPR 201

    CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

    Full text link
    Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding data is unsustainable, because models lack the continual learning ability to accumulate knowledge constantly. However, most continual learning studies are limited to uni-modal classification and existing multi-modal datasets cannot simulate continual non-stationary data stream scenarios. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D which contains over one million product image-text pairs from 9 industries. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data. We comprehensively study the characteristics and challenges of VLCP, and propose a new algorithm: Compatible momentum contrast with Topology Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge of the current and previous-task models to flexibly update the modal feature. Moreover, Topology Preservation transfers the knowledge of embedding across tasks while preserving the flexibility of feature adjustment. The experimental results demonstrate our method not only achieves superior performance compared with other baselines but also does not bring an expensive training burden. Dataset and codes are available at https://github.com/KevinLight831/CTP.Comment: Accepted by ICCV 2023. Code: https://github.com/KevinLight831/CT
    • …
    corecore