1,338 research outputs found
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
Learned sparse retrieval (LSR) is a family of neural methods that encode
queries and documents into sparse lexical vectors that can be indexed and
retrieved efficiently with an inverted index. We explore the application of LSR
to the multi-modal domain, with a focus on text-image retrieval. While LSR has
seen success in text retrieval, its application in multimodal retrieval remains
underexplored. Current approaches like LexLIP and STAIR require complex
multi-step training on massive datasets. Our proposed approach efficiently
transforms dense vectors from a frozen dense model into sparse lexical vectors.
We address issues of high dimension co-activation and semantic deviation
through a new training algorithm, using Bernoulli random variables to control
query expansion. Experiments with two dense models (BLIP, ALBEF) and two
datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively
reduces co-activation and semantic deviation. Our best-performing sparsified
model outperforms state-of-the-art text-image LSR models with a shorter
training time and lower GPU memory requirements. Our approach offers an
effective solution for training LSR retrieval models in multimodal settings.
Our code and model checkpoints are available at
github.com/thongnt99/lsr-multimodalComment: 17 pages, accepted as a full paper at ECIR 202
An information-theoretic framework for semantic-multimedia retrieval
This article is set in the context of searching text and image repositories by keyword. We develop a unified probabilistic framework for text, image, and combined text and image retrieval that is based on the detection of keywords (concepts) using automated image annotation technology. Our framework is deeply rooted in information theory and lends itself to use with other media types.
We estimate a statistical model in a multimodal feature space for each possible query keyword. The key element of our framework is to identify feature space transformations that make them comparable in complexity and density. We select the optimal multimodal feature space with a minimum description length criterion from a set of candidate feature spaces that are computed with the average-mutual-information criterion for the text part and hierarchical expectation maximization for the visual part of the data. We evaluate our approach in three retrieval experiments (only text retrieval, only image retrieval, and text combined with image retrieval), verify the framework’s low computational complexity, and compare with existing state-of-the-art ad-hoc models
LiveSketch: Query Perturbations for Guided Sketch-based Visual Search
LiveSketch is a novel algorithm for searching large image collections using
hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch
search by creating visual suggestions that augment the query as it is drawn,
making query specification an iterative rather than one-shot process that helps
disambiguate users' search intent. Our technical contributions are: a triplet
convnet architecture that incorporates an RNN based variational autoencoder to
search for images using vector (stroke-based) queries; real-time clustering to
identify likely search intents (and so, targets within the search embedding);
and the use of backpropagation from those targets to perturb the input stroke
sequence, so suggesting alterations to the query in order to guide the search.
We show improvements in accuracy and time-to-task over contemporary baselines
using a 67M image corpus.Comment: Accepted to CVPR 201
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Vision-Language Pretraining (VLP) has shown impressive results on diverse
downstream tasks by offline training on large-scale datasets. Regarding the
growing nature of real-world data, such an offline training paradigm on
ever-expanding data is unsustainable, because models lack the continual
learning ability to accumulate knowledge constantly. However, most continual
learning studies are limited to uni-modal classification and existing
multi-modal datasets cannot simulate continual non-stationary data stream
scenarios. To support the study of Vision-Language Continual Pretraining
(VLCP), we first contribute a comprehensive and unified benchmark dataset P9D
which contains over one million product image-text pairs from 9 industries. The
data from each industry as an independent task supports continual learning and
conforms to the real-world long-tail nature to simulate pretraining on web
data. We comprehensively study the characteristics and challenges of VLCP, and
propose a new algorithm: Compatible momentum contrast with Topology
Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge
of the current and previous-task models to flexibly update the modal feature.
Moreover, Topology Preservation transfers the knowledge of embedding across
tasks while preserving the flexibility of feature adjustment. The experimental
results demonstrate our method not only achieves superior performance compared
with other baselines but also does not bring an expensive training burden.
Dataset and codes are available at https://github.com/KevinLight831/CTP.Comment: Accepted by ICCV 2023. Code: https://github.com/KevinLight831/CT
- …