86,545 research outputs found
Image Retrieval: History, Current Approaches, and Promising Framework
Abstract Today, by dominant use of the world computer networks, the volume of image database is increased and retrieving the required image similar with the image is a serious need. Here having a dynamic and flexible framework can help considerably in the design of an image retrieval system with high accuracy. In this study, by the investigation and analysis of three systems of current famous systems of retrieving and emphasis on weaknesses and strengths of the systems, presented a general framework for image retrieval systems. The important issue is that an ideal image retrieval system should be able to automatically extract semantic content and make the images indexing
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Multimodal tasks in the fashion domain have significant potential for
e-commerce, but involve challenging vision-and-language learning problems -
e.g., retrieving a fashion item given a reference image plus text feedback from
a user. Prior works on multimodal fashion tasks have either been limited by the
data in individual benchmarks, or have leveraged generic vision-and-language
pre-training but have not taken advantage of the characteristics of fashion
data. Additionally, these works have mainly been restricted to multimodal
understanding tasks. To address these gaps, we make two key contributions.
First, we propose a novel fashion-specific pre-training framework based on
weakly-supervised triplets constructed from fashion image-text pairs. We show
the triplet-based tasks are an effective addition to standard multimodal
pre-training tasks. Second, we propose a flexible decoder-based model
architecture capable of both fashion retrieval and captioning tasks. Together,
our model design and pre-training approach are competitive on a diverse set of
fashion tasks, including cross-modal retrieval, image retrieval with text
feedback, image captioning, relative image captioning, and multimodal
categorization.Comment: 14 pages, 4 figures. To appear at Conference on Empirical Methods in
Natural Language Processing (EMNLP) 202
- …