2,271 research outputs found
Multi-modal Embedding Fusion-based Recommender
Recommendation systems have lately been popularized globally, with primary
use cases in online interaction systems, with significant focus on e-commerce
platforms. We have developed a machine learning-based recommendation platform,
which can be easily applied to almost any items and/or actions domain. Contrary
to existing recommendation systems, our platform supports multiple types of
interaction data with multiple modalities of metadata natively. This is
achieved through multi-modal fusion of various data representations. We
deployed the platform into multiple e-commerce stores of different kinds, e.g.
food and beverages, shoes, fashion items, telecom operators. Here, we present
our system, its flexibility and performance. We also show benchmark results on
open datasets, that significantly outperform state-of-the-art prior work.Comment: 7 pages, 8 figure
MMFL-Net: Multi-scale and Multi-granularity Feature Learning for Cross-domain Fashion Retrieval
Instance-level image retrieval in fashion is a challenging issue owing to its
increasing importance in real-scenario visual fashion search. Cross-domain
fashion retrieval aims to match the unconstrained customer images as queries
for photographs provided by retailers; however, it is a difficult task due to a
wide range of consumer-to-shop (C2S) domain discrepancies and also considering
that clothing image is vulnerable to various non-rigid deformations. To this
end, we propose a novel multi-scale and multi-granularity feature learning
network (MMFL-Net), which can jointly learn global-local aggregation feature
representations of clothing images in a unified framework, aiming to train a
cross-domain model for C2S fashion visual similarity. First, a new
semantic-spatial feature fusion part is designed to bridge the semantic-spatial
gap by applying top-down and bottom-up bidirectional multi-scale feature
fusion. Next, a multi-branch deep network architecture is introduced to capture
global salient, part-informed, and local detailed information, and extracting
robust and discrimination feature embedding by integrating the similarity
learning of coarse-to-fine embedding with the multiple granularities. Finally,
the improved trihard loss, center loss, and multi-task classification loss are
adopted for our MMFL-Net, which can jointly optimize intra-class and
inter-class distance and thus explicitly improve intra-class compactness and
inter-class discriminability between its visual representations for feature
learning. Furthermore, our proposed model also combines the multi-task
attribute recognition and classification module with multi-label semantic
attributes and product ID labels. Experimental results demonstrate that our
proposed MMFL-Net achieves significant improvement over the state-of-the-art
methods on the two datasets, DeepFashion-C2S and Street2Shop.Comment: 27 pages, 12 figures, Published by <Multimedia Tools and
Applications
Similarity Reasoning and Filtration for Image-Text Matching
Image-text matching plays a critical role in bridging the vision and
language, and great progress has been made by exploiting the global alignment
between image and sentence, or local alignments between regions and words.
However, how to make the most of these alignments to infer more accurate
matching scores is still underexplored. In this paper, we propose a novel
Similarity Graph Reasoning and Attention Filtration (SGRAF) network for
image-text matching. Specifically, the vector-based similarity representations
are firstly learned to characterize the local and global alignments in a more
comprehensive manner, and then the Similarity Graph Reasoning (SGR) module
relying on one graph convolutional neural network is introduced to infer
relation-aware similarities with both the local and global alignments. The
Similarity Attention Filtration (SAF) module is further developed to integrate
these alignments effectively by selectively attending on the significant and
representative alignments and meanwhile casting aside the interferences of
non-meaningful alignments. We demonstrate the superiority of the proposed
method with achieving state-of-the-art performances on the Flickr30K and MSCOCO
datasets, and the good interpretability of SGR and SAF modules with extensive
qualitative experiments and analyses.Comment: 14 pages, 8 figures, Accepted by AAAI202
A Strong Baseline for Fashion Retrieval with Person Re-Identification Models
Fashion retrieval is the challenging task of finding an exact match for
fashion items contained within an image. Difficulties arise from the
fine-grained nature of clothing items, very large intra-class and inter-class
variance. Additionally, query and source images for the task usually come from
different domains - street photos and catalogue photos respectively. Due to
these differences, a significant gap in quality, lighting, contrast, background
clutter and item presentation exists between domains. As a result, fashion
retrieval is an active field of research both in academia and the industry.
Inspired by recent advancements in Person Re-Identification research, we
adapt leading ReID models to be used in fashion retrieval tasks. We introduce a
simple baseline model for fashion retrieval, significantly outperforming
previous state-of-the-art results despite a much simpler architecture. We
conduct in-depth experiments on Street2Shop and DeepFashion datasets and
validate our results. Finally, we propose a cross-domain (cross-dataset)
evaluation method to test the robustness of fashion retrieval models.Comment: 33 pages, 14 figure
- …