2,835 research outputs found
Fine-grained Apparel Classification and Retrieval without rich annotations
The ability to correctly classify and retrieve apparel images has a variety
of applications important to e-commerce, online advertising and internet
search. In this work, we propose a robust framework for fine-grained apparel
classification, in-shop and cross-domain retrieval which eliminates the
requirement of rich annotations like bounding boxes and human-joints or
clothing landmarks, and training of bounding box/ key-landmark detector for the
same. Factors such as subtle appearance differences, variations in human poses,
different shooting angles, apparel deformations, and self-occlusion add to the
challenges in classification and retrieval of apparel items. Cross-domain
retrieval is even harder due to the presence of large variation between online
shopping images, usually taken in ideal lighting, pose, positive angle and
clean background as compared with street photos captured by users in
complicated conditions with poor lighting and cluttered scenes. Our framework
uses compact bilinear CNN with tensor sketch algorithm to generate embeddings
that capture local pairwise feature interactions in a translationally invariant
manner. For apparel classification, we pass the feature embeddings through a
softmax classifier, while, the in-shop and cross-domain retrieval pipelines use
a triplet-loss based optimization approach, such that squared Euclidean
distance between embeddings measures the dissimilarity between the images.
Unlike previous works that relied on bounding box, key clothing landmarks or
human joint detectors to assist the final deep classifier, proposed framework
can be trained directly on the provided category labels or generated triplets
for triplet loss optimization. Lastly, Experimental results on the DeepFashion
fine-grained categorization, and in-shop and consumer-to-shop retrieval
datasets provide a comparative analysis with previous work performed in the
domain.Comment: 14 pages, 6 figures, 3 tables, Submitted to Springer Journal of
Applied Intelligenc
Interpretable Partitioned Embedding for Customized Fashion Outfit Composition
Intelligent fashion outfit composition becomes more and more popular in these
years. Some deep learning based approaches reveal competitive composition
recently. However, the unexplainable characteristic makes such deep learning
based approach cannot meet the the designer, businesses and consumers' urge to
comprehend the importance of different attributes in an outfit composition. To
realize interpretable and customized fashion outfit compositions, we propose a
partitioned embedding network to learn interpretable representations from
clothing items. The overall network architecture consists of three components:
an auto-encoder module, a supervised attributes module and a multi-independent
module. The auto-encoder module serves to encode all useful information into
the embedding. In the supervised attributes module, multiple attributes labels
are adopted to ensure that different parts of the overall embedding correspond
to different attributes. In the multi-independent module, adversarial operation
are adopted to fulfill the mutually independent constraint. With the
interpretable and partitioned embedding, we then construct an outfit
composition graph and an attribute matching map. Given specified attributes
description, our model can recommend a ranked list of outfit composition with
interpretable matching scores. Extensive experiments demonstrate that 1) the
partitioned embedding have unmingled parts which corresponding to different
attributes and 2) outfits recommended by our model are more desirable in
comparison with the existing methods
Studio2Shop: from studio photo shoots to fashion articles
Fashion is an increasingly important topic in computer vision, in particular
the so-called street-to-shop task of matching street images with shop images
containing similar fashion items. Solving this problem promises new means of
making fashion searchable and helping shoppers find the articles they are
looking for. This paper focuses on finding pieces of clothing worn by a person
in full-body or half-body images with neutral backgrounds. Such images are
ubiquitous on the web and in fashion blogs, and are typically studio photos, we
refer to this setting as studio-to-shop. Recent advances in computational
fashion include the development of domain-specific numerical representations.
Our model Studio2Shop builds on top of such representations and uses a deep
convolutional network trained to match a query image to the numerical feature
vectors of all the articles annotated in this image. Top- retrieval
evaluation on test query images shows that the correct items are most often
found within a range that is sufficiently small for building realistic visual
search engines for the studio-to-shop setting.Comment: 12 pages, 9 figures (Figure 1 has 5 subfigures, Figure 2 has 3
subfigures), 7 table
Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval
With the increasing number of online stores, there is a pressing need for
intelligent search systems to understand the item photos snapped by customers
and search against large-scale product databases to find their desired items.
However, it is challenging for conventional retrieval systems to match up the
item photos captured by customers and the ones officially released by stores,
especially for garment images. To bridge the customer- and store- provided
garment photos, existing studies have been widely exploiting the clothing
attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to
learn a common embedding space for garment representations. Unfortunately they
omit the sequential correlation of attributes and consume large quantity of
human labors to label the landmarks. In this paper, we propose a deep
multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain
embedding and sequential attribute learning are modeled simultaneously.
Sequential attribute learning not only provides the semantic guidance for
embedding, but also generates rich attention on discriminative local details
(\textit{e.g.,} black buttons) of clothing items without requiring extra
landmark labels. This leads to promising performance and 306 boost on
efficiency when compared with the state-of-the-art models, which is
demonstrated through rigorous experiments on two public fashion datasets
Looking at Outfit to Parse Clothing
This paper extends fully-convolutional neural networks (FCN) for the clothing
parsing problem. Clothing parsing requires higher-level knowledge on clothing
semantics and contextual cues to disambiguate fine-grained categories. We
extend FCN architecture with a side-branch network which we refer outfit
encoder to predict a consistent set of clothing labels to encourage
combinatorial preference, and with conditional random field (CRF) to explicitly
consider coherent label assignment to the given image. The empirical results
using Fashionista and CFPD datasets show that our model achieves
state-of-the-art performance in clothing parsing, without additional
supervision during training. We also study the qualitative influence of
annotation on the current clothing parsing benchmarks, with our Web-based tool
for multi-scale pixel-wise annotation and manual refinement effort to the
Fashionista dataset. Finally, we show that the image representation of the
outfit encoder is useful for dress-up image retrieval application
Learning the Latent "Look": Unsupervised Discovery of a Style-Coherent Embedding from Fashion Images
What defines a visual style? Fashion styles emerge organically from how
people assemble outfits of clothing, making them difficult to pin down with a
computational model. Low-level visual similarity can be too specific to detect
stylistically similar images, while manually crafted style categories can be
too abstract to capture subtle style differences. We propose an unsupervised
approach to learn a style-coherent representation. Our method leverages
probabilistic polylingual topic models based on visual attributes to discover a
set of latent style factors. Given a collection of unlabeled fashion images,
our approach mines for the latent styles, then summarizes outfits by how they
mix those styles. Our approach can organize galleries of outfits by style
without requiring any style labels. Experiments on over 100K images demonstrate
its promise for retrieving, mixing, and summarizing fashion images by their
style
Unconstrained Fashion Landmark Detection via Hierarchical Recurrent Transformer Networks
Fashion landmarks are functional key points defined on clothes, such as
corners of neckline, hemline, and cuff. They have been recently introduced as
an effective visual representation for fashion image understanding. However,
detecting fashion landmarks are challenging due to background clutters, human
poses, and scales. To remove the above variations, previous works usually
assumed bounding boxes of clothes are provided in training and test as
additional annotations, which are expensive to obtain and inapplicable in
practice. This work addresses unconstrained fashion landmark detection, where
clothing bounding boxes are not provided in both training and test. To this
end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and
landmarks are jointly estimated and trained iteratively in an end-to-end
manner. DLAN contains two dedicated modules, including a Selective Dilated
Convolution for handling scale discrepancies, and a Hierarchical Recurrent
Spatial Transformer for handling background clutters. To evaluate DLAN, we
present a large-scale fashion landmark dataset, namely Unconstrained Landmark
Database (ULD), consisting of 30K images. Statistics show that ULD is more
challenging than existing datasets in terms of image scales, background
clutters, and human poses. Extensive experiments demonstrate the effectiveness
of DLAN over the state-of-the-art methods. DLAN also exhibits excellent
generalization across different clothing categories and modalities, making it
extremely suitable for real-world fashion analysis.Comment: To appear in ACM Multimedia (ACM MM) 2017 as a full research paper.
More details at the project page:
http://personal.ie.cuhk.edu.hk/~lz013/projects/UnconstrainedLandmarks.htm
Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid
Matching clothing images from customers and online shopping stores has rich
applications in E-commerce. Existing algorithms encoded an image as a global
feature vector and performed retrieval with the global representation. However,
discriminative local information on clothes are submerged in this global
representation, resulting in sub-optimal performance. To address this issue, we
propose a novel Graph Reasoning Network (GRNet) on a Similarity Pyramid, which
learns similarities between a query and a gallery cloth by using both global
and local representations in multiple scales. The similarity pyramid is
represented by a Graph of similarity, where nodes represent similarities
between clothing components at different scales, and the final matching score
is obtained by message passing along edges. In GRNet, graph reasoning is solved
by training a graph convolutional network, enabling to align salient clothing
components to improve clothing retrieval. To facilitate future researches, we
introduce a new benchmark FindFashion, containing rich annotations of bounding
boxes, views, occlusions, and cropping. Extensive experiments show that GRNet
obtains new state-of-the-art results on two challenging benchmarks, e.g.,
pushing the top-1, top-20, and top-50 accuracies on DeepFashion to 26%, 64%,
and 75% (i.e., 4%, 10%, and 10% absolute improvements), outperforming
competitors with large margins. On FindFashion, GRNet achieves considerable
improvements on all empirical settings.Comment: ICCV 2019 (oral
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Conversational interfaces for the detail-oriented retail fashion domain are
more natural, expressive, and user friendly than classical keyword-based search
interfaces. In this paper, we introduce the Fashion IQ dataset to support and
advance research on interactive fashion image retrieval. Fashion IQ is the
first fashion dataset to provide human-generated captions that distinguish
similar pairs of garment images together with side-information consisting of
real-world product descriptions and derived visual attribute labels for these
images. We provide a detailed analysis of the characteristics of the Fashion IQ
data, and present a transformer-based user simulator and interactive image
retriever that can seamlessly integrate visual attributes with image features,
user feedback, and dialog history, leading to improved performance over the
state of the art in dialog-based image retrieval. We believe that our dataset
will encourage further work on developing more natural and real-world
applicable conversational shopping assistants
Unified Structured Learning for Simultaneous Human Pose Estimation and Garment Attribute Classification
In this paper, we utilize structured learning to simultaneously address two
intertwined problems: human pose estimation (HPE) and garment attribute
classification (GAC), which are valuable for a variety of computer vision and
multimedia applications. Unlike previous works that usually handle the two
problems separately, our approach aims to produce a jointly optimal estimation
for both HPE and GAC via a unified inference procedure. To this end, we adopt a
preprocessing step to detect potential human parts from each image (i.e., a set
of "candidates") that allows us to have a manageable input space. In this way,
the simultaneous inference of HPE and GAC is converted to a structured learning
problem, where the inputs are the collections of candidate ensembles, the
outputs are the joint labels of human parts and garment attributes, and the
joint feature representation involves various cues such as pose-specific
features, garment-specific features, and cross-task features that encode
correlations between human parts and garment attributes. Furthermore, we
explore the "strong edge" evidence around the potential human parts so as to
derive more powerful representations for oriented human parts. Such evidences
can be seamlessly integrated into our structured learning model as a kind of
energy function, and the learning process could be performed by standard
structured Support Vector Machines (SVM) algorithm. However, the joint
structure of the two problems is a cyclic graph, which hinders efficient
inference. To resolve this issue, we compute instead approximate optima by
using an iterative procedure, where in each iteration the variables of one
problem are fixed. In this way, satisfactory solutions can be efficiently
computed by dynamic programming. Experimental results on two benchmark datasets
show the state-of-the-art performance of our approach.Comment: Accepted to IEEE Trans. on Image Processin
- …