2,475 research outputs found
Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences
With the rapid proliferation of smart mobile devices, users now take millions
of photos every day. These include large numbers of clothing and accessory
images. We would like to answer questions like `What outfit goes well with this
pair of shoes?' To answer these types of questions, one has to go beyond
learning visual similarity and learn a visual notion of compatibility across
categories. In this paper, we propose a novel learning framework to help answer
these types of questions. The main idea of this framework is to learn a feature
transformation from images of items into a latent space that expresses
compatibility. For the feature transformation, we use a Siamese Convolutional
Neural Network (CNN) architecture, where training examples are pairs of items
that are either compatible or incompatible. We model compatibility based on
co-occurrence in large-scale user behavior data; in particular co-purchase data
from Amazon.com. To learn cross-category fit, we introduce a strategic method
to sample training data, where pairs of items are heterogeneous dyads, i.e.,
the two elements of a pair belong to different high-level categories. While
this approach is applicable to a wide variety of settings, we focus on the
representative problem of learning compatible clothing style. Our results
indicate that the proposed framework is capable of learning semantic
information about visual style and is able to generate outfits of clothes, with
items from different categories, that go well together.Comment: ICCV 201
A deep representation for depth images from synthetic data
Convolutional Neural Networks (CNNs) trained on large scale RGB databases
have become the secret sauce in the majority of recent approaches for object
categorization from RGB-D data. Thanks to colorization techniques, these
methods exploit the filters learned from 2D images to extract meaningful
representations in 2.5D. Still, the perceptual signature of these two kind of
images is very different, with the first usually strongly characterized by
textures, and the second mostly by silhouettes of objects. Ideally, one would
like to have two CNNs, one for RGB and one for depth, each trained on a
suitable data collection, able to capture the perceptual properties of each
channel for the task at hand. This has not been possible so far, due to the
lack of a suitable depth database. This paper addresses this issue, proposing
to opt for synthetically generated images rather than collecting by hand a 2.5D
large scale database. While being clearly a proxy for real data, synthetic
images allow to trade quality for quantity, making it possible to generate a
virtually infinite amount of data. We show that the filters learned from such
data collection, using the very same architecture typically used on visual
data, learns very different filters, resulting in depth features (a) able to
better characterize the different facets of depth images, and (b) complementary
with respect to those derived from CNNs pre-trained on 2D datasets. Experiments
on two publicly available databases show the power of our approach
- …