727 research outputs found
Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view
retrieval task, in which queries are abstract and ambiguous sketches while the
retrieval database is formed with natural images. Work in this area mainly
focuses on extracting representative and shared features for sketches and
natural images. However, these can neither cope well with the geometric
distortion between sketches and images nor be feasible for large-scale SBIR due
to the heavy continuous-valued distance computation. In this paper, we speed up
SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch
Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and
incorporated into an end-to-end binary coding framework. Specifically, three
convolutional neural networks are utilized to encode free-hand sketches,
natural images and, especially, the auxiliary sketch-tokens which are adopted
as bridges to mitigate the sketch-image geometric distortion. The learned DSH
codes can effectively capture the cross-view similarities as well as the
intrinsic semantic correlations between different categories. To the best of
our knowledge, DSH is the first hashing work specifically designed for
category-level SBIR with an end-to-end deep architecture. The proposed DSH is
comprehensively evaluated on two large-scale datasets of TU-Berlin Extension
and Sketchy, and the experiments consistently show DSH's superior SBIR
accuracies over several state-of-the-art methods, while achieving significantly
reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201
Beyond Classification: Latent User Interests Profiling from Visual Contents Analysis
User preference profiling is an important task in modern online social
networks (OSN). With the proliferation of image-centric social platforms, such
as Pinterest, visual contents have become one of the most informative data
streams for understanding user preferences. Traditional approaches usually
treat visual content analysis as a general classification problem where one or
more labels are assigned to each image. Although such an approach simplifies
the process of image analysis, it misses the rich context and visual cues that
play an important role in people's perception of images. In this paper, we
explore the possibilities of learning a user's latent visual preferences
directly from image contents. We propose a distance metric learning method
based on Deep Convolutional Neural Networks (CNN) to directly extract
similarity information from visual contents and use the derived distance metric
to mine individual users' fine-grained visual preferences. Through our
preliminary experiments using data from 5,790 Pinterest users, we show that
even for the images within the same category, each user possesses distinct and
individually-identifiable visual preferences that are consistent over their
lifetime. Our results underscore the untapped potential of finer-grained visual
preference profiling in understanding users' preferences.Comment: 2015 IEEE 15th International Conference on Data Mining Workshop
FindVehicle and VehicleFinder: A NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system
Natural language (NL) based vehicle retrieval is a task aiming to retrieve a
vehicle that is most consistent with a given NL query from among all candidate
vehicles. Because NL query can be easily obtained, such a task has a promising
prospect in building an interactive intelligent traffic system (ITS). Current
solutions mainly focus on extracting both text and image features and mapping
them to the same latent space to compare the similarity. However, existing
methods usually use dependency analysis or semantic role-labelling techniques
to find keywords related to vehicle attributes. These techniques may require a
lot of pre-processing and post-processing work, and also suffer from extracting
the wrong keyword when the NL query is complex. To tackle these problems and
simplify, we borrow the idea from named entity recognition (NER) and construct
FindVehicle, a NER dataset in the traffic domain. It has 42.3k labelled NL
descriptions of vehicle tracks, containing information such as the location,
orientation, type and colour of the vehicle. FindVehicle also adopts both
overlapping entities and fine-grained entities to meet further requirements. To
verify its effectiveness, we propose a baseline NL-based vehicle retrieval
model called VehicleFinder. Our experiment shows that by using text encoders
pre-trained by FindVehicle, VehicleFinder achieves 87.7\% precision and 89.4\%
recall when retrieving a target vehicle by text command on our homemade dataset
based on UA-DETRAC. The time cost of VehicleFinder is 279.35 ms on one ARM v8.2
CPU and 93.72 ms on one RTX A4000 GPU, which is much faster than the
Transformer-based system. The dataset is open-source via the link
https://github.com/GuanRunwei/FindVehicle, and the implementation can be found
via the link https://github.com/GuanRunwei/VehicleFinder-CTIM
Semantically tied paired cycle consistency for any-shot sketch-based image retrieval
This is the final version. Available from the publisher via the DOI in this record. Low-shot sketch-based image retrieval is an emerging task in computer vision, allowing to retrieve natural images relevant to
hand-drawn sketch queries that are rarely seen during the training phase. Related prior works either require aligned sketchimage pairs that are costly to obtain or inefficient memory fusion layer for mapping the visual information to a semantic space.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks, where we introduce
the few-shot setting for SBIR. For solving these tasks, we propose a semantically aligned paired cycle-consistent generative
adversarial network (SEM-PCYC) for any-shot SBIR, where each branch of the generative adversarial network maps the
visual information from sketch and image to a common semantic space via adversarial training. Each of these branches
maintains cycle consistency that only requires supervision at the category level, and avoids the need of aligned sketch-image
pairs. A classification criteria on the generators’ outputs ensures the visual to semantic space mapping to be class-specific.
Furthermore, we propose to combine textual and hierarchical side information via an auto-encoder that selects discriminating
side information within a same end-to-end model. Our results demonstrate a significant boost in any-shot SBIR performance
over the state-of-the-art on the extended version of the challenging Sketchy, TU-Berlin and QuickDraw datasets.European Union: Marie Skłodowska-Curie GrantEuropean Research Council (ERC
Context Embedding Networks
Low dimensional embeddings that capture the main variations of interest in
collections of data are important for many applications. One way to construct
these embeddings is to acquire estimates of similarity from the crowd. However,
similarity is a multi-dimensional concept that varies from individual to
individual. Existing models for learning embeddings from the crowd typically
make simplifying assumptions such as all individuals estimate similarity using
the same criteria, the list of criteria is known in advance, or that the crowd
workers are not influenced by the data that they see. To overcome these
limitations we introduce Context Embedding Networks (CENs). In addition to
learning interpretable embeddings from images, CENs also model worker biases
for different attributes along with the visual context i.e. the visual
attributes highlighted by a set of images. Experiments on two noisy crowd
annotated datasets show that modeling both worker bias and visual context
results in more interpretable embeddings compared to existing approaches.Comment: CVPR 2018 spotligh
Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval
In this paper, we investigate the problem of zero-shot sketch-based image
retrieval (ZS-SBIR), where human sketches are used as queries to conduct
retrieval of photos from unseen categories. We importantly advance prior arts
by proposing a novel ZS-SBIR scenario that represents a firm step forward in
its practical application. The new setting uniquely recognizes two important
yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap
between amateur sketch and photo, and (ii) the necessity for moving towards
large-scale retrieval. We first contribute to the community a novel ZS-SBIR
dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000
photos spanning across 110 categories. Highly abstract amateur human sketches
are purposefully sourced to maximize the domain gap, instead of ones included
in existing datasets that can often be semi-photorealistic. We then formulate a
ZS-SBIR framework to jointly model sketches and photos into a common embedding
space. A novel strategy to mine the mutual information among domains is
specifically engineered to alleviate the domain gap. External semantic
knowledge is further embedded to aid semantic transfer. We show that, rather
surprisingly, retrieval performance significantly outperforms that of
state-of-the-art on existing datasets that can already be achieved using a
reduced version of our model. We further demonstrate the superior performance
of our full model by comparing with a number of alternatives on the newly
proposed dataset. The new dataset, plus all training and testing code of our
model, will be publicly released to facilitate future researchComment: Oral paper in CVPR 201
- …