1,101 research outputs found
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view
retrieval task, in which queries are abstract and ambiguous sketches while the
retrieval database is formed with natural images. Work in this area mainly
focuses on extracting representative and shared features for sketches and
natural images. However, these can neither cope well with the geometric
distortion between sketches and images nor be feasible for large-scale SBIR due
to the heavy continuous-valued distance computation. In this paper, we speed up
SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch
Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and
incorporated into an end-to-end binary coding framework. Specifically, three
convolutional neural networks are utilized to encode free-hand sketches,
natural images and, especially, the auxiliary sketch-tokens which are adopted
as bridges to mitigate the sketch-image geometric distortion. The learned DSH
codes can effectively capture the cross-view similarities as well as the
intrinsic semantic correlations between different categories. To the best of
our knowledge, DSH is the first hashing work specifically designed for
category-level SBIR with an end-to-end deep architecture. The proposed DSH is
comprehensively evaluated on two large-scale datasets of TU-Berlin Extension
and Sketchy, and the experiments consistently show DSH's superior SBIR
accuracies over several state-of-the-art methods, while achieving significantly
reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201
Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval
With the increasing number of online stores, there is a pressing need for
intelligent search systems to understand the item photos snapped by customers
and search against large-scale product databases to find their desired items.
However, it is challenging for conventional retrieval systems to match up the
item photos captured by customers and the ones officially released by stores,
especially for garment images. To bridge the customer- and store- provided
garment photos, existing studies have been widely exploiting the clothing
attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to
learn a common embedding space for garment representations. Unfortunately they
omit the sequential correlation of attributes and consume large quantity of
human labors to label the landmarks. In this paper, we propose a deep
multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain
embedding and sequential attribute learning are modeled simultaneously.
Sequential attribute learning not only provides the semantic guidance for
embedding, but also generates rich attention on discriminative local details
(\textit{e.g.,} black buttons) of clothing items without requiring extra
landmark labels. This leads to promising performance and 306 boost on
efficiency when compared with the state-of-the-art models, which is
demonstrated through rigorous experiments on two public fashion datasets
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval
Due to the rapid development of mobile Internet techniques, cloud computation
and popularity of online social networking and location-based services, massive
amount of multimedia data with geographical information is generated and
uploaded to the Internet. In this paper, we propose a novel type of cross-modal
multimedia retrieval called geo-multimedia cross-modal retrieval which aims to
search out a set of geo-multimedia objects based on geographical distance
proximity and semantic similarity between different modalities. Previous
studies for cross-modal retrieval and spatial keyword search cannot address
this problem effectively because they do not consider multimedia data with
geo-tags and do not focus on this type of query. In order to address this
problem efficiently, we present the definition of NN geo-multimedia
cross-modal query at the first time and introduce relevant conceptions such as
cross-modal semantic representation space. To bridge the semantic gap between
different modalities, we propose a method named cross-modal semantic matching
which contains two important component, i.e., CorrProj and LogsTran, which aims
to construct a common semantic representation space for cross-modal semantic
similarity measurement. Besides, we designed a framework based on deep learning
techniques to implement common semantic representation space construction. In
addition, a novel hybrid indexing structure named GMR-Tree combining
geo-multimedia data and R-Tree is presented and a efficient NN search
algorithm called GMCMS is designed. Comprehensive experimental evaluation on
real and synthetic dataset clearly demonstrates that our solution outperforms
the-state-of-the-art methods.Comment: 27 page
A Decade Survey of Content Based Image Retrieval using Deep Learning
The content based image retrieval aims to find the similar images from a
large scale dataset against a query image. Generally, the similarity between
the representative features of the query image and dataset images is used to
rank the images for retrieval. In early days, various hand designed feature
descriptors have been investigated based on the visual cues such as color,
texture, shape, etc. that represent the images. However, the deep learning has
emerged as a dominating alternative of hand-designed feature engineering from a
decade. It learns the features automatically from the data. This paper presents
a comprehensive survey of deep learning based developments in the past decade
for content based image retrieval. The categorization of existing
state-of-the-art methods from different perspectives is also performed for
greater understanding of the progress. The taxonomy used in this survey covers
different supervision, different networks, different descriptor type and
different retrieval type. A performance analysis is also performed using the
state-of-the-art methods. The insights are also presented for the benefit of
the researchers to observe the progress and to make the best choices. The
survey presented in this paper will help in further research progress in image
retrieval using deep learning
Exquisitor: Interactive Learning at Large
Increasing scale is a dominant trend in today's multimedia collections, which
especially impacts interactive applications. To facilitate interactive
exploration of large multimedia collections, new approaches are needed that are
capable of learning on the fly new analytic categories based on the visual and
textual content. To facilitate general use on standard desktops, laptops, and
mobile devices, they must furthermore work with limited computing resources. We
present Exquisitor, a highly scalable interactive learning approach, capable of
intelligent exploration of the large-scale YFCC100M image collection with
extremely efficient responses from the interactive classifier. Based on
relevance feedback from the user on previously suggested items, Exquisitor uses
semantic features, extracted from both visual and text attributes, to suggest
relevant media items to the user. Exquisitor builds upon the state of the art
in large-scale data representation, compression and indexing, introducing a
cluster-based retrieval mechanism that facilitates the efficient suggestions.
With Exquisitor, each interaction round over the full YFCC100M collection is
completed in less than 0.3 seconds using a single CPU core. That is 4x less
time using 16x smaller computational resources than the most efficient
state-of-the-art method, with a positive impact on result quality. These
results open up many interesting research avenues, both for exploration of
industry-scale media collections and for media exploration on mobile devices
Deep Collaborative Discrete Hashing with Semantic-Invariant Structure
Existing deep hashing approaches fail to fully explore semantic correlations
and neglect the effect of linguistic context on visual attention learning,
leading to inferior performance. This paper proposes a dual-stream learning
framework, dubbed Deep Collaborative Discrete Hashing (DCDH), which constructs
a discriminative common discrete space by collaboratively incorporating the
shared and individual semantics deduced from visual features and semantic
labels. Specifically, the context-aware representations are generated by
employing the outer product of visual embeddings and semantic encodings.
Moreover, we reconstruct the labels and introduce the focal loss to take
advantage of frequent and rare concepts. The common binary code space is built
on the joint learning of the visual representations attended by language, the
semantic-invariant structure construction and the label distribution
correction. Extensive experiments demonstrate the superiority of our method
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
- …