674 research outputs found
Local Feature Detectors, Descriptors, and Image Representations: A Survey
With the advances in both stable interest region detectors and robust and
distinctive descriptors, local feature-based image or object retrieval has
become a popular research topic. %All of the local feature-based image
retrieval system involves two important processes: local feature extraction and
image representation. The other key technology for image retrieval systems is
image representation such as the bag-of-visual words (BoVW), Fisher vector, or
Vector of Locally Aggregated Descriptors (VLAD) framework. In this paper, we
review local features and image representations for image retrieval. Because
many and many methods are proposed in this area, these methods are grouped into
several classes and summarized. In addition, recent deep learning-based
approaches for image retrieval are briefly reviewed.Comment: 20 page
Composite Quantization
This paper studies the compact coding approach to approximate nearest
neighbor search. We introduce a composite quantization framework. It uses the
composition of several () elements, each of which is selected from a
different dictionary, to accurately approximate a -dimensional vector, thus
yielding accurate search, and represents the data vector by a short code
composed of the indices of the selected elements in the corresponding
dictionaries. Our key contribution lies in introducing a near-orthogonality
constraint, which makes the search efficiency is guaranteed as the cost of the
distance computation is reduced to from through a distance table
lookup scheme. The resulting approach is called near-orthogonal composite
quantization. We theoretically justify the equivalence between near-orthogonal
composite quantization and minimizing an upper bound of a function formed by
jointly considering the quantization error and the search cost according to a
generalized triangle inequality. We empirically show the efficacy of the
proposed approach over several benchmark datasets. In addition, we demonstrate
the superior performances in other three applications: combination with
inverted multi-index, quantizing the query for mobile search, and inner-product
similarity search
De-Hashing: Server-Side Context-Aware Feature Reconstruction for Mobile Visual Search
Due to the prevalence of mobile devices, mobile search becomes a more
convenient way than desktop search. Different from the traditional desktop
search, mobile visual search needs more consideration for the limited resources
on mobile devices (e.g., bandwidth, computing power, and memory consumption).
The state-of-the-art approaches show that bag-of-words (BoW) model is robust
for image and video retrieval; however, the large vocabulary tree might not be
able to be loaded on the mobile device. We observe that recent works mainly
focus on designing compact feature representations on mobile devices for
bandwidth-limited network (e.g., 3G) and directly adopt feature matching on
remote servers (cloud). However, the compact (binary) representation might fail
to retrieve target objects (images, videos). Based on the hashed binary codes,
we propose a de-hashing process that reconstructs BoW by leveraging the
computing power of remote servers. To mitigate the information loss from binary
codes, we further utilize contextual information (e.g., GPS) to reconstruct a
context-aware BoW for better retrieval results. Experiment results show that
the proposed method can achieve competitive retrieval accuracy as BoW while
only transmitting few bits from mobile devices.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Learning to Index for Nearest Neighbor Search
In this study, we present a novel ranking model based on learning
neighborhood relationships embedded in the index space. Given a query point,
conventional approximate nearest neighbor search calculates the distances to
the cluster centroids, before ranking the clusters from near to far based on
the distances. The data indexed in the top-ranked clusters are retrieved and
treated as the nearest neighbor candidates for the query. However, the loss of
quantization between the data and cluster centroids will inevitably harm the
search accuracy. To address this problem, the proposed model ranks clusters
based on their nearest neighbor probabilities rather than the query-centroid
distances. The nearest neighbor probabilities are estimated by employing neural
networks to characterize the neighborhood relationships, i.e., the density
function of nearest neighbors with respect to the query. The proposed
probability-based ranking can replace the conventional distance-based ranking
for finding candidate clusters, and the predicted probability can be used to
determine the data quantity to be retrieved from the candidate cluster. Our
experimental results demonstrated that the proposed ranking model could boost
the search performance effectively in billion-scale datasets.Comment: This paper was accepted by IEEE Transcations on Pattern Analysis and
Machine Intelligence in March 201
Joint Maximum Purity Forest with Application to Image Super-Resolution
In this paper, we propose a novel random-forest scheme, namely Joint Maximum
Purity Forest (JMPF), for classification, clustering, and regression tasks. In
the JMPF scheme, the original feature space is transformed into a compactly
pre-clustered feature space, via a trained rotation matrix. The rotation matrix
is obtained through an iterative quantization process, where the input data
belonging to different classes are clustered to the respective vertices of the
new feature space with maximum purity. In the new feature space, orthogonal
hyperplanes, which are employed at the split-nodes of decision trees in random
forests, can tackle the clustering problems effectively. We evaluated our
proposed method on public benchmark datasets for regression and classification
tasks, and experiments showed that JMPF remarkably outperforms other
state-of-the-art random-forest-based approaches. Furthermore, we applied JMPF
to image super-resolution, because the transformed, compact features are more
discriminative to the clustering-regression scheme. Experiment results on
several public benchmark datasets also showed that the JMPF-based image
super-resolution scheme is consistently superior to recent state-of-the-art
image super-resolution algorithms.Comment: 18 pages, 7 figure
Fine-tuning CNN Image Retrieval with No Human Annotation
Image descriptors based on activations of Convolutional Neural Networks
(CNNs) have become dominant in image retrieval due to their discriminative
power, compactness of representation, and search efficiency. Training of CNNs,
either from scratch or fine-tuning, requires a large amount of annotated data,
where a high quality of annotation is often crucial. In this work, we propose
to fine-tune CNNs for image retrieval on a large collection of unordered images
in a fully automated manner. Reconstructed 3D models obtained by the
state-of-the-art retrieval and structure-from-motion methods guide the
selection of the training data. We show that both hard-positive and
hard-negative examples, selected by exploiting the geometry and the camera
positions available from the 3D models, enhance the performance of
particular-object retrieval. CNN descriptor whitening discriminatively learned
from the same training data outperforms commonly used PCA whitening. We propose
a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and
average pooling and show that it boosts retrieval performance. Applying the
proposed method to the VGG network achieves state-of-the-art performance on the
standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.Comment: TPAMI 2018. arXiv admin note: substantial text overlap with
arXiv:1604.0242
Unifying Deep Local and Global Features for Image Search
Image retrieval is the problem of searching an image database for items that
are similar to a query image. To address this task, two main types of image
representations have been studied: global and local image features. In this
work, our key contribution is to unify global and local features into a single
deep model, enabling accurate retrieval with efficient feature extraction. We
refer to the new model as DELG, standing for DEep Local and Global features. We
leverage lessons from recent feature learning work and propose a model that
combines generalized mean pooling for global features and attentive selection
for local features. The entire network can be learned end-to-end by carefully
balancing the gradient flow between two heads -- requiring only image-level
labels. We also introduce an autoencoder-based dimensionality reduction
technique for local features, which is integrated into the model, improving
training efficiency and matching performance. Comprehensive experiments show
that our model achieves state-of-the-art image retrieval on the Revisited
Oxford and Paris datasets, and state-of-the-art single-model instance-level
recognition on the Google Landmarks dataset v2. Code and models are available
at https://github.com/tensorflow/models/tree/master/research/delf .Comment: ECCV'20 pape
Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge
This paper presents our 7th place solution to the second YouTube-8M video
understanding competition which challenges participates to build a
constrained-size model to classify millions of YouTube videos into thousands of
classes. Our final model consists of four single models aggregated into one
tensorflow graph. For each single model, we use the same network architecture
as in the winning solution of the first YouTube-8M video understanding
competition, namely Gated NetVLAD. We train the single models separately in
tensorflow's default float32 precision, then replace weights with float16
precision and ensemble them in the evaluation and inference stages., achieving
48.5% compression rate without loss of precision. Our best model achieved
88.324% GAP on private leaderboard. The code is publicly available at
https://github.com/boliu61/youtube-8mComment: Accepted paper at 2018 ECCV Youtube8M workshop:
https://research.google.com/youtube8m/workshop2018
Application-Driven Near-Data Processing for Similarity Search
Similarity search is a key to a variety of applications including
content-based search for images and video, recommendation systems, data
deduplication, natural language processing, computer vision, databases,
computational biology, and computer graphics. At its core, similarity search
manifests as k-nearest neighbors (kNN), a computationally simple primitive
consisting of highly parallel distance calculations and a global top-k sort.
However, kNN is poorly supported by today's architectures because of its high
memory bandwidth requirements.
This paper proposes an application-driven near-data processing accelerator
for similarity search: the Similarity Search Associative Memory (SSAM). By
instantiating compute units close to memory, SSAM benefits from the higher
memory bandwidth and density exposed by emerging memory technologies. We
evaluate the SSAM design down to layout on top of the Micron hybrid memory cube
(HMC), and show that SSAM can achieve up to two orders of magnitude
area-normalized throughput and energy efficiency improvement over multicore
CPUs; we also show SSAM is faster and more energy efficient than competing GPUs
and FPGAs. Finally, we show that SSAM is also useful for other data intensive
tasks like kNN index construction, and can be generalized to semantically
function as a high capacity content addressable memory.Comment: 15 pages, 8 figures, 7 table
- …