1,342 research outputs found
Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets
Similarity search is critical for many database applications, including the
increasingly popular online services for Content-Based Multimedia Retrieval
(CBMR). These services, which include image search engines, must handle an
overwhelming volume of data, while keeping low response times. Thus,
scalability is imperative for similarity search in Web-scale applications, but
most existing methods are sequential and target shared-memory machines. Here we
address these issues with a distributed, efficient, and scalable index based on
Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular
techniques for similarity search, but its poor referential locality properties
has made its implementation a challenging problem. Our solution is based on a
widely asynchronous dataflow parallelization with a number of optimizations
that include a hierarchical parallelization to decouple indexing and data
storage, locality-aware data partition strategies to reduce message passing,
and multi-probing to limit memory usage. The proposed parallelization attained
an efficiency of 90% in a distributed system with about 800 CPU cores. In
particular, the original locality-aware data partition reduced the number of
messages exchanged in 30%. Our parallel LSH was evaluated using the largest
public dataset for similarity search (to the best of our knowledge) with
128-d SIFT descriptors extracted from Web images. This is two orders of
magnitude larger than datasets that previous LSH parallelizations could handle
Learning to Hash for Indexing Big Data - A Survey
The explosive growth in big data has attracted much attention in designing
efficient indexing and search methods recently. In many critical applications
such as large-scale search and pattern matching, finding the nearest neighbors
to a query is a fundamental research problem. However, the straightforward
solution using exhaustive comparison is infeasible due to the prohibitive
computational complexity and memory requirement. In response, Approximate
Nearest Neighbor (ANN) search based on hashing techniques has become popular
due to its promising performance in both efficiency and accuracy. Prior
randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore
data-independent hash functions with random projections or permutations.
Although having elegant theoretic guarantees on the search quality in certain
metric spaces, performance of randomized hashing has been shown insufficient in
many real-world applications. As a remedy, new approaches incorporating
data-driven learning methods in development of advanced hash functions have
emerged. Such learning to hash methods exploit information such as data
distributions or class labels when optimizing the hash codes or functions.
Importantly, the learned hash codes are able to preserve the proximity of
neighboring data in the original feature spaces in the hash code spaces. The
goal of this paper is to provide readers with systematic understanding of
insights, pros and cons of the emerging techniques. We provide a comprehensive
survey of the learning to hash framework and representative techniques of
various types, including unsupervised, semi-supervised, and supervised. In
addition, we also summarize recent hashing approaches utilizing the deep
learning models. Finally, we discuss the future direction and trends of
research in this area
Low-density locality-sensitive hashing boosts metagenomic binning
Metagenomic binning is an essential task in analyzing metagenomic sequence
datasets. To analyze structure or function of microbial communities from
environmental samples, metagenomic sequence fragments are assigned to their
taxonomic origins. Although sequence alignment algorithms can readily be used
and usually provide high-resolution alignments and accurate binning results,
the computational cost of such alignment-based methods becomes prohibitive as
metagenomic datasets continue to grow. Alternative compositional-based methods,
which exploit sequence composition by profiling local short k-mers in
fragments, are often faster but less accurate than alignment-based methods.
Inspired by the success of linear error correcting codes in noisy channel
communication, we introduce Opal, a fast and accurate novel compositional-based
binning method. It incorporates ideas from Gallager's low-density parity-check
code to design a family of compact and discriminative locality-sensitive
hashing functions that encode long-range compositional dependencies in long
fragments. By incorporating the Gallager LSH functions as features in a simple
linear SVM, Opal provides fast, accurate and robust binning for datasets
consisting of a large number of species, even with mutations and sequencing
errors. Opal not only performs up to two orders of magnitude faster than BWA,
an alignment-based binning method, but also achieves improved binning accuracy
and robustness to sequencing errors. Opal also outperforms models built on
traditional k-mer profiles in terms of robustness and accuracy. Finally, we
demonstrate that we can effectively use Opal in the "coarse search" stage of a
compressive genomics pipeline to identify a much smaller candidate set of
taxonomic origins for a subsequent alignment-based method to analyze, thus
providing metagenomic binning with high scalability, high accuracy and high
resolution.Comment: RECOMB 2016. Due to the limitation "The abstract field cannot be
longer than 1,920 characters", the abstract appearing here is slightly
shorter than the one in the PDF fil
Exquisitor: Interactive Learning at Large
Increasing scale is a dominant trend in today's multimedia collections, which
especially impacts interactive applications. To facilitate interactive
exploration of large multimedia collections, new approaches are needed that are
capable of learning on the fly new analytic categories based on the visual and
textual content. To facilitate general use on standard desktops, laptops, and
mobile devices, they must furthermore work with limited computing resources. We
present Exquisitor, a highly scalable interactive learning approach, capable of
intelligent exploration of the large-scale YFCC100M image collection with
extremely efficient responses from the interactive classifier. Based on
relevance feedback from the user on previously suggested items, Exquisitor uses
semantic features, extracted from both visual and text attributes, to suggest
relevant media items to the user. Exquisitor builds upon the state of the art
in large-scale data representation, compression and indexing, introducing a
cluster-based retrieval mechanism that facilitates the efficient suggestions.
With Exquisitor, each interaction round over the full YFCC100M collection is
completed in less than 0.3 seconds using a single CPU core. That is 4x less
time using 16x smaller computational resources than the most efficient
state-of-the-art method, with a positive impact on result quality. These
results open up many interesting research avenues, both for exploration of
industry-scale media collections and for media exploration on mobile devices
Online Supervised Hashing for Ever-Growing Datasets
Supervised hashing methods are widely-used for nearest neighbor search in
computer vision applications. Most state-of-the-art supervised hashing
approaches employ batch-learners. Unfortunately, batch-learning strategies can
be inefficient when confronted with large training datasets. Moreover, with
batch-learners, it is unclear how to adapt the hash functions as a dataset
continues to grow and diversify over time. Yet, in many practical scenarios the
dataset grows and diversifies; thus, both the hash functions and the indexing
must swiftly accommodate these changes. To address these issues, we propose an
online hashing method that is amenable to changes and expansions of the
datasets. Since it is an online algorithm, our approach offers linear
complexity with the dataset size. Our solution is supervised, in that we
incorporate available label information to preserve the semantic neighborhood.
Such an adaptive hashing method is attractive; but it requires recomputing the
hash table as the hash functions are updated. If the frequency of update is
high, then recomputing the hash table entries may cause inefficiencies in the
system, especially for large indexes. Thus, we also propose a framework to
reduce hash table updates. We compare our method to state-of-the-art solutions
on two benchmarks and demonstrate significant improvements over previous work
Indexing of CNN Features for Large Scale Image Search
The convolutional neural network (CNN) features can give a good description
of image content, which usually represent images with unique global vectors.
Although they are compact compared to local descriptors, they still cannot
efficiently deal with large-scale image retrieval due to the cost of the linear
incremental computation and storage. To address this issue, we build a simple
but effective indexing framework based on inverted table, which significantly
decreases both the search time and memory usage. In addition, several
strategies are fully investigated under an indexing framework to adapt it to
CNN features and compensate for quantization errors. First, we use multiple
assignment for the query and database images to increase the probability of
relevant images' co-existing in the same Voronoi cells obtained via the
clustering algorithm. Then, we introduce embedding codes to further improve
precision by removing false matches during a search. We demonstrate that by
using hashing schemes to calculate the embedding codes and by changing the
ranking rule, indexing framework speeds can be greatly improved. Extensive
experiments conducted on several unsupervised and supervised benchmarks support
these results and the superiority of the proposed indexing framework. We also
provide a fair comparison between the popular CNN features.Comment: 21 pages, 9 figures, submitted to Multimedia Tools and Application
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Bloom Filters and Compact Hash Codes for Efficient and Distributed Image Retrieval
This paper presents a novel method for efficient image retrieval, based on a
simple and effective hashing of CNN features and the use of an indexing
structure based on Bloom filters. These filters are used as gatekeepers for the
database of image features, allowing to avoid to perform a query if the query
features are not stored in the database and speeding up the query process,
without affecting retrieval performance. Thanks to the limited memory
requirements the system is suitable for mobile applications and distributed
databases, associating each filter to a distributed portion of the database.
Experimental validation has been performed on three standard image retrieval
datasets, outperforming state-of-the-art hashing methods in terms of precision,
while the proposed indexing method obtains a speedup
Random Binary Trees for Approximate Nearest Neighbour Search in Binary Space
Approximate nearest neighbour (ANN) search is one of the most important
problems in computer science fields such as data mining or computer vision. In
this paper, we focus on ANN for high-dimensional binary vectors and we propose
a simple yet powerful search method that uses Random Binary Search Trees
(RBST). We apply our method to a dataset of 1.25M binary local feature
descriptors obtained from a real-life image-based localisation system provided
by Google as a part of Project Tango. An extensive evaluation of our method
against the state-of-the-art variations of Locality Sensitive Hashing (LSH),
namely Uniform LSH and Multi-probe LSH, shows the superiority of our method in
terms of retrieval precision with performance boost of over 20%Comment: The final publication is available at Springer via
https://doi.org/10.1007/978-3-319-69900-4_6
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
- …