2,407 research outputs found
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
Online Product Quantization
Approximate nearest neighbor (ANN) search has achieved great success in many
tasks. However, existing popular methods for ANN search, such as hashing and
quantization methods, are designed for static databases only. They cannot
handle well the database with data distribution evolving dynamically, due to
the high computational effort for retraining the model based on the new
database. In this paper, we address the problem by developing an online product
quantization (online PQ) model and incrementally updating the quantization
codebook that accommodates to the incoming streaming data. Moreover, to further
alleviate the issue of large scale computation for the online PQ update, we
design two budget constraints for the model to update partial PQ codebook
instead of all. We derive a loss bound which guarantees the performance of our
online PQ model. Furthermore, we develop an online PQ model over a sliding
window with both data insertion and deletion supported, to reflect the
real-time behaviour of the data. The experiments demonstrate that our online PQ
model is both time-efficient and effective for ANN search in dynamic large
scale databases compared with baseline methods and the idea of partial PQ
codebook update further reduces the update cost.Comment: To appear in IEEE Transactions on Knowledge and Data Engineering
(DOI: 10.1109/TKDE.2018.2817526
Scalable Image Retrieval by Sparse Product Quantization
Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional
feature indexing and retrieval is the crux of large-scale image retrieval. A
recent promising technique is Product Quantization, which attempts to index
high-dimensional image features by decomposing the feature space into a
Cartesian product of low dimensional subspaces and quantizing each of them
separately. Despite the promising results reported, their quantization approach
follows the typical hard assignment of traditional quantization methods, which
may result in large quantization errors and thus inferior search performance.
Unlike the existing approaches, in this paper, we propose a novel approach
called Sparse Product Quantization (SPQ) to encoding the high-dimensional
feature vectors into sparse representation. We optimize the sparse
representations of the feature vectors by minimizing their quantization errors,
making the resulting representation is essentially close to the original data
in practice. Experiments show that the proposed SPQ technique is not only able
to compress data, but also an effective encoding technique. We obtain
state-of-the-art results for ANN search on four public image datasets and the
promising results of content-based image retrieval further validate the
efficacy of our proposed method.Comment: 12 page
- …