6 research outputs found
Competitive Quantization for Approximate Nearest Neighbor Search
In this study, we propose a novel vector quantization algorithm for Approximate Nearest Neighbor (ANN) search, based on a joint competitive learning strategy and hence called as competitive quantization (CompQ). CompQ is a hierarchical algorithm, which iteratively minimizes the quantization error by jointly optimizing the codebooks in each layer, using a gradient decent approach. An extensive set of experimental results and comparative evaluations show that CompQ outperforms the-state-of-the-art while retaining a comparable computational complexity.Scopu
Spreading vectors for similarity search
Discretizing multi-dimensional data distributions is a fundamental step of
modern indexing methods. State-of-the-art techniques learn parameters of
quantizers on training data for optimal performance, thus adapting quantizers
to the data. In this work, we propose to reverse this paradigm and adapt the
data to the quantizer: we train a neural net which last layer forms a fixed
parameter-free quantizer, such as pre-defined points of a hyper-sphere. As a
proxy objective, we design and train a neural network that favors uniformity in
the spherical latent space, while preserving the neighborhood structure after
the mapping. We propose a new regularizer derived from the Kozachenko--Leonenko
differential entropy estimator to enforce uniformity and combine it with a
locality-aware triplet loss. Experiments show that our end-to-end approach
outperforms most learned quantization methods, and is competitive with the
state of the art on widely adopted benchmarks. Furthermore, we show that
training without the quantization step results in almost no difference in
accuracy, but yields a generic catalyzer that can be applied with any
subsequent quantizer.Comment: Published at ICLR 201
Vector Quantization Techniques for Approximate Nearest Neighbor Search on Large-Scale Datasets
The technological developments of the last twenty years are leading the world to a new era. The invention of the internet, mobile phones and smart devices are resulting in an exponential increase in data. As the data is growing every day, finding similar patterns or matching samples to a query is no longer a simple task because of its computational costs and storage limitations. Special signal processing techniques are required in order to handle the growth in data, as simply adding more and more computers cannot keep up.Nearest neighbor search, or similarity search, proximity search or near item search is the problem of finding an item that is nearest or most similar to a query according to a distance or similarity measure. When the reference set is very large, or the distance or similarity calculation is complex, performing the nearest neighbor search can be computationally demanding. Considering today’s ever-growing datasets, where the cardinality of samples also keep increasing, a growing interest towards approximate methods has emerged in the research community.Vector Quantization for Approximate Nearest Neighbor Search (VQ for ANN) has proven to be one of the most efficient and successful methods targeting the aforementioned problem. It proposes to compress vectors into binary strings and approximate the distances between vectors using look-up tables. With this approach, the approximation of distances is very fast, while the storage space requirement of the dataset is minimized thanks to the extreme compression levels. The distance approximation performance of VQ for ANN has been shown to be sufficiently well for retrieval and classification tasks demonstrating that VQ for ANN techniques can be a good replacement for exact distance calculation methods.This thesis contributes to VQ for ANN literature by proposing five advanced techniques, which aim to provide fast and efficient approximate nearest neighbor search on very large-scale datasets. The proposed methods can be divided into two groups. The first group consists of two techniques, which propose to introduce subspace clustering to VQ for ANN. These methods are shown to give the state-of-the-art performance according to tests on prevalent large-scale benchmarks. The second group consists of three methods, which propose improvements on residual vector quantization. These methods are also shown to outperform their predecessors. Apart from these, a sixth contribution in this thesis is a demonstration of VQ for ANN in an application of image classification on large-scale datasets. It is shown that a k-NN classifier based on VQ for ANN performs on par with the k-NN classifiers, but requires much less storage space and computations
Scalar Quantization as Sparse Least Square Optimization
Quantization can be used to form new vectors/matrices with shared values
close to the original. In recent years, the popularity of scalar quantization
for value-sharing applications has been soaring as it has been found huge
utilities in reducing the complexity of neural networks. Existing
clustering-based quantization techniques, while being well-developed, have
multiple drawbacks including the dependency of the random seed, empty or
out-of-the-range clusters, and high time complexity for a large number of
clusters. To overcome these problems, in this paper, the problem of scalar
quantization is examined from a new perspective, namely sparse least square
optimization. Specifically, inspired by the property of sparse least square
regression, several quantization algorithms based on least square are
proposed. In addition, similar schemes with and
regularization are proposed. Furthermore, to compute quantization results with
a given amount of values/clusters, this paper designed an iterative method and
a clustering-based method, and both of them are built on sparse least square.
The paper shows that the latter method is mathematically equivalent to an
improved version of k-means clustering-based quantization algorithm, although
the two algorithms originated from different intuitions. The algorithms
proposed were tested with three types of data and their computational
performances, including information loss, time consumption, and the
distribution of the values of the sparse vectors, were compared and analyzed.
The paper offers a new perspective to probe the area of quantization, and the
algorithms proposed can outperform existing methods especially under some
bit-width reduction scenarios, when the required post-quantization resolution
(number of values) is not significantly lower than the original number
Scalable Nearest Neighbor Search with Compact Codes
An important characteristic of the recent decade is the dramatic growth in the use and generation of data. From collections of images, documents and videos, to genetic data, and to network traffic statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we effectively
use all this data? The growing sizes of the modern datasets make it crucial to develop new algorithms and tools capable of sifting through this data efficiently. A central computational primitive for analyzing large datasets is the Nearest Neighbor Search problem in which the goal is to preprocess a set of objects, so that later, given a query object, one can find the data object closest to the query. In most situations involving high-dimensional objects, the exhaustive search which compares the query with every item in the dataset has a prohibitive cost both for runtime and memory space. This thesis focuses on the design of algorithms and tools for fast and cost efficient nearest neighbor search. The proposed techniques advocate the use of compressed and discrete codes for representing the neighborhood structure of data in a compact way. Transforming high-dimensional items, such as raw images, into similarity-preserving compact codes has both computational and storage advantages as compact codes can be stored efficiently using only a few bits per data item, and more importantly they can be compared extremely fast using bit-wise or look-up table operators. Motivated by this view, the present work explores two main research directions: 1) finding mappings that better preserve the given notion of similarity while keeping the codes as compressed as possible, and 2) building efficient data structures that support non-exhaustive search among the compact codes. Our large-scale experimental results reported on various benchmarks including datasets upto one billion items, show boost in retrieval performance in comparison to the state-of-the-art