6 research outputs found
Searching in one billion vectors: re-rank with source coding
International audienceRecent indexing techniques inspired by source coding have been shown successful to index billions of high-dimensional vectors in memory. In this paper, we propose an approach that re-ranks the neighbor hypotheses obtained by these compressed-domain indexing methods. In contrast to the usual post-verification scheme, which performs exact distance calculation on the short-list of hypotheses, the estimated distances are refined based on short quantization codes, to avoid reading the full vectors from disk. We have released a new public dataset of one billion 128-dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale. Experiments show that our method accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation
Fast, Compact and Highly Scalable Visual Place Recognition through Sequence-based Matching of Overloaded Representations
Visual place recognition algorithms trade off three key characteristics:
their storage footprint, their computational requirements, and their resultant
performance, often expressed in terms of recall rate. Significant prior work
has investigated highly compact place representations, sub-linear computational
scaling and sub-linear storage scaling techniques, but have always involved a
significant compromise in one or more of these regards, and have only been
demonstrated on relatively small datasets. In this paper we present a novel
place recognition system which enables for the first time the combination of
ultra-compact place representations, near sub-linear storage scaling and
extremely lightweight compute requirements. Our approach exploits the
inherently sequential nature of much spatial data in the robotics domain and
inverts the typical target criteria, through intentionally coarse scalar
quantization-based hashing that leads to more collisions but is resolved by
sequence-based matching. For the first time, we show how effective place
recognition rates can be achieved on a new very large 10 million place dataset,
requiring only 8 bytes of storage per place and 37K unitary operations to
achieve over 50% recall for matching a sequence of 100 frames, where a
conventional state-of-the-art approach both consumes 1300 times more compute
and fails catastrophically. We present analysis investigating the effectiveness
of our hashing overload approach under varying sizes of quantized vector
length, comparison of near miss matches with the actual match selections and
characterise the effect of variance re-scaling of data on quantization.Comment: 8 pages, 4 figures, Accepted for oral presentation at the 2020 IEEE
International Conference on Robotics and Automatio
Link and code: Fast indexing with graphs and compact regression codes
Similarity search approaches based on graph walks have recently attained
outstanding speed-accuracy trade-offs, taking aside the memory requirements. In
this paper, we revisit these approaches by considering, additionally, the
memory constraint required to index billions of images on a single server. This
leads us to propose a method based both on graph traversal and compact
representations. We encode the indexed vectors using quantization and exploit
the graph structure to refine the similarity estimation.
In essence, our method takes the best of these two worlds: the search
strategy is based on nested graphs, thereby providing high precision with a
relatively small set of comparisons. At the same time it offers a significant
memory compression. As a result, our approach outperforms the state of the art
on operating points considering 64-128 bytes per vector, as demonstrated by our
results on two billion-scale public benchmarks
Fast Data Analytics by Learning
Today, we collect a large amount of data, and the volume of the data we collect is projected to grow faster than the growth of the computational power. This rapid growth of data inevitably increases query latencies, and horizontal scaling alone is not sufficient for real-time data analytics of big data. Approximate query processing (AQP) speeds up data analytics at the cost of small quality losses in query answers. AQP produces query answers based on synopses of the original data. The sizes of the synopses are smaller than the original data; thus, AQP requires less computational efforts for producing query answers, thus can produce answers more quickly. In AQP, there is a general tradeoff between query latencies and the quality of query answers; obtaining higher-quality answers requires longer query latencies.
In this dissertation, we show we can speed up the approximate query processing without reducing the quality of the query answers by optimizing the synopses using two approaches. The two approaches we employ for optimizing the synopses are as follows:
1. Exploiting past computations: We exploit the answers to the past queries. This approach relies on the fact that, if two aggregation involve common or correlated values, the aggregated results must also be correlated. We formally capture this idea using a probabilistic distribution function, which is then used to refine the answers to new queries.
2. Building task-aware synopses: By optimizing synopses for a few common types of data analytics, we can produce higher quality answers (or more quickly for certain target quality) to those data analytics tasks. We use this approach for constructing synopses optimized for searching and visualizations.
For exploiting past computations and building task-aware synopses, our work incorporates statistical inference and optimization techniques. The contributions in this dissertation resulted in up to 20x speedups for real-world data analytics workloads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138598/1/pyongjoo_1.pd
Searching with expectations
International audienceHandling large amounts of data, such as large image databases, requires the use of approximate nearest neighbor search techniques. Recently, Hamming embedding methods such as spectral hashing have addressed the problem of obtaining compact binary codes optimizing the trade-off between the memory usage and the probability of retrieving the true nearest neighbors. In this paper, we formulate the problem of generating compact signatures as a rate-distortion problem. In the spirit of source coding algorithms, we aim at minimizing the reconstruction error on the squared distances with a constraint on the memory usage. The vectors are ranked based on the distance estimates to the query vector. Experiments on image descriptors show a significant improvement over spectral hashing