1,236 research outputs found
Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
High-dimensional vector similarity search (HVSS) is gaining prominence as a
powerful tool for various data science and AI applications. As vector data
scales up, in-memory indexes pose a significant challenge due to the
substantial increase in main memory requirements. A potential solution involves
leveraging disk-based implementation, which stores and searches vector data on
high-performance devices like NVMe SSDs. However, implementing HVSS for data
segments proves to be intricate in vector databases where a single machine
comprises multiple segments for system scalability. In this context, each
segment operates with limited memory and disk space, necessitating a delicate
balance between accuracy, efficiency, and space cost. Existing disk-based
methods fall short as they do not holistically address all these requirements
simultaneously. In this paper, we present Starling, an I/O-efficient
disk-resident graph index framework that optimizes data layout and search
strategy within the segment. It has two primary components: (1) a data layout
incorporating an in-memory navigation graph and a reordered disk-based graph
with enhanced locality, reducing the search path length and minimizing disk
bandwidth wastage; and (2) a block search strategy designed to minimize costly
disk I/O operations during vector query execution. Through extensive
experiments, we validate the effectiveness, efficiency, and scalability of
Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling
can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with
over 0.9 average precision and top-10 recall rate, and latency under 1
millisecond. The results showcase Starling's superior performance, exhibiting
43.9 higher throughput with 98% lower query latency compared to
state-of-the-art methods while maintaining the same level of accuracy.Comment: This paper has been accepted by SIGMOD 202
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph Index using Query-sensitivity Entry Vertex
Given a vector dataset and a query vector ,
graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph
index and approximately return vectors with minimum distances to
by searching over . The main drawback of graph-based ANNS is
that a graph index would be too large to fit into the memory especially for a
large-scale . To solve this, a Product Quantization (PQ)-based
hybrid method called DiskANN is proposed to store a low-dimensional PQ index in
memory and retain a graph index in SSD, thus reducing memory overhead while
ensuring a high search accuracy. However, it suffers from two I/O issues that
significantly affect the overall efficiency: (1) long routing path from an
entry vertex to the query's neighborhood that results in large number of I/O
requests and (2) redundant I/O requests during the routing process. We propose
an optimized DiskANN++ to overcome above issues. Specifically, for the first
issue, we present a query-sensitive entry vertex selection strategy to replace
DiskANN's static graph-central entry vertex by a dynamically determined entry
vertex that is close to the query. For the second I/O issue, we present an
isomorphic mapping on DiskANN's graph index to optimize the SSD layout and
propose an asynchronously optimized Pagesearch based on the optimized SSD
layout as an alternative to DiskANN's beamsearch. Comprehensive experimental
studies on eight real-world datasets demonstrate our DiskANN++'s superiority on
efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to
DiskANN, given the same accuracy constraint.Comment: 15 pages including reference
Survey of Vector Database Management Systems
There are now over 20 commercial vector database management systems (VDBMSs),
all produced within the past five years. But embedding-based retrieval has been
studied for over ten years, and similarity search a staggering half century and
more. Driving this shift from algorithms to systems are new data intensive
applications, notably large language models, that demand vast stores of
unstructured data coupled with reliable, secure, fast, and scalable query
processing capability. A variety of new data management techniques now exist
for addressing these needs, however there is no comprehensive survey to
thoroughly review these techniques and systems. We start by identifying five
main obstacles to vector data management, namely vagueness of semantic
similarity, large size of vectors, high cost of similarity comparison, lack of
natural partitioning that can be used for indexing, and difficulty of
efficiently answering hybrid queries that require both attributes and vectors.
Overcoming these obstacles has led to new approaches to query processing,
storage and indexing, and query optimization and execution. For query
processing, a variety of similarity scores and query types are now well
understood; for storage and indexing, techniques include vector compression,
namely quantization, and partitioning based on randomization, learning
partitioning, and navigable partitioning; for query optimization and execution,
we describe new operators for hybrid queries, as well as techniques for plan
enumeration, plan selection, and hardware accelerated execution. These
techniques lead to a variety of VDBMSs across a spectrum of design and runtime
characteristics, including native systems specialized for vectors and extended
systems that incorporate vector capabilities into existing systems. We then
discuss benchmarks, and finally we outline research challenges and point the
direction for future work.Comment: 25 page
On trip planning queries in spatial databases
In this paper we discuss a new type of query in Spatial Databases, called Trip Planning Query (TPQ). Given a set of points P in space, where each point belongs to a category, and given two points s and e, TPQ asks for the best trip that starts at s, passes through exactly one point from each category, and ends at e. An example of a TPQ is when a user wants to visit a set of different places and at the same time minimize the total travelling cost, e.g. what is the shortest travelling plan for me to visit an automobile shop, a CVS pharmacy outlet, and a Best Buy shop along my trip from A to B? The trip planning query is an extension of the well-known TSP problem and therefore is NP-hard. The difficulty of this query lies in the existence of multiple choices for each category. In this paper, we first study fast approximation algorithms for the trip planning query in a metric space, assuming that the data set fits in main memory, and give the theory analysis of their approximation bounds. Then, the trip planning query is examined for data sets that do not fit in main memory and must be stored on disk. For the disk-resident data, we consider two cases. In one case, we assume that the points are located in Euclidean space and indexed with an Rtree. In the other case, we consider the problem of points that lie on the edges of a spatial network (e.g. road network) and the distance between two points is defined using the shortest distance over the network. Finally, we give an experimental evaluation of the proposed algorithms using synthetic data sets generated on real road networks
On trip planning queries in spatial databases
In this paper we discuss a new type of query in Spatial Databases, called Trip Planning Query (TPQ). Given a set of points P in space, where each point belongs to a category, and given two points s and e, TPQ asks for the best trip that starts at s, passes through exactly one point from each category, and ends at e. An example of a TPQ is when a user wants to visit a set of different places and at the same time minimize the total travelling cost, e.g. what is the shortest travelling plan for me to visit an automobile shop, a CVS pharmacy outlet, and a Best Buy shop along my trip from A to B? The trip planning query is an extension of the well-known TSP problem and therefore is NP-hard. The difficulty of this query lies in the existence of multiple choices for each category. In this paper, we first study fast approximation algorithms for the trip planning query in a metric space, assuming that the data set fits in main memory, and give the theory analysis of their approximation bounds. Then, the trip planning query is examined for data sets that do not fit in main memory and must be stored on disk. For the disk-resident data, we consider two cases. In one case, we assume that the points are located in Euclidean space and indexed with an Rtree. In the other case, we consider the problem of points that lie on the edges of a spatial network (e.g. road network) and the distance between two points is defined using the shortest distance over the network. Finally, we give an experimental evaluation of the proposed algorithms using synthetic data sets generated on real road networks
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Efficient Computation of K-Nearest Neighbor Graphs for Large High-Dimensional Data Sets on GPU Clusters
The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data (\u3c15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set.
This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules
- …