Search CORE

1,348 research outputs found

HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

Author: Arora Akhil
Bhattacharya Arnab
Kumar Piyush
Sinha Sakshi
Publication venue: 'VLDB Endowment'
Publication date: 23/04/2018
Field of study

Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Approximated and User Steerable tSNE for Progressive Visual Analytics

Author: Eisemann Elmar
Höllt Thomas
Lelieveldt Boudewijn P. F.
Pezzotti Nicola
van der Maaten Laurens
Vilanova Anna
Publication venue
Publication date: 01/01/2016
Field of study

Progressive Visual Analytics aims at improving the interactivity in existing analytics techniques by means of visualization as well as interaction with intermediate results. One key method for data analysis is dimensionality reduction, for example, to produce 2D embeddings that can be visualized and analyzed efficiently. t-Distributed Stochastic Neighbor Embedding (tSNE) is a well-suited technique for the visualization of several high-dimensional data. tSNE can create meaningful intermediate results but suffers from a slow initialization that constrains its application in Progressive Visual Analytics. We introduce a controllable tSNE approximation (A-tSNE), which trades off speed and accuracy, to enable interactive data exploration. We offer real-time visualization techniques, including a density-based solution and a Magic Lens to inspect the degree of approximation. With this feedback, the user can decide on local refinements and steer the approximation level during the analysis. We demonstrate our technique with several datasets, in a real-world research scenario and for the real-time analysis of high-dimensional streams to illustrate its effectiveness for interactive data analysis

arXiv.org e-Print Archive

Repository TU/e

TU Delft Repository

Pure OAI Repository

Leiden University Scholary Publications

Analysis of approximate nearest neighbor searching with clustered point sets

Author: Maneewongvatana Songrit
Mount David M.
Publication venue
Publication date: 01/01/1999
Field of study

We present an empirical analysis of data structures for approximate nearest neighbor searching. We compare the well-known optimized kd-tree splitting method against two alternative splitting methods. The first, called the sliding-midpoint method, which attempts to balance the goals of producing subdivision cells of bounded aspect ratio, while not producing any empty cells. The second, called the minimum-ambiguity method is a query-based approach. In addition to the data points, it is also given a training set of query points for preprocessing. It employs a simple greedy algorithm to select the splitting plane that minimizes the average amount of ambiguity in the choice of the nearest neighbor for the training points. We provide an empirical analysis comparing these two methods against the optimized kd-tree construction for a number of synthetically generated data and query sets. We demonstrate that for clustered data and query sets, these algorithms can provide significant improvements over the standard kd-tree construction for approximate nearest neighbor searching.Comment: 20 pages, 8 figures. Presented at ALENEX '99, Baltimore, MD, Jan 15-16, 199

arXiv.org e-Print Archive

CiteSeerX

New Approaches to Similarity Searching in Metric Spaces

Author: celik cengiz
Publication venue
Publication date: 24/04/2006
Field of study

The complex and unstructured nature of many types of data, such as multimedia objects, text documents, protein sequences, requires the use of similarity search techniques for retrieval of information from databases. One popular approach for similarity searching is mapping database objects into feature vectors, which introduces an undesirable element of indirection into the process. A more direct approach is to define a distance function directly between objects. Typically such a function is taken from a metric space, which satisfies a number of properties, such as the triangle inequality. Index structures that can work for metric spaces have been shown to provide satisfactory performance, and were reported to outperform vector-based counterparts in many applications. Metric spaces also provide a more general framework, and for some domains defining a distance between objects can be accomplished more intuitively than mapping objects to feature vectors. In this thesis we will investigate new efficient methods for similarity searching in metric spaces. We will first show that current solutions to indexing in metric spaces have several drawbacks. Tree-based solutions do not provide the best tradeoffs between construction time and query performance. Tree structures are also difficult to make dynamic without further degrading their performance. There is also a family of flat structures that address some of the deficiencies of tree-based indices, but they introduce their own unique problems in terms of higher construction cost, higher space usage, and extra CPU overhead. In this thesis a new family of flat structures will be introduced, which are very flexible and simple. We will show that dynamic operations can easily be performed, and that they can be customized to work under different performance requirements. They also address many of the general drawbacks of flat structures as outlined above. A new framework, composite metrics will also be introduced, which provides a more flexible similarity searching process by allowing several metrics to be combined in one search structure. Two indexing structures will be introduced that can handle similarity queries in this setting, and it will be shown that they provide competitive query performance with respect to data structures for standard metrics

Digital Repository at the University of Maryland

High-dimensional indexing methods utilizing clustering and dimensionality reduction

Author: Zhang Lijuan
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2005
Field of study

The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

Digital Commons @ New Jersey Institute of Technology (NJIT)

Advance of the Access Methods

Author: Ivanova Krassimira
Karastanev Stefan
Markov Krassimir
Mitov Ilia
Publication venue: Institute of Information Theories and Applications FOI ITHEA
Publication date: 01/01/2008
Field of study

The goal of this paper is to outline the advance of the access methods in the last ten years as well as to make review of all available in the accessible bibliography methods

Bulgarian Digital Mathematics Library at IMI-BAS

High-Dimensional Indexing for Video Retrieval

Author: Catalin Calistru
Cristina Ribeiro
Gabriel David
Publication venue: 'IntechOpen'
Publication date: 01/01/2012
Field of study

IntechOpen

Repositório Aberto da Universidade do Porto