Search CORE

691 research outputs found

Indexing Metric Spaces for Exact Similarity Search

Author: Chen Lu
Gao Yunjun
Jensen Christian S.
Li Zheng
Miao Xiaoye
Song Xuan
Zhu Yifan
Publication venue
Publication date: 07/05/2020
Field of study

With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

arXiv.org e-Print Archive

VBN

Database Similarity Search in Metric Spaces: Limitations and Opportunities

Author: Shen Zeqian
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2004
Field of study

Generic database similarity search is one of the most challenging problems in current database research. Generic data are not simply structured data with several keys of numeric or alphabetic types. Traditional search algorithms that only check specified fields and keys are not effective. Similarity searches find the objects that are similar to a target using a specified similarity criterion. Tree-structured indexing techniques based on metric spaces are widely used to solve this problem. Existing methods can be divided into two categories: approaches based upon Voronoi partitions and approaches based upon reference points. The later one is the focus of this research. The problem of database similarity search using reference points in metric spaces is formulated, and the key issues are addressed. This research focuses upon two broad sets of open problems: Analysis of the limitations of approaches to similarity search using metric spaces, and development of criteria that can be and to evaluate the opportunities for new design methods. The performance limitations of similarity search based on metric spaces are analyzed and proved to be imposed by statistical characteristics of the data collection. A new concept, range threshold, is defined to evaluate the feasibility of tree-structured indexing techniques based upon reference points in metric spaces. A method to estimate the range threshold is provided, which makes it possible to check the feasibility of this approach for a data set prior to implementation. The opportunities for different approaches are evaluated by criteria based on search efficiency and utility. Comparison of different Minkowski metrics and data extraction methods using PCA (principle component analysis) are presented. Search utilities are demonstrated by examples. Several issues related to index tree structure are addressed. Experimental results show that a taller tree yields better performance. All these results indicate that the approaches based upon reference points in metric spaces are promising

University of Tennessee, Knoxville: Trace

CiteSeerX

Efficient Spatial Keyword Search in Trajectory Databases

Author: Cong Gao
Lu Hua
Ooi Beng Chin
Zhang Dongxiang
Zhang Meihui
Publication venue
Publication date: 01/01/2012
Field of study

An increasing amount of trajectory data is being annotated with text descriptions to better capture the semantics associated with locations. The fusion of spatial locations and text descriptions in trajectories engenders a new type of top-

k

queries that take into account both aspects. Each trajectory in consideration consists of a sequence of geo-spatial locations associated with text descriptions. Given a user location

\lambda

and a keyword set

\psi

, a top-

k

query returns

k

trajectories whose text descriptions cover the keywords

\psi

and that have the shortest match distance. To the best of our knowledge, previous research on querying trajectory databases has focused on trajectory data without any text description, and no existing work has studied such kind of top-

k

queries on trajectories. This paper proposes one novel method for efficiently computing top-

k

trajectories. The method is developed based on a new hybrid index, cell-keyword conscious B

^+

-tree, denoted by \cellbtree, which enables us to exploit both text relevance and location proximity to facilitate efficient and effective query processing. The results of our extensive empirical studies with an implementation of the proposed algorithms on BerkeleyDB demonstrate that our proposed methods are capable of achieving excellent performance and good scalability.Comment: 12 page

arXiv.org e-Print Archive

Roskilde Universitet

VBN

HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

Author: Arora Akhil
Bhattacharya Arnab
Kumar Piyush
Sinha Sakshi
Publication venue: 'VLDB Endowment'
Publication date: 23/04/2018
Field of study

Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Systematic Combination of Speed-Up Techniques for exact Shortest-Path Queries

Author: Schieferdecker Dennis
Publication venue
Publication date: 16/12/2009
Field of study

KITopen

Algorithmic linear dimension reduction in the l_1 norm for sparse vectors

Author: Gilbert A. C.
Strauss M. J.
Tropp J. A.
Vershynin R.
Publication venue
Publication date: 18/08/2006
Field of study

This paper develops a new method for recovering m-sparse signals that is simultaneously uniform and quick. We present a reconstruction algorithm whose run time, O(m log^2(m) log^2(d)), is sublinear in the length d of the signal. The reconstruction error is within a logarithmic factor (in m) of the optimal m-term approximation error in l_1. In particular, the algorithm recovers m-sparse signals perfectly and noisy signals are recovered with polylogarithmic distortion. Our algorithm makes O(m log^2 (d)) measurements, which is within a logarithmic factor of optimal. We also present a small-space implementation of the algorithm. These sketching techniques and the corresponding reconstruction algorithms provide an algorithmic dimension reduction in the l_1 norm. In particular, vectors of support m in dimension d can be linearly embedded into O(m log^2 d) dimensions with polylogarithmic distortion. We can reconstruct a vector from its low-dimensional sketch in time O(m log^2(m) log^2(d)). Furthermore, this reconstruction is stable and robust under small perturbations

arXiv.org e-Print Archive

Caltech Authors