Search CORE

34 research outputs found

On the I/O Complexity of the k-Nearest Neighbors Problem

Author: Babenko Artem
Gionis Aristides
Pestov Vladimir
Östlin Anna
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2020
Field of study

Crossref

The IT University of Copenhagen's Repository

SProt: sphere-based protein structure similarity algorithm

Author: Galgonek Jakub
Hoksza David
Skopal Tomáš
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Indexability, concentration, and VC theory

Author: Pestov Vladimir
Publication venue: 'Elsevier BV'
Publication date: 21/05/2011
Field of study

Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Recommended from our members

An Evaluation of Organization Methods for Data Types Commonly Used in the Geographic Domain

Author: Wendel Jochen
Publication venue: University of Colorado Boulder
Publication date: 01/01/2013
Field of study

This dissertation designed and implemented approaches to assess the suitability of commonly used unsupervised and supervised grouping methods on data types commonly used in the geographic domain. Four different types of data have been indexed for organization: a full-text data set depicting 30 years of cartographic literature, a raster data set consisting of physiographic characteristics of the U.S., a suite of GIS software commands used in hydrologic analysis, and a catalog of cartographic generalization algorithms. Various clustering and classification methods from the field of statistics and machine learning were evaluated for organizing these different data types. By systematically applying all types of data organization to each type of indexed data, this research addresses the question of whether certain indexing strategies influence the effectiveness of the organization methods. Depending on the data set and the indexing method applied, some clustering and classification methods performed better than others. The experiments of this dissertation demonstrate that by the systematic evaluation and validation of clustering and classification results recommendations for organizing data can be formulated based on the results of cluster and classification indices. Furthermore, through systematic evaluation and application of the six clustering and classification methods it is possible to match indexing strategy and organization methods for each of the four data sets used in this dissertation

CU Scholar Institutional Repository

Indexing Metric Spaces for Exact Similarity Search

Author: Chen Lu
Gao Yunjun
Jensen Christian S.
Li Zheng
Miao Xiaoye
Song Xuan
Zhu Yifan
Publication venue
Publication date: 07/05/2020
Field of study

With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

arXiv.org e-Print Archive

VBN

Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

Author: Gui Zhipeng
Peng Dehua
Wu Huayi
Publication venue
Publication date: 07/01/2024
Field of study

The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space.Comment: 17 pages, 11 figure

arXiv.org e-Print Archive