47,159 research outputs found

    Counterexamples expose gaps in the proof of time complexity for cover trees introduced in 2006

    Get PDF
    This paper is motivated by the k-nearest neighbors search: given an arbitrary metric space, and its finite subsets (a reference set R and a query set Q), design a fast algorithm to find all k-nearest neighbors in R for every point q in Q. In 2006, Beygelzimer, Kakade, and Langford introduced cover trees to justify a near-linear time complexity for the neighbor search in the sizes of Q,R. Section 5.3 of Curtin's PhD (2015) pointed out that the proof of this result was wrong. The key step in the original proof attempted to show that the number of iterations can be estimated by multiplying the length of the longest root-to-leaf path in a cover tree by a constant factor. However, this estimate can miss many potential nodes in several branches of a cover tree, that should be considered during the neighbor search. The same argument was unfortunately repeated in several subsequent papers using cover trees from 2006. This paper explicitly constructs challenging datasets that provide counterexamples to the past proofs of time complexity for the cover tree construction, the k-nearest neighbor search presented at ICML 2006, and the dual-tree search algorithm published in NIPS 2009. The corrected near-linear time complexities with extra parameters are proved in another forthcoming paper by using a new compressed cover tree simplifying the original tree structure

    Clustering gene expression data with a penalized graph-based metric

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The search for cluster structure in microarray datasets is a base problem for the so-called "-omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a high-dimensional space, as could be the case of some gene expression datasets.</p> <p>Results</p> <p>In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric.</p> <p>Conclusions</p> <p>In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.</p

    Counterexamples expose gaps in the proof of time complexity for cover trees introduced in 2006

    Get PDF
    This paper is motivated by the k-nearest neighbors search: given an arbitrary metric space, and its finite subsets (a reference set R and a query set Q), design a fast algorithm to find all k-nearest neighbors in R for every point q ∈ Q. In 2006, Beygelzimer, Kakade, and Langford introduced cover trees to justify a near-linear time complexity for the neighbor search in the sizes of Q,R.Section 5.3 of Curtin's PhD (2015) pointed out that the proof of this result was wrong. The key step in the original proof attempted to show that the number of iterations can be estimated by multiplying the length of the longest root-to-leaf path in a cover tree by a constant factor. However, this estimate can miss many potential nodes in several branches of a cover tree, that should be considered during the neighbor search. The same argument was unfortunately repeated in several subsequent papers using cover trees from 2006.This paper explicitly constructs challenging datasets that provide counterexamples to the past proofs of time complexity for the cover tree construction, the k-nearest neighbor search presented at ICML 2006, and the dual-tree search algorithm published in NIPS 2009.The corrected near-linear time complexities with extra parameters are proved in another forthcoming paper by using a new compressed cover tree simplifying the original tree structure

    Paired compressed cover trees guarantee a near linear parametrized complexity for all kk-nearest neighbors search in an arbitrary metric space

    Get PDF
    This paper studies the important problem of finding all kk-nearest neighbors to points of a query set QQ in another reference set RR within any metric space. Our previous work defined compressed cover trees and corrected the key arguments in several past papers for challenging datasets. In 2009 Ram, Lee, March, and Gray attempted to improve the time complexity by using pairs of cover trees on the query and reference sets. In 2015 Curtin with the above co-authors used extra parameters to finally prove a time complexity for k=1k=1. The current work fills all previous gaps and improves the nearest neighbor search based on pairs of new compressed cover trees. The novel imbalance parameter of paired trees allowed us to prove a better time complexity for any number of neighbors k1k\geq 1

    Solving Fr\'echet Distance Problems by Algebraic Geometric Methods

    Full text link
    We study several polygonal curve problems under the Fr\'{e}chet distance via algebraic geometric methods. Let Xmd\mathbb{X}_m^d and Xkd\mathbb{X}_k^d be the spaces of all polygonal curves of mm and kk vertices in Rd\mathbb{R}^d, respectively. We assume that kmk \leq m. Let Rk,md\mathcal{R}^d_{k,m} be the set of ranges in Xmd\mathbb{X}_m^d for all possible metric balls of polygonal curves in Xkd\mathbb{X}_k^d under the Fr\'{e}chet distance. We prove a nearly optimal bound of O(dklog(km))O(dk\log (km)) on the VC dimension of the range space (Xmd,Rk,md)(\mathbb{X}_m^d,\mathcal{R}_{k,m}^d), improving on the previous O(d2k2log(dkm))O(d^2k^2\log(dkm)) upper bound and approaching the current Ω(dklogk)\Omega(dk\log k) lower bound. Our upper bound also holds for the weak Fr\'{e}chet distance. We also obtain exact solutions that are hitherto unknown for curve simplification, range searching, nearest neighbor search, and distance oracle.Comment: To appear at SODA24, correct some reference

    Efficient k-nearest neighbor query processing in metric spaces based on precise radius estimation

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 45-47.Similarity searching is an important problem for complex and unstructured data such as images, video, and text documents. One common solution is approximating complex objects into feature vectors. Metric spaces approach, on the other hand, relies solely on a distance function between objects. No information is assumed about the internal structure of the objects, therefore a more general framework is provided. Methods that use the metric spaces have also been shown to perform better especially on high dimensional data. A common query type used in similarity searching is the range query, where all the neighbors in a certain area defined by a query object and a radius are retrieved. Another important type, k-nearest neighbor queries return k closest objects to a given query center. They are more difficult to process since the distance of the kth nearest neighbor varies highly. For that reason, some techniques are proposed to estimate a radius that will return exactly k objects, reducing the computation into a range query. A major problem with these methods is that multiple passes over the index data is required if the estimation is low. In this thesis we propose a new framework for k-nearest neighbor search based on radius estimation where only one sequential pass over the index data is required. We accomplish this by caching a short-list of promising candidates. We also propose several algorithms to estimate the query radius which outperform previously proposed methods. We show that our estimations are accurate enough to keep the size of the promising objects at acceptable levels.Şardan, CanM.S

    Accelerating Nearest Neighbor Search on Manycore Systems

    Full text link
    We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their search performance is provably sublinear in the size of the database, with a factor dependent only on its intrinsic dimensionality. We demonstrate that our methods provide substantial speedups on a range of datasets and hardware platforms. In particular, we present results on a 48-core server machine, on graphics hardware, and on a multicore desktop
    corecore