Search CORE

47,159 research outputs found

Counterexamples expose gaps in the proof of time complexity for cover trees introduced in 2006

Author: Elkin Yury
Kurlin Vitaliy
Publication venue
Publication date: 01/01/2022
Field of study

This paper is motivated by the k-nearest neighbors search: given an arbitrary metric space, and its finite subsets (a reference set R and a query set Q), design a fast algorithm to find all k-nearest neighbors in R for every point q in Q. In 2006, Beygelzimer, Kakade, and Langford introduced cover trees to justify a near-linear time complexity for the neighbor search in the sizes of Q,R. Section 5.3 of Curtin's PhD (2015) pointed out that the proof of this result was wrong. The key step in the original proof attempted to show that the number of iterations can be estimated by multiplying the length of the longest root-to-leaf path in a cover tree by a constant factor. However, this estimate can miss many potential nodes in several branches of a cover tree, that should be considered during the neighbor search. The same argument was unfortunately repeated in several subsequent papers using cover trees from 2006. This paper explicitly constructs challenging datasets that provide counterexamples to the past proofs of time complexity for the cover tree construction, the k-nearest neighbor search presented at ICML 2006, and the dual-tree search algorithm published in NIPS 2009. The corrected near-linear time complexities with extra parameters are proved in another forthcoming paper by using a new compressed cover tree simplifying the original tree structure

University of Liverpool Repository

Clustering gene expression data with a penalized graph-based metric

Author: A Baya
A Ben-Hur
A Fred
A Karatzoglou
A Ng
A Richards
A Soukas
A Thalamuthu
AA Alizadeh
AI Su
AK Jain
AK Jain
Ariel E Bayá
B Fischer
B Fischer
B Fischer
B King
B Tjaden
BJ Frey
EJ Yeoh
EP Xing
EY Kim
G McLachlan
G Milligan
J McQueen
J Risinger
J Shawe-Taylor
J Shi
J Tenenbaum
JP Brunet
K Yeung
L Dyrskjot
L Heyer
L Kaufman
L Li
L Liu
M Belkin
M Brito
M de Souto
M Dettling
M Filippone
M Polito
MB Eisen
N Mekuz
P Arabie
P Franti
P Marttinen
Pablo M Granitto
PHA Sneath
R Shai
R Tibshirani
R Tibshirani
R Waite
R Xu
S Calza
S Michele Leone
S Monti
S Pomeroy
S Ramaswamy
S Roweis
TH Cormen
TR Golub
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The search for cluster structure in microarray datasets is a base problem for the so-called "-omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a high-dimensional space, as could be the case of some gene expression datasets. Results In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

CONICET Digital

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Repositorio Hipermedial de la Universidad Nacional de Rosario

Counterexamples expose gaps in the proof of time complexity for cover trees introduced in 2006

Author: Elkin Yury
Kurlin Vitaliy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

This paper is motivated by the k-nearest neighbors search: given an arbitrary metric space, and its finite subsets (a reference set R and a query set Q), design a fast algorithm to find all k-nearest neighbors in R for every point q ∈ Q. In 2006, Beygelzimer, Kakade, and Langford introduced cover trees to justify a near-linear time complexity for the neighbor search in the sizes of Q,R.Section 5.3 of Curtin's PhD (2015) pointed out that the proof of this result was wrong. The key step in the original proof attempted to show that the number of iterations can be estimated by multiplying the length of the longest root-to-leaf path in a cover tree by a constant factor. However, this estimate can miss many potential nodes in several branches of a cover tree, that should be considered during the neighbor search. The same argument was unfortunately repeated in several subsequent papers using cover trees from 2006.This paper explicitly constructs challenging datasets that provide counterexamples to the past proofs of time complexity for the cover tree construction, the k-nearest neighbor search presented at ICML 2006, and the dual-tree search algorithm published in NIPS 2009.The corrected near-linear time complexities with extra parameters are proved in another forthcoming paper by using a new compressed cover tree simplifying the original tree structure

University of Liverpool Repository

Paired compressed cover trees guarantee a near linear parametrized complexity for all $k$ -nearest neighbors search in an arbitrary metric space

Author: Elkin Yury
Kurlin Vitaliy
Publication venue
Publication date: 17/01/2022
Field of study

This paper studies the important problem of finding all

k

-nearest neighbors to points of a query set

Q

in another reference set

R

within any metric space. Our previous work defined compressed cover trees and corrected the key arguments in several past papers for challenging datasets. In 2009 Ram, Lee, March, and Gray attempted to improve the time complexity by using pairs of cover trees on the query and reference sets. In 2015 Curtin with the above co-authors used extra parameters to finally prove a time complexity for

k=1

. The current work fills all previous gaps and improves the nearest neighbor search based on pairs of new compressed cover trees. The novel imbalance parameter of paired trees allowed us to prove a better time complexity for any number of neighbors

k\geq 1

arXiv.org e-Print Archive

University of Liverpool Repository

Solving Fr\'echet Distance Problems by Algebraic Geometric Methods

Author: Cheng Siu-Wing
Huang Haoqiang
Publication venue
Publication date: 22/10/2023
Field of study

We study several polygonal curve problems under the Fr\'{e}chet distance via algebraic geometric methods. Let

\mathbb{X}_m^d

and

\mathbb{X}_k^d

be the spaces of all polygonal curves of

m

and

k

vertices in

\mathbb{R}^d

, respectively. We assume that

k \leq m

. Let

\mathcal{R}^d_{k,m}

be the set of ranges in

\mathbb{X}_m^d

for all possible metric balls of polygonal curves in

\mathbb{X}_k^d

under the Fr\'{e}chet distance. We prove a nearly optimal bound of

O(dk\log (km))

on the VC dimension of the range space

(\mathbb{X}_m^d,\mathcal{R}_{k,m}^d)

, improving on the previous

O(d^2k^2\log(dkm))

upper bound and approaching the current

\Omega(dk\log k)

lower bound. Our upper bound also holds for the weak Fr\'{e}chet distance. We also obtain exact solutions that are hitherto unknown for curve simplification, range searching, nearest neighbor search, and distance oracle.Comment: To appear at SODA24, correct some reference

arXiv.org e-Print Archive

Efficient k-nearest neighbor query processing in metric spaces based on precise radius estimation

Author: Şardan Can
Publication venue: Bilkent University
Publication date: 01/01/2009
Field of study

Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 45-47.Similarity searching is an important problem for complex and unstructured data such as images, video, and text documents. One common solution is approximating complex objects into feature vectors. Metric spaces approach, on the other hand, relies solely on a distance function between objects. No information is assumed about the internal structure of the objects, therefore a more general framework is provided. Methods that use the metric spaces have also been shown to perform better especially on high dimensional data. A common query type used in similarity searching is the range query, where all the neighbors in a certain area defined by a query object and a radius are retrieved. Another important type, k-nearest neighbor queries return k closest objects to a given query center. They are more difficult to process since the distance of the kth nearest neighbor varies highly. For that reason, some techniques are proposed to estimate a radius that will return exactly k objects, reducing the computation into a range query. A major problem with these methods is that multiple passes over the index data is required if the estimation is low. In this thesis we propose a new framework for k-nearest neighbor search based on radius estimation where only one sequential pass over the index data is required. We accomplish this by caching a short-list of promising candidates. We also propose several algorithms to estimate the query radius which outperform previously proposed methods. We show that our estimations are accurate enough to keep the size of the promising objects at acceptable levels.Şardan, CanM.S

Bilkent University Institutional Repository

Accelerating Nearest Neighbor Search on Manycore Systems

Author: Cayton Lawrence
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their search performance is provably sublinear in the size of the database, with a factor dependent only on its intrinsic dimensionality. We demonstrate that our methods provide substantial speedups on a range of datasets and hardware platforms. In particular, we present results on a 48-core server machine, on graphics hardware, and on a multicore desktop

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe