Search CORE

82 research outputs found

Boundary-Sensitive Approach for Approximate Nearest-Neighbor Classification

Author: Flores-Velazco Alejandro
Mount David M.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 29th Annual European Symposium on Algorithms (ESA 2021)
Publication date: 01/01/2021
Field of study

The problem of nearest-neighbor classification is a fundamental technique in machine-learning. Given a training set P of n labeled points in ?^d, and an approximation parameter 0 < ? ? 1/2, any unlabeled query point should be classified with the class of any of its ?-approximate nearest-neighbors in P. Answering these queries efficiently has been the focus of extensive research, proposing techniques that are mainly tailored towards resolving the more general problem of ?-approximate nearest-neighbor search. While the latest can only hope to provide query time and space complexities dependent on n, the problem of nearest-neighbor classification accepts other parameters more suitable to its analysis. Such is the number k_? of ?-border points, which describes the complexity of boundaries between sets of points of different classes. This paper presents a new data structure called Chromatic AVD. This is the first approach for ?-approximate nearest-neighbor classification whose space and query time complexities are only dependent on ?, k_? and d, while being independent on both n and ?, the spread of P

Dagstuhl Research Online Publication Server

Faster Clustering via Preprocessing

Author: Kopelowitz Tsvi
Krauthgamer Robert
Publication venue
Publication date: 01/01/2012
Field of study

We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points

M

; the next stage receives as input a query set

Q\subset M

, and should report a clustering of

Q

according to some objective, such as 1-median, in which case the answer is a point

a\in M

minimizing

\sum_{q\in Q} d_M(a,q)

. We design fast algorithms that approximately solve such problems under standard clustering objectives like

p

-center and

p

-median, when the metric

M

has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size

n=|Q|

, and is (almost) independent of the total number of points

m=|M|

.Comment: 24 page

arXiv.org e-Print Archive

CiteSeerX

Fully Scalable MPC Algorithms for Clustering in High Dimension

Author: Czumaj Artur
Gao Guichen
Jiang Shaofeng H. -C.
Krauthgamer Robert
Veselý Pavel
Publication venue
Publication date: 14/11/2023
Field of study

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be

n^{\sigma}

for arbitrarily small fixed

\sigma>0

. Importantly, the local memory may be substantially smaller than the number of clusters

k

, yet all our algorithms are fast, i.e., run in

O(1)

rounds. We first devise a fast MPC algorithm for

O(1)

-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves

O(1)

-approximation for any clustering problem in general geometric setting; previous algorithms only provide

\mathrm{poly}(\log n)

-approximation or apply to restricted inputs, like low dimension or small number of clusters

k

; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves

O(1)

-bicriteria approximation for

k

-Median and for

k

-Means, namely, it computes

(1+\varepsilon)k

clusters of cost within

O(1/\varepsilon^2)

-factor of the optimum for

k

clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22]

arXiv.org e-Print Archive