82 research outputs found
Boundary-Sensitive Approach for Approximate Nearest-Neighbor Classification
The problem of nearest-neighbor classification is a fundamental technique in machine-learning. Given a training set P of n labeled points in ?^d, and an approximation parameter 0 < ? ? 1/2, any unlabeled query point should be classified with the class of any of its ?-approximate nearest-neighbors in P. Answering these queries efficiently has been the focus of extensive research, proposing techniques that are mainly tailored towards resolving the more general problem of ?-approximate nearest-neighbor search. While the latest can only hope to provide query time and space complexities dependent on n, the problem of nearest-neighbor classification accepts other parameters more suitable to its analysis. Such is the number k_? of ?-border points, which describes the complexity of boundaries between sets of points of different classes.
This paper presents a new data structure called Chromatic AVD. This is the first approach for ?-approximate nearest-neighbor classification whose space and query time complexities are only dependent on ?, k_? and d, while being independent on both n and ?, the spread of P
Faster Clustering via Preprocessing
We examine the efficiency of clustering a set of points, when the
encompassing metric space may be preprocessed in advance. In computational
problems of this genre, there is a first stage of preprocessing, whose input is
a collection of points ; the next stage receives as input a query set
, and should report a clustering of according to some
objective, such as 1-median, in which case the answer is a point
minimizing .
We design fast algorithms that approximately solve such problems under
standard clustering objectives like -center and -median, when the metric
has low doubling dimension. By leveraging the preprocessing stage, our
algorithms achieve query time that is near-linear in the query size ,
and is (almost) independent of the total number of points .Comment: 24 page
Fully Scalable MPC Algorithms for Clustering in High Dimension
We design new parallel algorithms for clustering in high-dimensional
Euclidean spaces. These algorithms run in the Massively Parallel Computation
(MPC) model, and are fully scalable, meaning that the local memory in each
machine may be for arbitrarily small fixed .
Importantly, the local memory may be substantially smaller than the number of
clusters , yet all our algorithms are fast, i.e., run in rounds.
We first devise a fast MPC algorithm for -approximation of uniform
facility location. This is the first fully-scalable MPC algorithm that achieves
-approximation for any clustering problem in general geometric setting;
previous algorithms only provide -approximation or apply
to restricted inputs, like low dimension or small number of clusters ; e.g.
[Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad
et al., ICML'22]. We then build on this facility location result and devise a
fast MPC algorithm that achieves -bicriteria approximation for -Median
and for -Means, namely, it computes clusters of cost
within -factor of the optimum for clusters.
A primary technical tool that we introduce, and may be of independent
interest, is a new MPC primitive for geometric aggregation, namely, computing
for every data point a statistic of its approximate neighborhood, for
statistics like range counting and nearest-neighbor search. Our implementation
of this primitive works in high dimension, and is based on consistent hashing
(aka sparse partition), a technique that was recently used for streaming
algorithms [Czumaj et al., FOCS'22]
- …