40 research outputs found
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Clustering with Faulty Centers
In this paper we introduce and formally study the problem of k-clustering with faulty centers. Specifically, we study the faulty versions of k-center, k-median, and k-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters k, d, and ?, that (1+?)-approximate the minimum expected cost solutions for points in d dimensional Euclidean space. For Faulty k-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have a small dependence on n. Specifically, our Faulty k-center algorithms have only linear dependence on n, while for our algorithms for Faulty k-median and Faulty k-means the dependence is still only n^(1 + o(1))
Stability Yields Sublinear Time Algorithms for Geometric Optimization in Machine Learning
In this paper, we study several important geometric optimization problems arising in machine learning. First, we revisit the Minimum Enclosing Ball (MEB) problem in Euclidean space ?^d. The problem has been extensively studied before, but real-world machine learning tasks often need to handle large-scale datasets so that we cannot even afford linear time algorithms. Motivated by the recent developments on beyond worst-case analysis, we introduce the notion of stability for MEB, which is natural and easy to understand. Roughly speaking, an instance of MEB is stable, if the radius of the resulting ball cannot be significantly reduced by removing a small fraction of the input points. Under the stability assumption, we present two sampling algorithms for computing radius-approximate MEB with sample complexities independent of the number of input points n. In particular, the second algorithm has the sample complexity even independent of the dimensionality d. We also consider the general case without the stability assumption. We present a hybrid algorithm that can output either a radius-approximate MEB or a covering-approximate MEB, which improves the running time and the number of passes for the previous sublinear MEB algorithms. Further, we extend our proposed notion of stability and design sublinear time algorithms for other geometric optimization problems including MEB with outliers, polytope distance, one-class and two-class linear SVMs (without or with outliers). Our proposed algorithms also work fine for kernels
Improved Outlier Robust Seeding for k-means
The -means is a popular clustering objective, although it is inherently
non-robust and sensitive to outliers. Its popular seeding or initialization
called -means++ uses sampling and comes with a provable
approximation guarantee \cite{AV2007}. However, in the presence of adversarial
noise or outliers, sampling is more likely to pick centers from distant
outliers instead of inlier clusters, and therefore its approximation guarantees
\textit{w.r.t.} -means solution on inliers, does not hold.
Assuming that the outliers constitute a constant fraction of the given data,
we propose a simple variant in the sampling distribution, which makes it
robust to the outliers. Our algorithm runs in time, outputs
clusters, discards marginally more points than the optimal number of outliers,
and comes with a provable approximation guarantee.
Our algorithm can also be modified to output exactly clusters instead of
clusters, while keeping its running time linear in and . This is
an improvement over previous results for robust -means based on LP
relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and
\textit{robust -means++} \cite{DeshpandeKP20}. Our empirical results show
the advantage of our algorithm over -means++~\cite{AV2007}, uniform random
seeding, greedy sampling for means~\cite{tkmeanspp}, and robust
-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data
sets used in previous work. Our proposal is easily amenable to scalable,
faster, parallel implementations of -means++ \cite{Bahmani,BachemL017} and
is of independent interest for coreset constructions in the presence of
outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}
On Generalization Bounds for Projective Clustering
Given a set of points, clustering consists of finding a partition of a point
set into clusters such that the center to which a point is assigned is as
close as possible. Most commonly, centers are points themselves, which leads to
the famous -median and -means objectives. One may also choose centers to
be dimensional subspaces, which gives rise to subspace clustering. In this
paper, we consider learning bounds for these problems. That is, given a set of
samples drawn independently from some unknown, but fixed distribution
, how quickly does a solution computed on converge to the
optimal clustering of ? We give several near optimal results. In
particular,
For center-based objectives, we show a convergence rate of
. This matches the known optimal bounds
of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]
and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for -means
and extends it to other important objectives such as -median.
For subspace clustering with -dimensional subspaces, we show a convergence
rate of . These are the first
provable bounds for most of these problems. For the specific case of projective
clustering, which generalizes -means, we show a convergence rate of
is necessary, thereby proving that the
bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical
Society 2016] are essentially optimal
A New Coreset Framework for Clustering
Given a metric space, the -clustering problem consists of finding
centers such that the sum of the of distances raised to the power of every
point to its closest center is minimized. This encapsulates the famous
-median () and -means () clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the
solutions, also known as \emph{coresets}, has been an important research
direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously
improves upon the best known bounds for a large variety of settings, ranging
from Euclidean space, doubling metric, minor-free metric, and the general
metric cases