9,197 research outputs found
Approximating -Median via Pseudo-Approximation
We present a novel approximation algorithm for -median that achieves an
approximation guarantee of
, improving upon the decade-old ratio of .
Our approach is based on two components, each of which, we believe, is of
independent interest.
First, we show that in order to give an -approximation algorithm for
-median, it is sufficient to give a \emph{pseudo-approximation algorithm}
that finds an -approximate solution by opening facilities.
This is a rather surprising result as there exist instances for which opening
facilities may lead to a significant smaller cost than if only
facilities were opened.
Second, we give such a pseudo-approximation algorithm with . Prior to our work, it was not even known whether opening
facilities would help improve the approximation ratio.Comment: 18 page
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
The Hardness of Approximation of Euclidean k-means
The Euclidean -means problem is a classical problem that has been
extensively studied in the theoretical computer science, machine learning and
the computational geometry communities. In this problem, we are given a set of
points in Euclidean space , and the goal is to choose centers in
so that the sum of squared distances of each point to its nearest center
is minimized. The best approximation algorithms for this problem include a
polynomial time constant factor approximation for general and a
-approximation which runs in time . At
the other extreme, the only known computational complexity result for this
problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness
results stems from the Euclidean nature of the problem, and the fact that any
point in can be a potential center. This gap in understanding left open
the intriguing possibility that the problem might admit a PTAS for all .
In this paper we provide the first hardness of approximation for the
Euclidean -means problem. Concretely, we show that there exists a constant
such that it is NP-hard to approximate the -means objective
to within a factor of . We show this via an efficient reduction
from the vertex cover problem on triangle-free graphs: given a triangle-free
graph, the goal is to choose the fewest number of vertices which are incident
on all the edges. Additionally, we give a proof that the current best hardness
results for vertex cover can be carried over to triangle-free graphs. To show
this we transform , a known hard vertex cover instance, by taking a graph
product with a suitably chosen graph , and showing that the size of the
(normalized) maximum independent set is almost exactly preserved in the product
graph using a spectral analysis, which might be of independent interest
Approximation algorithms for stochastic clustering
We consider stochastic settings for clustering, and develop provably-good
approximation algorithms for a number of these notions. These algorithms yield
better approximation ratios compared to the usual deterministic clustering
setting. Additionally, they offer a number of advantages including clustering
which is fairer and has better long-term behavior for each user. In particular,
they ensure that *every user* is guaranteed to get good service (on average).
We also complement some of these with impossibility results
Predictive Entropy Search for Efficient Global Optimization of Black-box Functions
We propose a novel information-theoretic approach for Bayesian optimization
called Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect to
the global maximum. PES codifies this intractable acquisition function in terms
of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy Search
(ES). Furthermore, PES can easily perform a fully Bayesian treatment of the
model hyperparameters while ES cannot. We evaluate PES in both synthetic and
real-world applications, including optimization problems in machine learning,
finance, biotechnology, and robotics. We show that the increased accuracy of
PES leads to significant gains in optimization performance
- …