4 research outputs found
A simple D^2-sampling based PTAS for k-means and other Clustering Problems
Given a set of points , the -means clustering
problem is to find a set of {\em centers} such that the objective function ,
where denotes the distance between and the closest center in ,
is minimized. This is one of the most prominent objective functions that have
been studied with respect to clustering.
-sampling \cite{ArthurV07} is a simple non-uniform sampling technique
for choosing points from a set of points. It works as follows: given a set of
points , the first point is chosen uniformly at
random from . Subsequently, a point from is chosen as the next sample
with probability proportional to the square of the distance of this point to
the nearest previously sampled points.
-sampling has been shown to have nice properties with respect to the
-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show
that points chosen as centers from using -sampling gives an
approximation in expectation. Ailon et. al. \cite{AJMonteleoni09}
and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07}
to show that points chosen as centers using -sampling give
approximation to the -means objective function with high probability. In
this paper, we further demonstrate the power of -sampling by giving a
simple randomized -approximation algorithm that uses the
-sampling in its core
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201