112 research outputs found
Coresets for Fuzzy K-Means with Applications
The fuzzy K-means problem is a popular generalization of the well-known K-means problem to soft clusterings. We present the first coresets for fuzzy K-means with size linear in the dimension, polynomial in the number of clusters, and poly-logarithmic in the number of points. We show that these coresets can be employed in the computation of a (1+epsilon)-approximation for fuzzy K-means, improving previously presented results. We further show that our coresets can be maintained in an insertion-only streaming setting, where data points arrive one-by-one
Fast Color Quantization Using Weighted Sort-Means Clustering
Color quantization is an important operation with numerous applications in
graphics and image processing. Most quantization methods are essentially based
on data clustering algorithms. However, despite its popularity as a general
purpose clustering algorithm, k-means has not received much respect in the
color quantization literature because of its high computational requirements
and sensitivity to initialization. In this paper, a fast color quantization
method based on k-means is presented. The method involves several modifications
to the conventional (batch) k-means algorithm including data reduction, sample
weighting, and the use of triangle inequality to speed up the nearest neighbor
search. Experiments on a diverse set of images demonstrate that, with the
proposed modifications, k-means becomes very competitive with state-of-the-art
color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table
Coresets for minimum enclosing balls over sliding windows
\emph{Coresets} are important tools to generate concise summaries of massive
datasets for approximate analysis. A coreset is a small subset of points
extracted from the original point set such that certain geometric properties
are preserved with provable guarantees. This paper investigates the problem of
maintaining a coreset to preserve the minimum enclosing ball (MEB) for a
sliding window of points that are continuously updated in a data stream.
Although the problem has been extensively studied in batch and append-only
streaming settings, no efficient sliding-window solution is available yet. In
this work, we first introduce an algorithm, called AOMEB, to build a coreset
for MEB in an append-only stream. AOMEB improves the practical performance of
the state-of-the-art algorithm while having the same approximation ratio.
Furthermore, using AOMEB as a building block, we propose two novel algorithms,
namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window
with constant approximation ratios. The proposed algorithms also support
coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally,
extensive experiments on real-world and synthetic datasets demonstrate that
SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the
state-of-the-art batch algorithm while providing coresets for MEB with rather
small errors compared to the optimal ones.Comment: 28 pages, 10 figures, to appear in The 25th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (KDD '19
Coresets for Time Series Clustering
We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many ļ¬elds including biology, medicine, and economics due to the proliferation of sensors for real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on N entities is generated from a Gaussian mixture model with autocorrelations over k clusters in Rd. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is eļ¬icient, and, under a mild assumption on the covariance matrices of the Gaussians, the size of the coreset is independent of the number of entities N and the number of observations for each entity, and depends only polynomially on k, d and 1/Īµ, where Īµ is the error parameter. We empirically assess the performance of our coresets with synthetic data
- ā¦