146 research outputs found
A tight lower bound instance for k-means++ in constant dimension
The k-means++ seeding algorithm is one of the most popular algorithms that is
used for finding the initial centers when using the k-means heuristic. The
algorithm is a simple sampling procedure and can be described as follows: Pick
the first center randomly from the given points. For , pick a point to
be the center with probability proportional to the square of the
Euclidean distance of this point to the closest previously chosen
centers.
The k-means++ seeding algorithm is not only simple and fast but also gives an
approximation in expectation as shown by Arthur and Vassilvitskii.
There are datasets on which this seeding algorithm gives an approximation
factor of in expectation. However, it is not clear from these
results if the algorithm achieves good approximation factor with reasonably
high probability (say ). Brunsch and R\"{o}glin gave a dataset where
the k-means++ seeding algorithm achieves an approximation ratio
with probability that is exponentially small in . However, this and all
other known lower-bound examples are high dimensional. So, an open problem was
to understand the behavior of the algorithm on low dimensional datasets. In
this work, we give a simple two dimensional dataset on which the seeding
algorithm achieves an approximation ratio with probability
exponentially small in . This solves open problems posed by Mahajan et al.
and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with
arXiv:1306.420
Fair redistricting is hard
Gerrymandering is a long-standing issue within the U.S. political system, and
it has received scrutiny recently by the U.S. Supreme Court. In this note, we
prove that deciding whether there exists a fair redistricting among legal maps
is NP-hard. To make this precise, we use simplified notions of "legal" and
"fair" that account for desirable traits such as geographic compactness of
districts and sufficient representation of voters. The proof of our result is
inspired by the work of Mahanjan, Minbhorkar and Varadarajan that proves that
planar k-means is NP-hard
StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge
Today, massive amounts of streaming data from smart devices need to be
analyzed automatically to realize the Internet of Things. The Complex Event
Processing (CEP) paradigm promises low-latency pattern detection on event
streams. However, CEP systems need to be extended with Machine Learning (ML)
capabilities such as online training and inference in order to be able to
detect fuzzy patterns (e.g., outliers) and to improve pattern recognition
accuracy during runtime using incremental model training. In this paper, we
propose a distributed CEP system denoted as StreamLearner for ML-enabled
complex event detection. The proposed programming model and data-parallel
system architecture enable a wide range of real-world applications and allow
for dynamically scaling up and out system resources for low-latency,
high-throughput event processing. We show that the DEBS Grand Challenge 2017
case study (i.e., anomaly detection in smart factories) integrates seamlessly
into the StreamLearner API. Our experiments verify scalability and high event
throughput of StreamLearner.Comment: Christian Mayer, Ruben Mayer, and Majd Abdo. 2017. StreamLearner:
Distributed Incremental Machine Learning on Event Streams: Grand Challenge.
In Proceedings of the 11th ACM International Conference on Distributed and
Event-based Systems (DEBS '17), 298-30
Clustering processes
The problem of clustering is considered, for the case when each data point is
a sample generated by a stationary ergodic process. We propose a very natural
asymptotic notion of consistency, and show that simple consistent algorithms
exist, under most general non-parametric assumptions. The notion of consistency
is as follows: two samples should be put into the same cluster if and only if
they were generated by the same distribution. With this notion of consistency,
clustering generalizes such classical statistical problems as homogeneity
testing and process classification. We show that, for the case of a known
number of clusters, consistency can be achieved under the only assumption that
the joint distribution of the data is stationary ergodic (no parametric or
Markovian assumptions, no assumptions of independence, neither between nor
within the samples). If the number of clusters is unknown, consistency can be
achieved under appropriate assumptions on the mixing rates of the processes.
(again, no parametric or independence assumptions). In both cases we give
examples of simple (at most quadratic in each argument) algorithms which are
consistent.Comment: in proceedings of ICML 2010. arXiv-admin note: for version 2 of this
article please see: arXiv:1005.0826v
- …