1,417 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Large Scale Clustering with Variational EM for Gaussian Mixture Models
How can we efficiently find large numbers of clusters in large data sets with
high-dimensional data points? Our aim is to explore the current efficiency and
large-scale limits in fitting a parametric model for clustering to data
distributions. To do so, we combine recent lines of research which have
previously focused on separate specific methods for complexity reduction. We
first show theoretically how the clustering objective of variational EM (which
reduces complexity for many clusters) can be combined with coreset objectives
(which reduce complexity for many data points). Secondly, we realize a concrete
highly efficient iterative procedure which combines and translates the
theoretical complexity gains of truncated variational EM and coresets into a
practical algorithm. For very large scales, the high efficiency of parameter
updates then requires (A) highly efficient coreset construction and (B) highly
efficient initialization procedures (seeding) in order to avoid computational
bottlenecks. Fortunately very efficient coreset construction has become
available in the form of light-weight coresets, and very efficient
initialization has become available in the form of AFK-MC seeding. The
resulting algorithm features balanced computational costs across all
constituting components. In applications to standard large-scale benchmarks for
clustering, we investigate the algorithm's efficiency/quality trade-off.
Compared to the best recent approaches, we observe speedups of up to one order
of magnitude, and up to two orders of magnitude compared to the -means++
baseline. To demonstrate that the observed efficiency enables previously
considered unfeasible applications, we cluster the entire and unscaled 80 Mio.
Tiny Images dataset into up to 32,000 clusters. To the knowledge of the
authors, this represents the largest scale fit of a parametric data model for
clustering reported so far
Clustering Categorical Data: Soft Rounding k-modes
Over the last three decades, researchers have intensively explored various
clustering tools for categorical data analysis. Despite the proposal of various
clustering algorithms, the classical k-modes algorithm remains a popular choice
for unsupervised learning of categorical data. Surprisingly, our first insight
is that in a natural generative block model, the k-modes algorithm performs
poorly for a large range of parameters. We remedy this issue by proposing a
soft rounding variant of the k-modes algorithm (SoftModes) and theoretically
prove that our variant addresses the drawbacks of the k-modes algorithm in the
generative model. Finally, we empirically verify that SoftModes performs well
on both synthetic and real-world datasets
Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation
In all state-of-the-art sketching and coreset techniques for clustering, as
well as in the best known fixed-parameter tractable approximation algorithms,
randomness plays a key role. For the classic -median and -means problems,
there are no known deterministic dimensionality reduction procedure or coreset
construction that avoid an exponential dependency on the input dimension ,
the precision parameter or . Furthermore, there is no
coreset construction that succeeds with probability and whose size does
not depend on the number of input points, . This has led researchers in the
area to ask what is the power of randomness for clustering sketches [Feldman,
WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio
achievable deterministically without a complexity exponential in the dimension
are for both -median and -means, even when allowing a
complexity FPT in the number of clusters . This stands in sharp contrast
with the -approximation achievable in that case, when allowing
randomization.
In this paper, we provide deterministic sketches constructions for
clustering, whose size bounds are close to the best-known randomized ones. We
also construct a deterministic algorithm for computing
-approximation to -median and -means in high dimensional
Euclidean spaces in time , close to the
best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset
construction that uses uniform sampling, that immediately improves over the
recent results of [Braverman et al. FOCS '22] by a factor .Comment: FOCS 2023. Abstract reduced for arxiv requirement
- …