23 research outputs found
Warped K-Means: An algorithm to cluster sequentially-distributed data
[EN] Many devices generate large amounts of data that follow some sort of sequentiality, e.g.,
motion sensors, e-pens, eye trackers, etc. and often these data need to be compressed for
classification, storage, and/or retrieval tasks. Traditional clustering algorithms can be used
for this purpose, but unfortunately they do not cope with the sequential information
implicitly embedded in such data. Thus, we revisit the well-known K-means algorithm
and provide a general method to properly cluster sequentially-distributed data. We present
Warped K-Means (WKM), a multi-purpose partitional clustering procedure that minimizes
the sum of squared error criterion, while imposing a hard sequentiality constraint in the
classification step. We illustrate the properties of WKM in three applications, one being
the segmentation and classification of human activity. WKM outperformed five state-of-
the-art clustering techniques to simplify data trajectories, achieving a recognition accuracy
of near 97%, which is an improvement of around 66% over their peers. Moreover, such an
improvement came with a reduction in the computational cost of more than one order of
magnitude.This work has been partially supported by Casmacat (FP7-ICT-2011-7, Project 287576), tranScriptorium (FP7-ICT-2011-9, Project 600707), STraDA (MINECO, TIN2012-37475-0O2-01), and ALMPR (GVA, Prometeo/20091014) projects.Leiva Torres, LA.; Vidal, E. (2013). Warped K-Means: An algorithm to cluster sequentially-distributed data. Information Sciences. 237:196-210. https://doi.org/10.1016/j.ins.2013.02.042S19621023
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
A Framework for Projected Clustering of High Dimensional Data Streams
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However