89,021 research outputs found
SOTXTSTREAM: Density-based self-organizing clustering of text streams
A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets
Classification methods for Hilbert data based on surrogate density
An unsupervised and a supervised classification approaches for Hilbert random
curves are studied. Both rest on the use of a surrogate of the probability
density which is defined, in a distribution-free mixture context, from an
asymptotic factorization of the small-ball probability. That surrogate density
is estimated by a kernel approach from the principal components of the data.
The focus is on the illustration of the classification algorithms and the
computational implications, with particular attention to the tuning of the
parameters involved. Some asymptotic results are sketched. Applications on
simulated and real datasets show how the proposed methods work.Comment: 33 pages, 11 figures, 6 table
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
The robust improper maximum likelihood estimator (RIMLE) is a new method for
robust multivariate clustering finding approximately Gaussian clusters. It
maximizes a pseudo-likelihood defined by adding a component with improper
constant density for accommodating outliers to a Gaussian mixture. A special
case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In
this paper we treat existence, consistency, and breakdown theory for the RIMLE
comprehensively. RIMLE's existence is proved under non-smooth covariance matrix
constraints. It is shown that these can be implemented via a computationally
feasible Expectation-Conditional Maximization algorithm.Comment: The title of this paper was originally: "A consistent and breakdown
robust model-based clustering method
A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering
In this paper we target the class of modal clustering methods where clusters
are defined in terms of the local modes of the probability density function
which generates the data. The most well-known modal clustering method is the
k-means clustering. Mean Shift clustering is a generalization of the k-means
clustering which computes arbitrarily shaped clusters as defined as the basins
of attraction to the local modes created by the density gradient ascent paths.
Despite its potential, the Mean Shift approach is a computationally expensive
method for unsupervised learning. Thus, we introduce two contributions aiming
to provide clustering algorithms with a linear time complexity, as opposed to
the quadratic time complexity for the exact Mean Shift clustering. Firstly we
propose a scalable procedure to approximate the density gradient ascent.
Second, our proposed scalable cluster labeling technique is presented. Both
propositions are based on Locality Sensitive Hashing (LSH) to approximate
nearest neighbors. These two techniques may be used for moderate sized
datasets. Furthermore, we show that using our proposed approximations of the
density gradient ascent as a pre-processing step in other clustering methods
can also improve dedicated classification metrics. For the latter, a
distributed implementation, written for the Spark/Scala ecosystem is proposed.
For all these considered clustering methods, we present experimental results
illustrating their labeling accuracy and their potential to solve concrete
problems.Comment: Algorithms are available at
https://github.com/Clustering4Ever/Clustering4Eve
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
- …