397,194 research outputs found
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e.
when the number of data points may be much smaller than the number of
dimensions. Specifically, we consider a Gaussian mixture model (GMM) with
non-spherical Gaussian components, where the clusters are distinguished by only
a few relevant dimensions. The method we propose is a combination of a recent
approach for learning parameters of a Gaussian mixture model and sparse linear
discriminant analysis (LDA). In addition to cluster assignments, the method
returns an estimate of the set of features relevant for clustering. Our results
indicate that the sample complexity of clustering depends on the sparsity of
the relevant feature set, while only scaling logarithmically with the ambient
dimension. Additionally, we require much milder assumptions than existing work
on clustering in high dimensions. In particular, we do not require spherical
clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur
clValid: An R Package for Cluster Validation
The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, "internal", "stability", and "biological". The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), and model-based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously evaluate several clustering algorithms while varying the number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class "clValid", which has summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.
Global Optimization strategies for two-mode clustering
Two-mode clustering is a relatively new form of clustering that clusters both rows and columns of a data matrix. To do so, a criterion similar to k-means is optimized. However, it is still unclear which optimization method should be used to perform two-mode clustering, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode clustering. Several known algorithms are discussed and a new, fuzzy algorithm is introduced. The meta-heuristics Multistart, Simulated Annealing, and Tabu Search are used in combination with these algorithms. The new, fuzzy algorithm is based on the fuzzy c-means algorithm of Bezdek (1981) and the Fuzzy Steps approach to avoid local minima of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performance of all methods is compared in a large simulation study. It is found that using a Multistart meta-heuristic in combination with a two-mode k-means algorithm or the fuzzy algorithm often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode clustering.algorithms;fuzzy clustering;multistart;simulated annealing;simulation;tabu search;two-mode clustering
The application of clustering analysis to international private indebtedness
The main goal of this paper is to apply a combination of statistical and connectionist schemes to examine, via clustering analysis, private indebtedness in different countries. Thirty-nine such experiences are used. The relationship between private debts and some macroeconomic variables are discussed in some detail. The clustering performance is improved by taking advantage of specific properties and capacities of each method. The procedures are also applied to a controlled numerical example.
- âŠ