17 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Unsupervised Classification Using Immune Algorithm
Unsupervised classification algorithm based on clonal selection principle
named Unsupervised Clonal Selection Classification (UCSC) is proposed in this
paper. The new proposed algorithm is data driven and self-adaptive, it adjusts
its parameters to the data to make the classification operation as fast as
possible. The performance of UCSC is evaluated by comparing it with the well
known K-means algorithm using several artificial and real-life data sets. The
experiments show that the proposed UCSC algorithm is more reliable and has high
classification precision comparing to traditional classification methods such
as K-means
Projection Pursuit for Exploratory Supervised Classification
In high-dimensional data, one often seeks a few interesting low-dimensional projections that reveal important features of the data. Projection pursuit is a procedure for searching high-dimensional data for interesting low-dimensional projections via the optimization of a criterion function called the projection pursuit index. Very few projection pursuit indices incorporate class or group information in the calculation. Hence, they cannot be adequately applied in supervised classification problems to provide low-dimensional projections revealing class differences in the data . We introduce new indices derived from linear discriminant analysis that can be used for exploratory supervised classification.Data mining, Exploratory multivariate data analysis, Gene expression data, Discriminant analysis
Projection Pursuit for Exploratory Supervised Classification
In high-dimensional data, one often seeks a few interesting low-dimensional projections that reveal important features of the data. Projection pursuit is a procedure for searching high-dimensional data for interesting low-dimensional projections via the optimization of a criterion function called the projection pursuit index. Very few projection pursuit indices incorporate class or group information in the calculation. Hence, they cannot be adequately applied in supervised classification problems to provide low-dimensional projections revealing class differences in the data . We introduce new indices derived from linear discriminant analysis that can be used for exploratory supervised classification
Clustering-based collocation for uncertainty propagation with multivariate correlated inputs
In this article, we propose the use of partitioning and clustering methods as an
alternative to Gaussian quadrature for stochastic collocation (SC). The key idea
is to use cluster centers as the nodes for collocation. In this way, we can extend
the use of collocation methods to uncertainty propagation with multivariate,
correlated input. The approach is particularly useful in situations where the
probability distribution of the input is unknown, and only a sample from the
input distribution is available. We examine several clustering methods and
assess their suitability for stochastic collocation numerically using the Genz
test functions as benchmark. The proposed methods work well, most notably
for the challenging case of nonlinearly correlated inputs in higher dimensions.
Tests with input dimension up to 16 are included.
Furthermore, the clustering-based collocation methods are compared to regular
SC with tensor grids of Gaussian quadrature nodes. For 2-dimensional
uncorrelated inputs, regular SC performs better, as should be expected, however
the clustering-based methods also give only small relative errors. For correlated
2-dimensional inputs, clustering-based collocation outperforms a simple
adapted version of regular SC, where the weights are adjusted to account for
input correlatio
Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies
<p>Abstract</p> <p>Background</p> <p>The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Biclustering in particular has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Biclustering algorithms also have important applications in sample classification where, for instance, tissue samples can be classified as cancerous or normal. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the "best" grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters.</p> <p>Results</p> <p>In this article, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. Cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices to generate biclusters. The performance of OREO is tested on (a) metabolite concentration data, (b) an image reconstruction matrix, (c) synthetic data with implanted biclusters, and gene expression data for (d) colon cancer data, (e) breast cancer data, as well as (f) yeast segregant data to validate the ability of the proposed method and compare it to existing biclustering and clustering methods.</p> <p>Conclusion</p> <p>We demonstrate that this rigorous global optimization method for biclustering produces clusters with more insightful groupings of similar entities, such as genes or metabolites sharing common functions, than other clustering and biclustering algorithms and can reconstruct underlying fundamental patterns in the data for several distinct sets of data matrices arising in important biological applications.</p
A New Generation of Mixture-Model Cluster Analysis with Information Complexity and the Genetic EM Algorithm
In this dissertation, we extend several relatively new developments in statistical model selection and data mining in order to improve one of the workhorse statistical tools - mixture modeling (Pearson, 1894). The traditional mixture model assumes data comes from several populations of Gaussian distributions. Thus, what remains is to determine how many distributions, their population parameters, and the mixing proportions. However, real data often do not fit the restrictions of normality very well. It is likely that data from a single population exhibiting either asymmetrical or nonnormal tail behavior could be erroneously modeled as two populations, resulting in suboptimal decisions. To avoid these pitfalls, we develop the mixture model under a broader distributional assumption by fitting a group of multivariate elliptically-contoured distributions (Anderson and Fang, 1990; Fang et al., 1990). Special cases include the multivariate Gaussian and power exponential distributions, as well as the multivariate generalization of the Student’s T. This gives us the flexibility to model nonnormal tail and peak behavior, though the symmetry restriction still exists. The literature has many examples of research generalizing the Gaussian mixture model to other distributions (Farrell and Mersereau, 2004; Hasselblad, 1966; John, 1970a), but our effort is more general. Further, we generalize the mixture model to be non-parametric, by developing two types of kernel mixture model. First, we generalize the mixture model to use the truly multivariate kernel density estimators (Wand and Jones, 1995). Additionally, we develop the power exponential product kernel mixture model, which allows the density to adjust to the shape of each dimension independently. Because kernel density estimators enforce no functional form, both of these methods can adapt to nonnormal asymmetric, kurtotic, and tail characteristics. Over the past two decades or so, evolutionary algorithms have grown in popularity, as they have provided encouraging results in a variety of optimization problems. Several authors have applied the genetic algorithm - a subset of evolutionary algorithms - to mixture modeling, including Bhuyan et al. (1991), Krishna and Murty (1999), and Wicker (2006). These procedures have the benefit that they bypass computational issues that plague the traditional methods. We extend these initialization and optimization methods by combining them with our updated mixture models. Additionally, we “borrow” results from robust estimation theory (Ledoit and Wolf, 2003; Shurygin, 1983; Thomaz, 2004) in order to data-adaptively regularize population covariance matrices. Numerical instability of the covariance matrix can be a significant problem for mixture modeling, since estimation is typically done on a relatively small subset of the observations. We likewise extend various information criteria (Akaike, 1973; Bozdogan, 1994b; Schwarz, 1978) to the elliptically-contoured and kernel mixture models. Information criteria guide model selection and estimation based on various approximations to the Kullback-Liebler divergence. Following Bozdogan (1994a), we use these tools to sequentially select the best mixture model, select the best subset of variables, and detect influential observations - all without making any subjective decisions. Over the course of this research, we developed a full-featured Matlab toolbox (M3) which implements all the new developments in mixture modeling presented in this dissertation. We show results on both simulated and real world datasets. Keywords: mixture modeling, nonparametric estimation, subset selection, influence detection, evidence-based medical diagnostics, unsupervised classification, robust estimation
Advanced analysis of branch and bound algorithms
Als de code van een cijferslot zoek is, kan het alleen geopend worden door alle cijfercombinaties langs te gaan. In het slechtste geval is de laatste combinatie de juiste. Echter, als de code uit tien cijfers bestaat, moeten tien miljard mogelijkheden bekeken worden. De zogenaamde 'NP-lastige' problemen in het proefschrift van Marcel Turkensteen zijn vergelijkbaar met het 'cijferslotprobleem'. Ook bij deze problemen is het aantal mogelijkheden buitensporig groot. De kunst is derhalve om de zoekruimte op een slimme manier af te tasten. Bij de Branch and Bound (BnB) methode wordt dit gedaan door de zoekruimte op te splitsen in kleinere deelgebieden. Turkensteen past de BnB methode onder andere toe bij het handelsreizigersprobleem, waarbij een kortste route door een verzameling plaatsen bepaald moet worden. Dit probleem is in algemene vorm nog steeds niet opgelost. De economische gevolgen kunnen groot zijn: zo staat nog steeds niet vast of bijvoorbeeld een routeplanner vrachtwagens optimaal laat rondrijden. De huidige BnB-methoden worden in dit proefschrift met name verbeterd door niet naar de kosten van een verbinding te kijken, maar naar de kostentoename als een verbinding niet gebruikt wordt: de boventolerantie.
PROBLEMI DI CLUSTERING CON VINCOLI: ALGORITMI E COMPLESSIT\uc0
This thesis introduces and studies the problem of 1-dimensional bounded clustering: for any fixed p 65 1, given reals x1, x2\u2026, xn, and integers k1, k2.., km, determine the partition (A1, A2\u2026 Am) of {1, 2, ..., n} with |A1| = k1, |A2| = k2 , \u2026 , |Am| = km which minimizes
\u3a3k \u3a3i\uf0ce Ak |xi - \u3bck |p
where \u3bck is the p-centroid of Ak
First, we prove that the optimum partition is contiguous (String Property), that is if i,j \uf0ce Ak, and xi < xs < xj, then s \uf0ce Ak .
As a consequence, we determine an efficient algorithm for bi-clustering (if p is an integer); however, we show that the general problem is NP-complete, while a relaxed version of it admits a polynomial-time algorithm.
When p is not an integer, we prove that the problem of deciding if the centroid \u3bc is less than a given integer is in the Counting Hierarchy CH.
As an application, the relaxed clustering algorithm used as a step for solving a problem in the field of Bioinformatics: the Localization of promoter regions in genomic sequences. The results are compared with those obtained through another methodology (MADAP)