9,948 research outputs found
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data
analysis relying on the finite mixture model to automatically find a latent
structure governing observed data. They are one of the most popular and
successful approaches in cluster analysis. The mixture density estimation is
generally performed by maximizing the observed-data log-likelihood by using the
expectation-maximization (EM) algorithm. However, it is well-known that the EM
algorithm initialization is crucial. In addition, the standard EM algorithm
requires the number of clusters to be known a priori. Some solutions have been
provided in [31, 12] for model-based clustering with Gaussian mixture models
for multivariate data. In this paper we focus on model-based curve clustering
approaches, when the data are curves rather than vectorial data, based on
regression mixtures. We propose a new robust EM algorithm for clustering
curves. We extend the model-based clustering approach presented in [31] for
Gaussian mixture models, to the case of curve clustering by regression
mixtures, including polynomial regression mixtures as well as spline or
B-spline regressions mixtures. Our approach both handles the problem of
initialization and the one of choosing the optimal number of clusters as the EM
learning proceeds, rather than in a two-fold scheme. This is achieved by
optimizing a penalized log-likelihood criterion. A simulation study confirms
the potential benefit of the proposed algorithm in terms of robustness
regarding initialization and funding the actual number of clusters.Comment: In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), 2013, Dallas, TX, US
Finite mixture regression: A sparse variable selection by model selection for clustering
We consider a finite mixture of Gaussian regression model for high-
dimensional data, where the number of covariates may be much larger than the
sample size. We propose to estimate the unknown conditional mixture density by
a maximum likelihood estimator, restricted on relevant variables selected by an
1-penalized maximum likelihood estimator. We get an oracle inequality satisfied
by this estimator with a Jensen-Kullback-Leibler type loss. Our oracle
inequality is deduced from a general model selection theorem for maximum
likelihood estimators with a random model collection. We can derive the penalty
shape of the criterion, which depends on the complexity of the random model
collection.Comment: 20 pages. arXiv admin note: text overlap with arXiv:1103.2021 by
other author
Skewed Factor Models Using Selection Mechanisms
Traditional factor models explicitly or implicitly assume that the factors follow a multivariate normal distribution; that is, only moments up to order two are involved. However, it may happen in real data problems that the first two moments cannot explain the factors. Based on this motivation, here we devise three new skewed factor models, the skew-normal, the skew-t, and the generalized skew-normal factor models depending on a selection mechanism on the factors. The ECME algorithms are adopted to estimate related parameters for statistical inference. Monte Carlo simulations validate our new models and we demonstrate the need for skewed factor models using the classic open/closed book exam scores dataset
Multiscale autocorrelation function: a new approach to anisotropy studies
We present a novel catalog-independent method, based on a scale dependent
approach, to detect anisotropy signatures in the arrival direction distribution
of the ultra highest energy cosmic rays (UHECR). The method provides a good
discrimination power for both large and small data sets, even in presence of
strong contaminating isotropic background. We present some applications to
simulated data sets of events corresponding to plausible scenarios for charged
particles detected by world-wide surface detector-based observatories, in the
last decades.Comment: 18 pages, 9 figure
Clustering and variable selection for categorical multivariate data
This article investigates unsupervised classification techniques for
categorical multivariate data. The study employs multivariate multinomial
mixture modeling, which is a type of model particularly applicable to
multilocus genotypic data. A model selection procedure is used to
simultaneously select the number of components and the relevant variables. A
non-asymptotic oracle inequality is obtained, leading to the proposal of a new
penalized maximum likelihood criterion. The selected model proves to be
asymptotically consistent under weak assumptions on the true probability
underlying the observations. The main theoretical result obtained in this study
suggests a penalty function defined to within a multiplicative parameter. In
practice, the data-driven calibration of the penalty function is made possible
by slope heuristics. Based on simulated data, this procedure is found to
improve the performance of the selection procedure with respect to classical
criteria such as BIC and AIC. The new criterion provides an answer to the
question "Which criterion for which sample size?" Examples of real dataset
applications are also provided
Automatic Clustering with Single Optimal Solution
Determining optimal number of clusters in a dataset is a challenging task.
Though some methods are available, there is no algorithm that produces unique
clustering solution. The paper proposes an Automatic Merging for Single Optimal
Solution (AMSOS) which aims to generate unique and nearly optimal clusters for
the given datasets automatically. The AMSOS is iteratively merges the closest
clusters automatically by validating with cluster validity measure to find
single and nearly optimal clusters for the given data set. Experiments on both
synthetic and real data have proved that the proposed algorithm finds single
and nearly optimal clustering structure in terms of number of clusters,
compactness and separation.Comment: 13 pages,4 Tables, 3 figure
- …