18 research outputs found
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data
analysis relying on the finite mixture model to automatically find a latent
structure governing observed data. They are one of the most popular and
successful approaches in cluster analysis. The mixture density estimation is
generally performed by maximizing the observed-data log-likelihood by using the
expectation-maximization (EM) algorithm. However, it is well-known that the EM
algorithm initialization is crucial. In addition, the standard EM algorithm
requires the number of clusters to be known a priori. Some solutions have been
provided in [31, 12] for model-based clustering with Gaussian mixture models
for multivariate data. In this paper we focus on model-based curve clustering
approaches, when the data are curves rather than vectorial data, based on
regression mixtures. We propose a new robust EM algorithm for clustering
curves. We extend the model-based clustering approach presented in [31] for
Gaussian mixture models, to the case of curve clustering by regression
mixtures, including polynomial regression mixtures as well as spline or
B-spline regressions mixtures. Our approach both handles the problem of
initialization and the one of choosing the optimal number of clusters as the EM
learning proceeds, rather than in a two-fold scheme. This is achieved by
optimizing a penalized log-likelihood criterion. A simulation study confirms
the potential benefit of the proposed algorithm in terms of robustness
regarding initialization and funding the actual number of clusters.Comment: In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), 2013, Dallas, TX, US
Clustering analysis to improve total unit weight prediction from CPTu
Accurate estimates of soil unit weight are fundamental for correctly post process CPTu data and making use of Soil Behavior Type-based classification systems. Soil-specific and global regressions have been proposed for this purpose. However, soil-specific correlation might pose a problem of pertinence when applied at new sites. On the other hand, global correlations are easy to apply, but generally carry large systematic uncertainties. In this context, this work proposes a data clustering technique applied to geotechnical database aiming to identify hidden linear trends among dimensionless soil unit weight and normalized CPTu parameter according to some unobservable soil classes. Global correlations are then revisited according to such data subdivision aiming to improve accuracy of soil unit weight prediction while reducing transformation uncertainty. A new probabilistic criterion for soil unit weight prediction is also obtained. The potential benefits of the proposed procedure are illustrated with data from a Llobregat delta site (Spain).Postprint (published version
The discriminative functional mixture model for a comparative analysis of bike sharing systems
Bike sharing systems (BSSs) have become a means of sustainable intermodal
transport and are now proposed in many cities worldwide. Most BSSs also provide
open access to their data, particularly to real-time status reports on their
bike stations. The analysis of the mass of data generated by such systems is of
particular interest to BSS providers to update system structures and policies.
This work was motivated by interest in analyzing and comparing several European
BSSs to identify common operating patterns in BSSs and to propose practical
solutions to avoid potential issues. Our approach relies on the identification
of common patterns between and within systems. To this end, a model-based
clustering method, called FunFEM, for time series (or more generally functional
data) is developed. It is based on a functional mixture model that allows the
clustering of the data in a discriminative functional subspace. This model
presents the advantage in this context to be parsimonious and to allow the
visualization of the clustered systems. Numerical experiments confirm the good
behavior of FunFEM, particularly compared to state-of-the-art methods. The
application of FunFEM to BSS data from JCDecaux and the Transport for London
Initiative allows us to identify 10 general patterns, including pathological
ones, and to propose practical improvement strategies based on the system
comparison. The visualization of the clustered data within the discriminative
subspace turns out to be particularly informative regarding the system
efficiency. The proposed methodology is implemented in a package for the R
software, named funFEM, which is available on the CRAN. The package also
provides a subset of the data analyzed in this work.Comment: Published at http://dx.doi.org/10.1214/15-AOAS861 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Model-Based Co-Clustering of Multivariate Functional Data
International audienceHigh dimensional data clustering is an increasingly interesting topic in the statistical analysis of heterogeneous large-scale data. In this paper, we consider the problem of clustering heterogeneous high-dimensional data where the individuals are described by functional variables which exhibit a dynamical longitudinal structure. We address the issue in the framework of model-based co-clustering and propose the functional latent block model (FLBM). The introduced FLBM model allows to simultaneously cluster a sample of multivariate functions into a finite set of blocks, each block being an association of a cluster over individuals and a cluster over functional variables. Furthermore, the homogeneous set within each block is modeled with a dedicated latent process functional regression model which allows its segmentation according to an underlying dynamical structure. The proposed model allows thus to fully exploit the structure of the data, compared to classical latent block clustering models for continuous non functional data, which ignores the functional structure of the observations. The FLBM can therefore serve for simultaneous co-clustering and segmentation of multivariate non-stationary functions. We propose a variational expectation-maximization (EM) algorithm (VEM-FLBM) to monotonically maximize a variational approximation of the observed-data log-likelihood for the unsupervised inference of the FLBM model
Feature quantization for parsimonious and interpretable predictive models
For regulatory and interpretability reasons, the logistic regression is still widely used by financial institutions to learn the refunding probability of a loan from applicant's historical data. To improve prediction accuracy and interpretability, a preprocessing step quantizing both continuous and categorical data is usually performed: continuous features are discretized by assigning factor levels to intervals and, if numerous, levels of categorical features are grouped. However, a better predictive accuracy can be reached by embedding this quantization estimation step directly into the predictive estimation step itself. By doing so, the predictive loss has to be optimized on a huge and untractable discontinuous quantization set. To overcome this difficulty, we introduce a specific two-step optimization strategy: first, the optimization problem is relaxed by approximating discontinuous quan-tization functions by smooth functions; second, the resulting relaxed optimization problem is solved via a particular neural network and stochas-tic gradient descent. The strategy gives then access to good candidates for the original optimization problem after a straightforward maximum a posteriori procedure to obtain cutpoints. The good performances of this approach, which we call glmdisc, are illustrated on simulated and real data from the UCI library and Crédit Agricole Consumer Finance (a major Eu-ropean historic player in the consumer credit market). The results show that practitioners finally have an automatic all-in-one tool that answers their recurring needs of quantization for predictive tasks
Gaussian Based Visualization of Gaussian and Non-Gaussian Based Clustering
International audienceA generic method is introduced to visualize in a "Gaussian-like way", and onto R 2 , results of Gaussian or non-Gaussian based clustering. The key point is to explicitly force a visualization based on a spherical Gaussian mixture to inherit from the within cluster overlap that is present in the initial clustering mixture. The result is a particularly user-friendly drawing of the clusters, providing any practitioner with an overview of the potentially complex clustering result. An entropic measure provides information about the quality of the drawn overlap compared to the true one in the initial space. The proposed method is illustrated on four real data sets of different types (categorical, mixed, functional and network) and is implemented on the R package ClusVis