280,298 research outputs found
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data
analysis relying on the finite mixture model to automatically find a latent
structure governing observed data. They are one of the most popular and
successful approaches in cluster analysis. The mixture density estimation is
generally performed by maximizing the observed-data log-likelihood by using the
expectation-maximization (EM) algorithm. However, it is well-known that the EM
algorithm initialization is crucial. In addition, the standard EM algorithm
requires the number of clusters to be known a priori. Some solutions have been
provided in [31, 12] for model-based clustering with Gaussian mixture models
for multivariate data. In this paper we focus on model-based curve clustering
approaches, when the data are curves rather than vectorial data, based on
regression mixtures. We propose a new robust EM algorithm for clustering
curves. We extend the model-based clustering approach presented in [31] for
Gaussian mixture models, to the case of curve clustering by regression
mixtures, including polynomial regression mixtures as well as spline or
B-spline regressions mixtures. Our approach both handles the problem of
initialization and the one of choosing the optimal number of clusters as the EM
learning proceeds, rather than in a two-fold scheme. This is achieved by
optimizing a penalized log-likelihood criterion. A simulation study confirms
the potential benefit of the proposed algorithm in terms of robustness
regarding initialization and funding the actual number of clusters.Comment: In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), 2013, Dallas, TX, US
Partial mixture model for tight clustering of gene expression time-course
Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to
this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate
information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a
simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset
under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion
A new approach to cluster analysis: the clustering-function-based method
The purpose of the paper is to present a new statistical approach to hierarchical cluster analysis with n objects measured on p variables. Motivated by the model of multivariate analysis of variance and the method of maximum likelihood, a clustering problem is formulated as a least squares optimization problem, simultaneously solving for both an n-vector of unknown group membership of objects and a linear clustering function. This formulation is shown to be linked to linear regression analysis and Fisher linear discriminant analysis and includes principal component regression for tackling multicollinearity or rank deficiency, polynomial or B-splines regression for handling non-linearity and various variable selection methods to eliminate irrelevant variables from data analysis. Algorithmic issues are investigated by using sign eigenanalysis
Improved Correction of Atmospheric Pressure Data Obtained by Smartphones through Machine Learning
A correction method using machine learning aims to improve the conventional linear regression (LR) based method for correction of atmospheric pressure data obtained by smartphones. The method proposed in this study conducts clustering and regression analysis with time domain classification. Data obtained in Gyeonggi-do, one of the most populous provinces in South Korea surrounding Seoul with the size of 10,000 km2, from July 2014 through December 2014, using smartphones were classified with respect to time of day (daytime or nighttime) as well as day of the week (weekday or weekend) and the user’s mobility, prior to the expectation-maximization (EM) clustering. Subsequently, the results were analyzed for comparison by applying machine learning methods such as multilayer perceptron (MLP) and support vector regression (SVR). The results showed a mean absolute error (MAE) 26% lower on average when regression analysis was performed through EM clustering compared to that obtained without EM clustering. For machine learning methods, the MAE for SVR was around 31% lower for LR and about 19% lower for MLP. It is concluded that pressure data from smartphones are as good as the ones from national automatic weather station (AWS) network
Local spatial regression models : a comparative analysis on soil contamination
Spatial data analysis focuses on both attribute and locational information. Local analyses deal with differences across space whereas global analyses deal with similarities across space. This paper addresses an experimental comparative study to analyse the spatial data by some weighted local regression models. Five local regression models have been developed and their estimation capacities have been evaluated. The experimental studies showed that integration of objective function based fuzzy clustering to geostatistics provides some accurate and general models structures. In particular, the estimation performance of the model established by combining the extended fuzzy clustering algorithm and standard regional dependence function is higher than that of the other regression models. Finally, it could be suggested that the hybrid regression models developed by combining soft computing and geostatistics could be used in spatial data analysis
- …