19 research outputs found

    Penalized model-based clustering for three-way data structures

    Get PDF
    Recently, there has been an increasing interest in developing statistical methods able to find groups in matrix-valued data. To this extent, matrix Gaussian mixture models (MGMM) provide a natural extension to the popular model-based clustering based on Normal mixtures. Unfortunately, the overparametrization issue, already affecting the vector-variate framework, is further exacerbated when it comes to MGMM, since the number of parameters scales quadratically with both row and column dimensions. In order to overcome this limitation, the present paper introduces a sparse model-based clustering approach for three-way data structures. By means of penalized estimation, our methodology shrinks the estimates towards zero, achieving more stable and parsimonious clustering in high dimensional scenarios. An application to satellite images underlines the benefits of the proposed method

    Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering

    Get PDF
    Finite Gaussian mixture models provide a powerful and widely employed probabilistic approach for clustering multivariate continuous data. However, the practical usefulness of these models is jeopardized in high-dimensional spaces, where they tend to be over-parameterized. As a consequence, different solutions have been proposed, often relying on matrix decompositions or variable selection strategies. Recently, a methodological link between Gaussian graphical models and finite mixtures has been established, paving the way for penalized model-based clustering in the presence of large precision matrices. Notwithstanding, current methodologies implicitly assume similar levels of sparsity across the classes, not accounting for different degrees of association between the variables across groups. We overcome this limitation by deriving group-wise penalty factors, which automatically enforce under or over-connectivity in the estimated graphs. The approach is entirely data-driven and does not require additional hyper-parameter specification. Analyses on synthetic and real data showcase the validity of our proposal

    Sparse model-based clustering of three-way data via lasso-type penalties

    Full text link
    Mixtures of matrix Gaussian distributions provide a probabilistic framework for clustering continuous matrix-variate data, which are becoming increasingly prevalent in various fields. Despite its widespread adoption and successful application, this approach suffers from over-parameterization issues, making it less suitable even for matrix-variate data of moderate size. To overcome this drawback, we introduce a sparse model-based clustering approach for three-way data. Our approach assumes that the matrix mixture parameters are sparse and have different degree of sparsity across clusters, allowing to induce parsimony in a flexible manner. Estimation of the model relies on the maximization of a penalized likelihood, with specifically tailored group and graphical lasso penalties. These penalties enable the selection of the most informative features for clustering three-way data where variables are recorded over multiple occasions and allow to capture cluster-specific association structures. The proposed methodology is tested extensively on synthetic data and its validity is demonstrated in application to time-dependent crime patterns in different US cities

    A Latent Shrinkage Position Model for Binary and Count Network Data

    Full text link
    Interactions between actors are frequently represented using a network. The latent position model is widely used for analysing network data, whereby each actor is positioned in a latent space. Inferring the dimension of this space is challenging. Often, for simplicity, two dimensions are used or model selection criteria are employed to select the dimension, but this requires choosing a criterion and the computational expense of fitting multiple models. Here the latent shrinkage position model (LSPM) is proposed which intrinsically infers the effective dimension of the latent space. The LSPM employs a Bayesian nonparametric multiplicative truncated gamma process prior that ensures shrinkage of the variance of the latent positions across higher dimensions. Dimensions with non-negligible variance are deemed most useful to describe the observed network, inducing automatic inference on the latent space dimension. While the LSPM is applicable to many network types, logistic and Poisson LSPMs are developed here for binary and count networks respectively. Inference proceeds via a Markov chain Monte Carlo algorithm, where novel surrogate proposal distributions reduce the computational burden. The LSPM's properties are assessed through simulation studies, and its utility is illustrated through application to real network datasets. Open source software assists wider implementation of the LSPM.Comment: 75 pages, 47 figure

    Unobserved classes and extra variables in high-dimensional discriminant analysis

    Full text link
    In supervised classification problems, the test set may contain data points belonging to classes not observed in the learning phase. Moreover, the same units in the test data may be measured on a set of additional variables recorded at a subsequent stage with respect to when the learning sample was collected. In this situation, the classifier built in the learning phase needs to adapt to handle potential unknown classes and the extra dimensions. We introduce a model-based discriminant approach, Dimension-Adaptive Mixture Discriminant Analysis (D-AMDA), which can detect unobserved classes and adapt to the increasing dimensionality. Model estimation is carried out via a full inductive approach based on an EM algorithm. The method is then embedded in a more general framework for adaptive variable selection and classification suitable for data of large dimensions. A simulation study and an artificial experiment related to classification of adulterated honey samples are used to validate the ability of the proposed framework to deal with complex situations.Comment: 29 pages, 29 figure

    Localised spin dimers and structural distortions in the hexagonal perovskite Ba3CaMo2O9

    Get PDF
    Open Access under the ACS OA Agreement Acknowledgments JWe thank the Carnegie Trust for the Universities of Scotland for a PhD Scholarship for S.S. and the U.K. Science and Technology Facilities Council (STFC) for provision of neutron beamtime at the ILL under the experiment code 5-31-2703. Data are available from ILL at DOI: 10.5291/ILL-DATA.5-31-2703.Peer reviewedPublisher PD

    mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models

    Get PDF
    Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.Science Foundation Irelan

    Variable selection methods for model-based clustering

    No full text
    Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.Science Foundation IrelandInsight Research Centr

    Variable selection methods for model-based clustering

    Get PDF
    Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.Science Foundation IrelandInsight Research Centr
    corecore