22 research outputs found
Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data
Three-way data structures, characterized by three entities, the units, the
variables and the occasions, are frequent in biological studies. In RNA
sequencing, three-way data structures are obtained when high-throughput
transcriptome sequencing data are collected for n genes across p conditions at
r occasions. Matrix-variate distributions offer a natural way to model
three-way data and mixtures of matrix-variate distributions can be used to
cluster three-way data. Clustering of gene expression data is carried out as
means to discovering gene co-expression networks. In this work, a mixture of
matrix-variate Poisson-log normal distributions is proposed for clustering read
counts from RNA sequencing. By considering the matrix-variate structure, full
information on the conditions and occasions of the RNA sequencing dataset is
simultaneously considered, and the number of covariance parameters to be
estimated is reduced. A Markov chain Monte Carlo expectation-maximization
algorithm is used for parameter estimation and information criteria are used
for model selection. The models are applied to both real and simulated data,
giving favourable clustering results
Penalized model-based clustering for three-way data structures
Recently, there has been an increasing interest in developing statistical
methods able to find groups in matrix-valued data. To this extent, matrix Gaussian
mixture models (MGMM) provide a natural extension to the popular model-based
clustering based on Normal mixtures. Unfortunately, the overparametrization issue,
already affecting the vector-variate framework, is further exacerbated when it comes
to MGMM, since the number of parameters scales quadratically with both row and
column dimensions. In order to overcome this limitation, the present paper introduces
a sparse model-based clustering approach for three-way data structures. By
means of penalized estimation, our methodology shrinks the estimates towards zero,
achieving more stable and parsimonious clustering in high dimensional scenarios.
An application to satellite images underlines the benefits of the proposed method
Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data
We propose a novel approach for modeling multivariate longitudinal data in
the presence of unobserved heterogeneity for the analysis of the Health and
Retirement Study (HRS) data. Our proposal can be cast within the framework of
linear mixed models with discrete individual random intercepts; however,
differently from the standard formulation, the proposed Covariance Pattern
Mixture Model (CPMM) does not require the usual local independence assumption.
The model is thus able to simultaneously model the heterogeneity, the
association among the responses and the temporal dependence structure. We focus
on the investigation of temporal patterns related to the cognitive functioning
in retired American respondents. In particular, we aim to understand whether it
can be affected by some individual socio-economical characteristics and whether
it is possible to identify some homogenous groups of respondents that share a
similar cognitive profile. An accurate description of the detected groups
allows government policy interventions to be opportunely addressed. Results
identify three homogenous clusters of individuals with specific cognitive
functioning, consistent with the class conditional distribution of the
covariates. The flexibility of CPMM allows for a different contribution of each
regressor on the responses according to group membership. In so doing, the
identified groups receive a global and accurate phenomenological
characterization.Comment: Published at http://dx.doi.org/10.1214/15-AOAS816 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Copula-based fuzzy clustering of spatial time series
This paper contributes to the existing literature on the analysis of spatial time series presenting a new clustering algorithm called COFUST, i.e. COpula-based FUzzy clustering algorithm for Spatial Time series. The underlying idea of this algorithm is to perform a fuzzy Partitioning Around Medoids (PAM) clustering using copula-based approach to interpret comovements of time series. This generalisation allows both to extend usual clustering methods for time series based on Pearson’s correlation and to capture the uncertainty that arises assigning units to clusters. Furthermore, its flexibility permits to include directly in the algorithm the spatial information. Our approach is presented and discussed using both simulated and real data, highlighting its main advantages