651 research outputs found
Modelling time course gene expression data with finite mixtures of linear additive models
Summary: A model class of finite mixtures of linear additive models is presented. The component-specific parameters in the regression models are estimated using regularized likelihood methods. The advantages of the regularization are that (i) the pre-specified maximum degrees of freedom for the splines is less crucial than for unregularized estimation and that (ii) for each component individually a suitable degree of freedom is selected in an automatic way. The performance is evaluated in a simulation study with artificial data as well as on a yeast cell cycle dataset of gene expression levels over time
Model-Based Clustering and Classification of Functional Data
The problem of complex data analysis is a central topic of modern statistical
science and learning systems and is becoming of broader interest with the
increasing prevalence of high-dimensional data. The challenge is to develop
statistical models and autonomous algorithms that are able to acquire knowledge
from raw data for exploratory analysis, which can be achieved through
clustering techniques or to make predictions of future data via classification
(i.e., discriminant analysis) techniques. Latent data models, including mixture
model-based approaches are one of the most popular and successful approaches in
both the unsupervised context (i.e., clustering) and the supervised one (i.e,
classification or discrimination). Although traditionally tools of multivariate
analysis, they are growing in popularity when considered in the framework of
functional data analysis (FDA). FDA is the data analysis paradigm in which the
individual data units are functions (e.g., curves, surfaces), rather than
simple vectors. In many areas of application, the analyzed data are indeed
often available in the form of discretized values of functions or curves (e.g.,
time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data).
This functional aspect of the data adds additional difficulties compared to the
case of a classical multivariate (non-functional) data analysis. We review and
present approaches for model-based clustering and classification of functional
data. We derive well-established statistical models along with efficient
algorithmic tools to address problems regarding the clustering and the
classification of these high-dimensional data, including their heterogeneity,
missing information, and dynamical hidden structure. The presented models and
algorithms are illustrated on real-world functional data analysis problems from
several application area
Multilevel modelling for inference of genetic regulatory networks
Time-course experiments with microarrays are often used to study dynamic biological systems and genetic regulatory networks (GRNs) that model how genes influence each other in cell-level development of organisms. The inference for GRNs provides important insights into the fundamental biological processes such as growth and is useful in disease diagnosis and genomic drug design. Due to the experimental design, multilevel data hierarchies are often present in time-course gene expression data. Most existing methods, however, ignore the dependency of the expression measurements over time and the correlation among gene expression profiles. Such independence assumptions violate regulatory interactions and can result in overlooking certain important subject effects and lead to spurious inference for regulatory networks or mechanisms. In this paper, a multilevel mixed-effects model is adopted to incorporate data hierarchies in the analysis of time-course data, where temporal and subject effects are both assumed to be random. The method starts with the clustering of genes by fitting the mixture model within the multilevel random-effects model framework using the expectation-maximization (EM) algorithm. The network of regulatory interactions is then determined by searching for regulatory control elements (activators and inhibitors) shared by the clusters of co-expressed genes, based on a time-lagged correlation coefficients measurement. The method is applied to two real time-course datasets from the budding yeast (Saccharomyces cerevisiae) genome. It is shown that the proposed method provides clusters of cell-cycle regulated genes that are supported by existing gene function annotations, and hence enables inference on regulatory interactions for the genetic network
Joint Clustering and Registration of Functional Data
Curve registration and clustering are fundamental tools in the analysis of
functional data. While several methods have been developed and explored for
either task individually, limited work has been done to infer functional
clusters and register curves simultaneously. We propose a hierarchical model
for joint curve clustering and registration. Our proposal combines a Dirichlet
process mixture model for clustering of common shapes, with a reproducing
kernel representation of phase variability for registration. We show how
inference can be carried out applying standard posterior simulation algorithms
and compare our method to several alternatives in both engineered data and a
benchmark analysis of the Berkeley growth data. We conclude our investigation
with an application to time course gene expression
Clustering of temporal gene expression data with mixtures of mixed effects models
While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms like K-Means and standard Gaussian mixture-models (GMM) fail to account for variability in replicated data or repeated measures over time and require a priori cluster number assumptions, evaluating many cluster numbers to select an optimal result. An improved penalized-GMM offers a computationally-efficient algorithm to simultaneously optimize cluster number and labels.
The work presented in this dissertation was motivated by mice bone-fracture models interested in determining patterns of temporal gene-expression during bone-healing progression. To solve this, an extension to the penalized-GMM was proposed to account for correlation between replicated data and repeated measures over time by introducing random-effects using a mixture of mixed-effects polynomial regression models and an entropy-penalized EM-Algorithm (EPEM).
First, performance of EPEM for different mixed-effects models were assessed with simulation studies and applied to the fracture-healing study. Second, modifications to address the high computational cost of EPEM were considered that either clustered subsets of data determined by predicted polynomial-order (S-EPEM) or used modified-initialization to decrease the initial burden (I-EPEM). Each was compared to EPEM and applied to the fracture-healing study. Lastly, as varied rates of fracture-healing were observed for mice with different genetic-backgrounds (strains), a new analysis strategy was proposed to compare patterns of temporal gene-expression between different mice-strains and assessed with simulation studies. Expression-profiles for each strain were treated as separate objects to cluster in order to determine genes clustered into different groups across strain.
We found that the addition of random-effects decreased accuracy of predicted cluster labels compared to K-Means, GMM, and fixed-effects EPEM. Polynomial-order optimization with BIC performed with highest accuracy, and optimization on subspaces obtained with singular-value-decomposition performed well. Computation time for S-EPEM was much reduced with a slight decrease in accuracy. I-EPEM was comparable to EPEM with similar accuracy and decrease in computation time. Application of the new analysis strategy on fracture-healing data identified several distinct temporal gene-expression patterns for the different strains.2021-02-27T00:00:00
Analyse statistique de données fonctionnelles à structures complexes
Les études longitudinales jouent un rôle prépondérant dans des domaines de recherche variés
et leur importance ne cesse de prendre de l’ampleur. Les méthodes d’analyse qui leur
sont associées sont devenues des outils privilégiés pour l’analyse de l’étude temporelle d’un
phénomène donné. On parle de données longitudinales lorsqu’une ou plusieurs variables
sont mesurées de manière répétée à plusieurs moments dans le temps sur un ensemble d’individus.
Un élément central de ce type de données est que les observations prises sur un
même individu ont tendance à être corrélées. Cette caractéristique fondamentale distingue
les données longitudinales d’autres types de données en statistique et suscite des méthodologies
d’analyse spécifiques. Ce domaine d’analyse a connu une expansion considérable dans
les quarante dernières années. L’analyse classique des données longitudinales est basée sur
les modèles paramétriques, non-paramétriques et semi-paramétriques. Mais une importante
question abondamment traitée dans l’étude des données longitudinales est associée à l’analyse
typologique (regroupement en classes) et concerne la détection de groupes (ou classes ou
encore trajectoires) homogènes, suggérés par les données, non définis a priori de sorte que les
individus dans une même classe tendent à être similaires les uns aux autres dans un certain
sens et, ceux dans différentes classes tendent à être non similaires (dissemblables). Dans cette
thèse, nous élaborons des modèles de clustering de données longitudinales et contribuons
à la littérature de ce domaine statistique en plein essor. En effet, une méthodologie émergente
non-paramétrique de traitement des données longitudinales est basée sur l’approche
de l’analyse des données fonctionnelles selon laquelle les trajectoires longitudinales sont perçues
comme étant un échantillon de fonctions (ou courbes) partiellement observées sur un
intervalle de temps sur lequel elles sont souvent supposées lisses. Ainsi, nous proposons dans
cette thèse, une revue de la littérature statistique sur l’analyse des données longitudinales
et développons deux nouvelles méthodes de partitionnement fonctionnel basées sur des modèles
spécifiques. En effet, nous exposons dans le premier volet de la présente thèse une
revue succinte de la plupart des modèles typiques d’analyse des données longitudinales, des
modèles paramétriques aux modèles non-paramétriques et semi-paramétriques. Nous présentons
également les développements récents dans le domaine de l’analyse typologique de ces données selon les deux plus importantes approches : l’approche non paramétrique et l’approche
fondée sur un modèle. Le but ultime de cette revue est de fournir un aperçu concis,
varié et très accessible de toutes les méthodes d’analyse des données longitudinales. Dans
la première méthodologie proposée dans le cadre de cette thèse, nous utilisons l’approche
de l’analyse des données fonctionnelles (ADF) pour développer un modèle très flexible pour
l’analyse et le regroupement de tout type de données longitudinales (balancées ou non) qui
combine adéquatement et simultanément l’analyse fonctionnelle en composantes principales
et le regroupement en classes. La modélisation fonctionnelle repose sur l’espace des coefficients
dans la base des splines et le modèle, conçu dans un cadre bayésien, est basé sur un
mélange de distributions de Student. Nous proposons également un nouveau critère pour
la sélection de modèle en développant une approximation de la log-vraisemblance marginale
(MLL). Ce critère se compare favorablement aux critères usuels tels que AIC et BIC.
La seconde méthode de regroupement développée dans la présente thèse est une nouvelle
procédure d’analyse de données longitudinales qui combine l’approche du partitionnement
fonctionnel basé sur un modèle et une double pénalisation de type Lasso pour identifier les
classes homogènes ou les individus avec des tendances semblables. Les courbes individuelles
sont approximées dans un espace dérivé par une base finie de splines et le nombre optimal de
classes est déterminé en pénalisant un mélange de distributions de Student. Les paramètres
de contrôle de la pénalité sont définis par la méthode d’échantillonnage par hypercube latin
qui assure une exploration plus efficace de l’espace de ces paramètres. Pour l’estimation des
paramètres dans les deux méthodes proposées, nous utilisons l’algorithme itératif espérancemaximisation.Longitudinal studies play a salient role in many and various research areas and their relevance
is still increasing. The related methods have become a privileged tool for analyzing the
evolution of a given phenomenon across time. Longitudinal data arise when measurements
for one or more variables are taken at different points of a temporal axis on individuals
involved in the study. A key feature of such type of data is that observations within the
same subject may be correlated. That fundamental characteristic makes longitudinal data
different from other types of data in statistics and motivates specific methodologies. There
has been remarkable developments in that field in the past forty years. Typical analysis of
longitudinal data relies on parametric, non-parametric or semi-parametric models. However,
an important question widely addressed in the analysis of longitudinal data is related to
cluster analysis and concerns the existence of groups or clusters (or homogeneous trajectories),
suggested by the data, not defined a priori, such that individuals in a given cluster
tend to be similar to each other in some sense, and individuals in different clusters tend to be
dissimilar. This thesis aims at contributing to that rapidly expanding field of clustering longitudinal
data. Indeed, an emerging non-parametric methodology for modeling longitudinal
data is based on the functional data analysis approach in which longitudinal trajectories are
viewed as a sample of partially observed functions or curves on some interval where these
functions are often assumed to be smooth. We then propose in the present thesis, a succinct
review of the most commonly used methods to analyze and cluster longitudinal data and
two new model-based functional clustering methods. Indeed, we review most of the typical
longitudinal data analysis models ranging from the parametric models to the semi and non
parametric ones, as well as the recent developments in longitudinal cluster analysis according
to the two main approaches : non-parametric and model-based. The purpose of that review
is to provide a concise, broad and readily accessible overview of longitudinal data analysis
and clustering methods. In the first method developed in this thesis, we use the functional
data analysis approach to propose a very flexible model which combines functional principal
components analysis and clustering to deal with any type of longitudinal data, even if the observations are sparse, irregularly spaced or occur at different time points for each individual.
The functional modeling is based on splines and the main data groups are modeled
as arising from clusters in the space of spline coefficients. The model, based on a mixture
of Student’s t-distributions, is embedded into a Bayesian framework in which maximum a
posteriori estimators are found with the EM algorithm. We develop an approximation of
the marginal log-likelihood (MLL) that allows us to perform an MLL based model selection
and that compares favourably with other popular criteria such as AIC and BIC. In the
second method, we propose a new time-course or longitudinal data analysis framework that
aims at combining functional model-based clustering and the Lasso penalization to identify
groups of individuals with similar patterns. An EM algorithm-based approach is used on a
functional modeling where the individual curves are approximated into a space spanned by a
finite basis of B-splines and the number of clusters is determined by penalizing a mixture of
Student’s t-distributions with unknown degrees of freedom. The Latin Hypercube Sampling
is used to efficiently explore the space of penalization parameters. For both methodologies,
the estimation of the parameters is based on the iterative expectation-maximization (EM)
algorithm
Clustering time-course data using P-splines and mixed effects mixture models
Mini Dissertation (MCom (Advanced Data Analytics))--University of Pretoria 2022.In the field of biology, gene expressions are evaluated over time to study complicated biological processes and
genetic supervisory networks. Because the process is continuous, time-course gene-expression data may be
represented by a continuous function.
This mini dissertation addresses cluster analysis of time-course data in a mixture model framework. To
take into account the time dependency of such time-course data, as well as the degree of error present in
many datasets, the mixed effects model with penalized B-splines is considered. In this mini dissertation the
performance of such a mixed effects model has been studied with regards to the clustering of time-course gene
expression data in a mixture model system. The EM algorithm has been implemented to fit the mixture model
in a mixed effects model structure. For each subject the best linear unbiased smooth estimate of its time-course
trajectory has been calculated and subjects with similar mean curves have been clustered in the same cluster.
Model validation statistics such has the model accuracy and the coefficient of determination (R
2
) indicates
that the model can cluster simulated data effectively into clusters that differ in either the form of the curves
or the timing to the curves’ peaks. The proposed technique is further evidenced by clustering time-course
gene expression data consisting of microarray samples from lung tissue of mice exposed to different Influenza
strains from 14 time-points.National Research Foundation, South Africa (Research chair: Computational and Methodological Statistics,
Grant number 71199)(SARChI).StatisticsMCom (Advanced Data Analytics)Unrestricte
- …