1,923 research outputs found
Mixtures of Regression Models for Time-Course Gene Expression Data: Evaluation of Initialization and Random Effects
Finite mixture models are routinely applied to time course microarray data.
Due to the complexity and size of this type of data the choice of good starting values plays
an important role. So far initialization strategies have only been investigated for data
from a mixture of multivariate normal distributions. In this work several initialization
procedures are evaluated for mixtures of regression models with and without random
effects in an extensive simulation study on different artificial datasets. Finally these
procedures are also applied to a real dataset from E. coli
Parsimonious Time Series Clustering
We introduce a parsimonious model-based framework for clustering time course
data. In these applications the computational burden becomes often an issue due
to the number of available observations. The measured time series can also be
very noisy and sparse and a suitable model describing them can be hard to
define. We propose to model the observed measurements by using P-spline
smoothers and to cluster the functional objects as summarized by the optimal
spline coefficients. In principle, this idea can be adopted within all the most
common clustering frameworks. In this work we discuss applications based on a
k-means algorithm. We evaluate the accuracy and the efficiency of our proposal
by simulations and by dealing with drosophila melanogaster gene expression
data
Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects
Time course microarray data provide insight about dynamic biological
processes. While several clustering methods have been proposed for the analysis
of these data structures, comparison and selection of appropriate clustering
methods are seldom discussed. We compared probabilistic based clustering
methods and distance based clustering methods for time course microarray
data. Among probabilistic methods, we considered: smoothing spline clustering
also known as model based functional data analysis (MFDA), functional
clustering models for sparsely sampled data (FCM) and model-based clustering
(MCLUST). Among distance based methods, we considered: weighted gene
co-expression network analysis (WGCNA), clustering with dynamic time warping
distance (DTW) and clustering with autocorrelation based distance (ACF). We
studied these algorithms in both simulated settings and case study data. Our
investigations showed that FCM performed very well when gene curves were short
and sparse. DTW and WGCNA performed well when gene curves were medium or long
( observations). SSC performed very well when there were clusters of gene
curves similar to one another. Overall, ACF performed poorly in these
applications. In terms of computation time, FCM, SSC and DTW were considerably
slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more
accurate and biological meaningful clustering results. WGCNA and MCLUST are the
best methods among the 6 methods compared, when performance and computation
time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides
model based inference and uncertainty measure of clustering results
Measuring Cluster Stability for Bayesian Nonparametrics Using the Linear Bootstrap
Clustering procedures typically estimate which data points are clustered
together, a quantity of primary importance in many analyses. Often used as a
preliminary step for dimensionality reduction or to facilitate interpretation,
finding robust and stable clusters is often crucial for appropriate for
downstream analysis. In the present work, we consider Bayesian nonparametric
(BNP) models, a particularly popular set of Bayesian models for clustering due
to their flexibility. Because of its complexity, the Bayesian posterior often
cannot be computed exactly, and approximations must be employed. Mean-field
variational Bayes forms a posterior approximation by solving an optimization
problem and is widely used due to its speed. An exact BNP posterior might vary
dramatically when presented with different data. As such, stability and
robustness of the clustering should be assessed.
A popular mean to assess stability is to apply the bootstrap by resampling
the data, and rerun the clustering for each simulated data set. The time cost
is thus often very expensive, especially for the sort of exploratory analysis
where clustering is typically used. We propose to use a fast and automatic
approximation to the full bootstrap called the "linear bootstrap", which can be
seen by local data perturbation. In this work, we demonstrate how to apply this
idea to a data analysis pipeline, consisting of an MFVB approximation to a BNP
clustering posterior of time course gene expression data. We show that using
auto-differentiation tools, the necessary calculations can be done
automatically, and that the linear bootstrap is a fast but approximate
alternative to the bootstrap.Comment: 9 pages, NIPS 2017 Advances in Approximate Bayesian Inference
Worksho
Modelling time course gene expression data with finite mixtures of linear additive models
Summary: A model class of finite mixtures of linear additive models is presented. The component-specific parameters in the regression models are estimated using regularized likelihood methods. The advantages of the regularization are that (i) the pre-specified maximum degrees of freedom for the splines is less crucial than for unregularized estimation and that (ii) for each component individually a suitable degree of freedom is selected in an automatic way. The performance is evaluated in a simulation study with artificial data as well as on a yeast cell cycle dataset of gene expression levels over time
Joint Clustering and Registration of Functional Data
Curve registration and clustering are fundamental tools in the analysis of
functional data. While several methods have been developed and explored for
either task individually, limited work has been done to infer functional
clusters and register curves simultaneously. We propose a hierarchical model
for joint curve clustering and registration. Our proposal combines a Dirichlet
process mixture model for clustering of common shapes, with a reproducing
kernel representation of phase variability for registration. We show how
inference can be carried out applying standard posterior simulation algorithms
and compare our method to several alternatives in both engineered data and a
benchmark analysis of the Berkeley growth data. We conclude our investigation
with an application to time course gene expression
M-quantile regression analysis of temporal gene expression data
In this paper, we explore the use of M-regression and M-quantile coefficients to detect statistical differences between temporal curves that belong to different experimental conditions. In particular, we consider the application of temporal gene expression data. Here, the aim is to detect genes whose temporal expression is significantly different across a number of biological conditions. We present a new method to approach this problem. Firstly, the temporal profiles of the genes are modelled by a parametric M-quantile regression model. This model is particularly appealing to small-sample gene
expression data, as it is very robust against outliers and it does not make any assumption on the error distribution. Secondly, we further increase the robustness of the method by summarising the M-quantile regression models for a large range of quantile values into an M-quantile coefficient. Finally, we employ a Hotelling T2-test to detect significant differences of the temporal M-quantile profiles across conditions. Simulated data shows the increased robustness of M-quantile regression methods over standard regression methods. We conclude by using the method to detect differentially expressed genes from time-course microarray data on muscular dystrophy
- …