1,386 research outputs found
Variational approximation for mixtures of linear mixed models
Mixtures of linear mixed models (MLMMs) are useful for clustering grouped
data and can be estimated by likelihood maximization through the EM algorithm.
The conventional approach to determining a suitable number of components is to
compare different mixture models using penalized log-likelihood criteria such
as BIC.We propose fitting MLMMs with variational methods which can perform
parameter estimation and model selection simultaneously. A variational
approximation is described where the variational lower bound and parameter
updates are in closed form, allowing fast evaluation. A new variational greedy
algorithm is developed for model selection and learning of the mixture
components. This approach allows an automatic initialization of the algorithm
and returns a plausible number of mixture components automatically. In cases of
weak identifiability of certain model parameters, we use hierarchical centering
to reparametrize the model and show empirically that there is a gain in
efficiency by variational algorithms similar to that in MCMC algorithms.
Related to this, we prove that the approximate rate of convergence of
variational algorithms by Gaussian approximation is equal to that of the
corresponding Gibbs sampler which suggests that reparametrizations can lead to
improved convergence in variational algorithms as well.Comment: 36 pages, 5 figures, 2 tables, submitted to JCG
Bayesian unsupervised learning with multiple data types
Copyright © 2009 Walter de Gruyter. The final publication is available at www.degruyter.comWe propose Bayesian generative models for unsupervised learning with two types of data and an assumed dependency of one type of data on the other. We consider two algorithmic ap- proaches, based on a correspondence model where latent variables are shared across datasets. These models indicate the appropriate number of clusters in addition to indicating relevant features in both types of data. We evaluate the model on arti¯cially created data. We then apply the method to a breast cancer dataset consisting of gene expression and microRNA array data derived from the same patients. We assume dependence of gene expression on microRNA expression in this study. The method ranks genes within subtypes which have statistically signi¯cant abnormal expression and ranks associated abnormally expressing mi- croRNA. We report a genetic signature for the basal-like subtype of breast cancer found across a number of previous gene expression array studies. Using the two algorithmic ap- proaches we ¯nd that this signature also arises from clustering on the microRNA expression data and appears derivative from this data
Non-parametric Bayesian modelling of digital gene expression data
Next-generation sequencing technologies provide a revolutionary tool for
generating gene expression data. Starting with a fixed RNA sample, they
construct a library of millions of differentially abundant short sequence tags
or "reads", which constitute a fundamentally discrete measure of the level of
gene expression. A common limitation in experiments using these technologies is
the low number or even absence of biological replicates, which complicates the
statistical analysis of digital gene expression data. Analysis of this type of
data has often been based on modified tests originally devised for analysing
microarrays; both these and even de novo methods for the analysis of RNA-seq
data are plagued by the common problem of low replication. We propose a novel,
non-parametric Bayesian approach for the analysis of digital gene expression
data. We begin with a hierarchical model for modelling over-dispersed count
data and a blocked Gibbs sampling algorithm for inferring the posterior
distribution of model parameters conditional on these counts. The algorithm
compensates for the problem of low numbers of biological replicates by
clustering together genes with tag counts that are likely sampled from a common
distribution and using this augmented sample for estimating the parameters of
this distribution. The number of clusters is not decided a priori, but it is
inferred along with the remaining model parameters. We demonstrate the ability
of this approach to model biological data with high fidelity by applying the
algorithm on a public dataset obtained from cancerous and non-cancerous neural
tissues
Bayesian Unsupervised Learning with Multiple Data Types
We propose Bayesian generative models for unsupervised learning with two types of data and an assumed dependency of one type of data on the other. We consider two algorithmic approaches, based on a correspondence model where latent variables are shared across datasets. These models indicate the appropriate number of clusters in addition to indicating relevant features in both types of data. We evaluate the model on artificially created data. We then apply the method to a breast cancer dataset consisting of gene expression and microRNA array data derived from the same patients. We assume dependence of gene expression on microRNA expression in this study. The method ranks genes within subtypes which have statistically significant abnormal expression and ranks associated abnormally expressing microRNA. We report a genetic signature for the basal-like subtype of breast cancer found across a number of previous gene expression array studies. Using the two algorithmic approaches we find that this signature also arises from clustering on the microRNA expression data and appears derivative from this data.
Deep mixture of linear mixed models for complex longitudinal data
Mixtures of linear mixed models are widely used for modelling longitudinal
data for which observation times differ between subjects. In typical
applications, temporal trends are described using a basis expansion, with basis
coefficients treated as random effects varying by subject. Additional random
effects can describe variation between mixture components, or other known
sources of variation in complex experimental designs. A key advantage of these
models is that they provide a natural mechanism for clustering, which can be
helpful for interpretation in many applications. Current versions of mixtures
of linear mixed models are not specifically designed for the case where there
are many observations per subject and a complex temporal trend, which requires
a large number of basis functions to capture. In this case, the
subject-specific basis coefficients are a high-dimensional random effects
vector, for which the covariance matrix is hard to specify and estimate,
especially if it varies between mixture components. To address this issue, we
consider the use of recently-developed deep mixture of factor analyzers models
as the prior for the random effects. The resulting deep mixture of linear mixed
models is well-suited to high-dimensional settings, and we describe an
efficient variational inference approach to posterior computation. The efficacy
of the method is demonstrated on both real and simulated data
- …