8 research outputs found
Sparse Linear Identifiable Multivariate Modeling
In this paper we consider sparse and identifiable linear latent variable
(factor) and linear Bayesian network models for parsimonious analysis of
multivariate data. We propose a computationally efficient method for joint
parameter and model inference, and model comparison. It consists of a fully
Bayesian hierarchy for sparse models using slab and spike priors (two-component
delta-function and continuous mixtures), non-Gaussian latent factors and a
stochastic search over the ordering of the variables. The framework, which we
call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and
bench-marked on artificial and real biological data sets. SLIM is closest in
spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in
inference, Bayesian network structure learning and model comparison.
Experimentally, SLIM performs equally well or better than LiNGAM with
comparable computational complexity. We attribute this mainly to the stochastic
search strategy used, and to parsimony (sparsity and identifiability), which is
an explicit part of the model. We propose two extensions to the basic i.i.d.
linear framework: non-linear dependence on observed variables, called SNIM
(Sparse Non-linear Identifiable Multivariate modeling) and allowing for
correlations between latent variables, called CSLIM (Correlated SLIM), for the
temporal and/or spatial data. The source code and scripts are available from
http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure
Latent protein trees
Unbiased, label-free proteomics is becoming a powerful technique for
measuring protein expression in almost any biological sample. The output of
these measurements after preprocessing is a collection of features and their
associated intensities for each sample. Subsets of features within the data are
from the same peptide, subsets of peptides are from the same protein, and
subsets of proteins are in the same biological pathways, therefore, there is
the potential for very complex and informative correlational structure inherent
in these data. Recent attempts to utilize this data often focus on the
identification of single features that are associated with a particular
phenotype that is relevant to the experiment. However, to date, there have been
no published approaches that directly model what we know to be multiple
different levels of correlation structure. Here we present a hierarchical
Bayesian model which is specifically designed to model such correlation
structure in unbiased, label-free proteomics. This model utilizes partial
identification information from peptide sequencing and database lookup as well
as the observed correlation in the data to appropriately compress features into
latent proteins and to estimate their correlation structure. We demonstrate the
effectiveness of the model using artificial/benchmark data and in the context
of a series of proteomics measurements of blood plasma from a collection of
volunteers who were infected with two different strains of viral influenza.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS639 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org