5,269 research outputs found
Probabilistic Latent Variable Models in Statistical Genomics
In this thesis, we propose different probabilistic latent variable mod-
els to identify and capture the hidden structure present in commonly
studied genomics datasets. We start by investigating how to cor-
rect for unwanted correlations due to hidden confounding factors in
gene expression data. This is particularly important in expression
quantitative trait loci (eQTL) studies, where the goal is to identify
associations between genetic variants and gene expression levels. We
start with a na¨ ıve approach, which estimates the latent factors from
the gene expression data alone, ignoring the genetics, and we show
that it leads to a loss of signal in the data. We then highlight how,
thanks to the formulation of our model as a probabilistic model, it is
straightforward to modify it in order to take into account the specific
properties of the data. In particular, we show that in the na¨ ıve ap-
proach the latent variables ”explain away” the genetic signal, and that
this problem can be avoided by jointly inferring these latent variables
while taking into account the genetic information. We then extend
this, so far additive, model to additionally detect interactions between
the latent variables and the genetic markers. We show that this leads
to a better reconstruction of the latent space and that it helps dis-
secting latent variables capturing general confounding factors (such
as batch effects) from those capturing environmental factors involved
in genotype-by-environment interactions. Finally, we investigate the
effects of misspecifications of the noise model in genetic studies, show-
ing how the probabilistic framework presented so far can be easily ex-
tended to automatically infer non-linear monotonic transformations of
the data such that the common assumption of Gaussian distributed
residuals is respected
Getting started in probabilistic graphical models
Probabilistic graphical models (PGMs) have become a popular tool for
computational analysis of biological data in a variety of domains. But, what
exactly are they and how do they work? How can we use PGMs to discover patterns
that are biologically relevant? And to what extent can PGMs help us formulate
new hypotheses that are testable at the bench? This note sketches out some
answers and illustrates the main ideas behind the statistical approach to
biological pattern discovery.Comment: 12 pages, 1 figur
Deep generative modeling for single-cell transcriptomics.
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task
Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review
A variety of genome-wide profiling techniques are available to probe
complementary aspects of genome structure and function. Integrative analysis of
heterogeneous data sources can reveal higher-level interactions that cannot be
detected based on individual observations. A standard integration task in
cancer studies is to identify altered genomic regions that induce changes in
the expression of the associated genes based on joint analysis of genome-wide
gene expression and copy number profiling measurements. In this review, we
provide a comparison among various modeling procedures for integrating
genome-wide profiling data of gene copy number and transcriptional alterations
and highlight common approaches to genomic data integration. A transparent
benchmarking procedure is introduced to quantitatively compare the cancer gene
prioritization performance of the alternative methods. The benchmarking
algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin
Composite likelihood inference in a discrete latent variable model for two-way "clustering-by-segmentation" problems
We consider a discrete latent variable model for two-way data arrays, which
allows one to simultaneously produce clusters along one of the data dimensions
(e.g. exchangeable observational units or features) and contiguous groups, or
segments, along the other (e.g. consecutively ordered times or locations). The
model relies on a hidden Markov structure but, given its complexity, cannot be
estimated by full maximum likelihood. We therefore introduce composite
likelihood methodology based on considering different subsets of the data. The
proposed approach is illustrated by simulation, and with an application to
genomic data
- …