4,234 research outputs found
Generalized empirical Bayesian methods for discovery of differential data in high-throughput biology
Motivation:
High-throughput data are now commonplace in biological research. Rapidly changing technologies and application mean that novel methods for detecting differential behaviour that account for a ‘large P, small n’ setting are required at an increasing rate. The development of such methods is, in general, being done on an ad hoc basis, requiring further development cycles and a lack of standardization between analyses.
Results:
We present here a generalized method for identifying differential behaviour within high-throughput biological data through empirical Bayesian methods. This approach is based on our baySeq algorithm for identification of differential expression in RNA-seq data based on a negative binomial distribution, and in paired data based on a beta-binomial distribution. Here we show how the same empirical Bayesian approach can be applied to any parametric distribution, removing the need for lengthy development of novel methods for differently distributed data. Comparisons with existing methods developed to address specific problems in high-throughput biological data show that these generic methods can achieve equivalent or better performance. A number of enhancements to the basic algorithm are also presented to increase flexibility and reduce computational costs.
Availability and implementation:
The methods are implemented in the R baySeq (v2) package, available on Bioconductor http://www.bioconductor.org/packages/release/bioc/html/baySeq.html.
Contact: [email protected]
Supplementary information:
Supplementary data are available at Bioinformatics online.This work was supported by European Research Council Advanced Investigator Grant ERC-2013-AdG 340642 – TRIBE.This is the author accepted manuscript. The final version is available from Oxford University Press via http://dx.doi.org/10.1093/bioinformatics/btv56
Recommended from our members
Generalized empirical Bayesian methods for discovery of differential data in high-throughput biology.
MOTIVATION: High-throughput data are now commonplace in biological research. Rapidly changing technologies and application mean that novel methods for detecting differential behaviour that account for a 'large P, small n' setting are required at an increasing rate. The development of such methods is, in general, being done on an ad hoc basis, requiring further development cycles and a lack of standardization between analyses. RESULTS: We present here a generalized method for identifying differential behaviour within high-throughput biological data through empirical Bayesian methods. This approach is based on our baySeq algorithm for identification of differential expression in RNA-seq data based on a negative binomial distribution, and in paired data based on a beta-binomial distribution. Here we show how the same empirical Bayesian approach can be applied to any parametric distribution, removing the need for lengthy development of novel methods for differently distributed data. Comparisons with existing methods developed to address specific problems in high-throughput biological data show that these generic methods can achieve equivalent or better performance. A number of enhancements to the basic algorithm are also presented to increase flexibility and reduce computational costs. AVAILABILITY AND IMPLEMENTATION: The methods are implemented in the R baySeq (v2) package, available on Bioconductor http://www.bioconductor.org/packages/release/bioc/html/baySeq.html. CONTACT: [email protected] SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.This work was supported by European Research Council Advanced Investigator Grant ERC-2013-AdG 340642 – TRIBE.This is the author accepted manuscript. The final version is available from Oxford University Press via http://dx.doi.org/10.1093/bioinformatics/btv56
Bayesian estimation of Differential Transcript Usage from RNA-seq data
Next generation sequencing allows the identification of genes consisting of
differentially expressed transcripts, a term which usually refers to changes in
the overall expression level. A specific type of differential expression is
differential transcript usage (DTU) and targets changes in the relative within
gene expression of a transcript. The contribution of this paper is to: (a)
extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian
model which is originally designed for identifying changes in overall
expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist
model for inferring DTU. cjBitSeq is a read based model and performs fully
Bayesian inference by MCMC sampling on the space of latent state of each
transcript per gene. BayesDRIMSeq is a count based model and estimates the
Bayes Factor of a DTU model against a null model using Laplace's approximation.
The proposed models are benchmarked against the existing ones using a recent
independent simulation study as well as a real RNA-seq dataset. Our results
suggest that the Bayesian methods exhibit similar performance with DRIMSeq in
terms of precision/recall but offer better calibration of False Discovery Rate.Comment: Revised version, accepted to Statistical Applications in Genetics and
Molecular Biolog
THE NUANCES OF STATISTICALLY ANALYZING NEXT-GENERATION SEQUENCING DATA
High-throughput sequencing technologies, in particular next-generation sequencing (NGS) technologies, have emerged as the preferred approach for exploring both gene function and pathway organization. Data from NGS technologies pose new computational and statistical challenges because of their massive size, limited replicate information, large number of genes (high-dimensionality), and discrete form. They are more complex than data from previous high-throughput technologies such as microarrays. In this work we focus on the statistical issues in analyzing and modeling NGS data for selecting genes suitable for further exploration and present a brief review of the relevant statistical methods. We discuss visualization methods to assess the suitability of statistical models for these data, statistical methods for modeling differential gene expression, and methods for checking goodness of fit of the models for NGS data. We also outline areas for further research, especially in the computational, statistical, and visualization aspects of such data
Multiple locus linkage analysis of genomewide expression in yeast.
With the ability to measure thousands of related phenotypes from a single biological sample, it is now feasible to genetically dissect systems-level biological phenomena. The genetics of transcriptional regulation and protein abundance are likely to be complex, meaning that genetic variation at multiple loci will influence these phenotypes. Several recent studies have investigated the role of genetic variation in transcription by applying traditional linkage analysis methods to genomewide expression data, where each gene expression level was treated as a quantitative trait and analyzed separately from one another. Here, we develop a new, computationally efficient method for simultaneously mapping multiple gene expression quantitative trait loci that directly uses all of the available data. Information shared across gene expression traits is captured in a way that makes minimal assumptions about the statistical properties of the data. The method produces easy-to-interpret measures of statistical significance for both individual loci and the overall joint significance of multiple loci selected for a given expression trait. We apply the new method to a cross between two strains of the budding yeast Saccharomyces cerevisiae, and estimate that at least 37% of all gene expression traits show two simultaneous linkages, where we have allowed for epistatic interactions. Pairs of jointly linking quantitative trait loci are identified with high confidence for 170 gene expression traits, where it is expected that both loci are true positives for at least 153 traits. In addition, we are able to show that epistatic interactions contribute to gene expression variation for at least 14% of all traits. We compare the proposed approach to an exhaustive two-dimensional scan over all pairs of loci. Surprisingly, we demonstrate that an exhaustive two-dimensional scan is less powerful than the sequential search used here. In addition, we show that a two-dimensional scan does not truly allow one to test for simultaneous linkage, and the statistical significance measured from this existing method cannot be interpreted among many traits
Sequential stopping for high-throughput experiments
In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga
Deep generative modeling for single-cell transcriptomics.
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task
Getting started in probabilistic graphical models
Probabilistic graphical models (PGMs) have become a popular tool for
computational analysis of biological data in a variety of domains. But, what
exactly are they and how do they work? How can we use PGMs to discover patterns
that are biologically relevant? And to what extent can PGMs help us formulate
new hypotheses that are testable at the bench? This note sketches out some
answers and illustrates the main ideas behind the statistical approach to
biological pattern discovery.Comment: 12 pages, 1 figur
- …