3,344 research outputs found
BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data
We perform differential expression analysis of high-throughput sequencing
count data under a Bayesian nonparametric framework, removing sophisticated
ad-hoc pre-processing steps commonly required in existing algorithms. We
propose to use the gamma (beta) negative binomial process, which takes into
account different sequencing depths using sample-specific negative binomial
probability (dispersion) parameters, to detect differentially expressed genes
by comparing the posterior distributions of gene-specific negative binomial
dispersion (probability) parameters. These model parameters are inferred by
borrowing statistical strength across both the genes and samples. Extensive
experiments on both simulated and real-world RNA sequencing count data show
that the proposed differential expression analysis algorithms clearly
outperform previously proposed ones in terms of the areas under both the
receiver operating characteristic and precision-recall curves.Comment: To appear in Journal of the American Statistical Associatio
Recommended from our members
Simulating multiple faceted variability in single cell RNA sequencing.
The abundance of new computational methods for processing and interpreting transcriptomes at a single cell level raises the need for in silico platforms for evaluation and validation. Here, we present SymSim, a simulator that explicitly models the processes that give rise to data observed in single cell RNA-Seq experiments. The components of the SymSim pipeline pertain to the three primary sources of variation in single cell RNA-Seq data: noise intrinsic to the process of transcription, extrinsic variation indicative of different cell states (both discrete and continuous), and technical variation due to low sensitivity and measurement noise and bias. We demonstrate how SymSim can be used for benchmarking methods for clustering, differential expression and trajectory inference, and for examining the effects of various parameters on their performance. We also show how SymSim can be used to evaluate the number of cells required to detect a rare population under various scenarios
Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data
There is a high prevalence of coronary artery disease (CAD) in patients with left bundle branch block (LBBB); however there are many other causes for this electrocardiographic abnormality. Non-invasive assessment of these patients remains difficult, and all commonly used modalities exhibit several drawbacks. This often leads to these patients undergoing invasive coronary angiography which may not have been necessary. In this review, we examine the uses and limitations of commonly performed non-invasive tests for diagnosis of CAD in patients with LBBB
MSIQ: Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification
Next-generation RNA sequencing (RNA-seq) technology has been widely used to
assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq
data offer insight into gene expression levels and transcriptome structures,
enabling us to better understand the regulation of gene expression and
fundamental biological processes. Accurate isoform quantification from RNA-seq
data is challenging due to the information loss in sequencing experiments. A
recent accumulation of multiple RNA-seq data sets from the same tissue or cell
type provides new opportunities to improve the accuracy of isoform
quantification. However, existing statistical or computational methods for
multiple RNA-seq samples either pool the samples into one sample or assign
equal weights to the samples when estimating isoform abundance. These methods
ignore the possible heterogeneity in the quality of different samples and could
result in biased and unrobust estimates. In this article, we develop a method,
which we call "joint modeling of multiple RNA-seq samples for accurate isoform
quantification" (MSIQ), for more accurate and robust isoform quantification by
integrating multiple RNA-seq samples under a Bayesian framework. Our method
aims to (1) identify a consistent group of samples with homogeneous quality and
(2) improve isoform quantification accuracy by jointly modeling multiple
RNA-seq samples by allowing for higher weights on the consistent group. We show
that MSIQ provides a consistent estimator of isoform abundance, and we
demonstrate the accuracy and effectiveness of MSIQ compared with alternative
methods through simulation studies on D. melanogaster genes. We justify MSIQ's
advantages over existing approaches via application studies on real RNA-seq
data from human embryonic stem cells, brain tissues, and the HepG2 immortalized
cell line
Discrete distributional differential expression (D3E)--a tool for gene expression analysis of single-cell RNA-seq data.
BACKGROUND: The advent of high throughput RNA-seq at the single-cell level has opened up new opportunities to elucidate the heterogeneity of gene expression. One of the most widespread applications of RNA-seq is to identify genes which are differentially expressed between two experimental conditions. RESULTS: We present a discrete, distributional method for differential gene expression (D(3)E), a novel algorithm specifically designed for single-cell RNA-seq data. We use synthetic data to evaluate D(3)E, demonstrating that it can detect changes in expression, even when the mean level remains unchanged. Since D(3)E is based on an analytically tractable stochastic model, it provides additional biological insights by quantifying biologically meaningful properties, such as the average burst size and frequency. We use D(3)E to investigate experimental data, and with the help of the underlying model, we directly test hypotheses about the driving mechanism behind changes in gene expression. CONCLUSION: Evaluation using synthetic data shows that D(3)E performs better than other methods for identifying differentially expressed genes since it is designed to take full advantage of the information available from single-cell RNA-seq experiments. Moreover, the analytical model underlying D(3)E makes it possible to gain additional biological insights
Modelling capture efficiency of single-cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics
Motivation:
Gene expression is characterized by stochastic bursts of transcription that occur at brief and random periods of promoter activity. The kinetics of gene expression burstiness differs across the genome and is dependent on the promoter sequence, among other factors. Single-cell RNA sequencing (scRNA-seq) has made it possible to quantify the cell-to-cell variability in transcription at a global genome-wide level. However, scRNA-seq data are prone to technical variability, including low and variable capture efficiency of transcripts from individual cells.
//
Results:
Here, we propose a novel mathematical theory for the observed variability in scRNA-seq data. Our method captures burst kinetics and variability in both the cell size and capture efficiency, which allows us to propose several likelihood-based and simulation-based methods for the inference of burst kinetics from scRNA-seq data. Using both synthetic and real data, we show that the simulation-based methods provide an accurate, robust and flexible tool for inferring burst kinetics from scRNA-seq data. In particular, in a supervised manner, a simulation-based inference method based on neural networks proves to be accurate and useful when applied to both allele and nonallele-specific scRNA-seq data.
//
Availability and implementation:
The code for Neural Network and Approximate Bayesian Computation inference is available at https://github.com/WT215/nnRNA and https://github.com/WT215/Julia_ABC, respectively
Recommended from our members
lncRNA-dependent mechanisms of androgen-receptor-regulated gene activation programs.
Although recent studies have indicated roles of long non-coding RNAs (lncRNAs) in physiological aspects of cell-type determination and tissue homeostasis, their potential involvement in regulated gene transcription programs remains rather poorly understood. The androgen receptor regulates a large repertoire of genes central to the identity and behaviour of prostate cancer cells, and functions in a ligand-independent fashion in many prostate cancers when they become hormone refractory after initial androgen deprivation therapy. Here we report that two lncRNAs highly overexpressed in aggressive prostate cancer, PRNCR1 (also known as PCAT8) and PCGEM1, bind successively to the androgen receptor and strongly enhance both ligand-dependent and ligand-independent androgen-receptor-mediated gene activation programs and proliferation in prostate cancer cells. Binding of PRNCR1 to the carboxy-terminally acetylated androgen receptor on enhancers and its association with DOT1L appear to be required for recruitment of the second lncRNA, PCGEM1, to the androgen receptor amino terminus that is methylated by DOT1L. Unexpectedly, recognition of specific protein marks by PCGEM1-recruited pygopus 2 PHD domain enhances selective looping of androgen-receptor-bound enhancers to target gene promoters in these cells. In 'resistant' prostate cancer cells, these overexpressed lncRNAs can interact with, and are required for, the robust activation of both truncated and full-length androgen receptor, causing ligand-independent activation of the androgen receptor transcriptional program and cell proliferation. Conditionally expressed short hairpin RNA targeting these lncRNAs in castration-resistant prostate cancer cell lines strongly suppressed tumour xenograft growth in vivo. Together, these results indicate that these overexpressed lncRNAs can potentially serve as a required component of castration-resistance in prostatic tumours
Modelling capture efficiency of single-cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics
MOTIVATION: Gene expression is characterised by stochastic bursts of transcription that occur at brief and random periods of promoter activity. The kinetics of gene expression burstiness differs across the genome and is dependent on the promoter sequence, among other factors. Single-cell RNA sequencing (scRNA-seq) has made it possible to quantify the cell-to-cell variability in transcription at a global genome-wide level. However, scRNA-seq data is prone to technical variability, including low and variable capture efficiency of transcripts from individual cells. RESULTS: Here, we propose a novel mathematical theory for the observed variability in scRNA-seq data. Our method captures burst kinetics and variability in both the cell size and capture efficiency, which allows us to propose several likelihood-based and simulation-based methods for the inference of burst kinetics from scRNA-seq data. Using both synthetic and real data, we show that the simulation-based methods provide an accurate, robust and flexible tool for inferring burst kinetics from scRNA-seq data. In particular, in a supervised manner, a simulation-based inference method based on neural networks proves to be accurate and useful when applied to both allele and non-allele-specific scRNA-seq data. AVAILABILITY: The code for Neural Network and Approximate Bayesian Computation inference is available at https://github.com/WT215/nnRNA and https://github.com/WT215/Julia_ABC respectively. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
- …