700 research outputs found
A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments
Background: High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user. Results: Here we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or PĂłlya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility. Conclusions: RNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.or
An R Implementation of the Polya-Aeppli Distribution
An efficient implementation of the Polya-Aeppli, or geometirc compound
Poisson, distribution in the statistical programming language R is presented.
The implementation is available as the package polyaAeppli and consists of
functions for the mass function, cumulative distribution function, quantile
function and random variate generation with those parameters conventionally
provided for standard univatiate probability distributions in the stats package
in RComment: 9 pages, 2 figure
Error estimates for the analysis of differential expression fromRNA-seq count data
Background: A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. Results: We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. Conclusions: We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible tomultiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data
Model based heritability scores for high-throughput sequencing data
Supplementary materials. (PDF 1370 KB
Recommended from our members
A genome-wide, single-cell analysis of vascular smooth muscle cell plasticity
Vascular smooth muscle cells (VSMCs) possess a remarkable capacity to change phenotype in response to injury or inflammation. In healthy arteries, VSMCs exist in a contractile state, but upon vascular inflammation or injury, they can switch into an activated state, in which they downregulate the contractile differentiation markers and show increased migration, proliferation and secretion of proinflammatory cytokines. This process is termed phenotypic switching and can lead to VSMC accumulation within atherosclerotic plaques. Previous observations of clonal expansion of a small number of VSMCs in atherosclerosis suggested that VSMCs were functionally heterogeneous. I hypothesised that functional heterogeneity of VSMCs in disease may originate from VSMC heterogeneity in healthy arteries.
In the first part of this thesis I explored the regional heterogeneity of VSMCs originating from different parts of the mouse aorta, as well as heterogeneity of VSMCs within a vascular bed using single-cell and bulk RNA sequencing. VSMCs originating from the atherosclerosis-prone aortic arch and atherosclerosis-resistant descending thoracic aorta were found to have distinct transcriptional signatures at the single-cell level. Additionally, several disease-relevant genes were observed to be heterogeneously expressed within both vascular beds.
In the second chapter I identified and characterised a rare subset of VSMCs expressing Stem cell antigen 1 (SCA1). Single-cell RNA-seq was combined with VSMC-specific lineage tracing to profile gene expression in individual VSMCs from healthy mouse arteries and to compare SCA1-expressing VSMCs to other cells. SCA1-positive VSMCs were heterogeneous, with many of them expressing low levels of contractile VSMC markers. Additionally, a subset of SCA1-positive VSMCs in healthy arteries expressed transcriptional signatures characteristic of activated VSMCs involved in phenotypic switching.
In the third chapter I investigated the involvement of SCA1-positive VSMCs in phenotypic switching. SCA1 upregulation was found to mark the process of VSMC phenotypic switching following in vitro culture and in vivo vascular injury. Single-cell RNA-seq profiling of VSMCs in atherosclerosis and following vascular injury showed that Ly6a/Sca1-expressing VSMCs were present and expressed transcriptional signatures similar to activated SCA1-positive cells observed in healthy arteries.
Overall the results presented in this thesis highlight the heterogeneous nature of VSMCs in healthy arteries, both regionally and within a vascular bed. I identified a rare subset of SCA1-positive VSMCs with activated transcriptional signatures in healthy arteries. I hypothesised that SCA1-positive VSMCs may be responsible for clonal expansion of VSMCs in atherosclerosis, which would have clinical implications for earlier detection and specific targeting of expanding VSMCs in atherosclerosis in the future. In support of this hypothesis I have shown that Ly6a/Sca1 is upregulated in model systems of VSMC phenotypic switching and that transcriptional signatures of Ly6a/Sca1-expressing VSMCs in mouse atherosclerosis and vascular injury resemble those of healthy activated SCA1-positive VSMCs.BBSRC DTP studentshi
Studies of gene expression in the Parkinson’s disease brain
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder, affecting ~1.8% of the population above 65 years. A combination of genetic and environmental factors contributes to the risk of PD, but the molecular mechanisms underlying its aetiology remain largely unaccounted for.
Profiling gene expression in the PD brain can identify molecular processes associated with the pathogenesis and nominate candidate therapeutic targets for further study. Most previous gene expression studies in PD focused on specific hypotheses and were restricted to selected genes of interest and only few were performed transcriptome-wide. While in part informative, the results of these studies must be interpreted with caution due to a combination of technical and biological limitations. Factors applying specifically to the study of human bulk brain tissue make it difficult to confidently and accurately determine altered pathways. 1) Bulk brain tissue is composed of multiple cell types, some of which are selectively affected in PD. Variation in cell-type composition across samples introduces noise, while disease-associated changes in the number of neurons and glia introduce systematic gene expression biases between conditions. 2) The complex architecture of neurons complicates sample dissection and can result in variable soma-to-synapses ratios across samples. This variability results in additional noise in expression data since RNA and proteins can undergo axonal transport, with some preferentially localizing to the soma or synapses. Another limitation of previous studies is that gene-level analyses provide only an incomplete perspective on the expression landscape. Regulation at the transcript- and protein-level is often overlooked.
The work of this thesis comprises three alternative approaches of gene expression analyses in the PD brain, aiming to overcome these limitations. We employed RNA-Seq and mass spectrometry in the prefrontal cortex of PD patients and healthy controls and approached these challenges by profiling expression at transcript-, gene- and protein-level. Considering the described aspects of bulk brain tissue, we adjusted for changes in cellular composition, RNA quality and guided functional interpretation with the polarized nature of neurons in mind.
Our results indicate that the frequently reported downregulation of mitochondrial function is partly driven by cellular composition. Adjusting for cell-type bias instead revealed altered pathways related to protein degradation, further strengthening their involvement in disease pathology. Both differential gene and transcript isoform expression showed enrichment for these. Additionally, we nominated genes that exhibit differential transcript usage events, suggesting alternate regulation at the transcript-level. These candidates can be targeted in future studies to identify functional consequences. Finally, we observed discordance between transcriptome and proteome which we concluded reflects alterations in PD proteostasis. Specifically, we identified certain proteasomal subunits central to these regulatory changes, providing us with further evidence for the key role of protein degradation in PD brain.Doktorgradsavhandlin
Recommended from our members
Goodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data
This work is about assessing model adequacy for negative binomial (NB) regression, particularly (1) assessing the adequacy of the NB assumption, and (2) assessing the appropriateness of models for NB dispersion parameters. Tools for the first are appropriate for NB regression generally; those for the second are primarily intended for RNA sequencing (RNA-Seq) data analysis. The typically small number of biological samples and large number of genes in RNA-Seq analysis motivate us to address the trade-offs between robustness and statistical power using NB regression models. One widely-used power-saving strategy, for example, is to assume some commonalities of NB dispersion parameters across genes via simple models relating them to mean expression rates, and many such models have been proposed. As RNA-Seq analysis is becoming ever more popular, it is appropriate to make more thorough investigations into power and robustness of the resulting methods, and into practical tools for model assessment. In this article, we propose simulation-based statistical tests and diagnostic graphics to address model adequacy. We provide simulated and real data examples to illustrate that our proposed methods are effective for detecting the misspecification of the NB mean-variance relationship as well as judging the adequacy of fit of several NB dispersion models
Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation
Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction
Statistical methods for the analysis of RNA sequencing data
The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are different because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. We also modify existing common initialization procedures to suit our model-based clustering algorithm. The effectiveness of the proposed methods is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach
- …