700 research outputs found

    A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

    Get PDF
    Background: High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user. Results: Here we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or PĂłlya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility. Conclusions: RNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.or

    An R Implementation of the Polya-Aeppli Distribution

    Full text link
    An efficient implementation of the Polya-Aeppli, or geometirc compound Poisson, distribution in the statistical programming language R is presented. The implementation is available as the package polyaAeppli and consists of functions for the mass function, cumulative distribution function, quantile function and random variate generation with those parameters conventionally provided for standard univatiate probability distributions in the stats package in RComment: 9 pages, 2 figure

    Error estimates for the analysis of differential expression fromRNA-seq count data

    Get PDF
    Background: A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. Results: We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. Conclusions: We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible tomultiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data

    Studies of gene expression in the Parkinson’s disease brain

    Get PDF
    Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder, affecting ~1.8% of the population above 65 years. A combination of genetic and environmental factors contributes to the risk of PD, but the molecular mechanisms underlying its aetiology remain largely unaccounted for. Profiling gene expression in the PD brain can identify molecular processes associated with the pathogenesis and nominate candidate therapeutic targets for further study. Most previous gene expression studies in PD focused on specific hypotheses and were restricted to selected genes of interest and only few were performed transcriptome-wide. While in part informative, the results of these studies must be interpreted with caution due to a combination of technical and biological limitations. Factors applying specifically to the study of human bulk brain tissue make it difficult to confidently and accurately determine altered pathways. 1) Bulk brain tissue is composed of multiple cell types, some of which are selectively affected in PD. Variation in cell-type composition across samples introduces noise, while disease-associated changes in the number of neurons and glia introduce systematic gene expression biases between conditions. 2) The complex architecture of neurons complicates sample dissection and can result in variable soma-to-synapses ratios across samples. This variability results in additional noise in expression data since RNA and proteins can undergo axonal transport, with some preferentially localizing to the soma or synapses. Another limitation of previous studies is that gene-level analyses provide only an incomplete perspective on the expression landscape. Regulation at the transcript- and protein-level is often overlooked. The work of this thesis comprises three alternative approaches of gene expression analyses in the PD brain, aiming to overcome these limitations. We employed RNA-Seq and mass spectrometry in the prefrontal cortex of PD patients and healthy controls and approached these challenges by profiling expression at transcript-, gene- and protein-level. Considering the described aspects of bulk brain tissue, we adjusted for changes in cellular composition, RNA quality and guided functional interpretation with the polarized nature of neurons in mind. Our results indicate that the frequently reported downregulation of mitochondrial function is partly driven by cellular composition. Adjusting for cell-type bias instead revealed altered pathways related to protein degradation, further strengthening their involvement in disease pathology. Both differential gene and transcript isoform expression showed enrichment for these. Additionally, we nominated genes that exhibit differential transcript usage events, suggesting alternate regulation at the transcript-level. These candidates can be targeted in future studies to identify functional consequences. Finally, we observed discordance between transcriptome and proteome which we concluded reflects alterations in PD proteostasis. Specifically, we identified certain proteasomal subunits central to these regulatory changes, providing us with further evidence for the key role of protein degradation in PD brain.Doktorgradsavhandlin

    Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation

    Get PDF
    Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction

    Statistical methods for the analysis of RNA sequencing data

    Get PDF
    The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are different because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. We also modify existing common initialization procedures to suit our model-based clustering algorithm. The effectiveness of the proposed methods is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach
    • …
    corecore