3,298 research outputs found

    aFold – using polynomial uncertainty modelling for differential gene expression estimation from RNA sequencing data

    No full text
    Data normalization and identification of significant differential expression represent crucial steps in RNA-Seq analysis. Many available tools rely on assumptions that are often not met by real data, including the common assumption of symmetrical distribution of up- and down-regulated genes, the presence of only few differentially expressed genes and/or few outliers. Moreover, the cut-off for selecting significantly differentially expressed genes for further downstream analysis often depend on arbitrary choices

    Statistical power analysis for single-cell RNA-sequencing

    Get PDF
    RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide. The recent development of single cell RNA sequencing (scRNA-seq) protocols opens up the possibility to systematically characterize cell transcriptomes and their underlying developmental and regulatory mechanisms. Since the first publication on single-cell transcriptomics a decade ago, hundreds of scRNA-seq datasets from a variety of sources have been released, profiling gene expression of sorted cells, tumors, whole dissociated organs and even complete organisms. Currently, it is also the main tool to systematically characterize human cells within the Human Cell Atlas Project. Given its wide applicability and increasing popularity, many experimental protocols and computational analysis approaches exist for scRNA-seq. However, the technology remains experimentally and computationally challenging. Firstly, single cells contain only minute mRNA amounts that need to be reliably captured and amplified for accurate quantification by sequencing. Importantly, the Polymerase Chain Reaction (PCR) is commonly used for amplification which might introduce biases and increase technical variation. Secondly, once the sequencing results are obtained, finding the best computational processing pipeline can be a struggle. A number of comparison studies have already been conducted - esp. for bulk RNA-seq - but usually they deal only with one aspect of the workflow. Furthermore, in how far the conclusions and recommendations of these studies can be transferred to scRNA-seq is unknown. Related to the processing of RNA-sequencing, we investigate the effect of PCR amplification on differential expression analysis. We find that computational removal of duplicates has either a negligible or a negative impact on specificity and sensitivity of differential expression analysis, and we therefore recommend not to remove read duplicates by mapping position. In contrast, if duplicates are identified using unique molecular identifiers (UMIs) tagging RNA molecules, both specificity and sensitivity improve. The first integral step of any scRNA-seq experiment is the preparation of sequencing libraries from the cells. We conducted an independent benchmarking study of popular library preparation protocols in terms of detection sensitivity, accuracy and precision using the same mouse embryonic stem cells and exogenous mRNA spike-ins. We recapitulate our previous finding that technical variance is markedly decreased when using UMIs to remove duplicates. In order to assign a monetary value to the detected amounts of technical variance, we developed a simulation framework, that enabled us to compare the power to detect differentially expressed genes across the scRNA-seq library preparation protocols. Our experiences during this comparison study led to the development of the sequencing data processing in zUMIs and the simulation framework and power analysis in powsimR. zUMIs is a pipeline for processing scRNA-seq data with flexible choices regarding UMI and cell barcode design. In addition, we showed with powsimR simulations that the inclusion of intronic reads for gene expression quantification increases the power to detect DE genes and added it as a unique feature to zUMIs. In powsimR, we present our simulation framework extending choices concerning data analysis, enabling researchers to assess experimental design and analysis plans of RNA-seq in terms of statistical power. Lastly, we conducted a systematic evaluation of scRNA-seq experimental and analytical pipelines. We found that choices made concerning normalisation and library preparation protocols have the biggest impact on the validity of scRNA-seq DE analysis. Choosing a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the cell sample size. Taken together, we have established and applied a simulation framework that allowed us to benchmark experimental and computational scRNA-seq protocols and hence inform the experimental design and method choices of this important technology

    Power analysis for RNA sequencing and mass spectrometry-based proteomics data

    Get PDF
    RNA-sequencing and mass spectrometry technologies have facilitated the differential expression discoveries in transcriptome and proteome studies. However, the determination of sample size to achieve adequate statistical power has been a major challenge in experimental design. The objective of this study is to develop a power analysis tool applicable to both RNA-seq and MS-based proteomics data. The methods proposed in this study are capable of both prospective and retrospective power analyses. In terms of the performance, the benchmarking results indicated that the proposed methods can give distinct power estimates for both differentially and equivalently expressed genes or proteins without prior differential expression analysis and other assumptions, such as, expected fraction of differentially expressed features, minimal fold changes and expected mean expressions. Using the proposed methods, not only can researchers evaluate the reliability of their acquired significant results, but also estimate the sufficient sample size for a desired power. The proposed methods in this study were implemented as an R package, which can be freely accessed from Bioconductor project at http://bioconductor.org/packages/PowerExplorer/

    GENE-Counter: A Computational Pipeline for the Analysis of RNA-Seq Data for Gene Expression Differences

    Get PDF
    GENE-counter is a complete Perl-based computational pipeline for analyzing RNA-Sequencing (RNA-Seq) data for differential gene expression. In addition to its use in studying transcriptomes of eukaryotic model organisms, GENE-counter is applicable for prokaryotes and non-model organisms without an available genome reference sequence. For alignments, GENE-counter is configured for CASHX, Bowtie, and BWA, but an end user can use any Sequence Alignment/Map (SAM)-compliant program of preference. To analyze data for differential gene expression, GENE-counter can be run with any one of three statistics packages that are based on variations of the negative binomial distribution. The default method is a new and simple statistical test we developed based on an over-parameterized version of the negative binomial distribution. GENE-counter also includes three different methods for assessing differentially expressed features for enriched gene ontology (GO) terms. Results are transparent and data are systematically stored in a MySQL relational database to facilitate additional analyses as well as quality assessment. We used next generation sequencing to generate a small-scale RNA-Seq dataset derived from the heavily studied defense response of Arabidopsis thaliana and used GENE-counter to process the data. Collectively, the support from analysis of microarrays as well as the observed and substantial overlap in results from each of the three statistics packages demonstrates that GENE-counter is well suited for handling the unique characteristics of small sample sizes and high variability in gene counts

    New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

    Get PDF
    Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

    Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

    Get PDF
    *Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses

    A systematic evaluation of single cell RNA-seq analysis pipelines

    Get PDF
    The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not yet been established. Here, we use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in similar to 3000 pipelines, allowing us to also assess interactions among pipeline steps. We find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, we find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, we illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size

    Bioinformatics for RNA‐Seq Data Analysis

    Get PDF
    While RNA sequencing (RNA‐seq) has become increasingly popular for transcriptome profiling, the analysis of the massive amount of data generated by large‐scale RNA‐seq still remains a challenge. RNA‐seq data analyses typically consist of (1) accurate mapping of millions of short sequencing reads to a reference genome, including the identification of splicing events; (2) quantifying expression levels of genes, transcripts, and exons; (3) differential analysis of gene expression among different biological conditions; and (4) biological interpretation of differentially expressed genes. Despite the fact that multiple algorithms pertinent to basic analyses have been developed, there are still a variety of unresolved questions. In this chapter, we review the main tools and algorithms currently available for RNA‐seq data analyses, and our goal is to help RNA‐seq data analysts to make an informed choice of tools in practical RNA‐seq data analysis. In the meantime, RNA‐seq is evolving rapidly, and newer sequencing technologies are briefly introduced, including stranded RNA‐seq, targeted RNA‐seq, and single‐cell RNA‐seq
    corecore