968 research outputs found

    THE NUANCES OF STATISTICALLY ANALYZING NEXT-GENERATION SEQUENCING DATA

    Get PDF
    High-throughput sequencing technologies, in particular next-generation sequencing (NGS) technologies, have emerged as the preferred approach for exploring both gene function and pathway organization. Data from NGS technologies pose new computational and statistical challenges because of their massive size, limited replicate information, large number of genes (high-dimensionality), and discrete form. They are more complex than data from previous high-throughput technologies such as microarrays. In this work we focus on the statistical issues in analyzing and modeling NGS data for selecting genes suitable for further exploration and present a brief review of the relevant statistical methods. We discuss visualization methods to assess the suitability of statistical models for these data, statistical methods for modeling differential gene expression, and methods for checking goodness of fit of the models for NGS data. We also outline areas for further research, especially in the computational, statistical, and visualization aspects of such data

    Normalizing single-cell RNA sequencing data: challenges and opportunities

    Get PDF
    Single-cell transcriptomics is becoming an important component of the molecular biologist's toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users

    Application of miRNA-seq in neuropsychiatry: A methodological perspective

    Get PDF
    MiRNAs are emerging as key molecules to study neuropsychiatric diseases. However, despite the large number of methodologies and software for miRNA-seq analyses, there is little supporting literature for researchers in this area. This review focuses on evaluating how miRNA-seq has been used to study neuropsychiatric diseases to date, analyzing both the main findings discovered and the bioinformatics workflows and tools used from a methodological perspective. The objective of this review is two-fold: first, to evaluate current miRNA-seq procedures used in neuropsychiatry; and second, to offer comprehensive information that can serve as a guide to new researchers in bioinformatics. After conducting a systematic search (from 2016 to June 30, 2020) of articles using miRNA-seq in neuropsychiatry, we have seen that it has already been used for different types of studies in three main categories: diagnosis, prognosis, and mechanism. We carefully analyzed the bioinformatics workflows of each study, observing a high degree of variability with respect to the tools and methods used and several methodological complexities that are identified and discussed in this reviewInstituto de Salud Carlos III | Ref. PI18/01311Ministerio de EconomĂ­a y Competitividad | Ref. RYC2014-15246Xunta de Galicia | Ref. ED431C2018/55-GR

    Normalization of gene expression data revisited: the three viewpoints of the transcriptome in human skeletal muscle undergoing load-induced hypertrophy and why they matter

    Get PDF
    The biological relevance and accuracy of gene expression data depend on the adequacy of data normalization. This is both due to its role in resolving and accounting for technical variation and errors, and its defining role in shaping the view point of biological interpretations. Still, the choice of the normalization method is often not explicitly motivated although this choice may be particularly decisive for conclusions in studies involving pronounced cellular plasticity. In this study, we highlight the consequences of using three fundamentally different modes of normalization for interpreting RNA-seq data from human skeletal muscle undergoing exercise-training induced growth. Briefly, 25 participants conducted 12 weeks of high-load resistance training. Muscle biopsy specimens were sampled from m. vastus lateralis before, after two weeks of training (week 2) and after the intervention (week 12) and were subsequently analysed using RNA-seq. Transcript counts were modelled as (1) per-library-size, (2) per-total-RNA, and (3) per-sample-size (per-mg-tissue). Result: Initially, the three modes of transcript modelling led to the identification of three unique sets of stable genes, which displayed differential expression profiles. Specifically, genes showing stable expression across samples in the per-library-size dataset displayed training-associated increases in per-total-RNA and per-sample-size datasets. These gene sets were then used for normalization of the entire dataset, providing transcript abundance estimates corresponding to each of the three biological viewpoints (i.e., per-library-size, per-total-RNA, and per-sample-size). The different normalization modes led to different conclusions, measured as training-associated changes in transcript expression. Briefly, for 27% and 20% of the transcripts, training was associated with changes in expression in per-total-RNA and per-sample-size scenarios, but not in the per-library-size scenario. At week 2, this led to opposite conclusions for 4% of the transcripts between per-library-size and per-sample-size datasets (↑ vs. ↓, respectively). Conclusion: Scientists should be explicit with their choice of normalization strategies and should interpret the results of gene expression analyses with caution. This is particularly important for data sets involving a limited number of genes or involving growing or differentiating cellular models, where the risk of biased conclusions is pronounced.publishedVersio

    SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data

    Get PDF
    Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces in silico RNA-seq data for benchmarking and assessing gene network inference methods. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package= SeqNet and on GitHub at https://github.com/tgrimes/SeqNet

    Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

    Get PDF
    *Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses

    Statistical Modeling of High-throughput Sequencing Data and Spatially Resolved Transcriptomic Data

    Get PDF
    Recent studies have shown that RNA sequencing (RNA-seq) can be used to measure mRNA of sufficient quality extracted from Formalin-Fixed Paraffin-Embedded (FFPE) tissues to provide whole-genome transcriptome analysis. However, little attention has been given to the normalization of FFPE RNA-seq data. In Chapters 1 and 2, we propose a new normalization method, labeled MIXnorm, and its simplified version SMIXnorm, for FFPE RNA-seq data. MIXnorm relies on a two-component mixture model, which models non-expressed genes by zero-inflated Poisson distributions and models expressed genes by truncated normal distributions. To obtain maximum likelihood estimates, we develop a nested EM algorithm, in which closed-form updates are available in each iteration. We evaluate MIXnorm and SMIXnorm through simulations and cancer studies. Recently, spatial molecular profiling technologies have enabled a comprehensive catalog of molecular profiling data together with tissue imaging data with spatial locations. In the context of spatial profiling, the research interest lies in investigating the association between gene expression levels and their spatial locations, i.e., identifying spatially expressed (SE) genes. However, gene expression data from spatial molecular profiling are subject to severe zero-inflation issues. In Chapter 3, we propose a Bayesian Spatial HEAPing model (SHEAP), which aims to accurately recover major spatial patterns underlying the gene expression levels that are partially observed and subject to heaping at zero. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for Bayesian inference. We evaluate the proposed method through simulation studies and real data applications
    • …
    corecore