4,599 research outputs found

    Models for transcript quantification from RNA-Seq

    Full text link
    RNA-Seq is rapidly becoming the standard technology for transcriptome analysis. Fundamental to many of the applications of RNA-Seq is the quantification problem, which is the accurate measurement of relative transcript abundances from the sequenced reads. We focus on this problem, and review many recently published models that are used to estimate the relative abundances. In addition to describing the models and the different approaches to inference, we also explain how methods are related to each other. A key result is that we show how inference with many of the models results in identical estimates of relative abundances, even though model formulations can be very different. In fact, we are able to show how a single general model captures many of the elements of previously published methods. We also review the applications of RNA-Seq models to differential analysis, and explain why accurate relative transcript abundance estimates are crucial for downstream analyses

    Deep generative modeling for single-cell transcriptomics.

    Get PDF
    Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task

    Methods for Joint Normalization and Comparison of Hi-C data

    Get PDF
    The development of chromatin conformation capture technology has opened new avenues of study into the 3D structure and function of the genome. Chromatin structure is known to influence gene regulation, and differences in structure are now emerging as a mechanism of regulation between, e.g., cell differentiation and disease vs. normal states. Hi-C sequencing technology now provides a way to study the 3D interactions of the chromatin over the whole genome. However, like all sequencing technologies, Hi-C suffers from several forms of bias stemming from both the technology and the DNA sequence itself. Several normalization methods have been developed for normalizing individual Hi-C datasets, but little work has been done on developing joint normalization methods for comparing two or more Hi-C datasets. To make full use of Hi-C data, joint normalization and statistical comparison techniques are needed to carry out experiments to identify regions where chromatin structure differs between conditions. We develop methods for the joint normalization and comparison of two Hi-C datasets, which we then extended to more complex experimental designs. Our normalization method is novel in that it makes use of the distance-dependent nature of chromatin interactions. Our modification of the Minus vs. Average (MA) plot to the Minus vs. Distance (MD) plot allows for a nonparametric data-driven normalization technique using loess smoothing. Additionally, we present a simple statistical method using Z-scores for detecting differentially interacting regions between two datasets. Our initial method was published as the Bioconductor R package HiCcompare [http://bioconductor.org/packages/HiCcompare/](http://bioconductor.org/packages/HiCcompare/). We then further extended our normalization and comparison method for use in complex Hi-C experiments with more than two datasets and optional covariates. We extended the normalization method to jointly normalize any number of Hi-C datasets by using a cyclic loess procedure on the MD plot. The cyclic loess normalization technique can remove between dataset biases efficiently and effectively even when several datasets are analyzed at one time. Our comparison method implements a generalized linear model-based approach for comparing complex Hi-C experiments, which may have more than two groups and additional covariates. The extended methods are also available as a Bioconductor R package [http://bioconductor.org/packages/multiHiCcompare/](http://bioconductor.org/packages/multiHiCcompare/). Finally, we demonstrate the use of HiCcompare and multiHiCcompare in several test cases on real data in addition to comparing them to other similar methods (https://doi.org/10.1002/cpbi.76)

    Implementation, adaptation and evaluation of statistical analysis techniques for next generation sequencing data

    Get PDF
    Deep sequencing is a new high‐throughput sequencing technology intended to lower the cost of DNA sequencing further than what was previously thought possible using standard methods. Analysis of sequencing data such as SAGE (serial analysis of gene expression) and microarray data has been a popular area of research in recent years. The increasing development of these different technologies and the variety of the data produced has stressed the need for efficient analysis techniques. Various methods for the analysis of sequencing data have been developed in recent years: both SAGE data, which is discrete; and microarray data, which is continuous. These include simple analysis techniques, hierarchical clustering techniques (both Bayesian and Frequentist) and various methods for finding differential expression between groups of samples. These methods range from simple comparison techniques to more complicated computational methods, which attempt to isolate the more subtle dissimilarities in the data. Various analysis techniques are used in this thesis for the analysis of unpublished deep sequencing data. This analysis was approached in three sections. The first was looking at clustering techniques previously developed for SAGE data, Poisson C / Poisson L algorithm and a Bayesian hierarchical clustering algorithm and evaluating and adapting these techniques for use on the deep sequencing data. The second was looking at methods to find differentially expressed tags in the dataset. These differentially expressed tags are of interest, as it is believed that finding tags which are significantly up or down regulatedacross groups of samples could potentially be useful in the treatment of certain diseases. Finally due to the lack of published data, a simulation study was constructed using various models to simulate the data and assess the techniques mentioned above on data with pre‐defined sample groupings and differentially expressed tags. The main goals of the simulation study were the validation of the analysis techniques previously discussed and estimation of false positive rates for this type of large, sparse dataset. The Bayesian algorithm yielded surprising results, producing no hierarchy, suggesting no evidence of clustering. However, promising results were obtained for the adapted Poisson C / Poisson L algorithm applied using various models to fit the data and measures of similarity. Further investigation is needed to confirm whether it is suitable for the clustering of deep sequencing data in general, especially where the situation of three or more groups of interest occurs. From the results of the differential expression analysis it can be deduced that the overdispersed log linear method for the analysis of differential expression, particularly when compared to simple test such as the 2‐sample t‐tests and the Wilcoxon signed rank test is the most reliable. This deduction is made based upon the results of the overlapping with other methods and the more reasonable number of differentially expressed tags detected, in contrast to those detected using the adapted log ratio method. However none of this can be confirmed, as no information was known about the tags in either dataset. The success of the Poisson C / Poisson L algorithm on both the Poisson and Truncated Poisson simulated datasets suggests that the method of simulation is acceptable for the assessment of clustering algorithms developed for use on sequencing data. However, evaluation of the differential expression analysis performed on the simulated data indicates that further work is needed on the method of simulation to increase its reliability. The algorithms presented can be adapted for use on any form of discrete data. From the work done here, there is there is evidence that the adapted Poisson C / Poisson L algorithm is a promising technique for the analysis of deep sequencing data

    An iteration normalization and test method for differential expression analysis of RNA-seq data

    Get PDF
    BACKGROUND: Next generation sequencing technologies are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key to analyzing massive and complex sequencing data. In order to derive gene expression measures and compare these measures across samples or libraries, we first need to normalize read counts to adjust for varying sample sequencing depths and other potentially technical effects. RESULTS: In this paper, we develop a normalization method based on iterating median of M-values (IMM) for detecting the differentially expressed (DE) genes. Compared to a previous approach TMM, the IMM method improves the accuracy of DE detection. Simulation studies show that the IMM method outperforms other methods for the sample normalization. We also look into the real data and find that the genes detected by IMM but not by TMM are much more accurate than the genes detected by TMM but not by IMM. What’s more, we discovered that gene UNC5C is highly associated with kidney cancer and so on

    A scaling normalization method for differential expression analysis of RNA-seq data

    Get PDF
    A novel and empirical method for normalization of RNA-seq data is presente

    Statistical Methods For Genomic And Transcriptomic Sequencing

    Get PDF
    Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

    Droplet scRNA-seq is not zero-inflated

    Get PDF
    Potential users of single-cell RNA-sequencing (scRNA-seq) often encounter a choice between high-throughput droplet-based methods and high-sensitivity plate-based methods. There is a widespread belief that scRNA-seq will often fail to generate measurements for some genes from some cells owing to technical molecular inefficiencies. It is believed that this causes data to have an overabundance of zero values compared to what is expected from random sampling and that this effect is particularly pronounced in droplet-based methods. Here I present an investigation of published data for technical controls in droplet-based scRNA-seq experiments that demonstrates that the number of zero values in the data is consistent with common distributional models of molecule sampling counts. Thus, any additional zero values in biological data likely result from biological variation or may reflect variation in gene abundance among cell types or cell states
    corecore