3,571 research outputs found

    Robust identification of local adaptation from allele frequencies

    Full text link
    Comparing allele frequencies among populations that differ in environment has long been a tool for detecting loci involved in local adaptation. However, such analyses are complicated by an imperfect knowledge of population allele frequencies and neutral correlations of allele frequencies among populations due to shared population history and gene flow. Here we develop a set of methods to robustly test for unusual allele frequency patterns, and correlations between environmental variables and allele frequencies while accounting for these complications based on a Bayesian model previously implemented in the software Bayenv. Using this model, we calculate a set of `standardized allele frequencies' that allows investigators to apply tests of their choice to multiple populations, while accounting for sampling and covariance due to population history. We illustrate this first by showing that these standardized frequencies can be used to calculate powerful tests to detect non-parametric correlations with environmental variables, which are also less prone to spurious results due to outlier populations. We then demonstrate how these standardized allele frequencies can be used to construct a test to detect SNPs that deviate strongly from neutral population structure. This test is conceptually related to FST but should be more powerful as we account for population history. We also extend the model to next-generation sequencing of population pools, which is a cost-efficient way to estimate population allele frequencies, but it implies an additional level of sampling noise. The utility of these methods is demonstrated in simulations and by re-analyzing human SNP data from the HGDP populations. An implementation of our method will be available from http://gcbias.org.Comment: 27 pages, 7 figure

    Detecting mutations in mixed sample sequencing data using empirical Bayes

    Get PDF
    We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data. We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS538 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Estimating the effective population size from temporal allele frequency changes in experimental evolution

    Get PDF
    A.J. and T.T. are members of the Vienna Graduate School of Population Genetics, which is funded by the Austrian Science Fund (FWF, W1225). C.S. is also supported by the European Research Council grant ”ArchAdapt,” and T.T. is a recipient of a Doctoral Fellowship (DOC) of the Austrian Academy of Sciences.The effective population size (Ne) is a major factor determining allele frequency changes in natural and experimental populations. Temporal methods provide a powerful and simple approach to estimate short-term Ne. They use allele frequency shifts between temporal samples to calculate the standardized variance, which is directly related to Ne. Here we focus on experimental evolution studies that often rely on repeated sequencing of samples in pools (Pool-seq). Pool-seq is cost-effective and often outperforms individual-based sequencing in estimating allele frequencies, but it is associated with atypical sampling properties: Additional to sampling individuals, sequencing DNA in pools leads to a second round of sampling, which increases the variance of allele frequency estimates. We propose a new estimator of Ne, which relies on allele frequency changes in temporal data and corrects for the variance in both sampling steps. In simulations, we obtain accurate Ne estimates, as long as the drift variance is not too small compared to the sampling and sequencing variance. In addition to genome-wide Ne estimates, we extend our method using a recursive partitioning approach to estimate Ne locally along the chromosome. Since the type I error is controlled, our method permits the identification of genomic regions that differ significantly in their Ne estimates. We present an application to Pool-seq data from experimental evolution with Drosophila and provide recommendations for whole-genome data. The estimator is computationally efficient and available as an R package at https://github.com/ThomasTaus/Nest.Publisher PDFPeer reviewe

    Improving & applying single-cell RNA sequencing

    Get PDF
    The cell is the fundamental building block of life. With the advent of single-cell RNA sequencing (scRNA-seq), we can for the first time assess the transcriptome of many individual cells. This has profound implications for biological and medical questions and is especially important to characterize heterogeneous cell populations and rare cells. However, the technology is technically and computationally challenging as complementary DNA (cDNA) needs to be generated and amplified from minute amounts of mRNA and sequenceable libraries need to be efficiently generated from many cells. This requires to establish different protocols, identify important caveats, benchmark various methods and improve them if possible. To this end, we analysed amplification bias and its effect on detecting differentially expressed genes in several bulk and a single-cell RNA sequencing methods. We found that correcting for amplification bias is not possible computationally but improves the power of scRNA-seq considerably, though neglectable for bulk-RNA-seq. In the second study we compared six prominent scRNA-seq protocols as more and more single-cell RNA-sequencing are becoming available, but an independent benchmark of methods is lacking. By using the same mouse embryonic stem cells (mESCs) and exogenous mRNA spike-ins as common reference, we compared six important scRNA-seq protocols in their sensitivity, accuracy and precision to quantify mRNA levels. In agreement with our previous study, we find that the precision, i.e. the technical variance, of scRNA-seq methods is driven by amplification bias and drastically reduced when using unique molecular identifiers to remove amplification duplicates. To assess the combined effects of sensitivity and precision and to compare the cost-efficiency of methods we compared the power to detect differentially expressed genes among the tested scRNA-seq protocols using a novel simulation framework. We find that some methods are prohibitively inefficient and others show trade-offs depending on the number of cells per sample that need to be analysed. Our study also provides a framework for benchmarking further improvements of scRNA-seq protocol and we published an improved version of our simulation framework powsimR. It uniquely recapitulates the specific characteristics of scRNA-seq data to enable streamlined simulations for benchmarking both wet lab protocols and analysis algorithms. Furthermore, we compile our experience in processing different types of scRNA-seq data, in particular with barcoded libraries and UMIs, and developed zUMIs, a fast and flexible scRNA-seq data processing software overcoming shortcomings of existing pipelines. In addition, we used the in-depth characterization of scRNA-seq technology to optimize an already powerful scRNA-seq protocol even further. According to data generated from exogenous mRNA spike-ins, this new mcSCRB-seq protocol is currently the most sensitive scRNA-seq protocol available. Single-cell resolution makes scRNA-seq uniquely suited for the understanding of complex diseases, such as leukemia. In acute lymphoblastic leukemia (ALL), rare chemotherapy-resistant cells persist as minimal residual disease (MRD) and may cause relapse. However, biological mechanisms of these relapse-inducing cells remain largely unclear because characterisation of this rare population was lacking so far. In order to contribute to the understanding of MRD, we leveraged scRNA-seq to study minimal residual disease cells from ALL. We obtained and characterised rare, chemotherapy-resistant cell populations from primary patients and patient cells grown in xenograft mouse models. We found that MRD cells are dormant and feature high expression of adhesion molecules in order to persist in the hematopoietic niche. Furthermore, we could show that there is plasticity between resting, resistant MRD cells and cycling, therapy-sensitive cells, indicating that patients could benefit from strategies that release MRD cells from the niche. Importantly, we show that our data derived from xenograft models closely resemble rare primary patient samples. In conclusion, my work of the last years contributes towards the development of experimental and computational single-cell RNA sequencing methods enabling their widespread application to biomedical problems such as leukemia

    Ironing out the wrinkles in the rare biosphere through improved OTU clustering

    Get PDF
    Deep sequencing of PCR amplicon libraries facilitates the detection of low-abundance populations in environmental DNA surveys of complex microbial communities. At the same time, deep sequencing can lead to overestimates of microbial diversity through the generation of low-frequency, error-prone reads. Even with sequencing error rates below 0.005 per nucleotide position, the common method of generating operational taxonomic units (OTUs) by multiple sequence alignment and complete-linkage clustering significantly increases the number of predicted OTUs and inflates richness estimates. We show that a 2% single-linkage preclustering methodology followed by an average-linkage clustering based on pairwise alignments more accurately predicts expected OTUs in both single and pooled template preparations of known taxonomic composition. This new clustering method can reduce the OTU richness in environmental samples by as much as 30–60% but does not reduce the fraction of OTUs in long-tailed rank abundance curves that defines the rare biosphere

    Statistical power analysis for single-cell RNA-sequencing

    Get PDF
    RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide. The recent development of single cell RNA sequencing (scRNA-seq) protocols opens up the possibility to systematically characterize cell transcriptomes and their underlying developmental and regulatory mechanisms. Since the first publication on single-cell transcriptomics a decade ago, hundreds of scRNA-seq datasets from a variety of sources have been released, profiling gene expression of sorted cells, tumors, whole dissociated organs and even complete organisms. Currently, it is also the main tool to systematically characterize human cells within the Human Cell Atlas Project. Given its wide applicability and increasing popularity, many experimental protocols and computational analysis approaches exist for scRNA-seq. However, the technology remains experimentally and computationally challenging. Firstly, single cells contain only minute mRNA amounts that need to be reliably captured and amplified for accurate quantification by sequencing. Importantly, the Polymerase Chain Reaction (PCR) is commonly used for amplification which might introduce biases and increase technical variation. Secondly, once the sequencing results are obtained, finding the best computational processing pipeline can be a struggle. A number of comparison studies have already been conducted - esp. for bulk RNA-seq - but usually they deal only with one aspect of the workflow. Furthermore, in how far the conclusions and recommendations of these studies can be transferred to scRNA-seq is unknown. Related to the processing of RNA-sequencing, we investigate the effect of PCR amplification on differential expression analysis. We find that computational removal of duplicates has either a negligible or a negative impact on specificity and sensitivity of differential expression analysis, and we therefore recommend not to remove read duplicates by mapping position. In contrast, if duplicates are identified using unique molecular identifiers (UMIs) tagging RNA molecules, both specificity and sensitivity improve. The first integral step of any scRNA-seq experiment is the preparation of sequencing libraries from the cells. We conducted an independent benchmarking study of popular library preparation protocols in terms of detection sensitivity, accuracy and precision using the same mouse embryonic stem cells and exogenous mRNA spike-ins. We recapitulate our previous finding that technical variance is markedly decreased when using UMIs to remove duplicates. In order to assign a monetary value to the detected amounts of technical variance, we developed a simulation framework, that enabled us to compare the power to detect differentially expressed genes across the scRNA-seq library preparation protocols. Our experiences during this comparison study led to the development of the sequencing data processing in zUMIs and the simulation framework and power analysis in powsimR. zUMIs is a pipeline for processing scRNA-seq data with flexible choices regarding UMI and cell barcode design. In addition, we showed with powsimR simulations that the inclusion of intronic reads for gene expression quantification increases the power to detect DE genes and added it as a unique feature to zUMIs. In powsimR, we present our simulation framework extending choices concerning data analysis, enabling researchers to assess experimental design and analysis plans of RNA-seq in terms of statistical power. Lastly, we conducted a systematic evaluation of scRNA-seq experimental and analytical pipelines. We found that choices made concerning normalisation and library preparation protocols have the biggest impact on the validity of scRNA-seq DE analysis. Choosing a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the cell sample size. Taken together, we have established and applied a simulation framework that allowed us to benchmark experimental and computational scRNA-seq protocols and hence inform the experimental design and method choices of this important technology

    Models for transcript quantification from RNA-Seq

    Full text link
    RNA-Seq is rapidly becoming the standard technology for transcriptome analysis. Fundamental to many of the applications of RNA-Seq is the quantification problem, which is the accurate measurement of relative transcript abundances from the sequenced reads. We focus on this problem, and review many recently published models that are used to estimate the relative abundances. In addition to describing the models and the different approaches to inference, we also explain how methods are related to each other. A key result is that we show how inference with many of the models results in identical estimates of relative abundances, even though model formulations can be very different. In fact, we are able to show how a single general model captures many of the elements of previously published methods. We also review the applications of RNA-Seq models to differential analysis, and explain why accurate relative transcript abundance estimates are crucial for downstream analyses
    corecore