16,661 research outputs found

    A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

    Get PDF
    BackgroundPCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.ResultsIn this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.ConclusionsThe method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates

    Evaluating the Stormwater Treatment Performance of AbTech Industries Smart Sponge® Plus, Landry, N

    Get PDF
    The ability of AbTech’s Smart Sponge® Plus to remove fecal-borne bacteria from stormwater was evaluated in a storm drainage system located in Seabrook, New Hampshire. The Smart Sponge ® Plus was installed into a water quality inlet and samples were collected from influent (pre-treatment) and effluent (post-treatment) for analysis of bacterial concentrations and loadings during 15 storm events from September 3, 2003 to May 24, 2004, excluding winter months. The 15 storms included events with a range of rainfall intensities and amounts, as well as accompanying runoff volumes. Flow weighted composite samples were analyzed for fecal coliforms, Escherichia coli and enterococci to determine if concentrations were lowered as stormwater passed through the Smart Sponge® Plus material. In most cases, bacterial concentrations were reduced within the treatment system, but to varying degrees. The efficiency ratio based on reduction in event mean concentration for each bacterial indicator in the flow was calculated for each storm event. The values ranged most widely for fecal coliforms, whereas the range of ratios was narrower and the values were more consistent for enterococci. The overall load reductions for the bacterial indicators were 50.3% for fecal coliforms, 51.3% for Escherichia coli and 43.2% for enterococci. Relatively consistent pH values were observed in influent and effluent samples. The overall range of pH values was large, ranging from 5.21 units in influent from storm event #11 to 7.64 units in influent from storm event #1. Conductivity values were gr eater in the effluent in 14 of the 15 storm events, especially in storm events #12 and #13 when effluent conductivities were \u3e50% higher than influent values. Quality assurance/quality control procedures supported the methods and results of the study. Overall, the observed reductions in bacterial concentrations in post-treatment stormwater would still result in discharge of elevated bacterial levels that would continue to limit uses in receiving waters

    Statistical power analysis for single-cell RNA-sequencing

    Get PDF
    RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide. The recent development of single cell RNA sequencing (scRNA-seq) protocols opens up the possibility to systematically characterize cell transcriptomes and their underlying developmental and regulatory mechanisms. Since the first publication on single-cell transcriptomics a decade ago, hundreds of scRNA-seq datasets from a variety of sources have been released, profiling gene expression of sorted cells, tumors, whole dissociated organs and even complete organisms. Currently, it is also the main tool to systematically characterize human cells within the Human Cell Atlas Project. Given its wide applicability and increasing popularity, many experimental protocols and computational analysis approaches exist for scRNA-seq. However, the technology remains experimentally and computationally challenging. Firstly, single cells contain only minute mRNA amounts that need to be reliably captured and amplified for accurate quantification by sequencing. Importantly, the Polymerase Chain Reaction (PCR) is commonly used for amplification which might introduce biases and increase technical variation. Secondly, once the sequencing results are obtained, finding the best computational processing pipeline can be a struggle. A number of comparison studies have already been conducted - esp. for bulk RNA-seq - but usually they deal only with one aspect of the workflow. Furthermore, in how far the conclusions and recommendations of these studies can be transferred to scRNA-seq is unknown. Related to the processing of RNA-sequencing, we investigate the effect of PCR amplification on differential expression analysis. We find that computational removal of duplicates has either a negligible or a negative impact on specificity and sensitivity of differential expression analysis, and we therefore recommend not to remove read duplicates by mapping position. In contrast, if duplicates are identified using unique molecular identifiers (UMIs) tagging RNA molecules, both specificity and sensitivity improve. The first integral step of any scRNA-seq experiment is the preparation of sequencing libraries from the cells. We conducted an independent benchmarking study of popular library preparation protocols in terms of detection sensitivity, accuracy and precision using the same mouse embryonic stem cells and exogenous mRNA spike-ins. We recapitulate our previous finding that technical variance is markedly decreased when using UMIs to remove duplicates. In order to assign a monetary value to the detected amounts of technical variance, we developed a simulation framework, that enabled us to compare the power to detect differentially expressed genes across the scRNA-seq library preparation protocols. Our experiences during this comparison study led to the development of the sequencing data processing in zUMIs and the simulation framework and power analysis in powsimR. zUMIs is a pipeline for processing scRNA-seq data with flexible choices regarding UMI and cell barcode design. In addition, we showed with powsimR simulations that the inclusion of intronic reads for gene expression quantification increases the power to detect DE genes and added it as a unique feature to zUMIs. In powsimR, we present our simulation framework extending choices concerning data analysis, enabling researchers to assess experimental design and analysis plans of RNA-seq in terms of statistical power. Lastly, we conducted a systematic evaluation of scRNA-seq experimental and analytical pipelines. We found that choices made concerning normalisation and library preparation protocols have the biggest impact on the validity of scRNA-seq DE analysis. Choosing a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the cell sample size. Taken together, we have established and applied a simulation framework that allowed us to benchmark experimental and computational scRNA-seq protocols and hence inform the experimental design and method choices of this important technology

    Improving & applying single-cell RNA sequencing

    Get PDF
    The cell is the fundamental building block of life. With the advent of single-cell RNA sequencing (scRNA-seq), we can for the first time assess the transcriptome of many individual cells. This has profound implications for biological and medical questions and is especially important to characterize heterogeneous cell populations and rare cells. However, the technology is technically and computationally challenging as complementary DNA (cDNA) needs to be generated and amplified from minute amounts of mRNA and sequenceable libraries need to be efficiently generated from many cells. This requires to establish different protocols, identify important caveats, benchmark various methods and improve them if possible. To this end, we analysed amplification bias and its effect on detecting differentially expressed genes in several bulk and a single-cell RNA sequencing methods. We found that correcting for amplification bias is not possible computationally but improves the power of scRNA-seq considerably, though neglectable for bulk-RNA-seq. In the second study we compared six prominent scRNA-seq protocols as more and more single-cell RNA-sequencing are becoming available, but an independent benchmark of methods is lacking. By using the same mouse embryonic stem cells (mESCs) and exogenous mRNA spike-ins as common reference, we compared six important scRNA-seq protocols in their sensitivity, accuracy and precision to quantify mRNA levels. In agreement with our previous study, we find that the precision, i.e. the technical variance, of scRNA-seq methods is driven by amplification bias and drastically reduced when using unique molecular identifiers to remove amplification duplicates. To assess the combined effects of sensitivity and precision and to compare the cost-efficiency of methods we compared the power to detect differentially expressed genes among the tested scRNA-seq protocols using a novel simulation framework. We find that some methods are prohibitively inefficient and others show trade-offs depending on the number of cells per sample that need to be analysed. Our study also provides a framework for benchmarking further improvements of scRNA-seq protocol and we published an improved version of our simulation framework powsimR. It uniquely recapitulates the specific characteristics of scRNA-seq data to enable streamlined simulations for benchmarking both wet lab protocols and analysis algorithms. Furthermore, we compile our experience in processing different types of scRNA-seq data, in particular with barcoded libraries and UMIs, and developed zUMIs, a fast and flexible scRNA-seq data processing software overcoming shortcomings of existing pipelines. In addition, we used the in-depth characterization of scRNA-seq technology to optimize an already powerful scRNA-seq protocol even further. According to data generated from exogenous mRNA spike-ins, this new mcSCRB-seq protocol is currently the most sensitive scRNA-seq protocol available. Single-cell resolution makes scRNA-seq uniquely suited for the understanding of complex diseases, such as leukemia. In acute lymphoblastic leukemia (ALL), rare chemotherapy-resistant cells persist as minimal residual disease (MRD) and may cause relapse. However, biological mechanisms of these relapse-inducing cells remain largely unclear because characterisation of this rare population was lacking so far. In order to contribute to the understanding of MRD, we leveraged scRNA-seq to study minimal residual disease cells from ALL. We obtained and characterised rare, chemotherapy-resistant cell populations from primary patients and patient cells grown in xenograft mouse models. We found that MRD cells are dormant and feature high expression of adhesion molecules in order to persist in the hematopoietic niche. Furthermore, we could show that there is plasticity between resting, resistant MRD cells and cycling, therapy-sensitive cells, indicating that patients could benefit from strategies that release MRD cells from the niche. Importantly, we show that our data derived from xenograft models closely resemble rare primary patient samples. In conclusion, my work of the last years contributes towards the development of experimental and computational single-cell RNA sequencing methods enabling their widespread application to biomedical problems such as leukemia

    Extracting News Events from Microblogs

    Full text link
    Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance
    • …
    corecore