17 research outputs found

    Avoidance of Self during CRISPR Immunization

    Get PDF
    This is the author accepted manuscript. The final version is available from Elsevier via the DOI in this recordThe battle between microbes and their viruses is ancient and ongoing. Clustered regularly interspaced short palindromic repeat (CRISPR) immunity, the first and, to date, only form of adaptive immunity found in prokaryotes, represents a flexible mechanism to recall past infections while also adapting to a changing pathogenic environment. Critical to the role of CRISPR as an adaptive immune mechanism is its capacity for self versus non-self recognition when acquiring novel immune memories. Yet, CRISPR systems vary widely in both how and to what degree they can distinguish foreign from self-derived genetic material. We document known and hypothesized mechanisms that bias the acquisition of immune memory towards non-self targets. We demonstrate that diversity is the rule, with many widespread but no universal mechanisms for self versus non-self recognition.NSFNatural Environment Research Council (NERC

    SAMQA: error classification and validation of high-throughput sequenced read data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.</p> <p>Results</p> <p>SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.</p> <p>Conclusions</p> <p>The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.</p

    Calculation of Tajima’s D and other neutrality test statistics from low depth next-generation sequencing data

    Get PDF
    BACKGROUND: A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima’s D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. RESULTS: We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. CONCLUSION: Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale

    Estimation of allele frequency and association mapping using next-generation sequencing data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15<it>X</it>). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates.</p> <p>Results</p> <p>We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.</p> <p>Conclusions</p> <p>Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.</p

    Interpreting the unculturable majority

    No full text
    New methods are necessary for the analysis and interpretation of massive amounts of metagenomic data

    Interpreting the unculturable majority

    No full text
    corecore