3,931 research outputs found

    Quantifying single nucleotide variant detection sensitivity in exome sequencing

    Get PDF
    BACKGROUND: The targeted capture and sequencing of genomic regions has rapidly demonstrated its utility in genetic studies. Inherent in this technology is considerable heterogeneity of target coverage and this is expected to systematically impact our sensitivity to detect genuine polymorphisms. To fully interpret the polymorphisms identified in a genetic study it is often essential to both detect polymorphisms and to understand where and with what probability real polymorphisms may have been missed. RESULTS: Using down-sampling of 30 deeply sequenced exomes and a set of gold-standard single nucleotide variant (SNV) genotype calls for each sample, we developed an empirical model relating the read depth at a polymorphic site to the probability of calling the correct genotype at that site. We find that measured sensitivity in SNV detection is substantially worse than that predicted from the naive expectation of sampling from a binomial. This calibrated model allows us to produce single nucleotide resolution SNV sensitivity estimates which can be merged to give summary sensitivity measures for any arbitrary partition of the target sequences (nucleotide, exon, gene, pathway, exome). These metrics are directly comparable between platforms and can be combined between samples to give “power estimates” for an entire study. We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5–15% of heterozygous and 1–4% of homozygous SNVs in the targeted regions will be missed. CONCLUSIONS: Non-reference alleles in the heterozygote state have a high chance of being missed when commonly applied read coverage thresholds are used despite the widely held assumption that there is good polymorphism detection at these coverage levels. Such alleles are likely to be of functional importance in population based studies of rare diseases, somatic mutations in cancer and explaining the “missing heritability” of quantitative traits

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    Controlling false discovery rates in RNA-Sequencing data

    No full text
    High throughput sequencing technologies are supplanting microarrays as the preferred technology for detecting and quantifying differential gene expression. The raw data produced by the a technique known as RNA-sequencing (RNA-seq), consists of integer counts of reverse transcribed cDNA fragment reads mapped onto each gene or transcript isoform in a reference genome or transcriptome. Many software packages exist for analysing RNA-seq datasets consisting of tables of mapped read counts from biological or technical replicate experiments under two or more conditions, the purpose being to detect which genes are differentially expressed between conditions. Two state-of-the-art packages, DESeq and edgeR, are based on a negative binomial model of read counts. Our tests with simulated data constructed according to the statistical model assumed by these packages reveal that both packages generate a non-uniform p-value spectrum from null-hypothesis data. We demo! nstrate how specific knowledge of the non-uniformity can be exploited to develop a graphical technique based on the Storey-Tibshirani method for improving estimates of p-values and false discovery rates in databases where differential expression is present. We have developed an add-on package for DESeq and edgeR, called Polyfit, which implements this method, and evaluate its performance against DESeq, edgeR and another recently introduced package, PoissonSeq, using simulated data

    Models for transcript quantification from RNA-Seq

    Full text link
    RNA-Seq is rapidly becoming the standard technology for transcriptome analysis. Fundamental to many of the applications of RNA-Seq is the quantification problem, which is the accurate measurement of relative transcript abundances from the sequenced reads. We focus on this problem, and review many recently published models that are used to estimate the relative abundances. In addition to describing the models and the different approaches to inference, we also explain how methods are related to each other. A key result is that we show how inference with many of the models results in identical estimates of relative abundances, even though model formulations can be very different. In fact, we are able to show how a single general model captures many of the elements of previously published methods. We also review the applications of RNA-Seq models to differential analysis, and explain why accurate relative transcript abundance estimates are crucial for downstream analyses

    Genome-wide mapping reveals single-origin chromosome replication in Leishmania, a eukaryotic microbe

    Get PDF
    Background DNA replication initiates on defined genome sites, termed origins. Origin usage appears to follow common rules in the eukaryotic organisms examined to date: all chromosomes are replicated from multiple origins, which display variations in firing efficiency and are selected from a larger pool of potential origins. To ask if these features of DNA replication are true of all eukaryotes, we describe genome-wide origin mapping in the parasite Leishmania. Results Origin mapping in Leishmania suggests a striking divergence in origin usage relative to characterized eukaryotes, since each chromosome appears to be replicated from a single origin. By comparing two species of Leishmania, we find evidence that such origin singularity is maintained in the face of chromosome fusion or fission events during evolution. Mapping Leishmania origins suggests that all origins fire with equal efficiency, and that the genomic sites occupied by origins differ from related non-origins sites. Finally, we provide evidence that origin location in Leishmania displays striking conservation with Trypanosoma brucei, despite the latter parasite replicating its chromosomes from multiple, variable strength origins. Conclusions The demonstration of chromosome replication for a single origin in Leishmania, a microbial eukaryote, has implications for the evolution of origin multiplicity and associated controls, and may explain the pervasive aneuploidy that characterizes Leishmania chromosome architecture

    Hybrid gene misregulation in multiple developing tissues within a recent adaptive radiation of Cyprinodon pupfishes.

    Get PDF
    Genetic incompatibilities constitute the final stages of reproductive isolation and speciation, but little is known about incompatibilities that occur within recent adaptive radiations among closely related diverging populations. Crossing divergent species to form hybrids can break up coadapted variation, resulting in genetic incompatibilities within developmental networks shaping divergent adaptive traits. We crossed two closely related sympatric Cyprinodon pupfish species-a dietary generalist and a specialized molluscivore-and measured expression levels in their F1 hybrids to identify regulatory variation underlying the novel craniofacial morphology found in this recent microendemic adaptive radiation. We extracted mRNA from eight day old whole-larvae tissue and from craniofacial tissues dissected from 17-20 day old larvae to compare gene expression between a total of seven F1 hybrids and 24 individuals from parental species populations. We found 3.9% of genes differentially expressed between generalists and molluscivores in whole-larvae tissues and 0.6% of genes differentially expressed in craniofacial tissue. We found that 2.1% of genes were misregulated in whole-larvae hybrids whereas 19.1% of genes were misregulated in hybrid craniofacial tissues, after correcting for sequencing biases. We also measured allele specific expression across 15,429 heterozygous sites to identify putative compensatory regulatory mechanisms underlying differential expression between generalists and molluscivores. Together, our results highlight the importance of considering misregulation as an early indicator of genetic incompatibilities in the context of rapidly diverging adaptive radiations and suggests that compensatory regulatory divergence drives hybrid gene misregulation in developing tissues that give rise to novel craniofacial traits

    A new approach to bias correction in RNA-Seq

    Get PDF
    Motivation: Quantification of sequence abundance in RNA-Seq experiments is often conflated by protocol-specific sequence bias. The exact sources of the bias are unknown, but may be influenced by polymerase chain reaction amplification, or differing primer affinities and mixtures, for example. The result is decreased accuracy in many applications, such as de novo gene annotation and transcript quantification
    corecore