2,174 research outputs found

    MODEL-BASED QUALITY ASSESSMENT AND BASE-CALLING FOR SECOND-GENERATION SEQUENCING DATA

    Get PDF
    Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, and is capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1,000 Genomes Project, plans to fully sequence the genomes of approximately 1,200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T’s, between 30-100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this paper we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance

    Probabilistic methods for quality improvement in high-throughput sequencing data

    Get PDF
    Advances in high-throughput next-generation sequencing (NGS) technologies have enabled the determination of millions of nucleotide sequences in massive parallelism at affordable costs. Many studies have shown increased error rates over Sanger sequencing, in sequencing data produced by mainstream next-generation sequencing platforms, and have demonstrated the negative impacts of sequencing errors on a wide range of applications of NGS. Thus, it is critically important for primary analysis of sequencing data to produce accurate, high-quality nucleotides for downstream bioinformatics pipelines. Two bioinformatics problems are dedicated to the direct removal of sequencing errors: base-calling and error-correction. However, existing error correction methods are mostly algorithmic and heuristics. Few methods can address insertion and deletion errors, the dominant error type produced by many platforms. On the other hand, most base-callers do not model the underlying genome structures of the sequencing data, which are necessary for improving base-calling quality especially in low-quality regions. The sequential application of base-caller and error-corrector do not fully offset their shortcomings. In recognition of these issues, in this dissertation, we propose a probabilistic framework that closely emulate the sequencing-by-synthesis (SBS) process adopted by many NGS platforms. The core idea is to model sequencing data (individual reads, or fluorescent intensities) as independent emissions from a Hidden Markov model (HMM) with transition distributions to model local and double-stranded dependence in the genome, and emission distributions to model the subtle error characteristics of the sequencers. Deriving from this backbone, we develop three novel methods for improving the data quality of high-throughput sequencing: 1) PREMIER, an accurate probabilistic error corrector of substitution errors in Illumina data, 2) PREMIER-bc, an integrated base-caller and error corrector that significantly improves base-calling quality, and 3) PREMIER-indel, an extended error correction method that addresses substitution, insertion and deletion errors for SBS-based sequencers with good empirical performance. Our foray of using probabilistic methods for base-calling and error correction provides the immediate benefits to downstream analyses with increased sequencing data quality, and more importantly, a flexible and fully-probabilistic basis to go beyond primary analysis

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Probabilistic base calling of Solexa sequencing data

    Get PDF
    BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots

    Determining and utilizing the quasispecies of the hepatitis B virus in clinical applications

    Get PDF
    Chronic hepatitis B caused by infection with the hepatitis B virus (HBV) affects about 240 million people worldwide and is one of the major causes of severe liver cirrhosis and liver cancer. Hepatitis B treatment options have improved dramatically in the last decade. Effective direct-acting antiviral drugs, so-called nucleos(t)ide analogs, and one effective immunomodulatory drug (pegylated interferon alpha-2a) are available presently. Current challenges for treating HBV involve the careful selection of patients who require therapy and the thoughtful choice of the treatment option tailored to each patient individually. Personalized medicine aims to optimize treatment decisions based on the analysis of host factors and virus characteristics. The population of viruses within a host is called the viral quasispecies. This thesis provides statistical methods to infer relevant information about the viral quasispecies of HBV to support treatment decisions. We introduce a new genotyping methodology to identify dual infections, which can help to quantify the risk of interferon therapy failure. We present a method to infer short-range linkage information from Sanger sequencing chromatograms, a method to support treatment adjustment after the development of resistance to nucleos(t)ide analogs. Additionally, we provide the first full-genome analysis of the G-to-A hypermutation patterns of the HBV genome. Hypermutated viral genomes form a subpopulation of the quasispecies caused by proteins of the human innate immune system editing the genome of exogenous viral agents. We show that hypermutation is associated with the natural progression of hepatitis B, but does not correlate with treatment response to interferon.Die Hepatitis-B-Erkrankung wird durch eine Infektion mit dem Hepatitis-B-Virus (HBV) verursacht. Weltweit sind schätzungsweise 240 Millionen Menschen chronisch infiziert. Dabei stellt Hepatitis-B eine der häufigsten Ursachen für die Entwicklung von Leberzirrhose und Leberkrebs dar. Die Behandlungsmöglichkeiten wurden in den letzten zehn Jahren signifikant verbessert. Mittlerweile stehen effektive direkt antivirale Medikamente – sogenannte Nukleos(t)id-Analoga – und ein effektives immunmodulierendes Medikament (pegyliertes Interferon alpha-2a) für die Behandlung zur Verfügung. Zentrale Fragen bei der Behandlung von Hepatitis-B beinhalten die zielgerichtete Auswahl der Patienten, welche therapiert werden müssen, sowie die passgenaue Auswahl der Behandlungsoption. Die personalisierte Medizin verfolgt das Ziel, die Behandlung basierend auf der Analyse von Patientencharakteristika und Eigenschaften des Virus zu optimieren. Die Gesamtheit der Viren innerhalb eines Wirtes wird als virale Quasispezies bezeichnet. Diese Arbeit stellt statistische Methoden zur Verfügung, um relevante Informationen über die Quasispezies von HBV zur Unterstützung von Therapieentscheidungen zu ermitteln. Wir entwickeln eine neue Methode zur Genotypisierung, welche Zweifachinfektionen mit HBV identifiziert und somit hilfreich sein kann, das Risiko eines Therapieversagens einer Interferonbehandlung korrekt einzuschätzen. Des Weiteren stellen wir eine Methode vor, welche Linkage-Informationen der viralen Quasispezies, basierend auf den Chromatogrammen der DNA-Sequenzierung nach Sanger, extrahieren kann. Diese Methode kann bei der Umstellung einer Therapie mit Nukleos(t)id-Analoga nach Resistenzentwicklung verwendet werden. Schließlich präsentieren wir die erste Vollgenomanalyse der G-zu-A Hypermutationsmuster von HBV. Hypermutierte virale Genome stellen eine Teilmenge der Quasispezies dar, welche durch von Proteinen der angeborenen Immunabwehr bewirkte Mutationen im viralen Genom entsteht. Wir zeigen, dass diese Subpopulation mit dem natürlichen Verlauf einer Hepatitis-B-Erkrankung, jedoch nicht mit dem Therapieansprechen auf Interferon, statistisch signifikant assoziiert werden kann

    Statistical methods for high-throughput genomic data

    Get PDF

    Low-frequency variant detection in viral populations using massively parallel sequencing data

    Get PDF
    corecore