824 research outputs found

    Optimal Assembly for High Throughput Shotgun Sequencing

    Get PDF
    We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization.Comment: 26 pages, 18 figure

    Sequencing by Hybridization of Long Targets

    Get PDF
    Sequencing by Hybridization (SBH) reconstructs an n-long target DNA sequence from its biochemically determined l-long subsequences. In the standard approach, the length of a uniformly random sequence that can be unambiguously reconstructed is limited to due to repetitive subsequences causing reconstruction degeneracies. We present a modified sequencing method that overcomes this limitation without the need for different types of biochemical assays and is robust to error

    Functional characterization and annotation of trait-associated genomic regions by transcriptome analysis

    Get PDF
    In this work, two novel implementations have been presented, which could assist in the design and data analysis of high-throughput genomic experiments. An efficient and flexible tiling probe selection pipeline utilizing the penalized uniqueness score has been implemented, which could be employed in the design of various types and scales of genome tiling task. A novel hidden semi-Markov model (HSMM) implementation is made available within the Bioconductor project, which provides a unified interface for segmenting genomic data in a wide range of research subjects.In dieser Arbeit werden zwei neuartige Implementierungen präsentiert, die im Design und in der Datenanalyse von genomischen Hochdurchsatz-Experiment hilfreich sein könnten. Die erste Implementierung bildet eine effiziente und flexible Auswahl-Pipeline für Tiling-Proben, basierend auf einem Eindeutigkeitsmaß mit einer Maluswertung. Als zweite Implementierung wurde ein neuartiges Hidden-Semi-Markov-Modell (HSMM) im Bioconductor Projekt verfügbar gemacht

    Estimating DNA coverage and abundance in metagenomes using a gamma approximation

    Get PDF
    Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets

    Sensitivity of Noninvasive Prenatal Detection of Fetal Aneuploidy from Maternal Plasma Using Shotgun Sequencing Is Limited Only by Counting Statistics

    Get PDF
    We recently demonstrated noninvasive detection of fetal aneuploidy by shotgun sequencing cell-free DNA in maternal plasma using next-generation high throughput sequencer. However, GC bias introduced by the sequencer placed a practical limit on the sensitivity of aneuploidy detection. In this study, we describe a method to computationally remove GC bias in short read sequencing data by applying weight to each sequenced read based on local genomic GC content. We show that sensitivity is limited only by counting statistics and that sensitivity can be increased to arbitrary precision in sample containing arbitrarily small fraction of fetal DNA simply by sequencing more DNA molecules. High throughput shotgun sequencing of maternal plasma DNA should therefore enable noninvasive diagnosis of any type of fetal aneuploidy

    Detection of copy number variants in sequencing data.

    Get PDF
    In this work a program for detection of CNVs in sequencing data based on depth of coverage was implemented in C++ (copyDOC). Single steps in the pipeline, the acquisition of DOC signals in windows, the event calling and merging are implemented using generic programming techniques that enable the future integration of other algorithms in the pipeline. Furthermore, a testing environment was implemented, the copySim platform, which is very useful for testing and evaluation of different algorithms. CopyDOC was successfully applied to synthetic and real data using constant sized windows. Dynamic windows, that adapt according to the local mappability of the sequence, are implemented in the pipeline, but could not be tested in this work. They might be advantageous in datasets that contain uniquely mapped reads. However, CNVs have been shown to be overrepresented in segmental duplications (Nguyen et al. 2006; Cooper et al. 2007) and by a general exclusion of multireads those CNVs might be difficult to ascertain. In the application of copyDOC to a 1000 genomes dataset the overlap of predicted variants was considerable higer using multireads compared to uniquely mapped reads. Thus there is a requirement for tools that can handle multireads. Futher improvements of copyDOC might be done for the CNV calling algorithm and the merging step. For example the program workflow could be tested with a direct comparison of the DOC signals in two datasets via log ratios instead of appling a t-test on DOC signals in the two datasets. CopyDOC and copySim could be used as platform for the implementation and evaluation of futher CNV detection algorithms

    Comparing De Novo Genome Assembly: The Long and Short of It

    Get PDF
    Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium

    Validation of S. Pombe sequence assembly by microarray hybridization

    Get PDF
    We describe a method to make physical maps of genomes using correlative hybridization patterns of probes to random pools of BACs. We derive thereby an estimated distance between probes, and then use this estimated distance to order probes. To test the method, we used BAC libraries from Schizzosaccharomyces pombe. We compared our data to the known sequence assembly, in order to assess accuracy. We demonstrate a small number of significant discrepancies between our method and the map derived by sequence assembly. Some of these discrepancies may arise because genome order within a population is not stable; imposing a linear order on a population may not be biologically meaningful
    corecore