291 research outputs found

    Paired-end read length lower bounds for genome re-sequencing

    Get PDF
    International audienceNext-generation sequencing technology is enabling massive production of high-quality paired-end reads. Many platforms (Illumina Genome Analyzer, Applied Biosystems SOLID, Helicos HeliScope) are currently able to produce "ultra-short" paired reads of lengths starting at 25 nt. An analysis by Whiteford et al. [1] on sequencing using unpaired reads shows that ultra-short reads theoretically allow whole genome re-sequencing and de novo assembly of only small eukaryotic genomes. By conducting an analysis extending Whiteford et al. results, we investigate to what extent genome re-sequencing is feasible with ultra-short paired reads. We obtain theoretical read length lower bounds for re-sequencing that are also applicable to paired-end de novo assembly

    Simple tools for assembling and searching high-density picolitre pyrophosphate sequence data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The advent of pyrophosphate sequencing makes large volumes of sequencing data available at a lower cost than previously possible. However, the short read lengths are difficult to assemble and the large dataset is difficult to handle. During the sequencing of a virus from the tsetse fly, <it>Glossina pallidipes</it>, we found the need for tools to search quickly a set of reads for near exact text matches.</p> <p>Methods</p> <p>A set of tools is provided to search a large data set of pyrophosphate sequence reads under a "live" CD version of Linux on a standard PC that can be used by anyone without prior knowledge of Linux and without having to install a Linux setup on the computer. The tools permit short lengths of <it>de novo </it>assembly, checking of existing assembled sequences, selection and display of reads from the data set and gathering counts of sequences in the reads.</p> <p>Results</p> <p>Demonstrations are given of the use of the tools to help with checking an assembly against the fragment data set; investigating homopolymer lengths, repeat regions and polymorphisms; and resolving inserted bases caused by incomplete chain extension.</p> <p>Conclusion</p> <p>The additional information contained in a pyrophosphate sequencing data set beyond a basic assembly is difficult to access due to a lack of tools. The set of simple tools presented here would allow anyone with basic computer skills and a standard PC to access this information.</p

    QSRA – a quality-value guided de novo short read assembler

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data.</p> <p>Results</p> <p>We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality.</p> <p>Conclusion</p> <p>QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.</p

    Meraculous: De Novo Genome Assembly with Short Paired-End Reads

    Get PDF
    We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed

    Assembly complexity of prokaryotic genomes using short reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes.</p> <p>Results</p> <p>We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for <it>de novo </it>reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages).</p> <p>Conclusions</p> <p>Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.</p

    Mutation Detection with Next-Generation Resequencing through a Mediator Genome

    Get PDF
    The affordability of next generation sequencing (NGS) is transforming the field of mutation analysis in bacteria. The genetic basis for phenotype alteration can be identified directly by sequencing the entire genome of the mutant and comparing it to the wild-type (WT) genome, thus identifying acquired mutations. A major limitation for this approach is the need for an a-priori sequenced reference genome for the WT organism, as the short reads of most current NGS approaches usually prohibit de-novo genome assembly. To overcome this limitation we propose a general framework that utilizes the genome of relative organisms as mediators for comparing WT and mutant bacteria. Under this framework, both mutant and WT genomes are sequenced with NGS, and the short sequencing reads are mapped to the mediator genome. Variations between the mutant and the mediator that recur in the WT are ignored, thus pinpointing the differences between the mutant and the WT. To validate this approach we sequenced the genome of Bdellovibrio bacteriovorus 109J, an obligatory bacterial predator, and its prey-independent mutant, and compared both to the mediator species Bdellovibrio bacteriovorus HD100. Although the mutant and the mediator sequences differed in more than 28,000 nucleotide positions, our approach enabled pinpointing the single causative mutation. Experimental validation in 53 additional mutants further established the implicated gene. Our approach extends the applicability of NGS-based mutant analyses beyond the domain of available reference genomes

    The stepped wedge trial design: a systematic review

    Get PDF
    BACKGROUND: Stepped wedge randomised trial designs involve sequential roll-out of an intervention to participants (individuals or clusters) over a number of time periods. By the end of the study, all participants will have received the intervention, although the order in which participants receive the intervention is determined at random. The design is particularly relevant where it is predicted that the intervention will do more good than harm (making a parallel design, in which certain participants do not receive the intervention unethical) and/or where, for logistical, practical or financial reasons, it is impossible to deliver the intervention simultaneously to all participants. Stepped wedge designs offer a number of opportunities for data analysis, particularly for modelling the effect of time on the effectiveness of an intervention. This paper presents a review of 12 studies (or protocols) that use (or plan to use) a stepped wedge design. One aim of the review is to highlight the potential for the stepped wedge design, given its infrequent use to date. METHODS: Comprehensive literature review of studies or protocols using a stepped wedge design. Data were extracted from the studies in three categories for subsequent consideration: study information (epidemiology, intervention, number of participants), reasons for using a stepped wedge design and methods of data analysis. RESULTS: The 12 studies included in this review describe evaluations of a wide range of interventions, across different diseases in different settings. However the stepped wedge design appears to have found a niche for evaluating interventions in developing countries, specifically those concerned with HIV. There were few consistent motivations for employing a stepped wedge design or methods of data analysis across studies. The methodological descriptions of stepped wedge studies, including methods of randomisation, sample size calculations and methods of analysis, are not always complete. CONCLUSION: While the stepped wedge design offers a number of opportunities for use in future evaluations, a more consistent approach to reporting and data analysis is required
    corecore