47,539 research outputs found
Simple tools for assembling and searching high-density picolitre pyrophosphate sequence data
<p>Abstract</p> <p>Background</p> <p>The advent of pyrophosphate sequencing makes large volumes of sequencing data available at a lower cost than previously possible. However, the short read lengths are difficult to assemble and the large dataset is difficult to handle. During the sequencing of a virus from the tsetse fly, <it>Glossina pallidipes</it>, we found the need for tools to search quickly a set of reads for near exact text matches.</p> <p>Methods</p> <p>A set of tools is provided to search a large data set of pyrophosphate sequence reads under a "live" CD version of Linux on a standard PC that can be used by anyone without prior knowledge of Linux and without having to install a Linux setup on the computer. The tools permit short lengths of <it>de novo </it>assembly, checking of existing assembled sequences, selection and display of reads from the data set and gathering counts of sequences in the reads.</p> <p>Results</p> <p>Demonstrations are given of the use of the tools to help with checking an assembly against the fragment data set; investigating homopolymer lengths, repeat regions and polymorphisms; and resolving inserted bases caused by incomplete chain extension.</p> <p>Conclusion</p> <p>The additional information contained in a pyrophosphate sequencing data set beyond a basic assembly is difficult to access due to a lack of tools. The set of simple tools presented here would allow anyone with basic computer skills and a standard PC to access this information.</p
Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination
Standard Illumina mate-paired libraries are constructed from 3- to 5-kb DNA fragments by a blunt-end circularization. Sequencing reads that pass through the junction of the two joined ends of a 3-5-kb DNA fragment are not easy to identify and pose problems during mapping and de novo assembly. Longer read lengths increase the possibility that a read will cross the junction. To solve this problem, we developed a mate-paired protocol for use with Illumina sequencing technology that uses Cre-Lox recombination instead of blunt end circularization. In this method, a LoxP sequence is incorporated at the junction site. This sequence allows screening reads for junctions without using a reference genome. Junction reads can be trimmed or split at the junction. Moreover, the location of the LoxP sequence in the reads distinguishes mate-paired reads from spurious paired-end reads. We tested this new method by preparing and sequencing a mate-paired library with an insert size of 3 kb from Saccharomyces cerevisiae. We present an analysis of the library quality statistics and a new bio-informatics tool called DeLoxer that can be used to analyze an IlluminaCre-Lox mate-paired data set. We also demonstrate how the resulting data significantly improves a de novo assembly of the S. cerevisiae genome
REAPR: a universal tool for genome assembly evaluation.
Methods to reliably assess the accuracy of genome sequence data are lacking. Currently completeness is only described qualitatively and mis-assemblies are overlooked. Here we present REAPR, a tool that precisely identifies errors in genome assemblies without the need for a reference sequence. We have validated REAPR on complete genomes or de novo assemblies from bacteria, malaria and Caenorhabditis elegans, and demonstrate that 86% and 82% of the human and mouse reference genomes are error-free, respectively. When applied to an ongoing genome project, REAPR provides corrected assembly statistics allowing the quantitative comparison of multiple assemblies. REAPR is available at http://www.sanger.ac.uk/resources/software/reapr/
Minimum error correction-based haplotype assembly: considerations for long read data
The single nucleotide polymorphism (SNP) is the most widely studied type of
genetic variation. A haplotype is defined as the sequence of alleles at SNP
sites on each haploid chromosome. Haplotype information is essential in
unravelling the genome-phenotype association. Haplotype assembly is a
well-known approach for reconstructing haplotypes, exploiting reads generated
by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often
used for reconstruction of haplotypes from reads. However, problems with the
MEC metric have been reported. Here, we investigate the MEC approach to
demonstrate that it may result in incorrectly reconstructed haplotypes for
devices that produce error-prone long reads. Specifically, we evaluate this
approach for devices developed by Illumina, Pacific BioSciences and Oxford
Nanopore Technologies. We show that imprecise haplotypes may be reconstructed
with a lower MEC than that of the exact haplotype. The performance of MEC is
explored for different coverage levels and error rates of data. Our simulation
results reveal that in order to avoid incorrect MEC-based haplotypes, a
coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure
Models for transcript quantification from RNA-Seq
RNA-Seq is rapidly becoming the standard technology for transcriptome
analysis. Fundamental to many of the applications of RNA-Seq is the
quantification problem, which is the accurate measurement of relative
transcript abundances from the sequenced reads. We focus on this problem, and
review many recently published models that are used to estimate the relative
abundances. In addition to describing the models and the different approaches
to inference, we also explain how methods are related to each other. A key
result is that we show how inference with many of the models results in
identical estimates of relative abundances, even though model formulations can
be very different. In fact, we are able to show how a single general model
captures many of the elements of previously published methods. We also review
the applications of RNA-Seq models to differential analysis, and explain why
accurate relative transcript abundance estimates are crucial for downstream
analyses
Special features of RAD Sequencing data:implications for genotyping
Restriction site-associated DNA Sequencing (RAD-Seq) is an economical and efficient method for SNP discovery and genotyping. As with other sequencing-by-synthesis methods, RAD-Seq produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately. We show that there are several sources of bias specific to RAD-Seq that are not explicitly addressed by current genotyping tools, namely restriction fragment bias, restriction site heterozygosity and PCR GC content bias. We explore the performance of existing analysis tools given these biases and discuss approaches to limiting or handling biases in RAD-Seq data. While these biases need to be taken seriously, we believe RAD loci affected by them can be excluded or processed with relative ease in most cases and that most RAD loci will be accurately genotyped by existing tools
- ā¦