1,419 research outputs found

    MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

    Full text link
    A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

    Modeling update for the Thirty Meter Telescope laser guide star dual-conjugate adaptive optics system

    Get PDF
    This paper describes the modeling efforts undertaken in the past couple of years to derive wavefront error (WFE) performance estimates for the Narrow Field Infrared Adaptive Optics System (NFIRAOS), which is the facility laser guide star (LGS) dual-conjugate adaptive optics (AO) system for the Thirty Meter Telescope (TMT). The estimates describe the expected performance of NFIRAOS as a function of seeing on Mauna Kea, zenith angle, and galactic latitude (GL). They have been developed through a combination of integrated AO simulations, side analyses, allocations, lab and lidar experiments

    Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification

    Get PDF
    It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. Availability: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/. © 2014 Springer International Publishing Switzerland. Keywords: RQS; quality score; sparsification; compression; accuracy; variant callingHertz FoundationNational Institutes of Health (U.S.) (R01GM108348

    Real-time turbulence profiling with a pair of laser guide star Shack–Hartmann wavefront sensors for wide-field adaptive optics systems on large to extremely large telescopes

    Get PDF
    Real-time turbulence profiling is necessary to tune tomographic wavefront reconstruction algorithms for wide-field adaptive optics (AO) systems on large to extremely large telescopes, and to perform a variety of image post-processing tasks involving point-spread function reconstruction. This paper describes a computationally efficient and accurate numerical technique inspired by the slope detection and ranging (SLODAR) method to perform this task in real time from properly selected Shack–Hartmann wavefront sensor measurements accumulated over a few hundred frames from a pair of laser guide stars, thus eliminating the need for an additional instrument. The algorithm is introduced, followed by a theoretical influence function analysis illustrating its impulse response to high-resolution turbulence profiles. Finally, its performance is assessed in the context of the Thirty Meter Telescope multi-conjugate adaptive optics system via end-to-end wave optics Monte Carlo simulations

    Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

    Get PDF
    In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. International competitions such as Assemblathons or GAGE tried to identify the best assembler(s) and their features. Some what circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed FRCbam, which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers -- thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.Comment: Submitted to PLoS One. Supplementary material available at http://www.nada.kth.se/~vezzi/publications/supplementary.pdf and http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pd

    DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

    Full text link
    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
    corecore