30 research outputs found

    Indexing arbitrary-length kk-mers in sequencing reads

    Full text link
    We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kk-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments

    Metagenome Assembly

    Get PDF
    The advent of the next generation sequencing technology (NGS) makes it possible to study metagenomics data which is directly extracted and cloned from assemblage of micro-organisms. Metagenomics data are diverse in species and abundance. Because most genome assemblers are designed for single genome assembly, they could not perform well on metagenomics data. To deal with the mixed and not uniformly distributed metagenomics reads, we developed a novel metagenomic assembler named MetaSAGE, on the platform of the existing SAGE assembler. MetaSAGE finds contigs from the overlap graph based on the minimum cost flow theory and uses mate-pair information to extract scaffolds from the overlap graph. When facing chimeric nodes, the MetaSAGE splits them separately according to the coverage of edges. MetaSAGE exhibits good performance compared to existing metagenomic assemblers

    Utilization of Probabilistic Models in Short Read Assembly from Second-Generation Sequencing

    Get PDF
    With the advent of cheaper and faster DNA sequencing technologies, assembly methods have greatly changed. Instead of outputting reads that are thousands of base pairs long, new sequencers parallelize the task by producing read lengths between 35 and 400 base pairs. Reconstructing an organism’s genome from these millions of reads is a computationally expensive task. Our algorithm solves this problem by organizing and indexing the reads using n-grams, which are short, fixed-length DNA sequences of length n. These n-grams are used to efficiently locate putative read joins, thereby eliminating the need to perform an exhaustive search over all possible read pairs. Our goal was develop a novel n-gram method for the assembly of genomes from next-generation sequencers. Specifically, a probabilistic, iterative approach was utilized to determine the most likely reads to join through development of a new metric that models the probability of any two arbitrary reads being joined together. Tests were run using simulated short read data based on randomly created genomes ranging in lengths from 10,000 to 100,000 nucleotides with 16 to 20x coverage. We were able to successfully re-assemble entire genomes up to 100,000 nucleotides in length

    A new algorithm for de novo genome assembly

    Get PDF
    The enormous amount of short reads produced by next generation sequencing (NGS) techniques such as Roche/454, Illumina/Solexa and SOLiD sequencing opened the possibility of de novo genome assembly. Some of the de novo genome assemblers (e.g., Edena, SGA) use an overlap graph approach to assemble a genome, while others (e.g., ABySS and SOAPdenovo) use a de Bruijn graph approach. Currently, the approaches based on the de Bruijn graph are the most successful, yet their performance is far from being able to assemble entire genomic sequences. We developed a new overlap graph based genome assembler called Paired-End Genome ASsembly Using Short-sequences (PEGASUS) for paired-end short reads produced by NGS techniques. PEGASUS uses a minimum cost network flow approach to predict the copy count of the input reads more precisely than other algorithms. With the help of accurate copy count and mate pair support, PEGASUS can accurately unscramble the paths in the overlap graph that correspond to DNA sequences. PEGASUS exhibits comparable and in many cases better performance than the leading genome assemblers

    High-Performance Computing Frameworks for Large-Scale Genome Assembly

    Get PDF
    Genome sequencing technology has witnessed tremendous progress in terms of throughput and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn graph-based assembly tools demand a lot of processing power and memory and cannot assemble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out cluster with several dozens of nodes. In the first part of this work, we present a distributed next-generation sequence (NGS) assembler called Lazer, that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours. In the second part, we present a new distributed GPU-accelerated NGS assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps in quasi-linear time. To use the limited memory on GPUs efficiently, LaSAGNA uses a two-level semi-streaming approach from disk through host memory to device memory with restricted access patterns on both disk and host memory. Using LaSAGNA, we can assemble the human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s. In the third part, we present the first distributed 3rd generation sequence (3GS) assembler which uses a map-reduce computing paradigm and a distributed hash-map, both built on a high-performance networking middleware. Using this assembler, we assembled an Oxford Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas existing 3GS assemblers could not assemble it because of memory and/or time limitations

    Development of genomic resources and tools for precision farming of pikeperch through high-throughput sequencing and computational genomics

    Get PDF
    This thesis provides the first genomic tools and resources to enhance pikeperch's innovative farming, optimal domestication, and adaption into modern intensive aquaculture systems, including a high-quality chromosome-level assembly, reference transcriptome, and gene expression atlas. The pikeperch genome was also used as a reference for comparative genomics analyses and population genetics analyses in domesticated individuals to establish the landscape of genetic variations. These findings lay the foundation for addressing critical issues in genomics-informed pikeperch farming.Diese Dissertation stellt die ersten genomischen Werkzeuge und Ressourcen zur VerfĂĽgung, um die innovative Zanderzucht, optimale Domestizierung und Anpassung an moderne intensive Aquakultursysteme zu verbessern, einschlieĂźlich einer hochwertigen Genom-Assemblierung auf Chromosomenebene, eines Referenztranskriptoms und eines Genexpressionsatlasses. Das Genom des Zanders wurde auch als Referenz fĂĽr vergleichende Genomanalysen und populationsgenetische Analysen bei domestizierten Individuen verwendet, um die Landschaft der genetischen Variationen zu ermitteln
    corecore