2,445 research outputs found

    Parallelization of Mapping Algorithms for Next Generation Sequencing Applications

    Get PDF
    With the advent of next-generation high throughput sequencing instruments, large volumes of short sequence data are generated at an unprecedented rate. Processing and analyzing these massive data requires overcoming several challenges. A particular challenge addressed in this abstract is the mapping of short sequences (reads) to a reference genome by allowing mismatches. This is a significantly time consuming combinatorial problem in many applications including whole-genome resequencing, targeted sequencing, transcriptome/small RNA, DNA methylation and ChiP sequencing, and takes time on the order of days using existing sequential techniques on large scale datasets. In this work, we introduce six parallelization methods each having different scalability characteristics to speedup short sequence mapping. We also address an associated load balancing problem that involves grouping nodes of a tree from different levels. This problem arises due to a trade-off between computational cost and granularity while partitioning the workload. We comparatively present the proposed parallelization methods and give theoretical cost models for each of them. Experimental results on real datasets demonstrate the effectiveness of the methods and indicate that they are successful at reducing the execution time from the order of days to under just a few hours for large datasets. To the best of our knowledge this is the first study on parallelization of short sequence mapping problem

    SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications

    Full text link
    Summary: The Smith Waterman (SW) algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools, but current implementations are either designed as monolithic protein database searching tools or are embedded into other tools. To facilitate easy integration of the fast Single Instruction Multiple Data (SIMD) SW algorithm into third party software, we wrote a C/C++ library, which extends Farrars Striped SW (SSW) to return alignment information in addition to the optimal SW score. Availability: SSW is available both as a C/C++ software library, as well as a stand alone alignment tool wrapping the librarys functionality at https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library Contact: [email protected]: 3 pages, 2 figure

    Optimizing Splicing Junction Detection in Next Generation Sequencing Data on a Virtual-GRID Infrastructure

    Get PDF
    The new protocol for sequencing the messenger RNA in a cell, named RNA-seq produce millions of short sequence fragments. Next Generation Sequencing technology allows more accurate analysis but increase needs in term of computational resources. This paper describes the optimization of a RNA-seq analysis pipeline devoted to splicing variants detection, aimed at reducing computation time and providing a multi-user/multisample environment. This work brings two main contributions. First, we optimized a well-known algorithm called TopHat by parallelizing some sequential mapping steps. Second, we designed and implemented a hybrid virtual GRID infrastructure allowing to efficiently execute multiple instances of TopHat running on different samples or on behalf of different users, thus optimizing the overall execution time and enabling a flexible multi-user environmen

    Extreme Scale De Novo Metagenome Assembly

    Get PDF
    Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.Comment: Accepted to SC1
    corecore