6 research outputs found

    DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the <it>de novo </it>assembly in terms of assembly quality and scalability for large-scale short read datasets.</p> <p>Results</p> <p>We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve <it>de novo </it>assembly quality for <it>de</it>-<it>Bruijn</it>-graph-based assemblers.</p> <p>Conclusions</p> <p>DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.</p

    G-CNV: A GPU-based tool for preparing data to detect CNVs with read-depth methods

    Get PDF
    Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals

    High-Performance Meta-Genomic Gene Identification

    Get PDF
    Computational Genomics, or Computational Genetics, refers to the use of computational and statistical analysis for understanding the structure and the function of genetic material in organisms. The primary focus of research in computational genomics in the past three decades has been the understanding of genomes and their functional elements by analyzing biological sequence data. The high demand for low-cost sequencing has driven the development of highthroughput sequencing technologies, next-generation sequencing (NGS), that parallelize the sequencing process, producing thousands or millions of sequences concurrently. Moore’s Law is the observation that the number of transistors on integrated circuits doubles approximately every two years; correspondingly, the cost per transistor halves. The cost of DNA sequencing declines much faster, which implies more new DNA data will be obtained. This large-scale sequence data, produced with high throughput sequencing technologies, needs to be processed in a time-effective and cost-effective manner. In this dissertation, we present a high-performance meta-genome gene identification framework. This framework includes four modules: filter, alignment, error correction, and gene identification. The following chapters describe the proposed design and evaluation of this pipeline. The most computationally expensive kernel in the framework is the alignment procedure. Thus, the filter module is developed to determine unnecessary alignment operations. Without the filter module, the alignment module requires 1.9 hours to complete all-to-all alignment on a test file of size 512,000 sequences with each sequence average length 750 base pairs by using ten Kepler K20 NVIDIA GPU. On the other hand, when combined with the filter kernel, the total time is 11.3 minutes. Note that the ideal speedup is nearly 91.4 times faster when new alignment kernel is run on ten GPUs ( 10*9.14). We conclude that accuracy can be achieved at the expense of more resources while operating frequency can still be maintained

    Grid and high performance computing applied to bioinformatics

    Get PDF
    Recent advances in genome sequencing technologies and modern biological data analysis technologies used in bioinformatics have led to a fast and continuous increase in biological data. The difficulty of managing the huge amounts of data currently available to researchers and the need to have results within a reasonable time have led to the use of distributed and parallel computing infrastructures for their analysis. In this context Grid computing has been successfully used. Grid computing is based on a distributed system which interconnects several computers and/or clusters to access global-scale resources. This infrastructure is exible, highly scalable and can achieve high performances with data-compute-intensive algorithms. Recently, bioinformatics is exploring new approaches based on the use of hardware accelerators, such as the Graphics Processing Units (GPUs). Initially developed as graphics cards, GPUs have been recently introduced for scientific purposes by rea- son of their performance per watt and the better cost/performance ratio achieved in terms of throughput and response time compared to other high-performance com- puting solutions. Although developers must have an in-depth knowledge of GPU programming and hardware to be effective, GPU accelerators have produced a lot of impressive results. The use of high-performance computing infrastructures raises the question of finding a way to parallelize the algorithms while limiting data dependency issues in order to accelerate computations on a massively parallel hardware. In this context, the research activity in this dissertation focused on the assessment and testing of the impact of these innovative high-performance computing technolo- gies on computational biology. In order to achieve high levels of parallelism and, in the final analysis, obtain high performances, some of the bioinformatic algorithms applicable to genome data analysis were selected, analyzed and implemented. These algorithms have been highly parallelized and optimized, thus maximizing the GPU hardware resources. The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology. However, a software infrastructure for work ow management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure. Moreover, this software infrastructure allows a further coarse-grained data-parallel parallelization on more GPUs. Results show that the proposed appli- cation speed-up increases with the increase in the number of GPUs

    Grid and high performance computing applied to bioinformatics

    Get PDF
    Recent advances in genome sequencing technologies and modern biological data analysis technologies used in bioinformatics have led to a fast and continuous increase in biological data. The difficulty of managing the huge amounts of data currently available to researchers and the need to have results within a reasonable time have led to the use of distributed and parallel computing infrastructures for their analysis. In this context Grid computing has been successfully used. Grid computing is based on a distributed system which interconnects several computers and/or clusters to access global-scale resources. This infrastructure is exible, highly scalable and can achieve high performances with data-compute-intensive algorithms. Recently, bioinformatics is exploring new approaches based on the use of hardware accelerators, such as the Graphics Processing Units (GPUs). Initially developed as graphics cards, GPUs have been recently introduced for scientific purposes by rea- son of their performance per watt and the better cost/performance ratio achieved in terms of throughput and response time compared to other high-performance com- puting solutions. Although developers must have an in-depth knowledge of GPU programming and hardware to be effective, GPU accelerators have produced a lot of impressive results. The use of high-performance computing infrastructures raises the question of finding a way to parallelize the algorithms while limiting data dependency issues in order to accelerate computations on a massively parallel hardware. In this context, the research activity in this dissertation focused on the assessment and testing of the impact of these innovative high-performance computing technolo- gies on computational biology. In order to achieve high levels of parallelism and, in the final analysis, obtain high performances, some of the bioinformatic algorithms applicable to genome data analysis were selected, analyzed and implemented. These algorithms have been highly parallelized and optimized, thus maximizing the GPU hardware resources. The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology. However, a software infrastructure for work ow management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure. Moreover, this software infrastructure allows a further coarse-grained data-parallel parallelization on more GPUs. Results show that the proposed appli- cation speed-up increases with the increase in the number of GPUs
    corecore