Search CORE

205 research outputs found

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

Author: Lan H
Liu W
Lu M
Tan G
Vasilakos AV
Yin Z
Publication venue: 'Elsevier BV'
Publication date: 23/08/2022
Field of study

The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics

OPUS - University of Technology Sydney

DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI

Author: B Langmead
B Schmidt
Bertil Schmidt
BH Bloom
Douglas L Maskell
DR Zerbino
E Lindholm
EW Myers
H Shi
H Shi
J Butler
J Nickolls
J Schröder
JC Dohm
JT Simpson
L Fan
L Salmela
MJ Chaisson
P Havlak
PA Pevzner
R Li
RL Warren
S Batzoglou
WR Jeck
X Huang
Y Liu
Y Liu
Y Liu
Y Liu
Yongchao Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the <it>de novo </it>assembly in terms of assembly quality and scalability for large-scale short read datasets. Results We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve <it>de novo </it>assembly quality for <it>de</it>-<it>Bruijn</it>-graph-based assemblers. Conclusions DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Accelerating pairwise sequence alignment on GPUs using the Wavefront Algorithm

Author: Aguado Puig Quim
Publication venue: Universitat Politècnica de Catalunya
Publication date: 19/10/2022
Field of study

Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio, and Nanopore technologies. The recently proposed Wavefront Alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. Notwithstanding the advantages of the WFA algorithm, modern high performance computing (HPC) platforms rely on accelerator-based architectures that exploit parallel computing resources to improve over classical computing CPUs. Hence, a GPU-enabled implementation of the WFA could exploit the hardware resources of modern GPUs and further accelerate sequence alignment in current genome analysis pipelines. This thesis presents two GPU-accelerated implementations based on the WFA for fast pairwise DNA sequence alignment: eWFA-GPU and WFA-GPU. Our first proposal, eWFA-GPU, computes the exact edit-distance alignment between two short sequences (up to a few thousand bases), taking full advantage of the massive parallel capabilities of modern GPUs. We propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast on-chip memory of a GPU. Our results show that eWFA-GPU outperforms by 3-9X the edit-distance WFA implementation running on a 20 core machine. Compared to other state-of-the-art tools computing the edit-distance, eWFA-GPU is up to 265X faster than CPU tools and up to 56 times faster than other GPU-enabled implementations. Our second contribution, the WFA-GPU tool, extends the work of eWFA-GPU to compute the exact gap-affine distance (i.e., a more general alignment problem) between arbitrary long sequences. In this work, we propose a CPU-GPU co-design capable of performing inter and intra-sequence parallel alignment of multiple sequences, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original WFA implementation between 1.5-7.7X times when computing the alignment path, and between 2.6-16X when computing only the alignment score. Moreover, compared to other state-of-the-art tools, the WFA-GPU is up to 26.7X faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations

UPCommons. Portal del coneixement obert de la UPC

ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis

Author: Alser Mohammed
Cali Damla Senol
Cavlak Meryem Banu
Firtina Can
Kalsi Gurpreet S.
Kim Jeremie
Lindegger Joel
Luna Juan Gómez
Mutlu Onur
Pillai Kamlesh
Shahroodi Taha
Subramoney Sreenivas
Suresh Bharathwaj
Publication venue
Publication date: 21/10/2023
Field of study

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. We identify an urgent need for a flexible, high-performance, and energy-efficient HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out negligible computations using a hardware-based filter, and 4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC

arXiv.org e-Print Archive

High Performance Computing for DNA Sequence Alignment and Assembly

Author: Schatz Michael Christopher
Publication venue
Publication date: 01/01/2010
Field of study

Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

Digital Repository at the University of Maryland

Accelerating edit-distance sequence alignment on GPU using the wavefront algorithm

Author: Aguado Puig Quim
Castells Rufas David
Espinosa Morales Antonio
Marco-Sola Santiago
Moreto Planas Miquel
Moure López Juan Carlos
Álvarez Martí Lluc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/06/2022
Field of study

Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard to parallelize, require significant amounts of memory, and fail to scale for large inputs. This work presents eWFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA). This approach exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms. Our implementation takes full advantage of the massive parallel capabilities of modern GPUs to accelerate the alignment process. In addition, we propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast shared memory of a GPU. Our results show that our GPU implementation outperforms by 3- 9× the baseline edit-distance WFA implementation running on a 20 core machine. As a result, eWFA-GPU is up to 265 times faster than state-of-the-art CPU implementation, and up to 56 times faster than state-of-the-art GPU implementations.This work was supported in part by the European Unions’s Horizon 2020 Framework Program through the DeepHealth Project under Grant 825111; in part by the European Union Regional Development Fund within the Framework of the European Regional Development Fund (ERDF) Operational Program of Catalonia 2014–2020 with a Grant of 50% of Total Cost Eligible through the Designing RISC-V-based Accelerators for next-generation Computers Project under Grant 001-P-001723; in part by the Ministerio de Ciencia e Innovacion (MCIN) Agencia Estatal de Investigación (AEI)/10.13039/501100011033 under Contract PID2020-113614RB-C21 and Contract TIN2015-65316-P; and in part by the Generalitat de Catalunya (GenCat)-Departament de Recerca i Universitats (DIUiE) (GRR) under Contract 2017-SGR-313, Contract 2017-SGR-1328, and Contract 2017-SGR-1414. The work of Miquel Moreto was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal Fellowship under Grant RYC-2016-21104.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support

Author: Erbert Marius
Müller-Hannemann Matthias
Rechner Steffen
Publication venue
Publication date: 22/07/2016
Field of study

A basic task in bioinformatics is the counting of

k

-mers in genome strings. The

k

-mer counting problem is to build a histogram of all substrings of length

k

in a given genome sequence. We present the open source

k

-mer counting software Gerbil that has been designed for the efficient counting of

k

-mers for

k\geq32

. Given the technology trend towards long reads of next-generation sequencers, support for large

k

becomes increasingly important. While existing

k

-mer counting tools suffer from excessive memory resource consumption or degrading performance for large

k

, Gerbil is able to efficiently support large

k

without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into

k

-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large

k

, we outperform state-of-the-art open source

k

-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

arXiv.org e-Print Archive

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Objective review of de novo stand-alone error correction methods for NGS data

Author: Aeschlimann
Alkio
Allam
Au
Bentley
Boudreau
Bradnam
Bragg
Burke
Busk
Chaisson
Coates
Cock
Cong
Cox
Darling
De Wit
Deorowicz
Do
Dohm
Dohm
Dominova
Duma
D′Agostino
Engel
Engström
Fertin
Fichot
Fiebig
Fitak
Friz
Fujimoto
Fuller
Gilchrist
Gilles
Goodwin
Greenfield
Grob
Gupta
Hackl
Henkel
Heo
Hernandez
Hill-Cawthorne
Hillier
Huang
Ilie
Ilie
Ištvánek
Jiao
Jnemann
Joppich
Kao
Kelley
Kenny
Kleigrewe
Koch
Koren
Kozarewa
Lada
Lada
Lambert
Le Duc
Li
Li
Li
Lieberman
Lim
Liu
Liu
Liu
Loman
Loman
Luo
MacManes
Marinier
Marçais
Medvedev
Miclotte
Mikheyev
Miller
Molnar
Mulley
Nakamura
Neumann
Nguyen
Nikolaichik
Nikolenko
Nobu
Nystedt
Ollier
Ono
Pevzner
Qu
Quail
Riedel
Rizk
Roccaro
Ross
Rödelsperger
Sahli
Salmela
Salmela
Salmela
Salzberg
Sanger
Schatz
Schirmer
Schröder
Schulz
Shao
Sheikhizadeh
Shen
Shi
Shi
Sleep
Song
Suzuki
Tahir
Taniguti
Walter
Wang
Wang
Watson
Weirather
Wijaya
Wirawan
Xu
Yan
Yang
Yang
Yang
Yang
Yoo
Zawada
Zerbino
Zhao
Zhao
Zhao
Zhu
Zimin
Publication venue: 'Wiley'
Publication date: 01/04/2016
Field of study

[EN] The sequencing market has increased steadily over the last few years, with different approaches to read DNA information prone to different types of errors. Multiple studies demonstrated the impact of sequencing errors on different applications of next-generation sequencing (NGS), making error correction a fundamental initial step. Different methods in the literature use different approaches and fit different types of problems. We analyzed 50 methods divided into five main approaches (k-spectrum, suffix arrays, multiple-sequence alignment, read clustering, and probabilistic models). They are not published as a part of a suite (stand-alone), and target raw, unprocessed data without an existing reference genome (de novo). These correctors handle one or more sequencing technologies using the same or different approaches. They face general challenges (sometimes with specific traits for specific technologies) such as repetitive regions, uncalled bases, and ploidy. Even assessing their performance is a challenge in itself because of the approach taken by various authors, the unknown factor (de novo), and the behavior of the third-party tools employed in the benchmarks. This study aims to help the researcher in the field to advance the field of error correction, the educator to have a brief but comprehensive companion, and the bioinformatician to choose the right tool for the right job. © 2016 John Wiley & Sons, LtdWe want to thank our colleague Eloy Romero Alcale who has provided valuable advice regarding the structure of the document. This work was supported by Generalitat Valenciana [GRISOLIA/2013/013 to A.A.].Alic, AS.; Ruzafa, D.; Dopazo, J.; Blanquer Espert, I. (2016). Objective review of de novo stand-alone error correction methods for NGS data. Wiley Interdisciplinary Reviews: Computational Molecular Science. 6(2):111-146. https://doi.org/10.1002/wcms.1239S1111466

Crossref

RiuNet