Search CORE

163 research outputs found

State-of-the-art in Smith-Waterman Protein Database Search on HPC Platforms

Author: Botella Guillermo
De Giusti Armando Eduardo
García Sánchez Carlos
Naiouf Marcelo
Prieto-Matías Manuel
Rucci Enzo
Wong Ka-Chun
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 07/09/2020
Field of study

Searching biological sequence database is a common and repeated task in bioinformatics and molecular biology. The Smith–Waterman algorithm is the most accurate method for this kind of search. Unfortunately, this algorithm is computationally demanding and the situation gets worse due to the exponential growth of biological data in the last years. For that reason, the scientific community has made great efforts to accelerate Smith–Waterman biological database searches in a wide variety of hardware platforms. We give a survey of the state-of-the-art in Smith–Waterman protein database search, focusing on four hardware architectures: central processing units, graphics processing units, field programmable gate arrays and Xeon Phi coprocessors. After briefly describing each hardware platform, we analyse temporal evolution, contributions, limitations and experimental work and the results of each implementation. Additionally, as energy efficiency is becoming more important every day, we also survey performance/power consumption works. Finally, we give our view on the future of Smith–Waterman protein searches considering next generations of hardware architectures and its upcoming technologies.Instituto de Investigación en InformáticaUniversidad Complutense de Madri

Servicio de Difusión de la Creación Intelectual

Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis

Author: Melo Alba
Pérez-Serrano Jesús
Sandes Edans
Ujaldon-Martinez Manuel
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2017
Field of study

Artículo publicado en el libro de actas del congreso.We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA se- quences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous sce- narios to maximize acceleration and minimize power consumption. Ex- perimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.Universidad de Málaga, Campus de Excelencia Internacional Andalucía Tech

Repositorio Institucional Universidad de Málaga

FPGA acceleration of sequence analysis tools in bioinformatics

Author: Mahram Atabak
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

Boston University Institutional Repository (OpenBU)

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

Author: Buluç Aydın
Ding Nan
Ellis Marquita
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Santambrogio Marco D.
Yelick Katherine
Zeni Alberto
Publication venue
Publication date: 01/01/2020
Field of study

Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6x. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s

arXiv.org e-Print Archive

eScholarship - University of California

Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis

Author: Arif Arslan
Lavagno Luciano
Lazarescu MIHAI TEODOR
Ma Liang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Using FPGA-based acceleration of high-performance computing (HPC) applications to reduce energy and power consumption is becoming an interesting option, thanks to the availability of high-level synthesis (HLS) tools that enable fast design cycles. However, obtaining good performance for memory-intensive algorithms, which often exchange large data arrays with external DRAM, still requires time-consuming optimization and good knowledge of hardware design. This article proposes a new design methodology, based on dedicated application- and data array-specific caches. These caches provide most of the benefits that can be achieved by coding optimized DMA-like transfer strategies by hand into the HPC application code, but require only limited manual tuning (basically the selection of architecture and size), are neutral to target HLS tool and technology (FPGA or ASIC), and do not require changes to application code. We show experimental results obtained on five common memory-intensive algorithms from very diverse domains, namely machine learning, data sorting, and computer vision. We test the cost and performance of our caches against both out-of-the-box code originally optimized for a GPU, and manually optimized implementations specifically targeted for FPGAs via HLS. The implementation using our caches achieved an 8X speedup and 2X energy reduction on average with respect to out-of-the-box models using only simple directive-based optimizations (e.g., pipelining). They also achieved comparable performance with much less design effort when compared with the versions that were manually optimized to achieve efficient memory transfers specifically for an FPGA

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

並列計算アクセラレータへの効率的なアプリケーションマッピングに関する研究

Author: Dohi Keisuke
Publication venue
Publication date: 20/03/2014
Field of study

長崎大学学位論文学位記番号:博(工)甲第3号学位授与年月日:平成26年3月20日Nagasaki University (長崎大学)課程博

Nagasaki University's Academic Output SITE: NAOSITE

Nagasaki university's Academic Output SITE

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Author: A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Mitrophanov
A Poleksic
A Samuel
AA Schäffer
Alok Choudhary
Ankit Agrawal
C Camacho
D Honbo
DS Roos
L Ligowski
M Pagni
M Waterman
Md Mostofa Ali Patwary
ML Sierk
ML Sierk
NVIDIA
NVIDIA
P Aleksandar
R Mott
R O
S Altschul
S Karlin
S Manavski
S Ryoo
S Yooseph
S Zuyderduyn
Sanchit Misra
SF Altschul
SR Eddy
T Rognes
T Smith
W Liu
W Pearson
W Pearson
Wei-keng Liao
WR Pearson
Y Liu
Y Liu
Y Yu
Y Yu
Y Zhang
Y Zhang
Yuhong Zhang
Zhiguang Qin
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions

Author: A Khajeh-Saeed
A Szalkowski
A Wirawan
A Wozniak
Adrianto Wirawan
B Alpern
Bertil Schmidt
C Camacho
CM Liu
D Hains
E Lindholm
H Li
J Blazewicz
J Qiu
JD Thompson
L Ligowski
M Farrar
N Alachiotis
NVIDIA
NVIDIA
NVIDIA
O Gotoh
SA Manavski
SF Altschul
SF Altschul
T Oliver
T Oliver
T Rognes
T Rognes
T Smith
TI Li
W Liu
WR Pearson
Y Liu
Y Liu
Y Liu
Y Liu
Yongchao Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

An energy‐aware performance analysis of SWIMM: Smith–Waterman implementation on Intel's Multicore and Manycore architectures

Author: Botella Juan Guillermo
De Giusti Armando Eduardo
García Sanchez Carlos
Naiouf Marcelo
Prieto-Matias Manuel
Rucci Enzo
Publication venue
Publication date: 07/10/2019
Field of study

Alignment is essential in many areas such as biological, chemical and criminal forensics. The well‐known Smith–Waterman (SW) algorithm is able to retrieve the optimal local alignment with quadratic time and space complexity. There are several implementations that take advantage of computing parallelization, such as manycores, FPGAs or GPUs, in order to reduce the alignment effort. In this research, we adapt, develop and tune the SW algorithm named SWIMM on a heterogeneous platform based on Intel's Xeon and Xeon Phi coprocessor. SWIMM is a free tool available in a public git repository https://github.com/enzorucci/SWIMM. We efficiently exploit data and thread‐level parallelism, reaching up to 380 GCUPS on heterogeneous architecture, 350 GCUPS for the isolated Xeon and 50 GCUPS on Xeon Phi. Despite the heterogeneous implementation obtaining the best performance, it is also the most energy‐demanding. In fact, we also present a trade‐off analysis between performance and power consumption. The greenest configuration is based on an isolated multicore system that exploits AVX2 instruction set architecture reaching 1.5 GCUPS/Watts.Facultad de Informátic