19 research outputs found
Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis
Artículo publicado en el libro de actas del congreso.We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA se- quences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous sce- narios to maximize acceleration and minimize power consumption. Ex- perimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.Universidad de Málaga, Campus de Excelencia Internacional Andalucía Tech
SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs
Sequence alignment forms an important backbone in many sequencing
applications. A commonly used strategy for sequence alignment is an approximate
string matching with a two-dimensional dynamic programming approach. Although
some prior work has been conducted on GPU acceleration of a sequence alignment,
we identify several shortcomings that limit exploiting the full computational
capability of modern GPUs. This paper presents SaLoBa, a GPU-accelerated
sequence alignment library focused on seed extension. Based on the analysis of
previous work with real-world sequencing data, we propose techniques to exploit
the data locality and improve workload balancing. The experimental results
reveal that SaLoBa significantly improves the seed extension kernel compared to
state-of-the-art GPU-based methods.Comment: Published at IPDPS'2
Fast reconstruction of 3D volumes from 2D CT projection data with GPUs
cited By 0International audienceMeso-F.E. modelling of 3D textile composites is a powerful tool, which can help determine mechanical properties and permeability of the reinforcements or composites. The quality of the meso F.E. analyses depends on the quality of the initial model. A direct method based on X-ray tomography imaging is introduced to determine finite element models based on the real geometry of 3D composite reinforcements. The method is particularly suitable regarding 3D textile reinforcements for which internal geometries are numerous and complex. An analysis of the image's texture is performed. A hyperelastic model developed for fibre bundles is used for the simulation of the deformation of the 3D reinforcement. © EDP Sciences, 2016
GPU acceleration of Levenshtein distance computation between long strings
Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.This research was supported by the European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014–2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) [001-P-001723], in part by the Catalan Government under grant 2017-SGR-1624, and in part by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-095209-B-C22.Peer ReviewedPostprint (published version
GPU acceleration of Levenshtein distance computation between long strings
Altres ajuts: acords transformatius de la UABComputing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity
Formalization of block pruning: reducing the number of cells computed in exact biological sequence comparison algorithms
This is a pre-copyedited, author-produced version of an article accepted for publication in The Computer Journal following peer review. The version of record Edans F O Sandes, George L M Teodoro, Maria Emilia M T Walter, Xavier Martorell, Eduard Ayguade, Alba C M A Melo; Formalization of Block Pruning: Reducing the Number of Cells Computed in Exact Biological Sequence Comparison Algorithms, The Computer Journal, Volume 61, Issue 5, 1 May 2018, Pages 687–713 is available online at: The Computer Journal https://academic.oup.com/comjnl/article-abstract/61/5/687/4539903 and https://doi.org/10.1093/comjnl/bxx090.Biological sequence comparison algorithms that compute the optimal local and global alignments calculate a dynamic programming (DP) matrix with quadratic time complexity. The DP matrix H is calculated with a recurrence relation in which the value of each cell Hi,j is the result of a maximum operation on the cells’ values Hi-1,j-1, Hi-1,j and Hi,j-1 added or subtracted by a constant value. Therefore, it can be noticed that the difference between the value of cell Hi,j being calculated and the values of direct neighbor cells previously computed respect well-defined upper and lower bounds. Using these bounds, we can show that it is possible to determine the maximum and the minimum value of every cell in H, for a given reference cell. We use this result to define a generic pruning method which determines the cells that can pruned (i.e. no need to be computed since they will not contribute to the final solution), accelerating the computation but keeping the guarantee that the optimal result will be produced. The goal of this paper is thus to investigate and formalize properties of the DP matrix in order to estimate and increase the pruning method efficiency. We also show that the pruning efficiency depends mainly on three characteristics: (a) the order in which the cells of H are calculated, (b) the values of the parameters used in the recurrence relation and (c) the contents of the sequences compared.Peer ReviewedPostprint (author's final draft
Avaliação de estratégias Alignment-free para determinar o fator de pruning em comparação paralela de sequências
Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2019.A Bioinformática tem como uma de suas principais operações a comparação entre se-
quências biológicas. Através de métodos exatos baseados no alinhamento, são obtidos
resultados ótimos, porém o tempo para obtenção desses alinhamentos é muito grande.
Com a adição da técnica de block pruning, o tempo pode ser bastante reduzido. Porém,
no block pruning a área de poda é obtida ao longo da execução do alinhamento. Para
determinar o tempo de execução de uma comparação com pruning antes do alinhamento
existe uma regressão multi-linear. No entanto, essa fórmula requer que o fator de block
pruning seja conhecido antes da execução da comparação, o que não acontece. O presente
trabalho de graduação tem como objetivo analisar algoritmos que fazem a comparação
sem se basear no alinhamento, alignment-free, e que são executados rapidamente, com o
objetivo de utilizá-los para gerar os valores de block pruning antes do alinhamento das
sequências. Os resultados da análise de diferentes algoritmos mostraram que a ferramenta
ALF-N2 (ALignment-Free framework), através de seu algoritmo N2, apresenta resultados
satisfatórios para a geração de uma correlação entre o fator de block pruning e a simila-
ridade entre as sequências por meio do alignment-free. Usando a estratégia proposta no
presente trabalho de graduação, conseguimos obter taxas de block pruning com muito boa
acurácia, apresentando erros de até 7,0% para sequências pequenas e 3,7% para sequências
maiores.Sequence comparison is one of the most basic operations in Bioinformatics. With alignment-
based exact methods it is possible to obtain very good results, but these methods require
huge execution times, when the sequences are long. Block pruning is an optimization
that may reduce considerably the execution times if the sequences involved have high
similarity. In order to predict execution times of comparisons with pruning, a multiple re-
gression formula was proposed. However, this formula requires that the block pruning rate
is known before the comparison, which is not the case, since the pruned area is computed
during the alignment process. This graduation project evaluates different alignment-free
algorithms that execute much faster than the exact methods, with the goal to generate
block pruning values before the sequences alignment itself. The results of the evaluation
of various algorithms showed that the ALF-N2 tool provided very good results using its
algorithm N2. With the ALF-N2 tool, we were able to generate a correlation between
block pruning percentages and sequences similarity. Using the proposed strategy, we were
able to obtain block pruning rates that are very close to the real ones, with maximum
error rates of 7.0% for small sequences and 3.7% for longer ones
DynaProg for Scala
Dynamic programming is an algorithmic technique to solve problems that follow the Bellman’s principle: optimal solutions depends on optimal sub-problem solutions. The core idea behind dynamic programming is to memoize intermediate results into matrices to avoid multiple computations. Solving a dynamic programming problem consists of two phases: filling one or more matrices with intermediate solutions for sub-problems and recomposing how the final result was constructed (backtracking). In textbooks, problems are usually described in terms of recurrence relations between matrices elements. Expressing dynamic programming problems in terms of recursive formulae involving matrix indices might be difficult, if often error prone, and the notation does not capture the essence of the underlying problem (for example aligning two sequences). Moreover, writing correct and efficient parallel implementation requires different competencies and often a significant amount of time. In this project, we present DynaProg, a language embedded in Scala (DSL) to address dynamic programming problems on heterogeneous platforms. DynaProg allows the programmer to write concise programs based on ADP [1], using a pair of parsing grammar and algebra; these program can then be executed either on CPU or on GPU. We evaluate the performance of our implementation against existing work and our own hand-optimized baseline implementations for both the CPU and GPU versions. Experimental results show that plain Scala has a large overhead and is recommended to be used with small sequences (≤1024) whereas the generated GPU version is comparable with existing implementations: matrix chain multiplication has the same performance as our hand-optimized version (142% of the execution time of [2]) for a sequence of 4096 matrices, Smith-Waterman is twice slower than [3] on a pair of sequences of 6144 elements, and RNA folding is on par with [4] (95% running time) for sequences of 4096 elements. [1] Robert Giegerich and Carsten Meyer. Algebraic Dynamic Programming. [2] Chao-Chin Wu, Jenn-Yang Ke, Heshan Lin and Wu Chun Feng. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism. [3] Edans Flavius de O. Sandes, Alba Cristina M. A. de Melo. Smith-Waterman alignment of huge sequences with GPU in linear space. [4] Guillaume Rizk and Dominique Lavenier. GPU accelerated RNA folding algorithm
MASA-SSE : comparação de sequências biológicas utilizando instruções vetoriais
Monografia (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2015.A comparação de sequências biológicas é uma das operações mais básicas e importantes da
Bioinformática. Os métodos exatos de comparação de sequências possuem complexidade
quadrática de tempo e por isso soluções paralelas são utilizadas para acelerar a produção
de resultados. O framework MASA [3] é uma solução paralela flexível e customizável que
permite o alinhamento de sequências biológicas em diferentes hardwares e softwares. Ele
foi inicialmente pensado para execução paralela da comparação de sequências em GPUs
(Graphics Processing Units), porém, atualmente existem duas soluções MASA para CPU:
MASA-CPU e MASA-OpenMP. Essas soluções não utilizam instruções vetoriais, deixando
de explorar um grande potencial para paralelismo. O presente trabalho de graduação
propõe e avalia o MASA-SSE, uma solução em CPU que utiliza as instruções vetoriais
SSE da Intel, implementando o algoritmo de Farrar [6], que é considerado o estado da arte
em comparação de sequências biológicas com instruções vetoriais. Os resultados obtidos
a partir da comparação de várias sequências reais de DNA em duas máquinas distintas
mostram que o MASA-SSE, executando em uma thread e, utilizando instruções vetoriais,
possui desempenho superior ao do MASA-OpenMP com quatro threads. _____________________________________________________________________________ ABSTRACTBiological sequence comparison is one of the most basic and important operations in Bioinformatics.
The exact methods that compare two biological sequences have quadratic time
complexity and, for this reason, parallel solutions are often used to accelerate the execution.
The MASA framework [3] is a flexible and customizable parallel solution for biological
sequence comparison which was initially designed for GPU (Graphics Processing Unit)
execution but nowadays integrates two CPU solutions: MASA-CPU and MASA-OpenMP.
These CPU solutions do not use vector instructions and thus miss the opportunity of exploring
a high potential for parallelism. This graduation project proposes and evaluates
MASA-SSE, a CPU solution that uses the SSE vector instructions from Intel and implements
the Farrar algorithm [6], which is the state-of-the-art algorithm for biological
sequence comparison with vector instructions. Experimental results obtained with the
comparison of real DNA sequences in two different machines show that MASA-SSE, executing
with one thread and vector instructions, outperforms MASA-OpemMP, execution
with four threads