312 research outputs found
Simulating the behavior of the human brain on GPUS
The simulation of the behavior of the Human Brain is one of the most important challenges in computing today. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm and, although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. Two different approaches are studied, one for mono-morphology simulations (batch of neurons with the same shape) and one for multi-morphology simulations (batch of neurons where every neuron has a different shape). In mono-morphology simulations we obtain a good performance using just a single kernel to compute all the neurons. However this turns out to be inefficient on multi-morphology simulations. Unlike the previous scenario, in multi-morphology simulations a much more complex implementation is necessary to obtain a good performance. In this case, we must execute more than one single GPU kernel. In every execution (kernel call) one specific part of the batch of the neurons is solved. These parts can be seen as multiple and independent tridiagonal systems. Although the present paper is focused on the simulation of the behavior of the Human Brain, some of these techniques, in particular those related to the solving of tridiagonal systems, can be also used for multiple oil and gas simulations. Our studies have proven that the optimizations proposed in the present work can achieve high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two NVIDIA K80 GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling, even when dealing with a very high number of neurons.This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1),
from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence, and the European Union’s Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie Grant Agreement No. 749516.Peer ReviewedPostprint (published version
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
Parallelization of the ADI method exploring vector computing in GPUs
Dissertação de mestrado integrado em Engenharia InformáticaThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses
a direct method to solve a system of N linear equations, which requires N3 operations.
This problem can be solved using a more efficient computational method, known as the
alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with
N operations each, implemented in two steps, one to solve row by row, the other column by
column. Each N operation is fully independent in each step, which opens an opportunity to
an embarrassingly parallel solution. This method also explores the way matrices are stored in
computer memory, either in row-major or column-major, by splitting each iteration in two.
The major bottleneck of this method is solving the system of linear equations. These
systems of linear equations can be described as tridiagonal matrices since the elements are
always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal
matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas
algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR).
Current vector extensions in conventional scalar processing units, such as x86-64 and
ARM devices, require the vector elements to be in contiguous memory locations to avoid
performance penalties. To overcome these limitations in dot products several approaches
are proposed and evaluated in this work, both in general-purpose processing units and in
specific accelerators, namely NVidia GPUs.
Profiling the code execution on a server based on x86-64 devices showed that the ADI
method needs a combination of CPU computation power and memory transfer speed. This
is best showed on a server based on the Intel manycore device, KNL, where the algorithm
scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A
dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to
be a better choice: the algorithm executes in less time and scales better.
The introduction of GPU computing to further improve the execution performance (and
also using other optimisation techniques, namely a different thread scheme and shared
memory to speed up the process) showed better results for larger grid sizes (above 32Ki x
32Ki). The CUDA development environment also showed a better performance than using
OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL
code displayed a major performance improvement when compared to CUDA. But even with
this speedup, the better average time for the ADI method on all tested configurations on a
NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and
the CR as the auxiliary method.O problema da convecção-difusão é utilizado em simulaçãos cientificas que regularmente utilizam
métodos diretos para solucionar um sistema de N equações lineares e necessitam de N3 operações.
O problema pode ser resolvido utilizando um método computacionalmente mais eficiente para
resolver um sistema de N equações lineares com N operações cada, implementado em dois passos,
um solucionando linha a linha e outro solucionando coluna a coluna. Cada par de N operações
são independentes em cada passo, havendo assim uma oportunidade de utilizar uma solução em baraçosamente paralela. Este método também explora o modo de guardar as matrizes na memória do
computados, sendo esta por linhas ou em colunas, dividindo cada iteração em duas, este método é
conhecido como o método de direção alternada.
O maior bottleneck deste problema é a resolução dos sistemas de equações lineares criados pelo
ADI. Estes sistemas podem ser descritos como matrizes tridiagonais, visto todos os seus elementos se
encontrarem nas 3 diagonais interiores e a utilização de métodos estudados para este caso é necessário
para conseguir atingir a melhor performance possível. Esses métodos podem ser sequenciais (como o
algoritmo de Thomas) ou paralelos (como o CR e o PCR)
As extensões vectoriais utilizadas nas atuais unidades de processamento, como dispositivos x86-64
e ARM, necessitam que os elementos do vetor estejam em blocos de memória contíguos para não
sofrer penalizações. Algumas abordagens foram estudadas neste trabalho para as ultrapassar, tanto
em processadores convencionais como em aceleradores de computação. Os registos do tempo em
servidores baseado em dispositivos x86-64 mostram que o ADI necessitam de uma combinação de
poder de processamento assim como velocidade de transferência de dados. Isto é demonstrado
especialmente no servidor baseado no dispositivo KNL da Intel, no qual o algoritmo escala até que
a largura de banda deixe de ser suficiente para o problema. Um servidor com dois sockets em que
cada é composto por um dispositivo com 16 cores baseado na arquitetura Xeon Skylake, com acesso
ao AVX-512, mostrou ser a melhor escolha: o algoritmo faz as mesmas operações em menos tempo e
escala melhor.
Com a introdução de computação com GPUs para melhorar a performance do programa mostrou
melhores resultados para problemas de maiores dimensões (tamanho acima de 32Ki x 32Ki celulas).
O desenvolvimento em CUDA também mostrou melhores resultados que em OpenCL na maioria
dos casos. A maior divergência foi observada ao utilizar o método CR-PCR, onde o OpenCL mostrou
melhor performance que em CUDA. Mas mesmo com este método sendo mais eficaz que o mesmo em
CUDA, o melhor performance com o método ADI foi observado utilizando CUDA no GPU mais recente
estudado com o método CR
Reducing memory requirements for large size LBM simulations on GPUs
The scientific community in its never-ending road of larger and more efficient computational resources is in need of more efficient implementations that can adapt efficiently on the current parallel platforms. Graphics processing units are an appropriate platform that cover some of these demands. This architecture presents a high performance with a reduced cost and an efficient power consumption. However, the memory capacity in these devices is reduced and so expensive memory transfers are necessary to deal with big problems. Today, the lattice-Boltzmann method (LBM) has positioned as an efficient approach for Computational Fluid Dynamics simulations. Despite this method is particularly amenable to be efficiently parallelized, it is in need of a considerable memory capacity, which is the consequence of a dramatic fall in performance when dealing with large simulations. In this work, we propose some initiatives to minimize such demand of memory, which allows us to execute bigger simulations on the same platform without additional memory transfers, keeping a high performance. In particular, we present 2 new implementations, LBM-Ghost and LBM-Swap, which are deeply analyzed, presenting the pros and cons of each of them.This project was funded by the Spanish Ministry of Economy and Competitiveness (MINECO): BCAM Severo Ochoa accreditation SEV-2013-0323, MTM2013-40824, Computación de Altas Prestaciones VII TIN2015-65316-P, by the Basque Excellence Research Center (BERC 2014-2017) pro-
gram by the Basque Government, and by the Departament d' Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d' Execució Paral·lels (2014-SGR-1051). We also thank the support of the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT) and NVIDIA GPU Research Center program for the provided resources,
as well as the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence.Peer ReviewedPostprint (author's final draft
Parallel prefix operations on heterogeneous platforms
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento
computacional e na eficiencia enerxética, sendo un piar clave para a computación
de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa
de programar, e ten certos problemas asociados á portabilidade entre as diferentes
tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de
algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa
eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden
acelerar a computación destes algoritmos, tamén poden ser unha limitación cando
non explotan axeitadamente o paralelismo da arquitectura CPU.
Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos
de prefixo paralelo para calquera paradigma de programación paralela. Pola outra
banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente
algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU
CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de
tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no
rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores
óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e
da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de
fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o
que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o
tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven
problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira
trata con tamaños extremad8.1nente grandes, usando varias GPUs.
As nosas propostas proporcionan uns resultados moi competitivos a nivel de
rendemento, mellorando as propostas existentes na bibliografía para as operacións
probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen]
Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento
computacional y en la eficiencia energética, siendo una tecnología clave para la
computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es
costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus
códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de
prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las
ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque
las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser
una limitación si no explotan correctamente el paralelismo de la arquitectura CPU.
Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos
de prefijo paralelo que pueden ser implementados en cualquier paradigma de
programación paralela. Por otra parte, se propone una metodología general que implementa
eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable,
sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o
en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que
influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los
valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además,
las funciones GPU proporcionadas están compuestas de bloques de código CUDA
reutilizable y modular, lo que permite la implementación de cualquier algoritmo de
prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen
tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes,
utilizando para ello una única GPU i mientras que la tercera aproximación trata
con tamaños extremadamente grandes, usando varias GPUs.
Nuestras propuestas proporcionan resultados muy competitivos, mejorando el
rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas:
la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract]
Craphics Processing Units (CPUs) have shown remarkable advantages in computing
performance and energy efficiency, representing oue of the most promising
trends fúr the near-fnture of high perfonnance computing. However, these devices
also bring sorne programming complexities, and many efforts are required tú provide
portability between different generations. Additionally, parallel prefix algorithms
are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial
in roany computer sCience applications. Although GPUs can accelerate the computation
of such algorithms, they can also be a limitation when they do not match
correctly to the CPU architecture or do not exploit the CPU parallelism properly.
This dissertation presents two different perspectives. Gn the Oile hand, new
parallel prefix algorithms have been algorithmicany designed for any paranel progrannning
paradigm. On the other hand, a general tuning CPU methodology is
proposed to provide an easy and portable mechanism tú efficiently implement paranel
prefix algorithms on any CUDA CPU architecture, rather than focusing on a
particular algorithm or a CPU mode!. To accomplish this goal, the methodology
identifies the GPU parameters which influence on the performance and, following a
set oí performance premises, obtains the cOllvillient values oí these parameters depending
on the algorithm, the problem size and the CPU architecture. Additionally,
the provided CPU functions are composed of modular and reusable CUDA blocks
of code, which allow the easy implementation of any paranel prefix algorithm. Depending
on the size of the dataset, three different approaches are proposed. The first
two approaches solve small and medium-large datasets on a single GPU; whereas the
third approach deals with extremely large datasets on a Multiple-CPU environment.
OUT proposals provide very competitive performance, outperforming the stateof-
the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems
Simulating the Behaviour of the Human Brain on NVIDIA GPU: cuHinesBatch & cuThomasBatch implementations
Understand the human brain is one of the century challenges. On this work we are going to achieve a small step towards this objective presenting a novel data layout in order to compute more efficiently the Hines algorithm on GPU. A more general tridiagonal solver is going to be presented too
Techniques for Autotuning Algorithms on Heterogenous Platforms
Proceedings of the First PhD Symposium on Sustainable Ultrascale
Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Current GPUs (Graphic Processing Units) can obtain high computational performance in scientific applications.
Nevertheless, programmers have to use suitable parallel algorithms for these architectures and have to consider
optimization techniques in the implementation in order to achieve that performance. This thesis is focused on
designing and implementing parallel prefix algorithms into GPU architectures with little effort. For that, we have
developed a very optimized library called BPLG (Tuning Butterfly Processing Library for GPUs) and based on a set
of building blocks that enable to easily design well-known algorithms such as FFT, tridiagonal systems solvers, scan
operator, sorting or signal processing. This library is designed under a tuning methodology based on two-stages
indentified as GPU resource analysis and operator string manipulation. Specifically, this strategy is focused on a
set of parallel prefix algorithms that can be represented according to a set of common permutations of the digits
of each of its element indices [4], denoted as Index-Digit (ID) algorithms. So far, the proposed methodology has
obtained very good results with respect to state-of-art libraries, as CUFFT, CUSPARSE, CUDPP or ModernGPU.European Cooperation in Science and Technology. COS
Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem
Mixed-precision algorithms are a class of algorithms that uses low precision in part of the algorithm in order to save time and energy with less accurate computation and communication. These algorithms usually utilize iterative refinement processes to improve the approximate solution obtained from low precision to the accuracy we desire from doing all the computation in high precision. Due to the demand of deep learning applications, there are hardware developments offering different low-precision formats including half precision (FP16), Bfloat16 and integer operations for quantized integers, which uses integers with a shared scalar to represent a set of equally spaced numbers. As new hardware architectures focus on bringing performance in these formats, the mixed-precision algorithms have more potential leverage on them and outmatch traditional fixed-precision algorithms. This dissertation consists of two articles. In the first article, we adapt one of the most fundamental algorithms in numerical linear algebra---LU factorization with partial pivoting--- to use integer arithmetic. With the goal of obtaining a low accuracy factorization as the preconditioner of generalized minimal residual (GMRES) to solve systems of linear equations, the LU factorization is adapted to use two different fixed-point formats for matrices L and U. A left-looking variant is also proposed for matrices with unbounded column growth. Finally, GMRES iterative refinement has shown that it can work on matrices with condition numbers up to 10000 with the algorithm that uses int16 as input and int32 accumulator for the update step. The second article targets symmetric and Hermitian eigenvalue problems. In this section we revisit the SICE algorithm from Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested
- …