312 research outputs found

    Simulating the behavior of the human brain on GPUS

    Get PDF
    The simulation of the behavior of the Human Brain is one of the most important challenges in computing today. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm and, although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. Two different approaches are studied, one for mono-morphology simulations (batch of neurons with the same shape) and one for multi-morphology simulations (batch of neurons where every neuron has a different shape). In mono-morphology simulations we obtain a good performance using just a single kernel to compute all the neurons. However this turns out to be inefficient on multi-morphology simulations. Unlike the previous scenario, in multi-morphology simulations a much more complex implementation is necessary to obtain a good performance. In this case, we must execute more than one single GPU kernel. In every execution (kernel call) one specific part of the batch of the neurons is solved. These parts can be seen as multiple and independent tridiagonal systems. Although the present paper is focused on the simulation of the behavior of the Human Brain, some of these techniques, in particular those related to the solving of tridiagonal systems, can be also used for multiple oil and gas simulations. Our studies have proven that the optimizations proposed in the present work can achieve high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two NVIDIA K80 GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling, even when dealing with a very high number of neurons.This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence, and the European Union’s Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie Grant Agreement No. 749516.Peer ReviewedPostprint (published version

    Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

    Get PDF
    Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units have been shown to be highly efficient for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. We describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure

    Parallelization of the ADI method exploring vector computing in GPUs

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses a direct method to solve a system of N linear equations, which requires N3 operations. This problem can be solved using a more efficient computational method, known as the alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with N operations each, implemented in two steps, one to solve row by row, the other column by column. Each N operation is fully independent in each step, which opens an opportunity to an embarrassingly parallel solution. This method also explores the way matrices are stored in computer memory, either in row-major or column-major, by splitting each iteration in two. The major bottleneck of this method is solving the system of linear equations. These systems of linear equations can be described as tridiagonal matrices since the elements are always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR). Current vector extensions in conventional scalar processing units, such as x86-64 and ARM devices, require the vector elements to be in contiguous memory locations to avoid performance penalties. To overcome these limitations in dot products several approaches are proposed and evaluated in this work, both in general-purpose processing units and in specific accelerators, namely NVidia GPUs. Profiling the code execution on a server based on x86-64 devices showed that the ADI method needs a combination of CPU computation power and memory transfer speed. This is best showed on a server based on the Intel manycore device, KNL, where the algorithm scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to be a better choice: the algorithm executes in less time and scales better. The introduction of GPU computing to further improve the execution performance (and also using other optimisation techniques, namely a different thread scheme and shared memory to speed up the process) showed better results for larger grid sizes (above 32Ki x 32Ki). The CUDA development environment also showed a better performance than using OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL code displayed a major performance improvement when compared to CUDA. But even with this speedup, the better average time for the ADI method on all tested configurations on a NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and the CR as the auxiliary method.O problema da convecção-difusão é utilizado em simulaçãos cientificas que regularmente utilizam métodos diretos para solucionar um sistema de N equações lineares e necessitam de N3 operações. O problema pode ser resolvido utilizando um método computacionalmente mais eficiente para resolver um sistema de N equações lineares com N operações cada, implementado em dois passos, um solucionando linha a linha e outro solucionando coluna a coluna. Cada par de N operações são independentes em cada passo, havendo assim uma oportunidade de utilizar uma solução em baraçosamente paralela. Este método também explora o modo de guardar as matrizes na memória do computados, sendo esta por linhas ou em colunas, dividindo cada iteração em duas, este método é conhecido como o método de direção alternada. O maior bottleneck deste problema é a resolução dos sistemas de equações lineares criados pelo ADI. Estes sistemas podem ser descritos como matrizes tridiagonais, visto todos os seus elementos se encontrarem nas 3 diagonais interiores e a utilização de métodos estudados para este caso é necessário para conseguir atingir a melhor performance possível. Esses métodos podem ser sequenciais (como o algoritmo de Thomas) ou paralelos (como o CR e o PCR) As extensões vectoriais utilizadas nas atuais unidades de processamento, como dispositivos x86-64 e ARM, necessitam que os elementos do vetor estejam em blocos de memória contíguos para não sofrer penalizações. Algumas abordagens foram estudadas neste trabalho para as ultrapassar, tanto em processadores convencionais como em aceleradores de computação. Os registos do tempo em servidores baseado em dispositivos x86-64 mostram que o ADI necessitam de uma combinação de poder de processamento assim como velocidade de transferência de dados. Isto é demonstrado especialmente no servidor baseado no dispositivo KNL da Intel, no qual o algoritmo escala até que a largura de banda deixe de ser suficiente para o problema. Um servidor com dois sockets em que cada é composto por um dispositivo com 16 cores baseado na arquitetura Xeon Skylake, com acesso ao AVX-512, mostrou ser a melhor escolha: o algoritmo faz as mesmas operações em menos tempo e escala melhor. Com a introdução de computação com GPUs para melhorar a performance do programa mostrou melhores resultados para problemas de maiores dimensões (tamanho acima de 32Ki x 32Ki celulas). O desenvolvimento em CUDA também mostrou melhores resultados que em OpenCL na maioria dos casos. A maior divergência foi observada ao utilizar o método CR-PCR, onde o OpenCL mostrou melhor performance que em CUDA. Mas mesmo com este método sendo mais eficaz que o mesmo em CUDA, o melhor performance com o método ADI foi observado utilizando CUDA no GPU mais recente estudado com o método CR

    Reducing memory requirements for large size LBM simulations on GPUs

    Get PDF
    The scientific community in its never-ending road of larger and more efficient computational resources is in need of more efficient implementations that can adapt efficiently on the current parallel platforms. Graphics processing units are an appropriate platform that cover some of these demands. This architecture presents a high performance with a reduced cost and an efficient power consumption. However, the memory capacity in these devices is reduced and so expensive memory transfers are necessary to deal with big problems. Today, the lattice-Boltzmann method (LBM) has positioned as an efficient approach for Computational Fluid Dynamics simulations. Despite this method is particularly amenable to be efficiently parallelized, it is in need of a considerable memory capacity, which is the consequence of a dramatic fall in performance when dealing with large simulations. In this work, we propose some initiatives to minimize such demand of memory, which allows us to execute bigger simulations on the same platform without additional memory transfers, keeping a high performance. In particular, we present 2 new implementations, LBM-Ghost and LBM-Swap, which are deeply analyzed, presenting the pros and cons of each of them.This project was funded by the Spanish Ministry of Economy and Competitiveness (MINECO): BCAM Severo Ochoa accreditation SEV-2013-0323, MTM2013-40824, Computación de Altas Prestaciones VII TIN2015-65316-P, by the Basque Excellence Research Center (BERC 2014-2017) pro- gram by the Basque Government, and by the Departament d' Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d' Execució Paral·lels (2014-SGR-1051). We also thank the support of the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT) and NVIDIA GPU Research Center program for the provided resources, as well as the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence.Peer ReviewedPostprint (author's final draft

    Parallel prefix operations on heterogeneous platforms

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento computacional e na eficiencia enerxética, sendo un piar clave para a computación de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa de programar, e ten certos problemas asociados á portabilidade entre as diferentes tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden acelerar a computación destes algoritmos, tamén poden ser unha limitación cando non explotan axeitadamente o paralelismo da arquitectura CPU. Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos de prefixo paralelo para calquera paradigma de programación paralela. Pola outra banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira trata con tamaños extremad8.1nente grandes, usando varias GPUs. As nosas propostas proporcionan uns resultados moi competitivos a nivel de rendemento, mellorando as propostas existentes na bibliografía para as operacións probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen] Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento computacional y en la eficiencia energética, siendo una tecnología clave para la computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser una limitación si no explotan correctamente el paralelismo de la arquitectura CPU. Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos de prefijo paralelo que pueden ser implementados en cualquier paradigma de programación paralela. Por otra parte, se propone una metodología general que implementa eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable, sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además, las funciones GPU proporcionadas están compuestas de bloques de código CUDA reutilizable y modular, lo que permite la implementación de cualquier algoritmo de prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes, utilizando para ello una única GPU i mientras que la tercera aproximación trata con tamaños extremadamente grandes, usando varias GPUs. Nuestras propuestas proporcionan resultados muy competitivos, mejorando el rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas: la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract] Craphics Processing Units (CPUs) have shown remarkable advantages in computing performance and energy efficiency, representing oue of the most promising trends fúr the near-fnture of high perfonnance computing. However, these devices also bring sorne programming complexities, and many efforts are required tú provide portability between different generations. Additionally, parallel prefix algorithms are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial in roany computer sCience applications. Although GPUs can accelerate the computation of such algorithms, they can also be a limitation when they do not match correctly to the CPU architecture or do not exploit the CPU parallelism properly. This dissertation presents two different perspectives. Gn the Oile hand, new parallel prefix algorithms have been algorithmicany designed for any paranel progrannning paradigm. On the other hand, a general tuning CPU methodology is proposed to provide an easy and portable mechanism tú efficiently implement paranel prefix algorithms on any CUDA CPU architecture, rather than focusing on a particular algorithm or a CPU mode!. To accomplish this goal, the methodology identifies the GPU parameters which influence on the performance and, following a set oí performance premises, obtains the cOllvillient values oí these parameters depending on the algorithm, the problem size and the CPU architecture. Additionally, the provided CPU functions are composed of modular and reusable CUDA blocks of code, which allow the easy implementation of any paranel prefix algorithm. Depending on the size of the dataset, three different approaches are proposed. The first two approaches solve small and medium-large datasets on a single GPU; whereas the third approach deals with extremely large datasets on a Multiple-CPU environment. OUT proposals provide very competitive performance, outperforming the stateof- the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems

    Simulating the Behaviour of the Human Brain on NVIDIA GPU: cuHinesBatch & cuThomasBatch implementations

    Get PDF
    Understand the human brain is one of the century challenges. On this work we are going to achieve a small step towards this objective presenting a novel data layout in order to compute more efficiently the Hines algorithm on GPU. A more general tridiagonal solver is going to be presented too

    Techniques for Autotuning Algorithms on Heterogenous Platforms

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Current GPUs (Graphic Processing Units) can obtain high computational performance in scientific applications. Nevertheless, programmers have to use suitable parallel algorithms for these architectures and have to consider optimization techniques in the implementation in order to achieve that performance. This thesis is focused on designing and implementing parallel prefix algorithms into GPU architectures with little effort. For that, we have developed a very optimized library called BPLG (Tuning Butterfly Processing Library for GPUs) and based on a set of building blocks that enable to easily design well-known algorithms such as FFT, tridiagonal systems solvers, scan operator, sorting or signal processing. This library is designed under a tuning methodology based on two-stages indentified as GPU resource analysis and operator string manipulation. Specifically, this strategy is focused on a set of parallel prefix algorithms that can be represented according to a set of common permutations of the digits of each of its element indices [4], denoted as Index-Digit (ID) algorithms. So far, the proposed methodology has obtained very good results with respect to state-of-art libraries, as CUFFT, CUSPARSE, CUDPP or ModernGPU.European Cooperation in Science and Technology. COS

    Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem

    Get PDF
    Mixed-precision algorithms are a class of algorithms that uses low precision in part of the algorithm in order to save time and energy with less accurate computation and communication. These algorithms usually utilize iterative refinement processes to improve the approximate solution obtained from low precision to the accuracy we desire from doing all the computation in high precision. Due to the demand of deep learning applications, there are hardware developments offering different low-precision formats including half precision (FP16), Bfloat16 and integer operations for quantized integers, which uses integers with a shared scalar to represent a set of equally spaced numbers. As new hardware architectures focus on bringing performance in these formats, the mixed-precision algorithms have more potential leverage on them and outmatch traditional fixed-precision algorithms. This dissertation consists of two articles. In the first article, we adapt one of the most fundamental algorithms in numerical linear algebra---LU factorization with partial pivoting--- to use integer arithmetic. With the goal of obtaining a low accuracy factorization as the preconditioner of generalized minimal residual (GMRES) to solve systems of linear equations, the LU factorization is adapted to use two different fixed-point formats for matrices L and U. A left-looking variant is also proposed for matrices with unbounded column growth. Finally, GMRES iterative refinement has shown that it can work on matrices with condition numbers up to 10000 with the algorithm that uses int16 as input and int32 accumulator for the update step. The second article targets symmetric and Hermitian eigenvalue problems. In this section we revisit the SICE algorithm from Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested