Search CORE

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Author: De Boor
Dongarra
Greenbaum
Halliwell
Ho
Sajid
Valero-Lara
Valero-Lara
Valero-Lara
Valero-Lara
Zhang
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse-grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch. To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch has different size (cuThomasVBatch). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.This project was funded from the European Union's Horizon 2020 research and innovation programme under grant agreement 720270 (HBPSGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P)and the Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació iEntorns d'Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence andthe valuable feedback provided by Lung Sheng Chien and Alex Fit-Florea. Antonio J. Peña was cofinanced by the Spanish Ministry of Economy andCompetitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.Peer ReviewedPostprint (author's final draft

Leveraging the Performance of LBM-HPC for Large Sizes on GPUs using Ghost Cells

Author: Valero-Lara P.
Publication venue
Publication date: 10/11/2016
Field of study

Today, we are living a growing demand of larger and more efficient computational resources from the scienti c community. On the other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices o er an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an e cient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2 bigger problems without additional memory transfers, achieving faster executions when dealing with large problems

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Author: Jordà Marc
Peña Monferrer Antonio José
Valero-Lara Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

Elsevier - Publisher Connector

A Non-uniform Staggered Cartesian Grid approach for Lattice-Boltzmann method

Author: Jansson J.
Valero-Lara P.
Publication venue: 'Elsevier BV'
Publication date: 31/12/2015
Field of study

We propose a numerical approach based on the Lattice-Boltzmann method (LBM) for dealing with mesh refinement of Non-uniform Staggered Cartesian Grid. We explain, in detail, the strategy for mapping LBM over such geometries. The main benefit of this approach, compared to others, consists of solving all fluid units only once per time-step, and also reducing considerably the complexity of the communication and memory management between different refined levels. Also, it exhibits a better matching for parallel processors. To validate our method, we analyze several standard test scenarios, reaching satisfactory results with respect to other stateof-the-art methods. The performance evaluation proves that our approach not only exhibits a simpler and efficient scheme for dealing with mesh refinement, but also fast resolution, even in those scenarios where our approach needs to use a higher number of fluid units

LBM-HPC - An open-source tool for fluid simulations. Case study: Unified parallel C (UPC-PGAS)

Author: Jansson J.
Valero-Lara P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/12/2015
Field of study

The main motivation of this work is the evaluation of the Unified Parallel C (UPC) model, for Boltzmann-fluid simulations. UPC is one of the current models in the so-called Partitioned Global Address Space paradigm. This paradigm attempts to increase the simplicity of codes and achieve a better efficiency and scalability. Two different UPC-based implementations, explicit and implicit, are presented and evaluated. We compare the fundamental features of our UPC implementations with other parallel programming model, MPI-OpenMP. In particular each of the major steps of any LBM code, i.e., Boundary Conditions, Communication, and LBM solver, are analyzed

Multi-domain grid refinement for lattice-Boltzmann simulations on heterogeneous platforms

Author: Jansson J.
Valero-Lara P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The main contribution of the present work consists of several parallel approaches for grid refinement based on a multi-domain decomposition for lattice-Boltzmann simulations. The proposed method for discretizing the fluid incorporates different regular Cartesian grids with no homogeneous spatial domains, which are in need to be communicated each other. Three different parallel approaches are proposed, homogeneous Multicore, homogeneous GPU, and heterogeneous Multicore-GPU. Although, the homogeneous implementations exhibit satisfactory results, the heterogeneous approach achieves up to 30% extra efficiency, in terms of Millions of Fluid Lattice Updates per Second (MFLUPS), by overlapping some of the steps on both architectures, Multicore and GPU

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Towards HPC-Embedded Case Study: Kalray and Message-Passing on NoC

Author: Jansson J.
Krishnasamy E.
Valero-Lara P.
Publication venue: 'Scalable Computing: Practice and Experience'
Publication date: 01/01/2017
Field of study

Today one of the most important challenges in HPC is the development of computers with a low power consumption. In this context, recently, new embedded many-core systems have emerged. One of them is Kalray. Unlike other many-core architectures, Kalray is not a co-processor (self-hosted). One interesting feature of the Kalray architecture is the Network on Chip (NoC) connection. Habitually, the communication in many-core architectures is carried out via shared memory. However, in Kalray, the communication among processing elements can also be via Message-Passing on the NoC. One of the main motivations of this work is to present the main constraints to deal with the Kalray architecture. In particular, we focused on memory management and communication. We assess the use of NoC and shared memory on Kalray. Unlike shared memory, the implementation of Message-Passing on NoC is not transparent from programmer point of view. The synchronization among processing elements and NoC is other of the challenges to deal with in the Karlay processor. Although the synchronization using Message-Passing is more complex and consuming time than using shared memory, we obtain an overall speedup close to 6 when using Message-Passing on NoC with respect to the use of shared memory. Additionally, we have measured the power consumption of both approaches. Despite of being faster, the use of NoC presents a higher power consumption with respect to the approach that exploits shared memory. This additional consumption in Watts is about a 50%. However, the reduction in time by using NoC has an important impact on the overall power consumption as well

Elsevier - Publisher Connector

Recommended from our members

Accelerating solid-fluid interaction using Lattice-Boltzmann and Immersed Boundary coupled simulations on heterogeneous platforms

Author: Pinelli A.
Prieto-Matias M.
Valero-Lara P.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle the problem of the interaction of solids with an incompressible fluid flow. The proposed method uses a Cartesian uniform grid that incorporates both the fluid and the solid domain. This is a very optimum and novel method to solve this problem and is a growing research topic in Computational Fluid Dynamics. We explain in detail the parallelization of the whole method on both GPUs and an heterogeneous GPU-Multicore platform and describe different optimizations, focusing on memory management and CPU-GPU communication. Our performance evaluation consists of a series of numerical experiments that simulate situations of industrial and research interest. Based on these tests, we have shown that the baseline LBM implementation achieves satisfactory results on GPUs. Unfortunately, when coupling LBM and IB methods on GPUs, the overheads of IB degrade the overall performance. As an alternative we have explored an heterogeneous implementation that is able to hide such overheads and allows us to exploit both Multicore and GPU resources in a cooperative way

City Research Online