Search CORE

33 research outputs found

Reproducible Triangular Solvers for High-Performance Computing

Author: Collange Sylvain
Defour David
Graillat Stef
Iakymchuk Roman
Publication venue: HAL CCSD
Publication date: 13/02/2015
Field of study

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast repro-ducible triangular solver and results on recent NVIDIA GPUs

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

A Reproducible Accurate Summation Algorithm for High-Performance Computing

Author: Collange Sylvain
Defour David
Graillat Stef
Iakymchuk Roman
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

International audienceFloating-point (FP) addition is non-associative and parallel reduction involving this operation is a serious issue as noted in the DARPA Exascale Report [1]. Such large summations typically appear within fundamental numerical blocks such as dot products or numerical integrations. Hence, the result may vary from one parallel machine to another or even from one run to another. These discrepancies worsen on heterogeneous architectures – such as clusters with GPUs or Intel Xeon Phi processors – which combine programming environments that may obey various floating-point models and offer different intermediate precision or different operators [2,3]. Such non-determinism of floating-point calculations in parallel programs causes validation and debugging issues, and may lead to deadlocks [4]. The increasing power of current computers enables one to solve more and more complex problems. That, consequently, leads to a higher number of floating-point operations to be performed; each of them potentially causing a round-off error. Because of the round-off error propagation, some problems must be solved with a wider floating-point format. Two approaches exist to perform floating-point addition without incurring round-off errors. The first approach aims at computing the error that is occurred during rounding using FP expansions, which are based on an error-free transformation. FP expansions represent the result as an unevaluated sum of a fixed number of FP numbers, whose components are ordered in magnitude with minimal overlap to cover a wide range of exponents. FP expansions of sizes 2 and 4 are presented in [5] and [6], accordingly. The main advantage of this solution is that the expansion can stay in registers during the computations. But, the accuracy is insufficient for the summation of numerous FP numbers or sums with a huge dynamic range. Moreover, their complexity grows linearly with the size of the expansion. An alternative approach to expansions exploits the finite range of representable floating-point numbers by storing every bit in a very long vector of bits (accumulator). The length of the accumulator is chosen such that every bit of information of the input format can be represented; this covers the range from the minimum representable floating-point value to the maximum value, independently of the sign. For instance, Kulisch [7] proposed to utilize an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. The Kulisch accumulator is a solution to produce the exact result of a very large amount of floating-point numbers of arbitrary magnitude. However, for a long period this approach was considered impractical as it induces a very large memory overhead. Furthermore, without dedicated hardware support, its performance is limited by indirect memory accesses that make vectorization challenging. We aim at addressing both issues of accuracy and reproducibility in the context of summation. We advocate to compute the correctly-rounded result of the exact sum. Besides offering strict reproducibility through an unambiguous definition of the expected result, our approach guarantees that the result ha

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Identifying scalar behavior in CUDA kernels

Author: Sylvain Collange
Publication venue
Publication date: 12/01/2011
Field of study

We propose a compiler analysis pass for programs expressed in the Single Program, Multiple Data (SPMD) programming model. It identifies statically several kinds of regular patterns that can occur between adjacent threads, including common computations, memory accesses at consecutive locations or at the same location and uniform control flow. This knowledge can be exploited by SPMD compilers targeting SIMD architectures. We present a compiler pass developed within the Ocelot framework that performs this analysis on NVIDIA CUDA programs at the PTX intermediate language level. Results are compared with optima obtained by simulation of several sets of CUDA benchmarks.

HAL-ENS-LYON

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Enjeux de conception des architectures GPGPU (unités arithmétiques spécialisées et exploitation de la régularité)

Author: COLLANGE Sylvain
DAUMAS Marc
Publication venue
Publication date: 01/01/2010
Field of study

Les processeurs graphiques (GPU) actuels offrent une importante puissance de calcul disponible à faible coût. Ce fait a conduit à détourner leur emploi pour réaliser du calcul non graphique, donnant naissance au domaine du calcul généraliste sur processeur graphique (GPGPU).Cette thèse considère d'une part des techniques logicielles pour tirer parti de l'ensemble des opérateurs arithmétiques spécifiques aux GPU dans le cadre du calcul scientifique, et d'autre part des adaptations matérielles aux GPU afin d'exécuter plus efficacement les applications généralistes. En particulier, nous identifions la régularité parallèle comme une opportunité d'optimisation des architectures parallèles, et exposons son potentiel par la simulation d'une architecture GPU existante. Nous considérons ensuite deux alternatives permettant d'exploiter cette régularité. D'une part, nous mettons au point un mécanisme matériel dynamique afin d'améliorer l'efficacité énergétique des unités de calcul. D'autre part, nous présentons une analyse statique opérée à la compilation permettant de simplifier le matériel dédié au contrôle dans les GPU.Current Graphics Processing Units (GPUs) are high-performance, low-cost parallel processors. This makes them attractive even for non-graphics calculations, opening the field of General-Purpose computation on GPUs (GPGPU).In this thesis, we develop software techniques to take advantage of the fixed-function units that GPU provide for scientific computations, and consider hardware modifications to execute general-purpose applications more efficiently.More specifically, we identify parallel regularity as an opportunity for improving the efficiency of parallel architectures. We expose its potential through the simulation of an actual GPU architecture. Subsequently, we consider two alternatives to take advantage of regularity. First, we design a dynamic hardware scheme which improves the energy efficiency of datapaths and register files. Then, we present a static compiler analysis to reduce the complexity associated with instruction control on GPUs.PERPIGNAN-BU Sciences (661362101) / SudocSudocFranceF

OpenGrey Repository

Path list traversal: a new class of SIMT flow tracking mechanisms

Author: Brunie Nicolas
Collange Sylvain
Publication venue: HAL CCSD
Publication date: 05/06/2017
Field of study

The SIMT execution model implemented in GPUs synchronizes groups of threads to run their common instructions on SIMD units. This model requires hardware or software mechanisms to keep track of control-flow divergence and convergence among threads. A new class of such algorithms is gaining popularity in the literature in the last few years. We present a new classification of these techniques based on their common characteristic, namely traversals of the control-flow graph based on lists of paths. We then compare the implementation cost on an FPGA of path lists and per-thread program counters within the Simty processor. The sorted list enables significantly better scaling starting from 8 threads per warp.Le modèle d’exécution SIMT employé dans les GPU synchronise l’exécution de groupes de threads afin d’exécuter leurs instructions communes sur des unités SIMD. Ce modèle nécessite des mécanismes matériels ou logiciels pour gérer la divergence et la reconvergence de contrôle entre threads. Une nouvelle classe de tels algorithmes émerge dans la littérature depuis quelques années. Nous présentons une classification de ces techniques sur la base de leur caractéristique commune, un parcours de graphe à base de liste. Nous comparons le coût de miseen œuvre sur FPGA de deux variantes du processeur Simty, l’une basée sur un tel mécanisme de reconvergence à base de liste triée et l’autre sur un mécanisme d’arbitrage entre compteurs de programme. La liste triée permet un passage à l’échelle significativement meilleur à partir de 8 threads par warp

Parcours par liste de chemins : une nouvelle classe de mécanismes de suivi de flot SIMT

Author: Brunie Nicolas
Collange Sylvain
Publication venue: HAL CCSD
Publication date: 27/06/2017
Field of study

National audienceLe modèle d’exécution SIMT employé dans les GPU synchronise l’exécution de groupes de threads afin d’exécuter leurs instructions communes sur des unités SIMD. Ce modèle nécessite des mécanismes matériels ou logiciels pour gérer la divergence et la reconvergence de contrôle entre threads. Une nouvelle classe de tels algorithmes émerge dans la littérature depuis quelques années. Nous présentons une classification de ces techniques sur la base de leur caractéristique commune, un parcours de graphe à base de liste. Nous comparons le coût de mise en œuvre sur FPGA de deux variantes du processeur Simty, l’une basée sur un tel mécanisme de reconvergence à base de liste triée et l’autre sur un mécanisme d’arbitrage entre compteurs de programme. La liste triée permet un passage à l’échelle significativement meilleur à partir de 8 threads par warp