Search CORE

20 research outputs found

Solving dense linear systems on platforms with multiple hardware accelerators

Author: Buttari Alfredo
Buttari Alfredo
Castillo Maribel
El Baz In F.
Enrique S. Quintana-Ortí
Francisco D. Igual
Gregorio Quintana-Ortí
Gunnels John A.
Kurzak Jakub
Leiserson C.
Marker Bryan A.
Quintana-Ortí Gregorio
Quintana-Ortí Gregorio
Robert
Robert A. van de Geijn
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Pipelining the Fast Multipole Method over a Runtime System

Author: Agullo Emmanuel
Bramas Béranger
Coulaud Olivier
Darve Eric
Messner Matthias
Toru Takahashi
Publication venue
Publication date: 01/01/2012
Field of study

Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

Author: Augonnet Cédric
Aumage Olivier
Furmento Nathalie
Namyst Raymond
Thibault Samuel
Publication venue: HAL CCSD
Publication date: 16/05/2014
Field of study

GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Task-based programming for Seismic Imaging: Preliminary Results

Author: Agullo Emmanuel
Boillot Lionel
Bosilca George
Calandra Henri
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/08/2014
Field of study

International audienceThe level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code) may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on an homogenous multicore node and can exploit much more efficiently complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Towards dense linear algebra for hybrid GPU accelerated manycore systems

Author: Buttari
Dongarra
Higham
Hruska
Jack Dongarra
Kågström
Marc Baboulin
Owens
Owens
Pharr
Seiler
Stanimire Tomov
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Experimenting task-based runtimes on a legacy Computational Fluid Dynamics code with unstructured meshes

Author: Fournier Yvan
Jeannot Emmanuel
Lorendeau Benjamin
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

International audienceAdvances in high performance computing hardware systems lead to higher levels of parallelism and optimizations in scientific applications and more specifically in computational fluid dynamics codes. To reduce the level of complexity that such architectures bring while attaining an acceptable amount of the parallelism offered by modern clusters, the task-based approach has gained a lot of popularity recently as it is expected to deliver portability and performance with a relatively simple programming model. In this paper, we review and present the process of adapting part of Code Saturne, our legacy code at EDF R&D into a task-based form using the PARSEC (Parallel Runtime Scheduling and Execution Control) framework. We show first the adaptation of our prime algorithm to a simpler form to remove part of the complexity of our code and then present its task-based implementation. We compare performance of various forms of our code and discuss the perks of task-based runtimes in terms of scalability, ease of incremental deployment in a legacy CFD code, and maintainability

Crossref

INRIA a CCSD electronic archive server

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Author: Bosilca George
Bouteiller Aurelien
Dongarra Jack
Faverge Mathieu
Wu Wei
Publication venue: HAL CCSD
Publication date: 01/01/2015
Field of study

International audienceAccelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments

CiteSeerX

INRIA a CCSD electronic archive server

Task-based Conjugate-Gradient for multi-GPUs platforms

Author: Agullo Emmanuel
Giraud Luc
Guermouche Abdou
Nakov Stojce
Roman Jean
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

Whereas most today parallel High Performance Computing (HPC) software is written as highly tuned code taking care of low-level details, the advent of the manycore area forces the community to consider modular programming paradigms and delegate part of the work to a third party software. That latter approach has been shown to be very productive and efficient with regular algorithms, such as dense linear algebra solvers. In this paper we show that such a model can be efficiently applied to a much more irregular and less compute intensive algorithm. We illustrate our discussion with the standard unpreconditioned Conjugate Gradient (CG) that we carefully express as a task-based algorithm. We use the StarPU runtime system to assess the efficiency of the approach on a computational platform consisting of three NVIDIA Fermi GPUs. We show that almost optimum speed up (up to 2.89) may be reached (relatively to a mono-GPU execution) when processing large matrices and that the performance is portable when changing the low-level memory transfer mechanism.andis que la plupart des logiciels de calcul haute performance (HPC) actuels sont des codes extrêmement optimisés en prenant en compte les détails de bas-niveau, l'avènement de l'ère manycore incite la communauté à considèrer des paradigmes de programmation mod- ulaires et ainsi déléguer une partie du travail à des librairies tierces. Cette dernière approche s'est avérée très productive et efficace dans le cas d'algorithmes réguliers, tels que ceux issus de l'algèbre linéaire dense. Dans ce papier, nous démontrons qu'un tel modèle peut être effi- cacement appliqué à un problème beaucoup plus irrégulier et moins intensif en calcul. Nous illustrons notre discussion avec l'algorithme standard du Gradient Conjugué (CG) non précon- ditionné que nous exprimons sous la forme d'un algorithme en graphe de tâches. Nous utilisons le moteur d'exécution StarPU pour évaluer l'efficacité de notre approche sur une plate-forme de calcul composée de trois accélérateurs graphiques (GPU) NVIDIA Fermi. Nous démontrons qu'une accroissement de performance (jusqu'à un facteur 2, 89) quasi optimal (relativement au cas mono-GPU) peut être atteinte lorsque sont traitées des matrices creuses de grande taille. Nous montrons de surcroît que la performance est portable quand les mécanismes de transfert mémoire bas-niveau sont changés

INRIA a CCSD electronic archive server

Benchmarking GPUs to tune dense linear algebra

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref