Search CORE

87 research outputs found

Pipelining the Fast Multipole Method over a Runtime System

Author: Agullo Emmanuel
Bramas Béranger
Coulaud Olivier
Darve Eric
Messner Matthias
Toru Takahashi
Publication venue
Publication date: 01/01/2012
Field of study

Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Geometry-Oblivious FMM for Compressing Dense SPD Matrices

Author: Biros George
Levitt James
Reiz Severin
Yu Chenhan D.
Publication venue
Publication date: 01/07/2017
Field of study

We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in

N \log N

or even

N

time, where

N

is the matrix size. Compression requires

N \log N

storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

arXiv.org e-Print Archive

Crossref

Enabling task parallelism for many-core architectures

Author: Atkinson Patrick R
Publication venue
Publication date: 28/09/2021
Field of study

Explore Bristol Research

Task-Based FMM for Multicore Architectures

Author: Agullo Emmanuel
Bramas Bérenger
Coulaud Olivier
Darve Eric
Messner Matthias
Takahashi Toru
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 31/03/2013
Field of study

International audienceFast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state- of-the-art runtime system, StarPU, to process the tasks on the different computing units. We carefully design the task flow, the mathematical operators, their implementations as well as scheduling schemes. Potentials and forces on 200 million particles are computed in 42.3 seconds on a homogeneous 160 cores SGI Altix UV 100 and good scalability is shown

INRIA a CCSD electronic archive server

NVIDIA Tensor Core Programmability, Performance & Precision

Author: Der Chien Steven Wei
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Vetter Jeffrey S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/03/2018
Field of study

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

arXiv.org e-Print Archive

Crossref

Task-based FMM for multicore architectures

Author: Agullo Emmanuel
Bramas Bérenger
Coulaud Olivier
Darve Eric
Messner Matthias
Takahashi Toru
Publication venue: HAL CCSD
Publication date: 31/03/2013
Field of study

Preliminary version of a paper to appear in SIAM SISCFast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the Fast Multipole Method (FMM) algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, to process the tasks on the different computing units. We carefully design the task flow, the mathematical operators, their implementations as well as scheduling schemes. Potentials and forces on 200 million particles are computed in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and good scalability is shown.Les méthodes multipôles rapides (FMM) constituent une opération fondamentale pour la simulation de nombreux problèmes physiques. Leur mise en œuvre haute performance requiert habituellement d'optimiser attentivement l'algorithme à la fois pour la physique visée et le matériel utilisé. Dans ce papier, nous proposons une nouvelle approche qui atteint une performance élevée et portable. Notre méthode consiste à exprimer l'algorithme FMM comme un flot de tâches et d'employer un moteur d'exécution, StarPU, afin de traiter les tâches sur les différentes unités d'exécution. Nous concevons précisément le flot de tâches, les opérateurs mathématiques, leurs implémentations sur unité centrale de traitement (CPU) ainsi que les schémas d'ordonnancement. Nous calculons les potentiels et forces pour des problèmes avec 200 millions de particules en 48.7 secondes sur une machine homogène SGI Altix UV 100 comportant 160 cœur

INRIA a CCSD electronic archive server

A performance- and energy-oriented extended tuning process for time-step-based scientific applications

Author: Kalinnik Natalia
Kiesel Robert
Rauber Thomas
Richter Marcel
Rünger Gudula
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

EPub Bayreuth