Search CORE

4 research outputs found

Generic matrix multiplication for multi-GPU accelerated distributed-memory platforms over PaRSEC

Author: Bosilca George
Dongarra Jack
Herault Thomas
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/09/2019
Field of study

International audienc

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters

Author: de Jong Wibe A.
van Dam Hubertus J. J.
Williams-Young David B.
Yang Chao
Publication venue
Publication date: 01/01/2020
Field of study

The predominance of Kohn-Sham density functional theory (KS-DFT) for the theoretical treatment of large experimentally relevant systems in molecular chemistry and materials science relies primarily on the existence of efficient software implementations which are capable of leveraging the latest advances in modern high performance computing (HPC). With recent trends in HPC leading towards in increasing reliance on heterogeneous accelerator based architectures such as graphics processing units (GPU), existing code bases must embrace these architectural advances to maintain the high-levels of performance which have come to be expected for these methods. In this work, we purpose a three-level parallelism scheme for the distributed numerical integration of the exchange-correlation (XC) potential in the Gaussian basis set discretization of the Kohn-Sham equations on large computing clusters consisting of multiple GPUs per compute node. In addition, we purpose and demonstrate the efficacy of the use of batched kernels, including batched level-3 BLAS operations, in achieving high-levels of performance on the GPU. We demonstrate the performance and scalability of the implementation of the purposed method in the NWChemEx software package by comparing to the existing scalable CPU XC integration in NWChem.Comment: 26 pages, 9 figure

arXiv.org e-Print Archive

eScholarship - University of California

Programmation parallèle à base de tâches pour algorithmes passant à l'échelle : application au produit de matrices

Author: Agullo Emmanuel
Buttari Alfredo
Guermouche Abdou
Herrmann Julien
Jego Antoine
Publication venue: HAL CCSD
Publication date: 01/02/2022
Field of study

Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community thanks to how they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way. In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write a compact yet efficient and scalable General Matrix Multiplication. This extension required few modifications to the StarPU runtime system. The final implementation is shown to be competitive up to 32,768 cores with state-of-the-art libraries and may outperform them on some specific problem configurations.Les modèles de programmation à base de tâches ont réussi à susciter l'intérêt de la communauté des logiciels mathématiques de haute performance grâce à la manière dont ils soulagent une partie du fardeau que représentent le développement et la mise en œuvre efficace et portable d'algorithmes parallèles à mémoire distribuée. Dans des grappes d'ordinateurs de plus en plus grandes et hétérogènes, ces modèles apparaissent comme un moyen de développer et maintenir des algorithmes plus complexes. Cependant, les modèles de programmation basés sur les tâches manquent de flexibilité et les caractéristiques nécessaires pour exprimer de manière élégante et compacte des algorithmes passant à l'échelle se basant sur des schémas de communication avancés. Nous montrons que le paradigme de flot de tâches séquentiel (STF) peut être étendu pour écrire une multiplication matricielle passant à l'échelle. Cette extension a nécessité peu de modifications au système d'exécution StarPU. L'implantation finale est compétitive jusqu'à 32 768 cœurs avec les bibliothèques de pointe et peut même les surpasser dans certaines configurations spécifiques

INRIA a CCSD electronic archive server

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Author: Gopala Krishnan Arjun
Publication venue
Publication date: 30/08/2021
Field of study

Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations

Concordia University Research Repository