International audienceAlthough the hardware has dramatically changed in the last few years, nodes of multicore chips augmented by Graphics Processing Units (GPUs) seem to be a trend of major importance. Previous approaches for scheduling dense linear operations on such a complex node led to high performance but at the double cost of not using the potential of all the cores and producing a static and non generic code. In this extended abstract, we present a new approach for scheduling dense linear algebra operations on multicore architectures with GPU accelerators using a dynamic scheduler capable of using the full potential of the node [1]. We underline the benefits both in terms of programmability and performance. We illustrate our approach with a Cholesky factorization relying on cutting edge GPU and CPU kernels [2], [3] achieving roughly 900 Gflop/s on an eight cores node accelerated with three NVIDIA Tesla GPUs

Agullo, Emmanuel

Augonnet, Cédric

Dongarra, Jack

Ltaief, Hatem

Namyst, Raymond

Roman, Jean

Thibault, Samuel

Tomov, Stanimire

INRIA a CCSD electronic archive server

HAL Id: inria-00547616https://hal.inria.fr/inria-00547616Submitted on 16 Dec 2010HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Dynamically scheduled Cholesky factorization onmulticore architectures with GPU accelerators.Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, RaymondNamyst, Jean Roman, Samuel Thibault, Stanimire TomovTo cite this version:Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, et al.. Dynami-cally scheduled Cholesky factorization on multicore architectures with GPU accelerators.. Symposiumon Application Accelerators in High Performance Computing (SAAHPC), Jul 2010, Knoxville, UnitedStates. ￿inria-00547616￿Dynamically scheduled Cholesky factorization onmulticore architectures with GPU accelerators.Emmanuel Agullo∗, Cédric Augonnet∗, Jack Dongarra†‡§, Hatem Ltaief†,Raymond Namyst∗, Jean Roman∗, Samuel Thibault∗, Stanimire Tomov†∗INRIA – LaBRI – University of Bordeaux†Department of Electrical Engineering and Computer Science – University of Tennessee, Knoxville‡Computer Science and Mathematics Division – Oak Ridge National Laboratory§School of Mathematics & School of Computer Science – University of Manchester{emmanuel.agullo, cedric.augonnet, raymond.namyst, jean.roman, samuel.thibault}@inria.fr{dongarra, ltaief, tomov}@eecs.utk.eduAbstract—Although the hardware has dramatically changedin the last few years, nodes of multicore chips augmented byGraphics Processing Units (GPUs) seem to be a trend of majorimportance. Previous approaches for scheduling dense linearoperations on such a complex node led to high performancebut at the double cost of not using the potential of all thecores and producing a static and non generic code. In thisextended abstract, we present a new approach for schedulingdense linear algebra operations on multicore architectures withGPU accelerators using a dynamic scheduler capable of usingthe full potential of the node [1]. We underline the benefits bothin terms of programmability and performance. We illustrate ourapproach with a Cholesky factorization relying on cutting edgeGPU and CPU kernels [2], [3] achieving roughly 900 Gflop/s onan eight cores node accelerated with three NVIDIA Tesla GPUs.I. CHOLESKY FACTORIZATIONThe Cholesky factorization (or Cholesky decomposition)is mainly used for the numerical solution of linear equa-tions Ax = b, where A is symmetric and positive definite.Such systems arise often in physics applications, where A ispositive definite due to the nature of the modeled physicalphenomenon. This happens frequently in numerical solutionsof partial differential equations. The Cholesky factorization ofan n × n real symmetric positive definite matrix A has theformA = LLT ,where L is an n×n real lower triangular matrix with positivediagonal elements. In LAPACK the double precision algorithmis implemented by the DPOTRF routine. A single step ofthe algorithm is implemented by a sequence of calls to theLAPACK and BLAS routines: DSYRK, DPOTF2, DGEMM,DTRSM. The tile Cholesky algorithm is identical to theblock Cholesky algorithm implemented in LAPACK, exceptfor processing the matrix by tiles. Otherwise, the exact sameoperations are applied. The algorithm relies on four basic op-erations implemented by four computational kernels: DPOTRFwhich performs the Cholesky factorization of a diagonal tile,Fig. 1. Pseudocode of the tile Cholesky factorization.DTRSM which performs a triangular solve, DSYRK whichexecutes a symmetric rank-k update and DGEMM whichapplies a matrix multiplication. Figure 1 shows the pseudocodeof the Cholesky factorization.II. THE STARPU FRAMEWORKStarPU is a runtime system that schedules tasks ontoaccelerator-based platforms. It is meant to be used as aback-end for e.g. parallel language compilation environmentsand High-Performance libraries. The two basic principles ofStarPU is firstly that tasks can have several implementations,for some or each of the various heterogeneous processing unitsavailable in the machine, and secondly that transfers of datapieces to these processing units are handled transparently byStarPU.Thanks to auto-tuning facilities, StarPU transparently pre-dicts execution time and data transfer overhead. This permitsStarPU’s dynamic scheduler to avoid load imbalance whileenforcing data locality to reduce the pressure on the memorysubsystem, which is a critical resource for accelerator-basedplatforms.What is required to port an application on top of StarPU?• Register the different data to StarPU (in our case, weregister the tiles).• Create wrappers for the different kernels implementations(eg. for SGEMM).• Describe the algorithm as a set of tasks. A StarPU taskis defined by a codelet, the list of data that should beaccessed by the task as well as their access modes.• Task dependencies are either given explicitly or auto-matically derived from data dependencies if the differenttasks are submitted in an order that corresponds to a validsequential execution.Note that we do not need all tasks to perform the algorithm,so that it is possible that the application dynamically adaptsitself provided performance feedback from StarPU. We canalso cap the amount of resources (e.g., memory or number ofprocessing units) required to perform the computationsIII. PRELIMINARY RESULTSWe briefly present preliminary results obtained for theCholesky factorization in single precision. The GPU kernelswere taken from the MAGMA library [2], and we used kernelsfrom the PLASMA library on the multicore CPUs [3]. Theplatform used for this experiment is composed of two quad-core Intel Nehalem X5550 CPUs (8 CPU cores total) runningat 2.67 GHz with 48 GB of memory divided in two NonUniform Memory Access (NUMA) nodes. It is enhanced withthree NVIDIA Quadro FX5800 GPUs of 240 cores each (720GPU cores total) running at 1.3 GHz with 4 GB of GDDR3 perGPU. Using the three GPUs (associated to three CPU cores),the Cholesky factorization achieved 780 Gflop/s (Figure 2)corresponding to a perfect speedup equal to 3 (Figure 3). Theuse of five supplementary cores allows to increase the perfor-mance up to 900 Gflop/s. These supplementary 120 Gflop/sare above the SGEMM peak of the corresponding cores (100Gflop/s). Although this result is non intuitive, it corresponds toa natural property of parallel computing. Indeed, the SPOTRFoperation (level-2 BLAS) that is applied on diagonal blocksachieves a very low percentage of the theoretical peak of aGPU. By mainly performing this operation on CPUs (80%of the SPOTRF operations have been performed on CPUs),GPUs can be dedicated to operations for which they are veryefficient such as SGEMM (level-3 BLAS). 0 100 200 300 400 500 600 700 800 900 1000 5120  15360  25600  35840  46080Performance (Gflop/s)Matrix order4 GB3 GPUs + 5 CPUs3 GPUs2 GPUs1 GPUFig. 2. Performance for the Cholesky factorization in single precision. 0.5 1 1.5 2 2.5 3 3.5 4 5120  15360  25600  35840  46080Speedup against one GPUMatrix order4 GB3 GPUs + 5 CPUs3 GPUs2 GPUs1 GPUFig. 3. Speedup for the Cholesky factorization in single precision.IV. SUMMARY AND FUTURE WORKWe have shown that a dynamic scheduler allows to achievea perfect speed up in the case of a node with multiple GPUaccelerators. We have furthermore shown that the use ofsupplementary cores allowed a superlinear speedup. We arenow extending this work to other one-sided kernels (ie. QRand LU) and two-sided (eg. Hessemberg) factorizations. Weare also studying the extension of our approach to clusters ofaccelerator-based platforms.ACKNOWLEDMENTSResearch reported here was partially supported by theNational Science Foundation, the Department of Energy, Mi-crosoft Research, and NVIDIA. This work has also been sup-ported by the ANR through the COSINUS (PROHMPT ANR-08-COSI-013 project) and CONTINT (MEDIAGPU ANR-09-CORD-025) programs. This work was partially supported bythe EU as part of FP7 Project PEPPHER (www.peppher.eu)under grant 248481.REFERENCES[1] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: AUnified Platform for Task Scheduling on Heterogeneous Multicore Ar-chitectures. Proceedings of the 15th International Euro-Par Conference,Lecture Notes in Computer Science, 5704:863–874, August 2009.[2] R. Nath, S. Tomov, and J. Dongarra. Accelerating gpu kernels for denselinear algebra. Proceedings of VECPAR’10, 2010.[3] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra onemerging architectures: The plasma and magma projects. Journal ofPhysics: Conference Series, Vol. 180, 2009.

Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators.

Although the hardware has dramatically changed in the last few years, nodes of multicore chips augmented by Graphics Processing Units (GPUs) seem to be a trend of major importance. Previous approaches for scheduling dense linear operations on such a complex node led to high performance but at the double cost of not using the potential of all the cores and producing a static and non generic code. In this extended abstract, we present a new approach for scheduling dense linear algebra operations on multicore architectures with GPU accelerators using a dynamic scheduler capable of using the full potential of the node [1]. We underline the benefits both in terms of programmability and performance. We illustrate our approach with a Cholesky factorization relying on cutting edge GPU and CPU kernels [2], [3] achieving roughly 900 Gflop/s on an eight cores node accelerated with three NVIDIA Tesla GPUs

Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators.

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server

Oskar Bordeaux