Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance

A. Buttari

C. Augonnet

D.S. Hochbaum

E. Agullo

E. Hermann

F. Song

G. Bosilca

H. Topcuoglu

J.V.F. Lima

S. Kedad-Sidhoum

S. Tomov

English

arXiv

International audienceEfficient implementations of parallel applications on hetero-geneous hybrid architectures require a careful balance between compu-tations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is es-sential to reduce the total volume of communicated data. The litera-ture therefore abounds with ad hoc methods to reach that balance, but these are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algo-rithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heteroge-neous parallel machine with twelve CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the XKaapi runtime system. We re-port their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance

Bleuse, Raphaël

Gautier, Thierry

Lima, João V. F.

Mounié, Grégory

Trystram, Denis

INRIA a CCSD electronic archive server

Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures

Hal - Université Grenoble Alpes

Raphaël Bleuse

Thierry Gautier

João V. F. Lima

Grégory Mounié

Denis Trystram

Crossref

https://hal.archives-ouvertes.fr/hal-01081629/document

Scheduling data flow program in xkaapi: A new affinity based Algorithm
  for Heterogeneous Architectures

Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server

Hal - Université Grenoble Alpes

Crossref