125 research outputs found
Implementation of the K-Means Algorithm on Heterogeneous Devices: A Use Case Based on an Industrial Dataset
This paper presents and analyzes a heterogeneous implementation of an industrial use case based on K-means that targets symmetric multiprocessing (SMP), GPUs and FPGAs. We present how the application can be optimized from an algorithmic point of view and how this optimization performs on two heterogeneous platforms. The presented implementation relies on the OmpSs programming model, which introduces a simplified pragma-based syntax for the communication between the main processor and the accelerators. Performance improvement can be achieved by the programmer explicitly specifying the data memory accesses or copies. As expected, the newer SMP+GPU system studied is more powerful than the older SMP+FPGA system. However the latter is enough to fulfill the requirements of our use case and we show that uses less energy when considering only the active power of the execution.This work is partially supported by the European Union H2020 project AXIOM (grant
agreement n. 645496), HiPEAC (grant agreement n. 687698), and Mont-Blanc (grant
agreements n. 288777, 610402 and 671697), the Spanish Government Programa Severo
Ochoa (SEV-2015-0493), the Spanish Ministry of Science and Technology (TIN2015-
65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat
de Catalunya, under project MPEXPAR: Models de Programaci´o i Entorns d’Execució
Paral·lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
C Language Extensions for Hybrid CPU/GPU Programming with StarPU
Modern platforms used for high-performance computing (HPC) include machines
with both general-purpose CPUs, and "accelerators", often in the form of
graphical processing units (GPUs). StarPU is a C library to exploit such
platforms. It provides users with ways to define "tasks" to be executed on CPUs
or GPUs, along with the dependencies among them, and by automatically
scheduling them over all the available processing units. In doing so, it also
relieves programmers from the need to know the underlying architecture details:
it adapts to the available CPUs and GPUs, and automatically transfers data
between main memory and GPUs as needed. While StarPU's approach is successful
at addressing run-time scheduling issues, being a C library makes for a poor
and error-prone programming interface. This paper presents an effort started in
2011 to promote some of the concepts exported by the library as C language
constructs, by means of an extension of the GCC compiler suite. Our main
contribution is the design and implementation of language extensions that map
to StarPU's task programming paradigm. We argue that the proposed extensions
make it easier to get started with StarPU,eliminate errors that can occur when
using the C library, and help diagnose possible mistakes. We conclude on future
work
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance
A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel
The Trotter-Suzuki approximation leads to an efficient algorithm for solving
the time-dependent Schr\"odinger equation. Using existing highly optimized CPU
and GPU kernels, we developed a distributed version of the algorithm that runs
efficiently on a cluster. Our implementation also improves single node
performance, and is able to use multiple GPUs within a node. The scaling is
close to linear using the CPU kernels, whereas the efficiency of GPU kernels
improve with larger matrices. We also introduce a hybrid kernel that
simultaneously uses multicore CPUs and GPUs in a distributed system. This
kernel is shown to be efficient when the matrix size would not fit in the GPU
memory. Larger quantum systems scale especially well with a high number nodes.
The code is available under an open source license.Comment: 11 pages, 10 figure
AMA: asynchronous management of accelerators for task-based programming models
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.Peer ReviewedPostprint (published version
The AXIOM software layers
AXIOM project aims at developing a heterogeneous computing board (SMP-FPGA).The Software Layers developed at the AXIOM project are explained.OmpSs provides an easy way to execute heterogeneous codes in multiple cores. People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architectures for cyber-physical systems to meet such expectations. The technical approach aims at solving fundamental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AXIOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics).Peer ReviewedPostprint (author's final draft
Self-adaptive OmpSs tasks in heterogeneous environments
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.This work has been supported by the European Commission through the ENCORE project (FP7-248647), the TERAFLUX project (FP7-249013), the TEXT project (FP7-261580), the HiPEAC-3 Network of Excellence (FP7-ICT
287759), the Intel-BSC Exascale Lab collaboration project, the support of the Spanish Ministry of Education (CSD2007-
00050 and FPU program), the projects of Computación de
Altas Prestaciones V and VI (TIN2007-60625, TIN2012-34557) and the Generalitat de Catalunya (2009-SGR-980).Peer ReviewedPostprint (author’s final draft
Selection of Task Implementations in the Nanos++ Runtime
New heterogeneous systems and hardware accelerators can give higher levels of computational power to high performance computers. However, this does not come for free, since the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource utilization.
OmpSs is a task-based programming model and framework focused on the automatic parallelization of sequential applications. We present a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the framework will choose between these versions at runtime to obtain the best performance achievable for the given application. From our results, obtained in a multi-GPU system, we can prove that our project gives flexibility to application's source code and can potentially increase application’s performance
On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP
OpenMP includes in its latest 4.0 specification the accelerator model. In this paper we present a partial implementation of this specification in the OmpSs programming model developed at the Barcelona Supercomputing Center with the aim of identifying which should be the roles of the programmer, the compiler and the runtime system in order to facilitate the asynchronous execution of tasks in architectures with multiple accelerator devices and processors. The design of OmpSs is highly biassed to delegate most of the decisions to the runtime system, which based on the task graph built at runtime (depend clauses) is able to schedule tasks in a data flow way to the available processors and accelerator devices and orchestrate data transfers and reuse among multiple address spaces. For this reason our implementation is partial, just considering from 4.0 those directives that enable the compiler the generation of the so called “kernels” to be executed on the target device. Several extensions to the current specification are also presented, such as the specification of tasks in “native” CUDA and OpenCL or how to specify the device and data privatization in the target construct. Finally, the paper also discusses some challenges found in code generation and a preliminary performance evaluation with some kernel applications.Peer ReviewedPostprint (author’s final draft
- …