2 research outputs found
Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing
Load balancing is a widely accepted technique for performance optimization of
scientific applications on parallel architectures. Indeed, balanced
applications do not waste processor cycles on waiting at points of
synchronization and data exchange, maximizing this way the utilization of
processors. In this paper, we challenge the universality of the load-balancing
approach to optimization of the performance of parallel applications. First, we
formulate conditions that should be satisfied by the performance profile of an
application in order for the application to achieve its best performance via
load balancing. Then we use a real-life scientific application, MPDATA, to
demonstrate that its performance profile on a modern parallel architecture,
Intel Xeon Phi, significantly deviates from these conditions. Based on this
observation, we propose a method of performance optimization of scientific
applications through load imbalancing. We also propose an algorithm that finds
the optimal, possibly imbalanced, configuration of a data parallel application
on a set of homogeneous processors. This algorithm uses functional performance
models of the application to find the partitioning that minimizes its
computation time but not necessarily balances the load of the processors. We
show how to apply this algorithm to optimization of MPDATA on Intel Xeon Phi.
Experimental results demonstrate that the performance of this carefully
optimized load-balanced application can be further improved by 15\% using the
proposed load-imbalancing optimization.Comment: 10 pages, 9 figures, 3 table
Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments
Current computational systems are heterogeneous by nature, featuring a
combination of CPUs and GPUs. As the latter are becoming an established
platform for high-performance computing, the focus is shifting towards the
seamless programming of these hybrid systems as a whole. The distinct nature of
the architectural and execution models in place raises several challenges, as
the best hardware configuration is behaviour and workload dependent. In this
paper, we address the execution of compound, multi-kernel, OpenCL computations
in multi-CPU/multi-GPU environments. We address how these computations may be
efficiently scheduled onto the target hardware, and how the system may adapt
itself to changes in the workload to process and to fluctuations in the CPU's
load. An experimental evaluation attests the performance gains obtained by the
conjoined use of the CPU and GPU devices, when compared to GPU-only executions,
and also by the use of data-locality optimizations in CPU environments.Comment: in Concurrency Computat.: Pract. Exper., 201