2 research outputs found
Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing
Load balancing is a widely accepted technique for performance optimization of
scientific applications on parallel architectures. Indeed, balanced
applications do not waste processor cycles on waiting at points of
synchronization and data exchange, maximizing this way the utilization of
processors. In this paper, we challenge the universality of the load-balancing
approach to optimization of the performance of parallel applications. First, we
formulate conditions that should be satisfied by the performance profile of an
application in order for the application to achieve its best performance via
load balancing. Then we use a real-life scientific application, MPDATA, to
demonstrate that its performance profile on a modern parallel architecture,
Intel Xeon Phi, significantly deviates from these conditions. Based on this
observation, we propose a method of performance optimization of scientific
applications through load imbalancing. We also propose an algorithm that finds
the optimal, possibly imbalanced, configuration of a data parallel application
on a set of homogeneous processors. This algorithm uses functional performance
models of the application to find the partitioning that minimizes its
computation time but not necessarily balances the load of the processors. We
show how to apply this algorithm to optimization of MPDATA on Intel Xeon Phi.
Experimental results demonstrate that the performance of this carefully
optimized load-balanced application can be further improved by 15\% using the
proposed load-imbalancing optimization.Comment: 10 pages, 9 figures, 3 table
Novel Model-based Methods for Performance Optimization of Multithreaded 2D Discrete Fourier Transform on Multicore Processors
In this paper, we use multithreaded fast Fourier transforms provided in three
highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to
present a novel model-based parallel computing technique as a very effective
and portable method for optimization of scientific multithreaded routines for
performance, especially in the current multicore era where the processors have
abundant number of cores. We propose two optimization methods, PFFT-FPM and
PFFT-FPM-PAD, based on this technique. They compute 2D-DFT of a complex signal
matrix of size NxN using p abstract processors. Both algorithms take as inputs,
discrete 3D functions of performance against problem size of the processors and
output the transformed signal matrix. Based on our experiments on a modern
Intel Haswell multicore server consisting of 36 physical cores, the average and
maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9x and 6.8x
respectively and the average and maximum speedups observed using Intel MKL FFT
are 1.3x and 2x respectively. The average and maximum speedups observed for
PFFT-FPM-PAD using FFTW-3.3.7 are 2x and 9.4x respectively and the average and
maximum speedups observed using Intel MKL FFT are 1.4x and 5.9x respectively