1,486 research outputs found
Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems
MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI collectives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting
Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems
MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI colletives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting
Strong scaling of general-purpose molecular dynamics simulations on GPUs
We describe a highly optimized implementation of MPI domain decomposition in
a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson
and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional
CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented
within a code that was designed for execution on GPUs from the start (Anderson
et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair
force and bond force fields and achieves optimal GPU performance using an
autotuning algorithm. We are able to demonstrate equivalent or superior scaling
on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD)
simulations of up to 108 million particles. GPUDirect RDMA capabilities in
recent GPU generations provide better performance in full double precision
calculations. For a representative polymer physics application, HOOMD-blue 1.0
provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure
The inherent overlapping in the parallel calculation of the Laplacian
Producción CientíficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18
Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance
The cubic Klein-Gordon equation is a simple but non-trivial partial
differential equation whose numerical solution has the main building blocks
required for the solution of many other partial differential equations. In this
study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve
the Klein-Gordon equation and strong scaling of the code is examined on
thirteen different machines for a problem size of 512^3. The results are useful
in assessing likely performance of other parallel fast Fourier transform based
programs for solving partial differential equations. The problem is chosen to
be large enough to solve on a workstation, yet also of interest to solve
quickly on a supercomputer, in particular for parametric studies. Unlike other
high performance computing benchmarks, for this problem size, the time to
solution will not be improved by simply building a bigger supercomputer.Comment: 10 page
Developing performance-portable molecular dynamics kernels in Open CL
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs.
We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
MDMP: Managed Data Message Passing
MDMP is a new parallel programming approach that aims to provide users with
an easy way to add parallelism to programs, optimise the message passing costs
of traditional scientific simulation algorithms, and enable existing MPI-based
parallel programs to be optimised and extended without requiring the whole code
to be re-written from scratch. MDMP utilises a directives based approach to
enable users to specify what communications should take place in the code, and
then implements those communications for the user in an optimal manner using
both the information provided by the user and data collected from instrumenting
the code and gathering information on the data to be communicated. This work
will present the basic concepts and functionality of MDMP and discuss the
performance that can be achieved using our prototype implementation of MDMP on
some model scientific simulation applications.Comment: Submitted to SC13, 10 pages, 5 figure
- …