394 research outputs found
Scheduling Dynamic OpenMP Applications over Multicore Architectures
International audienceApproaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units
I introduce a new code for fast calculation of the Lomb-Scargle periodogram,
that leverages the computing power of graphics processing units (GPUs). After
establishing a background to the newly emergent field of GPU computing, I
discuss the code design and narrate key parts of its source. Benchmarking
calculations indicate no significant differences in accuracy compared to an
equivalent CPU-based code. However, the differences in performance are
pronounced; running on a low-end GPU, the code can match 8 CPU cores, and on a
high-end GPU it is faster by a factor approaching thirty. Applications of the
code include analysis of long photometric time series obtained by ongoing
satellite missions and upcoming ground-based monitoring facilities; and
Monte-Carlo simulation of periodogram statistical properties.Comment: Accepted by ApJ. Accompanying program source (updated since
acceptance) can be downloaded from
http://www.astro.wisc.edu/~townsend/resource/download/code/culsp.tar.g
A portable OpenCL-based unstructured edge-based Finite Element Navier-Stokes solver on graphics hardware
The rise of GPUs in modern high-performance systems increases the interest in porting
portion of codes to such hardware. The current paper aims to explore the performance
of a portable state-of-the-art FE solver on GPU accelerators. Performance evaluation is
done by comparing with an existing highly-optimized OpenMP version of the solver.
Code portability is ensured by writing the program using the OpenCL 1.1 specifications,
while performance portability is sought through an optimization step performed
at the beginning of the calculations to find out the optimal parameter set for the solver.
The results show that the new implementation can be several times faster than the
OpenMP version.Preprin
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
The chemical kinetics ODEs arising from operator-split reactive-flow
simulations were solved on GPUs using explicit integration algorithms. Nonstiff
chemical kinetics of a hydrogen oxidation mechanism (9 species and 38
irreversible reactions) were computed using the explicit fifth-order
Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster
than single- and six-core CPU versions by factors of 126 and 25, respectively,
for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for
hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane
(53 species and 634 irreversible reactions) oxidation, were computed using the
stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The
GPU-based RKC implementation demonstrated an increase in performance of nearly
59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than
the single- and six-core CPU-based RKC algorithms using the
hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU
performed more than 65 and 11 times faster, for problem sizes consisting of
131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up
to 57 times faster than the six-core CPU-based implicit VODE algorithm on
65,536 ODEs. In the presence of more severe stiffness, such as ethylene
oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more
than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and
at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a
larger time step size, RKC-GPU performed at best 2.5 times slower than six-core
VODE for 8192 ODEs and larger. Therefore, the need for developing new
strategies for integrating stiff chemistry on GPUs was discussed.Comment: 27 pages, LaTeX; corrected typos in Appendix equations A.10 and A.1
Multi-core performance studies of a Monte Carlo neutron transport code
Performance results are presented for a multi-threaded version of the OpenMC Monte Carlo neutronics code using OpenMP in the context of nuclear reactor criticality calculations. Our main interest is production computing, and thus we limit our approach to threading strategies that both require reasonable levels of development effort and preserve the code features necessary for robust application to real-world reactor problems. Several approaches are developed and the results compared on several multi-core platforms using a popular reactor physics benchmark. A broad range of performance studies are distilled into a simple, consistent picture of the empirical performance characteristics of reactor Monte Carlo algorithms on current multi-core architectures.United States. Dept. of Energy. Office of Advanced Scientific Computing Research (Contract DEAC02-06CH11357
- …