792 research outputs found
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence
A hybrid scheme that utilizes MPI for distributed memory parallelism and
OpenMP for shared memory parallelism is presented. The work is motivated by the
desire to achieve exceptionally high Reynolds numbers in pseudospectral
computations of fluid turbulence on emerging petascale, high core-count,
massively parallel processing systems. The hybrid implementation derives from
and augments a well-tested scalable MPI-parallelized pseudospectral code. The
hybrid paradigm leads to a new picture for the domain decomposition of the
pseudospectral grids, which is helpful in understanding, among other things,
the 3D transpose of the global data that is necessary for the parallel fast
Fourier transforms that are the central component of the numerical
discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the
hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a
maximum mean efficiency of 83%. Data are presented that demonstrate how to
choose the optimal number of MPI processes and OpenMP threads in order to
optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
Estimating the Potential Speedup of Computer Vision Applications on Embedded Multiprocessors
Computer vision applications constitute one of the key drivers for embedded
multicore architectures. Although the number of available cores is increasing
in new architectures, designing an application to maximize the utilization of
the platform is still a challenge. In this sense, parallel performance
prediction tools can aid developers in understanding the characteristics of an
application and finding the most adequate parallelization strategy. In this
work, we present a method for early parallel performance estimation on embedded
multiprocessors from sequential application traces. We describe its
implementation in Parana, a fast trace-driven simulator targeting OpenMP
applications on the STMicroelectronics' STxP70 Application-Specific
Multiprocessor (ASMP). Results for the FAST key point detector application show
an error margin of less than 10% compared to the reference cycle-approximate
simulator, with lower modeling effort and up to 20x faster execution time.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
New Algebraic Formulation of Density Functional Calculation
This article addresses a fundamental problem faced by the ab initio
community: the lack of an effective formalism for the rapid exploration and
exchange of new methods. To rectify this, we introduce a novel, basis-set
independent, matrix-based formulation of generalized density functional
theories which reduces the development, implementation, and dissemination of
new ab initio techniques to the derivation and transcription of a few lines of
algebra. This new framework enables us to concisely demystify the inner
workings of fully functional, highly efficient modern ab initio codes and to
give complete instructions for the construction of such for calculations
employing arbitrary basis sets. Within this framework, we also discuss in full
detail a variety of leading-edge ab initio techniques, minimization algorithms,
and highly efficient computational kernels for use with scalar as well as
shared and distributed-memory supercomputer architectures
A Parallel Mesh-Adaptive Framework for Hyperbolic Conservation Laws
We report on the development of a computational framework for the parallel,
mesh-adaptive solution of systems of hyperbolic conservation laws like the
time-dependent Euler equations in compressible gas dynamics or
Magneto-Hydrodynamics (MHD) and similar models in plasma physics. Local mesh
refinement is realized by the recursive bisection of grid blocks along each
spatial dimension, implemented numerical schemes include standard
finite-differences as well as shock-capturing central schemes, both in
connection with Runge-Kutta type integrators. Parallel execution is achieved
through a configurable hybrid of POSIX-multi-threading and MPI-distribution
with dynamic load balancing. One- two- and three-dimensional test computations
for the Euler equations have been carried out and show good parallel scaling
behavior. The Racoon framework is currently used to study the formation of
singularities in plasmas and fluids.Comment: late submissio
SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics
We introduce a new relativistic astrophysics code, SpECTRE, that combines a
discontinuous Galerkin method with a task-based parallelism model. SpECTRE's
goal is to achieve more accurate solutions for challenging relativistic
astrophysics problems such as core-collapse supernovae and binary neutron star
mergers. The robustness of the discontinuous Galerkin method allows for the use
of high-resolution shock capturing methods in regions where (relativistic)
shocks are found, while exploiting high-order accuracy in smooth regions. A
task-based parallelism model allows efficient use of the largest supercomputers
for problems with a heterogeneous workload over disparate spatial and temporal
scales. We argue that the locality and algorithmic structure of discontinuous
Galerkin methods will exhibit good scalability within a task-based parallelism
framework. We demonstrate the code on a wide variety of challenging benchmark
problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the
code's scalability including its strong scaling on the NCSA Blue Waters
supercomputer up to the machine's full capacity of 22,380 nodes using 671,400
threads.Comment: 41 pages, 13 figures, and 7 tables. Ancillary data contains
simulation input file
FullSWOF_Paral: Comparison of two parallelization strategies (MPI and SKELGIS) on a software designed for hydrology applications
In this paper, we perform a comparison of two approaches for the
parallelization of an existing, free software, FullSWOF 2D (http://www.
univ-orleans.fr/mapmo/soft/FullSWOF/ that solves shallow water equations for
applications in hydrology) based on a domain decomposition strategy. The first
approach is based on the classical MPI library while the second approach uses
Parallel Algorithmic Skeletons and more precisely a library named SkelGIS
(Skeletons for Geographical Information Systems). The first results presented
in this article show that the two approaches are similar in terms of
performance and scalability. The two implementation strategies are however very
different and we discuss the advantages of each one.Comment: 27 page
- …