2,712 research outputs found
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation
We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve
radiative heat transfer in turbulent flows of non-grey participating media that
can be coupled to fully resolved turbulent flows, namely to Direct Numerical
Simulation (DNS). The spectrally varying absorption coefficient is treated in a
narrow-band fashion with a correlated-k distribution. The implementation is
verified with analytical solutions and validated with results from literature
and line-by-line Monte Carlo computations. The method is implemented on GPU
with a thorough attention to memory transfer and computational efficiency. The
bottlenecks that dominate the computational expenses are addressed and several
techniques are proposed to optimize the GPU execution. By implementing the
proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude
can be achieved, while maintaining the same accuracy
An Efficient Sliding Mesh Interface Method for High-Order Discontinuous Galerkin Schemes
Sliding meshes are a powerful method to treat deformed domains in
computational fluid dynamics, where different parts of the domain are in
relative motion. In this paper, we present an efficient implementation of a
sliding mesh method into a discontinuous Galerkin compressible Navier-Stokes
solver and its application to a large eddy simulation of a 1-1/2 stage turbine.
The method is based on the mortar method and is high-order accurate. It can
handle three-dimensional sliding mesh interfaces with various interface shapes.
For plane interfaces, which are the most common case, conservativity and
free-stream preservation are ensured. We put an emphasis on efficient parallel
implementation. Our implementation generates little computational and storage
overhead. Inter-node communication via MPI in a dynamically changing mesh
topology is reduced to a bare minimum by ensuring a priori information about
communication partners and data sorting. We provide performance and scaling
results showing the capability of the implementation strategy. Apart from
analytical validation computations and convergence results, we present a
wall-resolved implicit LES of the 1-1/2 stage Aachen turbine test case as a
large scale practical application example
Hypercube algorithms on mesh connected multicomputers
A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes.Peer ReviewedPostprint (published version
Numerical wave propagation for the triangular - finite element pair
Inertia-gravity mode and Rossby mode dispersion properties are examined for
discretisations of the linearized rotating shallow-water equations using the
- finite element pair on arbitrary triangulations in planar
geometry. A discrete Helmholtz decomposition of the functions in the velocity
space based on potentials taken from the pressure space is used to provide a
complete description of the numerical wave propagation for the discretised
equations. In the -plane case, this decomposition is used to obtain
decoupled equations for the geostrophic modes, the inertia-gravity modes, and
the inertial oscillations. As has been noticed previously, the geostrophic
modes are steady. The Helmholtz decomposition is used to show that the
resulting inertia-gravity wave equation is third-order accurate in space. In
general the \pdgp finite element pair is second-order accurate, so this leads
to very accurate wave propagation. It is further shown that the only spurious
modes supported by this discretisation are spurious inertial oscillations which
have frequency , and which do not propagate. The Helmholtz decomposition
also allows a simple derivation of the quasi-geostrophic limit of the
discretised - equations in the -plane case, resulting in a
Rossby wave equation which is also third-order accurate.Comment: Revised version prior to final journal submissio
Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code
In this paper, we describe our vectorized and parallelized adaptive mesh
refinement (AMR) N-body code with shared time steps, and report its performance
on a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code puts
hierarchical meshes recursively where higher resolution is required and the
time step of all particles are the same. The parts which are the most difficult
to vectorize are loops that access the mesh data and particle data. We
vectorized such parts by changing the loop structure, so that the innermost
loop steps through the cells instead of the particles in each cell, in other
words, by changing the loop order from the depth-first order to the
breadth-first order. Mass assignment is also vectorizable using this loop order
exchange and splitting the loop into loops, if the cloud-in-cell
scheme is adopted. Here, is the number of dimension. These
vectorization schemes which eliminate the unvectorized loops are applicable to
parallelization of loops for shared-memory multiprocessors. We also
parallelized our code for distributed memory machines. The important part of
parallelization is data decomposition. We sorted the hierarchical mesh data by
the Morton order, or the recursive N-shaped order, level by level and split and
allocated the mesh data to the processors. Particles are allocated to the
processor to which the finest refined cells including the particles are also
assigned. Our timing analysis using the -dominated cold dark matter
simulations shows that our parallel code speeds up almost ideally up to 32
processors, the largest number of processors in our test.Comment: 21pages, 16 figures, to be published in PASJ (Vol. 57, No. 5, Oct.
2005
- …