22,856 research outputs found
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
A parallel compact-TVD method for compressible fluid dynamics employing shared and distributed-memory paradigms
A novel multi-block compact-TVD finite difference method for the simulation of compressible flows is presented. The method combines distributed and shared-memory paradigms to take advantage of the configuration of modern supercomputers that host many cores per shared-memory node. In our approach a domain decomposition technique is applied to a compact scheme using explicit flux formulas at block interfaces. This method offers great improvement in performance over earlier parallel compact methods that rely on the parallel solution of a linear system. A test case is presented to assess the accuracy and parallel performance of the new method
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units
We present the results of gravitational direct -body simulations using the
commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce
8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The
force evaluation of the -body problem was implemented in Cg using the GPU
directly to speed-up the calculations. The integration of the equations of
motions were, running on the host computer, implemented in C using the 4th
order predictor-corrector Hermite integrator with block time steps. We find
that for a large number of particles (N \apgt 10^4) modern graphics
processing units offer an attractive low cost alternative to GRAPE special
purpose hardware. A modern GPU continues to give a relatively flat scaling with
the number of particles, comparable to that of the GRAPE. Using the same time
step criterion the total energy of the -body system was conserved better
than to one in on the GPU, which is only about an order of magnitude
worse than obtained with GRAPE. For N\apgt 10^6 the GeForce 8800GTX was about
20 times faster than the host computer. Though still about an order of
magnitude slower than GRAPE, modern GPU's outperform GRAPE in their low cost,
long mean time between failure and the much larger onboard memory; the
GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTF can hold 9
million particles in memory.Comment: Submitted to New Astronom
Higher-order CFD and Interface Tracking Methods on Highly-Parallel MPI and GPU systems
A computational investigation of the effects on parallel performance of higher-order accurate schemes was carried out on two different computational systems: a traditional CPU based MPI cluster and a system of four Graphics Processing Units (GPUs) controlled by a single quad-core CPU. The investigation was based on the solution of the level set equations for interface tracking using a High-Order Upstream Central (HOUC) scheme. Different variants of the HOUC scheme were employed together with a 3rd-order TVD Runge-Kutta time integration. An increase in performance of two orders of magnitude was seen when comparing a single CPU core to a single GPU with a greater increase at higher orders of accuracy and at lower precision
- …