15 research outputs found
Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores
In this paper we present scaling results of a FFT library, FFTK, and a
pseudospectral code, Tarang, on grid resolutions up to grid using
65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We
observe that communication dominates computation, more so on the Cray XC40. The
computation time scales as , and the communication
time as with ranging from 0.7
to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the
fluid and convection solvers of Tarang exhibit weak as well as strong scaling
nearly up to 196608 cores of Cray XC40. We perform a comparative study of the
performance on the Blue Gene/P and Cray XC40 clusters
An Efficient Particle Tracking Algorithm for Large-Scale Parallel Pseudo-Spectral Simulations of Turbulence
Particle tracking in large-scale numerical simulations of turbulent flows
presents one of the major bottlenecks in parallel performance and scaling
efficiency. Here, we describe a particle tracking algorithm for large-scale
parallel pseudo-spectral simulations of turbulence which scales well up to
billions of tracer particles on modern high-performance computing
architectures. We summarize the standard parallel methods used to solve the
fluid equations in our hybrid MPI/OpenMP implementation. As the main focus, we
describe the implementation of the particle tracking algorithm and document its
computational performance. To address the extensive inter-process communication
required by particle tracking, we introduce a task-based approach to overlap
point-to-point communications with computations, thereby enabling improved
resource utilization. We characterize the computational cost as a function of
the number of particles tracked and compare it with the flow field computation,
showing that the cost of particle tracking is very small for typical
applications
GPU parallelization of a hybrid pseudospectral geophysical turbulence framework using CUDA
An existing hybrid MPI-OpenMP scheme is augmented with a CUDA-based fine grain parallelization approach for multidimensional distributed Fourier transforms, in a well-characterized pseudospectral fluid turbulence code. Basics of the hybrid scheme are reviewed, and heuristics provided to show a potential benefit of the CUDA implementation. The method draws heavily on the CUDA runtime library to handle memory management and on the cuFFT library for computing local FFTs. The manner in which the interfaces to these libraries are constructed, and ISO bindings utilized to facilitate platform portability, are discussed. CUDA streams are implemented to overlap data transfer with cuFFT computation. Testing with a baseline solver demonstrated significant aggregate speed-up over the hybrid MPI-OpenMP solver by offloading to GPUs on an NVLink-based test system. While the batch streamed approach provided little benefit with NVLink, we saw a performance gain of 30% when tuned for the optimal number of streams on a PCIe-based system. It was found that strong GPU scaling is nearly ideal, in all cases. Profiling of the CUDA kernels shows that the transform computation achieves 15% of the attainable peak FlOp-rate based on a roofline model for the system. In addition to speed-up measurements for the fiducial solver, we also considered several other solvers with different numbers of transform operations and found that aggregate speed-ups are nearly constant for all solvers.Fil: Rosenberg, Duane. State University of Colorado - Fort Collins; Estados UnidosFil: Mininni, Pablo Daniel. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de FÃsica de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de FÃsica de Buenos Aires; ArgentinaFil: Reddy, Raghu. Environmental Modeling Center; Estados UnidosFil: Pouquet, Annick. State University of Colorado at Boulder; Estados Unidos. National Center for Atmospheric Research; Estados Unido
The inherent overlapping in the parallel calculation of the Laplacian
Producción CientÃficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18
Turbulence in a stably stratified fluid: Onset of global anisotropy as a function of the Richardson number
It is necessary to introduce an external forcing to induce turbulence in a
stably stratified fluid. The Heisenberg eddy viscosity technique should in this
case suffice to calculate a space-time averaged quantity like the global
anisotropy parameter as a function of the Richardson number. We find
analytically that the anisotropy increases linearly with the Richardson number,
with a small quadratic correction. A numerical simulation of the complete
equations shows the linear behaviour