43 research outputs found
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs
This paper reports large-scale direct numerical simulations of
homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08
petaflop/s on gpu hardware using single precision. The simulations use a vortex
particle method to solve the Navier-Stokes equations, with a highly parallel
fast multipole method (FMM) as numerical engine, and match the current record
in mesh size for this application, a cube of 4096^3 computational points solved
with a spectral method. The standard numerical approach used in this field is
the pseudo-spectral method, relying on the FFT algorithm as numerical engine.
The particle-based simulations presented in this paper quantitatively match the
kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted
code. In terms of parallel performance, weak scaling results show the fmm-based
vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per
mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral
method is able to achieve just 14% parallel efficiency on the same number of
mpi processes (using only cpu cores), due to the all-to-all communication
pattern of the FFT algorithm. The calculation time for one time step was 108
seconds for the vortex method and 154 seconds for the spectral method, under
these conditions. Computing with 69 billion particles, this work exceeds by an
order of magnitude the largest vortex method calculations to date
A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems
Among the algorithms that are likely to play a major role in future exascale
computing, the fast multipole method (FMM) appears as a rising star. Our
previous recent work showed scaling of an FMM on GPU clusters, with problem
sizes in the order of billions of unknowns. That work led to an extremely
parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This
paper reports on a a campaign of performance tuning and scalability studies
using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were
parallelized using OpenMP, and a test using 10^7 particles randomly distributed
in a cube showed 78% efficiency on 8 threads. Tuning of the
particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of
the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel
scalability was studied in both strong and weak scaling. The strong scaling
test used 10^8 particles and resulted in 93% parallel efficiency on 2048
processes for the non-SIMD code and 54% for the SIMD-optimized code (which was
still 2x faster). The weak scaling test used 10^6 particles per process, and
resulted in 72% efficiency on 32,768 processes, with the largest calculation
taking about 40 seconds to evaluate more than 32 billion unknowns. This work
builds up evidence for our view that FMM is poised to play a leading role in
exascale computing, and we end the paper with a discussion of the features that
make it a particularly favorable algorithm for the emerging heterogeneous and
massively parallel architectural landscape
A SVD accelerated kernel-independent fast multipole method and its application to BEM
The kernel-independent fast multipole method (KIFMM) proposed in [1] is of
almost linear complexity. In the original KIFMM the time-consuming M2L
translations are accelerated by FFT. However, when more equivalent points are
used to achieve higher accuracy, the efficiency of the FFT approach tends to be
lower because more auxiliary volume grid points have to be added. In this
paper, all the translations of the KIFMM are accelerated by using the singular
value decomposition (SVD) based on the low-rank property of the translating
matrices. The acceleration of M2L is realized by first transforming the
associated translating matrices into more compact form, and then using low-rank
approximations. By using the transform matrices for M2L, the orders of the
translating matrices in upward and downward passes are also reduced. The
improved KIFMM is then applied to accelerate BEM. The performance of the
proposed algorithms are demonstrated by three examples. Numerical results show
that, compared with the original KIFMM, the present method can reduce about 40%
of the iterating time and 25% of the memory requirement.Comment: 19 pages, 4 figure
Méthode des multipôles rapide à base de tâches pour des clusters de processeurs multicoeurs
Most high-performance, scientific libraries have adopted hybrid parallelization schemes - such as the popular MPI+OpenMP hybridization - to benefit from the capacities of modern distributed-memory machines. While these approaches have shown to achieve high performance, they require a lot of effort to design and maintain sophisticated synchronization/communication strategies. On the other hand, task-based programming paradigms aim at delegating this burden to a runtime system for maximizing productivity. In this article, we assess the potential of task-based fast multipole methods (FMM) on clusters of multicore processors. We propose both a hybrid MPI+task FMM parallelization and a pure task-based parallelization where the MPI communications are implicitly handled by the runtime system. The latter approach yields a very compact code following a sequential task-based programming model. We show that task-based approaches can compete with a hybrid MPI+OpenMP highly optimized code and that furthermore the compact task-based scheme fully matches the performance of the sophisticated, hybrid MPI+task version, ensuring performance while maximizing productivity. We illustrate our discussion with the ScalFMM FMM library and the StarPU runtime system.La plupart des bibliothèques scientifiques très performantes ont adopté des parallélisations hybrides - comme l’approche MPI+OpenMP - pour profiter des capacités des machines modernes à mémoire distribuée. Ces approches permettent d’obtenir de très hautes performances, mais elles nécessitent beaucoup d’efforts pour concevoir et pour maintenir des stratégies de synchronisation/communication sophistiquées. D’un autre côté, les paradigmes de programmation à base de tâches visent à déléguer ce fardeau à un moteur d'exécution pour maximiser la productivité. Dans cet article, nous évaluons le potentiel de la méthode des multipôles rapide (FMM) à base de tâches sur les clusters de processeurs multic\oe{}urs. Nous proposons deux types de parallélisation, une première approche hybride (MPI+Tâche) à base de tâches et d’appels à MPI pour gérer explicitement les communications et la deuxième uniquement à base de tâches où les communications MPI sont implicitement postées par le moteur d'exécution. Cette dernière approche conduit à un code très compact qui suit le modèle de programmation séquentiel à base de tâches. Nous montrons que cette approche rivalise avec le code hybride MPI+OpenMP fortement optimisé et qu'en outre le code compact atteint les performances de la version hybride MPI+Tâche, assurant une très haute performance tout en maximisant la productivité. Nous illustrons notre propos avec la bibliothèque FMM ScalFMM et le moteur d'exécution StarPU