Search CORE

34 research outputs found

Optimized M2L Kernels for the Chebyshev Interpolation based Fast Multipole Method

Author: Bramas Bérenger
Coulaud Olivier
Darve Eric
Messner Matthias
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

A fast multipole method (FMM) for asymptotically smooth kernel functions (1/r, 1/r^4, Gauss and Stokes kernels, radial basis functions, etc.) based on a Chebyshev interpolation scheme has been introduced in [Fong et al., 2009]. The method has been extended to oscillatory kernels (e.g., Helmholtz kernel) in [Messner et al., 2012]. Beside its generality this FMM turns out to be favorable due to its easy implementation and its high performance based on intensive use of highly optimized BLAS libraries. However, one of its bottlenecks is the precomputation of the multiple-to-local (M2L) operator, and its higher number of floating point operations (flops) compared to other FMM formulations. Here, we present several optimizations for that operator, which is known to be the costliest FMM operator. The most efficient ones do not only reduce the precomputation time by a factor up to 340 but they also speed up the matrix-vector product. We conclude with comparisons and numerical validations of all presented optimizations

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Algorithmische und Code-Optimierungen Molekulardynamiksimulationen für Verfahrenstechnik

Author: Tchipev Nikola Plamenov
Publication venue: Technische Universität München
Publication date
Field of study

The focus of this work lies on implementational improvements and, in particular, node-level performance optimization of the simulation software ls1-mardyn. Through data structure improvements, SIMD vectorization and, especially, OpenMP parallelization, the world’s first simulation of 2*1013 molecules at over 1 PFLOP/sec was enabled. To allow for long-range interactions, the Fast Multipole Method was introduced to ls1-mardyn. The algorithm was optimized for sequential, shared-memory, and distributed-memory execution on up to 32,768 MPI processes.Der Fokus dieser Arbeit liegt auf Code-Optimierungen und insbesondere Leistungsoptimierung auf Knoten-Ebene für die Simulationssoftware ls1-mardyn. Durch verbesserte Datenstrukturen, SIMD-Vektorisierung und vor allem OpenMP-Parallelisierung wurde die weltweit erste Petaflop-Simulation von 2*1013 Molekülen ermöglicht. Zur Simulation von langreichweitigen Wechselwirkungen wurde die Fast-Multipole-Methode in ls1-mardyn eingeführt. Sequenzielle, Shared- und Distributed-Memory-Optimierungen wurden angewandt und erlaubten eine Ausführung auf bis zu 32768 MPI-Prozessen

High performance BLAS formulation of the multipole-to-local operator in the Fast Multipole Method

Author: Coulaud Olivier
Fortin Pierre
Roman Jean
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

International audienceThe multipole-to-local (M2L) operator is the most time-consuming part of the far field computation in the Fast Multipole Method for Laplace equation. Its natural expression, though commonly used, does not respect a sharp error bound: we here first prove the correctness of a second expression. We then propose a matrix formulation implemented with BLAS (Basic Linear Algebra Subprograms) routines in order to speed up its computation for these two expressions. We also introduce special data storages in memory to gain greater computational efficiency. This BLAS scheme is finally compared, for uniform distributions, to other M2L improvements such as block FFT, rotations and plane wave expansions. When considering runtime, extra memory storage, numerical stability and common precisions for Laplace equation, the BLAS version appears as the best one

INRIA a CCSD electronic archive server

Oskar Bordeaux

The fast multipole method at exascale

Author: Chandramowlishwaran Aparna
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

Scholarly Materials And Research @ Georgia Tech

Enabling task parallelism for many-core architectures

Author: Atkinson Patrick R
Publication venue
Publication date: 28/09/2021
Field of study

Explore Bristol Research

Optimisation et parallèlisation de la méthode des élements frontières pour l’équation des ondes dans le domaine temporel

Author: Bramas Bérenger
Publication venue: HAL CCSD
Publication date: 15/02/2016
Field of study

The time-domain BEM for the wave equation in acoustics and electromagnetism is used to simulatethe propagation of a wave with a discretization in time. It allows to obtain several frequencydomainresults with one solve. In this thesis, we investigate the implementation of an efficientTD-BEM solver using different approaches. We describe the context of our study and the TD-BEMformulation expressed as a sparse linear system composed of multiple interaction/convolutionmatrices. This system is naturally computed using the sparse matrix-vector product (SpMV). Wework on the limits of the SpMV kernel by looking at the matrix reordering and the behavior of ourSpMV kernels using vectorization (SIMD) on CPUs and an advanced blocking-layout on NvidiaGPUs. We show that this operator is not appropriate for our problem, and we then propose toreorder the original computation to get a special matrix structure. This new structure is called aslice matrix and is computed with a custom matrix/vector product operator. We present an optimizedimplementation of this operator on CPUs and Nvidia GPUs for which we describe advancedblocking schemes. The resulting solver is parallelized with a hybrid strategy above heterogeneousnodes and relies on a new heuristic to balance the work among the processing units. Due tothe quadratic complexity of this matrix approach, we study the use of the fast multipole method(FMM) for our time-domain BEM solver. We investigate the parallelization of the general FMMalgorithm using several paradigms in both shared and distributed memory, and we explain howmodern runtime systems are well-suited to express the FMM computation. Finally, we investigatethe implementation and the parametrization of an FMM kernel specific to our TD-BEM, and weprovide preliminary results.La méthode des éléments frontières pour l’équation des ondes (BEM) est utilisée en acoustique eten électromagnétisme pour simuler la propagation d’une onde avec une discrétisation en temps(TD). Elle permet d’obtenir un résultat pour plusieurs fréquences à partir d’une seule résolution.Dans cette thèse, nous nous intéressons à l’implémentation efficace d’un simulateur TD-BEM sousdifférents angles. Nous décrivons le contexte de notre étude et la formulation utilisée qui s’exprimesous la forme d’un système linéaire composé de plusieurs matrices d’interactions/convolutions.Ce système est naturellement calculé en utilisant l’opérateur matrice/vecteur creux (SpMV). Nousavons travaillé sur la limite du SpMV en étudiant la permutation des matrices et le comportementde notre implémentation aidé par la vectorisation sur CPU et avec une approche par bloc surGPU. Nous montrons que cet opérateur n’est pas approprié pour notre problème et nous proposonsde changer l’ordre de calcul afin d’obtenir une matrice avec une structure particulière.Cette nouvelle structure est appelée une matrice tranche et se calcule à l’aide d’un opérateur spécifique.Nous décrivons des implémentations optimisées sur architectures modernes du calculhaute-performance. Le simulateur résultant est parallélisé avec une approche hybride (mémoirespartagées/distribuées) sur des noeuds hétérogènes, et se base sur une nouvelle heuristique pouréquilibrer le travail entre les processeurs. Cette approche matricielle a une complexité quadratiquesi bien que nous avons étudié son accélération par la méthode des multipoles rapides (FMM). Nousavons tout d’abord travaillé sur la parallélisation de l’algorithme de la FMM en utilisant différentsparadigmes et nous montrons comment les moteurs d’exécution sont adaptés pour relâcher le potentielde la FMM. Enfin, nous présentons des résultats préliminaires d’un simulateur TD-BEMaccéléré par FMM

Thèses en Ligne

INRIA a CCSD electronic archive server

Theses.fr

X10 for high-performance scientific computing

Author: Milthorpe Joshua John
Publication venue
Publication date: 01/01/2015
Field of study

High performance computing is a key technology that enables large-scale physical simulation in modern science. While great advances have been made in methods and algorithms for scientific computing, the most commonly used programming models encourage a fragmented view of computation that maps poorly to the underlying computer architecture. Scientific applications typically manifest physical locality, which means that interactions between entities or events that are nearby in space or time are stronger than more distant interactions. Linear-scaling methods exploit physical locality by approximating distant interactions, to reduce computational complexity so that cost is proportional to system size. In these methods, the computation required for each portion of the system is different depending on that portion’s contribution to the overall result. To support productive development, application programmers need programming models that cleanly map aspects of the physical system being simulated to the underlying computer architecture while also supporting the irregular workloads that arise from the fragmentation of a physical system. X10 is a new programming language for high-performance computing that uses the asynchronous partitioned global address space (APGAS) model, which combines explicit representation of locality with asynchronous task parallelism. This thesis argues that the X10 language is well suited to expressing the algorithmic properties of locality and irregular parallelism that are common to many methods for physical simulation. The work reported in this thesis was part of a co-design effort involving researchers at IBM and ANU in which two significant computational chemistry codes were developed in X10, with an aim to improve the expressiveness and performance of the language. The first is a Hartree–Fock electronic structure code, implemented using the novel Resolution of the Coulomb Operator approach. The second evaluates electrostatic interactions between point charges, using either the smooth particle mesh Ewald method or the fast multipole method, with the latter used to simulate ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer. We compare the performance of both X10 applications to state-of-the-art software packages written in other languages. This thesis presents improvements to the X10 language and runtime libraries for managing and visualizing the data locality of parallel tasks, communication using active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples. This work demonstrates that X10 can achieve performance comparable to established programming languages when running on a single core. More importantly, X10 programs can achieve high parallel efficiency on a multithreaded architecture, given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local data. For distributed memory architectures, X10 supports the use of active messages to construct local, asynchronous communication patterns which outperform global, synchronous patterns. Although point-to-point active messages may be implemented efficiently, productive application development also requires collective communications; more work is required to integrate both forms of communication in the X10 language. The exploitation of locality is the key insight in both linear-scaling methods and the APGAS programming model; their combination represents an attractive opportunity for future co-design efforts

The Australian National University