34 research outputs found
Optimized M2L Kernels for the Chebyshev Interpolation based Fast Multipole Method
A fast multipole method (FMM) for asymptotically smooth kernel functions (1/r, 1/r^4, Gauss and Stokes kernels, radial basis functions, etc.) based on a Chebyshev interpolation scheme has been introduced in [Fong et al., 2009]. The method has been extended to oscillatory kernels (e.g., Helmholtz kernel) in [Messner et al., 2012]. Beside its generality this FMM turns out to be favorable due to its easy implementation and its high performance based on intensive use of highly optimized BLAS libraries. However, one of its bottlenecks is the precomputation of the multiple-to-local (M2L) operator, and its higher number of floating point operations (flops) compared to other FMM formulations. Here, we present several optimizations for that operator, which is known to be the costliest FMM operator. The most efficient ones do not only reduce the precomputation time by a factor up to 340 but they also speed up the matrix-vector product. We conclude with comparisons and numerical validations of all presented optimizations
Algorithmische und Code-Optimierungen Molekulardynamiksimulationen fĂŒr Verfahrenstechnik
The focus of this work lies on implementational improvements and, in particular, node-level performance optimization of the simulation software ls1-mardyn. Through data structure improvements, SIMD vectorization and, especially, OpenMP parallelization, the worldâs first simulation of 2*1013 molecules at over 1 PFLOP/sec was enabled. To allow for long-range interactions, the Fast Multipole Method was introduced to ls1-mardyn. The algorithm was optimized for sequential, shared-memory, and distributed-memory execution on up to 32,768 MPI processes.Der Fokus dieser Arbeit liegt auf Code-Optimierungen und insbesondere Leistungsoptimierung auf Knoten-Ebene fĂŒr die Simulationssoftware ls1-mardyn. Durch verbesserte Datenstrukturen, SIMD-Vektorisierung und vor allem OpenMP-Parallelisierung wurde die weltweit erste Petaflop-Simulation von 2*1013 MolekĂŒlen ermöglicht. Zur Simulation von langreichweitigen Wechselwirkungen wurde die Fast-Multipole-Methode in ls1-mardyn eingefĂŒhrt. Sequenzielle, Shared- und Distributed-Memory-Optimierungen wurden angewandt und erlaubten eine AusfĂŒhrung auf bis zu 32768 MPI-Prozessen
High performance BLAS formulation of the multipole-to-local operator in the Fast Multipole Method
International audienceThe multipole-to-local (M2L) operator is the most time-consuming part of the far field computation in the Fast Multipole Method for Laplace equation. Its natural expression, though commonly used, does not respect a sharp error bound: we here first prove the correctness of a second expression. We then propose a matrix formulation implemented with BLAS (Basic Linear Algebra Subprograms) routines in order to speed up its computation for these two expressions. We also introduce special data storages in memory to gain greater computational efficiency. This BLAS scheme is finally compared, for uniform distributions, to other M2L improvements such as block FFT, rotations and plane wave expansions. When considering runtime, extra memory storage, numerical stability and common precisions for Laplace equation, the BLAS version appears as the best one
The fast multipole method at exascale
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems.
We specifically address two key challenges. The first challenge is how to engineer fast code for todayâs platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design.
To demonstrate the scientific significance of FMM, we present two applications
namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities.
Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D
Optimisation et parallĂšlisation de la mĂ©thode des Ă©lements frontiĂšres pour lâĂ©quation des ondes dans le domaine temporel
The time-domain BEM for the wave equation in acoustics and electromagnetism is used to simulatethe propagation of a wave with a discretization in time. It allows to obtain several frequencydomainresults with one solve. In this thesis, we investigate the implementation of an efficientTD-BEM solver using different approaches. We describe the context of our study and the TD-BEMformulation expressed as a sparse linear system composed of multiple interaction/convolutionmatrices. This system is naturally computed using the sparse matrix-vector product (SpMV). Wework on the limits of the SpMV kernel by looking at the matrix reordering and the behavior of ourSpMV kernels using vectorization (SIMD) on CPUs and an advanced blocking-layout on NvidiaGPUs. We show that this operator is not appropriate for our problem, and we then propose toreorder the original computation to get a special matrix structure. This new structure is called aslice matrix and is computed with a custom matrix/vector product operator. We present an optimizedimplementation of this operator on CPUs and Nvidia GPUs for which we describe advancedblocking schemes. The resulting solver is parallelized with a hybrid strategy above heterogeneousnodes and relies on a new heuristic to balance the work among the processing units. Due tothe quadratic complexity of this matrix approach, we study the use of the fast multipole method(FMM) for our time-domain BEM solver. We investigate the parallelization of the general FMMalgorithm using several paradigms in both shared and distributed memory, and we explain howmodern runtime systems are well-suited to express the FMM computation. Finally, we investigatethe implementation and the parametrization of an FMM kernel specific to our TD-BEM, and weprovide preliminary results.La mĂ©thode des Ă©lĂ©ments frontiĂšres pour lâĂ©quation des ondes (BEM) est utilisĂ©e en acoustique eten Ă©lectromagnĂ©tisme pour simuler la propagation dâune onde avec une discrĂ©tisation en temps(TD). Elle permet dâobtenir un rĂ©sultat pour plusieurs frĂ©quences Ă partir dâune seule rĂ©solution.Dans cette thĂšse, nous nous intĂ©ressons Ă lâimplĂ©mentation efficace dâun simulateur TD-BEM sousdiffĂ©rents angles. Nous dĂ©crivons le contexte de notre Ă©tude et la formulation utilisĂ©e qui sâexprimesous la forme dâun systĂšme linĂ©aire composĂ© de plusieurs matrices dâinteractions/convolutions.Ce systĂšme est naturellement calculĂ© en utilisant lâopĂ©rateur matrice/vecteur creux (SpMV). Nousavons travaillĂ© sur la limite du SpMV en Ă©tudiant la permutation des matrices et le comportementde notre implĂ©mentation aidĂ© par la vectorisation sur CPU et avec une approche par bloc surGPU. Nous montrons que cet opĂ©rateur nâest pas appropriĂ© pour notre problĂšme et nous proposonsde changer lâordre de calcul afin dâobtenir une matrice avec une structure particuliĂšre.Cette nouvelle structure est appelĂ©e une matrice tranche et se calcule Ă lâaide dâun opĂ©rateur spĂ©cifique.Nous dĂ©crivons des implĂ©mentations optimisĂ©es sur architectures modernes du calculhaute-performance. Le simulateur rĂ©sultant est parallĂ©lisĂ© avec une approche hybride (mĂ©moirespartagĂ©es/distribuĂ©es) sur des noeuds hĂ©tĂ©rogĂšnes, et se base sur une nouvelle heuristique pourĂ©quilibrer le travail entre les processeurs. Cette approche matricielle a une complexitĂ© quadratiquesi bien que nous avons Ă©tudiĂ© son accĂ©lĂ©ration par la mĂ©thode des multipoles rapides (FMM). Nousavons tout dâabord travaillĂ© sur la parallĂ©lisation de lâalgorithme de la FMM en utilisant diffĂ©rentsparadigmes et nous montrons comment les moteurs dâexĂ©cution sont adaptĂ©s pour relĂącher le potentielde la FMM. Enfin, nous prĂ©sentons des rĂ©sultats prĂ©liminaires dâun simulateur TD-BEMaccĂ©lĂ©rĂ© par FMM
X10 for high-performance scientific computing
High performance computing is a key technology that enables large-scale physical
simulation in modern science. While great advances have been made in methods and
algorithms for scientific computing, the most commonly used programming models
encourage a fragmented view of computation that maps poorly to the underlying
computer architecture.
Scientific applications typically manifest physical locality, which means that interactions
between entities or events that are nearby in space or time are stronger
than more distant interactions. Linear-scaling methods exploit physical locality by approximating
distant interactions, to reduce computational complexity so that cost is
proportional to system size. In these methods, the computation required for each
portion of the system is different depending on that portionâs contribution to the
overall result. To support productive development, application programmers need
programming models that cleanly map aspects of the physical system being simulated
to the underlying computer architecture while also supporting the irregular
workloads that arise from the fragmentation of a physical system.
X10 is a new programming language for high-performance computing that uses
the asynchronous partitioned global address space (APGAS) model, which combines
explicit representation of locality with asynchronous task parallelism. This thesis
argues that the X10 language is well suited to expressing the algorithmic properties
of locality and irregular parallelism that are common to many methods for physical
simulation.
The work reported in this thesis was part of a co-design effort involving researchers
at IBM and ANU in which two significant computational chemistry codes
were developed in X10, with an aim to improve the expressiveness and performance
of the language. The first is a HartreeâFock electronic structure code, implemented
using the novel Resolution of the Coulomb Operator approach. The second evaluates
electrostatic interactions between point charges, using either the smooth particle
mesh Ewald method or the fast multipole method, with the latter used to simulate
ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer.
We compare the performance of both X10 applications to state-of-the-art software
packages written in other languages.
This thesis presents improvements to the X10 language and runtime libraries for
managing and visualizing the data locality of parallel tasks, communication using
active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples.
This work demonstrates that X10 can achieve performance comparable to established
programming languages when running on a single core. More importantly,
X10 programs can achieve high parallel efficiency on a multithreaded architecture,
given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local
data. For distributed memory architectures, X10 supports the use of active messages
to construct local, asynchronous communication patterns which outperform global,
synchronous patterns. Although point-to-point active messages may be implemented
efficiently, productive application development also requires collective communications;
more work is required to integrate both forms of communication in the X10
language. The exploitation of locality is the key insight in both linear-scaling methods and
the APGAS programming model; their combination represents an attractive opportunity
for future co-design efforts