187 research outputs found

    Scalable Fast Multipole Methods on Heterogeneous Architecture

    Get PDF
    The N-body problem appears in many computational physics simulations. At each time step the computation involves an all-pairs sum whose complexity is quadratic, followed by an update of particle positions. This cost means that it is not practical to solve such dynamic N-body problems on large scale. To improve this situation, we use both algorithmic and hardware approaches. Our algorithmic approach is to use the Fast Multipole Method (FMM), which is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop, to reduce such quadratic complexity to linear with guaranteed accuracy. Our hardware approach is to use heterogeneous clusters, which comprised of nodes that contain multi-core CPUs tightly coupled with accelerators, such as graphics processors unit (GPU) as our underline parallel processing hardware, on which efficient implementations require highly non-trivial re-designed algorithms. In this dissertation, we fundamentally reconsider the FMM algorithms on heterogeneous architectures to achieve a significant improvement over recent/previous implementations in literature and to make the algorithm ready for use as a workhorse simulation tool for both time-dependent vortex flow problems and for boundary element methods. Our major contributions include: 1. Novel FMM data structures using parallel construction algorithms for dynamic problems. 2. A fast hetegenenous FMM algorithm for both single and multiple computing nodes. 3. An efficient inter-node communication management using fast parallel data structures. 4. A scalable FMM algorithm using novel Helmholz decomposition for Vortex Methods (VM). The proposed algorithms can handle non-uniform distributions with irregular partition shapes to achieve workload balance and their MPI-CUDA implementations are highly tuned up and demonstrate the state of the art performances

    Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

    Get PDF
    The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance

    Algorithmische und Code-Optimierungen Molekulardynamiksimulationen fĂĽr Verfahrenstechnik

    Get PDF
    The focus of this work lies on implementational improvements and, in particular, node-level performance optimization of the simulation software ls1-mardyn. Through data structure improvements, SIMD vectorization and, especially, OpenMP parallelization, the world’s first simulation of 2*1013 molecules at over 1 PFLOP/sec was enabled. To allow for long-range interactions, the Fast Multipole Method was introduced to ls1-mardyn. The algorithm was optimized for sequential, shared-memory, and distributed-memory execution on up to 32,768 MPI processes.Der Fokus dieser Arbeit liegt auf Code-Optimierungen und insbesondere Leistungsoptimierung auf Knoten-Ebene für die Simulationssoftware ls1-mardyn. Durch verbesserte Datenstrukturen, SIMD-Vektorisierung und vor allem OpenMP-Parallelisierung wurde die weltweit erste Petaflop-Simulation von 2*1013 Molekülen ermöglicht. Zur Simulation von langreichweitigen Wechselwirkungen wurde die Fast-Multipole-Methode in ls1-mardyn eingeführt. Sequenzielle, Shared- und Distributed-Memory-Optimierungen wurden angewandt und erlaubten eine Ausführung auf bis zu 32768 MPI-Prozessen

    Parallel Fast Multipole Method for Molecular Dynamics

    Get PDF
    We report on a parallel version of the Fast Multipole Method (FMM) implemented in the classical molecular dynamics code, NAMD (Not Another Molecular Dynamics program). This novel implementation of FMM aims to minimize interprocessor communication through the modification of the FMM grid to match the hybrid force and spatial decomposition scheme already present in NAMD. This new implementation has the benefit of replacing all-to-all communications broadcasts with direct communications between nearest neighbors. This results in a significant reduction in the amount of communication compared to earlier attempts to integrate FMM into common molecular dynamics programs. The early performance of FMM is similar to the existing electrostatics methods already in NAMD. In addition, tests of the stability and accuracy of the FMM algorithm in molecular dynamics as applied to several common solvated protein structures are discussed

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Designing a high-performance boundary element library with OpenCL and Numba

    Get PDF
    The Bempp boundary element library is a well known library for the simulation of a range of electrostatic, acoustic and electromagnetic problems in homogeneous bounded and unbounded domains. It originally started as a traditional C++ library with a Python interface. Over the last two years we have completely redesigned Bempp as a native Python library, called Bempp-cl, that provides computational backends for OpenCL (using PyOpenCL) and Numba. The OpenCL backend implements kernels for GPUs and CPUs with SIMD optimization. In this paper, we discuss the design of Bempp-cl, provide performance comparisons on different compute devices, and discuss the advantages and disadvantages of OpenCL as compared to Numba

    Task-based FMM for multicore architectures

    Get PDF
    Preliminary version of a paper to appear in SIAM SISCFast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the Fast Multipole Method (FMM) algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, to process the tasks on the different computing units. We carefully design the task flow, the mathematical operators, their implementations as well as scheduling schemes. Potentials and forces on 200 million particles are computed in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and good scalability is shown.Les méthodes multipôles rapides (FMM) constituent une opération fondamentale pour la simulation de nombreux problèmes physiques. Leur mise en œuvre haute performance requiert habituellement d'optimiser attentivement l'algorithme à la fois pour la physique visée et le matériel utilisé. Dans ce papier, nous proposons une nouvelle approche qui atteint une performance élevée et portable. Notre méthode consiste à exprimer l'algorithme FMM comme un flot de tâches et d'employer un moteur d'exécution, StarPU, afin de traiter les tâches sur les différentes unités d'exécution. Nous concevons précisément le flot de tâches, les opérateurs mathématiques, leurs implémentations sur unité centrale de traitement (CPU) ainsi que les schémas d'ordonnancement. Nous calculons les potentiels et forces pour des problèmes avec 200 millions de particules en 48.7 secondes sur une machine homogène SGI Altix UV 100 comportant 160 cœur

    Direct N-body Kernels for Multicore Platforms

    Full text link
    Abstract—We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDIA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency. I
    • …
    corecore