611 research outputs found

    A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

    Get PDF
    In this work, we consider the solution of boundary integral equations by means of a scalable hierarchical matrix approach on clusters equipped with graphics hardware, i.e. graphics processing units (GPUs). To this end, we extend our existing single-GPU hierarchical matrix library hmglib such that it is able to scale on many GPUs and such that it can be coupled to arbitrary application codes. Using a model GPU implementation of a boundary element method (BEM) solver, we are able to achieve more than 67 percent relative parallel speed-up going from 128 to 1024 GPUs for a model geometry test case with 1.5 million unknowns and a real-world geometry test case with almost 1.2 million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6 minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the setup phase and 20 seconds for the iterative solver. To the best of the authors' knowledge, we here discuss the first fully GPU-based distributed-memory parallel hierarchical matrix Open Source library using the traditional H-matrix format and adaptive cross approximation with an application to BEM problems

    Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

    Full text link
    High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs

    Designing a high-performance boundary element library with OpenCL and Numba

    Get PDF
    The Bempp boundary element library is a well known library for the simulation of a range of electrostatic, acoustic and electromagnetic problems in homogeneous bounded and unbounded domains. It originally started as a traditional C++ library with a Python interface. Over the last two years we have completely redesigned Bempp as a native Python library, called Bempp-cl, that provides computational backends for OpenCL (using PyOpenCL) and Numba. The OpenCL backend implements kernels for GPUs and CPUs with SIMD optimization. In this paper, we discuss the design of Bempp-cl, provide performance comparisons on different compute devices, and discuss the advantages and disadvantages of OpenCL as compared to Numba

    Productive and efficient computational science through domain-specific abstractions

    Get PDF
    In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

    Abstractions and performance optimisations for finite element methods

    Get PDF
    Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing. In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance. Domain-specific systems based on code generation techniques, such as Firedrake, attempt to address this problem with a design consisting of a hierarchy of abstractions, where the users can specify the mathematical problems via a high-level, descriptive interface, which is progressively lowered through the intermediate abstractions. Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently, generating high-performance code without user intervention. This thesis discusses several topics on the design of the abstraction layers of Firedrake, and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers. In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra. We successfully lift specific loop optimisations, previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language, improving the compilation speed and optimisation effectiveness. The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh. We redesign the abstraction to express the global assembly loop nests using tools and concepts based on the polyhedral model. This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically. This abstraction also improves the portability of Firedrake, as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
    • …
    corecore