611 research outputs found
A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters
In this work, we consider the solution of boundary integral equations by
means of a scalable hierarchical matrix approach on clusters equipped with
graphics hardware, i.e. graphics processing units (GPUs). To this end, we
extend our existing single-GPU hierarchical matrix library hmglib such that it
is able to scale on many GPUs and such that it can be coupled to arbitrary
application codes. Using a model GPU implementation of a boundary element
method (BEM) solver, we are able to achieve more than 67 percent relative
parallel speed-up going from 128 to 1024 GPUs for a model geometry test case
with 1.5 million unknowns and a real-world geometry test case with almost 1.2
million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6
minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the
setup phase and 20 seconds for the iterative solver. To the best of the
authors' knowledge, we here discuss the first fully GPU-based
distributed-memory parallel hierarchical matrix Open Source library using the
traditional H-matrix format and adaptive cross approximation with an
application to BEM problems
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics
High fidelity Computational Fluid Dynamics simulations are generally
associated with large computing requirements, which are progressively acute
with each new generation of supercomputers. However, significant research
efforts are required to unlock the computing power of leading-edge systems,
currently referred to as pre-Exascale systems, based on increasingly complex
architectures. In this paper, we present the approach implemented in the
computational mechanics code Alya. We describe in detail the parallelization
strategy implemented to fully exploit the different levels of parallelism,
together with a novel co-execution method for the efficient utilization of
heterogeneous CPU/GPU architectures. The latter is based on a multi-code
co-execution approach with a dynamic load balancing mechanism. The assessment
of the performance of all the proposed strategies has been carried out for
airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta
V100 GPUs
Designing a high-performance boundary element library with OpenCL and Numba
The Bempp boundary element library is a well known library for the simulation of a range of electrostatic, acoustic and electromagnetic problems in homogeneous bounded and unbounded domains. It originally started as a traditional C++ library with a Python interface. Over the last two years we have completely redesigned Bempp as a native Python library, called Bempp-cl, that provides computational backends for OpenCL (using PyOpenCL) and Numba. The OpenCL backend implements kernels for GPUs and CPUs with SIMD optimization. In this paper, we discuss the design of Bempp-cl, provide performance comparisons on different compute devices, and discuss the advantages and disadvantages of OpenCL as compared to Numba
Productive and efficient computational science through domain-specific abstractions
In an ideal world, scientific applications are computationally efficient,
maintainable and composable and allow scientists to work very productively. We
argue that these goals are achievable for a specific application field by
choosing suitable domain-specific abstractions that encapsulate domain
knowledge with a high degree of expressiveness.
This thesis demonstrates the design and composition of
domain-specific abstractions by abstracting the stages a scientist goes
through in formulating a problem of numerically solving a partial differential
equation. Domain knowledge is used to transform this problem into a different,
lower level representation and decompose it into parts which can be solved
using existing tools. A system for the portable solution of partial
differential equations using the finite element method on unstructured meshes
is formulated, in which contributions from different scientific communities
are composed to solve sophisticated problems.
The concrete implementations of these domain-specific abstractions are
Firedrake and PyOP2. Firedrake allows scientists to describe variational
forms and discretisations for linear and non-linear finite element problems
symbolically, in a notation very close to their mathematical models. PyOP2
abstracts the performance-portable parallel execution of local computations
over the mesh on a range of hardware architectures, targeting multi-core CPUs,
GPUs and accelerators. Thereby, a separation of concerns is achieved, in which
Firedrake encapsulates domain knowledge about the finite element method
separately from its efficient parallel execution in PyOP2, which in turn is
completely agnostic to the higher abstraction layer.
As a consequence of the composability of those abstractions, optimised
implementations for different hardware architectures can be
automatically generated without any changes to a single high-level
source. Performance matches or exceeds what is realistically attainable by
hand-written code. Firedrake and PyOP2 are combined to form a tool chain that
is demonstrated to be competitive with or faster than available alternatives
on a wide range of different finite element problems.Open Acces
Abstractions and performance optimisations for finite element methods
Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing.
In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance.
Domain-specific systems based on code generation techniques,
such as Firedrake,
attempt to address this problem with a design consisting of
a hierarchy of abstractions,
where the users can specify the mathematical problems via a high-level,
descriptive interface,
which is progressively lowered through the intermediate abstractions.
Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently,
generating high-performance code without user intervention.
This thesis discusses several topics on the design of the abstraction layers of Firedrake,
and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers.
In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra.
We successfully lift specific loop optimisations,
previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language,
improving the compilation speed and optimisation effectiveness.
The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh.
We redesign the abstraction to express the global assembly loop nests
using tools and concepts based on the polyhedral model.
This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically.
This abstraction also improves the portability of Firedrake,
as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
- …