35 research outputs found
Automated cache optimisations of stencil computations for partial differential equations
This thesis focuses on numerical methods that solve partial differential equations.
Our focal point is the finite difference method, which solves partial
differential equations by approximating derivatives with explicit finite differences.
These partial differential equation solvers consist of stencil computations on structured grids.
Stencils for computing real-world practical applications are patterns often
characterised by many memory accesses and non-trivial arithmetic expressions
that lead to high computational costs compared to simple stencils used in much prior
proof-of-concept work.
In addition, the loop nests to express stencils on structured grids may often be complicated.
This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations.
These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}.
In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests.
In this work, we aim to increase the performance of stencil kernel execution.
We study automated cache-memory-dependent optimisations for stencil computations.
This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest.
Data movement is a dominant factor affecting the performance of high-performance computing applications.
It has long been a target of optimisations due to its impact on execution time and energy consumption.
This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations.
Temporal blocking is a well-known technique to enhance data reuse in stencil computations.
However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy.
Applying temporal blocking to scientific simulations is more complex.
More specifically, in this work, we focus on the application context of seismic and medical imaging.
In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain.
These operations make the application of temporal blocking challenging.
We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations.
Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it.
We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking.
These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning.
\href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions.
Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof.
We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution.
We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction.
These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces
Recommended from our members
Swept time-space domain decomposition on GPUs and heterogeneous computing systems
Modern scientific and engineering problems often require simulations with a level of resolution difficult to achieve in reasonable amounts of time—even in effectively parallelized programs. Therefore, applications that exploit high performance computing (HPC) systems have become invaluable in academia and industry over the past two decades. Addressing the questions that arise from continual scientific advancement requires solutions from hardware and software are required to supply the necessary throughput for demand across scientific disciplines.
The most important development on the hardware side has been the General Purpose Graphics Processing Unit (GPGPU), a class of massively parallel device that now composes a substantial portion of the computational power of the top 500 supercomputers. As these systems grow, barriers to increased performance arise from small costs accumulated over innumerable iterations such as latency, the fixed cost of memory accesses, which becomes significantly larger when access requires communication between two distant CPU processes. This thesis implements and analyzes swept time-space domain decomposition, a communication avoiding scheme for time-stepping stencil codes, for GPGPU and heterogeneous (CPU/GPU) architectures.
The GPGPU program significantly improves the execution time of finite-difference solvers for relatively simple one-dimensional time-stepping partial differential equations (PDEs). The swept decomposition code showed speedups of 2-9x compared with simple GPU domain decompositions and 7-300x compared with parallel CPU versions over a range of problem sizes: 10ÂÂ3 – 106 spatial points. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2-1.9x than a standard implementation for all problem sizes. The program targeting heterogeneous systems with distributed memory patterns performs significantly better on both simple problems, speedup 4-18x, and more complex equation systems, speedup 1.5-3x, over the range of problem sizes: 105-107 spatial points. This demonstrates the benefit of GPU architecture and the contingent effectiveness of swept time-space decomposition for accelerating explicit PDE solvers on current computational architectures
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
The key common bottleneck in most stencil codes is data movement, and prior
research has shown that improving data locality through optimisations that
schedule across loops do particularly well. However, in many large PDE
applications it is not possible to apply such optimisations through compilers
because there are many options, execution paths and data per grid point, many
dependent on run-time parameters, and the code is distributed across different
compilation units. In this paper, we adapt the data locality improving
optimisation called iteration space slicing for use in large OPS applications
both in shared-memory and distributed-memory systems, relying on run-time
analysis and delayed execution. We evaluate our approach on a number of
applications, observing speedups of 2 on the Cloverleaf 2D/3D proxy
application, which contain 83/141 loops respectively, on the linear
solver TeaLeaf, and on the compressible Navier-Stokes solver
OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of
CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's
Knights Landing, demonstrating maintained throughput as the problem size grows
beyond 16GB, and we do scaling studies up to 8704 cores. The approach is
generally applicable to any stencil DSL that provides per loop data access
information
Cache based optimization of stencil computations : an algorithmic approach
We are witnessing a fundamental paradigm shift in computer design. Memory has been and is becoming more hierarchical. Clock frequency is no longer crucial for performance. The on-chip core count is doubling rapidly. The quest for performance is growing. These facts have lead to complex computer systems which bestow high demands on scientific computing problems to achieve high performance.
Stencil computation is a frequent and important kernel that is affected by this complexity. Its importance stems from the wide variety of scientific and engineering applications that use it. The stencil kernel is a nearest-neighbor computation with low arithmetic intensity, thus it usually achieves only a tiny fraction of the peak performance when executed on modern computer systems. Fast on-chip memory modules were introduced as the hardware approach to alleviate the problem.
There are mainly three approaches to address the problem, cache aware, cache oblivious, and automatic loop transformation approaches. In this thesis, comprehensive cache aware and cache oblivious algorithms to optimize stencil computations on structured rectangular 2D and 3D grids are presented. Our algorithms observe the challenges for high performance in the previous approaches, devise solutions for them, and carefully balance the solution building blocks against each other.
The many-core systems put the scalability of memory access at stake which has lead to hierarchical main memory systems. This adds another locality challenge for performance. We tailor our frameworks to meet the new performance challenge on these architectures. Experiments are performed to evaluate the performance of our frameworks on synthetic as well as real world problems.Wir erleben gerade einen fundamentalen Paradigmenwechsel im Computer Design. Speicher wird immer mehr hierarchisch gegliedert. Die CPU Frequenz ist nicht mehr allein entscheidend für die Rechenleistung. Die Zahl der Kerne auf einem Chip verdoppelt sich in kurzen Zeitabständen. Das Verlangen nach mehr Leistung wächst dabei ungebremst. Dies hat komplexe Computersysteme zur Folge, die mit schwierigen Problemen aus dem Bereich des wissenschaftlichen Rechnens einhergehen um eine hohe Leistung zu erreichen.
Stencil Computation ist ein häufig eingesetzer und wichtiger Kernel, der durch diese Komplexität beeinflusst ist. Seine Bedeutung rührt von dessen zahlreichen wissenschaftlichen und ingenieurstechnischen Anwendungen. Der Stencil Kernel ist eine Nächster-Nachbar-Berechnung von niedriger arithmetischer Intensität. Deswegen erreicht es nur einen Bruchteil der möglichen Höchstleistung, wenn es auf modernen Computersystemen ausgeführt wird.
Es gibt im Wesentlichen drei Möglichkeiten dieses Problem anzugehen, und zwar durch cache-bewusste, cache-unbewusste und automatische Schleifentransformationsansätze. In dieser Doktorarbeit stellen wir vollständige cache-bewusste sowie cache-unbewusste Algorithmen zur Optimierung von Stencilberechnungen auf einem strukturierten rechteckigen 2D und 3D Gitter. Unsere Algorithmen erfüllen die Erfordernisse für eine hohe Leistung und wiegen diese sorgfältig gegeneinander ab.
Das Problem der Skalierbarkeit von Speicherzugriffen fĂĽhrte zu hierarchischen Speichersystemen. Dies stellt eine weitere Herausforderung an die Leistung dar. Wir passen unser Framework dahingehend an, um mit dieser Herausforderung auf solchen Architekturen fertig zu werden. Wir fĂĽhren Experimente durch, um die Leistung unseres Algorithmen auf synthetischen wie auch realen Problemen zu evaluieren
A parallel, adaptive discontinuous Galerkin method for hyperbolic problems on unstructured meshes
This thesis is concerned with the parallel, adaptive solution of hyperbolic conservation laws on unstructured meshes.
First, we present novel algorithms for cell-based adaptive mesh refinement (AMR) on unstructured meshes of triangles on graphics processing units (GPUs). Our implementation makes use of improved memory management techniques and a coloring algorithm for avoiding race conditions. The algorithm is entirely implemented on the GPU, with negligible communication between device and host. We show that the overhead of the AMR subroutines is small compared to the high-order solver and that the proportion of total run time spent adaptively refining the mesh decreases with the order of approximation. We apply our code to a number of benchmarks as well as more recently proposed problems for the Euler equations that require extremely high resolution. We present the solution to a shock reflection problem that addresses the von Neumann triple point paradox. We also study the problem of shock disappearance and self-similar diffraction of weak shocks around thin films.
Next, we analyze the stability and accuracy of second-order limiters for the discontinuous Galerkin method on unstructured triangular grids. We derive conditions for a limiter such that the numerical solution preserves second order accuracy and satisfies the local maximum principle. This leads to a new measure of cell size that is approximately twice as large as the radius of the inscribed circle. It is shown with numerical experiments that the resulting bound on the time step is tight. We also consider various combinations of limiting points and limiting neighborhoods and present numerical experiments comparing the accuracy, stability, and efficiency of the corresponding limiters.
We show that the theory for strong stability preserving (SSP) time stepping methods employed with the method of lines-type discretizations of hyperbolic conservation laws may result in overly stringent time step restrictions. We analyze a fully discrete finite volume method with slope reconstruction and a second order SSP Runge-Kutta time integrator to show that the maximum stable time step can be increased over the SSP limit. Numerical examples show that this result extends to two-dimensional problems on triangular meshes.
Finally, we propose a moment limiter for the discontinuous Galerkin method applied to hyperbolic conservation laws in two and three dimensions.
The limiter works by finding directions in which the solution coefficients can be separated and limits them independently of one another by comparing to forward and backward reconstructed differences. The limiter has a precomputed stencil of constant size, which provides computational advantages in terms of implementation and runtime. We provide examples that demonstrate stability and second order accuracy of solutions
Generating and auto-tuning parallel stencil codes
In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform.
A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation.
The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology.
The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different
hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually.
We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code