4,418 research outputs found
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor
Several emerging petascale architectures use energy-efficient processors with
vectorized computational units and in-order thread processing. On these
architectures the sustained performance of streaming numerical kernels,
ubiquitous in the solution of partial differential equations, represents a
challenge despite the regularity of memory access. Sophisticated optimization
techniques are required to fully utilize the Central Processing Unit (CPU).
We propose a new method for constructing streaming numerical kernels using a
high-level assembly synthesis and optimization framework. We describe an
implementation of this method in Python targeting the IBM Blue Gene/P
supercomputer's PowerPC 450 core. This paper details the high-level design,
construction, simulation, verification, and analysis of these kernels utilizing
a subset of the CPU's instruction set.
We demonstrate the effectiveness of our approach by implementing several
three-dimensional stencil kernels over a variety of cached memory scenarios and
analyzing the mechanically scheduled variants, including a 27-point stencil
achieving a 1.7x speedup over the best previously published results
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
Lattice Quantum Chromodynamics simulations typically spend most of the
runtime in inversions of the Fermion Matrix. This part is therefore frequently
optimized for various HPC architectures. Here we compare the performance of the
Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate
gradient solver. By exposing more parallelism to the accelerator through
inverting multiple vectors at the same time, we obtain a performance greater
than 300 GFlop/s on both architectures. This more than doubles the performance
of the inversions. We also give a short overview of the Knights Corner
architecture, discuss some details of the implementation and the effort
required to obtain the achieved performance.Comment: 7 pages, proceedings, presented at 'GPU Computing in High Energy
Physics', September 10-12, 2014, Pisa, Ital
Abstractions and performance optimisations for finite element methods
Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing.
In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance.
Domain-specific systems based on code generation techniques,
such as Firedrake,
attempt to address this problem with a design consisting of
a hierarchy of abstractions,
where the users can specify the mathematical problems via a high-level,
descriptive interface,
which is progressively lowered through the intermediate abstractions.
Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently,
generating high-performance code without user intervention.
This thesis discusses several topics on the design of the abstraction layers of Firedrake,
and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers.
In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra.
We successfully lift specific loop optimisations,
previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language,
improving the compilation speed and optimisation effectiveness.
The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh.
We redesign the abstraction to express the global assembly loop nests
using tools and concepts based on the polyhedral model.
This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically.
This abstraction also improves the portability of Firedrake,
as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces
Composable code generation for high order, compatible finite element methods
It has been widely recognised in the HPC communities across the world, that exploiting modern
computer architectures, including exascale machines, to a full extent requires software commu-
nities to adapt their algorithms. Computational methods with a high ratio of floating point op-
erations to bandwidth are favorable. For solving partial differential equations, which can model
many physical problems, high order finite element methods can calculate approximations with a
high efficiency when a good solver is employed. Matrix-free algorithms solve the corresponding
equations with a high arithmetic intensity. Vectorisation speeds up the operations by calculating
one instruction on multiple data elements.
Another recent development for solving partial differential are compatible (mimetic) finite ele-
ment methods. In particular with application to geophysical flows, compatible discretisations ex-
hibit desired numerical properties required for accurate approximations. Among others, this has
been recognised by the UK Met office and their new dynamical core for weather and climate fore-
casting is built on a compatible discretisation. Hybridisation has been proven to be an efficient
solver for the corresponding equation systems, because it removes some inter-elemental coupling
and localises expensive operations.
This thesis combines the recent advances on vectorised, matrix-free, high order finite element
methods in the HPC community on the one hand and hybridised, compatible discretisations in
the geophysical community on the other. In previous work, a code generation framework has been
developed to support the localised linear algebra required for hybridisation. First, the framework
is adapted to support vectorisation and further, extended so that the equations can be solved fully
matrix-free. Promising performance results are completing the thesis.Open Acces
Architecture and performance of Devito, a system for automated stencil computation
Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process -- from mathematical equations down to C++ code -- is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications
- …