406 research outputs found
Efficient parallel 3D computation of the compressible Euler equations with an invariant-domain preserving second-order finite-element scheme
We discuss the efficient implementation of a high-performance second-order
collocation-type finite-element scheme for solving the compressible Euler
equations of gas dynamics on unstructured meshes. The solver is based on the
convex limiting technique introduced by Guermond et al. (SIAM J. Sci. Comput.
40, A3211-A3239, 2018). As such it is invariant-domain preserving, i.e., the
solver maintains important physical invariants and is guaranteed to be stable
without the use of ad-hoc tuning parameters. This stability comes at the
expense of a significantly more involved algorithmic structure that renders
conventional high-performance discretizations challenging. We develop an
algorithmic design that allows SIMD vectorization of the compute kernel,
identify the main ingredients for a good node-level performance, and report
excellent weak and strong scaling of a hybrid thread/MPI parallelization
Doctor of Philosophy
dissertationSparse matrix codes are found in numerous applications ranging from iterative numerical solvers to graph analytics. Achieving high performance on these codes has however been a significant challenge, mainly due to array access indirection, for example, of the form A[B[i]]. Indirect accesses make precise dependence analysis impossible at compile-time, and hence prevent many parallelizing and locality optimizing transformations from being applied. The expert user relies on manually written libraries to tailor the sparse code and data representations best suited to the target architecture from a general sparse matrix representation. However libraries have limited composability, address very specific optimization strategies, and have to be rewritten as new architectures emerge. In this dissertation, we explore the use of the inspector/executor methodology to accomplish the code and data transformations to tailor high performance sparse matrix representations. We devise and embed abstractions for such inspector/executor transformations within a compiler framework so that they can be composed with a rich set of existing polyhedral compiler transformations to derive complex transformation sequences for high performance. We demonstrate the automatic generation of inspector/executor code, which orchestrates code and data transformations to derive high performance representations for the Sparse Matrix Vector Multiply kernel in particular. We also show how the same transformations may be integrated into sparse matrix and graph applications such as Sparse Matrix Matrix Multiply and Stochastic Gradient Descent, respectively. The specific constraints of these applications, such as problem size and dependence structure, necessitate unique sparse matrix representations that can be realized using our transformations. Computations such as Gauss Seidel, with loop carried dependences at the outer most loop necessitate different strategies for high performance. Specifically, we organize the computation into level sets or wavefronts of irregular size, such that iterations of a wavefront may be scheduled in parallel but different wavefronts have to be synchronized. We demonstrate automatic code generation of high performance inspectors that do explicit dependence testing and level set construction at runtime, as well as high performance executors, which are the actual parallelized computations. For the above sparse matrix applications, we automatically generate inspector/executor code comparable in performance to manually tuned libraries
Productive and efficient computational science through domain-specific abstractions
In an ideal world, scientific applications are computationally efficient,
maintainable and composable and allow scientists to work very productively. We
argue that these goals are achievable for a specific application field by
choosing suitable domain-specific abstractions that encapsulate domain
knowledge with a high degree of expressiveness.
This thesis demonstrates the design and composition of
domain-specific abstractions by abstracting the stages a scientist goes
through in formulating a problem of numerically solving a partial differential
equation. Domain knowledge is used to transform this problem into a different,
lower level representation and decompose it into parts which can be solved
using existing tools. A system for the portable solution of partial
differential equations using the finite element method on unstructured meshes
is formulated, in which contributions from different scientific communities
are composed to solve sophisticated problems.
The concrete implementations of these domain-specific abstractions are
Firedrake and PyOP2. Firedrake allows scientists to describe variational
forms and discretisations for linear and non-linear finite element problems
symbolically, in a notation very close to their mathematical models. PyOP2
abstracts the performance-portable parallel execution of local computations
over the mesh on a range of hardware architectures, targeting multi-core CPUs,
GPUs and accelerators. Thereby, a separation of concerns is achieved, in which
Firedrake encapsulates domain knowledge about the finite element method
separately from its efficient parallel execution in PyOP2, which in turn is
completely agnostic to the higher abstraction layer.
As a consequence of the composability of those abstractions, optimised
implementations for different hardware architectures can be
automatically generated without any changes to a single high-level
source. Performance matches or exceeds what is realistically attainable by
hand-written code. Firedrake and PyOP2 are combined to form a tool chain that
is demonstrated to be competitive with or faster than available alternatives
on a wide range of different finite element problems.Open Acces
Doctor of Philosophy
dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving
A Performance-Portable SYCL Implementation of CRK-HACC for Exascale
The first generation of exascale systems will include a variety of machine
architectures, featuring GPUs from multiple vendors. As a result, many
developers are interested in adopting portable programming models to avoid
maintaining multiple versions of their code. It is necessary to document
experiences with such programming models to assist developers in understanding
the advantages and disadvantages of different approaches.
To this end, this paper evaluates the performance portability of a SYCL
implementation of a large-scale cosmology application (CRK-HACC) running on
GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the
process of migrating the original code from CUDA to SYCL and show that
specializing kernels for specific targets can greatly improve performance
portability without significantly impacting programmer productivity. The SYCL
version of CRK-HACC achieves a performance portability of 0.96 with a code
divergence of almost 0, demonstrating that SYCL is a viable programming model
for performance-portable applications.Comment: 12 pages, 13 figures, 2023 International Workshop on Performance,
Portability & Productivity in HP
Recent Techniques for Regularization in Partial Differential Equations and Imaging
abstract: Inverse problems model real world phenomena from data, where the data are often noisy and models contain errors. This leads to instabilities, multiple solution vectors and thus ill-posedness. To solve ill-posed inverse problems, regularization is typically used as a penalty function to induce stability and allow for the incorporation of a priori information about the desired solution. In this thesis, high order regularization techniques are developed for image and function reconstruction from noisy or misleading data. Specifically the incorporation of the Polynomial Annihilation operator allows for the accurate exploitation of the sparse representation of each function in the edge domain.
This dissertation tackles three main problems through the development of novel reconstruction techniques: (i) reconstructing one and two dimensional functions from multiple measurement vectors using variance based joint sparsity when a subset of the measurements contain false and/or misleading information, (ii) approximating discontinuous solutions to hyperbolic partial differential equations by enhancing typical solvers with l1 regularization, and (iii) reducing model assumptions in synthetic aperture radar image formation, specifically for the purpose of speckle reduction and phase error correction. While the common thread tying these problems together is the use of high order regularization, the defining characteristics of each of these problems create unique challenges.
Fast and robust numerical algorithms are also developed so that these problems can be solved efficiently without requiring fine tuning of parameters. Indeed, the numerical experiments presented in this dissertation strongly suggest that the new methodology provides more accurate and robust solutions to a variety of ill-posed inverse problems.Dissertation/ThesisDoctoral Dissertation Mathematics 201
Evaluating the performance of legacy applications on emerging parallel architectures
The gap between a supercomputer's theoretical maximum (\peak")
oatingpoint
performance and that actually achieved by applications has grown wider
over time. Today, a typical scientific application achieves only 5{20% of any
given machine's peak processing capability, and this gap leaves room for significant
improvements in execution times.
This problem is most pronounced for modern \accelerator" architectures
{ collections of hundreds of simple, low-clocked cores capable of executing the
same instruction on dozens of pieces of data simultaneously. This is a significant
change from the low number of high-clocked cores found in traditional CPUs,
and effective utilisation of accelerators typically requires extensive code and
algorithmic changes. In many cases, the best way in which to map a parallel
workload to these new architectures is unclear.
The principle focus of the work presented in this thesis is the evaluation
of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel
MIC) for two benchmark codes { the LU benchmark from the NAS Parallel
Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex
parallel behaviours that are representative of many scientific applications. Using
combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we
demonstrate performance improvements of up to 7x for these workloads.
We also detail a code development methodology that permits application developers
to target multiple architecture types without maintaining completely
separate implementations for each platform. Using OpenCL, we develop performance
portable implementations of the LU and miniMD benchmarks that are
faster than the original codes, and at most 2x slower than versions highly-tuned
for particular hardware.
Finally, we demonstrate the importance of evaluating architectures at scale
(as opposed to on single nodes) through performance modelling techniques,
highlighting the problems associated with strong-scaling on emerging accelerator
architectures
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
- …