20 research outputs found
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors
Traditional solution approaches for problems in quantum mechanics scale as
, where is the number of electrons. Various methods have
been proposed to address this issue and obtain linear scaling .
One promising formulation is the direct minimization of energy. Such methods
take advantage of physical localization of the solution, namely that the
solution can be sought in terms of non-orthogonal orbitals with local support.
In this work a numerically efficient implementation of sparse parallel vectors
within the open-source finite element library deal.II is proposed. The main
algorithmic ingredient is the matrix-free evaluation of the Hamiltonian
operator by cell-wise quadrature. Based on an a-priori chosen support for each
vector we develop algorithms and data structures to perform (i) matrix-free
sparse matrix multivector products (SpMM), (ii) the projection of an operator
onto a sparse sub-space (inner products), and (iii) post-multiplication of a
sparse multivector with a square matrix. The node-level performance is analyzed
using a roofline model. Our matrix-free implementation of finite element
operators with sparse multivectors achieves the performance of 157 GFlop/s on
Intel Cascade Lake architecture. Strong and weak scaling results are reported
for a typical benchmark problem using quadratic and quartic finite element
bases.Comment: 29 pages, 12 figure
Code Generation for High Performance PDE Solvers on Modern Architectures
Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available.
Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm.
An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake
architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue.
To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure.
We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake
architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator
FlashX: Massive Data Analysis Using Fast I/O
With the explosion of data and the increasing complexity of data analysis, large-scale
data analysis imposes significant challenges in systems design. While current
research focuses on scaling out to large clusters, these scale-out solutions introduce
a significant amount of overhead. This thesis is motivated by the advance of new
I/O technologies such as flash memory. Instead of scaling out, we explore efficient
system designs in a single commodity machine with non-uniform memory architecture
(NUMA) and scale to large datasets by utilizing commodity solid-state drives
(SSDs). This thesis explores the impact of the new I/O technologies on large-scale
data analysis. Instead of implementing individual data analysis algorithms for SSDs,
we develop a data analysis ecosystem called FlashX to target a large range of data
analysis tasks. FlashX includes three subsystems: SAFS, FlashGraph and FlashMatrix.
SAFS is a user-space filesystem optimized for a large SSD array to deliver
maximal I/O throughput from SSDs. FlashGraph is a general-purpose graph analysis
framework that processes graphs in a semi-external memory fashion, i.e., keeping
vertex state in memory and edges on SSDs, and scales to graphs with billions of
vertices by utilizing SSDs through SAFS. FlashMatrix is a matrix-oriented programming
framework that supports both sparse matrices and dense matrices for general
data analysis. Similar to FlashGraph, it scales matrix operations beyond memory
capacity by utilizing SSDs. We demonstrate that with the current I/O technologies
FlashGraph and FlashMatrix in the (semi-)external-memory meets or even exceeds
state-of-the-art in-memory data analysis frameworks while scaling to massive datasets
for a large variety of data analysis tasks
A Parallel Geometric Multigrid Method for Adaptive Finite Elements
Applications in a variety of scientific disciplines use systems of Partial Differential Equations (PDEs) to model physical phenomena. Numerical solutions to these models are often found using the Finite Element Method (FEM), where the problem is discretized and the solution of a large linear system is required, containing millions or even billions of unknowns. Often times, the domain of these solves will contain localized features that require very high resolution of the underlying finite element mesh to accurately solve, while a mesh with uniform resolution would require far too much computational time and memory overhead to be feasible on a modern machine. Therefore, techniques like adaptive mesh refinement, where one increases the resolution of the mesh only where it is necessary, must be used. Even with adaptive mesh refinement, these systems can still be on the order of much more than a million unknowns (large mantle convection applications like the ones in [90] show simulations on over 600 billion unknowns), and attempting to solve on a single processing unit is infeasible due to limited computational time and memory required. For this reason, any application code aimed at solving large problems must be built using a parallel framework, allowing the concurrent use of multiple processing units to solve a single problem, and the code must exhibit efficient scaling to large amounts of processing units.
Multigrid methods are currently the only known optimal solvers for linear systems arising from discretizations of elliptic boundary valued problems. These methods can be represented as an iterative scheme with contraction number less than one, independent of the resolution of the discretization [24, 54, 25, 103], with optimal complexity in the number of unknowns in the system [29]. Geometric multigrid (GMG) methods, where the hierarchy of spaces are defined by linear systems of finite element discretizations on meshes of decreasing resolution, have been shown to be robust for many different problem formulations, giving mesh independent convergence for highly adaptive meshes [26, 61, 83, 18], but these methods require specific implementations for each type of equation, boundary condition, mesh, etc., required by the specific application. The implementation in a massively parallel environment is not obvious, and research into this topic is far from exhaustive.
We present an implementation of a massively parallel, adaptive geometric multigrid (GMG) method used in the open-source finite element library deal.II [5], and perform extensive tests showing scaling of the v-cycle application on systems with up to 137 billion unknowns run on up to 65,536 processors, and demonstrating low communication overhead of the algorithms proposed. We then show the flexibility of the GMG by applying the method to four different PDE systems: the Poisson equation, linear elasticity, advection-diffusion, and the Stokes equations. For the Stokes equations, we implement a fully matrix-free, adaptive, GMG-based solver in the mantle convection code ASPECT [13], and give a comparison to the current matrix-based method used. We show improvements in robustness, parallel scaling, and memory consumption for simulations with up to 27 billion unknowns and 114,688 processors. Finally, we test the performance of IDR(s) methods compared to the FGMRES method currently used in ASPECT, showing the effects of the flexible preconditioning used for the Stokes solves in ASPECT, and the demonstrating the possible reduction in memory consumption for IDR(s) and the potential for solving large scale problems.
Parts of the work in this thesis has been submitted to peer reviewed journals in the form of two publications ([36] and [34]), and the implementations discussed have been integrated into two open-source codes, deal.II and ASPECT. From the contributions to deal.II, including a full length tutorial program, Step-63 [35], the author is listed as a contributing author to the newest deal.II release (see [5]). The implementation into ASPECT is based on work from the author and Timo Heister. The goal for the work here is to enable the community of geoscientists using ASPECT to solve larger problems than currently possible. Over the course of this thesis, the author was partially funded by the NSF Award OAC-1835452 and by the Computational Infrastructure in Geodynamics initiative (CIG), through the NSF under Award EAR-0949446 and EAR-1550901 and The University of California -- Davis