20 research outputs found

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors

    Full text link
    Traditional solution approaches for problems in quantum mechanics scale as O(M3)\mathcal O(M^3), where MM is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling O(M)\mathcal O(M). One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases.Comment: 29 pages, 12 figure

    Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators

    No full text

    Code Generation for High Performance PDE Solvers on Modern Architectures

    Get PDF
    Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available. Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm. An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue. To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure. We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator

    FlashX: Massive Data Analysis Using Fast I/O

    Get PDF
    With the explosion of data and the increasing complexity of data analysis, large-scale data analysis imposes significant challenges in systems design. While current research focuses on scaling out to large clusters, these scale-out solutions introduce a significant amount of overhead. This thesis is motivated by the advance of new I/O technologies such as flash memory. Instead of scaling out, we explore efficient system designs in a single commodity machine with non-uniform memory architecture (NUMA) and scale to large datasets by utilizing commodity solid-state drives (SSDs). This thesis explores the impact of the new I/O technologies on large-scale data analysis. Instead of implementing individual data analysis algorithms for SSDs, we develop a data analysis ecosystem called FlashX to target a large range of data analysis tasks. FlashX includes three subsystems: SAFS, FlashGraph and FlashMatrix. SAFS is a user-space filesystem optimized for a large SSD array to deliver maximal I/O throughput from SSDs. FlashGraph is a general-purpose graph analysis framework that processes graphs in a semi-external memory fashion, i.e., keeping vertex state in memory and edges on SSDs, and scales to graphs with billions of vertices by utilizing SSDs through SAFS. FlashMatrix is a matrix-oriented programming framework that supports both sparse matrices and dense matrices for general data analysis. Similar to FlashGraph, it scales matrix operations beyond memory capacity by utilizing SSDs. We demonstrate that with the current I/O technologies FlashGraph and FlashMatrix in the (semi-)external-memory meets or even exceeds state-of-the-art in-memory data analysis frameworks while scaling to massive datasets for a large variety of data analysis tasks

    A Parallel Geometric Multigrid Method for Adaptive Finite Elements

    Get PDF
    Applications in a variety of scientific disciplines use systems of Partial Differential Equations (PDEs) to model physical phenomena. Numerical solutions to these models are often found using the Finite Element Method (FEM), where the problem is discretized and the solution of a large linear system is required, containing millions or even billions of unknowns. Often times, the domain of these solves will contain localized features that require very high resolution of the underlying finite element mesh to accurately solve, while a mesh with uniform resolution would require far too much computational time and memory overhead to be feasible on a modern machine. Therefore, techniques like adaptive mesh refinement, where one increases the resolution of the mesh only where it is necessary, must be used. Even with adaptive mesh refinement, these systems can still be on the order of much more than a million unknowns (large mantle convection applications like the ones in [90] show simulations on over 600 billion unknowns), and attempting to solve on a single processing unit is infeasible due to limited computational time and memory required. For this reason, any application code aimed at solving large problems must be built using a parallel framework, allowing the concurrent use of multiple processing units to solve a single problem, and the code must exhibit efficient scaling to large amounts of processing units. Multigrid methods are currently the only known optimal solvers for linear systems arising from discretizations of elliptic boundary valued problems. These methods can be represented as an iterative scheme with contraction number less than one, independent of the resolution of the discretization [24, 54, 25, 103], with optimal complexity in the number of unknowns in the system [29]. Geometric multigrid (GMG) methods, where the hierarchy of spaces are defined by linear systems of finite element discretizations on meshes of decreasing resolution, have been shown to be robust for many different problem formulations, giving mesh independent convergence for highly adaptive meshes [26, 61, 83, 18], but these methods require specific implementations for each type of equation, boundary condition, mesh, etc., required by the specific application. The implementation in a massively parallel environment is not obvious, and research into this topic is far from exhaustive. We present an implementation of a massively parallel, adaptive geometric multigrid (GMG) method used in the open-source finite element library deal.II [5], and perform extensive tests showing scaling of the v-cycle application on systems with up to 137 billion unknowns run on up to 65,536 processors, and demonstrating low communication overhead of the algorithms proposed. We then show the flexibility of the GMG by applying the method to four different PDE systems: the Poisson equation, linear elasticity, advection-diffusion, and the Stokes equations. For the Stokes equations, we implement a fully matrix-free, adaptive, GMG-based solver in the mantle convection code ASPECT [13], and give a comparison to the current matrix-based method used. We show improvements in robustness, parallel scaling, and memory consumption for simulations with up to 27 billion unknowns and 114,688 processors. Finally, we test the performance of IDR(s) methods compared to the FGMRES method currently used in ASPECT, showing the effects of the flexible preconditioning used for the Stokes solves in ASPECT, and the demonstrating the possible reduction in memory consumption for IDR(s) and the potential for solving large scale problems. Parts of the work in this thesis has been submitted to peer reviewed journals in the form of two publications ([36] and [34]), and the implementations discussed have been integrated into two open-source codes, deal.II and ASPECT. From the contributions to deal.II, including a full length tutorial program, Step-63 [35], the author is listed as a contributing author to the newest deal.II release (see [5]). The implementation into ASPECT is based on work from the author and Timo Heister. The goal for the work here is to enable the community of geoscientists using ASPECT to solve larger problems than currently possible. Over the course of this thesis, the author was partially funded by the NSF Award OAC-1835452 and by the Computational Infrastructure in Geodynamics initiative (CIG), through the NSF under Award EAR-0949446 and EAR-1550901 and The University of California -- Davis
    corecore