438 research outputs found

    Efficient Explicit Time Stepping of High Order Discontinuous Galerkin Schemes for Waves

    Full text link
    This work presents algorithms for the efficient implementation of discontinuous Galerkin methods with explicit time stepping for acoustic wave propagation on unstructured meshes of quadrilaterals or hexahedra. A crucial step towards efficiency is to evaluate operators in a matrix-free way with sum-factorization kernels. The method allows for general curved geometries and variable coefficients. Temporal discretization is carried out by low-storage explicit Runge-Kutta schemes and the arbitrary derivative (ADER) method. For ADER, we propose a flexible basis change approach that combines cheap face integrals with cell evaluation using collocated nodes and quadrature points. Additionally, a degree reduction for the optimized cell evaluation is presented to decrease the computational cost when evaluating higher order spatial derivatives as required in ADER time stepping. We analyze and compare the performance of state-of-the-art Runge-Kutta schemes and ADER time stepping with the proposed optimizations. ADER involves fewer operations and additionally reaches higher throughput by higher arithmetic intensities and hence decreases the required computational time significantly. Comparison of Runge-Kutta and ADER at their respective CFL stability limit renders ADER especially beneficial for higher orders when the Butcher barrier implies an overproportional amount of stages. Moreover, vector updates in explicit Runge--Kutta schemes are shown to take a substantial amount of the computational time due to their memory intensity

    A matrix-free high-order discontinuous Galerkin compressible Navier-Stokes solver: A performance comparison of compressible and incompressible formulations for turbulent incompressible flows

    Full text link
    Both compressible and incompressible Navier-Stokes solvers can be used and are used to solve incompressible turbulent flow problems. In the compressible case, the Mach number is then considered as a solver parameter that is set to a small value, M≈0.1\mathrm{M}\approx 0.1, in order to mimic incompressible flows. This strategy is widely used for high-order discontinuous Galerkin discretizations of the compressible Navier-Stokes equations. The present work raises the question regarding the computational efficiency of compressible DG solvers as compared to a genuinely incompressible formulation. Our contributions to the state-of-the-art are twofold: Firstly, we present a high-performance discontinuous Galerkin solver for the compressible Navier-Stokes equations based on a highly efficient matrix-free implementation that targets modern cache-based multicore architectures. The performance results presented in this work focus on the node-level performance and our results suggest that there is great potential for further performance improvements for current state-of-the-art discontinuous Galerkin implementations of the compressible Navier-Stokes equations. Secondly, this compressible Navier-Stokes solver is put into perspective by comparing it to an incompressible DG solver that uses the same matrix-free implementation. We discuss algorithmic differences between both solution strategies and present an in-depth numerical investigation of the performance. The considered benchmark test cases are the three-dimensional Taylor-Green vortex problem as a representative of transitional flows and the turbulent channel flow problem as a representative of wall-bounded turbulent flows

    Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures

    Get PDF
    Feltor is a modular and free scientific software package. It allows developing platform independent code that runs on a variety of parallel computer architectures ranging from laptop CPUs to multi-GPU distributed memory systems. Feltor consists of both a numerical library and a collection of application codes built on top of the library. Its main target are two- and three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin methods as the main numerical discretization technique. We observe that numerical simulations of a recently developed gyro-fluid model produce non-deterministic results in parallel computations. First, we show how we restore accuracy and bitwise reproducibility algorithmically and programmatically. In particular, we adopt an implementation of the exactly rounded dot product based on long accumulators, which avoids accuracy losses especially in parallel applications. However, reproducibility and accuracy alone fail to indicate correct simulation behaviour. In fact, in the physical model slightly different initial conditions lead to vastly different end states. This behaviour translates to its numerical representation. Pointwise convergence, even in principle, becomes impossible for long simulation times. In a second part, we explore important performance tuning considerations. We identify latency and memory bandwidth as the main performance indicators of our routines. Based on these, we propose a parallel performance model that predicts the execution time of algorithms implemented in Feltor and test our model on a selection of parallel hardware architectures. We are able to predict the execution time with a relative error of less than 25% for problem sizes between 0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth gives a minimum array size per compute node to achieve a scaling efficiency above 50% (both strong and weak)

    GPU-accelerated discontinuous Galerkin methods on hybrid meshes

    Full text link
    We present a time-explicit discontinuous Galerkin (DG) solver for the time-domain acoustic wave equation on hybrid meshes containing vertex-mapped hexahedral, wedge, pyramidal and tetrahedral elements. Discretely energy-stable formulations are presented for both Gauss-Legendre and Gauss-Legendre-Lobatto (Spectral Element) nodal bases for the hexahedron. Stable timestep restrictions for hybrid meshes are derived by bounding the spectral radius of the DG operator using order-dependent constants in trace and Markov inequalities. Computational efficiency is achieved under a combination of element-specific kernels (including new quadrature-free operators for the pyramid), multi-rate timestepping, and acceleration using Graphics Processing Units.Comment: Submitted to CMAM

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Efficiency of high-performance discontinuous Galerkin spectral element methods for under-resolved turbulent incompressible flows

    Full text link
    The present paper addresses the numerical solution of turbulent flows with high-order discontinuous Galerkin methods for discretizing the incompressible Navier-Stokes equations. The efficiency of high-order methods when applied to under-resolved problems is an open issue in literature. This topic is carefully investigated in the present work by the example of the 3D Taylor-Green vortex problem. Our implementation is based on a generic high-performance framework for matrix-free evaluation of finite element operators with one of the best realizations currently known. We present a methodology to systematically analyze the efficiency of the incompressible Navier-Stokes solver for high polynomial degrees. Due to the absence of optimal rates of convergence in the under-resolved regime, our results reveal that demonstrating improved efficiency of high-order methods is a challenging task and that optimal computational complexity of solvers, preconditioners, and matrix-free implementations are necessary ingredients to achieve the goal of better solution quality at the same computational costs already for a geometrically simple problem such as the Taylor-Green vortex. Although the analysis is performed for a Cartesian geometry, our approach is generic and can be applied to arbitrary geometries. We present excellent performance numbers on modern, cache-based computer architectures achieving a throughput for operator evaluation of 3e8 up to 1e9 DoFs/sec on one Intel Haswell node with 28 cores. Compared to performance results published within the last 5 years for high-order DG discretizations of the compressible Navier-Stokes equations, our approach reduces computational costs by more than one order of magnitude for the same setup
    • …
    corecore