438 research outputs found
Efficient Explicit Time Stepping of High Order Discontinuous Galerkin Schemes for Waves
This work presents algorithms for the efficient implementation of
discontinuous Galerkin methods with explicit time stepping for acoustic wave
propagation on unstructured meshes of quadrilaterals or hexahedra. A crucial
step towards efficiency is to evaluate operators in a matrix-free way with
sum-factorization kernels. The method allows for general curved geometries and
variable coefficients. Temporal discretization is carried out by low-storage
explicit Runge-Kutta schemes and the arbitrary derivative (ADER) method. For
ADER, we propose a flexible basis change approach that combines cheap face
integrals with cell evaluation using collocated nodes and quadrature points.
Additionally, a degree reduction for the optimized cell evaluation is presented
to decrease the computational cost when evaluating higher order spatial
derivatives as required in ADER time stepping. We analyze and compare the
performance of state-of-the-art Runge-Kutta schemes and ADER time stepping with
the proposed optimizations. ADER involves fewer operations and additionally
reaches higher throughput by higher arithmetic intensities and hence decreases
the required computational time significantly. Comparison of Runge-Kutta and
ADER at their respective CFL stability limit renders ADER especially beneficial
for higher orders when the Butcher barrier implies an overproportional amount
of stages. Moreover, vector updates in explicit Runge--Kutta schemes are shown
to take a substantial amount of the computational time due to their memory
intensity
A matrix-free high-order discontinuous Galerkin compressible Navier-Stokes solver: A performance comparison of compressible and incompressible formulations for turbulent incompressible flows
Both compressible and incompressible Navier-Stokes solvers can be used and
are used to solve incompressible turbulent flow problems. In the compressible
case, the Mach number is then considered as a solver parameter that is set to a
small value, , in order to mimic incompressible flows.
This strategy is widely used for high-order discontinuous Galerkin
discretizations of the compressible Navier-Stokes equations. The present work
raises the question regarding the computational efficiency of compressible DG
solvers as compared to a genuinely incompressible formulation. Our
contributions to the state-of-the-art are twofold: Firstly, we present a
high-performance discontinuous Galerkin solver for the compressible
Navier-Stokes equations based on a highly efficient matrix-free implementation
that targets modern cache-based multicore architectures. The performance
results presented in this work focus on the node-level performance and our
results suggest that there is great potential for further performance
improvements for current state-of-the-art discontinuous Galerkin
implementations of the compressible Navier-Stokes equations. Secondly, this
compressible Navier-Stokes solver is put into perspective by comparing it to an
incompressible DG solver that uses the same matrix-free implementation. We
discuss algorithmic differences between both solution strategies and present an
in-depth numerical investigation of the performance. The considered benchmark
test cases are the three-dimensional Taylor-Green vortex problem as a
representative of transitional flows and the turbulent channel flow problem as
a representative of wall-bounded turbulent flows
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
GPU-accelerated discontinuous Galerkin methods on hybrid meshes
We present a time-explicit discontinuous Galerkin (DG) solver for the
time-domain acoustic wave equation on hybrid meshes containing vertex-mapped
hexahedral, wedge, pyramidal and tetrahedral elements. Discretely energy-stable
formulations are presented for both Gauss-Legendre and Gauss-Legendre-Lobatto
(Spectral Element) nodal bases for the hexahedron. Stable timestep restrictions
for hybrid meshes are derived by bounding the spectral radius of the DG
operator using order-dependent constants in trace and Markov inequalities.
Computational efficiency is achieved under a combination of element-specific
kernels (including new quadrature-free operators for the pyramid), multi-rate
timestepping, and acceleration using Graphics Processing Units.Comment: Submitted to CMAM
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Efficiency of high-performance discontinuous Galerkin spectral element methods for under-resolved turbulent incompressible flows
The present paper addresses the numerical solution of turbulent flows with
high-order discontinuous Galerkin methods for discretizing the incompressible
Navier-Stokes equations. The efficiency of high-order methods when applied to
under-resolved problems is an open issue in literature. This topic is carefully
investigated in the present work by the example of the 3D Taylor-Green vortex
problem. Our implementation is based on a generic high-performance framework
for matrix-free evaluation of finite element operators with one of the best
realizations currently known. We present a methodology to systematically
analyze the efficiency of the incompressible Navier-Stokes solver for high
polynomial degrees. Due to the absence of optimal rates of convergence in the
under-resolved regime, our results reveal that demonstrating improved
efficiency of high-order methods is a challenging task and that optimal
computational complexity of solvers, preconditioners, and matrix-free
implementations are necessary ingredients to achieve the goal of better
solution quality at the same computational costs already for a geometrically
simple problem such as the Taylor-Green vortex. Although the analysis is
performed for a Cartesian geometry, our approach is generic and can be applied
to arbitrary geometries. We present excellent performance numbers on modern,
cache-based computer architectures achieving a throughput for operator
evaluation of 3e8 up to 1e9 DoFs/sec on one Intel Haswell node with 28 cores.
Compared to performance results published within the last 5 years for
high-order DG discretizations of the compressible Navier-Stokes equations, our
approach reduces computational costs by more than one order of magnitude for
the same setup
- …