Efficient GPU Implementation of Automatic Differentiation for Computational Fluid Dynamics

Abstract

Many scientific and engineering applications require repeated calculation of derivatives of output functions with respect to input parameters. Automatic Differentiation (AD) is a methodology that automates derivative calculation and can significantly speed up the code development. In Computational Fluid Dynamics (CFD), derivatives of flux functions with respect to state variables (Jacobian) are needed for efficient solution of nonlinear governing equations. AD of the flux function on graphics processing units (GPUs) is challenging as flux computation involves many intermediate variables that create a high register pressure and requires significant memory traffic because of the need to store the derivatives. This paper presents a forward-mode AD method based om multivariate dual numbers that addresses these challenges and simultaneously reduces the operation count. The dimension of multivariate dual numbers is optimized for performance. The flux computations are restructured to minimize the number of temporary variables and reduce the register pressure. For effective utilization of the memory bandwidth, we use shared memory to store the local Jacobian. The threads assigned to process an edge (dual-face) collectively populate the local Jacobian in the shared memory. Shared memory is used to store local flux Jacobian. The threads assigned to process a flux differentiation at an edge collectively populate the local Jacobian in the shared memory. The use of shared memory allows further reducing temporary variables. The local Jacobian is written from the shared memory to the device memory taking advantage of coalesced stores. This is another major benefit of the shared memory approach. During this work, we assessed existing GPU-based forward-mode AD approaches for flux Jacobian computation and found them performing suboptimally. We demonstrated that our GPU implementation based on multivariate dual numbers of dimension 5 outperforms other tested implementations including the hand differentiated version optimized for NVIDIA V100. Our implementation achieves 75% of the peak floating point throughput and 61% of the peak device bandwidth on V100

    Similar works