The computation of large-scale virtual acoustics using the 3D finite difference time domain (FDTD) is prohibitively computationally expensive, especially at high audio sample rates, when using traditional CPUs. In recent years the computer gaming industry has driven the development of extremely powerful Graphics Processing Units (GPUs). Through specialised development and tuning we can exploit the highly parallel GPU architecture to make such FDTD computations feasible.This paper describes the simultaneous use of multiple NVIDIA GPUs to compute schemes containing over a billion grid points. We examine the use of asynchronous halo transfers between cards, to hide the latency involved in transferring data,and overall computation time is considered with respect to variation in the size of the partition layers. As hardware memory poses limitations on the size of the room to be rendered, we also investigate the use of single precision arithmetic.This allows twice the domain space, compared with double precision, but results in phase shifting of the output with possible audible artefacts. Using these techniques, large-scale spaces of several thousand cubic metres can be computed at 44.1kHz in a useable time frame, making their use in room acoustics rendering and auralization applications possible in the near future.
INTRODUCTION
High fidelity virtual room acoustics can be approached through direct numerical simulation of wave propagation in a defined space. Unlike ray-based [1] or image source [2] techniques, this approach seeks to model the entire acoustic field within the simulation domain. Three-dimensional Finite Difference Time Domain (FDTD) schemes can be employed, however at audio sample rates such schemes are extremely computationally expensive [3] , prohibitively so for serial computation. Recent advances in graphics processing unit (GPU) architectures allow for general purpose computation to be performed on these forms of highly parallel hardware. Whilst central processing units (CPUs) may contain a small number of cores, such as four or eight , GPUs contain hundreds of processing cores that can be used to perform parallel computation. Using this architecture, the data independence of FDTD schemes can be leveraged to gain significant acceleration over single-threaded implementations [4] , and this allows large-scale simulations to be computed in time scales that are actually useable for performing research.
For scientific computing, Nvidia's Tesla GPUs are typically used in a workstation or compute node that can be configured with four GPUs connected across the same PCIe bus. This paper examines the simultaneous use of these four-GPU systems to render virtual acoustic simulations. This allows greater acceleration of existing models, or the combined use of all available memory across four GPUs to render large-scale domains containing billions of grid points. Recent versions of the CUDA language facilitate this process [5] , without recourse to MPI programming techniques.
The first section details the FDTD schemes being used, followed by an outline of the CUDA programming model for the simultaneous use of multiple GPUs. We then describe the implementation of the schemes using both non-asynchronous and asynchronous approaches. Finally, we detail experimental testing in terms of floating-point precision and overall computation times for various configurations, including large-scale simulations that use maximum memory.
VIRTUAL ACOUSTICS USING FINITE DIFFERENCE METHOD
The starting point for acoustic FDTD simulations is the 3D wave equation, which in second order form is given by:
Here Ψ is the target acoustical field quantity, c is the wave speed in air, ∇ 2 is the 3D Laplacian. Simple first-order boundary conditions are used, where:
Here n is a unit normal to a wall or obstacle, and β is an absorption coefficient. The standard FDTD discretisation leads to the following update equation, which includes boundary loss terms using a single reflection coefficient,
where w l,m,p is the discrete acoustic field, K is 6 in free space, 5 at a face, 4 at an edge and 3 at a corner, The stability condition for the scheme follows from von Neumann analysis [6] , such that for a given time step T the grid spacing X must satisfy:
The basic scheme can be extended to include the effect of viscosity, which gives a frequency dependent damping [4] . Here α is a viscosity coefficient. This leads to an update equation of the form:
Note that this update uses the nearest neighbours from two time steps ago, thus requiring the use of three data grids. The basic scheme, which only uses the centre point from two time steps ago, can be implemented using only two grids and a read, then overwrite procedure. These systems are referred to as the "2 Grid" and "3 Grid" schemes throughout this paper.
PARALLEL COMPUTING AND USE OF MULTIPLE GPUS IN CUDA
CUDA is Nvidia's programming architecture for implementing highly-threaded GPU code. In a serial implementation of equation 3, loops would be used to iterate over the computation domain, applying the update equation to each grid point. In CUDA, we issues a large number of kernel threads that implement the SIMD (Single Instruction Multiple Data) operation, and these threads are scheduled to execute using a large number of parallel processing cores. The program code contains a mixture of host serial C code, and device CUDA code. Whilst the host uses standard CPU memory, the device has multiple memory types. Global memory is the core data store, and is typical in the range of 3 to 6Gb per GPU. CUDA threads make use of local, register memory, and also have a small amount of shared memory per thread block. They can also communicate directly with the global and (read-only) constant memory. Each type has different access speeds, with global memory being the slowest, and shared and constant being fast. With a four GPU server, we have four instances of this memory model which are independent. The GPUs are connected in a pair-wise manner over the PCIe bus, as shown in figure 1. Version four and above of the CUDA architecture contains functionality that allows multiple GPUs that are connected in such a manner to be used concurrently [5] . Peer-to-peer communication allows data transfer between GPUs that bypasses the host altogether (transferring data from device to host, then back to another device is an expensive operation). This can be combined with the use of multiple streams of execution and asynchronous scheduling to achieve scalable speedups when using multiple GPUs.
IMPLEMENTATION OF THREE-DIMENSIONAL FDTD SCHEMES
This section gives a detailed description of the implementation of the basic 2 Grid FDTD scheme with its first-order boundaries. We start with a single GPU version, then extend this to an initial multiple GPU implementation. This is then developed to include the use of asynchronous data transfers of the partition halos.
Single GPU implementation
The CUDA programming model makes use of threads that are grouped first into blocks, and then into a grid. Both of these objects can be one, two, or three-dimensional in shape. Given a three-dimensional data domain, there are many possible approaches to mapping the threading model over the domain. Prior to the FERMI architecture, the standard approach was to utilise shared memory by issuing threads to cover a 2D layer of data. Each thread itself would then iterate over the final dimension, reusing data and shared memory [7] . Post FERMI, the caching system negates the benefits of this approach, and one can simply issue threads that cover the whole data domain [8] . Three-dimensional threads blocks can be used, for example 32 x 4 x 2, and a three-dimensional thread grid placed over the data. The time loop for the simulation contains a single kernel launch, then the input/output is processed, followed by swapping the pointers to the data grids, as shown in listing 1. The kernel keeps the use of conditional statements to a minimum. A layer of non-updated "ghost" points is used around the data domain, and so line 8 employs a conditional to check for this. A single further conditional is used at line 15, to load the coefficients used at a boundary. The logical expression at line 11 computes boundary position in an efficient manner, without the need for a lengthy IF-ELSEIF statement.
Non-asynchronous implementation using multiple GPUs
In transitioning from a single GPU to the use of four GPUs, the data domain needs to be partitioned. The individual GPUs have discrete memory, and so the 3D data needs to be separated into four segments. A further complication is that the FDTD scheme requires neighbouring points in all dimensions, and so overlap halos will be required. The 3D data itself is decomposed using a row-major alignment for each layer of the Z dimension, with consecutive layers in series. In this format, each layer occupies contiguous memory locations. Thus, the most natural partitioning is across the Z dimension, as shown in figure 2 . The overlap halos are individual Z layers, and so can be transferred as a single contiguous block of memory. Whilst the domain partitioning is straightforward, the CUDA code itself requires many extensions compared to the single GPU case. In terms of the pre-time loop setup code, individual commands such as cudaMalloc( ) become embedded in loops over the four GPUs. A call is made to cudaSetDevice( ) at each iteration, to perform the operation on individual GPUs. Single pointers to device memory become arrays of pointers, and constant memory has to be allocated to each GPU. The halo offset locations have to be calculated as linear positions across memory, and finally the peer-to-peer access has to be initialised. In a non-asynchronous implementation, the time loop operates as follows:
1. Loop over the GPUs, issuing a kernel launch to compute the data on that GPU.
Synchronize all GPUs.
3. Perform peer-to-peer data transfer for overlap halos.
4. Perform input/output.
Synchronize and swap data pointers.
Each GPU computes its data simultaneously, but only when all have completed do we then perform the data transfers of the individual overlap halos.
Implementation using asynchronous data transfers
The above implementation contains an inherent time lag, as the GPUs are idle during the data transfers across the PCIe bus. To eliminate this, we can make use of asynchronous behaviour and streams. The approach used is based on that outlined by Nvidia [9] , but is extended here to operate with the large-scale halo layers that occur in the 3D case. An individual Z layer can contain millions of floating-point values, and six layers have to be transferred between GPUs at each time step. A stream is simply a sequence of CUDA events that occur in series. However, multiple streams can be used so that events can execute in a concurrent and asynchronous manner.
As the FDTD scheme is data-independent at each time step, the overlap halo layers can be computed and the data transfers performed at the same time as the larger interior data segments on each GPU. This is accomplished by using one stream of events for the halos and transfers on each GPU, whilst a second stream is used for the interior. The streams are identified in the kernel launches, as shown in the time loop code detailed in listing 3. Initially, we iterate over the GPUs and launch the kernels required for the halos using stream_halo. Note that GPUs 0 and 3 have a single halo, whilst GPUs 1 and 2 contain two halos. Then the main interior data kernels are launched, using stream_int. The data transfer events are then pushed into stream_halo, which will execute when the halos have been computed. In this manner, the data transfers proceed at the same time as the interior computation is being performed. The GPUs are then synchronized before swapping the data pointers.
EXPERIMENTAL TESTING
Initial testing is performed using data grids containing 100 million points each. At double precision this requires 0.8Gb of data per grid (1.6Gb for the whole 2 Grid simulation), and so allows a comparison to be made between single GPU and four GPU implementation using Tesla C2050 GPUs that have 3Gb of global memory. Using a sample rate of 44.1kHz, the domain size is 244m 3 .
Computation times
The 2 Grid scheme is used to compare computation times for the single GPU, basic (non-asynchronous) four GPU, and asynchronous four GPU implementations. The simulations are computed for 4,410 samples in each case, at 44.1kHz, and for both single and double precision floating-point accuracy. Table 1 shows the resulting times. The data grids are of size Nx : 960 points, Ny : 396 points, and Nz : 264 points. The basic four GPU implementation only achieves a speedup of x 2.5, whilst the asynchronous version gets to x3. The grid sizes for these initial tests contain a very large Z layer (960 x 396 = 380,160 points). As six overlap halos of this size have to be transferred between GPUs at each time step, this is still a limiting factor. To test the effect of the Z layer size, the double precision simulation is performed for decreasing sizes whilst keeping the same overall domain size of 244m 3 , ranging from 380,160 points down to 76,032 points. Figure 3 shows the effect in terms of the dimensions of the space. Table 2 shows the timing results. As the Z layer size decreases we get closer to the x4 scalable speedup. 
Floating-point precision
Single precision floating-point variables require 32 bits of memory compared to double precision which requires 64 bits. So, using single precision we can effectively double the size of the computation domain using the same amount of memory. There is also an additional benefit, as GPUs offer greater peak performance at single precision. However, testing on the 100 million point domain reveals stability issues when running at, or very close to, the Courant limit for the scheme. Figure 4 shows the outputs for a 40,000 time step simulation at 44.1kHz using a DC-blocked audio input and grid spacing set at the Courant limit. The single precision output (blue) is stable initially, but shows phase and amplitude differences compared to the double precision (red). After 30,000 samples, the single precision output begins to diverge, and finally becomes unstable after 40,000 samples. Backing away from the Courant limit by around 0.05% ensures stability in single precision, at the cost of introducing greater dispersion. 
LARGE-SCALE ACOUSTIC SIMULATIONS
Having detailed the efficiency of the asynchronous four GPU implementation, we can now consider the use of maximum available memory to perform large-scale simulations. Nvidia's Tesla GPUs come with various amounts of global memory, and so table 3 shows the maximum simulation sizes for various configurations, at a sample rate of 44.1kHz. Note that the GPUs have less available memory than is actually labelled, for example a 3Gb C2050 has a useable global memory of around 2.8Gb. Whilst the table shows the maximum sizes for both the 2 Grid and 3 Grid schemes, in practice this has to be reduced to allow for storage of audio output arrays and, in the four GPU case, overlap halos of variable size. Four Tesla C2050 GPUs are used for the testing here, each of which has 3Gb of global memory. Thus for the basic 2 Grid scheme at single precision we can compute simulations using 1.4 billion grid points, and a resulting simulation size of 3,350 m 3 . For the 3 Grid scheme including viscosity, the grids contain just under a billion points. Table 4 shows the computation times for maximum memory simulations, running for 44,100 samples at 44.1kHz. 
CONCLUSIONS
The use of asynchronous data transfers and concurrent execution allows multiple GPUs to be used effectively to achieve near-scaleable speedups in three-dimensional FDTD schemes, typically ranging from x3 to x3.5 when using four GPUs, depending on the size of the overlap halos. By using all available memory on a four GPU compute node, we can perform virtual acoustic simulations using billions of grid points. At audio rates such as 44.1kHz, this allows the modelling of large rooms and halls, of several thousand cubic metres. Stability becomes an issue when running at single precision to maximise memory usage. Computing schemes at the Courant limit using single precision can lead to instability over time, although they may appear stable initially. Backing away from the Courant limit with a small increase in the spatial resolution resolves this behaviour.
Computation times for large-scale maximum memory simulations are around forty to fifty minutes per second at 44.1kHz, using four Tesla C2050 GPUs. Initial testing on the latest Kepler architecture GPUs shows a near two-fold speedup over the FERMI Tesla GPUs used here, and so should bring computation times down to under half an hour.
