Recommended form of bibliographic references: Zakirov A.V., Levchenko V.D., Perepelkina A.Yu., Zempo Yasunari. High performance FDTD code implementation for GPGPU supercomputers // Keldysh Institute Preprints. 2016. No. 44. 22 p. doi:10.20948/prepr-2016 
Introduction
Finite Dierence Time Domain (FDTD) method [1] is implemented in numerous codes for simulation of electrodynamics. Among other possible applications, it is used for design of optical devices, complex coatings, nanoantennas. The method is easily extended for simulation of other wave processes, such as acoustics and elasticity waves [2] .
As it comprises the base of many contemporary calculations, the qualitative acceleration of the code performance would aid many aspects of supercomputer research. This paper deals with one way of rethinking the traditional approach to numerical solution of evolutionary Cauchy problem, the Locally Recursive nonLocally Asynchronous (LRnLA) algorithms [3] . Based on this approach, we have developed software for simulation of real optics on General Purpose Graphical Processing Unit (GPGPU) and measured its eciency on one device, and on a multi-GPU supercomputer.
Background
Traditional FDTD implementations are based on algorithms which correspond to von-Neumann model of computations with parallel generalizations by Amdahl (for OpenMP implementation) and Hoare (for MPI implementation). Among the recent parallel models CUDA (OpenCL) should be noted, since these have adequate support of vector level parallelism.
The ideal paradigm of CPU RAM as a main storage of processed data leads to the fact that the majority of numerical scheme implementations use the computational region traversal rules that are layerwise, and linear multidimensional arrays for data structures.
Layerwise synchronization is the most obvious approach: after all eld data in the region is updated by one time step, the calculation for next time step begins.
The immediate consequence is a an essential limitation of the maximum locality of processed data. Since the memory subsystem is hierarchical, when the computation region is upscaled, this leads to performance degradation at times when the data size exceeds cache size.
For parallel implementation the layerwise approach requires layerwise synchronization of parallel processors. This leads to the limitation of upscaling set by communication environment latency, and imbalance of computation nodes'
loads.
These problems result in extremely low eciency of applications that use these methods. For example, among FDTD software, Meep [4] , Lumerical, have eciency that is lower than 1%. Other implementations of FDTD on GPGPU is limited by GPU device memory [5] .
Other space-time approaches
The dierence between LRnLA method and the conventional methods is that the optimization of computations deals not only with layerwise computation, but traces data dependencies in 4D time (iteration) and space domain [68] (see detailed in later sections).
Although this approach is not wide spread, the basics may be traced in works by other authors. Developed since early 1980s, the so-called loop tiling and loop skewing methods result in similar algorithms for simple domain geometry [912] .
The research on loop blocking led to creation of cache-aware and cache-oblivious algorithms [13] , which were used later for stencil computations of partial derivative equations [1416] . In 1D simulation these techniques lead to trapezoidal and diamond blocking of space, with generalizations to 2D and 3D.
Among these approaches LRnLA has the following advantages:
• The approach takes account for the complexity of modern computers. The space-time optimization account for all parallel levels, all levels of memory subsystem.
• The theory is built on the model of the computer and allows a priori quantitative estimates of the performance of method implementation.
• The theory applies to any physics simulation with local dependencies, any amount of dimensions.
The main dierence lies in the approach of building the algorithms. In LRnLA method, the best algorithm is chosen based on the analysis of both the computer system (by creating a model of memory subsystem and parallel levels) and numerical scheme properties (by constructing a dependency graph and tracing the dependencies). It should be noted that some cache-oblivious algorithms coincide with the algorithms of LRnLA family for lower dimensions. 
Algorithm description
We use rooine [18] model to estimate the eciency of the algorithms. Although other models exist [19] , rooine model is the most convenient for the current study. It shows only one level of memory subsystem hierarchy. In GPG-PU the memory subsystem hierarchy is developed less compared to CPU-type processors, so the rooine model appears to be the most suitable for GPGPU.
In it, the peak achievable performance is shown against the operational intensity parameter. Operational intensity is calculated as the ratio of performable operations for the given amount of data to the size of this data. It is an attribute of the algorithm in use. Depending on this parameter, the rooine model subdivides algorithms into two groups: memory bound and compute bound. Naive FDTD method implementation with layerwise synchronization has low operational intensity, and is memory bound.
To increase performance it is necessary to increase operational intensity. It is possible by avoiding stepwise synchronization. This is the main idea of the alternative algorithms for stencil computations (trapezoids, time-slicing and timeskewing [15, 16, 1923] )
LRnLA method suggests tracing the dependencies of the numerical method to nd the optimal computation order. The optimization is based on the hardware model. In this work, the construction of the algorithms and advantages of this method is described on the example of DiamondTorre algorithm.
DiamondTorre algorithm and its CUDA implementation
We describe an algorithm as computation order that follows from the subdivision of 4D X-Y -Z-t computational region. Each mesh point in this space has dependencies on some amount of points on the previous layer. By subdividing the domain we nd that some points are enclosed in a polytopes, the points of which have dependencies on the data from adjacent polytopes. Each polytope corresponds to a procedure of performing computations for all points within it in some arbitrary order. By tracing the dependencies, we nd that some shapes have dependencies (direct or indirect), and some are asynchronous. The dependent shapes are to be processed subsequently, asynchronous ones may be processed in parallel. A set of such shapes constitutes the description of the algorithm. By generalization, one shape also comprises an algorithm. Inside some shape in 4D space we can estimate the operational intensity as amount of points in it (proportional to the amount of necessary operations) divided by the amount of separate spatial grid points in it (projection of the shape to 3D spatial grid,
proportional to the amount of data necessary for operations). into GPU register. To calculate H-eld diamond in it three E-eld diamonds need to be loaded. To calculate E-eld diamond in it three H-eld diamonds (to the right, top, and bottom of it) need to be loaded (right).
al region can be tiled with one shape (except for boundaries). The pictured 3D
shape spans in two spatial coordinates (X and Y) and one time axis. In CUDA implementation, one DiamondTorre is processed by CUDA-block. Grid points along the remaining coordinate axis (Z) are updated using vector parallelism of CUDA-threads in one block. DiamondTorre size is dened by two parameters:
the size of the base DTS, and its height TH.
More detailed explanation of algorithm, its parameters and implementation for wave equation on 1 GPU can be found in [6] . Here the specics of its implementation for FDTD method and many-GPU systems are described.
The correspondence of DiamondTile to components of staggered gird is illustrated on g. 3. This is dened as the basic element for DiamondTile algorithms for FDTD scheme for Maxwell equations with 4th order of approximation and its size is DTS=1 by denition. It consists of two (E-eld and H-eld) diamonds that are shifted along time axis (by half time step), as well as along one coordinate axis (by 1.5 spatial step -half width of the chosen scheme stencil). DiamondTorre consists of TH DiamondTile pairs, shifted in a similar manner against each other.
We refer to the algorithm DiamondTorre as a process of making calculations for all points in the described 4D shape. In GPGPU implementation, DiamondTorre is a CUDA kernel. DiamondTorre's with the same Y-axis position are processed asynchronously by CUDA-blocks.
DiamondTorre base size DTS=1 is optimal in this case. Though higher DTS leads to higher operational intensity, with DTS=2 or more the main limiting Such data structure is chosen to provide coalesced data access for the chosen computation algorithms, and to minimize the amount of elements in data load/store operations.
Calculation window
The important advantage of DiamondTorre algorithm is high performance for large simulation domains, including those, where eld data do not t in device memory. It is easily achieved by updating data in a calculation window, which moves from right to left (according to the gures above). Data load and save to/from global RAM are performed asynchronously with computations. Only the data for one following group of asynchronous DiamondTorres are loaded.
Data of DiamondTorres which are no longer necessary for the current TH update are saved and deleted from device memory.
Performance does not decline in case the computation time of DiamondTorre is longer than the time, necessary for memory copy to/from device. Since computation time increases linearly with TH, and the copy time is constant, with high enough TH host-device transfers are completely concealed. If we do not account for the boundary eects, operational intensity increases linearly with TH.
3.3
Small scale performance tests
The described algorithms are implemented in code, which features not only the basic FDTD stencil computations, but also all the required methods for real physics computations.
• FDTD simulation in 3D spatial domain with 4th order accurate scheme in space;
• Perfectly Matched Layer absorbing boundary conditions;
• Total Field/Scattered Field wave source;
• Complex materials according to Drude, Drude-Lorenz model.
Small scale performance tests were conducted to nd optimal algorithm parameters. We measure performance in Yee cell updates per second (Ys). This number is usually of 10 9 order, so the main unit is GYs. Maximum vector size (N z, also equal to the amount of involved CUDAthreads), for which the register le of Kepler architecture is enough to keep the data of all necessary diamonds, is equal to 384 for double precision. This is why the data throughput is not utilized completely when all 14 SMs are involved (∼ 14 · 384 ≈ 6000 transactions). It is the main reason why the performance is only 90% from peak one. It becomes more than 10 9 GYs for suciently large TH. TH is inevitably smaller near region boundaries, so the performance for real problems is slightly lower (few percent).
Performance results for dierent values of TH parameter are shown on g. 6. Since TH parameter is proportional to the algorithm locality, and the vertical axis is performance (normalized by the maximum achieved with the current parameter set), we may plot the rooine model on the same graph to visually comprehend the limitations. of performance rate on TH is observed in case the data is stored on disk (SSD in this case). The optimal TH becomes signicantly larger (more than 500).
Concurrency on Y axis
Since all DiamondTorre's standing side-by-side along Y axis are asynchronous, where n is an integer number the time step on which the data of the leftmost node exists on.
Fig. 8: Concurrency on X axis
Here is a more detailed explanation of the computation and data transfer algorithm. Each node, which has data that does not include the boundary, processes the following operations sequentially (g. 9, 10):
• Wait to receive overlapping data on the right side from the node on the right (waitR);
• Computing cell updates for the received overlapping data (calcR);
• Non-blocking data transfer to the node on the right (sendR); • Wait to receive overlapping data on the left side from the node on the left (waitL); these are sent by sendR step of the node to the left;
• Computing cell updates for the overlapping data to the left (calcL);
• Non-blocking data transfer to the node on the left (sendL); these are received during the waitR step of the node to the left.
The dependencies between (i − 1)-th, i-th and (i + 1)-th nodes of the described stages are depicted on g. 10.
In case the execution of calcM takes more time than data transfer time (sendR+sendL), data transfer between nodes will be completely hidden by computations. 4 Parallel scaling results on TSUBAME2.5
The parallel performance of the code has been tested on TSUBAME2. • The amount of available node per one run is up to 300 according to the usage conditions (1403 in total).
• Each node has 3 NVIDIA Tesla K20x GPGPUs installed. Their total GDDR5 memory is 3 × 5.625 = 16.875 GB, with 208 GB data throughput each.
• Each node has at least 54 GB, up to 96 GB on several ones. Only a little above 40 GB from it is available for the computation.
• Devices are connected with PCI-e 2.0 with 4 GB/sec throughput in each direction.
• Each node has 120 GB SSD memory.
• Nodes are connected by Inniband QDR interconnect with up to 4GB/sec throughput. 
Rooine estimation
We plot the rooine graph for the system in use (g. 11). The memory subsystem is hierarchical, and LRnLA algorithms are constructed with this hierarchy in mind. This is why we plot rooines for several main memory levels. We also need to account for the size of the level, so the graph is expanded by adding the circles. The size of the circle shows the data size of the memory level. It is positioned on the corner of the corresponding rooine.
Red lines show the rooine for one GPGPU device. Both device memory and node memory may be used, so they both are shown on the rooine. One node (green lines) has 3 devices, so its peak performance is shown to be 3 times higher.
Up to 256 nodes were used in the performance tests, and the whole cluster contains 1403 nodes (blue lines).
The operational intensity of algorithm described in section 3.1 is estimated as 0.57 (red arrow on g. 11). For this algorithm we take account only of the inner memory of the GPU device. This simple estimate is made by taking N t → ∞.
The dierence from the precise number is marginal.
For comparison the operational intensity of the stepwise naive algorithm (no data reuse) is estimated to be 0.23 (black arrow).
When node memory is used the performance is bound by the host-device memory bandwidth (lower red rooine). When the calculation window is used In the second case the eciency becomes lower when the number of nodes rises from 1 to 2. It is caused by the fact that Inniband data throughput is lower than data throughput between devices on one node. The following increase in the number of nodes does not lower the eciency.
In the third case parallel scaling is close to ideal, but is still less than the one for the rst series. The main decrease occurs between 1 and 2 nodes. It is caused by a slight imbalance in node utilization.
Strong scaling
For strong scaling we chose a xed size domain, and by increasing the amount of nodes, the domain is subdivided to more and more parts, that are to be processed concurrently. Three series were performed (g. This dependency on TH is shown on g. 14. Problem size is 450×62208×128 grid cells. For one node computation TH is equal to 15, and increases up to 150 for 8 nodes and higher. For low amount of nodes the speedup is better than linear.
It is caused by optimization of TH which is only possible when enough data is processed on each node.
The nal result is a strong scaling series on a problem with 38400 × 363 × 128 grid points (g. 15). One node the performance is about 30% from the peak performance, since the size along z axis is not optimal. It is not big enough to conceal GDDR5 access latency, since the amount of simultaneous transactions is too low. But the acceleration is up to 40 times. Only with 128 nodes and above data transfers are taking more time than computation, which leads to decrease in computation rate.
It should be noted that one node has little memory size (only about 3 times more than total device memory), and this becomes the reason for acceleration 
Conclusion
The work can be summarized as follows. The FDTD code has been developed, that allows simulation of real optical phenomena. The distinguishing feature of the code is the use of DiamondTorre LRnLA algorithm, which maximizes the performance on one device, and parallel eciency for multi-GPU architectures.
The software was tested on TSUBAME2.5 supercomputer. High computation rate is achieved (more than 1 billion Yee cell updates per second on one device).
The problem size is not limited by device memory. The scaling for ∼ 1000 devices is linear for weak scaling, and saturates at ∼ 100 for strong scaling.
Algorithm parameters (such as TH and problem size) allow not only qualitative, but also quantitative estimates of the performance and parallel scaling. Maximal achieved performance is 0.65·10 12 Yee Cells per second for 3D domain with 0.3·10 12 Yee cells total. For example, such size for wave optics problems corresponds to 1 cubic millimeter domain. This allows a signicant breakthrough in computational nanooptics, by allowing the simulation in domains that were previously too big even for supercomputers. It may be used for simulation of complex optical devices and substrates.
Acknowledgement
The work is supported by Hosei International Fellowship grant, RFBR grant no. 14-01-31483.
