Abstract-Stochastic surface growth models aid in studying properties of universality classes like the Kardar-Parisi-Zhang class. High precision results obtained from large scale computational studies can be transferred to many physical systems. Many properties, such as roughening and some two-time functions can be studied using stochastic cellular automaton (SCA) variants of stochastic models. Here we present a highly efficient SCA implementation of a surface growth model capable of simulating billions of lattice sites on a single GPU. We also provide insight into cases requiring arbitrary random probabilities which are not accessible through bit-vectorization.
I. INTRODUCTION
The roughening and aging of interfaces is important in many processes in physics and technology. Examples are the progression of solidification fronts [1] or flame fronts [2] and the growth of surfaces. The Kardar-Parisi-Zhang (KPZ) equation [3] is a stochastic differential equation devised to model kinetic roughening of surfaces under ballistic deposition of particles. It describes the evolution of the surface height as an interplay of linear growth and processes of diffusion, smoothing the surface, and a local instability, increasing local roughness, with random particle deposition noise:
Here, is the time and x the position on the surface. denotes an uncorrelated Gaussian noise with zero mean. This equation was also found to model different systems like randomly stirred fluids [4] , directed polymers in random media [5] and even magnetic flux lines in superconductors [6] . The concept of aspects of different physical systems exhibiting the same scaling behavior is called universality [7] . Solutions of the KPZ equation define universality classes.
Exact solutions are available only in one-dimension. The growth of two-dimensional surfaces into the third dimension (2 + 1 dimensions) is of special interest due to the experimental relevance. Thus experimental studies are performed to compare empirical surface growth with KPZ universality. [8] On the other hand computational models which fall into this universality class are used to gather insight into general properties of these physical systems.
In order to compare calculations to experiments, large scale simulations, both in time and space are advantageous. A variety of stochastic surface growth models has been devised to accomplish this task. For example the restricted solid-onsolid model [9] or the -mer model [10] , [11] , which is the subject of the present work.
Efficient implementation of stochastic lattice models has been of interest for a very long time, primarily for spin models like the Ising model [12] . The most efficient way is to formulate the model as a stochastic cellular automaton (SCA) [13] . Ever since graphics processing units (GPUs) have been introduced they have also been employed for these tasks [14] , [15] . A very recent study of bit-vectorized Ising spin-glass simulations on GPU can be found in [16] .
The one-dimensional -mer model (roof-top model) [17] has also been studied using a SCA of an asymmetric exclusion process (ASEP) [18] , but for limited system sizes. Here we present a very efficient bit-vectorized SCA implementation of the 2 + 1 dimensional -mer model, which scales to very large systems (∼ 10 10 lattice sites on a single GPU). Dealing with arbitrary update probabilities is also discussed here, which is often not considered in bit-vectorized implementations.
In the following, the 2 + 1 dimensional -mer model, the octahedron model, is introduced. Section II describes our bit vectorized SCA implementation, the performance of which is analyzed in section III. Finally, conclusions are presented in section IV.
A. The Octahedron Model on a Checkerboard
A growing surface can be most efficiently represented by subtracting the linearly growing average height and considering only local height differences, which govern the surface morphology. In the octahedron model these height differences (slopes) are restricted to / = ±1. Updates, corresponding to the deposition or removal of particles on the surface, follow the generalized Kawasaki rules: [10] , [11] (︃ −1 1
where allowed deposition processes are carried out with probability and removals with probability . These rules are an extension of an ASEP in one dimension. [17] Figure 1 illustrates the update processes as exchanges of slopes along the and axes. Periodic boundary conditions (PBC) are applied, that is slopes leaving the system at one edge reenter at the opposite edge.
A time step, or Monte-Carlo step (MCS), requires attempting one update for each lattice site, this constitutes the unit of time in the simulation. To achieve this in a well-defined manner, the SCA first updates the even (black) lattice sites, then the odd (white) ones as defined by the parity of the xor of the coordinates ( ⊕ ) ∧ 1.
The focus of the presented implementation lies on cases where > 0, = 0. Using these parameters, the surface growth will follow KPZ universality exhibiting power-law growth of the surface roughness with a non-trivial exponent. [19] In these cases, the choice of rescales the simulation time non-linearly and affects correlation functions, beyond a simple rescaling of time. If = the resulting surface growth falls in the Edwards-Wilkinson universality class, where the roughness grows logarithmically [10] , [20] .
II. IMPLEMENTATION A. Lattice Encoding
The slopes can be mapped to a binary state / → {0, 1} of which two need to be encoded per lattice site ( − and − ). The slopes to the remaining neighbors can be retrieved from those:
+ ( , ) = − ( + 1, ) and + ( , ) = − ( , + 1) to avoid redundancy. Storing two bits per lattice site and encoding tiles of 4 × 4 in 32-bit words is straightforward. This encoding is illustrated in the left-hand panel of figure 2 . It has the advantage that nearest neighbor (NN) sites, which are required to perform an update may be found in the same word, making it the encoding of choice if lattice sites are selected randomly [19] and vectorization is not an option. For bit-vectorization, the ideal encoding is non-local with respect to the relations of data in configuration space. Instead the bits are grouped in memory according to their role. The SCA defines four roles in total, based on two binary properties: The direction of the slope / and the parity of the lattice site (odd/even). This grouping is illustrated in the right-hand panel of figure 2: All slopes with the same role are stored at consecutive addresses in memory, allowing vectorization with any word size not larger than half the size of the simulation cell /2.
B. Bit-Vectorized GPU Implementation
The kernel of the bit-vectorized GPU implementation is summarized in pseudo-code in figure 3. Care must be taken regarding the parity of rows with respect to the considered sublattice (line 3.2): All slopes belonging to sites in odd rows are perfectly aligned and no shifting is required. In even rows the + slopes, stored at the NN site in direction, are shifted by one bit in memory (compare figure 2, left) .
The implementation treats consecutive words processed by threads of the same block as one effective single instruction multiple data (SIMD) word, with a maximum effective size of eff,max = × ℎ , where is the word size in bits, which is 32 on GPU, giving eff,max = 2 15 . This SIMD-word size can be adjusted to span the simulation cell, as long as the lateral size does not exceed = 2 16 . Larger simulations are rarely required, thus the kernel displayed in figure 3 assumes the effective SIMD word to span the system. The work assigned to thread blocks is distributed along the direction. Buffer regions between blocks or global atomics are not employed since, as long as all blocks update the same sublattice, the non-locality of the encoding of on-site slopes avoids write conflicts in global memory. Each thread updates 32 lattice sites simultaneously. The corresponding slopes ⃗ + are placed in a buffer in shared memory (⃗ shared + ) to facilitate block-wide rotation when updating even rows. The correct set of ⃗ + slopes is compiled in the lines following 3.8. An xor-mask encoding the Kawasaki exchanges to be carried out is used to apply updates in parallel. Again, ⃗ + needs to be treated separately in even rows: The update mask is applied in parts to the corresponding data in shared memory. Atomics could be used to apply , in order to remove the need for synchronization in line 3.21, but this was found to yield lower performance.
The update mask in line 3.14 is generated according to the Kawasaki rules in equation (2), which can be implemented by the following relations:
where ⃗ denotes a word of random bits set with probability . If = 0, the calculation of can be omitted and = . Generation of ⃗ 0.5 is trivial and fastest, provided a good pseudo-random number generator with all bits with uniform distribution. For an arbitrary , it is necessary to generate ⃗ sequentially using random numbers. All = 32 random bits need to be generated. Generating only those random bits which are required for the actually possible updates in a way which minimizes warp-divergence, does not improve performance. Even then the implementation is considerably faster than a non-vectorized version using local encoding.
With the effective SIMD word spanning the simulation cell in one direction, the implementation utilizes global memory perfectly efficiently: All accesses are coalesced and all bits read are required to perform updates. To handle even larger systems ( ≥ 2 17 ), the kernel needs to be changed. The word index , initialized in line 3.3, is iterated with a stride equaling the number of threads per block. The procedure remains unchanged for odd rows, where no slopes need to be shifted. For even rows, the shared buffer ⃗ shared + holds two SIMD words. ⃗ + slopes need to be shifted between neighboring SIMD words, thus the whole of a second SIMD word is cached to ensure keeping global memory accesses coalesced. The buffer is used in a wrap-around fashion (⃗ Good quality pseudo-random numbers are crucial in this implementation, especially in the optimized case = 0.5. While in some cases adequate for Monte-Carlo methods, a linear congruential generator (LCG) is insufficient here: Correlations would severely influence results and produce accumulating errors since the SCA updating procedure is itself correlated and is to be decorrelated by the random acceptance of updates 
5:
if ¬ row then ◁ rotate SIMD word by one bit:
9: SYNCHRONIZETHREADS 10:
11:
end if
13:
◁ compute xor mask for update: 14 :
◁ apply mask: 16 :
17:
if ¬ row then
19:
◁ apply mask to NN slopes, except LSB: 20:
◁ apply mask to LSB of next NN slopes: 23 : . Bit-vectorized GPU kernel using non-local encoding to update one sublattice with given parity ( lat = ( ⊕ ) ∧ 1). It is executed for each sublattice in order to complete one MCS. PBC apply for all coordinates, including indexes in the shared memory buffer ⃗ shrd
id is a short-hand for the thread ID within the thread block. The above directly applies if the thread block (= SIMD word) spans the system in direction. Statements involving ⃗ / − represent two separate operations on ⃗ − and ⃗ − . Load and store operations for parameters and states of random number generators happen before respectively after the presented loop and are omitted for brevity. See text for details. in the first place. Also, LCGs do usually not provide good randomness of single bits. The present implementation uses a small version of the well-known Mersenne Twister [21] , called TinyMT [22] , which was developed by the same authors. Each thread maintains its own, randomly seeded, TinyMT state of 128 bits and a polynomial of 96 bits. The polynomials of all threads are independent of each other. For performance comparisons, a multi-threaded CPU implementation using a word size = 64 bit was created for the case = 0.5, = 0. The usage of SIMD instructions through the vector extensions of GCC [23] did not increase performance. Probably because some operations, like bit shifts across a whole SIMD word, require more operations than with scalar commands, compensating the performance gain from vectorization.
III. PERFORMANCE
The devices used for performance measurements are listed in table I. Tests of the CPU implementation were performed by running the maximum number of hardware threads provided by the hyper-threading capabilities of the platform (12) . This increases the performance by less than one percent over running only one thread per physical core (6) for the bitvectorized implementation. This suggests that the performance of the bit-vectorized implementation on CPU is not significantly limited by memory latency. The gain is about 20% for the local implementation.
The achieved performance of different implementations is presented in figure 4 as the number of update attempts performed per nanosecond. In the course of one MCS, each slope needs to be touched twice, translating into two write and two read accesses to each slope per MCS in the ideal case. This is illustrated by the right axis of the plot associating the performance with the ideally required bandwidth. For = 0.5, = 0, the bit-vectorized implementation performs about 229 updates per nanosecond on a GTX Titan Black GPU, which requires 229 GB/s of slope data to be transferred between global device memory and the GPU. The NVIDIA Profiler nvprof reports an achieved memory bandwidth of about the same value, since the memory transfers required to load and store states and parameters of random number generators negligible. This bandwidth is lower than the maximum bandwidth listed for the device in table I, but it is equal to the bandwidth reported by the bandwithTest utility from the CUDA SDK for pure device-to-device memory copy. At the same time nvprof reports low to medium utilization of arithmetic units. Thus the code is clearly memory-bound. When the lateral system size is increased to = 2 17 , the effective SIMD word does no longer span the simulation cell and memory accesses cannot be done as efficiently anymore, this leads to a performance drop of less than 5% (cyan bar).
In the case of arbitrary , the generation of correctly distributed random bits becomes more expensive. The benchmarks for = 0.95 (green bars in figure 4) show a performance reduced by a factor of four to five, which is still faster than the local implementation for the same case (red bars). Since memory access patterns of the bit-vectorized implementation remain unchanged between the cases = 0.50 and arbitrary , the decrease in performance can only stem from the increased computational load of random number generation, which does not require branches. This means, that the implementation turns from memory-bandwidth-bound to compute bound. Specific choices for the probabilities, which are the sums or differences of fractions 1/2 can be generated at lower computational cost than arbitrary ones, since they can be generated from less than 32 random numbers using the logical "and" and "or" operations.
The newer GTX Titan X (Maxwell-generation) GPU provides a significant speedup over Titan Black for arbitrary , due to increased compute power, but only marginal gain in the case = 0.5, since the available memory bandwidth is almost the same.
The values listed for thermal design power (TDP) in table I may not be perfectly comparable across vendors due to different definitions, but can provide a rough estimate of the power consumption of the device in a steady state under load. The actual power consumption during the benchmarking was not measured. Comparing performance and TDP for the Kepler-generation cards, one can conclude that the lower clocked, compute GPU K80 is more energy efficient running the presented implementations than the gaming card.
The case = = 0.5 was also tested, but not separately optimized, because simulations of this case are of less scientific interest since it falls into the analytically-solved EdwardsWilkinson universality class. However, evaluating additional = 0.5 updates does not significantly affect performance since memory bandwidth remains the main limiting factor.
IV. CONCLUSION
A very efficient SCA implementation of the octahedron model describing surface growth was presented, which is capable of simulating large systems over long simulation times (for example = 2 16 at = 0.5 for 1.4MMCS in under twelve hours using a K80 GPU). This implementation enables large scale studies of the physical aging behavior of the surface growth models under SCA dynamics [24] .
It was shown that the code is memory-bound on GPUs in the special case of = 0.5. When using an arbitrary update probability, where a separate random number is required for each lattice site, the implementations performance drops only by a factor of four to five. Thus it was shown, that by using bit-vectorized updates and an optimized encoding scheme, the performance of the SCA can be increased even without restricting possible applications by the site-updating frequencies. In this case the performance is limited by the performance of random number generation, which might be optimized, depending on the requirements of calculations.
A fair comparison was done between the performance on CPUs and GPUs, since implementations on both architectures are highly optimized. A speedup of up to 14× was found for GPU over a single socket six core CPU.
The program presented here will be made available upon request to the corresponding author.
