Large scale deep neural networks profited from an emerging class of AI accelerators. Although the accelerators are specialized for machine learning, some of their designs are general enough for other computing intensive applications. Cloud TPU, as one of them, offers tremendous computing resources and is easily accessible through TensorFlow by expressing the computation in a graph. In this paper, we leverage this powerful hardware combined with the expressiveness of TensorFlow to simulate the Ising model on a 2-dimensional lattice. We modify the computationally intensive part of the checkerboard algorithm into matrix operations to exploit Cloud TPU's highly efficient matrix unit. In our experiments, we demonstrate that our implementation outperforms the best published benchmarks to our knowledge by 60% in single core and 250% in multiple cores with linear scaling. We also show the performance improvement of using low precision arithmetic-bfloat16 instead of float32-without sacrificing any accuracy.
Introduction
The Ising model [13] , which considers short-range interactions between spin variables on the sites of a d-dimensional lattice, plays an important role in statistical physics as a prototyping system to study the universal behavior of critical phenomena. Many significant breakthroughs in statistical physics are attributed to the study of the model from either its computational or its theoretical perspective. It is well known that the Ising model has no phase transition in 1 dimension; however, it undergoes a second-order phase transition between an ordered and a disordered phase in 2 dimensions or more [5, 17] . The critical temperature T c at which this phase transition occurs on a 2-dimensional square lattice was analytically solved by Lars Onsager [17] , but it is still an open problem in 3 or more dimensions. However, computer simulation offers a powerful alternative to study such systems and determine critical temperatures, thanks to the development of finite scaling theory [3] and availability of increasing computational power. Such an approach ushered a plethora of interdisciplinary applications outside of physics, including bioinformatics [2] , economics [20] and operations research [11, 22] .
Large-scale simulation of systems such as Ising model requires a large amount of high performance computing resources, which are usually available in multi-core computing architectures based on distributed shared memory, or distributed clusters (a.k.a data-centers) with homogeneous or heterogeneous nodes commonly seen in private or commercial clouds. Benefitting from the explosion of machine learning, especially deep learning, commercial clouds provide not only CPUs and GPUs, but also specialized chips such as FPGAs and their in-house processors. The Tensor Processing Unit ("Cloud TPU" or "TPU" for short)-an AI accelerator application-specific integrated circuit (ASIC) developed by Google specifically for neural network machine learninghas received much attention in the machine learning community [15] . Its latest release offers 420 × 10 12 floating-point operations per second (FLOPS) and 128GB of high bandwidth memory (HBM). TPU architecture is accessed via software frontends such as TensorFlow [1] or PyTorch [19] , and can be deployed both for training huge deep neural networks and for performing lowlatency online prediction. [14] reports impressive acceleration of training and online prediction.
Although TPU aims for machine learning, it is nevertheless also readily applicable to scientific computing problems due to its tremendous amount of raw general-purpose computing power. In this paper, we employ TPU under the TensorFlow framework to develop a Markov chain Monte Carlo (MCMC) simulation of the Ising model. We demonstrate that we are able to achieve much better performance compared to the state-of-the-art benchmarks in terms of both speed and scalability. The same code can be run on different platforms, either on a desktop locally or on TPU clusters in cloud for larger problems.
TPU Device Architecture
TPU is a programmable linear algebra accelerator optimized for machine learning workloads. In the current architecture, one TPU unit consists of four TPU chips on a board, and each TPU chip contains two TensorCores. Those TensorCores are treated as independent processors that communicate to each other through a high-bandwidth network. In a larger system, multiple TPU units are packed on a two-dimensional torus interconnect network to form a TPU cluster known as a "TPU Pod" [7] .
The TensorCore, depicted in Figure 1 , is optimized for dense linear algebra computations. It contains distinct classes of computing units, such as a scalar processor, a vector processor, accumulators and matrix units [9, 15] . All those processors are backed by its 16GB High-Bandwidth Memory (HBM). Vectorized operations are handled either in the vector processor directly or forwarded to corresponding extended vector units. Each extended vector unit takes the input operands, performs the corresponding operations, and returns the results back to the vector processor. One of these extended vector units is the matrix unit (MXU), which is capable of performing 128 × 128 multiply-accumulate operations in each cycle [8] . The MXU is the main computing power of the TPU architecture, so it should be exploited as much as possible. While its inputs and outputs are 32-bit floating point values, the MXU performs multiplications at the reduced precision of bfloat16-a 16-bit floating point representation that provides better training and model accuracy than the IEEE half-precision representation.
Figure 1:
One TPU chip consists of two TensorCores. A TensorCore in 3rd generation TPU (TPUv3) consists of a scalar processor, vector processor, and two matrix units. The arrows depict the datapaths across different processors/units and the high-bandwidth memory (HBM), the diagram is borrowed from [9] , here "Core" is the "Tensorcore". Machine learning research shows that many machine learning models can tolerate lower precision arithmetic without degradation of converged accuracy. TPU supports storing values in bfloat16 format as a way to reduce the size of data and allow larger models to fit in memory [10] . For scientific computing, however, low precision is potentially dangerous because the increasing rounding error can introduce significant bias or divergent computation. In the case of the Ising model, while the binary spin values can be encoded in bfloat16 without loss of accuracy, the acceptance ratio and the random numbers used to determine acceptance in an MCMC simulation are more sensitive to reduced precision. However, our experiments show no noticeable differences in accuracy between bfloat16 and float32. By using bfloat16 instead of float32, we are able to simulate larger systems and leverage the MXU more effectively.
We program on TPU through TensorFlow. The flow to run programs on TPU is roughly depicted in Figure 2 . More details are available in TensorFlow's official XLA page 1 . In the first stage, TensorFlow constructs the computation graph and marks the graph for replication. Then the graph is rewritten to be TPU-compatible and compiled to a High Level Optimizer (HLO) program. Next, the Accelerated Linear Algebra (XLA) compiler takes over and converts HLO operations to Low Level Optimizer (LLO) code-effectively "TPU assembly code", which can be readily executed on TPU. While TensorFlow hides the details from programmers, such details have a critical impact on performance and memory usage. For example, according to the performance guide [8] , unlike most other architectures, arrays in TPU are tiled. This entails to padding one dimension to a multiple of 8, and the other dimension to a multiple of 128. XLA performs data layout transformations and data are arranged in memory such that the hardware can efficiently process them. Programs that operate on array sizes undividable by 8 will have sub-optimal performance.
The Ising Model
Mathematically, the Ising spin Hamiltonian is given by
where σ i is a random variable assuming the values of ±1 on sites i = 1, . . . , N of a d-dimensional hypercubic lattice, and ij indicates that sites i and j are nearest neighbors. The first term, where the sum is over pairs of nearest-neighbor sites, represents the interaction energy that favors an ordered ferromagnetic state (if J > 0). The second term, involving the interaction between the applied field and the spin system, is of a paramagnetic character. The configuration probability is given by the Boltzmann distribution with inverse temperature β = (k B T ) −1 :
is the partition function and k B is the Boltzmann constant. For a function f of the spins (observable), we denote f = σ σ σ f (σ σ σ)π(σ σ σ) the expectation (mean value) of f . f can often be difficult to evaluate numerically if there are many states in the system. The Markov Chain Monte Carlo is the most commonly used monte carlo algorithm to calculate statistics on the Ising model. Without loss of generality, in what follows, we assume no external magnetic field, i.e., µ = 0 and J = 1, and the 2D lattice has circular boundary or in other words, is a torus. A given configuration of the lattice (spin values) are represented by matrix σ σ σ.
Checkerboard Algorithm
The Metropolis-Hastings algorithm is the standard algorithm to simulate the Ising model. At each step, it proposes a candidate spin and flips the candidate based on the energy difference and acceptance probability. Closely related to this vanilla version that flips one spin at each step, there is another similar algorithm by flipping non-interacting spins in parallel, i.e., the efficient checkerboard algorithm [21] . [4, 4, 4, 4] tensor, where [l, k, :, :] is the sub-lattice at [l, k] of the grid; on the right, the sub-lattice is zoomed in and the indices of its spin sites are shown; (2) Reorganized checkerboard: one the left, each 4 × 4 sub-lattice is reorganized by 4 "compact" 2 × 2 sub-lattices; on the right, 4 "compact" 2 × 2 sub-lattices are zoomed in and their original indices from the 4 × 4 sub-lattice are shown. In general, such alternate coloring of black and white can be extended to lattices with any dimensions.
Like the checkerboard above ( Figure 3 - (1)), spins in the lattice are colored black and white. The energy difference by flipping a spin of one color is completely described by its 4 neighbors of the opposite color. Thus, by fixing all spins of one color, the spins of the opposite color have no interactions with each other, and can be updated independently using Metropolis-Hastings. This observation leads to a highly parallel algorithm by alternating the 2 sub-routines below:
• Fixing all black spins, flip each white spin based on Metropolis Hastings in parallel.
• Fixing all white spins, flip each black spin based on Metropolis Hastings in parallel. 
where nn(i) is the sum of neighbor spins of i. By conditional probability decomposition, it is
, thus the transition kernel satisfies the stationary distribution. The proof is based on Metropolis-Within-Gibbs Sampler [16] and is included in the supplemental materials.
Computation
The computationally intensive part of the checkerboard algorithm is the computation of the sum of neighbor values for each spin. To leverage the MXU, for a given lattice with size [128 × m, 128 × n] in a TPU core, we divide it into a [m, n] grid of 128 × 128 sub-lattices ( Figure 3-(1) ). Define the kernel matrix K as,
then for a given sub-lattice σ σ σ ij , matmul(σ σ σ ij , K) + matmul(K, σ σ σ ij ) calculates the sum of nearest neighbors for all its internal sites. However, for the boundary sites, half of their nearest neighbors are missing from the sums. Those neighbors are on the boundaries of neighboring sub-lattices and ought to be sliced out and added into the sums. The whole lattice is updated one color at a time.
To fix the spins of one color on the checkerboard, we multiply flip probabilities with a binary mask M , where
In summary, the algorithm to update the lattice given the color c and configuration σ σ σ is given in and their boundary spins are corrected using a similar approach as in Algorithm 1. The complete algorithm is presented in Algorithm 2. According to our experiments, it is about 3x faster than Algorithm 1 and has less memory footprint as it uses less temporary HBM. 
Two Dimensional Ising Model

Correctness
The average magnetization per spin for a given β is defined as,
where N = n 2 for a square lattice with size n. It can be shown that below the critical temperature, i.e., (
, there is spontaneous magnetization-the interaction among spins is sufficiently large to cause neighboring spins to spontaneously align. On the other hand, thermal fluctuations completely eliminate any alignment above the critical temperature. Moreover, at the critical temperature, there is a discontinuity in the first derivative of m(T ) with respect to T c . This discontinuity generates a downward drop in the average magnetization. The sudden loss of spontaneous magnetization above the critical temperature is a signature of phase transition. Besides average magnetization, a more sensitive test of correctness is the Binder parameter [4] , which is given by
namely, the kurtosis of m(T ). It is frequently used to accurately determine phase transition points in numerical simulations of various models.
We verify the correctness of our algorithm and implementation by computing both quantities at various temperatures on different sizes of square lattices. As shown in Figure 4 , the curves of average magnetization, with subtle differences, overlap with each other, and those of Binder parameters cross the critical line almost perfectly. Additionally, we investigate the implications of lower precision, i.e., bfloat16, for the estimation accuracy. In MCMC simulation of Ising model, lower precision has impact on the calculation of acceptance ratio and the random number generation, the bias introduced might accumulate and decrease the overall accuracy. However, in our experiments, all curves generated in bfloat16 and float32, especially those of Binder parameters, are almost the same and sharp turns around critical line are clearly observable. Based on this evidence, we argue that using bfloat16 has negligible impact on Ising model simulation, and in turn it offers two benefits: 1). We are able to simulate a larger lattice on a single TPU core because of smaller memory footprint, and 2). bfloat16 matrix multiplication with 32-bit accumulation is very efficient in MXU, while float32 matrix multiplication is more expensive as several bfoat16 passes are required.
Benchmarks
Preis et al. [21] showed impressive acceleration in their single-core GPU implementation of checkerboard algorithm using the compute unified device architecture (CUDA). By exploiting GPU's large pool of threads, its single-instruction multiple thread (SIMT) unit and memory hierarchy, their algorithm achieved 60x speedup on single GPU core compared to its CPU counterpart. In their follow-up paper [6] , their original algorithm was modified to overcome the memory limitations of 000,000 samples using checkerboard update, where the first 100,000 samples are discarded for burn-in, and the rest 900,000 samples are used for the calculation. The plots of U 4 (T ) for various lattice sizes cross almost perfectly at the critical temperature, which is shown additionally as a black dashed line, and their float32 and bfloat16 versions almost completely match. The plots of m(T ) show vanishing magnetization above critical temperature, but there are subtle differences between float32 and bfloat16 as m(T ) is a less sensitive test. single GPU. The improved algorithm was able to simulate significantly larger systems and reached peak performance of ∼ 7.9774 flips/ns. By combining CUDA with MPI on the CPU level, their distributed algorithm achieved 206 flips/ns on a 800, 000 2 lattice. Besides the work on GPU, another encouraging line of research is the use of field-programmable gate array (FPGA). A recent implementation achieves ∼ 0.6144 flips/ns, see [18] and references therein.
To quantify the performance of our implementation, we run our algorithm on TPUv3 using single core and multiple cores on TPUv3 clusters. As in [6, 18, 21] , we measure the time spent on one sweep update, i.e., one update on all "black" spins plus one update on all "white" ones, and compute the average number of flips per nanosecond by dividing with n 2 .
Single TPU Core
First, we simulate the system on a single TPUv3 core. Our choice of lattice size is compatible with MXU's registers and achieves 100% memory utilization according to our profilings. We can simulate lattice with size up to (656 × 128) 2 = 83, 968 2 , which consumes 96% of the memory. The benchmarks in various sizes from (20 × 128) 2 to (640 × 128) 2 are summarized in Table. 1, the lattice size and flips in nanoseconds increase in tandem, as more computation is spent on matrix multiplication. For comparison, we also implemented the algorithms based on [6, 21] under CUDA 10.1, and using its cuRand package for random number generation and its Thrust package for reductions. To avoid the excessive temporary memory allocation on GPU, we wrote a custom memory allocator to reuse temporary memory. The benchmark we obtained on NVIDIA's Tesla V100, which powered by NVIDIA's latest Volta architecture, is 11.3704 flips/ns. Other older benchmarks published in [6, 18, 21] are also listed for reference. lattice size n 2 flips/ns (20 × 128) 2 8.1920 (40 × 128) 2 9.3623 (80 × 128) 2 12.3362 (160 × 128) 2 12.8266 (320 × 128) 2 12.9056 (640 × 128) 2 12.8783 GPU in [6, 21] 7.9774 Nvidia Tesla V100 11.3704
FPGA in [18] 0.6144 
Linear Scaling on Multiple TPU Cores
The idea to simulate Ising model on TPUv3 clusters is to split the whole lattice into sub-lattices, and exchange their boundary values to calculate the nearest-neighbor sums and update the sublattices in each core independently. The challenges are to handle the synchronization among the cores: 1). Block the update of sub-lattices until all the nearest-neighbor sums are calculated.
2). Block next iteration before the update of all sub-lattices is completed. Fortunately, such synchronization is already implemented in TensorFlow op collective permute, which is used to exchange data by specifying source and target cores. In its current release, TPUs in a TPU Pod are organized into a grid and each TPU core has an associated coordinate. All cores communicate through a specialized high-speed interconnect.
In our experiments, we use smaller sections of a pod called slices [7] , and show that the TPU Pod interconnect makes the overhead of exchanging boundary values between cores negligible. As a result, we observe strict linear scaling of flips/ns to the number of TPU cores, as shown in Table. 2.
In [6] , the communications between GPUs are handled by MPI through the hosts, which is potentially a bottleneck in the simulations. Another interesting comparison is to benchmark their [6] 800, 000 2 ∼ 3000 206 Table 2 : Each core contains a [896 × 128, 448 × 128] sub-lattice. Hence, for a n × n × 2-core cluster, the lattice size is (512 × 128 × n) 2 . Dividing flips/ns by number of cores, the flips per nanosecond per TPUv3 core is roughly 11.4337, compared to 3.2188 per GPU-a 250% speedup.
multi-GPU algorithm using Nvidia NVLink Fabric, which enables the interconnect of 8 GPUs, or Nvidia NVSWITCH that support the interconnect of 16 GPUs [12] . However, as the code for multi-GPU simulation is not available and the limitations on the number of interconnect GPUs, we leave it for our future work.
Conclusion
We implement a TensorFlow version of the two dimensional ferromagnetic Ising model MCMC simulation using Cloud TPU. For the algorithm, we modify the checkerboard algorithm and convert its computationally intensive part to matrix operations. Our algorithm exploits the architecture of TPU to leverage its highly efficient computing units. For the implementation, TensorFlow is employed which interfaces with TPU as a high performance computing device. All of our operations are available through TensorFlow with underlying TPU implementations. Moreover, the APIs to handle synchronization and communications between TPU cores are also exposed through TensorFlow and are easy to use. To demonstrate our algorithm and implementation, we calculate the average magnetization and Binder parameter at various temperatures with different lattice sizes. Our numeric estimates on those size-independent quantities match the theoretical results. We further benchmark the performance, measured by flips per nanoseconds, against different lattice sizes on single TPUv3 core and on TPUv3 clusters. In our experiments, we are able to utilize TPU's computing power and memory efficiently and achieve state-of-the-art performance. In future work, we will study the details of XLA implementation of TensorFlow operators and explore possible optimization, and also perform further Monte Carlo based simulations on variations of the Ising model.
