GPU-based single-cluster algorithm for the simulation of the Ising model by Komura, Yukihiro & Okabe, Yutaka
ar
X
iv
:1
11
0.
08
99
v2
  [
ph
ys
ics
.co
mp
-p
h]
  9
 Ja
n 2
01
2
GPU-based single-cluster algorithm for the simulation
of the Ising model
Yukihiro Komuraa, Yutaka Okabea
aDepartment of Physics, Tokyo Metropolitan University, Hachioji, Tokyo 192-0397, Japan
Abstract
We present the GPU calculation with the common unified device architecture
(CUDA) for the Wolff single-cluster algorithm of the Ising model. Proposing an
algorithm for a quasi-block synchronization, we realize the Wolff single-cluster
Monte Carlo simulation with CUDA. We perform parallel computations for the
newly added spins in the growing cluster. As a result, the GPU calculation
speed for the two-dimensional Ising model at the critical temperature with the
linear size L = 4096 is 5.60 times as fast as the calculation speed on a current
CPU core. For the three-dimensional Ising model with the linear size L = 256,
the GPU calculation speed is 7.90 times as fast as the CPU calculation speed.
The idea of quasi-block synchronization can be used not only in the cluster
algorithm but also in many fields where the synchronization of all threads is
required.
Keywords: Monte Carlo simulation, cluster algorithm, Ising model, parallel
computing, GPU
1. Introduction
The Ising model is a simple and standard model of statistical physics. The
two-dimensional (2D) Ising model, which was exactly solved by Onsager [1],
shows a phase transition at the critical temperature Tc/J = 2.269 · · · . The Ising
model has been studied by using many Monte Carlo techniques, including the
Metropolis algorithm [2], cluster algorithm [3, 4], extended ensemble algorithm
such as the Wang-Landau method [5]. Since the Ising model is simple and
Preprint submitted to Elsevier October 26, 2018
its exact solution is available, it is sometimes used to test a new Monte Carlo
algorithm.
Recently the speed-up of computation with graphics processing unit (GPU)
has captured a lot of attention in many fields including computational biology
and chemistry [6], molecular dynamics simulation of fluids [7, 8], CT image re-
construction [9, 10], finance [11] and much more. GPU has been developed as a
supplementary arithmetic device of image processing. GPU has many stream-
ing multiprocessors (SM). Each SM is composed of eight streaming processors
(SP) with a multi-threaded instruction unit and shared memory. GPU has real-
ized the acceleration of image processing by using many SMs in parallel. Since
GPU is specialized for highly parallel computation, it is designed such that many
transistors are devoted to data processing. The common unified device architec-
ture (CUDA) released by NVIDIA makes it possible to make a general purpose
computing on GPU. CUDA allows us to implement algorithms using standard
C language with CUDA specific extensions. The host program launches a se-
quence of kernels. A kernel is organized as a hierarchy of threads. Threads are
grouped into blocks, and blocks are grouped into a grid. The organization of a
grid is determined when the GPU ”kernel” function is invoked as
kernel function <<< grid, block >>> (arguments); (1)
Here, grid and block are integer variables, which specify the number of blocks
and that of threads, respectively. For simplicity, we have used one-dimensional
representation for the dimensions of grid and block. Threads in a single block
will be executed on a single SM, sharing the data cache, and can synchronize and
share data with threads in the same block. There are several types of memories,
and there is a significant difference in the access speed of memories. The access
speed of global memory, which is accessible from either all the threads or the
host, is much slower than that of shared memory, which is accessible from the
threads in one block. The access speed of register, which is only accessible by
the thread, is nearly equal to that of shared memory.
Preis et al. [12] studied the 2D and three-dimensional (3D) Ising model by
2
using Metropolis algorithm with CUDA. They used a variant of sublattice de-
composition for a parallel computation on GPU. The spins on one sublattice do
not interact with other spins on the same sublattice. Therefore one can update
all spins on a sublattice in parallel when making the Metropolis simulation. As
a result they were able to accelerate 60 times for the 2D Ising model and 35
times for the 3D Ising model compared to a current CPU core. More recently,
the GPU acceleration of the multispin coding of the Ising model was reported
[13].
Metropolis algorithm [2] has flexibility which allows its application to a great
variety of physical systems. However, this algorithm is sometimes plagued by
long autocorrelation time, named critical slowing down. The cluster update
algorithms are efficient for overcoming the problem of critical slowing down.
The autocorrelation time drastically decreases by using the Wolff single-cluster
algorithm [4]. It is highly desirable to apply parallel computations to cluster
algorithms. But only a limited number of attempts have been reported so far
along this line. The message passing interface (MPI) parallelization technique
was employed by Bae et al. [14] for the Wolff algorithm. Quite recently, Kaupuzs
et al. [15] studied the parallelization of the Wolff algorithm by using the open
multiprocessing (OpenMP). For the 3D Ising model with linear size L=1024,
they reached the speed-up about 1.79 times on two processors and about 2.67
times on four processors, as compared to the serial code.
In this paper, we present the GPU calculation with CUDA for the Wolff
single-cluster algorithm of the Ising model. The rest of the paper is organized as
follows. In section 2, we briefly describe the standard way of implementing Wolff
single-cluster algorithm on a CPU. In section 3, we explain the idea of quasi-
block synchronization on GPU. In section 4, we show the idea and techniques
of GPU calculation for the Wolff single-cluster algorithm. In section 5, we
compare the performance of GPU calculation with that of CPU calculation.
The summary and discussion are given in section 6.
3
2. Wolff single-cluster algorithm
Our Hamiltonian is given by
H = −J
∑
<i,j>
SiSj. (2)
Here, J is the coupling and Si is the Ising spin on the lattice site i. The
summation is taken over the nearest neighbor pairs < i, j >. Periodic boundary
conditions are employed.
Wolff proposed a Monte Carlo algorithm in which only a single cluster is
flipped at a time [4]. The spin-update process of the Wolff single-cluster algo-
rithm on a CPU can be formulated as follows [16, 17]:
(i) Choose a site i randomly.
(ii) Look at each of the nearest neighbors j. If Sj is parallel with Si, add
Sj to the cluster with probability p = 1 − e
−2β, where β is the inverse
temperature J/T . Then, flip Si.
(iii) If all the nearest neighbors j have been checked, look at each of the nearest
neighbors k of site j. If Sk is parallel with Sj and it was not a member
of the cluster before, add Sk to the cluster with probability p = 1− e
−2β.
Then, flip Sj .
(iv) For all the added spins Sk’s, repeat step (iii) until no more new bonds are
created.
(v) Go to (i).
Slightly different ways of formulation are possible, for example, for the timing
of the flip of spins. To check the condition that no more new bonds are created
in (iv), we may use two variables ic and in, where ic is the total number of
members of cluster, and in is the total number of sites which have been already
checked. We repeat the step (iii) while the condition in < ic is satisfied. We
note that the check of site j is done sequentially on a standard CPU.
4
3. Quasi-block synchronization
In parallel computing, the best performance is obtained if calculation on
each thread is done independently. At the same time, barrier synchronization
is needed at a certain point. CUDA provides a barrier synchronization function
syncthreads(). When a kernel function calls syncthreads(), all threads in a
block will be held at the calling location until everyone else in the block reaches
the location. But CUDA is not equipped with an inter-block synchronization
function. Since the synchronization for different blocks is required in our GPU
calculation of cluster algorithm, we here propose an idea of a quasi-block syn-
chronization.
In CUDA, the organization of a grid is determined by two special parameters
provided during kernel launch as mentioned in section 1. They are the size of
the grid in terms of numbers of blocks, grid, and the size of each block in
terms of numbers of threads, block. Moreover, all the threads are identified by
two-level unique coordinates, called blockIdx.x and threadIdx.x. We note
that blockIdx.x is the number associated with each block within a grid, and
threadIdx.x refers to the label associated with a thread in a block. Here, we
have used one-dimensional IDs.
Now, we present an idea of synchronization of threads across the blocks. The
program of quasi-block synchronization is as follows:
/* quasi-block synchronization */
if(threadIdx.x == 0){ stopper[blockIdx.x] = n; } (a)
if(threadIdx.x == 0){ stop = 0; } (b)
__syncthreads();
while(stop != 1){
if(threadIdx.x == 0){ stop = 1; } (c)
__syncthreads();
if(stopper[threadIdx.x] < n){ stop = 0; } (d)
__syncthreads();
} // while(stop != 1) loop end
We use a flag variable stop, which is allocated in a shared memory. Only after
stop of every block becomes 1, one gets out of the ”while” loop of quasi-block
synchronization. To check whether each block reaches this location, we use the
5
array stopper[ ], which is allocated in a global memory. As an initial value, we
set stopper[blockIdx.x] as m, which is smaller than n, for all blocks. When
any block reaches this location, stopper[blockIdx.x] is set as n in (a). If there
are stopper[blockIdx.x] which are smaller than n, then stop will be set as 0
for all the blocks in (d). The trick is that one sets stopper[blockIdx.x] in (a)
but one checks stopper[threadIdx.x] in (d). It is natural that we choose the
size of grid, grid, and the size of block, block, as the same value. In this way,
we can realize the inter-block synchronization. Of course, the synchronization
within a block is guaranteed by the synchronization function, syncthreads().
When we execute the quasi-block synchronization consecutively, we use dif-
ferent values for n in (a). There is a possibility that some block reaches the
next gate before some other block gets out of this routine completely. We es-
cape from this problem by increasing n by one. To make a loop for n between
nmin and nmax, we slightly modify the condition (d) at the points of nmin and
nmax.
We here mention the choice of the size of grid and that of block. To design
the organization of a grid, we take account of the size of shared memory and
register, the number of processors which are executed simultaneously within a
block, etc. A warp in CUDA is a group of 32 threads, which is the minimum size
of the data processed in SIMD (Single Instruction Multiple Data) fashion. Thus,
we choose grid and block as 32 in our GPU calculation of cluster algorithm.
Then, the total number of threads in a grid becomes 1024 (= 32 × 32). Since
the maximum number of threads per block is 512 for NVIDIA GeForce GTX
200 series, we could use a single block. However, it is not efficient because the
number of processors executed simultaneously in a block is limited.
Although it is natural to choose grid and block as the same value, we can
remove this restriction for block > grid by imposing a condition such that
if(threadIdx.x < grid)
if(stopper[threadIdx.x] < n){ stop = 0; } (d’)
}
in (d). We should note that there are upper limits for the choice of grid, the
6
total number of blocks, and block, the total number of threads. The upper limit
of block is determined by the maximum number of threads per block, and that
of grid by the number of blocks in one SM and the number of SMs. The number
of blocks in one SM is determined by the number of threads and the amount of
registers and shared memories, and the number of SMs depends on the model
of GPU.
We may employ other methods for inter-block synchronization. One can-
didate is to use atomic operations such as atomicAdd() provided by CUDA.
Atomic operations are performed without interference from any other threads.
Inter-block synchronization is realized by checking that a flag variable in global
memory is accessed from all the blocks. However, it takes time to use atomi-
cAdd() when many blocks try to access to the global memory for flag variable
at the same time. Xiao and Feng [18] proposed two approaches for inter-block
GPU synchronization. They compared the GPU lock-based synchronization us-
ing atomic function with the GPU lock-free synchronization. Our algorithm is
a refined version of the latter, the GPU lock-free synchronization, in the sense
that the number of access to global memories are reduced. They found that
the GPU lock-based synchronization takes more time than the GPU lock-free
synchronization. Another possible method is the master-slave method [19] by
Volkov and Demmel. The time of global memory accesses should be carefully
checked. Further study on inter-block synchronization will be required.
4. GPU calculation of the Wolff single-cluster algorithm
In this section, we describe the GPU calculation for the Wolff single-cluster
algorithm. The sublattice decomposition cannot be used for parallelization in
the case of cluster algorithm. We here perform parallel computation for the
newly added spins Sk in the step (iii) of section 2. This idea is similar to that
by Kaupuzs et al. [15], where these newly added spins in the wave front of
the growing cluster were referred to as wave-front spins. We assign each of
wave-front spins to the thread in the grid.
7
There are several points which ensure that the parallel algorithm works cor-
rectly. We should avoid the situation where different threads try to incorporate
the same spin to the cluster simultaneously. To do this, after checking one di-
rection of nearest neighbors, we perform both thread synchronization and quasi-
block synchronization. Moreover, the newly added spins should be numbered.
In such a numbering, the synchronization after the process of each direction of
neighbors is essential. The numbering and the update of ic, the total number
of members of cluster, are done when the last direction of neighbors is checked
for each spin.
Then, the step (iii) is modified as follows:
(iii’) All the processes are done for each thread representing the wave-front
spin. After checking each direction of nearest neighbors, perform the syn-
chronization. When the last direction of the neighbors is checked, sum up
the total number of newly added spins, and complete the numbering of
the newly added spins.
We here make a remark on the implementation of parallel computation.
The number of wave-front spins is represented by (ic-in) in terms of ic and
in discussed in section 2. If the thread is indexed by
index = blockIdx.x * 32 + threadIdx.x;
and the kernel function is invoked on the condition that index < (ic-in), we
make calculations only for wave-front spins.
Then, the main part of GPU kernel function is designed as
if( index < (ic-in) ){
"check one direction of the neighbors of wave-front spin"
}
"perform synchronization"
...
...
if( index < (ic-in) ){
"check the last direction of the neighbors of wave-front spin"
}
"perform synchronization"
"flip the wave-front spin"
"update the numbering of the newly added spins and ic"
8
As mentioned before, we chose the size of the grid in terms of threads as 1024
(= 32 × 32). Then, actually, the upper limit of the parallelization is 1024. In
other words, the calculation is made in parallel for min((ic-in),1024). Thus,
in, the total number of sites which have been already checked, is updated as in
= in+min((ic-in), 1024).
In making the GPU calculation with CUDA, the proper use of shared mem-
ories, the technique to avoid the warp divergence or the bank conflict when
taking the summation over the shared memories, etc., are very effective for fast
computation. Since the access to global memory is time-consuming, it is better
to reduce the frequency of the access to global memory. For this purpose we
use the following data structure; the information on spin value and that on flag
index to specify whether the site is a member of the cluster or not are put into
one word.
We finally note that we use a linear congruential random generator which
was proposed by Preis et al. [12] when evaluating the transition probability
p = 1− e−2β .
5. Results
We have tested the performance of our code on NVIDIA GeForce GTX 285.
For comparison, we run the code on a current CPU, Intel(R) Xeon(R) CPU
W3520 @ 2.67GHz. Only one core of the CPU is used. For compiler, we have
used gcc 4.2.1 with option -O3.
Since the cluster size is dependent on the temperature, the computational
time changes with temperature. So we compare the GPU computational time
with the CPU computational time at the critical temperature, Tc/J = 2.269
for the 2D Ising model and Tc/J = 4.510 for the 3D Ising model. The average
computational times per a single-cluster update at the critical temperature for
the 2D and 3D Ising model are tabulated in table 1 and table 2, respectively.
There, the time for only spin-update and that including the measurement of en-
ergy and magnetization are given. We show the measured time in units of micro
sec. The linear system sizes (L = Nx = Ny = Nz) are L = 512, 1024, 2048, 4096
9
for the 2D Ising model and L = 64, 128, 256 for the 3D Ising model. We can
see from tables 1 and 2 that the acceleration of our single-cluster algorithm in-
creases as the linear system size grows to a large size. The GPU computational
speed for the 2D Ising model with L = 4096 is 5.60 times as fast as the CPU
computational speed, and that for the 3D Ising model with L = 256 is 7.90
times as fast as the CPU computational speed. This result indicates that the
acceleration rate of our single-cluster algorithm surpasses that of the algorithm
by Kaupuzs et al. [15], which reached the speed-up about 2.67 times for the 3D
Ising model with linear size L=1024. Because of the size of memory for GPU,
that is, 1GB for NVIDIA GeForce GTX 285, the system size is currently limited
to L = 4096 for 2D and L = 256 for 3D. With the increase of the system size
available, much better performance is expected for L=1024 in 3D.
L GPU CPU
update only + measurement update only + measurement
512 7.634 msec 7.677 msec 4.151 msec 4.913 msec
1024 15.80 msec 16.10 msec 20.97 msec 24.19 msec
2048 33.02 msec 34.50 msec 87.59 msec 100.9 msec
4096 81.37 msec 83.95 msec 412.0 msec 470.9 msec
Table 1: Average computational time per a single-cluster update at Tc for the 2D Ising model.
The time for only single-cluster update and that including the measurement of energy and
magnetization are given.
L GPU CPU
update only + measurement update only + measurement
64 1.176 msec 1.378 msec 1.170 msec 1.947 msec
128 2.801 msec 3.483 msec 9.785 msec 15.76 msec
256 13.04 msec 17.45 msec 97.47 msec 137.9 msec
Table 2: Average computational time per a single-cluster update at Tc for the 3D Ising model.
The time for only single-cluster update and that including the measurement of energy and
magnetization are given.
Next, we refer to the temperature dependence of our single-cluster algorithm.
We plot the temperature dependence of the acceleration rate, that is, the ratio
of the CPU computational time to the GPU computational time for the 2D
10
2.26 2.280
4
8
(a) 2DCP
U 
tim
e/
 G
PU
 ti
m
e
+measurement
cluster update only
T/J
4.5 4.550
5
10
(b) 3DCP
U 
tim
e/
 G
PU
 ti
m
e
+measurement
cluster update only
T/J
Figure 1: (a) Temperature dependence of the acceleration rate for GPU computation for the
2D Ising model with L = 4096 and (b) that for the 3D Ising model with L = 256.
Ising model with L = 4096 and that for the 3D Ising model with L = 256 in
figures 1(a) and (b), respectively.
The time for only spin-update and that including the measurement of energy
and magnetization are shown there. We notice from figures 1(a) and (b) that the
acceleration of computational speed increases at the temperatures apart from
the critical temperature. The reason of accelerating the computational speed at
low temperature is that the cluster size grows larger and the parallel computing
becomes more effective. On the other hand, the cluster size becomes smaller at
high temperature, but the efficiency of measurement of energy and magnetiza-
tion in parallel computing becomes prominent. To conclude, our single-cluster
algorithm is effective over a wide range of temperatures.
As an illustration, we plot the Binder ratio
U(T ) = 1−
< M(T )4 >
3 < M(T )2 >2
(3)
of the 2D and 3D Ising model in figures 2(a) and (b), respectively. For the
2D Ising model, we discarded the first 5,000 single-cluster updates and the
next 50,000 single-cluster updates were used for measurement. The first 10,000
single-cluster updates were discarded and the next 100,000 single-cluster were
used for measurement for the 3D Ising model.
11
2.25 2.26 2.27 2.28
0.4
0.5
0.6
0.7
U(
T)
T/J
512
1024
2048
4096
(a) 2D
4.48 4.5 4.52 4.540
0.2
0.4
0.6
U(
T)
T/J
64
128
256
(b) 3D
Figure 2: (a) Binder ratio of the 2D Ising model for L=512, 1024, 2048 and 4096 and (b) that
of the 3D Ising model for L=64, 128 and 256.
6. Summary and discussion
We have formulated a GPU parallel computing of the Wolff single-cluster
Monte Carlo algorithm for the Ising model. We have compared the GPU com-
putational time with the CPU computational time for the 2D and 3D Ising
model. The GPU computational speed for the 2D Ising model with L = 4096 is
5.60 times as fast as the CPU computational speed at the critical temperature,
and that for the 3D Ising model with L = 256 is 7.90 times as fast as the CPU
computational speed. This acceleration rate is much higher than that reported
by Kaupuzs et al. [15].
We have shown that the computational speed is accelerated for GPU over
a wide range of temperatures for larger system sizes. The extension of this
calculation to the Potts model is straightforward. It is also easily extended to
continuous spin models, such as the classical XY model, using the Wolff’s idea of
embedded cluster [4]. The application to quantum cluster algorithm [20, 21, 22]
is highly desirable. It is also interesting to apply this parallel calculation to the
probability-changing cluster algorithm [23].
It is apparent from table 1 and 2 that the calculation with CPU is better for
smaller lattices. For the calculation of larger lattices, a hybrid calculation using
CPU and GPU could be possible; that is, one starts with the calculation using
12
CPU, and then switches to the calculation using GPU when the size of growing
cluster exceeds some value. In our code, one checks (ic-in), the number of
wave-front spins. However, in such a hybrid calculation one needs the transfer
of data between CPU and GPU. We tested the performance of such a hybrid
calculation. In table 3, we compare the average computational time at Tc of
the 2D Ising model with hybrid calculation, and that with only GPU and that
with only CPU. In the hybrid calculation we switch from CPU to GPU when
the number of wave-front spins exceeds 100. For L = 1024 in 2D, the trade-off
between the saving time by the use of CPU for initial stage of growing cluster and
the extra time of data transfer is compatible, that is, the total computational
time is almost the same. However, for much larger lattices the parallelization
in GPU becomes more prominent, and the time for data transfer becomes a
bottleneck. Although a hybrid calculation is not so efficient in this algorithm,
it will be interesting to pursue such an approach in other problems.
L GPU hybrid CPU
512 7.634 msec 6.995 msec 4.151 msec
1024 15.80 msec 16.89 msec 20.97 msec
2048 33.02 msec 40.10 msec 87.59 msec
4096 81.37 msec 111.8 msec 412.0 msec
Table 3: Comparison of average computational time per a single-cluster update at Tc for the
2D Ising model. The time for hybrid calculation is compared with that with only GPU and
that with only CPU.
We here mention the bottleneck of our method associated with the quasi-
block synchronization. This does not mean that the quasi-block synchronization
itself takes long time, but the latency is a problem when all blocks reach the
quasi-block synchronization point. When a new spin is added to the cluster in
the step (iii’) of section 3, our method needs to update a flag variable in global
memory to check whether that site belongs to the cluster or not. Since the
sites of the newly added spins are chosen randomly, this task is not a coalesced
memory access in global memory, and the amount of the task becomes quite
different for each block. However, all blocks must wait for other blocks at the
13
quasi-block synchronization point. Thus, the latency becomes a bottleneck.
If we could equalize the load of the global memory access of each block, the
performance will be improved. But such a trial will be a difficult job.
We here mention about the GPU occupancy, which is available by the CUDA
Visual Profiler. The value of the GPU occupancy of our algorithm is 0.25, which
is lower than 1. It is because we fix the number of threads in a block as 32.
We could bring the GPU occupancy close to 1 by increasing the number of
parallelization, but in our algorithm it also causes extra computational cost due
to the extra warp together with that in the process of summing up the total
number of newly added spins.
After we finished the present work, we came to know that Hawick et al.
[24] also studied the CUDA implementation of the Wolff algorithm. They used
a modified Connected Component Labelling for the assignment of the cluster.
Since the GPU performance of the Wolff update is not so good as that of the
Metropolis update, they put more emphasis on the hybrid implementation of
Metropolis and Wolff updates and the optimal choice of the ratio of both up-
dates. It is interesting to compare their GPU implementation of the Wolff
single-cluster algorithm with that of ours, which will be left to a future work.
We finally make a comment on synchronization. In CUDA, the function for
inter-block synchronization is not natively provided. We have proposed an idea
of a quasi-block synchronization with CUDA, which is effective for our single-
cluster algorithm. This quasi-block synchronization algorithm can be used not
only for a single-cluster Monte Carlo algorithm but also in many fields where
the synchronizations of all threads in all blocks are required.
Acknowledgment
This work was supported by a Grant-in-Aid for Scientific Research from the
Japan Society for the Promotion of Science.
14
References
[1] L. Onsager, Crystal Statistics. I. A Two-Dimensional Model with an Order-
Disorder Transition, Phys. Rev. 65 (1944) 117-149.
[2] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller,
Equation of State Calculations by Fast Computing Machines, J. Chem.
Phys. 21 (1953) 1087-1092.
[3] R.H. Swendsen, J.S. Wang, Nonuniversal critical dynamics in Monte Carlo
simulations, Phys. Rev. Lett. 58 (1987) 86-88.
[4] U. Wolff, Collective Monte Carlo Updating for Spin Systems, Phys. Rev.
Lett. 62 (1989) 361-364.
[5] F. Wang, D.P. Landau, Efficient, Multiple-Range Random Walk Algorithm
to Calculate the Density of States, Phys. Rev. Lett. 86 (2001) 2050-2053.
[6] I.S. Ufimtsev, T.J. Martinez, Quantum Chemistry on Graphical Processing
Units. 1. Strategies for Two-Electron Integral Evaluation, J. Chem. Theory
and Computation 4 (2008) 222-231.
[7] J.A. Anderson, C.D. Lorenz, A. Travesset, General purpose molecular
dynamics simulations fully implemented on graphics processing units, J.
Comp. Phys. 227 (2008) 5342-5359.
[8] A. Sunarso, T. Tsuji, S. Chono, GPU-accelerated molecular dynamics sim-
ulation for study of liquid crystalline flows, J. Comp. Phys. 229 (2010)
5486-5497.
[9] O. Kuttera, R. Shams, N. Navab, Visualization and GPU-accelerated sim-
ulation of medical ultrasound from CT images, Comp. Methods and Pro-
grams in Biomedicine 94 (2009) 250-266.
[10] H. Scherl, B. Keck, M. Kowarschik, J. Hornegger, Fast GPU-Based CT
Reconstruction using the Common Unified Device Architecture (CUDA),
IEEE Nuclear Science Symposium Conference Record 6 (2007) 4464-4466.
15
[11] T. Preis, P. Virnau, W. Paul, J.J. Schneider, Accelerated fluctuation anal-
ysis by graphic cards and complex pattern formation in financial markets,
New J. Phys. 11 (2009) 093024.
[12] T. Preis, P. Virnau, W. Paul, J.J. Schneider, GPU accelerated Monte Carlo
simulation of the 2D and 3D Ising model, J. Comp. Phys. 228 (2009) 4468-
4477.
[13] B. Block, P. Virnau, T. Preis, Multi-GPU accelerated multi-spin Monte
Carlo simulations of the 2D Ising model, Comp. Phys. Comm. 181 (2010)
1549-1556 .
[14] S. Bae, S.H. Ko, P.D. Coddington, Parallel Wolff Cluster Algorithms, Int.
J. Mod. Phys. C 6 (1995) 197-210.
[15] J. Kaupuzs, J. Rimsans, R.V.N. Melnik, Parallelization of the Wolff single-
cluster algorithm, Phys. Rev. E 81 (2010) 026701.
[16] W. Janke, Monte Carlo Simulations of Spin Systems, in Computational
Physics, edited by K. H. Hoffmann and M. Schreiber (Springer, Berlin,
1996), pp. 10-43.
[17] D.P. Landau, K. Binder, A Guide to Monte Carlo Simulations in Statistical
Physics, 2nd edition (Cambridge University Press, Cambridge, 2005).
[18] S. Xiao and W. Feng, Inter-Block GPU Communication via Fast Barrier
Synchronization, Proc. of the IEEE International Parallel and Distributed
Processing Symposium, April 2010, pp. 1-12.
[19] V. Volkov, J.W. Demmel, Benchmarking GPUs to Tune Dense Linear Alge-
bra, Proc. 2008 ACM/IEEE Conference on Supercomputing (IEEE Press,
2008), pp. 1-11.
[20] H.G. Evertz, G. Lana, M. Marcu, Cluster algorithm for vertex models,
Phys. Rev. Lett. 70 (1993) 875-879.
16
[21] N. Kawashima, J.E. Gubernatis, Loop Algorithms for Monte Carlo Simu-
lations of Quantum Spin Systems, Phys. Rev. Lett. 73 (1994) 1295-1298.
[22] B.B. Beard, U.-J. Wiese, Simulations of Discrete Quantum Systems in Con-
tinuous Euclidean Time, Phys. Rev. Lett. 77 (1996) 5130-5133.
[23] Y. Tomita, Y. Okabe, Probability-Changing Cluster Algorithm for Potts
Models, Phys. Rev. Lett. 86 (2001) 572-575.
[24] K. A. Hawick, A. Leist, D. P. Playne, Cluster and Fast-Update Simulations
of Regular and Rewired Lattice Ising Models Using CUDA and Graphical
Processing Units, Technical Report CSTN-104 (2010).
17
