A Study of Single and Multi-device Synchronization Methods in Nvidia
  GPUs by Zhang, Lingqi et al.
A Study of Single and Multi-device
Synchronization Methods in Nvidia GPUs
Lingqi Zhang∗, Mohamed Wahib†§, Haoyu Zhang‡, Satoshi Matsuoka§∗,
∗ Tokyo Institute of Technology, zhang.l.ai@m.titech.ac.jp
† National Institute of Advanced Industrial Science and Technology, mohamed.attia@aist.go.jp
‡ miHoYo Inc, (This work was done while the co-author worked in Tokyo Institute of Technology) lynkzhang@gmail.com
§ RIKEN Center for Computational Science, matsu@acm.org
Abstract—GPUs are playing an increasingly important role
in general-purpose computing. Many algorithms require syn-
chronizations at different levels of granularity in a single GPU.
Additionally, the emergence of dense GPU nodes also calls for
multi-GPU synchronization. Nvidia’s latest CUDA provides a
variety of synchronization methods. Until now, there is no full
understanding of the characteristics of those synchronization
methods. This work explores important undocumented features
and provides an in-depth analysis of the performance consid-
erations and pitfalls of the state-of-art synchronization methods
for Nvidia GPUs. The provided analysis would be useful when
making design choices for applications, libraries, and frameworks
running on single and/or multi-GPU environments. We provide a
case study of the commonly used reduction operator to illustrate
how the knowledge gained in our analysis can be useful. We also
describe our micro-benchmarks and measurement methods.
Index Terms—CUDA Barrier, Synchronization, GPUs
I. INTRODUCTION
GPUs have been playing an increasingly important role in
general-purpose computing. Different scientific areas exploit
the power of GPUs to accelerate science and engineering
applications. Many complex algorithms require different lev-
els of synchronizations, through the use of barriers. Until
recently 1, developers used two methods of synchronization
in CUDA. First, developers made use of CUDA thread block
synchronization to develop complex algorithms [2]. Second,
for applications like Deep Learning (DL), the CPU-side im-
plicit barrier occurring after the kernel launch function is used
for device-wide synchronization [3].
Due to the importance of device-wide synchronization,
several researchers attempted to develop software device-wide
barriers [4], [5]. Liu et al. [6] also proposed a hardware-
software cooperative framework for synchronization. Yet the
increase in complexity and density of GPUs in GPU-based
systems, e.g. Nvidia DGX-2 includes 16 GPUs, call for a
general and high-performance method for devices-wide and
multi-GPU synchronization. Recently Nvidia proposed meth-
ods for synchronizations that spans all levels of granularity
from a small group of threads in a GPU to a multi-GPU device:
warp level, thread block level, and grid level. The grid level
synchronization can be a productive way to perform device-
wide and multi-device level synchronization. This hierarchy of
1Nvidia introduced a hierarchy of synchronization methods (based on
Cooperative Groups(CG)) since CUDA 9.0 [1]
synchronization methods can make GPU programming more
productive. Thus, it is important to study the performance
characteristics of different levels of synchronization methods.
In this paper, we characterize the synchronization methods
in Nvidia GPUs. Specifically, in this work:
• We identify the performance characteristics of different
synchronization methods in Nvidia GPUs.
• We use different implementations of the reduction opera-
tor as a motivating example to demonstrate how to use the
knowledge gained in this study to optimize the reduction
kernel.
• We explore the pitfalls of using several synchronization
instructions.
• We provide our micro-benchmarks used in measure-
ments 2.
II. BACKGROUND
A. CUDA Programming Model
CUDA is a C-like programming model for Nvidia GPUs.
It offers three levels of programming abstractions: thread,
thread block, and grid. Among them, thread is the most basic
programming abstraction. At the hardware side, there is a
hierarchy that maps to the CUDA programming model. Three
different levels of hardware resources exist: ALU, Stream
Multi-Processor (SM), and the GPU. Take the Volta V100 [7]
as an example, a V100 GPU consists of 80 SMs; an SM is
partitioned into 4 processing blocks, each consists of several
ALUs, e.g. 16 FP32 Cores.
A warp in CUDA is a small number of threads executed
together as a working unit in a SIMT fashion. A warp in all
Nvidia GPU generations consists of 32 threads. Inside an SM
in V100 there are 4 warp schedulers corresponding to the 4
partitions inside one SM. CUDA’s runtime will schedule one
thread block to only one SM, and one grid to only one GPU,
though it may occupy several SMs.
Figure 1 shows the details of CUDA programming model,
its corresponding hardware abstraction, and the mapping rela-
tionship between them.
2The source code is available at:
https://github.com/neozhang307/SyncMicrobenchmark
ar
X
iv
:2
00
4.
05
37
1v
1 
 [c
s.D
C]
  1
1 A
pr
 20
20
Thread
Thread…
Block Block Block…
Grid Grid GPU1 GPU2
SM SM…
W
arp Scheduler
8 X FP64
16 X FP32
16 X INT
…
Warp
Programming Model Hardware (V100 as an example)
32 threads
Several blocks map to 1 SM
Normally 1 grid maps to 1 GPU
Fig. 1. CUDA programming model and corresponding hardware structure
B. Related Work
Many efforts have been done to micro-benchmark GPUs.
Volkov et al. [8] benchmarks were partially used to study
kernel launch overhead, manual device-wide barriers, data
transfer, pipeline latency, instructions throughput, and metrics
related to GPU memory system. This knowledge discovered
was then used to tune several dense linear algebra algo-
rithms. Wong et al. [9] proposed the use of more fine-
grained micro-benchmarks to understand the performance of
GPUs, including the behavior of instructions and memory
structure of GPUs. Zhang et al. [10] introduced assembly-
level micro-benchmarks. Recently, Jia et al. use ASM code to
run micro-benchmarks on new Nvidia Tesla GPUs, i.e. V100
and P100 [11]. Several other works mainly focused on the
memory hierarchy of GPUs, e.g. [12]–[14]. Among them, Mei
et al. [14] discovered some cache patterns that were missed by
previous researches. To the authors’ knowledge, none of the
GPU micro-benchmarking efforts focus on CUDA’s hierarchy
of synchronizations.
Volkov et al. [15] also compared kernel launch overhead
and a manually implemented software barrier. Yet they only
tested the overhead of light kernels, which is not practical
for most of the applications. Other efforts analyzed software
synchronization methods by comparing the performance of
implementations of several algorithms with and without their
software synchronization methods [4], [5], [16]. The analysis
works on case-by-case bases and can not be generalized to
different kernels.
III. OVERVIEW OF SYNCHRONIZATION METHODS IN
NVIDIA GPUS
A. Primitive Synchronization Methods in Nvidia GPUs
Starting from CUDA 9.0, Nvidia added the feature of Coop-
erative Groups (CG). This feature is planned to allow scalable
cooperation among groups of threads and provide flexible
parallel decomposition. Coalesced groups and tile groups can
Thread Group
Multi-Grid 
Group Block Group
 Tile 
Group
Coalesced 
Group
Grid Level 
Synchronization
Block Level 
Synchronization Warp Level Synchronization
Synchronization
 Methods
CUDA Groups {
{
sync(); sync();__syncthreads(); sync();API {
Grid 
Group
Fig. 2. Hierarchy of synchronizations in CUDA
be used as a method to decompose thread blocks. Beyond
the level of thread blocks, grid synchronization is proposed
for inter-block synchronization. Multi-grid synchronization is
proposed for inter-GPU synchronization.
In the current version of CUDA (10.0), tile group and
coalesced group only work correctly inside a warp. Analysis
of PTX code shows that those two instructions are transformed
to the warp.sync instruction. Hence, as it stands, we consider
the synchronization capability of those methods to be only
applicable to the warp level.
Figure 2 shows the granularity of cooperative groups and
synchronization in the current version of CUDA.
1) Warp-level Synchronization (Synchronization Inside a
Single GPU): Current CUDA supports two intra-warp syn-
chronization methods, i.e. tile synchronization and the co-
alesced group synchronization corresponding respectively to
the tile group and coalesced group in Figure 2. Previous
versions of CUDA guarantee that all threads inside a warp
process the same instruction at a time. Yet the introduction of
synchronization methods inside a warp plus the fact that each
thread now has its own Program Counter (PC) implies a future
possibility of removing this feature.
2) Block-level Synchronization (Synchronization Inside a
Single GPU): Block-level synchronization corresponds to the
thread block in the programming model. According to CUDA’s
programming guide [1], its function is the same as the classical
synchronization primitive syncthreads().
3) Grid-level Synchronization (Single GPU Synchroniza-
tion): Starting from CUDA 9.0, Nvidia introduced grid group
grid-level synchronization. Grid-level synchronization is a
method to do single GPU synchronization. In order to use
a grid group, cudaLaunchCooperativeKernel() API call is
necessary, in comparison to the traditional kernel launch
(<<<>>>).
4) Multi-Grid Level Synchronization (Multi-GPU Synchro-
nization): CUDA 9.0 also introduced the concept of multi-grid
group. This group is initialized by a kernel launch API: cu-
daLaunchCooperativeKernelMultiDevice(). Synchronizing this
group can act as a way to do multi-GPU synchronization in a
single node.
B. Non-primitive Synchronization
1) Software Barrier for Synchronization: Li etc. [16]
researched fine-grained synchronization. Beyond it, Xiao,
etc. [5] introduced a software device-level synchronization.
The authors limit the number of blocks per SM to only one in
order to avoid deadlocks. Sorensen et al. extended this work by
adding an automatic occupancy discovery protocol to discover
activate warps [4].
2) Implicit Barrier for Synchronization: Before the in-
troduction of grid-level synchronization, the typical way to
introduce a device-wide barrier to a program was to use several
kernels in a single CUDA stream. A stream is a logical queue
that enforces an execution order on the CUDA kernels in the
stream, i.e. the kernels and data movement commands are
executed in the order by which they appeared in the stream.
For example, many DL frameworks, e.g., Chainer [3], use this
method to enforce execution order.
3) Multi-GPU Synchronization: The common way to do
multi-GPU synchronization is to synchronize CPU threads
orchestrating the GPUs. The basic idea is to use one CPU
thread per device (or one MPI rank per device). Additionally,
with the help of the GPUDirect CUDA technology, it is
also possible to implement multi-GPU software barriers using
GPUDirect APIs.
Since we are concerned in this paper with studying general
and intrinsic barrier methods, we would not discuss manu-
ally implementation barriers, including software barriers and
GPUDirect based manually implementations.
IV. SYNCHRONIZATION VIA CPU-SIDE IMPLICIT
BARRIERS
Launching new kernels in a single stream can act as a
device-wide implicit barrier to maintain the order of the
program. Yet launching an additional kernel is not a free lunch:
it will also introduce overheads. This section will inspect the
overhead of traditional launch function, i.e., the <<<>>>
kernel invocation method, and the new launch functions,
i.e. cudaLaunchCooperativeKernel() and cudaLaunchCooper-
ativeKernelMultiDevice() Nvidia introduced from CUDA 9.0
for CG.
To simplify our discussion, this section does not consider
the extra overhead of launching the first kernel. Instead, in all
our measurements we assume a warm-up kernel was already
launched, and we focus our analysis on the behavior of kernels
launched after the warm-up kernel.
Before further discussion in this section, we introduce the
following terms:
• Kernel Execution Latency: Total time spent in executing
the kernel, excluding any overhead for launching the
kernel.
• Launch Overhead: Latency that is not related to kernel
execution.
• Kernel Total Latency: Total latency to run kernels.
TKernel Total Latency = TKernel Execution Latency +
TLaunch Overhead
Figure 3 is our sample code for micro-benchmarks. It also
shows the concept of kernel execution latency and kernel total
latency. Kernel execution latency is controlled by the sleep
instruction. Tkerne totall latency = ((timer3 − timer2) −
(timer2 − timer1))/(5 − 1) here; Elaborate details on the
bench-marking methods are discussed in Section IX-B.
1 g l o b a l vo id n u l l k e r n e l ( ) {
2 / / k e r n e l e x e c u t i o n l a t e n c y i s 10 us h e r e .
3 r e p e a t 1 0 ( asm v o l a t i l e ( ” n a n o s l e e p . u32 1000 ; ” ) ; )
4 }
5 . . .
6 r e c o r d ( t i m e r 1 ) ;
7 r e p e a t 1 ( l a u n c h ( n u l l k e r n e l , l a u n c h p a r a m e t e r s ) ; ) ;
8 c u d a D e v i c e S y n c h r o n i z e ( ) ;
9 r e c o r d ( t i m e r 2 ) ;
10 r e p e a t 5 ( l a u n c h ( n u l l k e r n e l , l a u n c h p a r a m e t e r s ) ; ) ;
11 c u d a D e v i c e S y n c h r o n i z e ( ) ;
12 r e c o r d ( t i m e r 3 ) ;
13 . . .
Fig. 3. Sample code to micro-benchmark implicit barriers for a null (empty)
kernel
TABLE I
LAUNCH OVERHEAD AND NULL KERNEL LATENCY OF DIFFERENT
LAUNCH FUNCTIONS
Null Kernel
Launch Type Launch Overhead Kernel Total Latency
(ns) (ns)
Traditional 1081 8888
Cooperative 1063 10248
Cooperative Multi-Device 1258 10874
In this way, We measured the launch overhead by using the
kernel fusion method. We also test the kernel total latency of
a null kernel for comparison. Table I shows the result.
V. SINGLE GPU SYNCHRONIZATION
In this section, we characterize the performance of warp,
thread block, and grid level synchronization. Warp and block
abstractions exist inside an SM. For warp and block, we used
the micro-benchmark discussed in Section IX-C. Grid is an
inter-SM abstraction, for that, we used the micro-benchmark
discussed in Section IX-D.
For the warp shuffle operation and block synchronization
operation, the throughput is reported by CUDA programming
guide [1] at the granularity of warps and blocks, respectively.
Yet it is possible that the size of a group that performs
synchronization or shuffle would influence the performance
itself. Hence in this work, we consider the group size when
experimenting with warp shuffle and block synchronization.
A. Warp-Level Synchronization
The current CUDA (10.0) supports two kinds of warp
level synchronization: tile group based and coalesced group
based (as seen in Figure 2). Additionally, the CUDA shuffle
operation, which exchanges a register value among threads in
a warp, is an operation that implies a synchronization after it.
We also include the results of the shuffle operation.
Since the size of a synchronization group might influence
the result, we tested every possible group size for both tile
group and coalesced group. The possible tile group sizes are:
1, 2, 4, 8, 16, and 32. The possible coalesced group size ranges
from 1 to 32. Latency is tested by using only 32 threads (a
warp) in a CUDA kernel with one block. The throughput is
tested by iterating every possibility pair of up to 1024 threads
TABLE II
PERFORMANCE OF WARP SYNCHRONIZATION IN A BLOCK
Type Latency Throughput Reference [1]
(group size) cycle (sync/cycle) thread op/cycle
V100 P100 V100 P100 V100 P100
Tile(*) 14 1 0.812 1.774 - -
Shuffle(Tile)(*) 22 31 0.928 0.642 32 32
Coalesced(1-31) 108 1 0.167 1.791 - -
Coalesced(32) 14 1 1.306 1.821 - -
Shuffle(COA)(*) 77 50 0.121 0.166 - -
Block(warp)) 22 218 0.475 0.091 16 32
and up to 64 blocks per SM and recording only the highest
result. Table II shows the result of warp level synchronization.
For tile group synchronization the size of the group influ-
ence neither latency nor throughput. A possible explanation
is that CUDA could be merging all the concurrent tile group
synchronization instructions into a single instruction. For coa-
lesced group synchronization, the group size does not influence
the performance of P100. The group size does, however,
influence the performance of coalesced group in V100. The
performance is the highest when all the threads inside a warp
belong to a single coalesced group. For convenience, because
the group size doesn’t influence the total latency of tile group
synchronization, we only record the throughput in the case of
a group size of 32 in tile group synchronization.
We use the reference throughput of shuffle operation men-
tioned in the CUDA programming guide [1] in Table II. Ap-
parently, the performance of V100 is closer to the theoretical
result in the programming guide. On the other hand, there
seem to be some overheads that influence the throughput of
the shuffle operation in P100.
B. Block-Level Synchronization
We tested every possible group size at the block level, i.e.
starting from 32 to 1024. We find that the throughput of block-
level synchronization is related to the number of active warps
per SM.
Figure 4 shows the relationship between the throughput of
block synchronization divided by warp count (warp sync per
cycle) and the maximum number of activate warps per SM
(as calculated by [7]). When the warp count exceeds the size
of max activate warp per SM, the device is saturated and the
throughput of block synchronization reaches its maximum.
With this observation, we conclude that the performance
of block-level synchronization is related to the warp count
per SM. We further summarize the performance of block
synchronization from a warp’s perspective in Table II.
CUDA’s programming guide [1] reports that the throughput
for syncthreads() (or block-level synchronization) is 16
operations per clock cycle for capability 7.x (V100) and 32
for capability 6.0 (P100). The throughput of V100 is relatively
close to 16 op/cycle. But the result of P100 is far away from
32 op/cycle. To further support our result, the inverse of the
gradient of the points in the up part of Figure 4 can represent
throughput. Obviously, the gradient of block synchronization
0
5
10
15
20
25
30
1 4 16 64 256 1024
La
te
nc
y 
(T
ho
us
an
ds
 o
f C
yc
le
s)
Warp / SM V100 P100
Active Warp/SM
1 2 4 8 16 32 64
Th
ro
ug
hp
ut
 
(W
ar
p 
Sy
nc
h/
cy
cle
)
0.0
0.1
0.2
0.3
0.4
0.5
V100 P100
Fig. 4. Relationship between throughput of block sync (per warp perspective)
(up) and active warp/SM perspective (down)
V100 32 64 128 256 512 1024 P100 32 64 128 256 512 1024
1 1.43 1.45 1.46 1.50 1.71 2.21 1 1.77 1.78 1.79 1.83 1.91 2.26
2 1.81 1.82 1.88 1.99 2.48 3.49 2 2.06 2.07 2.11 2.23 2.65 3.52
4 2.85 2.90 3.07 3.53 4.52 4 3.45 3.50 3.62 4.04 4.90
8 5.07 5.26 5.70 6.71 8 6.53 6.58 7.04 8.39
16 8.52 8.81 10.30 16 12.20 13.46 14.92
32 19.29 24.51 32 31.69 28.42
Thread Per Block
Bl
oc
k 
Pe
r S
M
Fig. 5. Latency (us) of grid synchronization in V100 (left) and P100 (right)
in P100 is larger than V100. So, the throughput of P100 should
not be larger than V100’s.
Admittedly, it is also possible that the performance of block
synchronization in P100 is not ideal due to over-subscription.
Yet the latency of block synchronization in P100 is so large
that it is nearly impossible to find a point at which the
instruction pipeline is saturated while the overhead of over-
subscription is not so severe.
C. Grid-Level Synchronization
Figure 5 shows the heat map of grid synchronization. It
shows that in both V100 and P100 the latency of grid syn-
chronization is more related to the grid dimension (specifically,
block count per SM) than to the block dimension.
No matter how small the grid is, it seems that it is still
slower than the overhead of kernel launch we measured in
Section IV. Single GPU grid synchronization might not bring
about any benefit in performance, in comparison to implicit
barrier methods. Yet we argue that this performance difference
is negligible (at most 2.5us with two blocks/SM) in real
applications. In addition, using the implicit barrier instead
would eliminate the possibility of data reuse in shared memory
and registers.
VI. MULTI-GPU SYNCHRONIZATION METHODS
We consider three ways to do multi-GPU synchronization:
1 # pragma omp p a r a l l e l num ) t h r e a d s ( GPU count ) {
2 u n i t g i d =omp get thread num ( ) ;
3 c u d a S e t D e v i c e ( g i d ) ;
4 . . .
5 k e r n e l<<<>>>() ;
6 c u d a D e v i c e S y n c h r o n i z e ( ) ;
7 # pragma omp b a r r i e r
8 . . .
9 }
Fig. 6. Code example of using CPU threads for synchronization
1 GPU 32 64 128 256 512 1024 2 GPU 32 64 128 256 512 1024
1 1.45 1.41 1.43 1.52 1.80 2.50 1 7.29 7.26 7.34 7.35 7.67 8.44
2 1.72 1.74 1.82 2.10 2.92 4.56 2 7.92 7.91 8.00 8.24 9.00 9.93
4 3.02 3.07 3.33 4.01 5.72 4 10.14 10.19 10.02 10.71 12.17
8 5.42 5.54 6.59 8.48 8 16.35 16.15 17.11 18.84
16 8.84 9.98 12.75 16 29.85 30.83 33.56
32 20.81 26.23 32 62.80 68.05
Thread Per Block
Bl
oc
k 
Pe
r S
M
Fig. 7. Latency (us) of multi-grid synchronization in P100 platform for one
GPU (left) and two GPUs (right)
A. Using Multi-device Launch Function as an Implicit Barrier
When using the multi-device launch function with the
default flag, kernels will not execute until all the previous
operations in all the GPU streams involved have finished
execution [17]. Although this implicit barrier method is not
commonly used, we nonetheless evaluate it to assess if this
method is a valuable alternative. Section IX-B discusses in
detail micro-benchmark we use in this subsection.
B. Using CPU-side Barriers
A common way to make a barrier between GPUs is to
use CPU threads or processes to synchronize different GPUs.
We use openMP to measure the overhead in this case. Each
thread calls the cudaDeviceSynchronize() API to ensure the
asynchronously launched GPU kernels are executed till their
end. In addition, the threads use the openMP barrier API to
synchronize. Figure 6 shows the code example for this kind
of barrier. Finally, we appropriately pin the CPU threads. We
applied the same micro-benchmark discussed in Section IX-B
for this subsection.
C. Using Multi-grid Synchronization
Section IX-D discusses in detail micro-benchmark we use
in this subsection. Figure 7 and Figure 8 show the heat maps
of the latency of multi-grid synchronization in V100 and P100.
Because the inter-connection in the P100 system is PCIe, the
performance is worse than the V100 system that is equipped
with NVLink connection between devices.
We experimented with all 8 GPUs in the DGX-1, we found
that the performance of multi-grid synchronization among 2-
5 GPUs is similar to each other, and the performance of
multi-grid synchronization among 6-8 GPUs are similar to
each other. This behaviour is likely related to the internal
NVLink network structure of DGX-1. From Figures 7 and 8,
we can see that the performance of multi-grid synchronization
is influenced by both the grid dimension and number of active
Thread Per Block
Bl
oc
k 
Pe
r S
M 1 GPU 32 64 128 256 512 1024
1 1.42 1.44 1.56 2.04 3.06 7.34
2 1.81 1.86 2.33 3.34 6.93 18.97
4 2.92 3.37 4.35 7.53 19.10
8 5.32 6.35 9.10 20.68
16 9.66 11.72 24.24
32 20.84 34.04
2 GPU 32 64 128 256 512 1024 5 GPU 32 64 128 256 512 1024
1 6.44 6.46 6.53 6.99 8.05 12.41 1 7.02 7.05 7.15 7.62 8.68 13.32
2 6.77 6.80 7.28 8.32 11.80 24.14 2 7.37 7.44 7.92 9.01 12.72 25.16
4 7.96 8.41 9.46 12.57 24.21 4 8.61 9.14 10.14 13.41 25.23
8 12.47 13.63 16.55 28.03 8 13.19 14.21 17.16 28.71
16 22.48 24.64 37.04 16 23.58 25.61 38.15
32 45.88 58.60 32 48.71 61.66
6 GPU 32 64 128 256 512 1024 8 GPU 32 64 128 256 512 1024
1 18.67 18.66 18.68 19.26 20.28 24.78 1 20.97 21.00 21.10 21.42 22.55 26.93
2 19.03 19.12 19.54 20.54 23.64 35.89 2 21.18 21.41 21.85 22.81 25.98 37.99
4 20.29 20.88 21.80 24.77 36.37 4 22.62 23.04 24.13 27.08 38.60
8 23.39 24.43 27.18 38.93 8 25.98 26.62 29.33 40.86
16 29.27 31.41 44.37 16 32.20 33.67 45.98
32 54.24 69.70 32 58.30 71.90
Fig. 8. Latency (us) of multi-grid synchronization in V100 platform
34.04
58.60 61.66
69.70
71.90
1.26
67.20
9.30 10.60
0
20
40
60
80
0 1 2 3 4 5 6 7 8 9
La
te
nc
y (
us
)
GPU count
1 block/SM, 32 thread/block 1 block/SM, 1024 thread/block
32 block/SM, 64 thread/block Launch Overhead in multi-device launch
Launch Overhead in CPU-side barriers
Fig. 9. Comparison of implicit barriers performance: multi-device launch vs.
CPU-side barriers and multi-grid synchronization across 8 GPUs in DGX-1
warps per SM. With block/SM <= 8 and warp/SM <= 32,
the performance is acceptable. Apart from the case of one
GPU, latency in all cases is no more than 2x slower than the
fastest case (1 block/SM, 32 threads/block) and 2x faster than
the slowest case (32 blocks/SM, 64 threads/block).
D. Comparison
Figure 9 shows the results of all three multi-GPU synchro-
nization methods across 8 GPUs in DGX-1. For simplification,
we only plot the data of three cases of multi-grid synchro-
nization in Figure 9: a) one block/SM, 32 threads/block as the
fastest case, b) 32 blocks/SM, 64 threads/block as the slowest
case, and c) one block/SM, 1024 threads/block as a general
case, which is within the parameters we recommended in the
previous paragraph.
The CPU-side barrier relying on openMP barriers outper-
form implicit barriers in multi-device launch when the GPU
count is larger than two. Also, the overhead of the CPU-
side barrier is relatively steady w.r.t. GPU count. It is worth
mention that this result is relatively close to the kernel total
latency of a null kernel as shown in Table I.
Figure 9 shows two performance drops in multi-grid syn-
chronization. We anticipated that the second drop would be
between 4 GPUs and 5 GPUs, based on the internal network
structure of DGX-1 that groups 4 GPU together. However, we
find no reasons for the performance drop between 5 GPU and
6 GPU.
The figure shows that multi-grid synchronization outper-
forms the multi-device kernel launch function as an implicit
barrier. On the other hand, as long as the program is not
oversubscribed, i.e., no more than 1024 threads per SM, the
performance of multi-grid synchronization is at most 3x slower
than CPU-side barriers. Yet the difference is around 16us,
which is practically not an issue in the situation of 8 GPUs.We
argue that this minor cost should not discourage programmers
from considering the use of multi-grid synchronization in their
algorithms, given the utility provided in terms of simplicity of
programming, and avoiding reliance on third-party libraries
such as openMP or MPI.
VII. CASE STUDY: REDUCTION OPERATOR
We use the reduction operator (summing the elements of
an array) as a case study. Harris et al. [2] is a notable work
that focused on optimizing the reduction operator in CUDA.
They studied several optimization methods and optimized
the operator by optimizing for maximum memory bandwidth
utilization. Additionally, Luitjens et al. [18] introduced the use
of the shuffle primitive in reduction. The optimized reduction
kernels can be found in CUDA SDK samples [19]. There are
other similar optimization strategies [20], [21]. To the best of
the authors’ knowledge, all of the previous strategies didn’t
quantitatively compare different synchronization methods in
different implementations. In this section, we will demonstrate
how to capitalize on the analysis in previous sections to
make a decision between different reduction implementations,
depending on the input size and number of workers involved.
This approach can be applied to optimize any of the previous
reduction implementations and many other code generation
frameworks [22].
In addition to using single GPU synchronization methods in
optimizing for input size, there is a programmability benefit
in using multi-grid synchronization for multi-GPU systems.
In dense system, such as Nvidia DGX-1 and DGX-2, the
peer access feature enables one GPU to access the memory of
another GPU. In this case, multi-grid synchronization provides
an easy way to ensure sequential consistency. We explain this
in detail in section VII-E.
It is important to mention another potential benefit that does
not appear in the case of the reduction kernel. There is a
potential of improving data reuse by the means of replacing
several kernel invocations with a single persistent kernel that
uses multi-grid synchronization. An example of that would be
replacing kernel invocations in iterative stencil methods with a
persistent kernel that includes the time loop inside the kernel.
A. Performance Model
We assume that the throughput is indifferent to the size
of the problem (for any problem size that fully utilizes the
device). We also assume that the cost of synchronization is
the main cost of multi-threading. We can use Equation 2 to
know when to use fewer threads. In this equation, ”basic”
TABLE III
PROJECTED CONCURRENCY OF THE TWO CONFIGURATIONS IN
SECTION VII-B
scenery bandwidth latency concurrency
B/cycle cycle B
V100 P100 V100 P100 V100 P100
1 1 thrd. 0.62 0.43 13.0 18.5 8 8
1 warp 19.6 13.8 13.0 18.5 256 256
2 32 thrd. 19.6 13.8 13.0 18.5 256 256
1024 thrd 215 141 13.0 18.5 2796 2615
1 w h i l e ( i<n ) {sum+= g i d a t a [ i ] ; i += g r o u p s i z e ;}
Fig. 10. Code example of the main instruction in the memory bandwidth
micro-benchmark for proxying the reduction operation
might refer to single thread, single warp, single block, or
single GPU, and ”more” corresponds to more threads, more
warps, more blocks, or multi-GPU. We use Little’s Law [23] to
compute concurrency (Equation 1). To simplify the problem,
we consider Tbasic as the latency in Little’s Law, and Tmore
includes the overhead of synchronization as Equation 3 shows.
From this equation we can imagine three different scenarios:
1) If the input size is not larger than the concurrency of
”basic” threads, using fewer threads would always be
more profitable.
2) If the input size is larger than the concurrency of ”basic”
threads and no larger than the concurrency of ”more”,
we can use Equation 4 to compute the switching point.
3) If the input size is larger than the concurrency of ”more”
threads. We can use Equation 5 to know at which point
we should use fewer threads.
C = T ∗ Thr (1)
Tbasic +
Max(0,N−Cbasic)
Thrbasic
< Tmore +
Max(0,N−Cmore)
Thrmore
(2)
Tmore = Tbasic + Tsync = T + Tsync (3)
Nm < (T + Tsync) ∗ Thrbasic (4)
Nl <
(Tsync)∗Thrmore∗Thrbasic
Thrmore−Thrbasic (5)
∗(T represent Latency;Thr represent Throughput;
C represent concurrency)
B. Micro-benchmark and Basic Prediction
In the case of the GPUs we examine in this paper when
the input size is large enough, the bottleneck of reduction
algorithm is device memory bandwidth. Hence we use a mem-
ory bandwidth micro-benchmark to proxy the performance
of reduction. To make this micro-benchmark an accurate
representation, we add two add instructions to imitate the real
computation in the reduction operation. Figure 10 shows the
main instruction in the micro-benchmark.
Our objective is to identify when to use a single thread, a
single warp barrier, and until when would it be more efficient
TABLE IV
PREDICTING THE SWITCHING POINT BETWEEN TWO CONFIGURATIONS
scenery sync ltc* switch point
cycle B
V100 P100 V100 P100
1 1 warp Nl 110 155 70 70
1 warp Nm - - 76 75
2 1024 thrd Nl 420 2135 9076 32681
1024 thrd Nm - - 8501 29737
∗: 5 times synchronization
1 / / assume t h e d a t a r e s i d e s i n s h a r e d memory
2 f o r ( s t e p = 1 6 ; s t e p >=1; s t e p / = 2 ) {
3 / / o r use t h e s h u f f l e o p e r a t i o n h e r e
4 i f ( t i d + s t e p <32)sm [ t i d ]+=sm [ t i d + s t e p ] ;
5 s y n c h r o n i z e ( ) ;
6 }
Fig. 11. Code example of warp level reduction with synchronization
to use a multi-GPU barrier. Instead of enumerating every
possible case, we only consider two configurations here (and
it can be extended to other cases):
• To use a single thread or single warp barrier
• To use a single block with 1024 threads or with 32 threads
Normally in the two configurations we mentioned, the data
is usually kept in shared memory or cache, so we only measure
shared memory for the following part. Table III shows the
results of bandwidth (throughput), latency and concurrency.
Take the double type as an example (8 Bytes). In this case,
in both configurations, the input size exceeds the concurrency
of both ”basic” and ”more” settings, hence we only need to
take Nl in Equation 5 into consideration. Table IV shows the
results.
Table IV shows that: first, it is better to compute 32 data
points with a warp; second, there would be no benefit to
compute 1024 data points with 1024 threads per block. Our
further experiments show that those predictions are correct.
In addition, another potential overhead caused by synchro-
nization would be that the synchronization would possibly
clear the instruction pipeline. Threads might need additional
time to saturate the pipeline. So the real switching point would
likely be larger than this.
C. Warp Level Reduction
In this subsection, we compare different warp level synchro-
nization methods in the reduction kernel by observing their
behaviour in the current generations of GPUs. Figure 11 shows
our sample code, and Table V shows the result.
As shown in Table V, when using the volatile qualifier for
the input data, the performance of warp level synchronization
is no worse than in the case without the volatile qualifier
(shown as ”tile” in the table). Accordingly, the warp level
synchronization does not have much overhead other than to
ensure memory consistency. We can conclude that warp level
synchronization is no more than a memory fence in the current
version of CUDA. We also observe that the results for using
TABLE V
LATENCY (CYCLES) TO COMPUTE SUM OF 32 VALUES (DOUBLE
PRECISION)
serial nosync volatile tile coa tile coa
* & tile shuffle shuffle
V100 299 89 237 237 237 164 1261
P100 383 112 282 281 251 212 1423
∗result of no synchronization version is incorrect
1 d e v i c e REAL summing ( . . . ) { . . .
2 u i n t i = t h r e a d i d + b l o c k i d * blockdim ;
3 sum =0;
4 w h i l e ( i<n ) {
5 sum+= g i d a t a [ i ] ;
6 i += blockdim * g r i d d i m ;
7 }
8 r e t u r n sum ;
9 }
10 d e v i c e REAL b l o c k r e d u c e ( . . . ) { . . .
11 i = t h r e a d i d ;
12 sum =0;
13 w h i l e ( i<n ) {sum+= t d [ i ] ; i += blockdim ;}
14 / / n i s t h e pre−computed s w i t c h p o i n t
15 t d [ t h r e a d i d ]= sum ;
16 sum =0;
17 b l o c k . sync ( ) ;
18 i f ( warp id ==0)
19 {
20 i = t h r e a d i d ;
21 w h i l e ( i<blockDim ) {sum+= t d [ i ] ; i +=32;}
22 sum = s h u f f l e r e d u c e w a r p ( sum ) ;
23 }
24 r e t u r n sum ;
25 }
Fig. 12. Basic function of device wide reduction
the shuffle operation with the tile group have the lowest
latency.
D. Single GPU Reduction
In this Subsection, we directly apply the knowledge in Sec-
tion VII-B in implementing device-wide reduction. Figure 13
shows the code of reduction with explicit synchronization
and Figure 14 shows the code of reduction with implicit
synchronization for a single GPU.
The widely used GPU C++ library CUB [24] and CUDA
SDK samples [19] include single GPU reduction implementa-
tions, we compare the performance of those implementations
with our implementation.
Figure 15 and Table VI show the results. Our implemen-
tation is comparable to state of art implementations on V100
and is noticeably better on P100. We can learn from Figure 15
that using a CPU-side barrier (”implicit” in the figure) always
outperforms using grid synchronization (”grid sync” in the
figure), though the performance difference is not so decisive.
E. Multi-GPU Reduction
In this section, we use the code in Figure 13 and implicit-
MultiGPU code in Figure 14. Figure 16 shows the results.
Though it is hard to notice, an implicit barrier is always
slightly better than the multi-grid synchronization method.
1 / / works i n bo th s i n g l e and m u l t i GPU
2 g l o b a l vo id Exp l i c i tGPU ( . . . ) { . . .
3 w h i l e ( s t e p . n o t f i n i s h ( ) ) {
4 / / d i r e c t l y s t o r e d a t a i n t h e t a r g e t GPU
5 d e s t [ s t e p ] [ t h r e a d i d ]
6 = summing ( s r c [ s t e p ] [ t h r e a d i d ] , . . . ) ;
7 g r i d . sync ( ) ; / / e x p l i c i t s y n c h r o n i z e ;
8 }
9 i f ( gpu id ==0)
10 {
11 sum= b l o c k r e d u c e ( s r c [ 0 ] [ 0 ] , . . . ) ;
12 i f ( t h r e a d i d ==0)
13 o u t p u t [ t h r e a d i d ]= sum ;
14 }
15 }
Fig. 13. Code example of reduction with explicit device synchronization
1
2 g l o b a l vo id Kerne l1 ( . . . ) { . . .
3 u i n t i = t h r e a d i d + b l o c k i d * blockdim ;
4 sum=summing ( . . . ) ;
5 o u t p u t [ i ]= sum ;
6 . . . }
7 g l o b a l vo id Kerne l2 ( . . . ) { . . .
8 sum= b l o c k r e d u c e ( . . . ) ;
9 i f ( t h r e a d i d ==0)
10 o u t p u t [ t h r e a d i d ]= sum ;
11 . . . }
12
13 / / f o l l o w i n g p a r t s a r e CPU f u n c t i o n s
14 vo id i m p l i c i t S i n g l e G P U ( . . . ) { . . .
15 Kernel1 <<<...>>>(...) ; / / i m p l i c i t s y n c h r o n i z a t i o n
16 Kernel2 <<<...>>>(...) ;
17 . . . }
18
19 vo id i m p l i c i t M u l t i G P U ( ) { . . .
20 # pragma omp f o r num threads ( gpucoun t ) { . . .
21 c u d a D e v i c e S e t ( t i d ) ;
22 Kernel1 <<<...>>>(...) ;
23 / / g a t h e r d a t a t o one GPU t h a t would do t h e
r e m a i n i n g c o m p u t a t i o n .
24 w h i l e ( s t e p . n o t f i n i s h ( ) ) {
25 c u d a D e v i c e S y n c h r o n i z e ( ) ;
26 # pragma omp b a r r i e r ;
27 / / t r a n s f e r d a t a from c u r r e n t GPU t o a n o t h e r GPU
28 t r a n s f e r d a t a ( s r c [ s t e p ] [ t i d ] , d s t [ s t e p ] [ t i d ] ) ;
29 }
30 c u d a D e v i c e S y n c h r o n i z e ( ) ;
31 # pragma omp b a r r i e r ;
32 i f ( t i d ==0) Kernel2 <<<...>>>(...) ;
33 }
34 . . . }
Fig. 14. Code example of reduction with implicit device synchronization
As section IV mentioned, the overhead in cooperative multi-
launch might be the cause of this performance difference.
On the other hand, we want to emphasize here the benefit
of programming. We can easily rewrite implicit barrier code
(Figure 14) into the explicit barrier one (Figure 13), i.e. a
single persistent kernel is required in grid synchronization, and
eliminate the complexity of managing several GPUs with CPU
threads or processes. More importantly, the kernel function
requires no knowledge of the hardware structure.
10
100
1000
10000
0.1 1 10 100 1000 10000
La
te
nc
y (
us
)
Size (MB)
V100
implicit grid sync CUB cuda sample
10
100
1000
10000
0.1 1 10 100 1000
La
te
nc
y (
us
)
Size (MB)
P100
implicit grid sync CUB cuda sample
Fig. 15. Comparison of the performance of single reduction in V100 (up)
and in P100 (down)
TABLE VI
BANDWIDTH (GB/S) OF DIFFERENT REDUCTION METHODS
implicit grid sync CUB CUDA sample theory
V100 865.40 855.59 849.39 852.98 898.05
P100 592.40 590.85 543.96 590.65 732.16
VIII. CONSIDERATIONS OF USING CUDA
SYNCHRONIZATION INSTRUCTIONS
In this study, we identified several cases at which the
synchronization instructions might not work as intended. In
this section, we summarize some of those cases.
A. Synchronization Inside a Warp
In this section, we examine synchronization at the warp
level. To see if a barrier inside a warp is effective on all threads
in the barrier, we run the code in Figure 17. In the ideal case,
the timers in all threads in the warp before the barrier are
smaller than the timers after the sync in every thread. We
test all the synchronization methods. Results show that P100
does not assure all threads inside a warp are blocked at the
barrier (also the shuffle operation does not work correctly in
this code either), which we believe explains why the latency
of warp level synchronization in P100 is as fast as Table
II shows. On the other hand, in V100, we observed the
anticipated behavior (likely due to the fact that in V100 each
thread has its own program counter). Figure 18 shows our
observation when calling tile synchronization. We observed
the same phenomenon when running all other synchronization
instructions in both V100 and P100.
B. Deadlocks in Synchronization of Parts of Thread Groups
In this section, we examine the behaviour of synchronization
with a subset of a thread group: would synchronizing a subset
0
1000
2000
3000
4000
5000
6000
7000
8000
0 1 2 3 4 5 6 7 8 9
Th
ro
ug
hp
ut
(G
B/
s)
GPU Count
mgrid sync CPU-side barrier
Fig. 16. The throughput of reduction on DGX-1
1 i f ( t i d ==0){ t i m e r ( s t a r t ) ; sync ; t i m e r ( end ) ;}
2 e l s e i f ( t i d ==1){ t i m e r ( s t a r t ) ; sync ; t i m e r ( end ) ;}
3 . . .
4 e l s e i f ( t i d ==30){ t i m e r ( s t a r t ) ; sync ; t i m e r ( end ) ;}
5 e l s e { t i m e r ( s t a r t ) ; sync ; t i m e r ( end ) ;}
Fig. 17. Code example to verify synchronization inside a warp
of a group cause a deadlock or not? We implement a test suite
to see what happens when part of a thread group calls the
synchronization function. We test through every granularity
including threads, warps, blocks, and GPUs. As a result, we
observed deadlocks when we synchronize parts of blocks in
grid group, multi-grid group, and when we synchronize parts
of GPUs in a multi-grid group. In summary, one should be
careful, after initializing a grid group or a multi-grid group,
since current CUDA does not support synchronizing sub-
groups inside a grid group.
IX. BENCHMARKING CUDA SYNCHRONIZATION
METHODS
A. Experiments Environment
We use Pascal P100 and Volta V100 cards to conduct
our experiments. We set the application frequency of both
platforms to default. We use the latest stable driver. Table VII
shows the details of the environment.
B. Micro-benchmark for Implicit Barriers
We use the terminologies in Section IV. We do a warm-up
kernel call before every measurement that we don’t report the
results for.
We found that directly using a null kernel would not give a
correct result here. Because at this point the stream pipeline is
not saturated enough: the overhead tested would be larger than
usual. The kernel execution latency needs to be larger than a
certain number. This value is around 5us for a single GPU
and around 250us for 8 GPUs in DGX-1. In order to control
the kernel latency, we use the sleep instruction introduced in
CUDA for Volta platform. We use kernel fusion to unveil the
overhead hidden in kernel latency. The basic assumption here
is that merging the work of multiple argument-less kernels
into one single kernel does not introduce additional launch
overhead, and then the time saved when using kernel fusion
should be equal to the overhead of launching an additional
kernel. From our previous observations, the sleep instruction
0
4
8
12
16
0 8 16 24 32
Cy
cl
es
 (T
ho
us
an
ds
)
thread id
V100
0
2
4
6
8
10
0 8 16 24 32
Cy
cl
es
 (T
ho
us
na
ds
)
P100
start end
Fig. 18. Timer of threads inside a warp when calling tile synchronization in
V100 (left), and in P100 (right) in code sample of Figure 17
TABLE VII
ENVIRONMENT INFORMATION
Platform Default Freq. Driver CUDA
P100 x 2 1189MHz 418.40.04 V10.0.130
V100 x 8(DGX-1) 1312MHz 410.129 V10.0.130
has insignificant overhead and fits well into this assumption. In
this situation, we can compute the overhead with Equation 6.
Since we use the sleep instruction as a tool to analyze
launch overhead, which is only available in Volta Platform
in CUDA, we only conduct experiments on the V100 GPU
for this experiment.
O =
Latencyij−Latencyji
i−j (6)
∗(O represents Overhead; In Latencyij (the left one),
i represents call launch function i times,
j represents launch kernels with j wait unit)
To the best of the authors’ knowledge, Volkov et al. [8] was
the first one measured the overhead of implicit barrier, i.e.
CUDA kernel launch overhead. Xiao et al. [5] additionally
build a model for implicit and explicit barriers. They both
neglect the fact that the launch overhead is far smaller when
kernel execution latency is long enough. When using null
kernels, we tested a launch overhead of around 3us for
traditional launch, which is the same as the best case reported
by Volkov et al. [8].
C. Micro-benchmark for Intra-SM Instructions
We directly use Wong’s [9] method for instruction micro-
benchmarking. Wong’s method relies on the GPU clock. The
basic methodology is to build a chain of dependent operations
to repeat a single instruction enough times to saturate the
instruction pipeline. By using the clock register to record
the begin and end timestamps of the series of operations, it
is possible to average the repetitions to infer the latency of
that instruction. Figure 19 shows an example sample code to
measure the latency of an add instruction.
D. Micro-benchmark for Inter SM Instructions
Jia’s work [11] can work correctly only inside a single
thread, Wong’s work [9] can work correctly only in a single
1 g l o b a l vo id k e r n e l 1 ( ) {
2 s t a r t = c l o c k ( ) ;
3 r e p e a t 2 5 6 ( p=p+q ; q=p+q ) ; / / r e p e a t =512
4 end= c l o c k ( ) ;
5 r e t u r n q ;
6 }
1 g l o b a l vo id k e r n e l 2 ( ) {
2 s t a r t = c l o c k ( ) ;
3 r e p e a t 5 1 2 ( p=p+q ; q=p+q ) ; / / r e p e a t =1024
4 end= c l o c k ( ) ;
5 r e t u r n q ;
6 }
1 c p u c l o c k ( ) ;
2 k e r n e l ( ) ;
3 s y n c d e v i c e ( ) ;
4 c p u c l o c k ( ) ;
Fig. 19. Sample code to measure the latency of the add instruction in GPU
SM. Yet current synchronization instructions might involve
cooperation across different threads, different SMs, and even
different GPUs. As we move to grid level synchronization and
beyond, we need a new method.
In order to test the performance of synchronization beyond a
single SM, a global clock is necessary. In CUDA’s execution
model, a CPU thread launches a kernel and it can call the
DeviceSynchronize() function to block the CPU thread until
the GPU kernel finishes execution. So it is possible to use
the clock in that CPU thread as a global clock to test GPU
instructions. Yet we need to fix two issues before we can use
the CPU clock:
• We need to eliminate any latency not related to the target
instruction
• Account for the relative inaccuracy in the CPU clock
measurement, in comparison to the GPU’s clock mea-
surement.
In order to solve those issues, we need to additionally
introduce two assumptions:
• The measurement of the latency of every instruction
becomes more accurate when the pipeline is saturated
• Additional instructions in a kernel do not increase the
launch overhead of kernel launch
Tinstruction =
Lk1−Lk2
r1−r2 (7)
σk1−k2
r1−r2
=
√∑N
n=1 (
Lk1−Lk2
r1−r2 )
2
−∑Nn=1 (Lk1−Lk2r1−r2 )
2
N−1
= 1r1−r2
√∑
L2k1−Lk1
2
N−1 +
∑
L2k2−Lk2
2
N−1
= 1r1−r2
√
σ2k1 + σ
2
k2
(8)
∗(Lki represents kernel total latency of kernel i;
ri represents repeat times in kernel i)
Under those assumptions, if we increase the repetitions of
instructions in the GPU kernel (in Figure 19), the additional
kernel latency is only related to the additional repeat times of
instructions. In this manner, we are able to avoid unrelated
latency that might come from kernel launch (to get more
accurate measurements). Equation 7 shows how to measure
the instruction latency with this method. (First issue solved)
Standard deviation can be used to represent the uncertainty
in a single measurement [25]. Equation 8 shows the standard
deviation of the instruction tested, and its deduction (the
measurement of kernel 1 and kernel 2 is independent to each
other). And by deduction, if the difference in repeat times is
large enough, the standard deviation of the instruction latency
we seek to measure will be small. (Second issue solved)
In order to verify that the method we proposed in Sec-
tion IX-D matches our assumptions, we use both Wong’s
method and our method to test the single precision add
instruction. Both results show that float-add costs 6 cycles
in P100 and 4 cycles in V100. Those results match the
result in [11]. We can conclude that the inter SM micro-
benchmark method we propose is a reliable measurement tool
that approaches the accuracy of the GPU clock.
We additionally verify that the repeat times of a synchro-
nization instruction itself would not influence the performance
itself in block and grid level. Tile shuffle at warp level also
works as we anticipated. Other warp level synchronization
can be unstable: the latency of the synchronization instruction
might increase suddenly when increasing repeat times. It could
be the case that this warp synchronization relies on a software
implementation. So when repeating an instruction too many
times, instruction cache overflow can occur. We only record
the fastest result for warp level synchronization instructions.
X. CONCLUSION
In this paper, we conduct a detailed study of different
synchronization methods in Nvidia GPUs, ranging from warp
to grid, and from single GPU to multi-GPU.
We find that the performance of block synchronization is
related to the number of warps involved, and the performance
of grid level synchronization is mainly affected by the number
of blocks involved. In addition, the performance of multi-
grid level synchronization depends on the network structure
connecting the GPUs, and the number of active blocks and
warps.
CPU-side implicit barriers generally perform better than
grid level and multi-grid level synchronization. But if the
program size is large enough, the performance difference
would not be so severe, with the added benefit that multi-grid
synchronization simplifies multi-GPU programming.
We use the reduction operator as an example to use the
knowledge we gain from micro-benchmark. We build a per-
formance model to predict where would be the point that
using fewer threads is more profitable. Additionally, using
code samples, we show a possible simple way to do multi-
GPU programming without much performance degradation.
Moreover, with more multi-grid barriers in a kernel, the launch
TABLE VIII
SUMMARY OF OBSERVATIONS
Warp Level Sync Does not work on Pascal;
Shuffle performs better in real code.
Block Sync The number of active warps per SM affects
performance
Grid Sync The number of blocks per SM mainly affects
performance;
Generally, the performance is acceptable if
block/SM <= 2;
Currently, only parts of blocks inside a grid
calling grid level synchronization would cause
deadlock.
Multi-Grid Sync Both the number of blocks per SM and active
warps per SM affect performance;
If thread/SM <= 1024 and block/SM <=
8 the performance is relatively acceptable;
Currently, only parts of grids inside a grid
calling grid level synchronization would cause
deadlock.
Implicit Sync &
CPU Based Sync
Generally, their performance is slightly better
than explicit synchronization when in single
GPU or when the GPU count is large, or when
there is no much synchronization steps;
The issue for CPU Based Sync is programma-
bility, especially in the situation of multi-GPUs.
overhead in multi-device kernel launch would become more
insignificant. Table VIII summarizes the knowledge we gained
from this study.
XI. ACKNOWLEDGMENTS
This work was partially supported by JST-CREST under
Grant Number JPMJCR19F5.
REFERENCES
[1] Nvidia, “Programming guide,” 2019. [Online]. Available: https:
//docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[2] M. Harris et al., “Optimizing parallel reduction in cuda,” Nvidia devel-
oper technology, vol. 2, no. 4, p. 70, 2007.
[3] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation
open source framework for deep learning,” in Proceedings of workshop
on machine learning systems (LearningSys) in the twenty-ninth annual
conference on neural information processing systems (NIPS), vol. 5,
2015, pp. 1–6.
[4] T. Sorensen, A. F. Donaldson, M. Batty, G. Gopalakrishnan, and
Z. Rakamaric´, “Portable inter-workgroup barrier synchronisation for
gpus,” in ACM SIGPLAN Notices, vol. 51, no. 10. ACM, 2016, pp.
39–58.
[5] S. Xiao and W.-c. Feng, “Inter-block gpu communication via fast barrier
synchronization,” in 2010 IEEE International Symposium on Parallel &
Distributed Processing (IPDPS). IEEE, 2010, pp. 1–12.
[6] J. Liu, “Efficient synchronization for gpgpu,” Ph.D. dissertation, Uni-
versity of Pittsburgh, 2018.
[7] NVIDIA, “V100 gpu architecture,” 2017. [On-
line]. Available: https://images.nvidia.com/content/volta-architecture/
pdf/volta-architecture-whitepaper.pdf
[8] V. Volkov and J. W. Demmel, “Benchmarking gpus to tune dense linear
algebra,” in SC’08: Proceedings of the 2008 ACM/IEEE conference on
Supercomputing. IEEE, 2008, pp. 1–11.
[9] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying gpu microarchitecture through
microbenchmarking,” in 2010 IEEE International Symposium on
Performance Analysis of Systems & Software (ISPASS). IEEE, 2010,
pp. 235–246.
[10] X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou, and M. Chen, “Understanding
the gpu microarchitecture to achieve bare-metal performance tuning,”
ACM SIGPLAN Notices, vol. 52, no. 8, pp. 31–43, 2017.
[11] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the
nvidia volta gpu architecture via microbenchmarking,” arXiv preprint
arXiv:1804.06826, 2018.
[12] S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu, “Efficient
performance evaluation of memory hierarchy for highly multithreaded
graphics processors,” in ACM SIGPLAN Notices, vol. 47, no. 8. ACM,
2012, pp. 23–34.
[13] X. Mei, K. Zhao, C. Liu, and X. Chu, “Benchmarking the memory
hierarchy of modern gpus,” in IFIP International Conference on Network
and Parallel Computing. Springer, 2014, pp. 144–156.
[14] X. Mei and X. Chu, “Dissecting gpu memory hierarchy through
microbenchmarking,” IEEE Transactions on Parallel and Distributed
Systems, vol. 28, no. 1, pp. 72–86, 2016.
[15] V. Volkov, “Better performance at lower occupancy,” in Proceedings of
the GPU technology conference, GTC, vol. 10. San Jose, CA, 2010,
p. 16.
[16] A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, “Fine-grained
synchronizations and dataflow programming on gpus,” in Proceedings of
the 29th ACM on International Conference on Supercomputing. ACM,
2015, pp. 109–118.
[17] Nvidia, “Nvidia cuda runtime api,” 2019. [Online]. Available:
https://docs.nvidia.com/cuda/cuda-runtime-api/index.html
[18] J. Luitjens, “Faster parallel reductions on kepler,” Parallel Forall.
NVIDIA Corporation. Available at: https://devblogs. nvidia.
com/parallelforall/faster-parallel-reductions-kepler, 2014.
[19] Nvidia, “Nvidia cuda sample,” 2019. [Online]. Available: https:
//docs.nvidia.com/cuda/cuda-samples/index.html
[20] P. J. Martı´n, L. F. Ayuso, R. Torres, and A. Gavilanes, “Algorithmic
strategies for optimizing the parallel reduction primitive in cuda,” in
2012 International Conference on High Performance Computing &
Simulation (HPCS). IEEE, 2012, pp. 511–519.
[21] W. A. R. Jradi, H. Nascimento, and W. S. Martins, “A fast and generic
gpu-based parallel reduction implementation,” in 2018 Symposium on
High Performance Computing Systems (WSCAD). IEEE, 2018, pp.
16–22.
[22] S. G. De Gonzalo, S. Huang, J. Go´mez-Luna, S. Hammond, O. Mutlu,
and W.-m. Hwu, “Automatic generation of warp-level primitives and
atomic instructions for fast and portable parallel reduction on gpus,” in
Proceedings of the 2019 IEEE/ACM International Symposium on Code
Generation and Optimization. IEEE Press, 2019, pp. 73–84.
[23] J. D. Little and S. C. Graves, “Little’s law,” in Building intuition.
Springer, 2008, pp. 81–100.
[24] Nvidia, “Cub library,” 2019. [Online]. Available: https://nvlabs.github.
io/cub
[25] J. Taylor, Introduction to error analysis, the study of uncertainties in
physical measurements. University Science Books, 1997.
