HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for
  Multi-GPU systems by Mojumder, Saiful A. et al.
HALCONE : A Hardware-Level Timestamp-based
Cache Coherence Scheme for Multi-GPU systems
Saiful A. Mojumder1, Yifan Sun2, Leila Delshadtehrani1, Yenai Ma1, Trinayan Baruah2,
José L. Abellán3, John Kim4, David Kaeli2, Ajay Joshi1
1ECE Department, Boston University; 2ECE Department, Northeastern University;
3CS Department, UCAM; 4School of EE, KAIST;
{msam, delshad, yenai joshi}@bu.edu, {yifansun, tbaruah, kaeli}@ece.neu.edu,
jlabellan@ucam.edu, jjk12@kaist.edu
ABSTRACT
While multi-GPU (MGPU) systems are extremely popular
for compute-intensive workloads, several inefficiencies in
the memory hierarchy and data movement result in a waste
of GPU resources and difficulties in programming MGPU
systems. First, due to the lack of hardware-level coherence,
the MGPU programming model requires the programmer to
replicate and repeatedly transfer data between the GPUsâA˘Z´
memory. This leads to inefficient use of precious GPU mem-
ory. Second, to maintain coherency across an MGPU sys-
tem, transferring data using low-bandwidth and high-latency
off-chip links leads to degradation in system performance.
Third, since the programmer needs to manually maintain data
coherence, the programming of an MGPU system to max-
imize its throughput is extremely challenging. To address
the above issues, we propose a novel lightweight timestamp-
based coherence protocol, HALCONE , for MGPU systems
and modify the memory hierarchy of the GPUs to support
physically shared memory. HALCONE replaces the Com-
pute Unit (CU) level logical time counters with cache level
logical time counters to reduce coherence traffic. Further-
more, HALCONE introduces a novel timestamp storage unit
(TSU) with no additional performance overhead in the main
memory to perform coherence actions. Our proposed HAL-
CONE protocol maintains the data coherence in the memory
hierarchy of the MGPU with minimal performance overhead
(less than 1%). Using a set of standard MGPU benchmarks,
we observe that a 4-GPU MGPU system with shared memory
and HALCONE performs, on average, 4.6× and 3× better
than a 4-GPU MGPU system with existing RDMA and with
the recently proposed HMG coherence protocol, respectively.
We demonstrate the scalability of HALCONE using different
GPU counts (2, 4, 8, and 16) and different CU counts (32, 48,
and 64 CUs per GPU) for 11 standard benchmarks. Broadly,
HALCONE scales well with both GPU count and CU count.
Furthermore, we stress test our HALCONE protocol using
a custom synthetic benchmark suite to evaluate its impact
on the overall performance. When running our synthetic
benchmark suite, the HALCONE protocol slows down the
execution time by only 16.8% in the worst case.
1. INTRODUCTION
Multi-GPU (MGPU) systems have become an integral part
of cloud services such as Amazon Web Services (AWS) [7],
Microsoft Azure [8], and Google Cloud [1]. In particular,
many deep learning frameworks running on these cloud ser-
vices provide support for MGPU execution to significantly
accelerate the long and compute-intensive process of train-
CU CU CU CU
L2
Xbar
Main Memory
CU CU CU CU
L2
Xbar
CU CU CU CU
L2
Xbar
CU CU CU CU
L2
Xbar
Off-chip Link
SW
SW
SW
SW
GPU0 GPU1
GPU2
L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1
GPU3
Main Memory
Main Memory Main Memory
Figure 1: Conventional MGPU system. Switch (SW) han-
dles the remote access requests from one GPU to another
GPU. PCIe or NVLink is used as the off-chip link.
ing deep neural networks. For instance, Goyal et al. [12]
trained ResNet-50 in only 1 hour using 256 GPUs, which
would have otherwise taken more than a week using a single
GPU. MGPU systems are also commonly used for paralleliz-
ing irregular graph applications [5, 28, 36, 37] and facilitat-
ing large-scale simulations in different domains including
physics [41], computational algebra [38], surface metrol-
ogy [40] and medicine [13].
GPU applications are evolving to support ever-larger
datasets and demand data communications not only within a
GPU but also across multiple GPUs in the system. As a result,
the underlying programming model is evolving, and major
vendors such as NVIDIA and AMD have introduced software
abstractions such as shared and global memory spaces that
enable sharing of data among different threads within a GPU.
Furthermore, GPU-to-GPU Remote Direct Memory Access
(RDMA) [22] was introduced so that a GPU can transfer data
directly to/from another GPU, without involving the CPU,
through off-chip links such as NVLink or PCIe as shown
in Figure 1. However, despite continuous advances in inter-
GPU networking technologies, the off-chip link bandwidth is
roughly 3-10× lower than local main memory (MM) band-
width. Thus, data accesses to/from remote GPU memories,
which are essential for MGPU applications, can lead to severe
performance degradation due to the NUMA effect [19, 20].
To highlight the large impact on performance when rely-
ing on state-of-the-art RDMA technology for GPU-to-GPU
communication, we perform an experiment with a SGEMM
kernel from the cuBLAS library [23] that is executed on an
1
ar
X
iv
:2
00
7.
04
29
2v
1 
 [c
s.A
R]
  8
 Ju
l 2
02
0
NVIDIA DGX-1 [14] MGPU system. More specifically, we
aim to compare the impact of RDMA on kernel execution
time considering two Volta GPUs (i.e., GPU0 and GPU1),
which are connected by NVLinks that support 50 GB/s per
direction. We first place the matrices in GPU0’s memory
and execute SGEMM. Then, we enable P2P direct access to
leverage RDMA and run the same SGEMM kernel on GPU1
while the matrices reside in GPU0’s memory (we refer to
this as remote). Figure 2 shows the results of the experiment
with different matrix sizes. The kernel using local memory is
12.4× (for a matrix size of 32768×32768) to 2895× (for a
matrix size of 512×512) faster than the kernel using remote
memory. These results clearly show that RDMA is expensive,
and motivate the pressing need for an alternate path to reduce
the significant cost of remote accesses in MGPU systems.
Programming a GPU system also has several challenges. In
particular, a program written to run on a single GPU cannot be
easily ported to run on multiple GPUs. The programmer must
be aware of any data sharing, both within a single GPU and
across multiple GPUs. As there is no hardware support for
cache coherency in current MGPU systems, the programmer
must explicitly maintain coherency while developing a GPU
program. This requires manually transferring duplicated data
when needed to different GPUs’ memory, using explicit bar-
riers if previously read data is modified, or in the worst case,
using atomic operations. All of these approaches require
significant effort by the programmer. Moreover, contempo-
rary GPUs only support weak data-race-free (DRF) memory
consistency [15, 30], so the programmer has to ensure there
are no data races during program execution.
To leverage the true capability of MGPU systems and
ease MGPU programming, we need efficient hardware-level
inter-GPU coherency. Extending well-known directory-based
or snooping-based CPU coherence protocols such as MESI,
MOESI, etc. is not suitable for an MGPU system [42]. This
is because the MGPU system with its thousands of parallel
threads per GPU produces a much larger number of simulta-
neous memory requests than CPUs, which translates into a
prohibitively high degree of coherence traffic in these tradi-
tional protocols [31]. A promising solution to alleviate coher-
ence traffic is to use self-invalidation by relying on temporal
coherence [31]. Previous timestamp based solutions target-
ing a single GPU used global time (the TC protocol) [31] or
logical time (the G-TSC protocol) [35] to maintain coherency
across the L1$s and L2$s of a GPU. However, none of the pre-
vious timestamp based work has addressed coherency issues
in an MGPU system.
To provide support for efficient intra-GPU and inter-GPU
coherency, in this paper we propose HALCONE - a new
timestamp-based hardware-level cache coherence scheme
for tightly-coupled MGPU systems with physically shared
main memory (named MGPU-SM for short)1. HALCONE
1An MGPU system can be designed by cobbling together a col-
lection of MGPU nodes, and leveraging a message passing layer,
such as MPI [18] to work collectively on a single application. An
alternate approach is to consider an MGPU system that is more
tightly-coupled and uses shared memory (i.e., MGPU-SM) [21].
The former involves more programmer effort to support MGPU exe-
cution, while the latter relies on the hardware and runtime system
to support a version of shared memory. In this paper, we focus on
improving the performance and scalability of an MGPU-SM system.
512 1024 2048 4096 8192 16384 32768
Matrix Length
0
1
10
100
1000
10000
100000
Ti
m
e (
m
s)
GPU0
GPU1
Figure 2: Kernel time in GPU0 (local) and GPU1 (remote)
on Volta-based NVIDIA DGX-1 system.
provides hardware-level coherency support for inter- and
intra-GPU data sharing using a single-writer-multiple-reader
(SWMR) invariant. HALCONE introduces a new cache-
level logical time counter to reduce Core-to-Cache traffic2
and a novel timestamp storage unit (TSU) to keep track
of cache blocks’ timestamps. We strategically place the
TSU outside the critical path of memory requests, as it is
accessed in parallel with the MM, thereby avoiding any per-
formance overhead. Although HALCONE is inspired by
G-TSC [35], G-TSC is not suitable as is for both NUMA-
and UMA-based MGPU systems due to the large area re-
quired for storing timestamps and the high performance
overhead due to additional traffic to maintain coherency.
Prior MGPU-based solutions such as CARVE [39] and
HMG [27] minimize RDMA overhead by introducing a sim-
ple VI cache coherence protocol. However, the scalability
of these solutions is limited because of the large amount of
coherency traffic required to transfer via off-chip links with
low bandwidth and high latency. Moreover, HMG relies on
a scoped memory model which increases the programmer’s
burden [29]. HALCONE is a holistic highly-scalable solu-
tion that reduces coherence traffic and eases programming
of MGPU systems. The main takeaways of our work are:
Design: We are the first to propose a fully coherent MGPU
system with physically shared MM. Our MGPU-SM system
eliminates NUMA accesses to MM, as well as the need to
transfer data back and forth between GPUs. We propose a
novel timestamp-based inter-GPU and intra-GPU coherence
protocol called HALCONE to ensure seamless data sharing
in an MGPU system. HALCONE leverages the concept of
logical timestamp [25] and reduces the overall traffic by re-
placing a compute unit (CU)-level timestamp counter at the
cache level. In MGPU-SM, HALCONE uses a new times-
tamp storage unit (TSU) that operates in parallel with MM to
avoid any performance overhead.
Evaluation: We evaluate our MGPU-SM system with HAL-
CONE using MGPUSim [32]. For evaluation, we use stan-
dard benchmarks and custom synthetic benchmarks (that
enforce read-write data sharing) to stress test the HALCONE
protocol. Compared to an MGPU system with RDMA, our
MGPU-SM system with HALCONE performs, on average,
4.6× better across the standard benchmarks. Our HALCONE
protocol adds minimal performance overhead (1% on aver-
age) to the standard benchmarks. Compared to the most
recently proposed MGPU system with an HMG coherence
protocol [27], HALCONE performs, on average, 3× better.
Stress tests performed using our synthetic benchmark suite
2Request traffic is reduced by up to 41.7% and the response traffic
is reduced by up to 3.1% in the memory hierarchy.
2
show that HALCONE can result in up to 16.8% performance
degradation for extreme cases, but it lowers programmer bur-
den. We also show that an MGPU system with HALCONE
scales well with GPU count and CU count per GPU.
2. BACKGROUND
2.1 Communication and Data Sharing
To share data across GPUs in MGPU systems, a variety of
communication mechanisms are available. A GPU can use
P2P memcpy to copy data to and from the memory of a remote
GPU. This P2P memcpy approach essentially replicates the
data in different memory modules [42]. To avoid this data
replication, P2P direct access (called RDMA in this paper)
can be used where a GPU can access data from a remote GPU
memory without copying it to its MM. However, P2P direct
access leads to high latency remote data accesses, which
causes performance degradation [39].
Current NVIDIA GPUs provide a page fault mechanism
in a virtually unified address space called unified memory
(UM)3. Under UM, the data is initially allocated on the CPU.
When a GPU tries to access this data, it triggers a page fault
and the GPU driver serves the page fault by fetching the
required page from another device. However, this page fault
mechanism is known to introduce serialization in accessing
pages and can hurt GPU performance [4, 16].
2.2 Timestamp Based Coherency
In this section, we briefly explain the operation of G-TSC
protocol [35], which was proposed to maintain coherency in
a single GPU. G-TSC protocol assigns a logical timestamp
(warpts) to each CUs in the GPU. Table 1 provides
definitions for the terminology used to describe the G-TSC
and also our proposed HALCONE protocol.
Read Operation: A read request from a CU contains the
warpts and the address. Each block in the L1$ 4 has a read
timestamp (rts) and write timestamp (wts). If the block is
present and the warpts falls within the range between wts
and rts (i.e. the lease), it is a cache hit. Otherwise, the
read request is treated as an L1$ miss and is forwarded to
the L2$. This read request to L2$ contains the address, the
wts of the block and the warpts. If the wts value for the
request is set to 0, it means a compulsory miss occurred at
L1$, and L2$ must respond with the data and the timestamps.
A non-zero value for wts implies the block exists in L1$, but
the timestamp expired.
Upon receiving a read request, the L2$ checks if the block
exists in the L2$. If it does not exist, L2$ sends a read
request to the MM. If the warpts of the request from L1$
is within the lease for that block in the L2$, the L2$ also
compares the wts from the L1$ request and the wts of the
block in the L2$. If both wts values are the same, it means
that the data was not modified by another CU and simply that
the lease expired in the L1$. In that case, the L2$ extends the
lease by increasing the rts and sends the new rts and wts
values to the L1$. If the wts values do not match, it implies
that the data was modified by a different CU. Hence, L2$
3UM provides the abstraction of a virtually unified memory com-
bining the CPU’s and GPUs’ physical memories.
4Throughout this paper, we use L1$ to refer to L1$ vector cache
unless specified otherwise.
Table 1: Terminologies and definitions
Term Definition
physical
time
The wall clock time of an operation.
logical
time
The logical counter maintained by a component (e.g., CU
and cache).
warpts The current logical time of a CU. In the G-TSC protocol,
the memory operations are ordered based on the warpts.
cts The current logical times of a cache. Each cache has a cts
that is updated based on the last memory operation.
block An entry containing address, data, and associated times-
tamps in the caches.
wts The write timestamp of a cache block. It represents the logi-
cal time when a write operation is visible to the processors
rts The read timestamp of a cache block. It represents the
logical time until which reading the cache block is valid
lease The difference between rts and wts. The data in the cache
block is valid only if cts or warpts is within the lease.
RdLease Lease assigned to a block after a read operation is executed.
WrLease Lease assigned to a block after a write operation is executed.
memts The memory time stamp that represents the logical read
timestamp that the memory assigns to a cache block.
sends both data and new timestamps to the L1$ .
Write Operation: Write requests are handled using a
write-though policy from L1$ to L2$, and L1$ adopts a no-
write-allocate policy. To result in a write hit in a cache, the
warpts needs to be within the lease of the requested cache
block. Otherwise, it is considered a write miss. When there
is a write hit in the L1$, the data is written in the L1$, but
the access to the data is locked until L2$ is updated and L2$
sends the updated timestamps for the data. It is necessary to
lock access to the block to ensure that the warpts is updated
correctly using the wts value that the L1$ receives from the
L2$. Any discrepancy in updating warpts may result in
an incorrect ordering of memory access requests. If there
is an L1$ write miss, the data is directly sent to the L2$ to
complete the write operation. For an L2$ write miss, the L2$
sends a write request to the MM. If we get a write hit at L2$,
the L2$ writes the data to the block in the L2$ and updates
the timestamps for that block. L2$ then sends the updated
rts and wts values to the L1$ .
On both read and write operations, the responses from L1$
to CU contain the wts value from the most recent memory
operation. Based on this wts value, CU updates its warpts.
3. HALCONE IN MGPU-SM SYSTEMS
We present the working fundamentals of our proposed
HALCONE using an example MGPU system with 4 GPUs
that share the MM. HALCONE builds on top of the G-TSC
protocol [35], which was designed for intra-GPU coherence
but cannot be readily applicable to MGPU systems. Maintain-
ing coherence across multiple GPUs is more challenging as
the L1$s of a GPU can only interact with their own L2$. We
need to maintain coherence across multiple L2$s across dif-
ferent GPUs. A straightforward extension of G-TSC would
be to add timestamps to each block in the MM leading to
significant area overhead as we would need space to store
the timestamps of each block of data in the MM. G-TSC
also needs to maintain a logical time counter at the compute
unit level, which needs to send timestamps i.e. warpts back
and forth between CUs and L1$s leading to additional traffic
overhead. As we show later in the paper (Section 3.2), to
3
Network
CPU
Cores
L1s
L2
LLC
GPU 1
CU CU ... CU
L1 L1 L1...
XBar
L2 L2...
GPU 1
CU CU ... CU
L1 L1 L1...
XBar
L2 L2...
GPU 1
CU CU ... CU
L1 L1 L1...
XBar
L2 L2...
GPU
CU CU ... CU
L1 L1 L1...
XBar
L2 L2...
............
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
TSU
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
Directory
HBM Stack
M
M
C DRAM
TSU
Figure 3: MGPU with shared
main memory.
RdReq{Addr}
{B}
Block != nil
Cts < Brts
Rd Hit
Block == nil
Comp. Miss
{B}
Block != nil
Cts > Brts
Coherency Miss{B}
CU L1
RdReq{Addr}
RdReq{Addr}
RdReq{Addr} ChkTSU(Addr) = nil
Add2Dir(Addr)
Update Memts{B, Rts, Wts} 
ChkTSU(Addr) != nil
Update Memts
RdReq{Addr}
{B, Rts, Wts} 
Block != nil
Cts < Brts
Rd Hit
Block == nil
Comp. Miss
RdReq{Addr}
Block != nil
Cts > Brts
Coherency Miss
{B, Rts, Wts} 
L2
{B, Rts, Wts} 
L1
RdReq{Addr}
{B, Rts, Wts} 
RdReq{Addr}
L2 MM
WrReq{Addr, B}
{Ack}
Block != nil
Cts < Brts
Wr Hit
Block == nil
Comp. Miss
{Ack}
Block != nil
Cts > Brts
Coherency Miss{Ack}
CU L1
WrReq{Addr, B}
WrReq{Addr, B}
WrReq{Addr, B} ChkTSU(Addr) = nil
Add2Dir(Addr)
Update Memts{Rts, Wts} 
ChkTSU(Addr) != nil
Update Memts
WrReq{Addr, B}Block != nil
Cts < Brts
Wr Hit
Block == nil
Comp. Miss
WrReq{Addr, B}
Block != nil
Cts > Brts
Coherency Miss
{Rts, Wts} 
L2
{Rts, Wts} 
L1
WrReq{Addr, B}
{Rts, Wts} 
WrReq{Addr, B}
L2 MM
{Rts, Wts} 
(a) (b) (c)
(d) (e) (f)
Figure 4: Transactions between (a) a CU and an L1$ for read operations,
(b) an L1$ and an L2$ for read operations, (c) an L2$ and the MM for read
operations, (d) a CU and an L1$ for write operations, (e) an L1$ and an L2$
for write operations, and (f) an L2$ and the MM for write operations.
maintain coherence in MGPU systems we do not need to
maintain a logical time counter in the CU level, but instead,
each L1$ and L2$ can maintain the individual counters. This
helps reduce request traffic. In addition, while G-TSC simply
provides the same lease for both reads and writes, we provide
different lease values for reads and writes. By doing so, as
we will see in Section 5.4, this benefits the exploitation of
temporal locality. To elaborate, each write operation moves
the logical time counter ahead by the write lease value. If we
use the same lease for read as for write, each write operation
will lead to self-invalidation of the previously read block.
3.1 An MGPU System with Shared Memory
Figure 3 shows the logical organization of our target
MGPU-SM system. In this configuration, each CU has a
private L1$ and each GPU has 8 distributed shared L2 banks.
Each L2 bank has a corresponding cache controller (CC). We
use High Bandwidth Memory (HBM) as the MM because
HBM is power and area efficient, capable of providing high
bandwidth required for GPUs [24]. A network provides con-
nectivity between the CC in the L2$ and the MM controller
(MMC) in the HBM. More details about the network are pro-
vided in Section 4. We assume a 4 GB memory per HBM
stack in this example. Each L2 CC handles 2 GB (the size de-
pends on the number of CCs and the HBM size) of the entire
address space in the HBMs. Thus, all the GPUs are connected
to all the HBM stacks, making the memory space physically
shared across all GPUs. We use a TSU in each HBM to keep
track of the timestamps for the blocks being accessed by the
different L2$s. We provide the detailed operation of the TSU
later in this section.
3.2 HALCONE Protocol
We define the HALCONE protocol using a single-writer-
multiple-reader (SWMR) invariant. Our HALCONE protocol
is based on the G-TSC [35] protocol designed for a single
GPU. The terms used to explain the HALCONE protocol
are defined in Table 1. Unlike the G-TSC protocol, we do
not have a warpts but assign a timestamp cts to each of
the L1$s and L2$s. Each CU has a private L1$, hence the
cts for an L1$ is equivalent to the warpts in the G-TSC
protocol. Managing timestamps at the caches allows us to
reduce timestamp traffic between the L1$ and CU, as well as
between the L1$ and L2$ by eliminating the need for sending
cts with requests and responses to maintain coherence as
compared to G-TSC protocol which sends warpts with every
request. The memory operations are ordered based on the
logical time, in particular, cts. If two requests have the same
cts value, the cache uses physical time to order them. The
key idea is that the block is only valid in the cache if the
cts is within the valid lease period. Figure 4 shows the
transactions between CUs, L1$s, L2$s, and MM for read and
write operations. We explain these transactions with the help
of Algorithms 1–5.
3.2.1 Read Operations
L1$: Figure 4(a) shows the transactions between a CU
and the L1$ for read operations. As shown in Algorithm 1, a
cache hit at L1$ occurs only when there is an address (tag)
match and the current timestamp, cts, is within the lease
period of the cache block. If there is a tag hit, but the cts is
not within the lease period, we fetch the cache block from
L2$ with new rts and wts values. For an L1$ miss, we fetch
the cache block with its rts and wts values from L2$.
L2$: Algorithm 2 shows how read requests are handled by
the L2$. The L2$ hit or miss is similar to that of L1$. Fig-
ure 4(b) shows the transactions between the L1$ and the L2$
for read requests. If there is a cache hit and the lease is valid,
the L2$ sends the cache block, rts, and wts to the L1$. If
there is a cache miss in the L2$, then the L2$ sends a request
to the MM. After fetching the cache block from MM, the L2$
responds to the L1$ with the cache block, rts, and wts. If
there is a tag match, but cts is not within the lease period in
L2$, we re-fetch the data with new timestamps from the MM.
This re-fetching of data ensures coherency in case another
GPU modified the data in the MM. Note that G-TSC protocol
only fetches renewed timestamps from L2$ if data has not
been modified by another CU. However, such re-fetching
requires to send the warpts with each request (which we
eliminated to reduce traffic) and adds more complexity as
HALCONE has a deeper memory hierarchy.
MM: Figure 4(c) shows the transactions to and from the
4
Algorithm 1: Read Request to L1
Initialization: cts = 0;
Fetch RdReqFromCU{Addr};
if Block(Addr) == nil or  cts > rts(Block) then
Send RdReqToL2{Addr};
Fetch RspFromL2{Data, rts, wts};
Bwts = max[cts, wts];
Brts = max[wts + 1, rts];
Send RspToCU{Block{Data}};
else if cts <= rts(Block) then
Send RspToCU{Block{Data}};
Algorithm 2: Read Request to L2
Initialization: cts = 0;
Fetch RdReqFromL1{Addr};
if Block(Addr) == nil or  cts > rts(Block) then
      Send RdReqToMM{Addr};
      Fetch RspFromMM{Data, Mrts, Mwts};
      Bwts = max[cts, Mwts];
      Brts = max[wts + 1, Mrts];
      Send RspToL1{Block{Data, Brts, Bwts}};
else if cts <= rts(Block) then
      Send RspToL1{Block{Data, Brts, Bwts}};
Algorithm 4: Write Request to L1
Algorithm 3: Read or WriteRequest to MM
Initialization: Memts = 0;
Fetch RdReqFromL1{Addr};
if TSU(Addr) == nil then
      AddEntryToTSUBlockAddr;
if Req==ReadReq then
      MemtsEntry = memts + RdLease;
      Mrts = MemtsEntry; Mwts = Mrts - RdLease;
else if Req==WriteReq then
      MemtsEntry = memts + WrLease;
      Mrts = MemTsEntry; Mwts = Mrts - WrLease;
Send RspToL2{Block{Data, rts, wts}};
Algorithm 5: Write Request to L2
Initialization: cts = 0;
Fetch WrReqFromL1{Addr};
if cts <= rts(Block) then
      WriteToBlock;
      LockAccesstoBlock;
      Send WrReqToMM{Addr};
      Fetch RspFromMM{Mrts, Mwts};
      Bwts = max[cts, Mwts];
      Brts = max[wts+1, Mrts];
      cts = max[cts, Bwts];
      UnlockAccessToBlock;
      Send RspToL1{Block{Data, Brts, Bwts}};
else
      Send WrReqToMM{Addr};
      Fetch RspFromMM{Block, rts, wts};
      WriteBlockToCache;
      Bwts = max[cts, wts]
      Brts = max[wts+1, rts];
      cts = max[cts, Bwts];
      Send RspToL1{Block{Data, Brts, Bwts}};
Initialization: cts = 0;
Fetch WrReqFromCU{Addr};
if cts <= rts(Block) then
WriteToBlock;
LockAccesstoBlock;
Send WrReqToL2{Addr};
Fetch RspFromL2{rts, wts};
Bwts = max[cts, wts];
Brts = max[wts+1, rts];
cts = max[cts, Bwts];
UnlockAccessToBlock;
Send RspToCU{Block{Data}};
else
      Send WrReqToL2{Addr};
      Fetch RspFromL2{Block, rts, wts};
      WriteBlockToCache;
      Bwts = max[cts, wts]
      Brts = max[wts+1, rts];
      cts = max[cts, Bwts];
      Send RspToCU{Block{Data}};
MM for read requests from the L2$. Algorithm 3 explains
how a read request from the L2$ is handled by the MMC.
The MM tracks the timestamp of each block accessed by the
L2$s of all the GPUs using the TSU. The TSU stores the read
address and the timestamp (memts) of the block, but not data
itself. memts is used to keep track of the lease of a block sent
to the L2$s. If there is no entry for the requested address
in the TSU (i.e., the block has never been requested by the
L2$s), it adds the address and then updates the memts5 of the
block using the Mrts allocated for the read operation. If
there is already an entry in the TSU for the requested address,
the TSU extends the memts of the entry using the Mrts for
the read operation.
3.2.2 Write Operations
L1$: Figure 4(d) shows the transactions that take place
for write requests to the L1$. We adopt a write-through
(WT)6 cache policy for both the L1$s and L2$s. Algorithm
4 illustrates how write requests to L1$ are handled. Due to
the WT policy, a write request to L1$ triggers a write request
from L1$ to L2$, irrespective of a cache hit or miss. If the
cts is within the lease, it is a write hit. In case of a write
hit, the data is written immediately to the cache block in
the L1$ and a write request is sent to the L2$. Access to
the block is locked until the L1$ receives a write response,
along with the new timestamps, from the L2$. The access is
locked by adding an entry to the miss-status-holding-register
(MSHR). In the case of a write miss in the L1$, the L1$
sends the request to the L2$. Once the L2$ returns both the
block and its timestamps to the L1$, the L1$ writes data to
the appropriate location and updates its cts.
L2$: Figure 4(e) shows the transactions that take place for
write requests to the L2$s. Algorithm 5 demonstrates how a
write request to the L2$ is serviced. As we are using a WT
5Read timestamp, Mrts and write timestamp, Mwts are design pa-
rameters for the HALCONE protocol; depending on the implemen-
tation, these values can be staticcally or dynamically assigned.
6We compared the run time of standard benchmarks using both L2$
WT and L2$ Write-back (WB) policies for an MGPU-SM system
with no coherency. We observed that WT performs better for the
MGPU-SM system. Hence, we implement our coherence protocol
using L2$ WT caches. For details on this, please refer to Section 5.
policy for the L2$, a write request to the L2$ triggers a write
request from the L2$ to the MM, irrespective of whether the
access is a cache hit or miss. Again, the cache hit and miss
conditions are the same as in the case of the L1$. If the
access is a cache hit, the data is written to the block in the
L2$ and a write request is sent to the MM. The L2$ updates
the timestamp of the cache block using the response that it
receives from the MM. The access to the block is locked until
the write response and the timestamps are received from the
MM. If the L2$ access results in a cache miss, L2$ sends a
write request to the MM. The write request includes the data
and address. The MM sends a response with the block and
updated timestamps. Next, the L2$ issues a write to the block
and updates its timestamps using the response from the MM.
MM: Figure 4(f) shows the transactions to and from the
MM for a write request from the L2$. Algorithm 3 explains
how a write request from the L2$ is serviced by the MMC.
If there is no matching entry for the requested address in the
TSU, then the TSU adds the address and updates the times-
tamp of the block using the lease for the write operation. If
there is an entry present in the TSU for the requested address,
the MM increases the memts of the entry using the lease for
a write operation.
3.2.3 Intra-GPU Coherency
We use instructions identical to those described by Tab-
bakh et al. [35] to explain both intra-GPU and inter-GPU
coherency. Here, we first present how intra-GPU coherency
is maintained using our HALCONE protocol. Figure 5(a)
shows the instructions and the sequence of steps for main-
taining intra-GPU coherency. In this example, we have two
compute units, CU0 and CU1. Both CU0 and CU1 belong
to GPU0. Each CU has a private L1$, but the L2$ is shared
between the two CUs. In Figure 5(a), we show two L2$s
for the sake of explanation, but both L2$s are the same L2$.
CU0 executes 3 instructions, I0-1, I0-2, and I0-3, which read
location [X], write to location [Y] and read location [X], re-
spectively. Similarly, CU1 executes 3 instructions, I1-1, I1-2,
and I1-3, which are: read location [Y], write location [X],
and read location [Y], respectively. Both L1$ and L2$ have
initial cts values of 0. 1 to 34 correspond to different
memory events that occur during the execution of the three
5
  
 
 
CU0 L1 L2 MM L2 CU1
Rd(X) Rd(X)
6
[X=
1,0,1
0]5
Rd(X)
[X=1] [X=1,0,10]
0,10
X
[Y=2]
0,10
X
L1
10
X
Wr(Y)20
[Y=
5,8,
12]
[Y=5,8,12]
8,12
Y
8,12
Y
Y
12
0,7
Y
Wr(Y) Wr(Y)
8,12
Y
{0}
{8}
{0}
{8}
Rd(Y)Rd(Y)Rd(Y)
[Y=2,0,7]
0,7
Y
X
0,10
0,7
Y
7
Y
Wr(X)Wr(X)Wr(X)
[X=3,11,15]
11,15
X
11,15
X
15
X [X=3,11,15]
11,15
X
Rd(X)
[X=1]
{8}
{11}{11}
Rd(Y)
[Y=5,8,12]
Rd(Y)
8,12
Y{11}
{11}
[Y=5]
{11}
{0} {0}
GPU0, CU0
I0-1: Read X
I0-2: Write Y=5
I0-3: Read X
GPU0, CU1
I1-1: Read Y
I1-2: Write X=3
I1-3: Read Y
[Y=2,0,7]
18
(a)
8
14
21
30
34
9
13
22
26
31
33
1 2 3 4
567
1011
12
15 16 17 18
1920
2324
25
27 28 Hit
29
32Hit
CU0 L1 L2 MM L2 CU0
Rd(X) Rd(X)
6
[X=
1,0,1
0]5
Rd(X)
[X=1] [X=1,0,10]
0,10
X
[Y=2]
0,10
X
L1
10
X
Wr(Y)20
[Y=
5,8,
12]
[Y=5,8,12]
8,12
Y
8,12
Y
Y
12
Wr(Y) Wr(Y)
{0}
{8}
{0}
{8}
Rd(Y)Rd(Y)Rd(Y)
[Y=2,0,7]
0,7
Y
0,7
Y
7
Y
Wr(X)Wr(X)Wr(X)
[X=3,11,15]
11,15
X
11,15
X
15
X [X=3,11,15]
Rd(X)
[X=1]
{8}
{11}
Rd(Y)
[Y=5,11,19]
Rd(Y)
11,19
Y
{11}
{11}
[Y=5]
{11}
{0} {0}
GPU0, CU0
I0-1: Read X
I0-2: Write Y=5
I0-3: Read X
GPU1, CU0
I1-1: Read Y
I1-2: Write X=3
I1-3: Read Y
[Y=2,0,7]
18
(b)
8
14
21
30
36
9
13
22
26
31
35
1 2 3 4
567
1011
12
15 16 17 18
1920
2324
25
27 28 Hit
29
3233 Rd(Y)
34
11,19
X [Y=5,11,19]
Figure 5: The timeline for (a) the intra- and (b) inter-GPU coherency. [] represents response traffic in [Data, rts, wts]
or [Data] format, {} represents the updated cts of a cache. In (a), the two L2$ instances refer to the same physical L2$.
instructions. At 1 , CU0 issues a read to location [X]. This
request misses in the L1$. At 2 , so the L1$ sends a read
request to L2$. As the request misses in L2$ as well, the L2$
sends a read request to the MM at 3 . At 4 , the MM sends
the response to the L2$ with rts and a wts values of 10 and
0, respectively (we choose these values of the timestamps for
this example. Our protocol works correctly for any values of
rts and wts). Based on the cache block and the timestamps
received from MM, at 5 L2$ updates its cts, the block’s
rts and wts, and responds to L1$ with the updated rts
and wts values, along with the cache block. Similarly, the
L1$ updates its cts, and rts and wts values for the cache
block at 6 . The CU finally receives the data from L1$ at
7 . Instruction I1-1 from CU1 issues a read from location
[Y] and follows the same steps as I0-1. The CU1 receives
the data through steps 8 to 14 . We assume a different lease
(wts= 0, rts= 7) for location [Y] for this example.
CU0 requests to write to location [Y] at 15 . The write
request from a CU is served by the MM, regardless of whether
it is a cache hit at L1$ or L2$. At 16 , the L1$ of CU0 sends
a write request to L2$. This results in a cache hit at L2$
as the location [Y] was previously read by CU1 and cts ≤
rts. At 17 , the L2$ sends a write request to the MM for
location [Y]. The MM updates the value and timestamps for
location [Y]. We assume a lease of 5 for write operations
in this example. At 18 , the MM sends the response with
rts= 12 and wts= 8 for the block containing [Y] to the
L2$. Then the L2$ updates the timestamps for [Y] and sets
cts(= 8) at 19 and sends the updated timestamps to the
L1$ of CU0. At 20 , the L1$ updates the timestamps for
the block containing [Y] and the associated cts(= 8). Note
that we do not show the actions to lock and unlock a block in
the diagram for clarity. Every write request to a block in the
cache must lock access to the block until receiving a response
from the MM. At step 21 , there is a write request (I1-2) from
the CU1 at location [X]. This follows the same steps followed
by I0-2. The response to the write request is executed in steps
22 to 26 . Now, both L1$ and L2$ of CU1 have a cts value
of 11 after completing the write request to location [X]. At
27 , there is a read request for location [X] from CU0. At
28 , the cts value is 8 and the block for location [X] has a
rts value of 10. Hence, it is a cache hit in L1$. Note that the
advantage of using a logical timestamp is that it allows the
scheduling of a memory operation in the future by assigning
a larger wts value. Hence, the previous write on [X] by CU1
will be visible later to L1$ of CU0 as it has a cts value lower
than the assigned wts value to the block for the write request
by CU1 at 24 . Since the cts of the L1$ of CUO is smaller
than the cts value of the L1$ of CU1 at this point, the read
by CU0 of the L1$ happens before the write by the L1$ of
CU1. The data is sent to CU0 by L1$ at 29 . At 30 , CU1
sends a request to read location [Y]. This request creates a
coherency miss in L1$. This is because the cts is 11, but the
block for location [Y] has a rts of 7. At 31 , L1$ sends a
read request to L2$. This request results in a cache hit at L2$,
since L2$ has a cts value of 11 and the block for [Y] has
rts= 12 and wts= 8. The execution order of the instructions
in this example is I0-1→ I1-1→ I0-2→ I0-3→ I1-2→ I1-3.
3.2.4 Inter-GPU Coherency
In this example, we use the same instructions as in the
previous example for intra-GPU coherency. CU0 of GPU0
executes instructions I0-1, I0-2, and I0-3. However, instruc-
tions I1-1, I1-2, and I1-3 are executed by the CU0 of GPU1
in this example. Thus, we have two different L2$s, one con-
nected to GPU0 and one connected to GPU1. Figure 5(b)
shows the instructions and the sequence of execution for ex-
plaining inter-GPU coherency. The read request from CU0
of GPU0 to read location [X] and read request from CU0 of
GPU1 to read location [Y] follow the exact same steps (steps
1 - 14 ) as in the case of intra-GPU coherency. The write
6
MMC
L2 Cache
...
TSU Main Memory
50 cycles 100 cycles
Request Path
Response PathAck
Lookup path
Figure 6: Time Stamp Unit (TSU). The TSU operates
independently and in parallel with the memory access.
request from CU0 of GPU0 at 15 and the write request from
CU0 of GPU1 at 21 are also handled in the same manner
as in the case of intra-GPU coherency. The only difference
is that the data for the write of [X] and for the write of [Y]
reside in different L2$s. The read request (I0-3) by CU0
of GPU0 at 27 still produces a cache hit in L1$. The read
request issued by CU0 of GPU1 (I1-3) results in a different
set of execution steps. This is because at 32 , there is no
longer an L2$ hit, as the lease (rts= 7, wts= 0) expired
for a cts= 11. Hence the data for [Y] has to be fetched
from the MM. The MM has the updated value written by
CU0 of GPU0. This value is received by CU0 of GPU1, and
thus it becomes coherent with CU0 of GPU0. The execu-
tion order for both the instructions in this example is again
I0-1→ I1-1→ I0-2→ I0-3→ I1-2→ I1-3.
3.2.5 TSU Implementation
The TSU is physically placed inside the logic layer of
the HBM stack. We could have chosen to place the TSU in
the DRAM layers, but this would increase memory access
latency. We designed the TSU as an 8-way set associative
cache. The TSU needs to store the memts for all of the blocks
in all the L2$s in the MGPU system. We use 16 bits for each
memts. Since we have 8 distributed L2$ modules in each
GPU, each way of the TSU keeps track of the timestamps of
the cache blocks in one of the L2$ modules. For example,
for an MGPU with a 2MB L2$ per GPU, we need 64KB
of space for the timestamps in the TSU for each GPU. As
TSU logic only searches for the presence of the timestamp
of a block and generates or updates timestamps, the latency
for accessing the TSU is identical to a L3$ hit time of 40
cycles [17]. We conservatively assume a 50 cycles access
latency for TSU.
Figure 6 shows the operation of the TSU. A request from
the memory controller is sent to the TSU and the DRAM
layer in parallel. The TSU responds with the timestamp
for the cache block, and in parallel with the DRAM layer,
responds with the cache block. Thus, the TSU never impacts
the critical path of the DRAM access, and so does not add any
performance overhead. The eviction of TSU entries is related
to the eviction of L2$ entries. When there is an eviction from
the L2$ of a GPU, the TSU also evicts the timestamp for that
cache block if it is not shared with other GPUs. The TSU
logic determines the block sharers using the memts value (if
the value of memts is within one lease period, it is assumed
to be shared). In case the TSU is full, the TSU evicts the
cache block with lowest memts value.
3.2.6 Timestamp Design
We use 16-bit fields for each one of the timestamps, rts
Table 2: GPU Architecture.
Component Configuration Count Component Configuration Count
per GPU per GPU
CU 1.0 GHz 32 L1 Vector $ 16KB 4-way 32
L1 Scalar $ 16KB 4-way 8 L1I$ 32KB 4-way 8
L2$ 256KB 16-way 8 DRAM 512MB HBM 8
L1 TLB 1 set, 32-way 48 L2 TLB 32 sets, 16-way 1
and wts. Assuming 64B cache block size, 4B for ACK, 4B
for metadata and 8B address, HALCONE increases the net-
work traffic by 5% and 5.26% for read and write transactions,
respectively. If the timestamp value overflows, instead of
flushing the cache, we simply re-initialize the timestamps to
0. This re-initialization results in a cache miss for one of the
cache blocks. However, given we are using a write-through
policy for writes in both L1$ and L2$, there is no chance of
losing data belonging to the cache block experiencing the
overflow. We just need to do an extra MM access. We need
1KB of storage per L1$ of size 256 KB and 128 KB of stor-
age per L2$ of size 2 MB for holding the read and write
timestamps. For each cache timestamp (cts), we use 64 bits.
For an example GPU with 32 CUs, the GPU requires a total
of 40 cts entries (32 for the 32 private L1$s belonging to
each CU and 8 for the L2$). Hence, we need a total of 320
bytes to represent all the cts values for the entire GPU.
4. EVALUATION METHODOLOGY
In this section, we describe the MGPU system configura-
tions, the simulator, standard application benchmarks, and the
custom synthetic benchmarks (for stress testing HALCONE )
that are used to evaluate different MGPU configurations.
4.1 MGPU System Configurations
Table 2 shows the architecture of each GPU in our MGPU
system. To complete a comprehensive evaluation, we evaluate
five different MGPU configurations7:
1. MGPU system with RDMA (RDMA-WB-NC).
2. MGPU system with RDMA and HMG coherency8
(RDMA-WB-C-HMG).
3. MGPU-SM system, L2$ with WB and no coherency
(SM-WB-NC).
4. MGPU-SM system, L2$ with WT and no coherency
(SM-WT-NC).
5. MGPU-SM system, L2$ with WT and HALCONE (SM-
WT-C-HALCONE ).
Figure 1 shows the RDMA configuration for a typical MGPU
system. In this configuration, each GPU’s L1$ is connected to
a switch (SW) that connects to another GPU’s L2$ for RDMA.
Each switch forwards 16 bits per transfer. Switches run at
16 GTransfers/s. Thus, each switch can support a throughput
of 32 GB/s (unidirectional link bandwidth between L2 and
MM), which is the peak unidirectional bandwidth for PCIe
4.0 [11]. In the case of HMG, to maintain coherency, each
GPU’s L2$ is connected to the switch (SW) and the protocol
uses RDMA via L2$. For RDMA connections between L2$
7To name the MGPU systems we use the following notation: C =
cache-coherence support, NC = No coherency support, WT = L2$
with write-through policy, and WB = L2$ with write-back policy.
8HMG is the most recently proposed solution for efficient HW
cache-coherent support in MGPU systems (HMG is succinctly de-
scribed in Sections 1 and 6, and with more details in Section 4).
7
and MM, we use PCIe 4.0 links. For both baseline and HMG
configurations, the MGPUsim simulator faithfully models the
PCIe interconnects. For our MGPU-SM system, we group to-
gether the switches to form a switch complex. Both the L2$s
and the MM are connected to the switch complex. The over-
all L2-to-MM bidirectional bandwidth is 256 GB/s, though
each HBM supports an effective communication bandwidth
of 341 GB/s [6]. Hence, in our MGPU-SM evaluation, the
total L2-to-MM bandwidth is limited to 1 TB/s. We carefully
model the queuing latency on the L2-to-MM network, as well
as a fixed 100-cycle latency at the memory controllers (the
number is calibrated using a real GPU with HBM memory).
For our evaluation, we allocate memory by interleaving 4 KB
pages across all the memory modules in the MGPU system.
Our evaluation of the RDMA-WB-NC and SM-WB-NC configu-
rations is aimed at exposing the need for our proposed MGPU-
SM cache coherent systems (more details in Section 5). The
SM-WB-NC and SM-WT-NC configurations are used to com-
pare L2$ write-back (WB) policy with L2$ write-through
(WT) policy in a MGPU-SM system. This comparison helps
us learn which write policy is more suitable for L2$ in a
MGPU-SM system. The SM-WT-NC and SM-WT-C-HALCONE
configurations are then compared to determine the overhead
of coherency (we use a WT policy as it provides better per-
formance than WB, as reported in our experiments in Sec-
tion 5). The comparison between configuration RDMA-WB-C-
HMG and SM-WT-C-HALCONE demonstrates the improvement
achieved by our proposed solution over the most optimized
hardware coherence support for MGPU systems (HMG proto-
col). Except for HMG which leverages scope based memory
consistency model, we adopt existing weak memory consis-
tency model for our evaluation. Nonetheless, our HALCONE
protocol can work as a building block for more strict memory
consistency models.
4.2 Simulation Platform
We use the MGPUSim [32] simulator to model MGPU
systems. MGPUSim has been validated against real AMD
MGPU systems. We modified the simulator and its mem-
ory hierarchy to support HALCONE . After implementing
the HALCONE protocol, we verify the implementation us-
ing unit, integration, and acceptance tests provided with the
simulator. We also modified the simulator to support the
HMG protocol by implementing a hash function that assigns
a home node for a given address, directory support for track-
ing sharers and invalidation support for sending messages to
the sharers as needed.
4.3 Benchmarks
We use standard application GPU benchmarks as well as
synthetic benchmarks to evaluate our HALCONE protocol in
an MGPU-SM system.
4.3.1 Standard Benchmarks
We use a mix of memory-bound and compute-bound bench-
marks, 11 in total (see Table 3), from the Hetero-Mark [34],
PolyBench [26], SHOC [9], and DNNMark [10] benchmark
suites to examine the impact of our HALCONE protocol on
the MGPU-SM system. In addition, these workloads have
large memory footprints and represent a variety of data shar-
ing patterns across different GPUs. More details about the
benchmarks can be found in [32, 33].
Table 3: Standard benchmarks used in this work. Mem-
ory represents the footprint in the GPU memory.
Benchmark (abbr.) Suite Type Memory
Advanced Encryption Hetero-Mark Compute 71 MBStandard (aes)
Matrix Transpose and PolyBench Memory 64 MBVector Multiplication (atax)
Breadth First Search (bfs) SHOC Memory 574 MB
BiCGStab Linear Solver (bicg) PolyBench Compute 64 MB
Bitonic Sort (bs) AMDAPPSDK Memory 67 MB
Finite Impulse Response (fir) Hetero-Mark Memory 67 MB
Floyd Warshall (fws) AMDAPPSDK Memory 32 MB
Matrix Multiplication (mm) AMDAPPSDK Memory 192 MB
Maxpooling (mp) DNNMark Compute 64 MB
Rectified Linear Unit (rl) DNNMark Memory 67 MB
Simple Convolution (conv) AMDAPPSDK Memory 145 MB
4.3.2 Synthetic Benchmarks
The publicly available benchmark suites mentioned in
Section 4.3.1 have been developed considering the lack of
hardware-level coherency support in GPUs. Hence, these
benchmarks cannot necessarily harness the potential benefit
of the hardware-support for coherency in our MGPU-SM
system. To stress test our HALCONE protocol, we develop
a synthetic benchmark suite called Xtreme. There are three
benchmarks in the Xtreme suite9. All the benchmarks in the
Xtreme suite perform a basic vector operation: C = A+B,
where A and B are floating point vectors. We describe the
basic operation of the Xtreme benchmarks with a simple
example. For each example, we assume the following:
1. There are two GPUs: GPUX and GPUY .
2. Both GPUX and GPUY are equipped with two CUs
each: CUX0, CUX1, and CUY0 and CUY1, respectively.
3. There are three vectors A, B and C that are used to
compute C = A+B using both GPUX and GPUY .
4. Each of the three vectors, A, B and C, are split into 4
slices: A0, A1, A2 and A3; B0, B1, B2 and B3; and C0,
C1, C2, and C3.
5. At the beginning of the program, CUX0 reads A0, B0,
and C0; CUX1 reads A1, B1, and C1; CUY0 reads A2,
B2, and C2; CUY1 reads A3, B3, and C3.
The three Xtreme benchmarks work as follows:
Xtreme1:
1 CUX0 performs C0 = A0 +B0; Similarly, CUX1 operates
on A1, B1 and C1; CUY0 operates on A2, B2 and C2; and
CUY1 operates on A3, B3 and C3.
2 Repeat step 1 10 times.
3 CUX0 performs A0 = C0 +B0; Similarly, CUX1 operates
on A1, B1 and C1; CUY0 operates on A2, B2 and C2; and
CUY1 operates on A3, B3 and C3.
4 Repeat step 3 10 times.
With Xtreme1, we evaluate the impact of consecutive writes
to the same location by a CU. There is no data sharing be-
tween the CUs or the GPUs. When there is a write to any
location, the corresponding cts of the L1$ and L2$ step
ahead and generate read misses. Steps 2 and 4 force co-
herency misses in the caches.
9All the benchmarks in the Xtreme suite perform repeated writes to
and reads from the same location. This extreme behavior is typically
uncommon in regular GPU benchmarks, and so the name Xtreme.
8
Xtreme2:
1 CUX0 performs C0 = A0 +B0; Similarly, CUX1 operates
on A1, B1 and C1; CUY0 operates on A2, B2 and C2; and
CUY1 operates on A3, B3 and C3.
2 CUX0 performs A1 = C1 +B1;
3 Repeat step 2 10 times.
4 Repeat step 1
With Xtreme2, we stress test HALCONE for intra-GPU co-
herency. There is a SWMR invariant dependency between
CUX0 and CUX1 at 2 , CUX0 writes to a location that was
previously read by CUX1. Step 3 forces coherency misses.
Xtreme3:
1 CUX0 performs C0 = A0 +B0; Similarly, CUX1 operates
on A1, B1 and C1; CUY0 operates on A2, B2 and C2; and
CUY1 operates on A3, B3 and C3.
2 CUX0 performs A3 = C3 +B3;
3 Repeat step 2 10 times.
4 Repeat step 1
With Xtreme3, we stress test HALCONE for inter-GPU co-
herency. The difference between Xtreme2 and Xtreme3 is
that at 2 CUX0 writes to a location that was previously read
by CUX1 and CUY1, respectively.
In our evaluation, we vary vector sizes from 192 KB to 96
MB for A, B and C so that we can examine the impact of
capacity misses at different levels of the memory hierarchy.
5. EVALUATION
In this section, we present our evaluation of the HALCONE
protocol for both existing standard benchmarks and synthetic
benchmarks. The standard benchmarks have been developed
in accordance with the current MGPU programming model
that assumes no hardware support for coherency and places
the burden of maintaining coherency on the programmer.
Hence, we evaluate these traditional GPU benchmarks to
ensure our HALCONE protocol does not introduce extra
overhead for these legacy cases, where coherency has been
maintained by the GPU programmer. Next, we evaluate the
Xtreme benchmarks, a set of three synthetic benchmarks
that leverage hardware support for coherency to ensure the
correctness of their computations.
5.1 Standard Benchmarks
We compare 5 different MGPU configurations: RDMA-WB-
NC is our baseline, RDMA-WB-C-HMG, SM-WB-NC, SM-WT-NC
and SM-WT-C-HALCONE assuming a 4-GPU system. We
use a WrLease of 5 and a RdLease of 10 for this evaluation.
Please refer to Section 5.4 for details on why we choose these
lease values.
Figure 7(a) shows the speed-up for different MGPU con-
figurations, as compared to RDMA-WB-NC. Our evaluation
shows that the RDMA-WB-C-HMG, SM-WB-NC, SM-WT-NC, and
SM-WT-C-HALCONE configurations achieve, on average, a
1.5×, 3.9×, 4.6×, and 4.6× speed-up, respectively, versus
RDMA-WB-NC. There are two reasons why all 3 shared mem-
ory configurations are faster than using RDMA-WB-NC alone.
First, RDMA-WB-NC requires data copy operations between
the CPU and GPUs. Shared memory eliminates this traffic
since the CPU and GPUs share the same memory. Second,
during kernel execution, all of the GPUs are required to use
RDMA to access data residing on other GPUs’ memory for
the baseline. The shared main memory allows sharing of data
across GPUs with no RDMA overhead.
For the compute–bound benchmarks (i.e., aes, atax, bicg,
and mp), all the MGPU-SM configurations achieve lower
(1.2× to 2.0×) speed-up as compared to the speed-up
achieved for the memory-bound benchmarks (3× to 27×).
This is due to the memory-bound benchmarks’ higher reliance
on the high overhead RDMA for shared data access than the
compute-bound benchmarks. Even though RDMA-WB-C-HMG
uses RDMA, this configuration brings the cache blocks from
a remote GPU in its L2$ instead of its L1$ as in the case of
RDMA-WB-NC. Hence, workloads that exploit temporal and
spatial locality (i.e. mm and conv) achieve speed-up up to
18× for RDMA-WB-C-HMG configuration.
If we compare the speed-up of SM-WB-NC and SM-WT-NC,
for all the compute-bound benchmarks, the difference be-
tween a WB L2$ and a WT L2$ is less than 1%. But for
the memory-bound benchmarks, we observe up to 3× bet-
ter performance when employing a WT cache. This lower
performance of WB cache can be explained by inspecting
the L1$ and L2$ transactions. Figures 7(b) and 7(c) show
the normalized10 L2$ and L1$ traffic in terms of number
of L2$ to MM and L1$ to L2$ transactions and responses,
respectively, for both read and write operations. As we ob-
serve in Figures 7(b), as expected, when using WB there are,
on average, 22.7% less transactions than WT from L2$ to
MM for all the benchmarks. However, it is counter-intuitive
that even with fewer L2$ to MM transactions, SM-WB-NC
performs worse than SM-WT-NC. For a read or write miss in
the L2$ with a WB policy, first, the L2$ performs a write to
MM to generate a cache eviction if there is either a conflict
or capacity miss. Only then the L2$ can service the pending
read or write transactions. The L2$ generating the WB be-
comes a bottleneck when there are frequent cache evictions.
Note that the benchmarks in our evaluation use large memory
footprints to generate frequent capacity and conflict misses
in the L2$. Additionally, the benchmarks are streaming in
nature. Hence, the benchmarks have frequent cache evictions,
which perform worse with WB than with WT. With a WT
L2$, we do not need to write the data to the MM in case of an
eviction as the updated copy of the data is always available in
the MM. The transactions from L1$ to L2$ remain the same
for both SM-WB-NC and SM-WT-NC across all benchmarks.
Figure 7(a) also shows that our proposed SM-WT-C-
HALCONE suffers, on average, a 1% performance degradation
as compared to the SM-WT-NC configuration. This slight per-
formance degradation is due to more L1$ transactions being
generated for SM-WT-C-HALCONE as compared to SM-WT-
NC, as seen in Figure 7(c). As explained earlier, the standard
benchmarks do not require any support for coherency and
due to their streaming nature (which means these bench-
marks continuously read and write to different cache blocks)
they suffer capacity and conflict misses instead of coherency
misses. For more details on this, refer to Section 5.3. We con-
clude that our HALCONE protocol is efficient as it causes, on
average, a 1% performance degradation for standard MGPU
benchmarks when compared to an MGPU system with SM-
WT-NC configuration. Moreover, compared to RDMA-WB-NC,
10We use normalized values here due to the wide variations in the
number of L2$ to MM as well as L1$ to L2$, transactions across
the different benchmarks.
9
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p
re
lu
M
ea
n
Benchmarks
0
4
8
12
16
20
24
28
Sp
ee
d-
Up
(a) RDMA-WB-NC
RDMA-WB-C-HMG
SM-WB-NC
SM-WT-NC
SM-WT-C-HALCONE
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p pr re
lu
M
ea
n
Benchmarks
0.0
0.4
0.8
1.2
1.6
2.0
2.4
No
rm
ali
ze
d 
#T
ra
ns
 (L
2 
to
 M
M
)
(b) SM-WT-NC SM-WT-C-HALCONE
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p pr re
lu
M
ea
n
Benchmarks
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No
rm
ali
ze
d 
#T
ra
ns
 (L
1 
to
 L
2)
(c) SM-WT-NC SM-WT-C-HALCONE
Figure 7: (a) Speed-up for different MGPU systems, normalized versus RDMA-WB-NC. (b) Number of L2$ to MM trans-
actions for SM-WT-NC and SM-WT-C-HALCONE normalized versus the number of transactions for the SM-WB-NC configu-
ration in an MGPU system. (c) Number of L1$ to L2$ transactions for SM-WT-NC and SM-WT-C-HALCONE normalized
versus the number of transactions for the SM-WB-NC configuration in an MGPU system. Mean refers to geometric mean.
an MGPU system with the SM-WT-C-HALCONE configura-
tion has, on average, 4.6× better performance. Besides, the
SM-WT-C-HALCONE configuration, on average, outperforms
RDMA-WB-C-HMG configuration by 3×.
5.2 Scalability Study
We use strong scaling to explore the scalability of the
MGPU-SM system with HALCONE protocol, by varying
both the GPU count and CU count while keeping the size of
the workloads constant.
5.2.1 GPU Count Scalability Study
For this study, we use 32 CUs per GPU as the baseline
comparison point. Figure 8(a) shows the speed-up for GPU
counts of 1, 2, 4, 8 and 16. Here, runtimes are normalized
to that of a single GPU. On average, we achieve a 1.76×,
2.74×, 4.05×, and 5.43× speed-up in comparison to a single
coherent GPU for 2, 4, 8, and 16 GPUs, respectively. Some
of the workloads (i.e., atax, bicg, mp and relu), do not scale
well beyond 4 GPUs due to lesser computations available for
each GPU and so do not benefit from a larger GPU count.
Nonetheless, the comparison shown in Figure 8(a) confirms
that our proposed HALCONE protocol is scalable and does
not limit the scalability of an MGPU-SM system.
5.2.2 CU Count Scalability Study
For this study, we use a 4-GPU system and consider 32, 48
and 64 CUs per GPU (see Figure 8(b) and Figure 8(c)). The
atax, bicg, mp, and relu benchmarks do not scale with CU
count as they do not have sufficient compute intensity. The
bfs and bs benchmarks do not scale as we increase the CU
count because of a L2$ bottleneck. For these benchmarks,
the number of transactions from L2$ to MM for 32 CUs is
the same as the number of transactions for 48 and 64 CUs.
Hence, L2$ queuing and serialization latencies dominate the
runtime, irrespective of the number of CUs. In Figure 7(b)
and Figure 7(c), we have already demonstrated that our HAL-
CONE protocol, on average, introduces only 1% additional
traffic from L2$ to MM and from L1$ to L2$. The bfs and
bs benchmarks suffer from the L2$ bottleneck, even when
the MGPU system lacks coherency. Hence, the HALCONE
protocol itself is not a bottleneck in terms of CU scalability.
The aes, fir, mm and conv benchmarks, do not stress the L2$
even if the number of transactions from L2$ to MM increases
with the increased CU count and have sufficient compute
intensity to take advantage of higher CU count. Hence, these
benchmarks benefit from a larger number of CUs. On an
average we see 1.12× and 1.24× speed-up as we increase the
CU count from 32 to 48 and 64, respectively.
5.3 Xtreme Benchmarks
As discussed before, the standard MGPU benchmarks
have been developed in accordance with the assumption that
there is no hardware support for coherency and no weak-
consistency programming model assumed for the GPUs.
Hence, we use our synthetic benchmark suite, Xtreme, to
evaluate the impact of our proposed HALCONE protocol for
some of the extreme cases of applications, where we need
coherency to ensure the correctness of the computation. With
Xtreme benchmarks, we evaluate three different scenarios:
1. The data size is small, so there are neither L1$ nor L2$
capacity or conflict misses.
2. The data size is large enough to cause L1$ capacity
and conflict misses, but not large enough to cause L2$
capacity or conflict misses.
3. The data size is large enough to cause both L1$ and
L2$ capacity and conflict misses.
We use MGPU-SM with 4 GPUs for this evaluation. Fig-
ure 9 shows the comparison of speed-up for SM-WT-NC and
SM-WT-C-HALCONE for all three Xtreme benchmarks. The
repeated writes to the same cache location in Xtreme1 cause
the cts of both the L1$s and L2$s to step ahead in logical
time, leading to coherency misses for the data that was read
before. For a vector size of 192 KB, we observe a perfor-
mance degradation of 14.3% for SM-WT-C-HALCONE . As
the vector size increases, there are more capacity and con-
flict misses, and eventually capacity and conflict misses far
outnumber coherency misses. The coherency misses can oc-
cur if the lease expires for a cache block. However, if there
are frequent cache evictions because of conflict or capacity
misses, the cache blocks are evicted based on an LRU policy,
even if the lease is valid. Thus, for a vector size of 98304
KB, we observe only a 0.6% performance degradation for SM-
WT-C-HALCONE in comparison to SM-WT-NC. The Xtreme2
benchmark exploits intra-GPU coherency. Xtreme3 requires
inter-GPU coherency among the MGPUs for correctness.
The data dependency in Xtreme2 and Xtreme3 results in
coherency misses when repeated writes are performed. We
observe a performance degradation of up to 12.1% and 16.8%
for Xtreme2 and Xtreme3, respectively. This degradation
decreases as the data size increases due to the corresponding
increase in capacity and conflict misses in L1$s and L2$s.
5.4 Sensitivity to Timestamps
We used (RdLease, WrLease) = (5, 10) for our evalua-
tions. We examined the impact of using different (RdLease,
WrLease) values of (2, 10), (10, 2), (5, 10), (10, 5), (20, 10),
10
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p pr re
lu
M
ea
n
Benchmarks
0
5
10
15
20
25
Sp
ee
d-
Up
(a) 1 GPU
2 GPUs
4 GPUs
8 GPUs
16 GPUs
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p pr re
lu
M
ea
n
Benchmarks
0.0
0.5
1.0
1.5
2.0
2.5
Sp
ee
d-
Up
(b) 48 CUs 64 CUs
ae
s
at
ax bf
s
bi
cg bs
co
nv fir fw
s
m
m m
p pr re
lu
M
ea
n
Benchmarks
0.0
0.3
0.6
0.9
1.2
1.5
1.8
No
rm
ali
ze
d 
#T
ra
ns
 (L
2 
to
 M
M
)
(c) 48 CUs 64 CUs
Figure 8: (a) GPU scalability: Speed-up for the SM-WT-C-HALCONE with different #GPUs normalized to that of a single
coherent GPU. (b) and (c) CU scalability: Speed-up and #L2$ transactions for the SM-WT-C-HALCONE with different
#CUs normalized to that of the SM-WT-C-HALCONE with 32 CUs. Mean stands for geometric mean.
19
2
15
36
24
57
6
98
30
40.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Sp
ee
d-
Up
(a) SM-WT-NC
SM-WT-C-HALCONE
19
2
15
36
24
57
6
98
30
40.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4 (b) SM-WT-NC
SM-WT-C-HALCONE
19
2
15
36
24
57
6
98
30
40.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4 (c) SM-WT-NC
SM-WT-C-HALCONE
Vector Size (KB)
Figure 9: Speed-up for the Xtreme benchmarks running
on an MGPU system with SM-WT-C-HALCONE w.r.t. an
MGPU system with SM-WT-NC for different vector sizes:
(a) Xtreme1, (b) Xtreme2 and (c) Xtreme3
and (10, 20) using the coherency-aware Xtreme benchmarks.
We found that if the difference between the WrLease and
the RdLease is increased to 10 from 5, then the benchmark
performance degrades by up to 3% for the Xtreme. So we
need to maintain a smaller difference between RdLease and
WrLease. In terms of absolute values of RdLease and Wr-
Lease, a large value of the RdLease can help an application
that performs significantly smaller number of writes than
number of reads. On the other hand, a smaller RdLease
results in more coherency misses. We choose a smaller Wr-
Lease value than RdLease value based on the assumption
that if a CU or a GPU writes to a cache block, it may write
to the same cache block in the future. This, in turn, prevents
consecutive writes to the same block and avoids making cts
too large, potentially causing many coherency misses.
6. RELATED WORK
MGPU systems have been recently adopted as the comput-
ing platform of choice by variety of data-intensive applica-
tions. Today there is growing interest in developing optimized
MGPU system architectures and efficient hardware coherence
support to reduce programming complexity.
MGPU System Design: Milic et al. propose a NUMA-
aware multi-socket GPU solution to resolve performance
bottlenecks related to NUMA memory placement in multi-
socket GPUs [19]. The proposed system dynamically adapts
inter-socket link bandwidth and caching policies to avoid
NUMA effects. Our CC-MGPU system completely avoids
the impact of NUMA on performance. Arunkumar et al. [2]
and Ren et al. [27] propose a MCM-GPU, where multiple
GPU modules are integrated in a package to improve energy
efficiency. As in MCM-GPU, our CC-MGPU can take ad-
vantage of novel integration technologies to improve energy
efficiency and performance. Arunkumar et al. [3] also argue
the need to improve inter-GPU communication. We plan to
explore high-bandwidth network architectures for CC-MGPU
systems in the future.
MGPU Coherency: NUMA-Aware multi-socket GPU
[19] maintains inter-GPU coherency by extending SW-based
coherence for L1$s to the L2$s. The resulting coherency
traffic lowers application performance. Similarly, MCM-
GPU [2] leverages the software-based L1$ coherence proto-
col for its L1.5$. The flushing of the caches and coherency
traffic hurt system scalability. Young et al. [39] propose
CARVE method, where part of a GPU’s memory is used as
a cache for shared remote data and the GPU-VI protocol is
used for coherency. This protocol does not scale well with
an increase in the amount of read-write transactions and false
sharing. Also, the CARVE method can cause performance
degradation for workloads with large memory footprint as it
reduces effective GPU memory space. To reduce coherency
traffic, Singh et al. propose timestamp-based coherency (TC)
protocol for intra-GPU coherency [31]. As this protocol relies
on a globally synchronized clock across all CUs, maintaining
clock synchronization is a challenging task for large MGPU
systems. To address this, Tabakh et al. [35] propose a logical
timestamp based coherence protocol (G-TSC). However, as
discussed in Section 2.2, the G-TSC protocol is designed
for single GPU systems and does not scale well for MGPU
systems. HMG [27] is a recent hardware-managed cache co-
herence protocol for distributed L2$s in MCM-GPUs using a
scoped memory model consistency. HMG proposes to extend
a simple VI-like protocol to track sharers in a hierarchical way
that is tailored to the MCM-GPU architecture, and achieves
a cost-effective solution in terms of on-chip area overhead,
inter-GPU coherence traffic reduction and high performance.
This protocol, however, relies on error-prone scoped memory
consistency model that increases programming complexity.
In contrast, our new timestamp-based coherence HALCONE
protocol operates assuming weak consistency model, which
is currently adopted by modern GPU products, and is able to
outperform HMG by 3× on average (see Section 5.1).
7. CONCLUSION
In this work, we propose a novel MGPU system, where
GPUs physically share the MM. This system eliminates the
programmer’s burden of unnecessary data replication and ex-
pensive remote memory accesses. To ensure seamless sharing
of data across and within multiple GPUs, we propose HAL-
CONE , a novel timestamp-based coherence protocol. For
standard benchmarks, a MGPU-SM system (that has 4 GPUs
and uses HALCONE ) performs on average, 4.6× faster than
the non-coherent conventional MGPU system with same num-
ber of GPUs. In addition, compared to a coherent MGPU
11
system using the state-of-the-art HMG coherence protocol,
an MGPU system that uses HALCONE reports 3× higher
performance. Our scalability study shows that our coherence
protocol scales well in terms of both GPU count and CU
count. We develop synthetic benchmarks that leverage data
sharing to examine the impact of our HALCONE protocol
on performance. For the worst case scenario in our synthetic
benchmarks, the proposed MGPU-SM with HALCONE suf-
fers from only a 16.8% performance overhead.
REFERENCES
[1] “Google cloud.” [Online]. Available: https://gsuite.google.com/
[2] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa,
A. Jaleel, C.-J. Wu, and D. Nellans, “Mcm-gpu: Multi-chip-module
gpus for continued performance scalability,” in Proceedings of the
44th International Symposium on Computer Architecture, ser. ISCA
’17, vol. 45, no. 2. ACM, 2017, pp. 320–332.
[3] A. Arunkumar, E. Bolotin, D. Nellans, and C.-J. Wu, “Understanding
the future of energy efficiency in multi-module gpus,” in 2019 IEEE
International Symposium on High Performance Computer Architecture
(HPCA). IEEE, 2019, pp. 519–532.
[4] T. Baruah, Y. Sun, A. T. Dincer, S. A. Mojumder, J. L. Abellán,
Y. Ukidave, A. Joshi, N. Rubin, J. Kim, and D. Kaeli, “Griffin:
Hardware-software support for efficient page migration in multi-gpu
systems,” in 2020 26th IEEE International Symposium on
High-Performance Computer Architecture (HPCA). IEEE, 2020.
[5] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia:
Understanding irregular gpgpu graph applications,” in 2013 IEEE
International Symposium on Workload Characterization (IISWC).
IEEE, 2013, pp. 185–195.
[6] J. H. Cho, J. Kim, W. Y. Lee, D. U. Lee, T. K. Kim, H. B. Park,
C. Jeong, M.-J. Park, S. G. Baek, S. Choi et al., “A 1.2 v 64gb 341gb/s
hbm2 stacked dram with spiral point-to-point tsv structure and
improved bank group data control,” in 2018 IEEE International
Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 208–210.
[7] A. E. C. Cloud, “Amazon web services,” Retrieved November, vol. 9,
no. 2011, p. 2011, 2011.
[8] M. Copeland, J. Soh, A. Puca, M. Manning, and D. Gollob, Microsoft
Azure. Springer, 2015.
[9] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth,
K. Spafford, V. Tipparaju, and J. S. Vetter, “The scalable
heterogeneous computing (shoc) benchmark suite,” in Proceedings of
the 3rd Workshop on General-Purpose Computation on Graphics
Processing Units. ACM, 2010, pp. 63–74.
[10] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark
suite for gpus,” in Proceedings of the General Purpose GPUs. ACM,
2017, pp. 63–72.
[11] D. Gonzales, “Pci express 4.0 electrical previews,” in PCI-SIG
Developers Conference, 2015.
[12] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski,
A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch
sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677,
2017.
[13] A. Hagan, A. Sawant, M. Folkerts, and A. Modiri, “Multi-gpu
configuration of 4d intensity modulated radiation therapy inverse
planning using global optimization,” Physics in Medicine & Biology,
vol. 63, no. 2, p. 025028, 2018.
[14] M. Harris, “Nvidia dgx-1: The fastest deep learning system,” 2017.
[15] B. A. Hechtman and D. J. Sorin, “Exploring memory consistency for
massively-threaded throughput-oriented processors,” in Proceedings of
the 40th Annual International Symposium on Computer Architecture,
2013, pp. 201–212.
[16] H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified
memory management in gpus for irregular workloads,” in Proceedings
of the Twenty-Fifth International Conference on Architectural Support
for Programming Languages and Operating Systems, 2020, pp.
1357–1370.
[17] D. Levinthal, “Performance analysis guide for intel core i7 processor
and intel xeon 5500 processors,” Intel Performance Analysis Guide,
vol. 30, p. 18, 2009.
[18] K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni,
and D. K. Panda, “Characterizing CUDA Unified Memory
(UM)-Aware MPI Designs on Modern GPU Architectures,” in
Proceedings of the 12th Workshop on General Purpose Processing
Using GPUs, ser. GPGPU ’19. New York, NY, USA: ACM, 2019,
pp. 43–52.
[19] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel,
A. Ramirez, and D. Nellans, “Beyond the socket: Numa-aware gpus,”
in Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture. ACM, 2017, pp. 123–135.
[20] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abellán,
J. Kim, D. Kaeli, and A. Joshi, “Profiling dnn workloads on a
volta-based dgx-1 system,” in 2018 IEEE International Symposium on
Workload Characterization (IISWC). IEEE, 2018, pp. 122–133.
[21] NVidia, “Nvidia dgx-1 with tesla v100 system architecture.”
[22] NVIDIA, “NVIDIA Unified Memory,” 2018. [Online]. Available:
http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-
everything-you-need-to-know-about-unified-memory.pdf
[23] C. Nvidia, “Cublas library,” NVIDIA Corporation, Santa Clara,
California, vol. 15, no. 27, p. 31, 2008.
[24] M. OâA˘Z´Connor, “Highlights of the high-bandwidth memory (hbm)
standard,” in Memory Forum Workshop, 2014.
[25] M. Plakal, D. J. Sorin, A. E. Condon, and M. D. Hill, “Lamport clocks:
verifying a directory cache-coherence protocol,” in Proceedings of the
tenth annual ACM symposium on Parallel algorithms and
architectures, 1998, pp. 67–76.
[26] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” URL:
http://www. cs. ucla. edu/pouchet/software/polybench, 2012.
[27] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, and D. Nellans,
“Hmg: Extending cache coherence protocols across modern
hierarchical multi-gpu systems,” in 2020 26th IEEE International
Symposium on High-Performance Computer Architecture (HPCA).
IEEE, 2020.
[28] J. Shi, R. Yang, T. Jin, X. Xiao, and Y. Yang, “Realtime top-k
personalized pagerank over large graphs on gpus,” Proceedings of the
VLDB Endowment, vol. 13, no. 1, pp. 15–28, 2019.
[29] M. D. Sinclair, J. Alsop, and S. V. Adve, “Efficient gpu
synchronization without scopes: Saying no to complex consistency
models,” in Proceedings of the 48th International Symposium on
Microarchitecture, 2015, pp. 647–659.
[30] A. Singh, S. Aga, and S. Narayanasamy, “Efficiently enforcing strong
memory ordering in gpus,” in Proceedings of the 48th International
Symposium on Microarchitecture. ACM, 2015, pp. 699–712.
[31] I. Singh, A. Shriraman, W. W. Fung, M. O’Connor, and T. M. Aamodt,
“Cache coherence for gpu architectures,” in 2013 IEEE 19th
International Symposium on High Performance Computer Architecture
(HPCA). IEEE, 2013, pp. 578–590.
[32] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway,
Y. Bao, S. Hance, C. McCardwell, V. Zhao, H. Barclay, A. K. Ziabari,
Z. Chen, R. Ubal, J. L. Abellán, J. Kim, A. Joshi, and D. Kaeli,
“Mgpusim: Enabling multi-gpu performance modeling and
optimization,” in Proceedings of the 46th International Symposium on
Computer Architecture, ser. ISCA ’19. New York, NY, USA: ACM,
2019, pp. 197–209.
[33] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, R. Ubal, X. Gong,
S. Treadway, Y. Bao, V. Zhao, J. L. Abellán et al., “Mgsim+ mgmark:
A framework for multi-gpu system research,” arXiv preprint
arXiv:1811.02884, 2018.
[34] Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukherjee,
C. McCardwell, A. Villegas, and D. Kaeli, “Hetero-mark, a
benchmark suite for cpu-gpu collaborative computing,” in 2016 IEEE
International Symposium on Workload Characterization (IISWC).
IEEE, 2016, pp. 1–10.
[35] A. Tabbakh, X. Qian, and M. Annavaram, “G-tsc: Timestamp based
coherence for gpus,” in 2018 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE, 2018, pp.
403–415.
[36] P. Wang, L. Zhang, C. Li, and M. Guo, “Excavating the potential of
12
gpu for accelerating graph traversal,” in 2019 IEEE International
Parallel and Distributed Processing Symposium (IPDPS). IEEE,
2019, pp. 221–230.
[37] Q. Xu, H. Jeon, and M. Annavaram, “Graph processing on gpus:
Where are the bottlenecks?” in 2014 IEEE International Symposium
on Workload Characterization (IISWC). IEEE, 2014, pp. 140–149.
[38] I. Yamazaki, T. Dong, R. Solcà, S. Tomov, J. Dongarra, and
T. Schulthess, “Tridiagonalization of a dense symmetric matrix on
multiple gpus and its application to symmetric eigenvalue problems,”
Concurrency and computation: Practice and Experience, vol. 26,
no. 16, pp. 2652–2666, 2014.
[39] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa,
“Combining hw/sw mechanisms to improve numa performance of
multi-gpu systems,” in 2018 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO). IEEE, 2018, pp.
339–351.
[40] C.-L. Zhang, Y.-P. Xu, Z.-J. Xu, J. He, J. Wang, and J.-H. Adu, “A
fuzzy neural network based dynamic data allocation model on
heterogeneous multi-gpus for large-scale computations,” International
Journal of Automation and Computing, vol. 15, no. 2, pp. 181–193,
2018.
[41] Y.-L. Zhu, D. Pan, Z.-W. Li, H. Liu, H.-J. Qian, Y. Zhao, Z.-Y. Lu, and
Z.-Y. Sun, “Employing multi-gpu power for molecular dynamics
simulation: an extension of galamost,” Molecular Physics, vol. 116,
no. 7-8, pp. 1065–1077, 2018.
[42] A. K. Ziabari, Y. Sun, Y. Ma, D. Schaa, J. L. Abellán, R. Ubal, J. Kim,
A. Joshi, and D. Kaeli, “Umh: A hardware-based unified memory
hierarchy for systems with multiple discrete gpus,” ACM Trans. Archit.
Code Optim., vol. 13, no. 4, Dec. 2016.
13
