Cooperative Caching for GPUs by Dublish, Saumay et al.
  
 
 
 
Edinburgh Research Explorer 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Cooperative Caching for GPUs
Citation for published version:
Dublish, S, Nagarajan, V & Topham, N 2016, 'Cooperative Caching for GPUs' ACM Transactions on
Architecture and Code Optimization, vol. 13, no. 4, 39, pp. 1-25. DOI: 10.1145/3001589
Digital Object Identifier (DOI):
10.1145/3001589
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
ACM Transactions on Architecture and Code Optimization
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 05. Apr. 2019
39
Cooperative Caching for GPUs
Saumay Dublish, University of Edinburgh
Vijay Nagarajan, University of Edinburgh
Nigel Topham, University of Edinburgh
The rise of general-purpose computing on GPUs has influenced architectural innovation on GPUs. The
introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however,
indicate inefficient cache usage due to myriad factors such as cache thrashing and extensive multithreading.
Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in
the L2 access path, therefore, results in high memory access latencies. In memory-intensive applications,
these latencies get exposed due to a lack of active compute threads to mask such high latencies.
In this paper, we aim to reduce the pressure on the shared L2 bandwidth, thereby reduce the memory
access latencies that lie in the critical path. We identify significant replication of data among private L1
caches, presenting an opportunity to reuse data among the L1s. We further show how this reuse can be
exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on the
L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate
inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache by an
average of 29%, freeing up the bandwidth for other accesses. We also show that CCN reduces the average
memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall
performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across
L1 caches. In doing so, CCN incurs a nominal area and energy overhead of 1.3% and 2.5% respectively.
Notably, the performance improvement with our proposed CCN compares favourably to the performance
improvement achieved by simply doubling the number of L2 banks by up to 34%.
CCS Concepts: •Computer Systems Organization→ Single Instruction, Multiple Data;
Additional Key Words and Phrases: GPGPU, Bandwidth Bottlenecks, Inter-Core Reuse
ACM Reference Format:
Saumay Dublish, Vijay Nagarajan and Nigel Topham, 2016. Cooperative Caching for GPUs. ACM Trans.
Architec. Code Optim. 13, 4, Article 39 (December 2016), 25 pages.
DOI: 10.1145/3001589
1. INTRODUCTION
Current GPUs are no longer perceived as accelerators solely for graphic workloads,
and now cater to a much broader spectrum of applications. In a short time, GPUs have
proven to be of substantive significance in the world of general-purpose computing. The
massive compute power of GPUs and recent innovations in their architecture [Nvidia
2009][Nvidia 2012] have helped unleash the latent potential of several non-graphical
applications, adding momentum to the rise of general-purpose computing on GPUs
(GPGPU).
Motivated by the pervasive impact of GPUs in the field of general-purpose comput-
ing, manufacturers have introduced configurable on-chip cache hierarchies to their
recent architectures to cater for the locality needs of non-streaming applications. De-
spite performance improvement for certain applications, however, the utilization of
these caches is far from perfect; this is evident from the high cache miss rates seen
on many GPUs. As shown in Figure 1(a), on NVIDIA’s Fermi GPU, general-purpose
applications across a variety of benchmark suites show high L1 miss rates, indicat-
ing that the current cache management techniques are unable to utilize these caches
effectively. As a consequence of high L1 miss rates, pressure on the L2 bandwidth in-
creases thereby increasing the memory access latencies due to congestion in the L2
access path. In our experiments (discussed later in Section 5), we observe that due
to congestion in the L1-L2 interconnect and L2 access queues, L2 accesses take up to
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:2 S. Dublish et al.
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc
%
L1-Miss
L1-Replication
Fig. 1: (a) L1-Miss: L1 cache miss rates (b) L1-Replication: Percentage of L1 misses
cached in remote L1 caches.
2-3× more cycles compared to the normal access latency of L2. In memory intensive
applications, due to lack of active compute threads to overlap such high memory access
latencies, increased latencies to the lower level get exposed and appear in the critical
path [Dublish et al. 2016], reducing system performance.
Goal: In this paper, our goal is to reduce the memory access latencies that cannot
be hidden by multithreading in memory-intensive applications. Since one of the major
reasons for such high latencies is the congestion in the L2 access path (due to the high
number of requests sent to the lower level), we aim to reduce this congestion.
Observation: In streaming applications, cores work on independent data with little
or no overlap in the working dataset. However, in general-purpose applications we
observe a considerable potential for data reuse across different cores. Figure 1(b) shows
that a significant percentage of miss requests generated by L1s is for data already
present on a non-local (or remote) L1 cache. If we can exploit this reuse within the L1s,
duplicate requests to the shared L2 can be potentially eliminated. This would result
in reduced congestion and faster lower level access for the remaining requests.
Proposal: In this paper, we propose a Cooperative Caching Network (CCN) for L1
caches in GPUs to improve the efficiency of the L1 cache hierarchy in filtering the
requests to the L2 cache. In our proposed scheme, we connect the private L1 caches in
a lightweight ring network to facilitate communication of reusable data among the L1
caches. In doing so, we reduce the average memory access latency due to the following
two reasons. 1 A fraction of L1 load misses, with reusable data cached on remote
L1s, can now completely bypass the high latency access path to L2. They are instead
serviced by the CCN with significantly lower latencies (42 cycles on average based on
our experiments) as compared to the L2 roundtrip access latencies, or simply L2 access
latencies (which is ∼300 cycles due to congestion). 2 Cooperatively sharing reusable
data within the L1 caches via the CCN reduces the traffic to L2 cache. This relieves
the pressure on the interconnect as well as on the L2 access queues thereby reducing
the L2 access latencies (by 78 cycles on average). Thus, CCN provides a faster access
to L2 for miss requests that do not find a sharer in the CCN.
In effect, our proposed architecture services a portion of L1 misses collaboratively
within the L1 caches with much lower latencies than the L2 access latency. This
leads to less congestion in the L2 access path, thereby accelerating memory response
for requests that do not find a reusable copy in remote L1 caches. However, in the
absence of reuse (such as in streaming applications), unsuccessful probes in the
CCN adds an additional overhead to the L1 load misses. In such cases, due to no
reduction in congestion, the CCN overhead does not get compensated and results in
an overall performance penalty. Therefore, in our final scheme we propose CCN-RT, a
Cooperative Caching Network with Request Throttling. It dynamically adapts to the
coarse grained reuse patterns shown across an entire application, thereby bypassing
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:3
the CCN when there is little or no reuse.
In summary, we make the following contributions:
— We provide fresh insight into the inter-core reuse patterns within GPUs by profiling
the communication characteristics over a diverse range of GPGPU applications.
— We propose the CCN, a cooperative caching architecture for GPUs that is cognizant
of the inter-core reuse.
— By servicing reusable requests via the CCN, we reduce the overall bandwidth de-
mand on L2 cache, boosting performance for memory-intensive applications that
show high levels of sharing across L1s.
— With our final proposal CCN-RT, we show an average performance gain of 14.7% for
applications that exhibit reuse, while being benign to applications with no reuse.
— We also reduce the average memory latency by 24%, L1 to L2 traffic by 29% and core
stall cycles by 26%. Our proposal incurs nominal area and energy overheads of 1.3%
and 2.5% respectively.
The remainder of the paper is organized as follows. Section 2 provides an overview
of the baseline architecture for our study and characterizes the workloads. Section 3
investigates the reuse patterns for general purpose applications and assesses the effi-
cacy of cooperative caching in GPUs. Section 4 presents CCN, a Cooperative Caching
Network for L1 caches in GPUs. Section 5 evaluates the architecture and proposed op-
timizations to our baseline proposal. Section 6 compares CCN with alternate solutions.
Section 7 discusses the related work and positions our findings in the current state-of-
art and Section 8 concludes the paper by summarizing the findings and contributions
of this work.
2. BACKGROUND
2.1. CUDA Programming Model
A typical CUDA program consists of data-parallel structures called kernels, that are
executed on the GPU. Large number of threads of a kernel are organized into struc-
tured blocks of computation known as thread blocks. Each thread block consists of
several smaller group of threads called warps– the smallest granularity of scheduling
threads in a GPU core.
2.2. Baseline Architecture
As shown in Figure 2, a typical GPU consists of several execution units organized
in a set of highly multithreaded and pipelined cores that are referred to as Streaming
Multiprocessors (SM1). In this study, we consider a baseline similar to NVIDIA’s Fermi
architecture. Our baseline GPU consists of 15 SMs, each with a 32 lane SIMD unit.
Each core consists of a private L1 data cache, shared memory (scratchpad) and read-
only texture and constant caches. Private caches of a core are backed by a shared L2
cache that has an access latency of around 120 cycles for non-texture accesses. The L1
data caches are non-coherent and employ write-through, no-write-allocate policy. The
baseline parameters are summarized later in Table II.
2.3. Benchmarks
For the purpose of this study, we use CUDA applications from three major general-
purpose benchmark suites, viz., Rodinia (v3.0) [Che et al. 2009], MapReduce [He et al.
2008] and Parboil [Stratton et al. 2012]. We categorize the benchmarks according to
1In this paper, we use the terms Core and SM interchangeably.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:4 S. Dublish et al.
Interconnect
L2 Bank
D
R
A
M
 
C
ha
n
ne
l
L2 Bank
D
R
A
M
 
C
ha
n
ne
l
L2 Bank
D
R
A
M
 
C
ha
n
ne
l
L2 Bank
D
R
A
M
 
C
ha
n
ne
l
Interconnect
SM SM SM SM
SM SM SM SM
L1D Shared L1D Shared L1D Shared L1D Shared
L1D SharedL1D SharedL1D SharedL1D Shared
Fig. 2: Baseline GPU architecture.
their sensitivity to the memory hierarchy. Table I lists the benchmarks sorted by the
speedup (PerfX) shown on a perfect memory system that has zero access latency to
lower level memories and infinite bandwidth between memory hierarchies on a Fermi-
like GPU.
A program is said to be memory-intensive if it constitutes several threads comprising
of long latency memory operations. The performance of memory-intensive applications
is usually bounded by the bandwidth to lower level memories. This is because a large
number of memory requests are kept waiting in each memory partition due to limited
bandwidth, thereby delaying memory responses and causing the cores to potentially
stall. Therefore, the magnitude of speedup on a perfect memory system essentially
indicates the gravity of bandwidth problem in the benchmarks.
3. NEED FOR COOPERATION
Graphics and general-purpose workloads exhibit different memory access patterns. In
graphics applications, kernels operate on independent data of streaming nature and
therefore, different thread blocks are executed in considerable isolation. On the other
hand, general-purpose applications show varying amounts of reuse within the thread
blocks and also at the boundaries with neighbouring thread blocks. For instance, in
scientific application such as computation of Coulombic Potential (cutcp), atoms are
organized in a 3D lattice. A sub-group of atoms constitute a thread block and the en-
tire lattice is divided into multiple thread blocks. In order to compute the potential
difference on the atoms at the edges and corners of a sub-lattice (or thread block),
coulombic potential contributed by atoms from surrounding sub-lattices needs to be
read and hence requires sharing and reuse of data among neighbouring thread blocks.
When such thread blocks are scheduled on different cores on a GPU, it results in inter-
core reuse. In current GPUs, reuse across thread blocks on different cores can only be
exploited by localizing the data on the L2 cache and not any closer. But in doing so,
cores have to incur the congestion delays in L1-L2 interconnect, as well as the delays in
the L2 access queues. Thus, for those applications that are bounded by the bandwidth
to the lower level, it degrades the overall performance by clogging the access path to
the L2.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:5
Table I: Benchmark characterization: (a) PerfX - speedup with perfect memory (b) µRC
- percentage of total L1 load misses that have reusable data on a remote L1.
S.No. Suite Benchmark ABV. Dataset PerfX µRC
1 MapReduce Matrix Multiplication mm 768 × 768 data points 9.86 4%
2 MapReduce Similarity Score ss 1024 × 256 data points 6.18 28%
3 Rodinia Computational Fluid cfd 200000 elements 6.17 51%
4 MapReduce Page View Rank pvr 21 MB 5.93 32%
5 Rodinia Stream Cluster sc 16384 points; 256 dimension 5.49 18%
6 Rodinia Breadth-First Search bfs’ 1000000 nodes 5.18 3%
7 Rodinia Wavelet Transform dwt2d 1024 × 1024 4.96 7%
8 Parboil Lattice-Boltzmann Method lbm 120 × 120 × 150 data points 4.49 0%
9 MapReduce K-Means km 10000 × 3 data points; 24 clusters 3.85 24%
10 Rodinia Hybrid Sort sort 4194304 floating points 3.68 1%
11 Parboil Breadth-First Search bfs 8500000 nodes 3.57 6%
12 Rodinia Particle Potential lavaMD 7 × 7 × 7 boxes 2.81 1%
13 Parboil 2-D Histogram histo 10000 × 4 dimension 2.63 1%
14 MapReduce String Match sm 4 MB 2.52 3%
15 Rodinia Cardiac Myocyte myocyte 100 instances 2.38 1%
16 Rodinia Needleman-Wunsch nw 2048 × 2048 data points 2.31 8%
17 Rodinia Graph Traversal b+tree 10000 nodes 2.21 25%
18 MapReduce Inverted Index ii 28 MB 2.19 2%
19 Rodinia Particle Filter pfloat 128 × 128 × 10 2.15 8%
20 Rodinia Tracking Microscopy leukocyte 176 MB 1.88 1%
21 MapReduce Word Count wc 96 KB 1.86 54%
22 Parboil Sum of Absolute Diff. sad 52 KB vs. 52 KB frame 1.76 3%
23 Rodinia Speckle Reduction sradv1 512 × 512 data points 1.74 15%
24 Rodinia Speckle Reduction sradv2 2048 × 2048 data points 1.70 16%
25 Parboil Cartesian Gridding mri-g 61 MB 1.49 2%
26 Rodinia K-Means kmeans 204800 data points; 34 features 1.47 0%
27 Rodinia Matrix Decomposition lud 2048 × 2048 data points 1.27 28%
28 Parboil PDE Solver stencil 512 × 512 × 64 input 1.23 6%
29 Rodinia Heart Wall Tracking heartwall 49 MB 1.19 0%
30 Rodinia Back Propagation backprop 65536 input nodes 1.10 3%
31 Rodinia Thermal Modeling hotspot 512 × 512 data points 1.07 29%
32 Parboil Coulombic Potential cutcp 96604 atoms 1.00 78%
33 Parboil MRI Reconstruction mri-q 64 × 64 × 64 data points 1.00 0%
34 Parboil Angular Correlation tpacf 10391 data points 1.00 19%
3.1. Inter-core reuse
In order to quantify the degree of temporal and spatial reuse of global data between
thread blocks, we analyse the L1 miss traffic of each core. In Table I, we show the Reuse
Coefficient (µRC), which is the percentage of miss requests received by the L2 cache
from private L1 caches for addresses that reside remotely on at least one L1 cache. We
see a maximum µRC of up to 78% with an average of 14% across all benchmarks. High
µRC for some benchmarks indicates that reuse requests from L1 caches form a large
portion of traffic to L2. It is worth noting that we only consider it as reuse if the load
miss address is cached on a remote L1 at the time of the miss.
In Figure 3 we further characterize the inter-core reuse patterns at the granularity
of each core with every other core, providing deeper insight into the reuse dynamics.
For brevity, we show the set of distinct observed patterns and omit those that replicate
the patterns shown here. The x-axis indicates the cores that incur an L1 load miss and
the y-axis indicates the sharers for that miss. A dense area in the heat map at an (x, y)
coordinate indicates that a high proportion of load miss requests by core-x are cached
by the L1 at core-y. For instance, cutcp shows a prominent reuse of data cached at a
distance of 4 cores from the location of the miss; dwt2d shows a strong reuse between
neighbours; km shows a gradual decline in reuse as we go further from the core; and
tpacf shows considerable levels of reuse across all cores.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:6 S. Dublish et al.
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(a) cutcp
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(b) dwt2d
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(c) km
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(d) tpacf
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(e) pvr
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14
SH
A
RE
RS
REQUESTING CORES
(f) pfloat
LOW HIGH
(g) Reuse Score
Fig. 3: Heatmaps indicating inter-core reuse by cores on the x-axis for data cached on
the cores on the y-axis. Dark spots in the heatmaps indicate high reuse between the
corresponding cores at their x and y coordinates.
3.2. Efficacy of cooperation
We have shown in the previous section that for general-purpose applications, there is
considerable reuse across L1 caches. We refer to those load requests as reuse requests
that miss in the local L1 but hit in a remote L1. By removing such reuse requests (also
quantified as µRC) from the pool of total misses going to the L2 cache, we can reduce
the pressure on L2 bandwidth. In order to assess the efficacy of reducing the band-
width demand on the overall performance, we begin by examining the performance
improvement when the reuse requests do not congest the access path to L2. In these
cases, reuse requests are instead serviced cooperatively within the L1s with varying
remote L1 access latencies, or reuse latencies. Since applications with low µRC are
not expected to show any change, we focus on benchmarks with high µRC. Later, we
demonstrate the effect of our final proposal on applications with low or zero µRC as
well.
Figure 4 shows the speedup due to cooperation, and demonstrates a noticeable im-
provement in performance, specifically for memory-intensive applications with high
µRC. For instance, cfd and pvr show performance improvements of up to 73% and 38%
respectively. Both of these applications are severely bounded by the memory band-
width and at the same time exhibit high reuse. On the other hand, despite high reuse
in cutcp and hotspot, there is no significant gain in IPC since bandwidth is not critical
for these benchmarks.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:7
-10
 0
 10
 20
 30
 40
 50
 60
 70
 80
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600
Stable 
 reuse latency 
 range
L2 access latency range
Exposed latency range
IP
C 
im
pr
ov
em
en
t (%
)
Remote L1 access latency
b+tree
cfd
hotspot
lud
sc
cutcp
pvr
ss
AVG
Fig. 4: Speedup of cooperation with varying remote L1 access latencies.
Another key observation in this study pertains to the variation of performance as a
function of remote L1 access latency. We observe that the performance improvement in
the region between 0-80 cycles is fairly stable, with the average IPC gain only chang-
ing from 21.5% to 18.8%. This is because in this region, latencies to remote L1s can be
effectively hidden by the multithreading on the cores. Moreover, due to reduced con-
gestion in the L2 access path and due to faster response to reuse requests (compared
to L2 accesses), the average number of active compute threads on a core increases.
This boosts the ability of the cores to further mask the memory access latencies. Due
to these effects, reuse latencies up to 80 cycles are effectively hidden by multithread-
ing and do not determine the execution time. However, on further increasing the reuse
latencies, performance improvement starts to degrade more rapidly. In fact, the IPC
gain returns to nearly 0% when the reuse latencies are varied in the range of L2 ac-
cess latencies (around 300 cycles). This is because latencies for reuse requests get in-
creasingly exposed and can no longer be hidden by multithreading, despite reduced
congestion.
In summary, these initial results indicate that for memory-bound applications, when
there is considerable reuse of data across L1 caches, cooperation among the private L1
caches can result in a considerable speedup (up to 21.5% on average). Notably, the
observed performance improvement is fairly stable in the reuse latency range of 0-80
cycles.
4. COOPERATIVE CACHING
In the previous sections, we observed a potential for cooperative caching on GPUs
and assessed its efficacy. We now propose a cooperative caching framework to use the
private L1 data caches in an aggregate manner. We begin by formalizing the above
discussion and analysing the parameters that contribute to the L2 access latencies
for L1 miss requests. Later, we propose a cooperative caching scheme and discuss the
architectural details.
4.1. Analytical Model
Here we present a simple analytical model to explain the conditions under which reuse
delivers a performance gain. Firstly, in the absence of cooperation between L1s, let lO
be the average memory latency to access the shared L2 cache. Secondly, with coop-
eration between L1s, let hreuse be the fraction of L1 misses that hit in a remote L1
cache. Furthermore, let lreuse be the average hit latency for accesses to remote L1s.
As a consequence of reduced congestion in the L2 access path due to remote L1 hits,
let δcong be the reduction in L2 access latency. And finally, let δoverhead be the coopera-
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:8 S. Dublish et al.
tion overhead borne by those requests that do not have a shared copy. Therefore, the
new average memory latency to L2 upon enabling cooperation, lC , can be obtained via
Equation 1.
lC = (lO − δcong + δoverhead).(1− hreuse) + lreuse.hreuse (1)
lreuse < lO
δoverhead < δcong
}
Criteria for useful cooperation (2)
In order to derive gain from cooperative caching, lC must be minimized. Therefore,
remote L1 accesses for reuse requests must take less time than a normal L2 access, i.e.,
lreuse < lO. Additionally, we have already seen in Figure 4 that the maximum gain from
cooperation is sustained in the lower reuse latency range, i.e., lreuse ∈ (0, 80). Finally,
for the remaining L2 accesses, the cooperation overhead must be less than the benefit
obtained from reducing the congestion in the L2 access path, i.e., δoverhead < δcong. A
combination of above conditions will result in a lower average L2 access latency, i.e.,
lC < lO.
How should we go about implementing the cooperative caching framework? Follow-
ing the approach of traditional multicores, a central directory in the L2 cache [Lebeck
and Wood 1995][Acacio et al. 2002][Kaxiras and Keramidas 2010] can be used to store
information about the sharers. However, maintaining a directory as part of the L2 will
not mitigate the existing bandwidth problem in accessing the L2, and instead, will only
worsen it. This is because the additional control and update traffic to the central di-
rectory will further increase the bandwidth demand to the L2 cache. Alternatively, an
approach along the lines of cooperative caching schemes for CPUs [Chang and Sohi
2006][Chang and Sohi 2007][Herrero et al. 2008] may be used. Such schemes aim to
minimize hop latencies to find a sharer and retrieve data using a highly interconnected
network of L1 caches. However, since we have demonstrated that we have a consider-
able leeway of around 80 cycles to fetch the shared data from a remote L1, such an
aggressive scheme to find a sharer is an overkill for GPUs.
Therefore, we propose a lightweight ring-based Cooperative Caching Network. A
ring topology is the lowest degree network and requires the fewest number of inter-
core connections. It is also lowest in terms of logical complexity and power consumption
as all core-to-core connections will be near-neighbour, and therefore, the wires will be
short. In addition, all routers in a ring are simple multiplexers, which are more energy
efficient than complex crossbar routers. As we have shown that GPUs can tolerate
reuse latencies gracefully up to 80 cycles, a ring topology appears to be a cost-effective
solution, as it allows us to trade-off higher latencies for simplicity and short wires, i.e.,
lower power consumption and die-area cost.
4.2. Architecture
In our proposed scheme, we facilitate the communication between neighbours by con-
necting the private L1 caches in a ring via our Cooperative Caching Network (CCN).
The CCN comprises of two different channels, viz., request channel and response chan-
nel. The request channel comprises of a network of Request Queues or ReqQ while the
response channel comprises of a network of Response Queues or RespQ. As shown in
Figure 5, each L1 has an independent pair of the aforementioned queues to allow the
cache to participate in cooperative caching. The L1 caches interact with their home
queues via CCN Buffers (CB), which hold the tag and Core-ID for the load misses,
until the CCN is ready to accept a request. A new miss request, upon entering the
local Request Queue, travels around the request channel by hopping on other Request
Queues and probing the different L1 caches on its way. If a remote copy is found on
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:9
Cooperative Caching Network
CB0
RT
CB2
Core-0
RT STL1
Core-1
STL1RT
Interconnect
ReqQ-0
RespQ-0
ReqQ-1
RespQ-1
CB1
L1 ST
Core-2
ReqQ-2
RespQ-2 From Core 3
Towards Core 3
From Core 14
Towards Core 14
L2 BankL2 BankL2 Bank
Request Channel
Response Channel
Fig. 5: Cooperative Caching Network.
one of the nodes, the response from the hit node is sent back to the requesting core
in a similar way by hopping in the reverse direction via the Response Queues at each
core. Note that a remote L1 copy is considered for sharing only if it is not pending on
a cache-fill for the requested data at the time of lookup. In other words, pending-hits
due to outstanding miss requests are not considered for sharing in CCN.
Specifically, upon incurring a load miss for global data, instead of sending the miss
directly to the L2, each core pushes the miss tag information into its CB along with
the Core-ID, where the request waits until the corresponding ReqQ is ready to accept
a new request. At every cycle, valid entries at the head of the ReqQ lookup the cor-
responding L1 cache (if it is not the home core of that request) before hopping on to
the next ReqQ. If the request travels back to the requesting core without a reuse copy,
it is finally sent to L2. However, if a sharer is found, the sharing core enqueues the
response to its RespQ. The response travels back to the requesting core, thereby avoid-
ing an L2 access. If the request queues get full due to congestion, CB eventually stops
accepting new miss requests. In such a scenario, the L1 load misses go directly to L2
until the CCN can start accepting new requests again.
4.2.1. Prioritization policy for queues. Each queue in the CCN has a corresponding input
multiplexer to select one of the entries out of the two possible input sources. In the
request channel, a ReqQ can either accept a new miss request from the home core via
CB, or a forwarded request from a preceding ReqQ. In our proposal, we prioritize an
older request (from ReqQ) over a new one (from CB). This helps in preventing over-
subscription of CCN to new L1 misses by allowing the previously accepted requests
to pass-through and therefore, minimize the roundtrip overhead (δoverhead) in CCN for
subscribed requests. Repeated unsuccessful attempts to inject a new request in the
CCN due to the above prioritization, thus causes the CB to get full and hence, deflects
the L1 misses directly to L2, allowing the CCN to recover from congestion.
In response queues, however, we prioritize a new cache response (from Core) over
an older response (from RespQ). This is because response queue latencies do not con-
tribute to δoverhead but contribute to the reuse latencies lreuse, which has comparatively
more relaxed requirements (shown in Figure 4). More importantly, if the response of a
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:10 S. Dublish et al.
new remote hit is not accepted by the response queue, the tag entry at the head of the
corresponding ReqQ that caused the hit is not popped, potentially stalling the entire
request network and increasing the δoverhead in the request channel.
4.2.2. CCN Memory Consistency. The CCN mechanism conforms to the existing mem-
ory consistency model supported by Fermi. CUDA provides two types of load instruc-
tions [Nvidia 2016] – a normal load cached at L1 (ld.ca) and a direct load to L2 (ld.cg),
bypassing the L1. Due to the write-through, no-write-allocate policy of the L1 cache,
a write causes the matching cache line in L1 to be invalidated thereby causing the
most recent value to reside in L2. However, due to a weak memory model [Alglave
et al. 2015][Nvidia 2014] and absence of coherence in GPUs, an ld.ca accessing L1 on
a different core can return a stale value. Litmus tests in [Alglave et al. 2015] have
also shown that due to weak consistency, an ld.ca load may return a stale value on
the same core as well, even if preceded by an ld.cg to the same address (CoRR). CCN
adopts similar weak memory ordering semantics for ld.ca loads; indeed, an L1 miss
can return a stale value by snooping other cores via CCN, instead of reading the L2
which may have the latest value. However, since a baseline GPU guarantees reading
the most recent value for ld.cg loads, CCN does not intercept such loads and hence does
not further weaken the memory model. In other words, when a programmer uses ld.cg
loads to bypass the L1, the current memory model ensures the most recently written
value is returned – a correctness guarantee also provided by CCN.
4.3. Shadow Tags
Since each L1 now services additional tag lookups for CCN requests, such remote
lookups could affect the performance of local cache accesses. To eliminate the inter-
ference of remote lookups on local requests, we duplicate the tags of the L1 data cache
in a separate set of Shadow Tags (ST) adjacent to each L1. The shadow tags always
contain an identical copy of the L1 tags, which is achieved by always writing tag up-
dates to both sets of tags simultaneously. As a result, concurrent reads at independent
addresses can then take place to L1 tags and shadow tags, from the local core and
remote lookups respectively. Therefore, the shadow tags dissociate the performance
of each local cache from interference of CCN traffic. However, if a shadow tag lookup
succeeds, then the remote access makes a regular L1 access to retrieve the data it
needs. This steals a cycle from the L1 data cache, which is taken into account in our
performance model.
Overhead: For the largest L1 data cache configuration of 48 KB with 128 byte
line size, we require 24 upper address bits per tag, assuming 40-bit physical ad-
dresses[Nvidia 2009], plus one valid bit. As the L1 data cache is 4-way set-associative,
the shadow tags are arranged as 96 sets of four 25-bit tags in 96×100 single-ported
tag memory.
Way 0 Way 1 Way 2 Way 3
V0 Tag0[39:16] V1 Tag1[39:16] V2 Tag2[39:16] V3 Tag3[39:16]
Therefore, the net storage overhead of the shadow tags is 1200 bytes per SM and a
total of 17.5 KB for a 15 core GPU that we consider in our study. Although, each remote
access has to be checked in multiple shadow tags, these shadow tag memories are small
and can be constructed from low-leakage high-density bit-cells without impacting the
overall cycle time of the ring interconnect.
4.4. Request Throttler
In order to prevent those cores that do not exhibit any inter-core reuse from congest-
ing the CCN, we introduce a Request Throttler (RT) at each core. The purpose of RT
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:11
is to throttle the remote lookup requests directly to L2 cache when prior routing of
misses to CCN proves to be below a threshold level of effectiveness. In order to do this,
each RT periodically samples the CCN performance parameters and at the end of the
sampling period, computes the success rate in routing its load misses to CCN during
the sampling interval. The success rate is determined by the ratio of hits in the CCN
to the total number of requests injected in the CCN by the corresponding L1 cache. If
the success rate is below the threshold, the L1 cache bypasses the CCN until the next
sampling interval, and performs the load miss by sending the request directly to L2
cache. However, the shadow tags of the throttled cores still participate in the lookup
for other requests in the CCN.
To illustrate the working of RT further, we define the sampling interval as tS and the
periodicity of sampling as tP where tS << tP . Therefore, the entire period of execution
is logically divided into multiple epochs of duration tP . We also define Hmin as the
minimum hit rate required in the CCN in order to derive utility out of cooperation.
At the beginning of an epoch of interval tP , each core begins by routing the load
misses to CCN for a fixed sampling duration of tS . During the tS interval, RT collects
the statistics about the number of requests injected in the CCN (Ntotal) and the num-
ber of hits observed for its requests (Nhits). At the end of the sampling duration, RT
computes the hit rate (hreuse) in the CCN, i.e., hreuse = Nhits/Ntotal. If hreuse > Hmin, RT
continues to inject requests in the CCN for the remaining duration of (tP - tS) in the
current epoch. On the other hand, if hreuse < Hmin, RT disables the routing of requests
to CCN for the remaining duration of the epoch. After the current epoch ends, Nhits
and Ntotal are reset and RT repeats the entire process again for the new epoch. There-
fore, with the help of RT, we improve the average success rate of sending a load miss to
CCN by preventing those cores from cooperating that are not working on potentially
reusable data, during specific epochs of execution.
4.5. Working example
In this section, we further illustrate the working of CCN. Figure 6 shows the flow of
requests within the CCN. In this example, Core-0 incurs a load miss for a global data
in its private L1 cache. In the baseline architecture, this L1 miss would be directly
routed to the L2 cache. However, with our scheme, the miss request can either go to
the CCN or to the L2 cache. RT takes this decision for that particular core on the basis
of the statistics collected over the most recent sampling interval, tS . In this example,
we assume that hreuse for Core-0 and Core-1 suggests healthy reuse (> Hmin) and
therefore, these cores continue to use CCN. However, Core-2 observes a low reuse in
the recent tS interval thereby routing all the requests directly to L2 cache for the
current epoch.
Thus, in order to service the miss at Core-0 via CCN, the tag and Core-ID of the
load request are pushed 1 onto the corresponding CCN Buffer, CB0. Based on the
input prioritization policy for ReqQ, the new tag waits in CB0 until it acquires the
priority and is accepted 2 by the ReqQ-0. Upon reaching the head of ReqQ-0, the miss
request does not perform a lookup in the ST of Core-0 as it is the home core of the miss
request and therefore, it is directly passed to ReqQ-1 3 . Upon reaching the head of
ReqQ-1, it performs a lookup 4 in ST of Core-1. Assuming it is a hit in Core-1, the
ST receives the cache line from the corresponding L1 cache and enqueues the response
5 in RespQ-1, given that the RespQ-1 is not full. On the other hand, if the RespQ-1
is full, the response is stalled, thereby preventing the tag at the head of the ReqQ-1
from getting popped. Once the response reaches the head of the queue at RespQ-1 and
acquires priority to enter the next queue, it is pushed into RespQ-0 6 . Since Core-0 is
the home core of the response, the new entry to RespQ-0 is bypassed to the head of the
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:12 S. Dublish et al.
CB0
RT
CB2
Core-0
RT STL1
Core-1
STL1RT
1
2
3
4
Interconnect
ReqQ-0
RespQ-0
ReqQ-1
RespQ-1
CB1
L1 ST
Core-2
ReqQ-2
RespQ-2
5
6
7
From Core 3
Towards Core 3
From Core 14
Towards Core 14
L2 BankL2 BankL2 Bank
X
hreuse < Hmin
Fig. 6: Working of the Cooperative Caching Network with Request Throttling.
queue and the response is serviced 7 to the L1 cache of Core-0, hence completing the
request-response cycle.
5. EVALUATION
In this section, we discuss the implementation of our proposed architecture and
demonstrate the results.
5.1. Implementation
For the purpose of this study, we implement and evaluate two flavours of our proposed
architecture, i.e., CCN-B and CCN-RT. CCN-B is our baseline CCN architecture which
includes a pair of queues and shadow tags at every node of the network. Whereas
in CCN-RT, we add the request throttling feature to the baseline CCN architecture.
Table II(b) summarizes the design parameters for CCN-B and CCN-RT.
In our implementation, we choose the sampling interval and the periodicity of sam-
pling as 1 million and 10 million instructions respectively. This is based on the observa-
tion that most benchmarks show a single-phase sharing across the entire application.
Hence, it allows us to sample for a short duration to get a fairly accurate hint for a
large duration that follows the sampling interval. Further, on the basis of our sensitiv-
ity studies, we select the threshold hit rate (Hmin) as 5%, i.e., the minimum number of
hits required to derive benefit from cooperative caching. We also observe in our experi-
ments that small 8-entry Request and Response Queues provide most optimal results.
Furthermore, the request and response channels in CCN are configured to flow in
opposite direction. This is because our experiments show that in such a case, servicing
reuse requests takes an average of 10 hops compared to a fixed 15 hops when both the
channels propagate in the same direction.
5.2. Experimental setup
We model the Cooperative Caching Network on GPGPU-Sim (version 3.2.2) [Bakhoda
et al. 2009] to simulate a Fermi-like GPU with the configuration parameters listed in
Table II(a). For energy and area simulations, we use GPUWattch [Leng et al. 2013], a
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:13
Table II: Configuration parameters for (a) GPGPU-Sim and (b) CCN.
Parameter Value
(a) GPGPU-Sim
Core 15 SMs, Greedy-then-oldest (GTO) scheduler
Clock frequency Core @ 1.4 GHz; Interconnect/L2 @ 700 MHz
Threads per SM 1536
Warp width 32
SIMD lane width 32
Registers per SM 32768
Shared Memory 48 KB
L1 Data Cache 16KB, 128 byte line, 4-way, LRU, write-through, no-write-allocate
L2 Cache 768 KB, 128 byte line, 8-way, LRU, write-back, 12 banks
DRAM GDDR5 DRAM, 6 channel, 64-bits per channel, 924 MHz
(b) CCN
CCN Buffer 8-entry, 30 bits per entry (26 bits Tag + 4 bits Core-ID)
Request Queue 8-entry, 30 bits per entry
Response Queue 8-entry, ∼128 bytes per entry (cache line + Core-ID)
CCN ring 4-byte request channel; 32-byte response channel; 1.4 GHz
Shadow Tag 1200 bytes size (modelled upon 48 KB L1 data cache)
tS 1 million instructions
tP 10 million instructions
Hmin 0.05 (5 percent hits)
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
 1.8
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
IP
C 
(no
rm
ali
ze
d)
baseline CCN-B CCN-RT ideal
(a) Speedup for applications with µRC > 10
-6
-5
-4
-3
-2
-1
 0
 1
 2
bfs’
heartwall
sort
kmeans
lavaMD
leukocyte
myocyte
histo
lbm mri-g
m
ri-q
ii AVG
-11.5IP
C 
im
pr
ov
em
en
t (%
)
CCN-B CCN-RT ideal
(b) Percentage improvement in IPC for applications
with µRC < 3
Fig. 7: Performance variation with cooperative caching.
McPAT based power model integrated in GPGPU-Sim. All CCN transactions have been
modelled at cycle-by-cycle accuracy in the simulator which includes queuing delays in
the request and response channels, CCN congestion and L1 cycle stealing by shadow
tag accesses. We run all the benchmarks to completion, or until they execute 16 billion
instructions, whichever comes first.
5.3. Results
We begin by evaluating the overall performance improvement with our proposed
schemes for benchmarks that exhibit inter-core reuse (µRC > 10). We also show the
neutrality of our scheme for benchmarks with little or no reuse (µRC < 3). Later we
assess the finer parameters for the former set of benchmarks, as applications with
inter-core reuse are the primary motivation for this study. We do not show the bench-
marks between this range as results of the above categories are good indicators of
the trend in the rest of the benchmarks. We also compare the results of our proposed
schemes, i.e., CCN-B and CCN-RT, against an ideal cooperative caching configuration
that services all of the remote hits with zero latency, without incurring any overheads
of cooperative caching.
5.3.1. Performance. In Figure 7a, we show the speedup with CCN-B and CCN-RT for
applications that exhibit reuse. Over the baseline configuration, we observe an average
improvement of 14.5% with CCN-B and 14.7% with CCN-RT. Memory-bound applica-
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:14 S. Dublish et al.
tions such as cfd, ss and pvr show higher speedup compared to non-memory-bound ap-
plications as they are more sensitive to bandwidth bottlenecks. b+tree shows a higher
improvement than ideal case due to the timing variations in scheduling warps. Such
an aberration is also caused due to higher number of coalesced hits on cache lines al-
located for on-going remote L1 accesses, which does not occur in the ideal scenario due
to zero cycle latency for remote L1 accesses.
We also assess the impact of cooperative caching on applications that show little or
no reuse. For such applications, cooperative caching adds an extra roundtrip overhead
of going through the CCN. This is because due to low µRC, most requests end up
going to L2 cache after an unsuccessful traversal in the CCN. In such cases, Request
Throttler helps in preventing the L1 misses from incurring the CCN overhead when
there is little or no reuse. In Figure 7b, we show that with CCN-B, we see a degradation
of up to 11.5% and an average degradation of 1.7% compared to the baseline GPU.
However, with CCN-RT, the maximum degradation reduces to 1.5% with an overall
average of 0.1%.
5.3.2. L2 cache bandwidth demand. In Figure 8a, we demonstrate the effectiveness of
our proposed technique in mitigating the L2 cache bandwidth bottleneck. On average,
CCN-RT reduces the traffic to L2 cache by 29% compared to the baseline GPU. It is in
close proximity to the ideal-case average of 33% indicating that most of the reuse hits
on remote L1 caches are captured by the proposed architecture. Virtually no difference
between CCN-B and CCN-RT demonstrates that while throttling diverts most of the
non-productive traffic directly to L2 cache, it does not reduce the number of potential
hits in the CCN. If it would divert the useful reuse requests to L2 cache bypassing the
CCN, then we would see a lesser reduction in L2 traffic with CCN-RT compared to
CCN-B.
5.3.3. Average Memory Latency (AML). In Figure 8b, we see an average reduction of 24%
in AML with our proposed CCN-RT architecture for applications that show reuse. We
observe that cutcp shows the maximum reduction of 65% in AML due to high µRC
of 78%. However, it does not translate into performance gain due to its non-memory-
bound nature.
5.3.4. Core stall cycles. We observed in the above results that the performance im-
proved by mitigating the bandwidth problem (indicated by L2 traffic), and by servicing
the misses in less time (indicated by AML). This is because cores now spend less time
waiting for memory. Therefore, we assess the impact of our proposal on the total num-
ber of cycles for which the cores are stalled. In Figure 8c, we observe a significant
reduction in core stall cycles for memory-bound applications such as cfd and sc, while
no degradation is seen for non-memory-bound applications like cutcp and tpacf. On
average, we reduce the core stall cycles by 26%, which is in close proximity to the ideal
reduction of 28%.
5.3.5. Off-chip memory traffic. In order to dissociate the effects of L2 and off-chip band-
widths on the overall performance gain, we analyse the change in off-chip memory
traffic. As shown in Figure 8d, we see that for most applications there is no visible
difference in the traffic to off-chip memory, indicating that the entire performance im-
provement can be attributed to the mitigation of bandwidth bottleneck between pri-
vate L1s and the shared L2. Therefore, it can be inferred for most benchmarks that in
the baseline architecture without CCN, the reuse requests mostly hit in the L2 cache
thereby only burdening the L2 cache bandwidth with duplicate requests. However, in
sc we notice a reduction in DRAM traffic by 12% with CCN-RT. This indicates that
for sc a significant portion of reuse requests to L2 also misses in the L2 cache, adding
to the DRAM traffic. As a result, upon removing the reuse requests to L2 cache with
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:15
 10
 20
 30
 40
 50
 60
 70
 80
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVGR
ed
uc
tio
n 
in
 L
1 
to
 L
2 
tra
ffi
c 
(%
) CCN-B CCN-RT ideal
(a) Percentage reduction in L1 to L2 traffic
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
AM
L 
(no
rm
ali
ze
d)
baseline CCN-B CCN-RT
(b) Normalized average memory latency
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
Co
re
 s
ta
ll c
yc
le
s 
(no
rm
ali
ze
d)
baseline CCN-B CCN-RT ideal
(c) Normalized core stall cycles
 0.8
 0.9
 1
 1.1
 1.2
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVGO
ff-
ch
ip
 m
em
or
y 
tra
ffi
c 
(no
rm
ali
ze
d) baseline CCN-B CCN-RT ideal
(d) Normalized off-chip memory traffic
 0.84
 0.88
 0.92
 0.96
 1
 1.04
 1.08
 1.12
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
En
er
gy
 (n
orm
ali
ze
d)
baseline CCN-RT
(e) Energy dissipation with CCN
Fig. 8: Experimental results demonstrating the effect of cooperative caching.
the help of CCN in sc, not only the traffic to L2 cache is reduced, but also the traf-
fic to DRAM is reduced. Therefore, the performance benefit in sc with CCN-RT can
be attributed not only to the mitigation of L2 bandwidth bottleneck, but also to the
mitigation of DRAM bandwidth bottleneck.
5.3.6. Summary. In the above results, we observed that for applications which exhibit
reuse, we are able to reduce the traffic to L2 cache by 29% while also reducing the aver-
age memory latency by 24%. As a consequence of the above improvements, we reduce
the average core stall cycles by 26% which translates into an average performance
improvement of 14.7%.
5.4. Hardware costs
5.4.1. Area. We use GPUWattch [Leng et al. 2013] to estimate the area of our pro-
posed architecture. We use the existing components in GPUWattch to model the CCN
components, after appropriate scaling wherever necessary. CCN adds an area over-
head of 4.38 mm2 for the ring interconnect and the shadow tags (corresponding to the
largest L1 data cache configuration) at 40 nm technology. Other storage units such as
CCN Buffers and Request/Response Queues add another 4.82 mm2. This amounts to
an overall increase in die area by 1.3% with respect to baseline processor architecture
area of 700 mm2.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:16 S. Dublish et al.
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVGIP
C 
(no
rm
ali
ze
d t
o r
es
p. 
ba
se
lin
es
) baseline-16/48 CCN-16L1 CCN-48L1
(a) Speedup with varying L1 cache sizes
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
IP
C 
(no
rm
ali
ze
d)
baseline 1-cycle 3-cycles 5-cycles
(b) Speedup with link latencies of 1, 3 and 5 cycles
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVGIP
C 
(no
rm
ali
ze
d t
o r
es
p. 
ba
se
lin
es
) baseline ccn-32 ccn-64 ccn-128 ccn-192
(c) Speedup with varying width of SIMD lanes
Fig. 9: Sensitivity analysis.
5.4.2. Energy. With CCN, cores are stalled for fewer cycles, thereby reducing the leak-
age power. In addition, fewer packets require routing at the energy-inefficient crossbar
routers. Also, lower traffic to L2 leads to lower energy consumption by the NoC. How-
ever, high shadow tag lookups for remote cache accesses normalizes other energy gains
of the CCN, resulting in an average energy overhead of 2.5% (Figure 8e).
5.5. Sensitivity analysis
5.5.1. L1 cache size. As Fermi offers configurable L1 cache sizes of 16 KB and 48 KB,
we analyse the sensitivity of our proposal to L1 cache size. As shown in Figure 9a,
on increasing the L1 cache size to 48 KB, we observe an average IPC gain of 20.6%
with CCN, compared to 14.7% with CCN on 16 KB L1 (over their respective baselines).
This is due to the following reason. Although, increasing the L1 cache size reduces the
number of capacity/conflict misses thereby reducing the opportunities to find remote
L1 hits in the CCN, we observe that a larger L1 significantly increases the likelihood of
finding a remote L1 sharer for a compulsory miss. Therefore, due to significant increase
in utility of CCN for compulsory misses on increasing the L1 size (which dominates the
decrease in utility of CCN due to lower conflict/ capacity misses), we observe a higher
improvement in performance with larger L1s.
5.5.2. Link latency and frequency. In this study, we analyse the performance impact of
interconnect latencies for every hop on the CCN ring. This is done by varying the core-
to-core transfer latency from 1-5 cycles (i.e., 15-75 cycles for the entire ring). For a 700
mm2 chip, each hop is approximately 3.5 mm of on-chip distance and therefore, 1-5 cy-
cles at 1.4 GHz is a reasonable window to complete the transfer [Beckmann and Wood
2004]. It is worth noting that varying the CCN link latency also captures the effect of
running the CCN ring at a fraction of core frequency. Therefore, this study shows the
performance variation on using the CCN ring at up to 1/5th the core frequency (280
MHz).
In Figure 9b, we see that for most applications, the IPC gain is fairly resilient to
increasing link latencies (or decreasing ring frequencies). For instance, cfd shows a
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:17
marginal reduction of 1% when the latency increases from 1 to 5 cycles. A minority of
applications show visible reductions in the gain as link latency increases. For example,
the IPC gain of b+tree drops from 31% to 19%, although it still maintains a modest
overall improvement in performance. On the average, we see IPC improvements drop
from 14.7% to 13.6% as latency is increased from 1 to 3 cycles, settling further at 11.2%
when the link latency is increased to 5 cycles. These results indicate that our proposed
scheme is fairly robust to increasing inefficiencies in the ring interconnect (as well as
increasing distance between the neighbouring cores).
5.5.3. SIMD lane width. Each core in Nvidia’s Fermi GPU consists of 32-lane SIMD
unit, each lane capable of executing one floating-point or arithmetic instruction per
clock. In this study, we analyse the utility of CCN on increasing the SIMD lane width.
In Figure 9c, we plot the performance gain with CCN-RT on baseline configuration
with varying SIMD lane width of 32 (ccn-32), 64 (ccn-64), 128 (ccn-128) and 192 (ccn-
192), each normalized to their respective baselines. On average, the performance gain
drops modestly from 14.7% to 13.6% on increasing the SIMD lane width from 32 to
64, settling further at 11.4% and 10.2% with SIMD lane widths of 128 and 192 re-
spectively. Although the minor reduction in CCN gain is due to the increased latency
tolerance provided by additional SIMD lanes, cooperative caching continues to provide
considerable benefits for memory-intensive applications. This is due to the fact that by
increasing the SIMD lanes or the compute capability of the cores, only compute-bound
applications are expected to show significant speedups and a higher overlap of mem-
ory latencies with computation. In contrast, memory-intensive applications lack inde-
pendent compute instructions and continue to be bottlenecked by memory resources.
Therefore, additional compute resources for memory-intensive applications provides
only limited additional latency tolerance to the cores due to which cooperative caching
continues to be useful in reducing memory latencies that lie in the critical path. How-
ever, some benchmarks such as lud and km also show momentary improvement in
performance gain with CCN on increasing the SIMD lane width. We observe that this
is because with wider SIMD lanes, higher number of threads arrive at the memory in-
structions per cycle, issuing higher number of requests that may exhibit reuse, thereby
amplifying the utility of CCN in reducing the traffic that could lead to even higher con-
gestion.
5.6. Discussion
In future, scalability of the CCN can be addressed by a hierarchical imple-
mentation of the proposed ring network [Holliday and Stumm 1994][Ravindran
and Stumm 1997]. A sub-CCN-ring that contains the requesting core can in-
quire other sub-CCN-rings in parallel, thereby decomposing the serial latency
of traversing the high number of cores into concurrent transactions to multi-
ple rings. In addition, as coherent caches in GPUs are imminent with future
architectures[Martin et al. 2012][Power et al. 2013][Singh et al. 2013], inter-core com-
munication via CCN can also act as a substrate for implementing cache coherence.
6. COMPARATIVE STUDY
In this section, we perform a quantitative and qualitative comparison of CCN with
alternative techniques that address the bandwidth bottleneck in GPUs.
6.1. Increasing L2 banks
An alternative technique to increase the bandwidth to L2 is to increase the number
L2 banks. However, increasing the banks only reduces the congestion in the access
path to L2 whereas CCN, in addition to reducing pressure on L2 bandwidth, provides
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:18 S. Dublish et al.
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
IP
C 
(no
rm
ali
ze
d)
bank12 bank24 bank12/CCN bank24/CCN
(a) Speedup with 2× L2 banks and CCN
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
b+tree
cfd hotspot
lud sradv1
sc cutcp
tpacf
km pvr
ss wc AVG
IP
C 
(no
rm
ali
ze
d)
baseline cluster03 cluster05 CCN
1.73
(b) Ideal speedup with L1 cache clusters
Fig. 10: Comparative study.
a significantly faster response for a fraction of miss requests. In our experiments, we
observe that CCN services the reuse requests in 42 cycles (lreuse) on average, for 29%
misses (hreuse) that hit in CCN. For the remaining L2 accesses, CCN adds a roundtrip
overhead of 54 cycles (δoverhead). It also reduces the congestion overhead to L2 by 78
cycles (δcong). Considering that the average L2 access latency without the CCN is 300
cycles (lO) and substituting the above values in Equation 1, the average L2 access
latency with CCN is computed to be 208 cycles (Equation 3).
lC(CCN) = (300− 78 + 54)× 0.71 + (42)× 0.29 = 208 (3)
lC(2×) = (300− 80 + 0)× 1.0 = 220 (4)
lC(CCN/2×) = (300− 117 + 54)× 0.71 + (42)× 0.29 = 180 (5)
However, increasing the L2 banks only reduces δcong (though marginally more than
CCN for some benchmarks), but requires all accesses to go through the L2 access la-
tency, albeit via reduced congestion. Upon substituting corresponding values in Equa-
tion 1, the reduced L2 access latency is computed to be 220 cycles (Equation 4). There-
fore, in Figure 10a, we observe an average IPC improvement of 10.2% upon a 2× in-
crease in L2 banks from 12 to 24. In contrast, CCN implemented with 12-bank L2
configuration shows a higher improvement of 14.7% (with cfd performing 34% better
with CCN than with 2× L2 banks).
Importantly, CCN is partly orthogonal to increasing the banks at L2. This is because,
in addition to reducing the δcong further, CCN adds the benefit of faster access to reuse
requests. The average L2 access latency in Equation 1 for a CCN architecture on a
24 L2 banks configuration is computed to be 180 cycles (Equation 5). In Figure 10a,
our experiments show an average performance improvement of 23.5% with both the
techniques combined.
With respect to the cost, increasing the L2 banks would require higher number of
ports in the crossbar. As the area of a crossbar increases polynomially on increasing
the ports, the area overhead will be significant. Energy demands also increase signifi-
cantly as each router is more complex and need to arbitrate on higher number of nodes.
In contrast, CCN only require simple multiplexers at each router and scales well with
respect to area and energy overheads. Alternatively, increasing the L2 datapath width
to provide more L2 bandwidth would also be area intensive as it entails increasing
the area of 15x12 core-to-L2 connections in the crossbar, making the crossbar much
bulkier. In contrast, CCN only requires 15 core-to-core connections. As core-to-L2 con-
nections are typically longer (in addition to being higher) than core-to-core connections
in CCN, there is a higher overhead in scaling the former.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:19
6.2. Sharing Tracker
Tarjan et al. [Tarjan and Skadron 2010] proposed a scheme to exploit reuse within the
private caches by using a Sharing Tracker, a decomposed version of the coherence di-
rectory. It aims to reduce the off-chip memory bandwidth demand by diverting DRAM
accesses to private caches that contain a shared copy.
Although we adopt this intuition to reuse shared copies in private caches, our aim is
to reduce the bandwidth demand to the shared cache (and not the DRAM as in [Tarjan
and Skadron 2010]). This is because in current scenario with recent GPU architec-
tures, exploiting reuse does not reduce off-chip memory traffic (as shown in Figure 8d)
and hence, a common directory in shared cache is not expected to show any benefit
since there are not many off-chip memory accesses that it can avoid. In fact, since
accessing and maintaining the sharing tracker in L2 cache adds to the bandwidth de-
mand to L2 without relieving pressure on off-chip bandwidth, it will only exacerbate
the problem by increasing the L2 access latencies and thereby worsening the IPC with
respect to baseline. For those architectures where off-chip memory traffic is also re-
duced by exploiting sharing within private caches, CCN achieves the same, but in
addition, it also reduces the traffic to L2 (which we have shown to be critical to perfor-
mance), and therefore provides a significant advantage over a directory approach.
6.3. Clustered sharing
Keshtegar et al. [2015] proposed an architecture to enable restricted sharing within
core clusters. However, we have shown in Figure 3 that while some benchmarks show
higher reuse with neighbouring cores, others show a uniform sharing with all cores. In
Figure 10b, we show the ideal performance improvement (with no sharing overheads)
obtained by sharing within cache clusters and we compare it with an ideal case of
CCN (sharing among all cores). We observe an average IPC gain of 4% and 8% with
ideal clusters of 3 and 5 L1s respectively, compared to an average IPC gain of 21%
with ideal CCN. This suggests that for most benchmarks, upon restricting the sharing
within cache clusters, SMs lose out on most of the reuse data.
Moreover, the cluster-based proposal by Keshtegar et al. [2015] employs a mesh-type
network within a cluster and scales polynomially with the number of cores. Therefore,
we expect the area overhead of clusters to exceed the area of ring-based connections in
CCN which scales linearly with the number of cores. Furthermore, in current GPUs,
SMs are placed linearly around the central L2 cache [Nvidia 2009][Nvidia 2012] and
therefore, clusters would require longer wires to connect the far-ends of a cluster as
compared to only near-neighbour connections in CCN.
6.4. Summary
In this section, we have shown that CCN fares well in comparison with alternative
techniques. CCN performs better than simply increasing the number of L2 banks while
also being partly orthogonal to the latter technique. Sharing tracker is expected to
show negative performance gain with the baseline architecture; and restricted sharing
within cache clusters significantly reduces inter-core reuse.
7. RELATED WORK
While sharing across L1 caches is a common occurrence in multiprocessors, as empha-
sized by the prevalent use of sophisticated coherence infrastructure, we derive signif-
icant benefits by exploiting L1 sharing for GPGPU workloads, a property atypical in
GPUs. Additionally, in contrast to earlier works [Yazdanbakhsh et al. 2016][Jog et al.
2016] where only the off-chip memory bandwidth is considered critical to performance,
we identify the criticality of mitigating congestion in the on-chip cache hierarchy be-
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:20 S. Dublish et al.
tween L1 and L2 cache. In the following subsections, we further discuss several prior
works related to the ideas presented in CCN and cite their key differences.
7.1. Cooperative Caching in CMPs
In the realm of CMPs, Chang et al. [Chang and Sohi 2006][Chang and Sohi 2007]
proposed cooperative caching by adapting the coherence infrastructure. Subsequently,
Herrero et al. [2008] proposed a scalable distributed cooperative caching scheme by re-
designing the coherence engine to provide distributed directories. Both schemes aim to
provide aggressive latency and capacity benefits for on-chip caches in CMPs. However,
since GPUs are relatively more tolerant to latencies, in this paper we address the
problem of bandwidth in GPUs. In addition, a directory-based scheme is not directly
portable to GPUs due to the lack of coherence infrastructure and therefore, our solution
proposes an independent lightweight network.
7.2. Ring Network
Ring topologies have been used extensively in commercial multiprocessors to provide
low cost inter-core communication. Larrabee [Seiler et al. 2008] employs a bidirectional
ring network to allow on-chip communication between latency-sensitive CPU cores, co-
herent L2 caches and other blocks with each link being 64 bytes wide (net width of 128
bytes). Xeon-Phi [Chrysos, George 2012] also comprises of bidirectional rings with each
ring comprising of three independent rings, viz., a 64 bytes data block ring for data
transactions, an address/command ring, and an acknowledgement ring for coherence
and flow control messages (net width >128 bytes). In contrast, CCN enables bidirec-
tional communication between latency-tolerant GPU cores by connecting the incoher-
ent L1 caches in a ring. Due to relaxed latency constraints in CCN compared to prior
ring interconnects in multiprocessors, the bus width for inter-core transfers is smaller
with each link being 8 bytes and 32 bytes wide respectively (net width of 40 bytes).
Therefore, our proposal exploits the latency-tolerance property of multithreaded cores
to provide low cost inter-core communication through a lightweight ring network.
Furthermore, Campanoni et al. [2014] proposed a ring cache for HELIX-RC that acts
as a distributed first-level cache, preceding the private L1 cache. Each ring node has
a cache array to cache shared data and satisfies the loads and stores received from its
attached core. To avoid coherence complications, memory addresses are permanently
mapped to the nodes of the ring cache. In contrast, each node in the CCN ring net-
work comprises of a shadow tag array, needed only for lookups and not for storage of
shared data. Subsequent loads to the shared data via CCN are performed directly in
the corresponding L1 caches since there is no separate data array for the ring nodes.
Therefore, the nodes in the CCN ring network are lighter than nodes in the ring cache
proposed in HELIX-RC.
7.3. Shadow Tags
Prior proposals such as Piranha [Barroso et al. 2000] and Niagara [Kongetira et al.
2005] have replicated tag structures of the private L1 caches at the shared L2 cache.
Such duplicate L1 tags stored centrally in the L2 cache are typically used to construct
partial sharing information, thereby reducing indirections to the coherence engine.
Duplicate tag structures are also used to reduce redundant write-back traffic to L2
cache from multiple L1s that cache the same shared data. However, in CCN we repli-
cate the tags adjacent to the corresponding L1 caches and do not complicate the L2
cache control. It is used only to prevent deterioration of L1 cache performance due to
remote lookups. Moreover, tag updates to shadow tags incur minimum communication
overhead in CCN due to physical proximity of L1 caches and shadow tags.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:21
7.4. Cache Management
In the field of GPUs, prior proposals such as Sharing Tracker [Tarjan and Skadron
2010] and cluster-based schemes [Keshtegar et al. 2015] (discussed previously in Sec-
tion 6) exploit reuse within GPU cores via central directory and clustered caches re-
spectively.
Several other schemes have been proposed for GPUs to improve the effective on-chip
cache capacity, to reduce cache thrashing and to improve locality in L1 & L2 caches.
Rhu et al. [2013] proposed a locality-aware memory hierarchy which adaptively ad-
justs the memory access granularity to prevent over-fetching, providing better off-chip
bandwidth utilization. Furthermore to adaptive memory access granularity, Li et al.
[2016] proposed a tag-split cache to enable fine storage granularity to improve cache
utilization while keeping a coarse access granularity to avoid excessive cache requests.
Tarjan et al. [2009] proposed a scheme to tolerate memory miss latencies for SIMD
cores by masking out threads in a warp which are waiting on data and allowing other
threads to continue execution, hence utilizing the idle execution slots. Rogers et al.
[2012,2013] proposed scheduling techniques which are conscious of the variations in
the cache locality thereby dynamically altering the scheduling policies to maximize
inter-warp locality on the L1 data cache. Jia et al. [2012] present a taxonomy for mem-
ory access locality and propose a compile-time algorithm to selectively utilize the L1
caches. Narasiman et al. [2011] propose large warp architecture and a two-level warp
scheduling technique to make effective use of resources on GPU while Jog et al. [2013]
propose a thread block aware scheduling policy to improve the cache hit rates of L1
cache. Choi et al. [2012] employ techniques such as write buffering and read bypassing
to reduce DRAM traffic and improve the L2 cache utilization, thereby addressing the
bandwidth constraint between shared memory and DRAM. There has also been work
on cache management policies for heterogeneous CPU-GPU architectures. Yang et al.
[2012] proposed a CPU-assisted prefetching scheme to improve the GPU memory la-
tencies by localizing the data in the LLC cache while Lee and Kim [2012] proposed a
TLP-aware cache management policy to effectively utilize the LLC for general-purpose
workloads.
Broadly, the above cache management proposals focus on reducing the miss rate of
independent caches by improving cache utilization. However, in CCN, without reduc-
ing miss rate of independent L1s, we reduce the collective bandwidth demand of L1 on
L2 by diverting some of the misses to remote L1s. Hence, the mentioned techniques
that reduce the miss rate itself are orthogonal to our work. Given the severity of the
memory bottleneck in GPUs (as indicated by the magnitude of PerfX in Table I), no
technique alone solves the entire problem, and hence such orthogonal techniques can
be used in conjunction with CCN.
7.5. Cache bypassing
In order to mitigate the severity of cache thrashing, several cache bypassing tech-
niques have been proposed. In CPUs, Gaur et al.[2011] proposed a bypass policy to
selectively fill the exclusive last-level cache with evicted cache blocks from the higher
level. Further, Duong et al. [2012] proposed a policy to protect reusable cache lines
from eviction with a dynamically computed Protected Distance, and bypass the miss
requests upon lack of unprotected cache lines in a set.
In GPUs, high multithreading and low on-chip cache capacity per thread present
additional challenges due to severe cache thrashing. Chen et al. [2014] proposed a
dynamic cache management policy that combines L1 cache bypassing and throttling.
In their proposed scheme, warp throttling prevents over-saturation of on-chip cache
resources while cache bypassing prevents cache contention, requiring lower number of
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:22 S. Dublish et al.
warps to be throttled in comparison to standalone warp throttling schemes. Li et al.
[2015] proposed a locality-driven cache bypassing scheme that uses reuse frequency in
a decoupled and extended tag memory to allow allocation in the data memory for only
those cache lines that exhibit high reuse.
In summary, the cache bypassing schemes in GPU improve cache utilization by pre-
serving hot cache lines with high reuse in the the available on-chip caches and bypass-
ing the streaming requests directly to the L2 cache. By preventing eviction of cache
lines with high reuse, it helps in eliminating repeated reuse requests from the same
L1 cache to the L2 cache. However, in our proposed technique, we eliminate the reuse
requests from different L1 caches to the L2 cache. In other words, cache bypassing
techniques reduce intra-core reuse requests that access the L2 cache, whereas our pro-
posed technique reduces inter-core reuse requests that access the L2 cache. Therefore,
we expect our proposal to be complimentary to cache bypassing techniques as both
techniques help in reducing mutually exclusive set of requests to the L2 cache.
8. CONCLUSION
In this paper, we discuss the inefficiencies in the management of L1 caches in GPUs.
We show that as a consequence of high L1 miss rates, high traffic to L2 cache presents
a bandwidth bottleneck between L1 and L2 resulting in high L2 access latencies. In
memory-intensive applications, multithreading is unable to hide such high latencies,
making it critical to performance.
We discover considerable potential for data reuse within the L1 caches. We exploit
this opportunity to reduce the miss traffic to the L2 cache and thereby reduce the
L2 cache bandwidth demand. Therefore, we present a Cooperative Caching Network
which services the L1 load misses cooperatively via a lightweight ring network. We
show that GPUs can tolerate reuse latencies gracefully up to 80 cycles and therefore, a
ring topology appears to be a cost-effective solution, as it allows us to trade-off higher
latencies for simplicity and short wires, i.e., lower power consumption and die-area
cost. We also use shadow tag memory, adjacent to each L1 data cache, to decouple the
local L1 cache performance from remote L1 cache tag lookups. For applications that do
not exhibit any inter-core reuse, we detect the lack of sharing at runtime and prevent
the L1 miss requests from incurring the CCN overhead, sending them directly to the
L2 cache. For applications that exhibit reuse, our technique improves the IPC by 14.7%
while being neutral to applications that show little or no reuse. We likewise reduce
the traffic to L2 cache by 29%, and reduce the average memory latency by 24%. As a
result, we reduce the total core stall cycles by 26%. Alongside the above improvements,
CCN presents an area and energy overhead of 1.3% and 2.5% respectively. CCN also
compares favourably with alternative techniques that address the bandwidth issue.
REFERENCES
Manuel E. Acacio, Jose´ Gonza´lez, Jose´ M. Garcı´a, and Jose´ Duato. 2002. Owner Prediction for Accelerating
Cache-to-cache Transfer Misses in a cc-NUMA Architecture. In Proceedings of the 2002 ACM/IEEE
Conference on Supercomputing (SC ’02). IEEE Computer Society Press, Los Alamitos, CA, USA, 1–12.
http://dl.acm.org/citation.cfm?id=762761.762762
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl,
Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming
Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 577–591.
DOI:http://dx.doi.org/10.1145/2694344.2694391
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA
workloads using a detailed GPU simulator.. In ISPASS (2009-05-26). IEEE, 163–174.
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:23
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and
B. Verghese. 2000. Piranha: a scalable architecture based on single-chip multiprocessing. In Computer
Architecture, 2000. Proceedings of the 27th International Symposium on. 282–293.
Bradford M. Beckmann and David A. Wood. 2004. Managing Wire Delay in Large Chip-
Multiprocessor Caches. In Proceedings of the 37th Annual IEEE/ACM International Sympo-
sium on Microarchitecture (MICRO 37). IEEE Computer Society, Washington, DC, USA, 319–330.
DOI:http://dx.doi.org/10.1109/MICRO.2004.21
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks.
2014. HELIX-RC: An Architecture-compiler Co-design for Automatic Parallelization of Irregular Pro-
grams. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA
’14). IEEE Press, Piscataway, NJ, USA, 217–228. http://dl.acm.org/citation.cfm?id=2665671.2665705
Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative Caching for Chip Multiprocessors. In Proceedings
of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer
Society, Washington, DC, USA, 264–276. DOI:http://dx.doi.org/10.1109/ISCA.2006.17
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative Cache Partitioning for Chip Multiprocessors. In
Proceedings of the 21st Annual International Conference on Supercomputing (ICS ’07). ACM, New York,
NY, USA, 242–252. DOI:http://dx.doi.org/10.1145/1274971.1275005
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin
Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009
IEEE International Symposium on Workload Characterization (IISWC) (IISWC ’09). IEEE Computer
Society, Washington, DC, USA, 44–54. DOI:http://dx.doi.org/10.1109/IISWC.2009.5306797
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A
Performance Study of General-purpose Applications on Graphics Processors Using CUDA. J. Parallel
Distrib. Comput. 68, 10 (Oct. 2008), 1370–1380. DOI:http://dx.doi.org/10.1016/j.jpdc.2008.05.014
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014.
Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society,
Washington, DC, USA, 343–355. DOI:http://dx.doi.org/10.1109/MICRO.2014.11
Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing Off-chip Memory Traffic by Selective
Cache Management Scheme in GPGPUs. In Proceedings of the 5th Annual Workshop on General Pur-
pose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 110–119.
DOI:http://dx.doi.org/10.1145/2159430.2159443
Chrysos, George. 2012. Intel Xeon Phi Coprocessor - The Architecture. Technical Report. Intel Corporation.
https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Characterizing Memory Bottlenecks in GPGPU
Workloads. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization
(IISWC ’16). Providence, Rhode Island, USA.
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012.
Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 2012
45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer
Society, Washington, DC, USA, 389–400. DOI:http://dx.doi.org/10.1109/MICRO.2012.43
Jayesh Gaur, Mainak Chaudhuri, and Sreenivas Subramoney. 2011. Bypass and Insertion Al-
gorithms for Exclusive Last-level Caches. In Proceedings of the 38th Annual Interna-
tional Symposium on Computer Architecture (ISCA ’11). ACM, New York, NY, USA, 81–92.
DOI:http://dx.doi.org/10.1145/2000064.2000075
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapRe-
duce Framework on Graphics Processors. In Proceedings of the 17th International Conference on Par-
allel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 260–269.
DOI:http://dx.doi.org/10.1145/1454115.1454152
Enric Herrero, Jose´ Gonza´lez, and Ramon Canal. 2008. Distributed Cooperative Caching. In Proceedings
of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08).
ACM, New York, NY, USA, 134–143. DOI:http://dx.doi.org/10.1145/1454115.1454136
Mark A. Holliday and Michael Stumm. 1994. Performance Evaluation of Hierarchical Ring-Based Shared
Memory Multiprocessors. IEEE Trans. Computers 43, 1 (1994), 52–67.
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of
Demand-fetched Caches in GPUs. In Proceedings of the 26th ACM International Conference on Super-
computing (ICS ’12). ACM, New York, NY, USA, 15–24. DOI:http://dx.doi.org/10.1145/2304576.2304582
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kan-
demir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
39:24 S. Dublish et al.
Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth Interna-
tional Conference on Architectural Support for Programming Languages and Operating Systems (ASP-
LOS ’13). ACM, New York, NY, USA, 395–406. DOI:http://dx.doi.org/10.1145/2451116.2451158
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar
Iyer, and Chita R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance.
In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and
Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 14-18, 2016. 351–363.
DOI:http://dx.doi.org/10.1145/2896377.2901468
Stefanos Kaxiras and Georgios Keramidas. 2010. SARC Coherence: Scaling Directory
Cache Coherence in Performance and Power. IEEE Micro 30, 5 (Sept. 2010), 54–65.
DOI:http://dx.doi.org/10.1109/MM.2010.82
Mohammad Mahdi Keshtegar, Hajar Falahati, and Shaahin Hessabi. 2015. Cluster-based approach for im-
proving graphics processing unit performance by inter streaming multiprocessors locality. IET Comput-
ers & Digital Techniques (March 2015). http://digital-library.theiet.org/content/journals/10.1049/iet-cdt.
2014.0092
P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: a 32-way multithreaded Sparc processor. IEEE
Micro 25, 2 (March 2005), 21–29. DOI:http://dx.doi.org/10.1109/MM.2005.35
Alvin R. Lebeck and David A. Wood. 1995. Dynamic Self-invalidation: Reducing Coherence
Overhead in Shared-memory Multiprocessors. In Proceedings of the 22Nd Annual Interna-
tional Symposium on Computer Architecture (ISCA ’95). ACM, New York, NY, USA, 48–59.
DOI:http://dx.doi.org/10.1145/223982.223995
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Het-
erogeneous Architecture. In Proceedings of the 2012 IEEE 18th International Symposium on High-
Performance Computer Architecture (HPCA ’12). IEEE Computer Society, Washington, DC, USA, 1–12.
DOI:http://dx.doi.org/10.1109/HPCA.2012.6168947
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and
Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of
the 40th Annual International Symposium on Computer Architecture (ISCA ’13). ACM, New York, NY,
USA, 487–498. DOI:http://dx.doi.org/10.1145/2485922.2485964
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang
Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM
on International Conference on Supercomputing (ICS ’15). ACM, New York, NY, USA, 67–77.
DOI:http://dx.doi.org/10.1145/2751205.2751237
Lingda Li, Ari B. Hayes, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Tag-Split Cache for Efficient
GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS
’16). ACM, New York, NY, USA, Article 43, 12 pages. DOI:http://dx.doi.org/10.1145/2925426.2926253
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay.
Commun. ACM 55, 7 (July 2012), 78–89. DOI:http://dx.doi.org/10.1145/2209249.2209269
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N.
Patt. 2011. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceed-
ings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM,
New York, NY, USA, 308–317. DOI:http://dx.doi.org/10.1145/2155620.2155656
Nvidia. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Technical Report. Nvidia
Corporation. http://www.nvidia.co.uk/content/PDF/fermi white papers/NVIDIA Fermi Compute
Architecture Whitepaper.pdf
Nvidia. 2012. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110.
Technical Report. Nvidia Corporation. http://www.nvidia.co.uk/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Nvidia. 2014. CUDA by example - Errata, June 2014. http://developer.nvidia.com/
cuda-example-errata-page. (2014).
Nvidia. 2016. Parallel Thread Execution ISA, Version 4.3. (2016). http://docs.nvidia.com/cuda/
parallel-thread-execution
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K.
Reinhardt, and David A. Wood. 2013. Heterogeneous System Coherence for Integrated CPU-GPU Sys-
tems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO-46). ACM, New York, NY, USA, 457–467. DOI:http://dx.doi.org/10.1145/2540708.2540747
Govindan Ravindran and Michael Stumm. 1997. A Performance Comparison of Hierarchical Ring- and
Mesh- Connected Multiprocessor Networks. In Proceedings of the 3rd IEEE Symposium on High-
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
Cooperative Caching for GPUs 39:25
Performance Computer Architecture (HPCA ’97). IEEE Computer Society, Washington, DC, USA, 58–.
http://dl.acm.org/citation.cfm?id=548716.822685
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hi-
erarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98.
DOI:http://dx.doi.org/10.1145/2540708.2540717
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront
Scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 72–83.
DOI:http://dx.doi.org/10.1109/MICRO.2012.16
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Pro-
ceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).
ACM, New York, NY, USA, 99–110. DOI:http://dx.doi.org/10.1145/2540708.2540718
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen
Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan,
and Pat Hanrahan. 2008. Larrabee: A Many-core x86 Architecture for Visual Computing. In ACM
SIGGRAPH 2008 Papers (SIGGRAPH ’08). ACM, New York, NY, USA, Article 18, 15 pages.
DOI:http://dx.doi.org/10.1145/1399504.1360617
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013.
Cache coherence for GPU architectures. 2013 IEEE 20th International Symposium on High Performance
Computer Architecture (HPCA) (2013), 578–590. http://doi.ieeecomputersociety.org/10.1109/HPCA.2013.
6522351
John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W.
Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing.
Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.
David Tarjan, Jiayuan Meng, and Kevin Skadron. 2009. Increasing Memory Miss Tolerance for SIMD Cores.
In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC
’09). ACM, New York, NY, USA, Article 22, 11 pages. DOI:http://dx.doi.org/10.1145/1654059.1654082
David Tarjan and Kevin Skadron. 2010. The Sharing Tracker: Using Ideas from Cache Coherence Hardware
to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. In Proceedings of the 2010 ACM/IEEE
International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’10).
IEEE Computer Society, Washington, DC, USA, 1–10. DOI:http://dx.doi.org/10.1109/SC.2010.54
Yi Yang, Ping Xiang, Mike Mantor, and Huiyang Zhou. 2012. CPU-assisted GPGPU on Fused CPU-
GPU Architectures. In Proceedings of the 2012 IEEE 18th International Symposium on High-
Performance Computer Architecture (HPCA ’12). IEEE Computer Society, Washington, DC, USA, 1–12.
DOI:http://dx.doi.org/10.1109/HPCA.2012.6168948
A. Yazdanbakhsh, B. Thwaites, H. Esmaeilzadeh, G. Pekhimenko, O. Mutlu, and T. C. Mowry. 2016. Miti-
gating the Memory Bottleneck With Approximate Load Value Prediction. IEEE Design Test 33, 1 (Feb
2016), 32–42. DOI:http://dx.doi.org/10.1109/MDAT.2015.2504899
ACM Transactions on Architecture and Code Optimization, Vol. 13, No. 4, Article 39, Publication date: December 2016.
