In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important, but complicated. As memory demands grow and data movement overheads increasingly limit performance, determining the best GPU caching policy to use for a diverse range of MI workloads represents one important challenge. To study this, we evaluate 17 MI applications and characterize their behavior using a range of GPU caching strategies. In our evaluations, we find that the choice of caching policy in GPU caches involves multiple performance trade-offs and interactions, and there is no one-size-fits-all GPU caching policy for MI workloads. Based on detailed simulation results, we motivate and evaluate a set of cache optimizations that consistently match the performance of the best static GPU caching policies.
INTRODUCTION
In recent years, MI has emerged as an important driver for the computing industry. This has motivated a large amount of work optimizing hardware for MI, especially for Convolutional Neural Networks (CNNs) (e.g., [13] - [27] ). Although these works have led to significant improvements in performance and energy efficiency of CNNs on modern multi-core CPUs, GPUs, and accelerators, it is challenging to analyze how future architectures will perform for these workloads. Here we focus on GPUs, as they are widely used for running MI workloads in numerous domains.
Although many MI systems use large discrete noncoherent GPUs, the emerging trend is to unify the CPU-GPU memory system [37] . Such a system is easier to program and can significantly reduce unnecessary data movement when GPU kernel launches are frequent, as can be the case with many MI workloads. However, implementing efficient coherent caches between CPUs and GPUs remains a significant challenge. GPUs use a coherence strategy which prioritizes memory throughput and scalability, sometimes at the cost of cache reuse. In an effort to better understand the trade-offs of different caching strategies, we evaluate the performance of these applications with multiple levels of GPU caching enabled using the publicly available AMD gem5 APU simulator [3] .
We find that there is no best performing caching policy for all MI workloads. Although caching can significantly improve performance by enabling local data reuse, in some cases it can lead to harmful cache stalls and DRAM row locality disruption. Motivated by these results, we model and evaluate three microarchitectural optimizations which work together to mitigate the caching inefficiencies encountered by MI workloads. The first optimization avoids blocking for cache allocation, which reduces cache stalls. The second optimization applies a state-of-the-art CPU cache rinsing technique [41] to the last-level GPU cache to improve row buffer locality. Finally, we use a PC-based bypass prediction technique [40] to address remaining caching overheads while still caching accesses that can benefit from reuse. Together, these optimizations achieve the benefits of GPU cache reuse while minimizing caching overheads for these important MI workloads.
II. MI BACKGROUND
Although there are many different MI methods, in this work we focus on deep neural networks (DNNs), which are some of the most commonly used MI workloads and are wellsupported by the MIOpen library [29] . In state-of-the-art MI kernels, the tiling pattern, work item/work group parallelism, and scratchpad memory usage can vary even for a given layer based on what the framework determines is optimal for the target platform, making it difficult to generalize about the specific memory demands of any class of MI tasks. However, we summarize the high-level memory properties of the layers studied below; a more detailed discussion can be found in an extended version of this work [1] .
 Activation: Simple elementwise operations with low reuse.  Fully connected: High reuse between distant elements, compute intensive.  Convolutional: Fewer parameters and less computationally intensive than fully connected, but has reuse between adjacent elements. These layers generally dominate DNN execution time.  Pooling: Low reuse (depending on stride), unbalanced read and write counts.  Normalization: High reuse (depending on normalization dimension).  RNN: Similar to fully connected, but weights are common among layers, leading to more potential reuse.
III. CPU-GPU CACHE COHERENCE BACKGROUND
In this work, we explore the costs and benefits of caching in GPU MI workloads by simulating three caching policies that differ in how loads and stores are handled in GPU caches:
 Uncached: Loads and stores bypass all GPU caches.
 CacheR: Loads are cached in L1 and L2, but stores bypass all GPU caches.
 CacheRW: Loads are cached in L1 and L2, stores bypass L1 and are combined in L2.
When load caching is disabled, read requests to the same cache line may be coalesced while the original bypass request is pending, but on a response the data is forwarded without being inserted in the cache. When load caching is enabled (CacheR, CacheRW), the GPU L1 and L2 caches always selfinvalidate valid data at synchronization points (e.g., kernel boundaries) [47] [49] . When store caching is enabled (CacheRW), stores still bypass the L1 but they may be delayed and coalesced at the L2 until a flush of all L2 dirty data is triggered at a system-scope synchronization point, at which time they are written back to memory [48] .
IV. METHODOLOGY

A. The gem5 Simulator
To analyze MI workloads, we leverage the gem5 simulator [2] [3] . In this work we focus on simulating opensourced MI applications that use the MIOpen library [29] on top of the ROCm stack, including workloads from DNNMark [5] , DeepBench [6] [7] , and MIOpen-benchmark [8] . Details of required application and simulator changes to run our experiments can be found in an extended version of this work [1] . Table 1 lists the key system parameters we simulate in gem5. Figure 1 shows the conceptual system design, which includes a 64-CU GPU with two levels of cache [30] . Our simulated GPU CU pipeline is based on AMD's GCN architecture [38] and uses the GCN3 ISA [31] . Table 2 shows the seventeen MI benchmarks that we studied. These MI benchmarks come from several popular MI suites: DNNMark [5] , DeepBench [6] [7] , and MIOpenbenchmark [8] . We selected these benchmarks because they cover many different types of CNN and RNN layers and full NNs. A more detailed discussion of these applications can be found in an extended version of this work [1] . Figure 2 shows the execution time of each caching policy described in Section III for all applications, normalized to Uncached. Figure 3 shows the number of memory accesses that reach the DRAM controller, also normalized to Uncached. Overall, our results show that the best performing caching policy varies widely depending on the available 
B. System Configuration
C. Applications
V. CACHING CHARACTERIZATION
A. Caching Benefits: Reuse
The main benefit of caching is that it enables cache reuse and therefore lower latency and higher bandwidth access to data. Layers with limited connectivity such as the pooling, convolution, and some normalization layers show limited benefit mainly because reuse is primarily between nearby work items and can be exploited even when caching is disabled. However, for workloads with higher connectivity, where reuse is possible between distant work items (e.g., FwFC, FwBN, FwBwGRU, and FwBwLSTM), we find that read caching can reduce memory demand by up to 93%. When the accesses that experience reuse are critical to performance (RS workloads), this reuse can also reduce execution time by up to 29%. In addition, write caching can further reduce memory demand by up to 71% and execution time by up to 32% for RS workloads which exhibit high potential for write coalescing at L2 such as BwPool and BwBN.
B. Caching Overheads
Although caching improves performance in many cases, for TS workloads it can increase execution time by up to 24%. We find that caching overheads manifest themselves primarily as 1) cache stalls due to added contention for cache resources, and 2) reduced DRAM row locality for requests that have been delayed in the caches. Although not shown here for space, an extended version of this paper includes detailed data on these overheads [1] .
1) Cache Stalls: Cached requests require allocation on a miss, and this may cause stalls if all lines in the set are in a busy state (e.g., waiting for a pending load). In addition, coherence operations can add contention for shared resources such as tag and data arrays (e.g., due to failed cache allocation). High cache stall counts lead to worse execution time for FwAct and BwAct when read caching is enabled. FwPool also experiences high cache stalls, although negative effects here are offset by the added reuse achieved through caching.
2) DRAM Row Locality: Enabling read or write caching adds variability to memory access times through cache stalls described in Section V.B.1) or by delaying stores at the L2 so they can be coalesced (CacheRW). MI applications tend to have regular access patterns, and as a result enabling caching can interfere with this regularity and hurt DRAM row hit rates. In particular, FwPool, FwAct, FwLRN, and BwAct suffer from this effect. Although this effect in FwPool is outweighed by the benefits of cache reuse, for the TS workloads it contributes to a net performance degradation for caching configurations.
VI. CACHING OPTIMIZATIONS FOR MI APPLICATIONS
Motivated by the caching overheads we observe in GPU MI workloads, we next describe three potential architectural optimizations and evaluate their effect on performance. All are applied to the most aggressive caching policy, CacheRW, and are compared against the best and worst performing static configurations as measured in Figure 2 . Figure 4 reports the normalized execution time for these optimizations.
A. Allocation Bypass
We begin by adapting our caching policies to converting cached requests to bypass requests whenever allocation would require blocking. This allocation bypass optimization is plotted as CacheRW-AB in Figure 4 .
Non-blocking caching optimization reduces cache stalls per request significantly, but it has a minimal effect on overall performance for most applications. This can be explained by 
Normalized DRAM Accesses
Uncached CacheR CacheRW the fact that allocation bypassing does little to reduce the added congestion overhead, and in some cases adds to it by eliminating a throttling effect from the L1 cache level (this increases execution time by 7% for BwPool). The main exception is FwLRN, which sees significant benefits due to improved DRAM row hit ratios.
B. Row Locality-Aware Cache Rinsing
Although allocation bypass avoids row locality disruption caused by blocking allocation operations, it does not avoid disruption due to L2 write coalescing. To address this, we next add a row locality-aware cache rinsing scheme based on a method originally proposed for CPUs by Seshadri et al. [6] . This technique adds a dirty block index to the GPU L2 that tracks dirty blocks in each DRAM row. Whenever a dirty block is evicted, a writeback of all other dirty blocks in that row is triggered.
We add this cache rinsing optimization on top of the allocation bypassing optimization, denoted CacheRW-CR in Figure 4 . Cache rinsing counteracts caching's detrimental effect on DRAM row locality overhead for affected (mainly TS) workloads, offering DRAM row hit rates that are even higher than those of the best static configuration. For example, as Figure 4 shows, cache rinsing reduces the performance overheads of caching for BwAct and FwAct.
C. PC-Based L2 Bypassing
We next attempt to address remaining performance overheads due to caching by predicting whether caching will be beneficial (i.e., whether cache reuse is likely), then dynamically choosing to use cached requests and incur the resulting overheads only when that is the case. Past work has explored this concept for adaptive load bypassing at the L1, proposing a PC-based reuse predictor to avoid cache pollution and more effectively use limited cache space [40] ; we apply the same PC-based technique instead to the L2 for both loads and stores for the purpose of avoiding congestion overheads when reuse is unlikely.
We apply PC-based L2 bypassing on top of the allocation bypassing and cache rinsing optimizations and denote it as CacheRW-PCby in Figure 4 . Overall, it is effective at predicting reuse for MI workloads. For nearly all workloads, the combination of allocation bypassing, cache rinsing, and PC-based bypassing matches or exceeds the performance of the best static cache configuration by selectively incurring cache overheads when they are expected to be beneficial.
VII. RELATED WORK
There have been multiple prior efforts to enable efficient caching and coherence in GPUs [47] [44] , flexible coherence request types [54] , and cross-layer coordination of scheduling and memory management [55] have also been shown to improve cache efficiency by matching caching policies to GPU workload demands. In contrast, the primary contribution of this work is to characterize the sources of cache inefficiency in GPU MI workloads and to describe techniques that address the specific caching overheads in this important domain.
Prior work has simulated MI workloads on in-house simulators [32] [34] [35] [36] , using analytical models [32] [33] , or on discrete GPU simulators [28] [45] [46] . In contrast, this work evaluates MI workloads on a cycle-level, publicly available simulator that does not suffer from the inaccuracies that arise with higher level ISAs [4] .
VIII. CONCLUSIONS AND FUTURE WORK
In this work we find that caching reads and writes has mixed behavior for MI workloads. For some workloads, it improves performance by up to 29% through increased cache reuse, but for others it degrades performance by up to 24% by incurring cache stalls and interfering with DRAM row locality. Based on the detailed performance overheads we identify by running open-source MI workloads on the AMD gem5 APU simulator, we implement and evaluate a set of adaptive GPU cache optimizations in gem5. These optimizations allow us to leverage the benefits of caching when it is helpful while avoiding performance overheads when caching hurts.
IX. ACKNOWLEDGEMENTS
The authors would like to thank Gabe Loh for his feedback and insights. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. 
