This paper summarizes the idea of Memory Divergence Correction (MeDiC), which was published at PACT 2015 [6] , and examines the work's signi cance and future potential. In a modern GPU architecture, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete.
Introduction
Graphics Processing Units (GPUs) have enormous parallel processing power to leverage thread-level parallelism. GPU applications are usually broken down into thousands of threads, allowing GPUs to use ne-grained multithreading [128, 136] to prevent GPU cores from stalling due to dependencies and long memory latencies. Ideally, there should always be available threads for GPU cores to continue execution, preventing stalls within the core. GPUs also take advantage of the SIMD (Single Instruction, Multiple Data) execution model [30] . The thousands of threads within a GPU application are clustered into thread blocks, each of which contains multiple smaller bundles (warps) of threads that run concurrently. Each thread in a warp executes the same instruction on a di erent piece of data. A warp completes an instruction when all threads in the warp complete the instruction.
While many GPGPU applications can tolerate a signicant amount of memory latency due to their parallelism and the use of ne-grained multithreading, memory divergence (where the threads of a warp reach a memory instruction, and some of the threads' memory requests take longer to service than others) can signi cantly increase the stall time of a warp [51, 52, 63, 75, 89, 101, 116, 117, 155] . Because all threads within a warp operate in lockstep due to the SIMD execution model, the warp cannot proceed to the next instruction until the slowest request within the warp completes. Figures 1a  and 1b show examples of memory divergence within a warp. Figure 1a shows a mostly-hit warp, where most of the warp's memory accesses hit in the cache ( 1 ). Only a single access misses in the cache and must go to main memory ( 2 ). As a result, the entire warp is stalled until the much longer cache miss completes. Figure 1b shows a mostly-miss warp, where most of the warp's memory requests miss in the cache ( 3 ), resulting in many accesses to main memory. Even though some requests are cache hits ( 4 ), these do not bene t the execution time of the warp since the execution of the warp ends when the slowest thread nishes the instruction.
Based on these three observations, we aim to devise a mechanism that has two major goals: (1) convert mostly-hit warps into all-hit warps (warps where all requests hit in the cache, as shown in Figure 1c ), and (2) convert mostly-miss warps into all-miss warps (warps where none of the requests hit in the cache, as shown in Figure 1d ). As we can see in Figure 1a , the stall time due to memory divergence for the mostly-hit warp can be eliminated by converting only the single cache miss ( 2 ) into a hit. Doing so requires additional cache space. If we convert the two cache hits of the mostlymiss warp (Figure 1b, 4 ) into cache misses, we can allocate the cache space previously used by these hits to the mostlyhit warp, thus converting the mostly-hit warp into an all-hit warp. Though the mostly-miss warp is now an all-miss warp (Figure 1d ), it incurs no extra stall penalty, as the warp was already waiting on the other six cache misses to complete. show the heterogeneity between mostly-hit and mostly-miss warps, respectively. (c) and (d) show the change in stall time from converting mostly-hit warps into all-hit warps, and mostly-miss warps into all-miss warps, respectively. Reproduced from [6] .
Moreover, now that it is an all-miss warp, we can predict that its future memory requests will also not be in the L2 cache. Based on this prediction, we can simply have these requests bypass the cache. By doing so, the requests from the all-miss warp can completely avoid unnecessary L2 access and queuing delays, and enable the use of L2 cache bandwidth and bu er space by warps that bene t from the L2 cache. This decreases the total number of requests going to the L2 cache, thus reducing the queuing latencies for requests from mostlyhit and all-hit warps, as there is less contention.
Observation on GPU Memory Divergence
We make three new key observations about memory divergence (at the shared L2 cache). First, we observe that the degree of memory divergence can di er across warps (as illustrated in Figure 1 ). This inter-warp heterogeneity a ects how well each warp takes advantage of the shared cache. Second, we observe that a warp's memory divergence behavior tends to remain stable for long periods of execution, making it predictable. Third, we observe that requests to the shared cache experience long queuing delays due to the large amount of parallelism in GPGPU programs, which exacerbates the memory divergence problem and slows down GPU execution. Next, we describe each of these observations in detail and motivate our solutions.
Memory Divergence Heterogeneity
There is heterogeneity across warps in the degree of memory divergence experienced by each warp at the shared L2 cache. Figures 1a and 1b show examples of two di erent types of warps that exhibit di erent degrees of memory divergence.
We observe that di erent warps have di erent amounts of sensitivity to memory latency and cache utilization. We study the cache utilization of a warp by determining its hit ratio, the percentage of memory requests that hit in the cache when the warp issues a single memory instruction. As Figure 2 shows, the warps from each of our three representative GPGPU applications are distributed across all possible ranges of hit ratio, exhibiting signi cant heterogeneity. To better characterize warp behavior, we break the warps down into the ve types shown in Figure 3 based on their hit ratios: all-hit, mostly-hit, balanced, mostly-miss, and all-miss. MeDiC provide two mechanisms, warp-type-aware cache bypassing and warp-type-aware cache insertion policy, in order to convert mostly-hit warps into all-hit warps, where all requests in the warp hit in the cache, thereby reducing the stall time of mostly-hit warp signi cantly. This is done at the cost of converting the mostly-miss warps into all-miss warps, but doing so does not increase the stall time of such warps. To speed up uncacheable cache misses from mostlyhit warps, the warp-type-aware memory scheduling policy in MeDiC prioritizes memory requests from mostly-hit warps over memory requests from mostly-miss warps.
Memory Divergence Stability Over Time
A warp tends to retain its memory divergence behavior (e.g., whether or not it is mostly-hit or mostly-miss) for long periods of execution, and is thus predictable. This is due to the spatial and temporal locality of each thread within the warp. Figure 4 shows a sample of warps from a representative application (i.e., BFS [10] ) that shows this predictability. This predictability enables us to perform history-based warp divergence characterization. 
High Queuing Latencies at the Shared Cache
Due to the amount of thread parallelism within a GPU, a large number of memory requests can arrive at the L2 cache in a small window of execution time, leading to signi cant queuing delays. Prior work observes high access latencies for the shared L2 cache within a GPU [126, 127, 142] , but does not identify why these latencies are so high. We show that when a large number of requests arrive at the L2 cache, both the limited number of read/write ports and backpressure from cache bank con icts force many of these requests to queue up for long periods of time. We observe that this queuing latency can sometimes add hundreds of cycles to the cache access latency, and that non-uniform queuing across the di erent cache banks exacerbates memory divergence. Figure 5 quanti es the magnitude of this queue contention if we set the cache lookup latency at one cycle, for one application, BFS [10] . As shown, a signi cant number of requests experience tens to hundreds of cycles of queuing delay.
Fract. of L2 Requests
Queuing Time (cycles) 53 .8% Figure 5 : Distribution of per-request queuing latencies for L2 cache requests from BFS. Reproduced from [6] .
The warp-type-aware bypassing logic in MeDiC helps to alleviate these L2 queuing latencies. By preventing mostlymiss and all-miss warps from accessing the cache, which yields little or no bene t to them, we reduce the access latencies for requests from (1) mostly-hit and all-hit warps, which bene t from the cache much more; and also (2) mostly-miss and all-miss warps themselves; thereby improving the overall performance of all warps and the system.
MeDiC: Memory Divergence Correction
Based on these three new observations we made, we de ne three major goals for our new mechanism. We would like to devise a mechanism that (1) converts mostly-hit warps into all-hit warps (warps where all requests hit in the cache, as shown in Figure 1c ), (2) converts mostly-miss warps into allmiss warps (warps where none of the requests hit in the cache, as shown in Figure 1d ) and (3) reduces L2 cache queuing delay for all warp types. As we can see in Figure 1a , the stall time due to memory divergence for the mostly-hit warp can be eliminated by converting only the single cache miss (Figure 1a , 2 ) into a cache hit.
To this end, we introduce Memory Divergence Correction (MeDiC), a GPU-speci c mechanism that exploits memory divergence heterogeneity across warps at the shared cache and at main memory to improve the overall performance of GPGPU applications. MeDiC consists of three di erent components, which work together to achieve our three goals: (1) a warp-type-aware cache bypassing mechanism, which prevents requests from mostly-miss and all-miss warps from accessing the shared L2 cache; (2) a warp-type-aware cache insertion policy, which prioritizes requests from mostly-hit and all-hit warps, in order to increase the likelihood that they all become cache hits; and (3) a warp-type-aware memory scheduling mechanism, which prioritizes requests from mostly-hit warps that were not successfully converted to all-hit warps, in order to minimize the stall time due to divergence. These three components are all driven by an online mechanism that can identify the expected memory divergence behavior of each warp. Figure 6 shows the overall MeDiC mechanism. MeDiC consists of four di erent components: 1 a warp-typeidenti cation mechanism that classi es warps into one of the four warp types as described in Section 2.1; 2 a bypass mechanism that bypasses requests from all-miss and mostlymiss warps, reducing the queuing delay in the L2 cache; 3 an insertion policy that prevent mostly-hit requests from being evicted from the cache; and 4 a memory scheduler that prioritizes requests from mostly-hit warps, which are more latency sensitive.
Warp Type Identi cation
In order to take advantage of the memory divergence heterogeneity across warps, we must rst add hardware that can identify the divergence behavior of each warp. The key idea is to periodically sample the hit ratio of a warp, and to classify the warp's divergence behavior as one of the ve types in Figure 3 based on the observed hit ratio. This information can then be used to drive the warp-type-aware components of MeDiC. In general, warps tend to retain the same memory divergence behavior for long periods of execution. However, there can be some long-term shifts in warp divergence behavior, requiring periodic resampling of the hit ratio to potentially re-evaluate the warp type. Warp type identi cation through hit ratio sampling requires hardware within the cache to periodically count the number of hits and misses each warp incurs. We append two counters to the metadata stored for each warp, which represent the total number of cache hits and cache accesses for the warp during the sampling interval.
Warp-type-aware Shared Cache Bypassing
Once the warp type is known and a warp generates a request to the L2 cache, our mechanism rst decides whether to bypass the cache based on the warp type. The key idea behind warp-type-aware cache bypassing is to convert mostly-miss warps into all-miss warps, as they do not bene t greatly from the few cache hits that they get. By bypassing these requests, we achieve three bene ts: (1) bypassed requests can avoid L2 queuing latencies entirely, (2) other requests that do hit in the L2 cache experience shorter queuing delays due to the reduced contention, and (3) space is created in the L2 cache for mostly-hit warps.
The cache bypassing logic must make a simple decision: if an incoming memory request was generated by a mostlymiss or all-miss warp, the request is bypassed directly to DRAM. This is determined by reading the warp type stored in the warp metadata from the warp type identi cation mechanism. A simple 2-bit demultiplexer can be used to determine whether a request is sent to the L2 bank arbiter, or directly to the DRAM request queue.
Warp-type-aware Cache Insertion Policy
Our cache bypassing mechanism frees up space within the L2 cache, which we want to use for the cache misses from mostly-hit warps (to convert the cache miss memory requests into cache hits). However, even with the new bypassing mechanism, other warps (e.g., balanced, mostly-miss) still insert some data into the cache. In order to aid the conversion of mostly-hit warps into all-hit warps, we develop a warptype-aware cache insertion policy, whose key idea is to ensure that in a given cache set, data from mostly-miss warps are evicted rst, while data from mostly-hit warps and all-hit warps are evicted last.
To ensure that a cache block from a mostly-hit warp stays in the cache for as long as possible, we insert the block closer to the MRU position. A cache block requested by a mostlymiss warp is inserted closer to the LRU position, making it more likely to be evicted. To track the warp type associated with these cache blocks, we add two bits of metadata to each cache block, indicating the warp type. These bits are then appended to the replacement policy bits. The bits modify the replacement policy behavior, such that a cache block from a mostly-miss warp is more likely to get evicted than a block from a balanced warp. Similarly, a cache block from a balanced warp is more likely to be evicted than a block from a mostly-hit or all-hit warp.
Warp-type-aware Memory Scheduler
Our cache bypassing mechanism and cache insertion policy work to increase the likelihood that all requests from a mostlyhit warp become cache hits, converting the warp into an allhit warp. However, due to cache con icts, or due to poor locality, there may still be cases when a mostly-hit warp cannot be fully converted into an all-hit warp, and is therefore unable to avoid stalling due to memory divergence as at least one of its requests has to go to DRAM. In such a case, we want to minimize the amount of time that this warp stalls. To this end, we propose a warp-type-aware memory scheduler that prioritizes the occasional DRAM requests from mostly-hit warps.
The design of our memory scheduler is very simple. Each memory request is tagged with a single bit, which is set if the memory request comes from a mostly-hit warp (or an all-hit warp, in case the warp was mischaracterized). We modify the request queue at the memory controller to contain two di erent queues, where a high-priority queue contains all requests that have their mostly-hit bit set to one. The lowpriority queue contains all other requests, whose mostly-hit bits are set to zero. Each queue uses FR-FCFS [115, 156] as the scheduling policy; however, the scheduler always selects requests from the high priority queue over requests in the low priority queue. 1 We describe each component of MeDiC in more detail in Sections 4.1, 4.2, 4.3 and 4.4 of our PACT 2015 paper [6] .
Methodology
We model our mechanism using GPGPU-Sim 3.2.1 [9] . We modi ed GPGPU-Sim to accurately model cache bank conicts, and added the cache bypassing, cache insertion, and memory scheduling mechanisms needed to support MeDiC. We use GPUWattch [76] to evaluate power consumption. We have open sourced our simulator source code at [118] . We evaluate our system across multiple GPGPU applications from the CUDA SDK [102] , Rodinia [19] , MARS [39] , and Lonestar [10] benchmark suites. We report performance results using the harmonic average of the IPC speedup (over the baseline GPU) of each kernel of each application. 2 Harmonic speedup [28, 85] was shown to re ect the average normalized execution time in multiprogrammed workloads. We calculate energy e ciency for each workload by dividing the IPC by the energy consumed Section 5 of our PACT 2015 paper [6] provides more detail on our experimental methodology. Figure 7 shows the performance of MeDiC compared to four GPU cache management mechanisms: the Evicted Address Filter insertion policy [123] (EAF), PCAL bypassing policy [79] (PCAL), PC-based cache bypassing policy (PCByp) and an idealized random bypassing policy (Rand) over 15 di erent GPGPU applications from 4 benchmark suites. We also show the performance of each individual component of MeDiC: our warp-type-aware insertion policy (WIP), our warp-type-aware memory scheduling policy (WMS) and our warp-type-aware bypassing policy (WByp).
Evaluation
We found that each component of MeDiC individually provides signi cant performance improvement: WIP (32.5%), WMS (30.2%), and WByp (33.6%). MeDiC, which combines all three mechanisms, provides a 41.5% performance improvement over Baseline, on average. MeDiC matches or outperforms its individual components for all benchmarks except BP, where MeDiC has a higher L2 miss rate and lower row bu er locality than WMS and WByp.
Our insertion policy, WIP, outperforms EAF [123] by 12.2%. We observe that the key bene t of WIP is that cache blocks 1 Using two queues ensures that high-priority requests are not blocked by low-priority requests even when the low-priority queue is full. Two-queue priority also uses simpler logic design than comparator-based priority [5, 132, 133] . 2 We con rm that for each application, all kernels have similar speedup values, and that aside from SS and PVC, there are no outliers (i.e., no kernel has a much higher speedup than the other kernels). To verify that harmonic speedup is not swayed greatly by these few outliers, we recompute it for SS and PVC without these outliers, and nd that the outlier-free speedup is within 1% of the harmonic speedup we report in the paper. from mostly-miss warps are much more likely to be evicted. In addition, WIP reduces the cache miss rate of several applications. Our memory scheduler, WMS, provides signi cant performance gains (30.2%) over Baseline, because the memory scheduler prioritizes requests from warps that have a high hit ratio, allowing these warps to become active much sooner than they do in Baseline. Our bypassing mechanism, WByp provides an average 33.6% performance improvement over Baseline, because it is e ective at reducing the L2 queuing latency..
Compared to PCAL [79] , WByp provides 12.8% better performance, and full MeDiC provides 21.8% better performance. We observe that while PCAL reduces the amount of cache thrashing, the reduction in thrashing does not directly translate into better performance. We observe that warps in the mostly-miss category sometimes have high reuse, and acquire tokens to access the cache. This causes less cache space to become available for mostly-hit warps, limiting how many of these warps become all-hit. However, when high-reuse warps that possess tokens are mainly in the mostly-hit category (PVC, PVR, SS, and BH), we nd that PCAL performs better than WByp.
Compared to Rand, 3 MeDiC performs 6.8% better, because MeDiC is able to make bypassing decisions that do not increase the miss rate signi cantly. This leads to lower o -chip bandwidth usage under MeDiC than under Rand. Rand increases the cache miss rate by 10% for the kernels of several applications (BP, PVC, PVR, BFS, and MST). We observe that in many cases, MeDiC improves the performance of applications that tend to generate a large number of memory requests, and thus experience substantial queuing latencies.
Compared to PC-Byp, MeDiC performs 12.4% better. We observe that the overhead of tracking the PC becomes signi cant, and that thrashing occurs as two PCs can hash to the same index, leading to inaccuracies in the bypassing decisions.
We conclude that each component of MeDiC, and the full MeDiC framework, are e ective. Note that each component of MeDiC addresses the same problem (i.e., memory divergence of threads within a warp) using di erent techniques on di erent parts of the memory hierarchy. For the majority of workloads, one optimization is enough. However, we see that for certain high-intensity workloads (BFS and SSSP), the congestion is so high that we need to attack divergence on multiple fronts. Thus, MeDiC provides better average performance than all of its individual components, especially for such memory-intensive workloads.
We provide the following other evaluation results in our PACT 2015 paper [6] :
• Impact of MeDiC on cache miss rate. Figure 8 : Energy e ciency of MeDiC. Adapted from [6] .
• Impact of MeDiC on queuing latency.
• Impact of MeDiC on row bu er locality.
• Analysis of reuse in GPGPU applications.
• Hardware cost of MeDiC.
Related Work
To our knowledge, MeDiC is the rst work that identi es inter-warp memory divergence heterogeneity and exploits it to achieve better system performance in GPGPU applications. Our new mechanism consists of warp-type-aware components for cache bypassing, cache insertion, and memory scheduling. We have already provided extensive quantitative and qualitative comparisons to state-of-the-art mechanisms in GPU cache bypassing [79] , cache insertion [123] , and memory scheduling [115, 156] . In this section, we discuss other related work in these areas.
Hardware-based Cache Bypassing. PCAL is a bypassing mechanism that addresses the cache thrashing problem by throttling the number of threads that time-share the cache at any given time [79] . The key idea of PCAL is to limit the number of threads that get to access the cache. Concurrent work by Li et al. [78] proposes a cache bypassing mechanism that allows only threads with high reuse to utilize the cache. The key idea is to use locality ltering based on the reuse characteristics of GPGPU applications, with only high reuse threads having access to the cache. Xie et al. [146] propose a bypassing mechanism at the thread block level. In their mechanism, the compiler statically marks whether thread blocks prefer caching or bypassing. At runtime, the mechanism dynamically selects a subset of thread blocks to use the cache, to increase cache utilization.
Chen et al. [20, 21] propose a combined warp throttling and bypassing mechanism for the L1 cache based on the cacheconscious warp scheduler [116] . The key idea is to bypass the cache when resource contention is detected. This is done by embedding history information into the L2 tag arrays. The L1 cache uses this information to perform bypassing decisions, and only warps with high reuse are allowed to access the L1 cache. Jia et al. propose an L1 bypassing mechanism [48] , whose key idea is to bypass requests when there is an associativity stall. Dai et al. propose a mechanism to bypass cache based on a model of a cache miss rate [23] .
MeDiC di ers from these prior cache bypassing works because it uses warp memory divergence heterogeneity for bypassing decisions. We also show (in Section 6.4 of our PACT 2015 paper [6] ) that our mechanism implicitly takes reuse information into account.
Software-based Cache Bypassing. Concurrent work by Li et al. [77] proposes a compiler-based technique that performs cache bypassing using a method similar to PCAL [79] . Xie et al. [145] propose a mechanism that allows the compiler to perform cache bypassing for global load instructions. Both of these mechanisms are di erent from MeDiC in that MeDiC applies bypassing to all loads and stores that utilize the shared cache, without requiring additional characterization at the compiler level. Mekkat et al. [87] propose a bypassing mechanism for when a CPU and a GPU share the last level cache. Their key idea is to bypass GPU cache accesses when CPU applications are cache sensitive, which is not applicable to GPU-only execution.
CPU Cache Bypassing. There are also several other CPUbased cache bypassing techniques. These techniques include using additional bu ers track cache statistics to predict cache blocks that have high utility based on reuse count [18, 27, 32, 50, 55, 81, 144, 152] , reuse distance [18, 24, 29, 31, 34, 104, 143, 149] , behavior of the cache block [46] or miss rate [22, 88, 120, 137] As they do not operate on SIMD systems, these mechanisms do not (need to) account for memory divergence heterogeneity when performing bypassing decisions.
Cache Insertion and Replacement Policies. Many works propose di erent insertion policies for CPU systems (e.g., [44, 45, 54, 110, 112, 123] ). We compare our insertion policy against the Evicted-Address Filter (EAF) [123] , and show that our policy, which takes advantage of inter-warp divergence heterogeneity, outperforms EAF. Dynamic Insertion Policy (DIP) [44] and Dynamic Re-Reference Interval Prediction (DRRIP) [45] are insertion policies that account for cache thrashing. The downside of these two policies is that they are unable to distinguish between high-reuse and low-reuse blocks in the same thread [123] . The Bi-modal Insertion Policy [110] dynamically characterizes the cache blocks being inserted. None of these works take warp type characteristics or memory divergence behavior into account. Other work proposed prefetch-aware insertion and replacement policies [25, 124, 130] . MeDiC can be combined with such policies.
Memory Scheduling. Yuan et al. propose a GPU interconnect design that rearrange the sequence of memory requests that arrive at each memory channel to reduce the complexity of GPU memory scheduler [151] . Chatterjee et al. propose a GPU memory scheduler that allows requests from the same warp to be grouped together, in order to reduce the memory divergence across di erent memory requests within the same warp [17] . Jog et al. propose a GPU memory scheduler that exploit the criticality information of warps in the GPU cores in order to improve the performance of GPGPU applications [49] . Principles of MeDiC can be incorporated into these schedulers.
There are several memory scheduler designs that target systems with CPUs [26, 33, 43, 56, 57, 59, 60, 67, 68, 69, 82, 93, 94, 95, 96, 98, 99, 115, 131, 132, 133, 134, 135, 147, 153] , and heterogeneous compute elements [5, 47, 138] . Memory schedulers for CPUs and heterogeneous systems generally aim to reduce interference across di erent applications.
Improving DRAM. An alternative approach to mitigate memory divergence is to improve the performance of the main memory itself. Previous works propose new DRAM designs that are capable of reducing memory latency in conventional DRAM [1, 2, 3, 4, 11, 12, 13, 13, 14, 14, 15, 16, 35, 36, 37, 38, 40, 41, 42, 53, 58, 61, 70, 71, 72, 73, 74, 83, 86, 92, 97, 100, 103, 109, 119, 121, 122, 125, 129, 141, 154] as well as non-volatile memory [62, 64, 65, 66, 80, 84, 90, 91, 111, 113, 114, 148, 150] . Data compression techniques can increase the e ective DRAM bandwidth [105, 106, 107, 108, 140] . All these techniques are orthogonal to MeDiC and can be used to further improve the performance of GPGPU applications.
Other Ways to Handle Memory Divergence. In addition to cache bypassing, cache insertion policy, and memory scheduling, other works propose di erent methods of decreasing memory divergence [51, 52, 63, 75, 89, 101, 116, 117, 155] . These methods range from thread throttling [51, 52, 63, 116] to warp scheduling [75, 89, 101, 116, 117, 155] . While these methods share our goal of reducing memory divergence, none of them exploit inter-warp heterogeneity and, as a result, are either orthogonal or complementary to our proposal. Our work also makes new observations about memory divergence not covered by these works.
Potential Impact
While the problem that MeDiC is trying to solve, which is memory divergence, is not new, key ndings in this work provide novelty and create potential research topics for the future. We discuss at least three such opportunities and future directions.
Taking Advantage of Memory Divergence Heterogeneity. MeDiC modi es the memory hierarchy to introduce awareness of the memory divergence heterogeneity between di erent types of warps. There are many other applications that can exploit warp type information. Other resources within the GPU (e.g., GPU cores, warp scheduler) can exploit the memory divergence heterogeneity across di erent warps to further improve the performance of GPGPU applications. For example, the warp type information can be used by the warp scheduler and thread block scheduler to ensure that they do not schedule warps of the same type to execute at the same time, to limit the amount of cache contention that occurs. Incorporating the warp type information with other techniques, such as assist warps to relieve execution bottlenecks [140] , can enable GPUs to utilize resources based on the type of warps the GPU is executing. For example, mostly-hit warps favor a mechanism that provides low memory latency, while mostly-miss warps might favor a mechanism that provides higher o -chip bandwidth. Memory divergence heterogeneity can also be used to assist GPU resource virtualization [139] , as virtual resource allocation can take into account the utilization of shared memory resources to determine how much of a particular memory resource to allocate to each thread block.
Warp type information can be used to improve the performance of GPU address translation. Prior works [7, 8] show that address translations that do not hit in a TLB can incur long-latency page table walks, which can a ect hundreds of application threads at once. Such long-latency address translations might have a greater impact on warps that are latency sensitive (e.g., mostly-hit and all-hit warps). Thus, warptype information can be combined with previously-proposed techniques that aim to reduce the overhead of GPU address translation [7, 8] to provide synergistic performance bene ts.
We believe the idea of warp-type heterogeneity enables many di erent mechanisms to customize execution on a GPU to achieve higher performance and energy e ciency. Hence, our PACT 2015 paper [6] paves the way for ne-grained customization of a GPU.
Identifying Long-Latency Threads in a Warp. Our PACT 2015 paper [6] shows how to intelligently reduce the memory latency of threads within a warp in order to reduce the memory divergence problem. However, MeDiC focuses on reducing the stall time of mostly-hit warps. Long-latency threads can still exist in the mostly-hit warps due to other problems such as load balancing at the memory partitions. Additional work on (1) how to identify latency-critical threads within a warp and (2) how to accelerate these speci c threads can further improve the performance and energy e ciency of GPGPU applications.
Reducing High Queuing Delays and Memory Contention in the GPU Memory Hierarchy. As shown in our PACT 2015 paper [6] , the queuing delay of throughput processors such as GPUs can become a performance bottleneck, as the delay increases the stall time of warps of all types. While the proposed warp-type-aware cache bypassing mechanism in MeDiC aims to reduce the queuing delay, non-uniform memory access patterns can still cause contention at a few L2 cache banks and memory partitions. In future systems, the parallelism of throughput processors is likely to increase further. For example, future GPUs will likely come with a higher number of GPU cores and larger SIMD widths. This is expected to greatly increase the amount of contention and, thus, queuing delay, for many resources. The di erent components of MeDiC can serve as a starting point for future research on alleviating cache and memory contention in future systems, and can ultimately enable a larger amount of thread-level parallelism. We believe studying the mitigation of high cache and memory contention is very promising for future parallel throughput processors and encourage future work in this area.
Conclusion
Warps from GPGPU applications exhibit heterogeneity in their memory divergence behavior at the shared L2 cache within the GPU. We nd that (1) some warps bene t signi cantly from the cache, while others make poor use of it; (2) such divergence behavior for a warp tends to remain stable for long periods of the warp's execution; and (3) the impact of memory divergence can be ampli ed by the high queuing latencies at the L2 cache.
We propose Memory Divergence Correction (MeDiC), whose key idea is to identify memory divergence heterogeneity online in hardware and use this information to drive cache management and memory scheduling, by prioritizing warps that take the greatest advantage of the shared cache. To achieve this, MeDiC consists of three warp-type-aware components for (1) cache bypassing, (2) cache insertion, and (3) memory scheduling. MeDiC delivers signi cant performance and energy improvements over multiple previously proposed policies, and over a state-of-the-art GPU cache management technique. We conclude that exploiting inter-warp heterogeneity is e ective, and hope future works explore other ways of improving systems based on this key observation of our work.
