Abstract-This letter quantitatively studies the benefits of inter-warp divergence aware execution on GPUs. To that end, the letter first proposes a novel approach to quantify the inter-warp divergence by measuring the temporal similarity in execution progress of concurrent warps, which we call Warp Progression Similarity (WPS). Based on the WPS metric, this letter proposes a WPS-aware Scheduler (WPSaS) to optimize GPU throughput. The aim is to manage inter-warp divergence to hide memory access latency and minimize resource conflicts and temporal under-utilization in compute units allowing GPUs to achieve their peak throughput. Our results demonstrate that WPSaS improves throughput by 10 percent with a pronounced reduction in resource conflicts and temporal under-utilization.
INTRODUCTION
GPUS offer a high throughput for applications with massively parallel threads. High throughput is achieved by hiding memory access latency across parallel threads. While some threads are waiting for their data or instructions, others can execute. To hide memory access latency, GPUs have dedicated hardware schedulers performing scheduling decisions at warp-level granularity. A warp (also called wavefront in OpenCL) is a group of threads sharing same Program Counter (PC) and executing in lock-step mode. At every cycle, the scheduler selects a ready warp (with data, instruction and compute resources) to execute.
Concurrent warps may execute unevenly, which we call interwarp divergence. Excessive inter-warp divergence may hinder GPUs to achieve their peak throughput. The major side-effects of inter-warp divergence are: (1) inter-warp conflicts in I$ and D$ when diverged warps compete for the same cache blocks [1] , [2] , [3] and (2) temporal resource under-utilization, e.g., when completed warps have to wait for uncompleted warps [4] . Fig. 1 illustrates the correlation between inter-warp divergence and conflicts for an example of Mixture-of-Gaussians (MoG) background subtraction [5] . Previous approaches aim to reduce inter-warp conflicts in D$ (e.g., [2] , [3] ) and temporal under-utilization (e.g., [4] , [6] ). However, they have less attention on quantifying the inter-warp divergence as well as reducing the inter-warp conflicts in I$. In general, the effect of resource conflicts is more pronounced in I$ due to conflicting fetch requests caused by diverged PCs. Fig. 2 presents the achievable speedup with an infinite I$ for nine selected benchmarks from Parboil [7] , Rodinia [8] , CUDA-SDK [9] as well as MoG (with 2 KB I$). Many benchmarks have significant performance benefits by eliminating inter-warp conflicts. This motivates the need for approaches that manage inter-warp divergence, avoiding I$ conflicts for divergence-sensitive benchmarks.
This letter introduces Warp Progression Similarity (WPS) to quantify the inter-warp divergence based on the temporal similarity in execution progression of concurrent warps. Based on the WPS metric, this letter introduces the WPS-aware Scheduler (WPSaS) to manage/control inter-warp divergence. We anticipate a conceptual trade-off between inter-warp divergence and IPC (shown in Fig. 3 ) for the divergence-sensitive benchmarks with high inter-warp conflicts in I$. While both extremes hinder GPUs to achieve their peak IPC, we anticipate that maximum IPC is achieved with a medium inter-warp divergence.
This letter continues with Section 2 presenting WPS metric Section 3 analyzes the architecture impacts onto WPS, and Section 4 presents our proposed WPSaS mechanism. The exploration and results are based on three divergence-sensitive benchmarks: MoG background subtraction in [5] , histogram in Parboil [7] , and DXTC in CUDA SDK [9] . All experiments are carried out on GPGPU-Sim [10] (v3.2.2) configured to Nvidia Fermi architecture ( Table 1 lists the simulator parameters).
WARP PROGRESSION SIMILARITY (WPS)
This section introduces and quantifies Warp Progression Similarity as a metric for measuring uneven execution progress of concurrent warps. Warps that execute on a single SM share the same kernel binary code and execute almost the same PC pattern. However, their execution progress may vary over time depending on scheduling decisions as well as the availability of data, instruction and computation resources. Warp Progression Similarity measures temporal similarity in execution progress of concurrent warps. WPS has a reverse correlation with inter-warp divergence. A higher WPS indicates a lower inter-warp divergence.
Calculating WPS is an offline approach using PC traces of all warps. It consists of two steps: (a) instruction profiling to capture execution traces per warp and (b) temporal similarity calculation using the captured traces. For profiling (a), we choose an I$ block granularity to capture execution trace as it matches with the granularity of instruction fetch in underlying architecture. Our profiler sequences the kernel binary in smaller blocks of 128 bytes each (typical I$ block size). During simulation on GPGPU-Sim, the profiler records the runtime trace (block number and access duration) of accessed blocks. Fig. 4a highlights one example for three concurrent warps. Note that the progress of warps through the same I$ block may vary, and even the sequence of accesses may be different (due to conditional execution).
Temporal similarity calculation (b) uses the above recorded traces, and computes the temporal similarity of warps as a value between 0 and 1. It is based on the cycle distance of accessing I$ blocks. To correlate distances to temporal similarity values, we use a Gaussian distribution; smaller distances have high similarity closer to 1 and bigger distances have similarities closer to 0 (presented in Fig. 4b ). The selection of Gaussian distribution is based on our I$ block lifetime analysis over the benchmarks (with a 400 cycles standard deviation).
Algorithm 1 outlines the temporal similarity calculation. For every cycle (t), function Calc-WPS returns a WPS value (Similarity) reflecting the temporal similarity between all warps at that cycle. At first, the algorithm computes the difference of the currently used block to when it will be used again by the target warp (line:7). Computed distances are normalized through the decay function (line:8) and then added up to the cumulative normalized distance value (sum) across all concurrent warps of that cycle. Finally, the algorithm calculates final WPS which is an average of normalized distances across all warps (line:14). Note that numWarp refers to the number of warps launched. Early finishing warps still contribute to WPS, using the last accessed I$ block for distance calculation. 
end function
To evaluate WPS, we applied it to MoG (highlighted in Fig. 5a ). Comparing the WPS changes over time in Fig. 5a with PC values in Fig. 1a , we observe a tight correlation between WPS and inter-warp divergence. For example, cycles with higher variation in PCs and thus higher inter-warp divergence, also show lower WPS. Fig. 5b also presents WPS for DXTC benchmark. For both benchmarks, WPS varies significantly during execution. The variation is more pronounced toward the end, hinting an unbalanced execution.
WPS CONTRIBUTORS
Overall, both algorithm and architecture contribute to the WPS. Algorithms with frequently data-dependent branches tend to diverge across parallel warps. At the same time, the underlying architecture can cause further reduction in WPS. Looking at a single SM (highlighted in Fig. 6 ), primary WPS contributors are I$ and D$ (causing uneven access latency to memory hierarchy), and warp schedulers in the back-end for scheduling warps for execution. To achieve maximum utilization of the compute resources, GPUs often employ multiple warp schedulers (e.g., two in Fermi and four in Kepler and Maxwell architectures).
For our WPS analysis, we focus on four major contributors: algorithm-intrinsic, uneven latency of instruction/data access, type and number of warp schedulers. Fig. 7 cumulatively illustrates the effect of the four major contributors for MoG with 28 concurrent warps running on a single SM. Fig. 7a shows the algorithm-intrinsic WPS with infinite I$ and D$ and a strict round-robin scheduling policy (SRR). Fig. 7b presents WPS with bounded caches -48 KB D$ and 2 KB I$. Fig. 7c presents the effect of different scheduling policies, and Fig. 7d shows the impact of number of warp schedulers (in this case Greedy and Then Oldest (GTO)) on the WPS.
The algorithm-intrinsic WPS is high (always above 80 percent) with only small variations due to data-dependent branches in the algorithm. Even with bounded caches (Fig. 7b) , WPS drops only a slightly due to uneven latency for accessing the data and instructions. The scheduler impact (Fig. 7c) varies. Fair policies such as SRR and Loosely Round-Robin (LRR) have minimum effect on WPS. In contrast, greedy policies such as Two-Level, and GTO result in a much lower WPS. Nonetheless, GTO achieves the best performance as it opens more opportunity for hiding memory access latency. Adding a second level GTO scheduler (Fig. 7d ) reduces divergence (due to fewer warps per scheduler), yet it increases WPS variations as the schedulers work independently. Among all WPS contributors, the warp scheduler is most influential. Aggressive warp schedulers, such as GTO, result in higher inter-warp divergence to hide memory access latency for a better IPC. The downside is high resource conflicts and temporal underutilization hindering GPUs to achieve their peak throughput. We have observed a significant inter-warp divergence in other benchmarks including Histogram and DXTC. Novel solutions are required to control/manage the inter-warp divergence to hide memory access latency while reducing inter-warp conflicts.
WPS-AWARE SCHEDULER (WPSAS)
This section introduces our proposed WPS-aware Scheduler with the aim to demonstrate potential benefits when taking WPS into account for GPU operation. WPSaS is an extended I-fetch scheduler in the front-end of an SM (illustrated in Fig. 6 ). Being an I-fetch scheduler, WPSaS can make global decisions across all concurrent warps independent from number of back-end schedulers. In essence, WPSaS avoids conflicting instruction fetch requests caused by warps which are too far ahead. Avoiding these requests effectively stalls execution of ahead warps by the back-end scheduler(s). This results in a more even progress across all warps.
Algorithm 2 illustrates WPSaS which is an extension over a standard round-robin I-fetch scheduler. It has two main functions: (1) WPSaS and (2) DoesThrashICache. (1) WPSaS implements roundrobin scheduling with an additional check DoesThrashICache. (2) DoesThrashICache receives the PC value and current warp index (cur-idx), and returns the status of the warp. DoesThrashICache returns TRUE for an ahead warp and FALSE otherwise. An ahead warp is a warp with a fetch request that evicts a cache block currently being used by other warps. If DoesThrashICache() returns TRUE (indicating an ahead warp), WPSaS() skips the fetch request.
To detect an ahead warp, DoesThrashICache first calls ICache_Probe (PC) (line 12) to probe the status of I$ for the new fetch request. If the requested PC is in cache, ICache_Probe(PC) returns NULL. In case of cache miss, ICache_Probe(PC) returns the index of the block that would be evicted (victim block). A warp with a fetch request that results in a victim block is also called a trigger warp. A trigger warp can potentially be an ahead warp. The decision of being an ahead warp depends on the access history of the victim block.
To capture the access history, DoesThrashICache calls Inter_ Warp_Age function (line 14). Inter_Warp_Age returns the cycle count of last accessing the victim block by any other warp (victim-blk-age).
A large victim-blk-age is an indicator that the block is less active. Conversely, a short victim-blk-age indicates that the block has been recently used by other warps (an active block). Thr defines the maximum age for being an active block. Hence, an ahead warp is a trigger warp that evicts an active block (ie. victim-blk-age < Thr). 
end function
In a nutshell, DoesThrashICache identifies a non-ahead warp under two conditions: (a) there is no associated victim block to the PC fetch request (ICache_Probe(PC) returns NULL); (b) there is an associated victim block but it is not active (victim-blk-age > Thr). The open research question is to tune Thr to achieve maximum IPC. Low values of Thr neutralize the effect of WPSaS (Thr ¼ 0 is equivalent to WPS-unaware scheduling). In other extreme, high Thr values force warps to execute in lock-step.
To evaluate WPSaS, we enhance the round-robin I-fetch scheduler of GPGPU-Sim to WPSaS and configure GTO as the back-end policy. To show how WPS impacts performance, we sweep Thr from 0 to 4,000 and measure the resulting WPS and IPC. Fig. 8 plots the measured IPC over WPS for MoG and DXTC. Fig. 8 resembles the conceptual trade-off in Fig. 3 . For both applications, WPS generally increases by increasing Thr value. However, the IPC reaches its maximum by WPS around 0.6 which corresponds to Thr ¼ 1,040 in MoG and Thr = 1,320 in DXTC. IPC drops steeply with WPS more than 0.6.
To illustrate the effects of WPSaS on inter-warp divergence and resource conflicts, Fig. 9a plots the PC values for MoG application for the best found Thr (Thr = 1040). Comparing WPSaS (Fig. 9a) against WPS-unaware scheduler (Fig. 1) , the execution progress is more even in WPSaS. At the same time, there is enough inter-warp divergence to hide the memory access latency. distributed throughout the execution. The reduction in I$ misses is more pronounced as it directly affected by WPSaS.
To provide more insight, Fig. 10 plots for the best found Thr, the IPC, temporal under-utilization, and execution stalls for all three benchmarks (MoG, Histogram, and DXTC). Fig. 10 also compares the result against WPS-unaware scheduler. Fig. 10a shows a pronounced increase in IPC, 10.6 percent on average (14.1 percent MoG, 9.5 percent Histogram, 8.4 percent DXTC). IPC mainly increases due to less temporal under-utilization and fewer conflicts in I$ and D$. Temporal under-utilization drops across all benchmarks (Fig. 10b) , most pronounced reduction belongs to DXTC (from 20 to 2 percent). MoG and DXTC reach almost 100 percent utilization. Fig. 10c compares the number of stalls in the back-end which are either caused by lack of instruction, or data or compute resources. WPS-aware scheduler is able to significantly reduce the back-end stalls, in particular stalls due to lack of data or instructions (45 K to 34 K for MoG, 68 K to 56 K for DXTC, and 39 K to 34 K for Histogram). I$ stalls reduce significantly as our WPSaS denies the fetch requests of ahead warps which would evict useful I$ blocks. On the downside, WPSaS introduces additional scheduling stalls (sched.) when denying fetch requests. MoG and Histogram have a minimal sched. stalls. In contrast, DXTC has a larger number of sched. stalls which hints to a large number of inter-warp conflicts in I$.
Overall, our proposed WPSaS demonstrates the benefits of interwarp divergence aware execution in GPUs. For the benchmarks under evaluation, WPSaS results in 10.6 percent speedup on average with 21 percent to less than 8 percent reduction in temporal under-utilization, and 23 percent reduction in back-end stalls. This letter focuses on defining the WPS metric, and highlighting the performance improvement opportunities when taking WPS into account for scheduling. Nonetheless, we anticipate an efficient implementation at the architecture level, as all information required for WPSaS (Algorithm 2) is already available in the front-end stage.
The primary aim of WPSaS is to demonstrate the potential of WPS-aware execution on GPUs. Given the space constraints, this letter only focuses on three I$ sensitive benchmarks. Nonetheless, we have executed all 10 applications listed in Fig. 2 , and WPSaS shows some performance gain for each application (no slowdown). As part of our future work, we are working on a more detailed analysis of WPS impact on a wider set of applications as well as architecture realizations of WPSaS.
CONCLUSIONS
This letter proposes a new perspective on the efficiency analysis of GPUs by considering the role of inter-warp divergence on GPU execution. As a metric to quantify inter-warp divergence, this letter proposes Warp Progression Similarity to measure temporal execution similarity of concurrent warps. Based on the WPS metric, this letter introduces WPSaS, a WPS-aware Scheduler, giving an outlook on performance improvement opportunities. WPSaS achieves 10.6 percent improvement in throughput for three selected benchmarks. We anticipate similar performance improvements with benchmarks which either suffer from inter-warp conflicts in I$, or with high temporal under-utilization. Nonetheless, many design opportunities but also challenges are opened by divergence-aware design principles, such as finding the balance between inter-warp divergence and inter-warp conflicts in I$ and D$.
