Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage-e.g. number of registers, amount of shared memory-of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted.
INTRODUCTION
Graphics Processing Units (GPUs) have been effectively used to accelerate large data parallel applications. GPUs consisting of Streaming Multiprocessors (SMs) achieve high throughput by concurrently executing a large number of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. threads to hide long latencies. The throughput achieved by a GPU depends on the amount of thread level parallelism (TLP) utilized by it. Recent studies [19, 31, 32, 34] focus on improving the throughput of GPUs by exploiting the TLP.
The amount of TLP utilized by a GPU depends on the number of threads resident on it. When an application is launched on a GPU, an execution configuration consisting of the number of thread blocks and the number of threads in a thread block is specified. The number of thread blocks that can actually be launched on an SM depends on the resource requirement, such as the number of registers and the amount of scratchpad memory needed by each thread block. If an SM contains R units of a resource and a thread block requires R tb units to complete its execution, then the SM can launch at the most R/R tb thread blocks, utilizing R tb × R/R tb units. The remaining R mod R tb units are wasted.
In this paper, we propose a mechanism to share resources of SM and launch more thread blocks, effectively reducing resource wastage. In particular, we show how sharing of registers and sharing of scratchpad improves the throughput of SMs. It is observed [19] that increasing the number of threads benefits compute-bound applications, but may result in increased L1/L2 cache misses for memory-bound applications, thereby decreasing their performance. To overcome this, we propose an optimization, called Owner Warp First (OWF) that schedules the extra thread blocks and their constituent warps effectively. For the register sharing approach, we further propose two optimizations, viz., Unrolling and Reordering of Register Declaration and Dynamic Warp Execution that improves register utilization and minimizes the number of stall cycles observed by the additional thread blocks respectively.
We make the following contributions in this work:
1. To utilize the resources of GPUs effectively, we propose a novel resource sharing mechanism that enables launching of more thread blocks per SM. 2. We implemented our approach for two resources, i.e., registers and scratchpad. We propose optimizations to further improve the throughput of applications. 3. We implemented our approach using GPGPU-Sim and evaluated on 19 applications from GPGPU-Sim [7] , Rodinia [9] , CUDA-SDK [2] , and Parboil [4] benchmarks.
We observe that 8 of the applications, which underutilize the register resource, show an average improvement of 11% with register sharing approach. Similarly 7 applications, which underutilize the scratchpad resource, show an average improvement of 12.5% with scratchpad shar- ing. While the remaining 4 applications, which do not waste any resources, perform comparable to the baseline approach. The paper is organized as follows: Section 2 describes the background required for our approach. Section 3 motivates the need for sharing. Our approach is presented in Sections 4 and 5. Section 6 discusses hardware overhead for implementing our approach. Section 7 describes the experimental evaluation. Section 8 discusses related work, and Section 9 concludes the paper.
BACKGROUND
A typical NVIDIA GPU [1] consists of a set of Streaming Multiprocessors (SMs), and each multiprocessor has execution units called Stream Processors (SPs). CUDA [1] supports extensions to languages, such as C, to allow programmers to define and invoke parallel functions, called kernels, on a GPU. A kernel is invoked along with an execution configuration of threads that specifies the number of threads per thread block and the number of thread blocks.
The number of thread blocks that can reside on an SM depends on: (a) the number of registers used by a thread block and the number of registers available in the SM, (b) the amount of scratchpad memory used by a thread block and the amount of scratchpad memory available in the SM, (c) the maximum number of threads allowed per SM, and (d) the maximum number of thread blocks allowed per SM.
The threads in a thread block are further divided into a set of consecutive 32 threads called Warp. Each SM contains one or more warp schedulers which schedule a ready warp every cycle from a pool of ready warps. All threads in a warp execute the same instruction. Warp schedulers schedule instructions in-order and so, when the current instruction of a warp can not be issued, the warp is not considered to be ready. If no warp can be scheduled in a cycle, then that is a stall cycle. As the number of stall cycles increases, the run time goes up and the throughput decreases. Our approach increases the number of resident thread blocks by utilizing the wasted registers as well as scratchpad memory on each SM and hence increases the number of resident warps and also improves the warp schedulers to hide long latencies.
MOTIVATION
The problem of resource underutilization occurs in GPU because resources are allocated at thread block granularity. We analyzed several benchmark applications using the GPGPU-Sim [3] simulator 1 . For applications that are limited by register resource, we show the number of resident 1 The GPU configuration is described in Table 1 . The benchmark details are given in Table 2 , Table 3 Section 7. Similarly, in Figure 1 (b) we show the number of resident thread blocks per SM for the applications that are limited by scratchpad resource, and in Figure 1 (d) we show the percentage of scratchpad memory that remains unutilized per SM. Consider the application lavaMD. Each thread block for this benchmark needs 7200 bytes of scratchpad memory. According to the configuration in Table 1 , the amount of scratchpad memory available per SM is 16384 bytes, hence an SM can fit 2 thread blocks. This results in 1984 bytes of scratchpad memory per SM remaining unutilized. Similar behavior is observed for other applications as well.
Applications that are constrained by their resource requirements may not only have low residency, but also waste resources of GPU. Our proposed approach reduces wastage of registers and scratchpad memory and increases the number of resident thread blocks. Our experiments show that these extra thread blocks help to hide long execution latencies and increase throughput.
RESOURCE SHARING
We can increase the number of thread blocks in an SM by allowing two thread blocks to share resources. For example, consider an application that has thread blocks of size 10 warps (320 threads), and a thread block requires 10K resource units to complete its execution. If an SM has 35K resource units, at most 3 thread blocks can be resident on In order to reduce the wastage of resources, our approach allocates one more thread block (TB3) in sharing mode with TB2. Instead of allocating 10K resource units separately to each of the thread blocks TB2 and TB3, a total of 15K units for the two blocks are allocated as follows: each of TB2 and TB3 is allocated 5K units exclusively (Private or Unshared Resource), while the remaining 5K units (Shared Resource) are all allocated to TB2 or TB3 whoever needs any one of these resources first. The other thread block (which did not get the ownership of shared resources), when it needs any of the shared resources, waits till the owner block finishes.
We refer to any two thread blocks as Shared Blocks when they share resources exclusively (for example TB2 and TB3 in Figure 2 (b)), and the warps of such thread blocks as Shared Warps. Thread blocks (warps) that do not participate in sharing are referred to as Unshared Blocks (Unshared Warps). We describe in detail our sharing approach for two types of resources (a) Registers, and (b) Scratchpad.
Register Sharing
The scenario in Figure 2 (a) can be improved using our register allocation scheme shown in Figure 2 (b), in which we allocate 10K registers to each thread block TB0 and TB1. The remaining 15K registers are shared between thread blocks TB2 and TB3 such that each pair of warps in these thread blocks are allocated 1.5K registers as described next. We refer to TB0 and TB1 as unshared thread blocks, whereas, TB2 and TB3 as shared thread blocks.
Consider the pair of warps W20 and W30 that participate in sharing. We allocate 0.5K registers (private or unshared registers) each to W20 and W30. The remaining 0.5K registers are shared registers, that are allocated to these warps together in a shared but exclusive manner, i.e., only one of them can access the pool of shared registers at a time. For example, if warp W20 accesses any of the shared registers first, exclusive access to all the 0.5K shared registers is given to W20, while W30 is prevented from accessing any of those 0.5K shared registers till W20 finishes. This implies, W30 can continue its execution until its first access to any of the 0.5K shared registers and waits until the shared registers are released. Only after W20 finishes execution, W30 can access the shared registers and continue. This way, additional warps make some progress, which helps in hiding execution latencies.
To generalize this idea and to compute the increase in number of thread blocks, we will consider a GPU that provides R registers per SM. Also, consider a thread block that requires R tb registers, and each warp in the thread block requires Rw registers to complete its execution. To increase the number of thread blocks that share registers with other existing thread blocks in the SM, we allocate R tb (1 + t) (for any threshold 0 < t < 1) registers to each pair of shared thread blocks, instead of allocating 2R tb registers to them (in Figure 2 (b), t is 0.5). Equivalently we allocate Rw(1 + t) registers per two warps from these thread blocks (i.e., one warp from each shared thread block in the pair), such that each of these warps can access Rwt unshared registers independently, and they can access the remaining Rw(1 − t) shared registers only when granted access.
We allocate registers to a warp dynamically when it requires to access the registers on its first usage, and we deallocate them from the register file after the warp has finished its execution, as described in GPGPU-Sim [3] . Every unshared register is allocated as per the request, but the shared registers are allocated to only a warp that has exclusive access. To detect a register accessed by a warp as shared or unshared, and to efficiently access it from the register file unit, we modify the existing register file access mechanism as shown in Figure 3 . When a warp (WarpId) needs to access a register (RegNo), we first check if the warp is an unshared warp, i.e., if it belongs to an unshared thread block (Figure 3, Step (b)). If it is an unshared warp, it can directly access the register from register file using a combination of (WarpId, RegNo). If WarpId is a shared warp, the accessed register is an unshared register if RegNo ≤ Rwt (Step (c)). This is because Rwt number of unshared registers are allocated to each warp. If RegNo > Rwt, we treat the register as a shared register. A warp can access an unshared register directly from the register file, but it can access a shared register only when it gets exclusive access by acquiring a lock (Step (e)), otherwise it retries the access in another cycle 2 .
Consider a scenario shown in Figure 4 , where two thread blocks TB1 and TB2 are in shared mode. Assume that W2 and W3 have already acquired locks for accessing shared registers. Also, assume that the warps W2, W3 are waiting for warps W1 and W4 respectively, to arrive at a barrier instruction ( syncthreads()). Now, if warp W1 tries to acquire a lock to access shared registers from W3, and W4 tries to acquire a lock to access shared registers from W2, then a deadlock occurs. To avoid deadlock, we always ensure that if thread blocks TB1 and TB2 share registers, then a warp from TB1 (TB2) can acquire a lock only when either (a) none of the warps from TB2 (TB1) have acquired a lock for the shared registers, or (b) the warps from TB2 (TB1) that have acquired exclusive access to the shared registers have finished their execution. For the above example, if warp W3 already has acquired a lock, W2 can not acquire a lock, avoiding the deadlock.
Scratchpad Sharing
Figure 2(c) shows an example of Scratchpad Sharing, where we consider a GPU that has 35K units of scratchpad memory per SM, and each thread block requires 10K units. To increase number of resident thread blocks with scratchpad sharing, we allocate 10K units to each TB0 and TB1; the remaining 15K scratchpad units are allocated together for thread blocks TB2 and TB3 such that each one gets 5K units in private mode and the remaining 5K units are accessed in exclusive mode, i.e., only one thread block can access it at a time 3 . Similar to Register sharing approach, we refer to TB0 and TB1 as unshared thread blocks, whereas, TB2 and TB3 as shared thread blocks.
When a thread from the shared thread block (say TB2) needs to access a memory location from shared scratchpad, it gains an exclusive access by acquiring a lock. As long as TB2 is running, no thread from TB3 can access the shared scratchpad locations and hence the corresponding warps of TB3 will have to wait for TB2 to finish before they can proceed further. But warps of TB3 that do not access the shared scratchpad locations can continue execution.
The implementation to support scratchpad sharing in GPGPU-Sim is shown in Figure 5 . The steps for the shared scratchpad access follow the rules similar to the shared register access and are omitted for brevity. A deadlock can never occur with scratchpad sharing. Consider two thread blocks TB1 and TB2 that share scratchpad. When a warp from shared thread block (say TB1) acquires a lock, no other warp from TB2 is given access to the shared scratchpad region until TB1 finishes its execution. So, only the warps from TB2 that require accessing the shared resources wait for TB1 to finish. Warps from TB1 never wait for TB2 to finish. Hence there is no deadlock cycle. 3 Unlike register sharing, we can not distribute 1.5K scratchpad memory to each pair of warps because any thread within a thread block can access any scratchpad location allocated for that block. Figure 5 : Scratchpad Access Mechanism
Computing the Number of Thread Blocks to be Launched per SM
A naive method of sharing, where each thread block is sharing resources with some other thread block, may launch more thread blocks as compared to default (non-sharing) approach. However, the number of thread blocks that make progress (effective thread blocks) per SM can be less than that for non-sharing. For example, consider a scenario where 3 thread blocks are resident per SM without sharing. With naive sharing, it may be possible to have 4 thread blocks resident, such that block 1 shares resources with block 2; and block 3 with block 4. It can happen that block 2 and 4 start accessing shared resources causing blocks 1 and 3 to wait. Effectively only two thread blocks (blocks 2 and 4) will make progress in the naive sharing approach, whereas all 3 blocks can make progress in the non-sharing approach reducing the throughput. To avoid this, we describe a method to compute the total number of thread blocks (Shared + Unshared) to be launched per SM such that the number of effective thread blocks using sharing approach is no less than that of nonsharing approach. We use the following notations: that do not share resources with any other thread block, 5. M : Maximum number of thread blocks to be launched in an SM, 6. t: Threshold for computing the number of resources that a thread block shares with another thread block. For a given threshold value t (0 < t < 1) we allocate (1 + t)R tb resource units per two shared thread blocks, in which (1 − t)R tb resource units are shared. Without sharing, we can launch up to R/R tb thread blocks in an SM, and all of them make progress. Whereas in our approach, if two thread blocks are launched in sharing mode, at least one thread block always makes progress. So, when S shared pairs are launched in an SM, at least S thread blocks always make progress. Also, if U unshared thread blocks are launched in the SM, they always make progress. Therefore, at least S +U thread blocks always make progress with our approach. In order to keep the number of effective thread blocks in our approach to be same as that of nosharing approach, we need the following relation to hold:
For each shared pair of thread blocks, we allocate R tb (1 + t) resource units and for each unshared thread block, we allocate R tb resource units. Since the total number of resource units available in the SM is R, we have: The total number of thread blocks that can be launched in sharing approach is equal to the number of unshared thread blocks plus twice the number of shared pairs, i.e.,
Using Equations 1, 2, and 3,
Since the actual number of thread blocks that can reside in an SM also depends on other factors, such as (a) maximum number of resident threads per SM, and (b) maximum number of resident thread blocks per SM; the number of thread blocks that are launched in an SM by our approach is minimum of values obtained using the factors (a), (b), and the value M. When the number of thread blocks launched by our approach is more than that of baseline approach (i.e., R R tb ), we enable our resource sharing approach; otherwise, we launch all the thread blocks in unsharing mode.
OPTIMIZATIONS
With the proposed resource sharing approach, each SM has unshared and shared warps, and scheduling these warps plays a very important role in determining the performance of applications. We propose an optimization called "Owner Warp First (OWF)" to schedule these warps effectively. If two thread blocks TBi and TBj are a shared pair, and at least one of the warps of TBi waits for shared resources from TBj, we call TBj as Owner Block, and the warps that belong to TBj are called Owner Warps. TBi is called Non-Owner Block and warps of TBi are called Non-Owner Warps. As soon as the owner thread block finishes its execution, it transfers its ownership to the non-owner thread block (i.e., the non-owner thread block becomes the owner), and a new non-owner thread block gets launched.
Scheduling Owner Warp First (OWF)
A warp scheduler in the SM issues a warp every cycle from a pool of ready warps. With our solution, the warps can be categorized into three types viz., unshared, shared owner and shared non-owner. In register sharing, shared non-owner warps depend on the corresponding shared-owner warps to release registers, before they can make progress. Similarly, with scratchpad sharing, warps from non-owner thread blocks wait for owner thread blocks to complete their execution. Hence scheduling the warps plays a role in improving the performance of applications.
Consider a scenario shown in Figure 6 . Assume that an SM contains 3 warps: unshared (U), shared owner (O), shared non-owner (N) warps, and each warp needs to execute three instructions (I1, I2, and I3) as indicated in the figure. Assume that latency of Mov and Add instruction is 1 cycle, and the latency of Load instruction is 5 cycles. Also, assume that register R1 is an unshared resource and R2, R3 are shared resources. If unshared warp is prioritized over owner warp (shown as Unshared Warp First), the unshared warp executes I1 in the first cycle, and it starts execution of I2 in the 2nd cycle. However, it can not start I3 in the 3rd cycle because register R2 of I3 is dependent on the instruction I2, and I2 takes 5 cycles to complete the execution. If owner warp is prioritized over non-owner warp, it can start execution in the 3rd cycle. The non-owner warp which has the least priority can start its execution I1 at the 5th cycle. However, it can not execute I2 in the 6th cycle because it needs to acquire access to the shared resource R2, which is held by its owner warp. Hence, it waits until the owner warp releases the shared resources (i.e., till the 9th cycle). The non-owner warp can resume its execution in the 10th cycle and can finish in 15 cycles.
To minimize the waiting time of the non-owner warps, we propose an algorithm, Owner Warp First (OWF), that prioritizes warps in the order: shared owner, unshared, and shared non-owner. Giving the highest priority to shared owner warps helps finish them sooner, and hence the dependent shared non-owner warps can make progress. Since nonowner warps depend on their corresponding owner warps for shared resources, giving them low priority helps in hiding stalls when no other types of warps are are ready to run. In Figure 6 with OWF approach, owner warp can finish sooner, i.e., in 7 cycles. Similarly unshared warp, with second priority, can finish in 9 cycles. Since the non-owner has low priority, it can start executing I1 in the 5th cycle. It can overlap the execution of I2 with the unshared warp in the 8th cycle because its owner warp has released the shared resources. Further, it can finish the execution in 13 cycles as shown in the figure, thus improving the overall performance.
To leverage OWF optimization, we launch the additional thread blocks such that all the thread blocks in the SM will be initially arranged according to the order: owner, unshared, non-owner thread blocks. If we launch k thread blocks in addition to N original blocks (id: 0, . . . , N-1), then thread block pairs with ids (0, N), (1, N+1), . . . , (k-1, N+k-1) are launched as shared thread blocks, and thread blocks k to N-1 are launched as unshared thread blocks. Hence thread blocks with ids 0 to k-1 can become owner blocks, thread blocks k to N-1 become unshared blocks, and thread blocks N to N+k-1 become non-owner blocks.
Unrolling and Reordering of Register Declarations
In register sharing, non-owner warps need to wait for owner warps when they try to access shared registers. If the very first instruction issued by a non-owner warp uses a shared register, then the warp has to wait and can not start its execution until corresponding owner warp has released the shared register. In order to allow the non-owner warps to execute as many instructions as possible before stalling due to unavailability of shared registers, we unroll and reorder the register declarations. To illustrate this, consider the PTXPlus [3] code shown in Figure 7 (a), which is generated by GPGPU-Sim [3] for the sgemm application from Parboil Suite [4] . The first instruction of the code accesses registers p0 and r124, which get the register sequence numbers as 31 and 35 according to the declaration. These registers are part of the shared registers for a certain threshold value t. Hence, a non-owner warp has to wait until the registers are released. To delay accessing the shared registers, we unroll and rearrange the order of the register declarations so that p0, r124 become unshared registers (i.e., they get the register sequence numbers as 1 and 3, as shown in Figure 7(b) ). Hence the non-owner warps get to execute more number of instructions before they start accessing shared registers.
To implement this optimization, we converted the assembly code (PTXPlus) produced by GPGPU-Sim into an optimized assembly code. To achieve this, we first find an order of registers according to their first usage. Further, to ensure that unshared registers are used before shared registers, we modify the register declarations so that a register that has been used first is declared first. Finally, we modified the GPGPU simulator to use optimized PTXPlus code for simulating instructions. This optimization can be easily integrated at assembly level using CUDA compiler.
Dynamic Warp Execution
A study by Kayiran et. al. [19] shows that the performance of memory-bound applications can degrade with increase in the number of resident thread blocks. Executing additional thread blocks can increase L1/L2 cache misses, which leads to increase in the stall cycles. In register sharing, the additional warp (non-owner warp) resumes its execution as soon as its corresponding owner warp finishes, while in scratchpad sharing non-owner warps wait until its corresponding owner thread block finishes. In order to reduce the number of additional stalls due to the execution of non-owner warps in register sharing, we propose an optimization that can dynamically enable or disable execution of long latency instructions (memory) issued by the non-owner warps.
To control the execution of memory instructions from the non-owner warps, we monitor the number of stall cycles for each SM. When executing memory instructions from nonowner warp leads to increase in the number of stalls, we decrease the probability of executing further memory instructions from the non-owner warps. To illustrate this, consider a GPU that has N SMs, all in sharing mode. Our approach disables execution of memory instructions for the non-owner warps, only on a specific SM (e.g. SM0). Every other SM, SMi for i ∈ {1 . . . N − 1}, allows execution of memory instructions for the non-owner warps, and compares its stall cycles periodically with the stalls on SM0. If stalls observed in the SMi are more than the stalls appearing in SM0, then the probability of executing memory instructions on SMi from the non-owner warps is decreased by a predetermined value p. If the stalls in SMi are less than that in SM0, then the probability of executing memory instructions on SMi from the non-owner warps is increased by the same value p. Thus, we reduce the number of stall cycles by controlling the execution of memory instructions.
After running several experiments, we selected the periodicity of monitoring to be 1000 cycles, which is to ensure that (a) the monitoring overhead is not high, and (b) sufficient number of stall cycles are observed. In our experiments, initially all the SMs (except SM0) are allowed to execute all memory instructions, i.e., the probability of executing memory instructions from non-owner warp is 1. Depending on the stall cycles observed for an SMi (i ∈ {1 . . . N − 1}), this probability for SMi is decreased or increased by p = 0.1, but is kept within interval [0, 1] as a saturating counter. Figure 8 shows the modified architecture to implement our proposed resource sharing approach. There are mainly two changes in the scheduling logic. The first change is that the warp scheduler uses OWF policy to prioritize warps, using the owner information. The second change is the inclusion of resource access check. A warp is considered to be ready for issuing only when it can access the required resources (resource access check) and has all its operands available (scoreboard check). The resource access unit follows the resource access mechanism (Figure 3 Figure 8 in Additional Storage Units corresponding to Register Sharing) to specify whether sharing mode is enabled for it. This bit will be set when the number of thread blocks assigned to the SM using resource sharing is more than the default number of thread blocks per SM. 2. Each resident thread block stores its shared thread block id in the ShTB table, shown in the figure. If a thread block is in unsharing mode, its corresponding value is set to -1. For T thread blocks, T log 2 (T + 1) (assuming ids 0 to T-1 for T thread blocks, we can use id T to represent -1) bits are required per SM. 3. Each warp requires a bit for specifying the owner information, which is stored in Owner table in the figure. This bit is set only when the warp is an owner warp. Hence, for W warps, W bits are needed.
HARDWARE REQUIREMENT
In the additional storage units, X dimension of each table refers to the number of bits, and Y dimension refers to the number of entries. figure) . A warp is set to be in sharing mode, when its corresponding thread block is in sharing mode. For W warps in an SM, W bits are required. For a warp in shared mode, its corresponding shared warp can be identified using the sharer thread block id of its thread block and its relative position in the thread block. 5. Each pair of shared warps uses a lock variable to access the shared registers exclusively. The lock variable is set to the id of the warp which has gained access to the shared registers. This is maintained in the Lock table in the figure. If an SM has W warps, there can be a maximum of W/2 shared pairs of warps in the SM. Hence, we need a total of W/2 log 2 W bits per SM. Storage units required for scratchpad sharing: 1. Similar to register sharing, scratchpad sharing approach also requires ShSM, ShTB, and Owner tables as described above. These tables are shown in Figure 8 in Additional Storage Units corresponding to Scratchpad Sharing.
Each pair of shared thread blocks uses a lock variable to
access the shared locations exclusively. The lock variable is set to the id of the thread block which has gained access to the shared scratchpad region. If an SM has T thread blocks, there can be a maximum of T /2 shared pairs of thread blocks in the SM. Hence, we need a total of T /2 log 2 T bits per SM. Similar to register sharing, these values are maintained in the Lock table, shown in Figure 8 for scratchpad sharing. The total amount of storage required (in bits) for a GPU with N SMs for implementing register sharing is:
(1 + T log 2 (T + 1) + 2W + W/2 log 2 W ) * N and for implementing scratchpad sharing is:
(1 + T log 2 (T + 1) + W + T /2 log 2 T ) * N For the architecture shown in Table 1 , the additional storage required per SM is 273 bits for register sharing and 93 bits for scratchpad sharing. In addition to storage units, the resource access unit requires two comparator circuits to implement the steps (b) and (c) shown in Figures 3 and 5 . Similarly, it requires an arithmetic circuit to set the lock as shown in step (e).
EXPERIMENTS AND ANALYSIS
We implemented our approach using GPGPU-Sim V3.X [3] . Table 1 shows the baseline architecture used for comparison. We evaluated our approach on several applications from GPGPU-Sim [7] , Rodinia [9] , CUDA-SDK [2] , and Parboil [4] benchmarks. Depending on the resource requirement of applications, we divided the benchmarks into three sets. Set-1 (Table 2) consists of applications whose number of thread blocks per SM are limited by registers. Set-2 ( Table 3 ) has applications that are limited by scratchpad memory. Set-3 ( Table 4 ) has applications that are limited neither by registers nor by scratchpad memory (i.e., they are limited either by the number of resident threads or the number of resident thread blocks). We choose Set-3 applications to ensure that our approach does not degrade the performance of applications that are neither limited by registers nor scratchpad memory. For each application in Table 2 and Table 3 , we show names of the kernels used for evaluation and the number of threads per thread block. In Table 2 , we report the number of registers per thread for each kernel, which GPGPU-Sim uses to compute the number of resident thread blocks, and in Table 3 we show the amount of scratchpad memory used by each thread block.
We use the value of threshold (t) to configure the percentage of resource sharing. For example, if each thread block requires R tb units of resource, and we choose t = 0.1, then we allocate 1.1 * R tb resource units per two shared thread blocks, which means 90% of resource units (R tb ) are used as shared resource units. So for a given threshold t, we can compute the percentage of resource sharing as (1 − t) * 100. We analyzed the performance of our approach for each application by varying t and chose the threshold value as 0.1 (i.e., 90% resource sharing) for our results (For details, see [16] ).
We measure the performance of our approach using the following metrics, which are reported by GPGPU-Sim [3]: 1. The Number of Resident Thread Blocks: It indicates the number of thread blocks that are launched in an SM. We choose this metric to compare the amount of TLP that is present in an SM.
Instructions Per Cycle (IPC):
It is the number of instructions that are simulated per core clock cycle. We use it to measure the performance of our modified GPU architecture with respect to benchmark applications. 3. Simulation Cycles: It is the number of cycles that a ker- nel takes to complete its simulation. We use this metric to measure performance of benchmarks applications with our modified GPU architecture.
Pipeline Stall Cycle: It is the cycle in which no warp
can execute an instruction because the execution units are busy. This is to show that our approach can help in hiding the long latency instructions. 5. Idle Cycle: It is the cycle in which no warp is ready to execute next instruction. We choose this metric to show that the additional thread blocks launched by our approach help in minimizing the cycles in which SMs are idle. show that resource sharing helps in increasing the number of thread blocks launched for the applications. Figure 9 (a) compares the effective number of thread blocks launched by register sharing approach (denoted as Shared-OWF-Unroll-Dyn) with that of baseline implementation (denoted as Unshared-LRR). For applications MUM, backprop, hotspot, and mri-q our approach is able to launch 6 thread blocks (i.e., 1536 threads), which is the limit on the number of resident threads per SM. Applications stencil and b+tree launch 3 thread blocks per SM, compared to 2 in the baseline approach. For applications LIB and sgemm our approach is able to launch 8 thread blocks per SM, which is the limit on the number of resident thread blocks.
Analysis of Set-1 and Set-2 benchmarks

Increase in the number of thread blocks
In Figure 9(b) , we compare the number of resident thread blocks launched by scratchpad sharing (labeled as Shared-OWF) with baseline approach. For applications CONV1, NW1, and NW2, we launch 8 thread blocks per SM, which is the limit on the number of resident thread blocks. Figure 9 (c) shows the improvement in IPC with register sharing over baseline LRR (Loose Round Robin) implementation. We observe that applications show an average improvement of 11% with register sharing. Applications b+tree, hotspot, MUM, and stencil achieve significant speedups of 11.98%, 21.76%, 24.14%, and 23.45% respectively. Similarly Figure 9(d) shows the performance improvement in IPC with scratchpad sharing. We observe that applications show an average improvement of 12.5% with scratchpad sharing. CONV2, lavaMD, and SRAD1 achieve speedups of 15.85%, 29.96%, and 25.73% respectively. These applications leverage all our optimizations to perform better. The performance improvement in IPC for lavaMD is due to two reasons: (1) The number of resident thread blocks launched by our approach is twice that of baseline approach (2) No instruction that uses scratchpad memory location falls into shared scratchpad, hence all the additional thread blocks execute instructions without waiting for shared thread blocks. Though LIB launches 8 thread blocks per SM with register sharing, it improves only by 0.84%. It is due to increase in L2 cache misses caused by additional shared blocks. The benchmarks backprop and sgemm achieve modest improvements of 5.82% and 4.06% respectively with register sharing. Similarly, CONV1, NW1, and NW2 show improvements of 4.33%, 5.62%, and 9.03% respectively with scratchpad sharing. mri-q slows down by 0.72% because additional shared blocks increase L1 cache misses and hence increase the number of stalls. SRAD2 shows improvement only upto 0.1% because a barrier instruction placed next to shared scratchpad access limits the progress of shared threads that do not access any shared scratchpad location.
Performance analysis
In Figures 10(a) and (b) , we show the percentage decrease in the number of simulation cycles with register and scratchpad sharing when compared to baseline approach. Since the number of instructions executed in our approach is same as that of the baseline approach, all the applications that show improvement in IPC in Figures 9(c) and (d) will take less the number of simulation cycles for completing their execution using our approach. That is why Figure 9 and Figure 10 show similar trend. Figure 11 (a) compares register sharing optimizations with baseline approach. We compare the results of register sharing when we do not use any optimization and use the existing baseline LRR scheduling policy (labeled Shared-LRR-NoOpt). Consider the application hotspot, it achieves a speedup of 13.65% even without using any optimization because the additional thread blocks launched by our approach help in hiding execution latencies. With register unrolling optimization (labeled Shared-LRR-Unrolled), we further see an improvement up to 15.18% because register unrolling enables threads to execute more instructions before they start accessing shared registers. Hence it can execute more instructions before it accesses shared registers. When we enable the dynamic warp execution (labeled Shared-LRR-Unrolled-Dyn), we see an improvement only upto 14.58% because it limits the execution of memory instructions from non-owner warps. However when we apply the OWF optimization (labeled Shared-OWF-Unrolled-Dyn), the application speeds further upto 21.76%. With OWF optimization, the priority of non-owner warps decreases compared to the other warps. Hence the memory instructions issued by nonowner warps do not interfere with the other warps, which minimizes the L1/L2 cache misses. We see that b+tree behaves similarly to hotspot in terms of performance gain by varying the optimizations.
Effectiveness of optimizations
MUM slows down by 0.15% when we do not use any optimization. We observe that increase in the resident thread blocks leads to increase in the number of memory instructions issued by non-owner warps, increasing L1 and L2 cache misses. Though we see an increase in the L1/L2 cache misses, the other instructions issued by the non-owner warps help in minimizing the stall cycles. With register unrolling optimization, we see a slight improvement (0.08%). When we apply the dynamic warp execution, it shows a speed up of 6.45%. From this, we analyze that dynamic warp execution reduces the additional stall cycles produced by issuing memory instructions from the non-owner warps. Further with OWF optimization, performance improves upto 24.14% because of the decrease in interference from non-owner warps.
LIB shows an improvement of 2% using sharing with no optimizations. We observe the same performance even with unrolling optimization because the number of instructions that use unshared registers before they start accessing shared registers is exactly the same as without optimization. With dynamic warp execution, we still observe the same since in this application all the owner warps have completed executing all instructions before any non-owner warp starts issuing any memory instructions. With OWF optimization, we observe a small degradation because of increase in the number of stall cycles compared to the LRR policy.
The benchmarks sgemm, backprop, and stencil achieve good improvements only when OWF optimization is enabled. Since instructions issued by non-owner warps exe- cute with the least priority, they do not interfere with other warps and hence minimize L1/L2 cache misses. We do not see any performance improvement with mri-q because the additional thread blocks increase L1 cache misses with our approach. However the slow down was reduced to 0.72% in the presence of all the optimizations.
To summarize, memory-bound applications, like MUM, take advantage of our sharing approach in the presence of dynamic warp execution and OWF optimizations. Whereas, compute-bound applications, like hotspot, perform better even without any optimizations, and they further improve with OWF optimization.
In Figure 11 (b), we show the effect of OWF optimization on scratchpad sharing. lavaMD shows an improvement of 28% even without any optimization (labeled shared-LRR-NoOpt). It is because additional thread blocks do not access any memory location which belongs to shared scratchpad memory. CONV1, CONV2, SRAD1, and SRAD2 applications show improvements of 5.68%, 6.21%, 11.1%, and 5.28% respectively without applying optimization, which is due to additional thread blocks that help in hiding the latencies.
With OWF optimization, CONV2, NW1, NW2, and SRAD1 applications improve upto 15.85%, 5.62%, 9.03%, and 25.73% respectively. Since OWF optimization schedules the owner warps efficiently, it helps in minimizing stall cycles thus improving IPC value. lavaMD improves upto 30% since it has more benefit with sharing than OWF optimization. CONV1 and SRAD2 perform better when no optimization is applied because these applications go through extra cache misses (L1 and L2) and extra stall cycles with OWF optimization when compared to no optimization.
Comparison with other schedulers
In Figures 12(a) and (b), we show the performance improvement in the register sharing and the scratchpad sharing, respectively, over GTO (Greedy Then Old) scheduler.
We observe that our approach shows an improvement upto 3.9% with register sharing and shows an improvement upto 30% with scratchpad sharing. backprop shows the same number L2 misses as the baseline GTO, but it has more L1 misses with our approach. In stencil, we observe extra L2 misses with our approach. NW1 and NW2 degrade with our approach because they have less number of stall cycles with GTO scheduler than our approach. Further, as shown in Figure 12 
Reduction in idle and stall cycles
In Figures 13(a) and (b), we report percentage decrease in the number of idle cycles and pipeline stall cycles when compared to the baseline approach. We observe that, all ap- It shows 49.5% decrease in idle cycles.
plications but one show reduction in the number of idle cycles (upto 99%). This is expected because with the increase in the number of thread blocks, number of instructions that are ready to execute also increase. For MUM, LIB, backprop, hotspot, and stencil the stall cycles also reduce with register sharing. Similarly for CONV2, NW1, NW2, SRAD1, and SRAD2 applications, number of stall cycles reduce with scratchpad sharing. It indicates the additional thread blocks launched with our approach hide the long execution latencies in a better way. We observe an increase in the stall cycles for applications b+tree and sgemm. However, since the number of idle cycles have significantly reduced, overall we see a benefit with our approach. For mri-q, the number of stall cycles increases with our approach due to the increase in L1 cache misses. lavaMD shows an increase of 259 stall cycles because the additional threads wait for execution units (SP units) to become ready. For CONV1 we see an increase in number of stalls with our approach due to L1 cache misses.
Resource savings
We also compare our approach against LRR Scheduler that uses twice the number of resources. In Figure 14(a) , the baseline approach (labeled as Unshared-LRR-Reg#65536) uses 64K registers, whereas our approach uses only 32K registers. Even with an increase in the number of registers and hence an increase in the number of resident thread blocks in the baseline approach, our approach performs better in 5 out of 8 applications. MUM performs better with our approach, even though the number of thread blocks is same (6) in both the approaches because dynamic warp execution optimization helps minimizing the stalls produced by the additional thread blocks. sgemm, b+tree, and LIB perform better with the baseline approach due to an increase in the number of resident thread blocks and hence an increase in the number of active warps. In Figure 14 (b), we compare scratchpad sharing approach that uses 16K bytes of memory with that of baseline approach that uses 32K byes of memory. From the figure we observe that, performance of CONV1, NW1, and NW2 is comparable to that of baseline approach because our approach can launch the same number of thread blocks as the baseline approach. lavaMD performs better than baseline approach because sharing helps in minimizing latencies. CONV2, SRAD1, and SRAD2 degrade with our approach because number of resident thread blocks in our approach is less, and number of stall cycles in our approach is more compared to baseline approach.
Analysis of Set-3 benchmarks
The performance of register sharing and scratchpad sharing approach for the Set-3 applications (Table 4) is presented in Figures 15(a) and (b) respectively. As discussed earlier, these applications are not limited by the number of available resources but due to other factors such as the number of threads or thread blocks. We measure their performance when our approach uses (1) LRR scheduling policy, (2) GTO scheduling policy, and (3) OWF scheduling policy 4 . From  Figures 15(a) and (b), we observe that our proposed resource sharing approach when used with LRR scheduling (labeled as Shared-LRR-Unroll-Dyn) performs exactly same as the baseline LRR scheduling (Unshared-LRR). Since the number of thread blocks launched by the applications are not limited by the resources, our approach does not launch any additional thread blocks, and all the thread blocks are in unsharing mode. Hence, it behaves exactly similar to the baseline approach. Similarly, our approach when used with the GTO scheduling policy (Shared-GTO-Unroll-Dyn), performs exactly same as the baseline approach that uses GTO scheduling policy without sharing (Unshared-GTO). Finally, we observe that with OWF scheduling policy (Shown as Shared-OWF-Unroll-Dyn), our approach is comparable to that of Unshared-GTO implementation. In OWF optimization, the warps are arranged according to the priorities of From the results of Set-1, Set-2, and Set-3 benchmark applications we can say that, if the number of thread blocks launched by an application is limited either by registers or by scratchpad memory (as shown in Set-1 and Set-2), then they can leverage our sharing approach to improve their performance. When they are limited either by the number of resident threads or by the number of thread blocks, our approach does not launch any additional thread blocks, and they perform comparable to the baseline approach.
RELATED WORK
Xiang et. al. [32] discussed thread block level resource management. They proposed a hardware solution to launch a partial thread block when there are not enough resources to launch a full thread block. Their solution can have only one partial thread block running. The patented register management [31] uses the concept of virtual registers, which are more than the actual physical registers, and hence can launch more thread blocks than allowed by physical registers. This can be combined with our solution. Yang et. al. [34] proposed hardware and software solutions to the problem caused by allocation and deallocation of shared memory at the thread block granularity. Their solution is complementary to our approach. A compiler based coordinated register allocation [33] was proposed to improve the GPU performance by reducing the register spilling cost.
GPU register file virtualization [17] proposes techniques to share register space across warps. It uses compiler generated life time information to allocate dead registers to another warp. However, it is used for reducing the power consumption and not for improving performance. In contrast, our sharing approach focuses on improving the TLP thereby improving performance. Li et. al. [24] proposed a resource virtualization scheme to minimize the under utilization of system resources by sharing the GPU resources among multiprocessors. Their approach achieves speedup by overlapping multiple kernel executions on virtual GPU. Warped Register File [5] describes a solution to reduce power consumption in register file by turning off unallocated registers. Gebhart et. al. [14] proposed a unified memory for register, scratchpad, and primary cache, which partitions resources of SM as per the application need. It requires a lot of hardware changes to access unified storage.
Other techniques to improve GPU performance include re-ducing cache contention, improving DRAM bandwidth, hide long latencies, reduce energy consumption, etc. Rogers et. al. [29] propose a cache conscious wave front scheduling algorithm which makes use of intra-wave front locality detector, focusing on shared L1 cache. A Two level warp scheduler [27] proposed by Narasiman et. al. divides warps into groups and schedules the warps in each group in round robin manner to hide long latencies in a better way. Sethia et. al. [30] proposed a memory aware warp scheduling approach that prioritizes memory requests of a single warp when memory saturation occurs. Priority based cache allocation [23] was proposed to enable high throughput and better resource utilization. Gebhart et. al. [13] proposed energy efficient hierarchical register file storage and two level warp scheduler for high throughput processors. OWL [18] proposes techniques to improve cache contention and DRAM bank level parallelism. Lee et. al. [20] proposed alternative thread block scheduling mechanism to improve GPU performance and also proposed mixed concurrent kernel execution to improve resource utilization and performance. Warp criticality problem [22] has been addressed by scheduling critical warps more frequently than others. Also, Lee et. al. [21] proposed a coordinated solution to accelerate the execution of critical warps. Ma et al. [25] proposed an algorithm for shared memory allocation using integer programming framework. It maximizes the performance by maximizing the access to shared memory. Other solutions to improve the throughput of GPUs are by handling branch and thread divergence. Fung et. al. [12] dynamically form warps to minimize branch divergence. The performance of this approach is dependent on the order in which the warps are issued to the pipeline. Thread block compaction [11] was proposed to reduce the divergence by regrouping to new warps at the divergent branch. Anantpur et. al. [6] proposed compiler based control flow linearization technique to handle branch divergence. Similarly other hardware and software techniques [8, 10, 15, 26, 28] were proposed to handle branch and thread divergence, and these are orthogonal to our approach.
CONCLUSIONS AND FUTURE WORK
We proposed sharing of some resources of SM to minimize their wastage by launching additional thread blocks in each SM. For effective utilization of these additional thread blocks, we proposed optimizations which further help in reducing the stalls produced in the system.
We validated our approach for register sharing and for scratchpad sharing on several applications and showed improvements upto maximum 24% and average 11% with register sharing, and maximum 30% and average 12.5% with scratchpad sharing.
In future, we plan to incorporate traditional compiler analysis and optimizations into our approach. For example, live range analysis along with instruction reordering can be used to detect and release registers that are not used beyond a point. Such registers, if shared, can be used by the warp in the other thread block waiting for shared registers. Further, we plan to combine both the approaches to improve performance of applications that are limited by both registers and scratchpad memory. We also plan to extend our work to study the effect of various techniques such as, increasing the number of registers per thread, allocating temporary variables into available resources, and applying several cache replacement policies on our approach.
