Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The amount of thread level parallelism that can be utilized depends on the number of resident threads on each of the SMs. The threads are typically structured into a grid of thread blocks with each thread block containing a large number of threads. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage-e.g. number of registers, amount of shared memory-of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted.
INTRODUCTION
Graphics Processing Units (GPUs) have been effectively used to accelerate large data parallel applications. GPUs consisting of Streaming Multiprocessors (SMs) achieve high throughput by concurrently executing a large number of threads to hide long latencies. The throughput achieved by a GPU depends on the amount of thread level parallelism (TLP) utilized by it. Recent studies [16] [22] [23] [24] focus on improving the throughput of GPUs by exploiting the TLP.
The amount of TLP utilized by a GPU depends on the number of threads resident on it. When an application is launched on a GPU, an execution configuration consisting of the number of thread blocks and number of threads in * "The miracle is this -the more we share, the more we have."-Leonard Nemoy a thread block is specified. The number of thread blocks that can actually be launched on an SM depends on the resource requirement, such as the number of registers and the amount of shared memory needed by each thread block. If an SM contains R registers and a thread block requires R tb registers to complete its execution, then the SM can launch at the most R R tb thread blocks on it, utilizing R tb × R R tb registers. The remaining R mod R tb registers are wasted. In this paper we propose an approach, called Register Sharing, to launch more thread blocks using the wasted registers and improve the throughput.
It is observed in an earlier study [16] that increasing the number of threads benefits compute bound applications, but may result in increased L1/L2 cache misses for memory bound applications, thereby decreasing their performance.
In order to use the extra thread blocks and their constituent warps effectively, we propose three optimizations, viz., (i) Owner Warp First Scheduling, (ii) Unrolling and Reordering of Register Declarations, and (iii) Dynamic Warp Execution.
We make the following contributions in this work:
1. To utilize the register resources of GPUs effectively, we propose a novel register sharing mechanism that enables launching of more thread blocks per SM.
2. To further improve the throughput of applications, we propose three optimizations.
3. We have implemented our approach using GPGPU-Sim simulator and evaluated on several applications from GPGPU-Sim, Rodinia, and Parboil benchmark suites. We observed a maximum improvement of 24% and an average improvement of 11%.
The rest of the paper is organized as follows: Section 2 describes the background required for register sharing approach. Section 3 motivates the need for increasing register utilization. Details of our approach are presented in Sections 4 and 5. In Section 6 we discuss hardware overhead for implementing our approach. Section 7 describes the experimental evaluation. Section 8 discusses related work and Section 9 concludes the paper.
BACKGROUND
A typical NVIDIA GPU [1] consists of a set of Streaming Multiprocessors(SMs), and each multiprocessor has execution units called Stream Processors (SPs). CUDA [1] supports extensions to languages, such as C, to allow programmers to define and invoke parallel functions, called kernels, The number of thread blocks that can reside on an SM depends on: (a) the number of registers used by a thread block and the number of registers available in the SM, (b) the amount of shared memory used by a thread block and the amount of shared memory available in the SM, (c) the maximum number of threads allowed per SM, and (d) maximum number of thread blocks allowed per SM. The threads in a thread block are further divided into a set of consecutive 32 threads called Warp. Each SM contains one or more warp schedulers which schedule a ready warp every cycle from a pool of ready warps. All threads in a warp execute the same instruction.
Warp schedulers schedule instructions in-order and so, when the current instruction of a warp can not be issued, the warp is not considered to be ready. If no warp can be scheduled in a cycle, then that is a stall cycle. As the number of stall cycles increases, the run time goes up and the throughput decreases. Our approach increases the number of resident thread blocks by utilizing the wasted registers on each SM and hence increases the number of resident warps and also improves the warp schedulers to hide long latencies.
MOTIVATION
The number of registers available on an SM constrain the the number of thread blocks that can reside on it. In this paper, we propose sharing registers to reduce the impact of this constraint.
We analyzed several benchmark applications using the GPGPU-Sim [2] simulator 1 . Figure 1 shows the number of resident thread blocks per SM, and Figure 2 shows the percentage of registers that remain unutilized per SM. Consider the application hotspot. Each thread for this benchmark needs 36 registers, and there are 256 threads in each block, so the number of registers required per thread block is 9216 (36 * 256). According to the configuration in Table 1, the number of registers available on an SM is 32768, so an SM can fit only 3 threads blocks ( 32768 9216 ). This results in 5120 registers per SM remaining unutilized. Similar behavior is observed for other applications as well.
Applications that are constrained by their resource requirements may not only have low residency, but also waste resources of GPU. Our approach uses sharing to reduce the number of unutilized registers in order to increase the number of resident thread blocks. Our experiments show that these extra thread blocks help to hide long execution latencies and increase throughput.
REGISTER SHARING
Register sharing increases the number of thread blocks in an SM by allowing two thread blocks to share registers. We explain the idea with an example. Consider an application that has thread blocks of size 10 warps (320 threads), and each warp requires 1K register (or equivalently, thread block requires 10K registers) to complete its execution. If an SM has 35K registers, at most 3 thread blocks can be resident on each SM. Thus 30K registers per SM are utilized, while remaining 5K registers are wasted. The schematic of this approach (baseline) is shown in Figure 3 (a), where thread blocks TB0, TB1 and TB2 are scheduled on an SM.
In order to reduce the wastage of registers, our approach allocates one more thread block (TB3) in sharing mode with TB2. Instead of allocating 10K registers separately to each of the thread blocks TB2 and TB3, a total of 15K registers for the two blocks are allocated. These 15K registers are distributed such that 1.5K registers are assigned to each pair of warps consisting of one warp each from TB2 and TB3 (Figure 3(b) ).
Consider the pair of warps W20 and W30 which is allocated a total of 1.5K registers. Our approach allocates 0.5K registers each to W20 and W30. We call these "Unshared Registers". The remaining 0.5K registers, called "Shared Registers", are allocated to these warps together in an exclusive manner, i.e. only one of them can access the pool of shared registers at a time. For example, if warp W20 accesses the shared registers first, exclusive access to all the 0.5K shared registers is given to W20, while W30 is prevented from accessing any of those 0.5K shared registers till W20 finishes. This implies, W30 can continue its execution until its first access to any of the 0.5K shared registers, at which point it busy-waits. Only after W20 finishes execution, W30 can access the shared registers and continue. This way, additional warps can make some progress, which helps in hiding execution latencies.
We refer to any two thread blocks as "Shared Blocks" when they share registers exclusively (for example TB2 and TB3 in Figure 3 To generalize this idea and to compute the increase in number of thread blocks, we will consider a GPU that provides R registers per SM. Also, consider a thread block that requires R tb registers, and each warp in the thread block requires Rw registers to complete its execution. To increase the number of thread blocks that share registers with other existing thread blocks in the SM, we allocate R tb (1 + t) (for any threshold 0 < t ≤ 1) registers to each pair of shared thread blocks, instead of allocating 2R tb registers to them (e.g, in Figure 3 (b), t is 0.5). Equivalently we allocate Rw(1 + t) registers per two warps from these thread blocks (i.e, one warp from each shared thread block in the pair), such that each of these warps can access Rwt unshared registers independently, and they can access the remaining Rw(1 − t) shared registers only when granted exclusive access.
We allocate registers to a warp dynamically when it requires to access the registers on its first usage, and we deallocate them from the register file after the warp has finished its execution. Every unshared register is allocated as per the request, but the shared registers are allocated to only a warp that has exclusive access. To categorize a register accessed by a warp as shared or unshared, and to efficiently access it from the register file unit, we modify the existing register file access mechanism as shown in Figure 4 . When a warp (WarpId) needs to access a register (RegNo), we first check if the warp is an unshared warp i.e. if it belongs to an unshared thread block ( Figure 4 , Step (b)). If it is an unshared warp, it can directly access the register from register file using a combination of (WarpId, RegNo). If WarpId is a shared warp, the accessed register is an unshared register if RegNo ≤ Rwt (Step (c)). This is because Rwt number of unshared registers are allocated to each warp. If RegNo > Rwt, we treat the register as a shared register. A warp can access an unshared register directly from the register file, but it can access a shared register only when it gets exclusive access by acquiring a lock (shown in Step(e)), otherwise it retries the access in another cycle.
Since several shared warps try to acquire a lock, it is possible to have a deadlock situation. Consider a scenario shown in Figure 5 , where two thread blocks TB1 and TB2 are in shared mode. Assume that warp W1 of TB1 tries to acquire a lock to access shared registers, but W3 has already acquired the lock. Also, assume that the warps W2, W3 are waiting for warps W1 and W4 respectively, to arrive at a barrier instruction ( syncthreads()). Now, if warp W4 tries to acquire a lock to access shared registers from W2, a deadlock occurs.
To avoid deadlock, we always ensure that if thread blocks TB1 and TB2 are sharing registers, then a warp from TB1 can acquire a lock only when either (a) none of the warps from TB2 have acquired a lock for the shared registers, or (b) the warps from TB2 that have acquired exclusive access to the shared registers have finished their execution. Hence, for the scenario in Figure 5 , since warp W2 already has acquired a lock, W3 can not acquire a lock, breaking the deadlock cycle.
Computing Number of Thread Blocks to be Launched per SM
Consider the naive method of sharing, where each thread block is sharing registers with some other thread block. This method may launch more thread blocks as compared to default (non-sharing) approach. However, it is important to note that the number of effective thread blocks per SM (thread blocks that are guaranteed to progress) can be less than the number of effective thread blocks launched without sharing. We explain this with an example.
Consider a scenario where 3 thread blocks are resident per SM without sharing. With naive sharing approach, it may be possible to have 4 thread blocks resident, such that block 1 shares registers with block 2; and block 3 shares registers with block 4. It can happen that block 2 and 4 start accessing shared registers causing blocks 1 and 3 to wait. Effectively only two thread blocks (blocks 2 and 4) will make progress in the naive sharing approach, whereas all 3 blocks can make progress in the non-sharing approach. Now, we describe a method to compute the total number of thread blocks (Shared + Unshared) to be launched per SM such that the number of effective thread blocks launched in the SM using our approach is at least same as that of nosharing approach. For this, we use the following notations: When no-sharing mechanism is used, we can launch up to ⌊R/R tb ⌋ thread blocks in an SM, and all of them make progress. Whereas in our approach, if two thread blocks are launched in sharing mode, at least one thread block always makes progress. So, when S shared pairs are launched in an SM, at least S thread blocks always make progress. Also, if U unshared thread blocks are launched in the SM, all of them always make progress. Therefore, at least S +U thread blocks always make progress with our approach.
In order to keep the number of effective thread blocks in our approach to be same as that of no-sharing approach, we need the following relation to hold:
For each shared pair of thread blocks, we allocate R tb (1 + t) registers. Similarly for each unshared thread block, we allocate R tb registers. Since the total number of registers available in the SM is R, we have:
The total number of thread blocks that can be launched in sharing approach is equal to the number of unshared thread blocks plus twice the number of shared pairs, i.e,
From Equations 1 and 2, we can compute the values of U and S. Finally, the maximum value of M can be computed from the equation 3 as,
Since the actual number of thread blocks that can reside in an SM also depends on other factors, namely (a) shared memory, (b) maximum number of resident threads per SM, and (c) maximum number of resident thread blocks in the SM; the number of thread blocks resident in an SM in our approach is the minimum of numbers obtained using the factors (a), (b), (c) and the value M .
OPTIMIZATIONS
With the proposed register sharing approach, each SM has unshared and shared warps, and scheduling these warps plays a very important role in determining the performance of applications. We propose an optimization called "Owner Warp First (OWF)" to schedule these warps effectively. If two thread blocks TBi and TBj are a shared pair, and at least one of the warps of TBi waits for shared registers from TBj, we call TBj as "Owner Block",and the warps that belong to TBj are called "Owner Warps". So owner thread blocks always have access to the shared registers. TBi is called "Non-Owner Block" and warps of TBi are called "Non-Owner Warps". As soon as the Owner thread block finishes its execution, it transfers its ownership to the Non-Owner thread block (i.e, the Non-Owner thread block becomes the Owner), and a new thread block that gets launched in place of it becomes the Non-Owner.
Scheduling Owner Warp First (OWF)
A warp scheduler in the SM issues a warp every cycle from a pool of ready warps. With our solution, the warps can be categorized into three types viz., unshared, shared owner and shared non-owner. Shared non-owner warps depend on the corresponding shared-owner warps to release registers, before they can make progress. In Owner Warp First algorithm we prioritize warps in the order: shared owner, unshared, and shared non-owner. Giving the highest priority to shared owner warps helps finish them sooner, and hence the dependent shared non-owner warps can make progress. Since shared non-owner warps can not make much progress before stalling, giving them lower priority than unshared warps helps use them only to hide stalls when no other types of warps are ready to run. Figure 6 illustrates that scheduling unshared warps with high priority compared to shared warps can degrade the performance of an application.
Unrolling and Reordering of Register Declarations
Non-owner warps need to wait for owner warps when they try to access shared registers. If the very first instruction issued by a non-owner warp uses a shared register, then the warp has to wait and can not start its execution until corresponding owner warp has released the shared register. In order to allow the non-owner warps to execute as many instructions as possible before stalling due to unavailability of shared registers, we unroll and reorder the register declarations. To illustrate this, consider the PTXPlus [2] code shown in Figure 7 (a), which is generated by GPGPU-Sim [2] for the sgemm application from Parboil Suite [3] . The first instruction of the code accesses registers p0 and r124, which get the register sequence numbers as 31 and 35 according to the declaration. These registers are part of the shared registers for a certain threshold value t. Hence, a non-owner warp has to wait until the registers are released. To delay accessing the shared registers, we unroll and rearrange the order of the register declarations so that p0, r124 become unshared registers (i.e, they get the register sequence numbers as 1 and 3). This is shown in Figure 7(b) . Hence the non-owner warps get to execute more number of instructions before they start accessing shared registers.
To implement this optimization, we converted the assembly code (PTXPlus) produced by GPGPU-Sim into an optimized assembly code. To achieve this, we first find an order of registers according to their first usage. Further, to ensure that the unshared registers are used before the shared registers, we modify the register declarations so that a register that has been used first in assembly code is declared first. Finally, we modified the GPGPU simulator to use the optimized PTXPlus code for simulating instructions. This optimization can be easily integrated at the assembly level using the existing CUDA compiler.
Dynamic Warp Execution
Recent study [16] by Kayiran et. al. shows that the performance of memory bound applications can degrade with increase in the number of resident thread blocks. Executing additional thread blocks can increase L1/L2 cache misses, which leads to increase in the number of stall cycles. So, in order to reduce the number of stalls in the system, we propose an optimization that can dynamically enable or disable the execution of long latency instructions (memory) issued by the additional blocks.
To control the execution of memory instructions from the additional warps (non-owner warps), we monitor the number of stall cycles for each SM. When executing memory instructions from non-owner warp leads to increase in the number of stalls, we decrease the probability of executing further memory instructions from the non-owner warps. To illustrate this, consider a GPU that has N SMs, all in sharing mode. Our approach disables execution of memory instructions for the non-owner warps, only on a specific SM (e.g. SM0). Every other SM, SMi for i ∈ {1 . . . N − 1}, allows execution of memory instructions for the non-owner warps, and compares its stall cycles periodically with the stalls on SM0. If stalls observed in the SMi are more than the stalls appearing in SM0, then the probability of executing memory instructions on SMi from the non-owner warps is decreased by a predetermined value p. If the stalls in SMi are less than that in SM0, then the probability of executing memory instructions on SMi from the non-owner warps is increased by the same value p. Thus, we reduce the number of stall cycles by controlling the execution of memory instructions.
After running several experiments, we selected the periodicity of monitoring to be 1000 cycles, which is to ensure that (a) the monitoring overhead is not high, and (b) sufficient number of stall cycles are observed. In our experiments, initially all the SMs are allowed to execute all memory instructions, i.e., the probability of executing memory instructions from non-owner warp is 1. Depending on the stall cycles observed for an SMi (i ∈ {1 . . . N − 1}), this probability for SMi is decreased or increased by p = 0.1, but is kept within interval [0 − 1].
HARDWARE REQUIREMENT
To implement register sharing, we require the following hardware storage:
1. Each SM requires a bit to specify whether it is enabled for sharing mode. This bit will be set when the number of thread blocks assigned to an SM using register sharing is more than the default number of thread blocks per SM.
2. Each thread block stores id of the sharer thread block. This is set to -1 if the thread block is in unsharing mode. Hence for T thread blocks in an SM, we require T ⌈log 2 (T + 1)⌉ bits.
3. Each warp requires a bit to specify whether it is in sharing or unsharing mode. Hence for W warps in an SM, W bits are required. For a warp in shared mode, its corresponding shared warp can be identified using the sharer thread block id of its thread block and its relative position in the thread block.
4. Each warp requires a bit for specifying the owner information. This bit is set to 1 if the warp is an owner warp, otherwise it is set to 0. Hence for W warps, W bits are needed.
Each pair of shared warps uses a lock variable to access
the shared registers exclusively. The lock variable is set to the id of the warp which has gained access to the shared registers. If an SM has W warps, there can be a maximum of W 2 shared pairs of warps in the SM. Hence we need a total of W 2 ⌈log 2 W ⌉ bits per SM.
Hence the total amount of storage required (in bits) for implementing our approach in a GPU with N SMs will be: (1 + T ⌈log 2 (T + 1)⌉ + 2W + W 2 ⌈log 2 W ⌉) * N . For a GPU configuration that is shown in Table 1 (Tesla C2050), our approach requires 273 additional bits per SM. Finally, we require two comparator circuits to implement the register access ( Figure 4 , components labeled (b) and (c)).
EXPERIMENTAL ANALYSIS

Evaluation Methodology
We implemented our approach using GPGPU-Sim V3.X [6] . The baseline architecture used for comparing our approach is shown in Table 1 . We experimentally evaluated our approach on several applications from GPGPU-Sim [6] , Rodinina [8] , and Parboil [3] benchmarks. Depending on the resource requirement of applications, we divided the benchmarks in two sets. Set-1, shown in Table 2 , consists of applications for which the number of thread blocks per SM is limited by the register resource. Set-2, shown in Table 3 , consists of applications for which the number of thread blocks per SM is limited by factors other than registers, such as (a) shared memory, (b) maximum number of resident threads, and (c) maximum number of resident thread blocks. For each application in the Tables 2 and 3, we show names of the kernels used for evaluation, and the number of registers per thread for each kernel, which GPGPU-Sim uses to compute the number of resident thread blocks. We also report the number of threads per thread block. We compiled all the applications using CUDA 4.2 and executed on the inputs provided in the benchmark suites.
For our experiments, we use the value of threshold (t) to determine the percentage of register sharing. For example, if Figure 8 : Comparing the number of resident thread blocks with baseline implementation each warp requires Rw registers, and we choose the threshold value to be 0.1, then we allocate 1.1Rw registers per two shared warps, which means 90% of the warp registers (Rw) are used as shared registers. So for a given threshold t, we can compute the percentage of register sharing as (1-t)*100. For all our experimental results, we use the threshold value as 0.1 (i.e, 90% register sharing), unless otherwise specified.
Experimental Results
We measure the performance of our approach using the number of Instructions executed Per Cycle (IPC), number of stall cycles, and number of idle cycles as defined by the GPGPU-Sim [6] , and we compare it with baseline GPGPU-Sim implementation.
We first show that register sharing helps in increasing the number of thread blocks launched for the applications in Set-1. In Figure 8 , we compare the effective number of thread blocks launched by our approach (denoted as Shared-OWF-Unroll-Dyn) with that of baseline implementation (denoted as Unshared-LRR). For applications MUM, backprop, hotspot, and mri-q our approach is able to launch 6 thread blocks (i.e, 1536 threads), which is the maximum limit on the number of resident threads per SM. Applications stencil and b+tree launch 3 thread blocks per SM, compared to 2 in the baseline approach. For applications LIB and sgemm our approach is able to launch 8 thread blocks per SM, which is the maximum limit on the number of resident thread blocks. Figure 9 shows the improvement in IPC with our approach over the baseline LRR (Loose Round Robin) implementa- tion. We observe that applications b+tree, hotspot, MUM, and stencil achieve significant speedups of 11.98%, 21.76%, 24.14%, and 23.45% respectively. These applications leverage all our optimizations to perform better. Though LIB launches 8 thread blocks per SM when register sharing is enabled, it improves only by 0.84%. It is due to increase in L2 cache misses that are caused by additional shared blocks. The benchmarks backprop and sgemm achieve modest improvements of 5.82% and 4.06% respectively. The benchmark mri-q slows down by 0.72%, because additional shared blocks increase L1 cache misses and hence increase the number of stalls.
In Figure 10 , we show the effectiveness of our proposed optimizations by comparing them with the baseline approach. First we compare the results of register sharing approach when we do not use any optimization and use the existing baseline LRR scheduling policy. For convenience, we label this scheme as Shared-LRR-NoOpt in the Figure 10 . Consider the application hotspot, it achieves a speedup of 13.65% even without using any optimization, it is clearly because the additional thread blocks launched by our ap-proach help in hiding execution latencies. With register unrolling optimization (labeled as Shared-LRR-Unrolled), we further see an improvement up to 15.18%, because register unrolling helps to increase the usage of unshared registers before they start using shared registers. Hence the application can execute more instructions before it accesses shared registers. When we enable the dynamic warp execution (shown as Shared-LRR-Unrolled-Dyn), we see an improvement only up to 14.58%, because it limits the execution of memory instructions from non-owner warps. However when we apply the OWF optimization (shown as Shared-OWF-Unrolled-Dyn), the application speeds further up to 21.76%. In the presence of OWF optimization, the priority of non-owner warps decreases compared to the other warps. Hence the memory instructions issued by non-owner warps do not interfere with the other warps, which minimizes the L1/L2 cache misses. We can see that b+tree also behaves similar to hotspot application in terms of performance gain by varying the optimizations.
For application MUM, when we do not use any optimization, there is a slow down of 0.15%. We observe that, increase in the resident thread blocks leads to increase in the L1 and L2 cache misses that arise by issuing memory instructions from the non-owner warps. Though we see an increase in the L1/L2 cache misses, the other instructions issued by the non-owner warps help in minimizing the stall cycles. With register unrolling optimization, we see a slight improvement (0.08%). When we apply the dynamic warp execution, it shows a speed up of 6.45%. From this we analyze that dynamic warp execution reduces the additional stall cycles produced by issuing memory instructions from the non-owner warps. Further with OWF optimization, performance improvement goes up to 24.14%, because of decrease in interference from the non-owner warps.
LIB shows an improvement of 2% in the presence of sharing with no optimizations. We observe the same performance even with unrolling optimization, because the number of instructions that use unshared registers before the first instruction using shared registers is exactly the same as Figure 11 : Percentage decrease in the number of stall cycles when compared with baseline implementation without the optimization. With dynamic warp execution, we still observe the same results, since in this application all the the owner warps have completed executing all instructions before any non-owner warp starts issuing any memory instructions. With OWF optimization, we observe a small degradation because of increase in the number of stall cycles compared to the LRR scheduling policy.
The benchmarks sgemm, backprop, and stencil achieve good improvements only when OWF optimization is enabled. Since instructions issued by non-owner warps execute with the least priority, they do not interfere with other warps and hence minimize L1/L2 cache misses. We do not see any performance improvement with mri-q, because the additional thread blocks increase L1 cache misses with our approach. However the slow down was reduced to 0.72% in the presence of all the optimizations.
To summarize, memory bound applications, like MUM, take advantage of our proposed sharing approach in the presence of dynamic warp execution and OWF optimizations. Whereas, compute bound applications, like hotspot, perform better even without any optimizations, and they further improve with the OWF optimization.
In Figure 11 , we report percentage decrease in the number of idle cycles (Cycle in which all the available warps are issued, but no warp is ready to execute) and stall cycles (Pipeline stall) compared to the baseline implementation. We observe that, all applications show reduction in the number of idle cycles (max up to 99%). This is expected because with increase in the number of thread blocks, the number of instructions that are ready to execute also increases. For applications MUM, LIB, backprop, hotspot, and stencil the number stall cycles also reduce with our approach. It indicates the additional thread blocks launched with our approach hide the long execution latencies in a better way. We observe an increase in the stall cycles for applications b+tree and stencil. However, since the number of idle cycles have significantly reduced, overall we see a benefit with our approach. For mri-q, the number of stall cycles increase with our approach due to increase in the number of L1 cache misses.
In Table 4 , we analyze the performance of Shared-OWF-Unroll-Dyn approach with the amount of sharing. From the results we observe that, most of the applications perform better when the amount of sharing is 90%. It is because, as shown in Table 5 , with increase in the amount of register sharing, the number of resident thread blocks will increase. These resident thread blocks help in hiding long latencies and hence help in achieving high throughput. From the Table 4 , we also notice that, all applications behave same at 0% and 10% sharing. At these percentages of sharing, the number of resident thread blocks with our approach is same as that of baseline implementation. Hence at run time, our approach decides to launch all the thread blocks in the unsharing mode. Since all these blocks are in unsharing mode, all the warps become unshared warps. In this case, OWF optimization uses dynamic warp ids to schedule warps and achieves higher performance than the baseline approach.
In Figure 12 , we show the performance improvement over GTO (Greedy Then Old) scheduler. We observe that our approach shows an improvement up to 3.9%. Further, as shown in Figure 14 we observe an improvement up to 27.22% in IPC over the two-level scheduling policy.
We also measure the effectiveness of register sharing mechanism by comparing it with LRR Scheduler that uses twice the number of registers. In Figure 13 , the baseline approach (Labeled as Unshared-LRR-Reg#65536) uses 64K registers, where as our approach uses only 32K registers. Even with an increase in the number of registers and hence an increase in the number of resident thread blocks in the baseline ap- Figure 13 : Comparison with LRR that uses twice the number of registers proach, our approach performs better in 5 out of 8 applications with fewer registers. For example, application MUM performs better with our approach, though we see the same number of thread blocks (6) in both the approaches as dynamic warp execution optimization helps minimizing the stalls produced by the additional thread blocks. Applications sgemm, b+tree, and LIB perform better with the baseline approach due to an increase in the number of resident thread blocks and hence an increase in the number of active warps.
The performance of our approach for the Set-2 applications (Table 3) is presented in Figure 15 . As discussed earlier, these applications are not limited by the number of available registers but due to other resources such as the number of threads, shared memory, etc. We measure their performance when our approach uses (1) LRR scheduling policy, (2) GTO scheduling policy, and (3) OWF scheduling policy 2 . From Figure 15 , we observe that our proposed register sharing approach when used with LRR scheduling (labeled as Shared-LRR-Unroll-Dyn) performs exactly same as the baseline LRR scheduling (Unshared-LRR). Since the number of thread blocks launched by the applications are not limited by the registers, our approach does not launch any additional thread blocks, and all the thread blocks are in unsharing mode. Hence, it behaves exactly similar to the baseline approach. Similarly, our approach when used with 2 We do not use two-level scheduling policy, because it cannot be directly integrated with our sharing approach the GTO scheduling policy (Shared-GTO-Unroll-Dyn), performs exactly same as the baseline approach that uses GTO scheduling policy without sharing (Unshared-GTO). Finally, we observe that with OWF scheduling policy (Shown as Shared-OWF-Unroll-Dyn), our approach is comparable to that of Unshared-GTO implementation. In OWF optimization, the warps are arranged according to priority of owner, unshared, and non-owner warps. Since in this case, we don't launch any additional thread blocks, all the thread blocks are in unshared mode. Hence all the unshared warps are sorted according to their dynamic warp id. So the performance of Shared-OWF-Unroll-Dyn is similar to that of Unshared-GTO implementation.
RELATED WORK
Xiang et. al. [23] discussed the problems due to thread block level resource management. They classified the resource under utilization problems as temporal and spatial. Temporal under utilization is caused due to differences in run times of warps of a thread block, whereas, spatial under utilization is caused because of unavailability of enough resources for a complete thread block at the time of launching a kernel. They proposed a hardware solution to launch a partial thread block when there are not enough resources to launch a full thread block. Their solution can have only one partial thread block running, whereas, our solution can have multiple thread blocks in shared mode and hence provides more opportunities to hide long latencies. Also, their solution to handle warp-divergence is complementary to our approach and can be used as and when warps from unshared thread blocks finish. A patented register management scheme in [22] , uses the concept of virtual registers, which are more than the actual physical registers, and hence can launch more thread blocks than allowed by the physical registers. This mechanism can be combined with our proposed solution. Yang et. al. [24] propose hardware and software solutions to the problem caused by allocation and deallocation of shared memory at the thread block granularity. Their solution is complementary to our approach.
Warped Register File [4] describes a solution to reduce the power consumption in register file by turning off unallocated registers. Gebhart et. al. [13] proposed a unified memory for register, scratch, and primary cache, which partitions resources of SM as per the application need.
There has been a lot of work which propose hardware and software solutions to handle various issues due to branch and thread divergence [5, 7, 9, 10, 11, 14, 17, 19, 20] .
A lot of research papers on warp scheduling have proposed techniques to reduce cache contention, improve DRAM bandwidth, hide long latencies, reduce energy consumption, etc,. Rogers et. al. [21] propose a cache conscious wave front scheduling algorithm which makes use of intra-wave front locality detector, focusing on the shared L1 cache. A Two level warp scheduler [18] proposed by Narasiman et. al., divides warps into groups and schedules the groups and warps in each group in round robin manner to hide long latencies in a better way. Gebhart et. al. [12] proposed energy efficient hierarchical register file storage and two level warp scheduler for high throughput processors. OWL [15] proposes various techniques to improve cache contention, DRAM bank level parallelism, etc.
CONCLUSION AND FUTURE WORK
In this paper we proposed a technique called "Register Sharing" to effectively utilize the register resources of GPUs. Our approach utilizes the wasted register resources by launching additional thread blocks in each SM. These thread blocks help in hiding the long instruction execution latencies, hence improving the throughput of applications. For effective utilization of these additional thread blocks we proposed three optimizations, which further help in reducing the stalls produced in the system. We validated our approach on several applications, and we showed that the applications improve up to maximum 24% and average up to 11% when compared with baseline implementation.
In future, we plan to incorporate traditional compiler analysis and optimizations into our approach. For example, live range analysis along with instruction reordering can be used to detect and release registers that are not used beyond a point. Such registers, if shared, can be used by the warp in the other thread block waiting for shared registers. We also plan to study the effect of various cache replacement policies on register sharing and use it to improve the throughput of memory bound applications.
