GPGPU applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism (TLP) present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this paper, we propose architectural and compiler optimizations to improve the scratchpad memory utilization. Our approach, called Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad memory and also share scratchpad memory with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the part of scratchpad memory that is shared among thread blocks.
INTRODUCTION
The throughput achieved by a GPU (Graphics Processing Unit) depends on the amount of thread-level-parallelism (TLP) it utilizes. Therefore, improving the TLP of GPUs has been the focus of many recent studies [Yang et al. 2012; Anantpur and Govindarajan 2014; Hayes and Zhang 2014] . The TLP present in a GPU is dependent on the number of resident threads. A programmer interested in parallelizing an application in GPU invokes a function, called kernel, with a configuration consisting of number of thread blocks and number of threads in each thread block. The maximum number of thread blocks, and hence the number of threads, that can be launched in a Streaming Multiprocessor (SM) depends on the number of available resources in it. If an SM has R resources and each thread block requires R tb resources, Vishwesh Jatala is supported by TCS Ph.D. fellowship. Jayvant Anantpur acknowledges the funding received from Google India Private Limited. This article is extension of the paper "Improving GPU Performance Through Resource Sharing", in Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). The paper describes a resource sharing technique that makes architectural modifications to improve GPU performance. This work extends the paper by introducing compiler optimizations to leverage the resource sharing approach. Author's addresses: Vishwesh Jatala, Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur; Jayvant Anantpur, Supercomputer Education and Research Centre (SERC) , Indian Institute of Science, Bangalore; Amey Karkare, Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur; then ⌊R/R tb ⌋ number of thread blocks can be launched in each SM. Thus utilizing R tb * ⌊R/R tb ⌋ units of resources present in the SM; the remaining R mod R tb resources are wasted. In this paper, we propose an approach, Scratchpad Sharing, that launches additional thread blocks in each SM. These thread blocks help in improving the TLP by utilizing the wasted scratchpad memory and by sharing the scratchpad memory with the other resident thread blocks. We further propose Owner Warp First (OWF), a warp scheduling algorithm that improves performance by effectively scheduling warps from the addition thread blocks.
In our experiments we observed that the performance of scratchpad sharing depends on the availability of the scratchpad memory that is shared between the thread blocks. We have developed static analysis that helps in allocating scratchpad variables into shared and unshared scratchpad regions such that the shared scratchpad variables are needed only for a short duration. We modified the GPU architecture to include a new hardware instruction (relssp) to release the acquired shared scratchpad memory at run time. When all the threads of a thread block execute the relssp instruction, the thread block releases its shared scratchpad memory. We describe an algorithm to help compiler in an optimal placement of the relssp instruction in a kernel such that the shared scratchpad can be released as early as possible, without causing any conflicts among shared thread blocks. These optimizations improve the availability of shared scratchpad memory.
The main contributions of this paper are: (1) We describe an approach to launch more thread blocks by sharing the scratchpad memory. We further describe a warp scheduling algorithm that improves the performance of the GPU applications by effectively using warps from additional thread blocks. (2) We present a static analysis to layout scratchpad variables in order to minimize the shared scratchpad region. We introduce a hardware instruction, relssp, and an algorithm for optimal placement of relssp in the user code to release the shared scratchpad region at the earliest. (3) We used the GPGPU-Sim [Bakhoda et al. 2009 ] simulator and the Ocelot [Diamos et al. 2010 ] compiler framework to implement and evaluate our proposed ideas. On several kernels from various benchmark suites, we achieved an average improvement of 19% and a maximum improvement of 92.17% over the baseline approach. The rest of the paper is organized as follows: Section 2 describes the background required for our approach. Section 3 motivates the need for scratchpad sharing and presents the details of the approach. Owner Warp First scheduling is described in Section 4. Section 5 presents the need for compiler optimizations. The optimizations themselves are discussed in Section 6. Section 7 analyzes the hardware requirements and the complexity of our approach. Section 8 shows the experimental results. Section 9 discusses related work, and Section 10 concludes the paper.
BACKGROUND
A typical NVIDIA GPU consists of a set of streaming multiprocessors (SMs). Each SM contains execution units called stream processors. A programmer parallelizes an application on GPU by specifying an execution configuration consisting of the number of thread blocks and the number of threads in each thread block. The number of thread blocks that are actually launched in a SM depends on the resources available in the SM, such as the amount of scratchpad memory, the number of registers. The threads in a SM are grouped into 32 threads, called warps. All the threads in a warp execute the same instruction in SIMD manner. GPU has one or more warp schedulers, which fetch When no warp can be issued in a cycle, the cycle is said to be a stall cycle.
NVIDIA provides a programming language CUDA [CUDA 2012] , which can be used to write an application to be parallelized on GPU. The region of a program which is to be parallelized is specified using a function called kernel. The kernel is invoked with the configuration specifying the number of thread blocks and number of threads as <<<#ThreadBlocks, #Threads>>>. A variable can be allocated to global memory by invoking cudamalloc() function. Similarly a variable can be allocated to scratchpad memory by specifying shared keyword inside a kernel function. The latency of accessing a variable from global memory is 400-800 cycles, whereas, latency of accessing from scratchpad memory is 20-30x lower than that of global memory [CUDA 2012 ].
SCRATCHPAD SHARING
Scratchpad memory allocation at thread block level granularity causes scratchpad underutilization. To understand the utilization of scratchpad memory, we analyzed applications shown in Table I . Figure 1 shows the number of thread blocks that are launched in each SM and Figure 2 shows the percentage of unutilized scratchpad memory for the GPU configuration shown in Table II .
Example 3.1. Consider the application backprop in Table I . It requires 9408 bytes of scratchpad memory to launch a thread block in the SM. According to the GPU configuration shown in Table II , each SM has 16K bytes of scratchpad memory. Hence only 1 thread block can be launched in the SM, this utilizes 9408 bytes of scratchpad memory. The remaining 6976 bytes of scratchpad memory remains unutilized. We can observe the similar behavior for other applications as well. Hence scratchpad allocation at thread block level granularity not only has lower number of resident thread blocks but also has scratchpad memory underutilization.
To address this problem, we propose, Scratchpad Sharing, that increases the number of resident thread blocks in each SM. These thread blocks use the unutilized scratchpad memory as well as share scratchpad with other resident thread blocks. This not only reduces scratchpad underutilization but also increases TLP on the SM. Example 3.2. To improve the performance of backprop, we launch two thread blocks (say TB 0 and TB 1 ), which share the scratchpad memory. Instead of allocating 9408 bytes of scratchpad memory to each of TB 0 and TB 1 , the scratchpad sharing approach allocates total 16K bytes of memory together for TB 0 and TB 1 . In this case, 6976 bytes (the unutilized amount in Example 3.1) of scratchpad memory is allocated to each thread block independently (unshared scratchpad), while the remaining 2432 bytes of memory (shared scratchpad) is allocated to the thread block which requires it first. For example, if TB 1 accesses the shared scratchpad memory first, it is allocated all of the shared portion. TB 0 can continue its execution till it requires shared scratchpad memory, at which point it waits. TB 0 resumes its execution once TB 1 finishes or releases the shared scratchpad. Thus, TB 0 can help in hiding the long memory latencies of TB 1 , thereby improving the run-time of the application.
To generalize our idea, consider a GPU that has R units of scratchpad memory per SM, and each thread block requires R tb units of scratchpad memory to complete its execution. Consider a pair TB 0 and TB 1 of shared thread blocks. Instead of allocating R tb units of memory to each of TB 0 and TB 1 , we allocate t × R tb (0 < t < 1) units of scratchpad memory to each of them independently. This is called unshared scratchpad. We further allocate (1 − t) × R tb units of scratchpad memory to the pair as shared scratchpad. Thus, a total of (1 + t) × R tb units of scratchpad memory is allocated for both. TB 0 and TB 1 can access shared scratchpad memory only after acquiring an exclusive lock, in an FCFS manner, to prevent concurrent accesses. Once a shared thread block (say TB 0 ) acquires the lock for shared scratchpad memory, it retains the lock till the end of its execution. The other thread block (TB 1 ) can continue to make progress until it requires to access shared scratchpad memory, at which point it waits until TB 0 releases the shared scratchpad.
The naive scratchpad sharing mechanism, where each thread block shares scratchpad memory with another resident thread block, may not give benefit over default (unshared) approach. we also need to guarantee that in sharing approach, the number of active thread blocks (not waiting for shared scratchpad) is no less than the number of thread blocks in default approach.
Example 3.3. Consider the application DCT3 that requires 2176 bytes of scratchpad memory per thread block. For the given GPU configuration (Table II) , 7 thread blocks can be launched in default mode. With scratchpad sharing, it is possible to launch 12 thread blocks (for a certain value of t). Suppose we create 6 pairs of thread blocks where the blocks in each pair share scratchpad. Then, in the worst case, all 12 blocks may request access to the shared portion of scratchpad. This will cause 6 blocks to go in waiting, while only the remaining 6 will make progress. If the shared region is sufficiently large, the application will perform worse with scratchpad sharing.
To make sure at least 7 thread blocks make progress, our approach creates only 5 pairs of thread blocks that share scratchpad memory, the remaining 2 thread blocks are not involved in sharing. Thus, at most 5 blocks can be waiting during execution.
In our approach, the thread blocks that share the scratchpad memory are referred to as shared thread blocks, the rest are referred to as unshared thread blocks. The computation of number of shared and unshared thread blocks is described in detail in [Jatala et al. 2016] .
To implement our approach, we modify the existing scratchpad access mechanism provided by GPGPU-Sim [2014] simulator. Figure 3 shows the scratchpad access mechanism that supports scratchpad sharing. When a thread (Thread Id: T hId) needs to access a scratchpad location (SM emLoc), we need to check if it is from an unshared thread block. If it belongs to an unshared thread block, it can access the location directly from scratchpad memory (Figure 3 Step (b)). Otherwise, we need to make another check if it accesses unshared scratchpad location (Step (c)). The thread accesses unshared scratchpad location if SM emLoc < R tb t because we allocate R tb t units of scratchpad memory to each of the shared thread blocks. Otherwise, we treat the location as shared scratchpad location. A thread can access unshared scratchpad location directly, however it can access the shared scratchpad location only after acquiring the exclusive lock as shown in Step (e). Otherwise, it retries the access in the next cycle 1 .
OWNER WARP FIRST (OWF) SCHEDULING
In our approach, each SM contains various types of thread blocks such as, (1) unshared thread blocks, which do not share scratchpad memory with any other thread block, (2) shared thread blocks that own the shared scratchpad memory (Owner thread blocks) by having exclusive lock, and (3) shared thread blocks that do not own the shared scratchpad memory (Non-owner thread blocks). We refer the to the warps in these thread blocks as 'Unshared Warps', 'Owner warps', and 'Non-owner Warps' respectively. When an owner thread block finishes its execution, it transfers its ownership to its corresponding non-owner thread block, and the new thread block that will be launched becomes the non-owner thread block. Scheduling these warps in the SM plays an important role in improving the performance of applications. Hence we propose an optimization, Owner Warp First (OWF) , that schedules the warps in the following order: (1) Owner warps, (2) Unshared warps, and (3) Non-owner warps. Giving the highest priority to owner warps helps them in finishing sooner so that the dependent non-owner warps can resume their execution, which can help in hiding long execution latencies. Figure 4 shows the benefit of giving the first priority to owner warp when compared to non-owner warp. Consider an SM that has three warps: Unshared (U), Owner (O), and Non-owner (N) warps. Assume that they need to execute 3 instructions I 1 , I 2 , and I 3 as shown in the figure; the latency of Add and Mov instruction is 1 cycle, and the latency of Load instruction is 5 Cycles. Also, assume that instruction I 2 uses a shared scratchpad memory location. If unshared warp is given highest priority (shown as Unshared Warp First in the figure), then it can issue I 1 in the 1st cycle and issue I 2 in the 2nd cycle. However, it can not issue I 3 in the 3rd cycle since I 3 is dependent on I 2 for register R 2 , and I 2 takes five cycles to complete its execution. However, the owner warp can execute I 1 in the 3rd cycle. Similarly, it can issue I 2 in the fourth cycle. The non-owner with least priority can start issuing I 1 in the 5th cycle, however, it can not issue I 2 in the 6th cycle since I 2 uses a shared scratchpad memory location, and it can access the shared location only after the owner thread block releases the lock, hence it waits until the owner warp finishes execution. Once the owner warp finishes the execution of I 2 and I 3 in the 8th and 9th cycles respectively, the non-owner resumes the execution of I 2 in 10th cycle, and it can subsequently finish in 15 cycles.
If owner warp is given first priority compared to unshared warp, it can issue I 1 and I 2 in its 1st and 2nd cycles respectively. Similarly, the unshared warp, with second priority, can issue I 1 and I 2 in 3rd and 4th cycles. The non-owner warp with least priority can issue I 1 in 5th cycle, and it waits for owner warp to release the shared scratchpad memory. Once the owner warp completes the execution of I 2 and I 3 in 7th and 8th cycles, the non-owner can resume the execution by overlapping the execution of I 2 in the 8th cycle with unshared warp. Finally, the unshared warp and non-owner warp can finish their execution in 9th and 13th cycle respectively. Thus improving the overall performance. 
THE NEED FOR COMPILER OPTIMIZATIONS
In scratchpad sharing, when two thread blocks (say, TB 0 and TB 1 ) are launched in shared mode, one of them accesses the shared scratchpad region at a time. As soon as one thread block, say TB 0 , starts accessing the shared scratchpad region, the other thread block, TB 1 , can not access the shared scratchpad region and hence may have wait until TB 0 finishes execution.
Example 5.1. Consider the CFG in Figure 5 , which is obtained for SRAD1 benchmark application (Table I ). In the figure, the program point marked L corresponds to the last access to the shared scratchpad. Without compiler assistance, the shared scratchpad region can be released only at the end of the last basic block (Exit node of CFG) even though it is never accessed after L.
To promote the release of shared scratchpad region before the end of kernel execution, we introduce a new hardware instruction (PTX instruction) called relssp. Our proposed compiler optimization can place the relssp instructions in a kernel such that shared scratchpad memory is released as early as possible by each thread.
Example 5.2. Consider the scenario in Figure 6 , where a kernel function declares four equal sized scratchpad variables V 1 to V 4 . The figure also shows the regions of the kernel within which different variables are accessed. If V 1 and V 4 are allocated into shared scratchpad region, then the shared scratchpad region is accessed from program point P 1 to program point P 8 . However, when V 2 and V 3 are allocated to shared scratchpad region, the shared region is accessed for a shorter duration, i.e., from program point P 3 to program point P 6 .
We describe a compile time memory allocation scheme that allocates scratchpad variables into shared and unshared region such that shared scratchpad variables are accessed for shorter durations.
COMPILER OPTIMIZATIONS
To increase the availability of shared scratchpad memory, it is desirable that the access to the shared portion of the scratchpad is restricted to a small region in the program. We have developed compile time analysis to allocate scratchpad variables into shared scratchpad region such that this shared region is minimal in terms of number of exe-cuted instructions. In the presence of loops, where the number of iterations of the loop is not computable at compile time, the number of iterations can be estimated using profiling and user annotations.
To simplify the description of analysis, we make the following assumptions: -The control flow graph (CFG) for a function (kernel) has a unique Entry and a unique Exit node. -There are no critical edges in the CFG. A critical edge is an edge whose source node has more than one successor and the destination node has more than one predecessor. -There are no dead definitions for scratchpad variables (i.e., every definition has some use). The assumptions are not restrictive as any control flow graph can be converted to the desired form using a preprocessing pass that splits the critical edges and removes dead code.
Minimizing Shared Scratchpad Region
Consider a GPU that uses scratchpad sharing approach such that two thread blocks involved in sharing can share a fraction f < 1 of scratchpad memory. Assume that each SM in the GPU has M bytes of scratchpad memory, the kernel that is to be launched into the SM has N scratchpad variables, and each thread block of the kernel requires M tb bytes of scratchpad memory. We allocate a subset S of scratchpad variables into shared scratchpad region such that (1) The total size of the scratchpad variables in the set S is equal to the size of shared scratchpad (f × M tb ), and (2) The region of access for variables in S is minimal in terms of the number of instructions.
To compute the region of access for S, we define access range for a variable as follows: Intuitively, the access range of a variable covers every program point between the variable's first definition and its last use in an execution path. Any other definition on this path does not affect access range. Note that this is unlike live range of a variable [Khedker et al. 2009] , where the intermediate definitions create holes in the live range. However, the access range for a variable can still contain disjoint regions due to branches in the flow graph. (1) there is a definition of some v i (1 ≤ i ≤ n) on some path from Entry to π and (2) there is a use of v j (1 ≤ j ≤ n) on some path from π to Exit.
Note that the variables checked for definition (v i ) and for use (v j ) may or may not be the same.
Example 6.1. Consider a kernel whose CFG is shown in Figure 7 . The kernel uses 3 scratchpad variables A, B and C. In the figure, variable A is accessed in the region from basic block BB 1 to basic block BB 4 . The regions of the program where B and C are accessed are also shown.
For the CFG, the start of basic block BB 2 is considered in access range of A because there is a path from Entry to start of BB 2 that contains the definition of A (in BB 1 ) and 
there is a path from the start of BB 2 to Exit that contains the use of A (in BB 4 ). In the traditional liveness analysis, A is dead at the start of BB 2 due to the redefinition of A in BB 2 . Consider the set S = {B, C}. Basic block BB 4 is in access range of S because there is a path from Entry to BB 4 containing the definition of B (in BB 2 ), and there is a path from BB 4 to Exit containing the use of the C (in BB 6 ).
To compute the access ranges for a program, we need a forward analysis to find the first definitions of the scratchpad variables, and a backward analysis to find the last uses of the scratchpad variables. We define these analyses formally using the following notations: -IN(BB) denotes the program point before the first statement of the basic block BB.
OUT ( AccOUT(S, BB) is true if OUT(BB) is in access range of a set of scratchpad variables S. The data flow equations to compute the information are 2 :
The analysis can be extended easily to compute information at any point inside a basic block. We ignore it for brevity.
We decide whether the access range of a set of scratchpad variables S includes the points IN(BB) and OUT(BB) as:
Example 6.2. Table III shows the program points in the access ranges of scratchpad variables for CFG of Figure 7 . The table also shows the program points in the access ranges of sets of two scratchpad variables each.
Let SV denote the set of all scratchpad variables. For each subset S of SV having a total size equal to the size of shared scratchpad memory, our analysis computes total number of instructions in the access range of S. The subset with minimum instructions in the access range is selected for allocation in the shared scratchpad memory.
Example 6.3. Consider once again the CFG in Figure 7 . For simplicity, assume that all the variables have equal sizes, and each basic block contains the same number of instructions.
Consider a scratchpad sharing approach that can allocate only two of the variables into the shared scratchpad region. From the CFG, and from Table III, it is clear that when A and B are allocated into shared scratchpad memory, the shared region is smaller, compared to when either {B,C} or {A,C} are allocated in the shared region.
Implementation of relssp Instruction
In scratchpad sharing approach, a shared thread block acquires a lock before accessing shared scratchpad region and unlocks it only after finishing its execution. This causes a delay in releasing the shared scratchpad because the thread block holds the scratchpad memory till the end of its execution, even though it has finished accessing shared region.
To minimize the delay in releasing the shared scratchpad, we propose a new instruction, called relssp, in PTX assembly language. The semantics of relssp instruction is to unlock the shared region only when all active threads within a thread block finished executing the shared region. Figure 8 shows the pseudo code for relssp instruction. The RELEASESSP() procedure maintains count, an integer initialized to zero. When an active thread within a thread block executes a relssp instruction, it increments the count value. When all active threads of a thread block execute relssp instruction (Line 5, when count equals ACTIVE THREADS), the shared region is unlocked by invoking UNLOCKSHAREDREGION(). The unlock procedure releases the shared scratchpad region by resetting the lock variable. The execution of relssp by a thread block that does not access shared scratchpad region has no effect.
It is clear that count in Figure 8 has to be a shared variable, hence a software implementation will require to manage critical section. The same algorithm, however, can be efficiently implemented in hardware circuit as shown in Figure 9 . The i th thread within a thread block is associated with an active mask (A i ) and a release bit (R i ). The mask A i is set if the i th thread is active. When this thread executes relssp instruction, the release bit (R i ) gets set. The shared scratchpad region is unlocked only when all the active threads in a thread block execute relssp instruction (the lock bit, i. e. the output of NAND gate becomes 0 in Figure 9 ). In other words, shared scratchpad region is unlocked if ∀i A i → R i is true. Condition 1 ensures that shared scratchpad is eventually released by a thread block since the instruction is executed by all the threads of a thread block. Also, it guarantees that shared scratchpad is released only after a thread block has completed using it. Whereas, Condition 2 avoids redundant execution of relssp instruction.
In the scratchpad sharing, a thread block releases the shared scratchpad memory after completing its execution, hence it is equivalent to having a relssp instruction placed at the end of the program, which guarantees both the conditions, albeit at the cost of delay in releasing the shared scratchpad. A simple improvement that promotes early release of shared scratchpad memory and ensures both the conditions, is to place the relssp instruction at a basic block BB postdom where BB postdom is a common post dominator of those basic blocks having the last accesses to the shared scratchpad memory along different paths. Further, BB postdom should dominate Exit, i.e., it should be executed in all possible execution paths. As the following example shows, this strategy, though an improvement over placing relssp in Exit, may also result in delaying the release of shared scratchpad memory.
Example 6.4. Consider a CFG shown in Figure 10 . Assume that L 1 , L 2 denote the program points that correspond to the last accesses to shared scratchpad memory. Since relssp instruction is to be executed by all the threads of thread block, it can be placed at the post dominator of the basic blocks BB 3 and BB 9 , i.e., program point marked π in BB 12 , which is visible to all threads. However, this delays the release of shared scratchpad. Consider a thread that takes a path along BB 9 , it can execute relssp immediately after executing the last access to shared region (shown as OP T 3 in the figure). It executes relssp at program point π. Similarly, when a thread takes a path along the basic block BB 4 , it releases the shared scratchpad at π even though it As is clear from the above example, placement of relssp instruction has an effect on the availability of shared scratchpad memory. Intuitively, a safely placed relssp instruction at a program point π can be moved to a previous program point π ′ in the same basic block provided the intervening instructions do not access shared scratchpad. The movement of relssp from a basic block BB to predecessor BB ′ is possible provided every other successor of BB ′ also does so.
Example 6.5. Figure 11 (a) shows a basic block BB 1 , which has the last access to the shared scratchpad memory at L 1 . In this block, if the relssp instruction can be placed safely at the program point π 1 , then it can be moved to π 2 since there is no access to shared scratchpad memory between π 1 and π 2 . However, it can not be moved to the program point π 3 within the same basic block, because it violates safety (Condition 1).
Consider another scenario shown in Figure 11 (b), basic block BB 2 has the last access to shared memory at L 2 , and basic blocks BB 1 , BB 3 , and BB 4 do not access any scratchpad memory. If the relssp instruction can be placed safely at π 4 in BB 4 , then it can be moved to a program point π 5 and π 7 in the basic blocks BB 2 and BB 3 respectively. However, it can not be moved to program point π 6 in BB 2 and π 8 in BB 1 since it violates of Condition 1. Also, the relssp instruction can not be moved from π 7 in BB 3 to π 8 in BB 1 since the basic block BB 2 , which is a successor of BB 1 , does not allow the relssp instruction to be placed at π 8 .
We now formalize these intuitions into a backward data flow analysis. The notations used are: -IN(BB) denotes the program point before the first statement of the basic block BB.
OUT(BB) denotes the program point after the last statement of BB. -SafeIN(BB) is true if the relssp instruction can be safely placed at IN(BB), and SafeOUT(BB) is true if the relssp instruction can be safely placed at OUT(BB).
-INS π , if true, denotes that relssp will be placed at program point π by the analysis.
The data flow equations are:
The above equations compute the program points where relssp can be placed safely. For a basic block BB, OUT(BB) is an optimal place for relssp instruction, if relssp can be placed safely at OUT(BB), and it can not be moved safely to its previous program point in the basic block, i.e., IN(BB) is false. This is computed as:
Similarly, IN(BB) is an optimal point for relssp instruction, when the instruction can not be moved to its predecessors 3 . This can be computed as:
Equations (1) and (2) together, along with the absence of critical edges, ensure the optimality condition that each thread executes the relssp instruction exactly once. Figure 12 shows the modified GPU architecture to implement scratchpad sharing. Our approach requires two modifications to scheduler unit in the SM. The first change is to the warp scheduler which uses OWF optimization. The second change is the inclusion of resource access unit, which follows the scratchpad access mechanism as discussed in Section 3. The resource access unit requires the following additional storage: (1) Each SM requires a bit (shown as ShSM in Figure 12 ) to indicate whether the scratchpad sharing is enabled for it. This bit is set when the number of thread blocks launched using our approach is more than that of baseline approach. (2) Every thread block involved in sharing stores the id of its partner thread block in the ShTB table. If a thread block is in unsharing mode, a −1 is stored. For T thread blocks in the SM, we need a total of T log 2 (T + 1) bits. (3) Each warp a requires a bit to specify if it is an owner warp. For W warps in the SM, W bits are needed. (4) For each pair of shared thread blocks, a lock variable is needed in order to access shared scratchpad memory. This variable is set to the id the thread block that acquired shared scratchpad memory. For T thread blocks, there are at most ⌊T /2⌋ pairs of sharing thread blocks in the SM. This requires ⌊ T 2 ⌋⌈(log 2 T )⌉ bits in the SM. If a GPU has N SMs and allows a maximum of T thread blocks and W warps per SM, then the number of additional bits required is: 1 + T log 2 (T + 1) + W + ⌊ T 2 ⌋⌈(log 2 T )⌉ * N . For the architecture we used for simulation (shown in Table II), the overhead is 209 bits per SM. In addition, each scheduler unit in the SM requires two comparator circuits and one arithmetic circuit to set the lock (See Figure 3) .
REQUIREMENTS FOR SCRATCHPAD SHARING

Hardware Requirements
Analysis of Compiler Optimizations
The dataflow analyses to compute definitions and usages of scratchpad variables (Section 6.1) are bit-vector data flow analyses [Khedker et al. 2009 ]. For a kernel with n scratchpad variables and m nodes (basic blocks) in the flow graph, the worst case complexity is O(n × m 2 ) (assuming set operations on n bit-wide vectors take O(n) time). The computation of access ranges for sets of variables may require analyzing all O(2 n ) subsets in the worst case, where the largest size of a subset is O(n). Thus, given the usage and definitions at each program point in the kernel, computation of AccIN and AccIN requires O(m × n × 2 n ) time. Therefore, the total time complexity is O(n × m 2 + m× n× 2 n ). Since the number of scratchpad variables in a kernel function is small (typically, n ≤ 10), the overhead of the analysis is practical.
Our approach inserts relssp instructions in a CFG such that relssp is called exactly once along any execution path. In the worst case, all nodes in a CFG (except Entry and Exit blocks) might fall along different paths from Entry to Exit. Hence the worst case number of relssp inserted is O(m).
EXPERIMENTAL EVALUATION
We implemented the proposed scratchpad sharing approach and integrated relssp instruction in GPGPU-Sim V3.x [GPGPU-Sim 2014] simulator. We implemented the compiler optimizations in PTX assembly [PTX 2014 ] using Ocelot [Diamos et al. 2010] framework. The baseline architecture that we used for comparing our approach is shown in Table II Depending on the amount and the last usage of the shared scratchpad memory by the applications, we divided the benchmark applications into three sets. Set-1 and Set-2 (Table I) consists of applications whose number of resident thread blocks are limited by scratchpad memory. For Set-1, the applications do not access scratchpad memory till towards the end of their execution, while for Set-2, the applications access scratchpad memory till towards the end of their execution. The introduction of relssp instruction is expected to give benefit over our earlier approach [Jatala et al. 2016] only for Set-1 applications. Set-3 benchmarks (Table IV) consist of applications whose number of thread blocks are not limited by scratchpad memory, but by some other parameter. These are included to show that our approach does not negatively affect the performance of applications that are not limited by scratchpad memory.
For each application in Set-1 and Set-2 benchmarks, Table I shows the kernel that is used for evaluation, the number of the scratchpad variables declared in each kernel, the amount of the scratchpad memory required for each thread block, and the thread block size. Some applications in Set-1 and Set-2 benchmarks are modified to make sure that the number of thread blocks is limited by scratchpad memory, thus making scratchpad sharing approach applicable. These changes increase the scratchpad memory requirement per thread block and are shown in Table V . For Set-3 benchmarks, Table IV shows the cause of limitation on the number of thread blocks. The causes include the limit on the number of registers, the maximum limit on the number of resident thread blocks, and the maximum limit on the number of resident threads.
We compiled all the applications using CUDA 4.0 4 and simulated them using the GPGPU-Sim simulator. We use a threshold (t) to configure the amount of scratchpad sharing. If each thread block requires R tb amount of scratchpad memory, then we allocate R tb (1 + t) for each pair of shared thread blocks, in which we allocate R tb (1 − t) as shared scratchpad memory. We analyzed the benchmark applications for various threshold values and choose the value t as 0.1 (i.e., 90% scratchpad is shared among pair of thread blocks) to give the maximum benefit. The details of the experiments to choose t are given in technical report [Jatala et al. 2015] .
We measure the performance of our approach using the following metrics: (1) The number of the resident thread blocks launched in the SMs. This is a measure of the amount of thread level parallelism present in the SMs. (2) The number of instructions executed per shader core clock cycle (IPC). This is a measure of the throughput of the GPU architecture. (3) The number of simulation cycles that an application takes to complete its execution. This is a measure of the performance of the benchmark applications.
Analysis of Set-1 and Set-2 Benchmarks
We use Unshared-LRR to denote the baseline unsharing approach, Shared-OWF to denote our scratchpad sharing approach with OWF scheduler, and Shared-OWF-OPT to denote the scratchpad sharing approach that includes OWF scheduler and compiler optimizations. Figure 13 shows the number of thread blocks for the three approaches. For applications DCT1 and DCT2, Unshared-LRR launches 7 thread blocks in the SM according to the amount of scratchpad mem- ory required by their thread blocks. Shared-OWF launches 14 thread blocks in the SM, where each of the 7 additional thread blocks share scratchpad memory with other resident thread blocks. For DCT3 and DCT4 applications, Unshared-LRR launches 7 thread blocks in the SM, whereas Shared-OWF launches 12 thread blocks in the SM such that the additional 5 thread blocks share scratchpad memory with the existing 5 thread blocks; while the remaining 2 existing thread blocks in the SM do not share scratchpad memory with any other thread block. For FDTD3d, Shared-OWF launches 2 additional thread blocks in the SM when compared to Unshared-LRR, which share scratchpad memory with other 2 resident thread blocks. For the remaining applications, Unshared-LRR launches 1 thread block, whereas Shared-OWF launches 1 additional thread block in the SM which shares scratchpad memory with the existing thread block. Note that the number of thread blocks launched by Shared-OWF-OPT is exactly same as that of Shared-OWF. This is expected since the number of additional thread blocks launched by scratchpad sharing approach depends on two parameters:
Comparing the Number of Resident Thread Blocks.
(1) the amount of scratchpad sharing, and (2) the amount of scratchpad memory required by a thread block; and our compiler optimizations do not affect either of these parameters. Figure 14 compares the performance of Shared-OWF-OPT 5 in terms of the number of instructions executed per cycle (IPC) with that of Unshared-LRR. We observe a maximum improvement of 92.17% and an average (Geometric Mean) improvement of 19% with Shared-OWF-OPT. The maximum benefit of 92.17% is for heartwall because the additional thread blocks launched by Shared-OWF-OPT do not access the shared scratchpad region. Hence all the additional thread Table VI shows the run-time overhead of inserting relssp instruction. We report sum of the number of instructions executed by all threads for Unshared-LRR, Shared-OWF, and Shared-OWF-OPT. We also report the number of threads launched.
Performance Comparison.
From the table, we observe that the number of instructions executed by Unshared-LRR and Shared-OWF is same. This is because Shared-OWF does not insert relssp instruction, and hence the input PTX assembly is not altered. Shared-OWF-OPT increases number of executed instructions as it inserts relssp and, in some cases, GOTO instruction to split critical edges. For the applications DCT1, DCT2, SRAD1, SRAD2, NW1, and NW2, the number of additionally executed instructions (shown as Difference (SO-U) in the table) is equal to number of threads because Shared-OWF-OPT inserts only the relssp instruction. Further, each thread executes relssp exactly once. For FDTD3d, heartwall, histogram, and MC1 applications, the number of additional instructions executed by Shared-OWF-OPT is twice that of number of threads. For these applications, each thread executes two additional instructions, i.e., one relssp instruction, and one GOTO instruction for splitting a critical edge. For backprop, DCT3, DCT4, and NQU applications, some threads take a path that has two additional instructions (GOT O and relssp), while other threads take the path which has one additional relssp instruction. Figure 15 shows the effectiveness of Shared-OWF-OPT by comparing the number of simulation cycles with that of Unshared-LRR. We observe a maximum reduction of 47.8% and an average reduction of 15.42% in the number of simulation cycles when compared to Unshared-LRR. Recall that Shared-OWF-OPT causes applications to execute more number of instructions (Table VI) . These extra instructions are also counted while computing the simulation cycles for Shared-OWF-OPT. Figure 16 shows the effectiveness of our optimizations with scratchpad sharing. We observe that all applications, except FDTD-3d and histogram, show some benefit with scratchpad sharing even without any optimizations (shown as Shared-NoOpt in the figure). With OWF scheduling (Shared-OWF), applications improve further because OWF schedules the resident warps in a way that the non-owner warps help in hiding long execution latencies. For our benchmarks, minimizing shared scratchpad region (shown as Shared-OWF-Reorder) does not have any noticeable impact. This is because (a) Most applications declare only a single scratchpad variable (Table I) in their kernel, hence the optimization is not applicable (there is only one possible order of scratchpad variable declarations); and (b) For the remaining applications, the scratchpad declarations are already ordered in the optimal fashion, i.e., the access to shared scratchpad region is already minimal.
Reduction in Simulation Cycles.
The addition of relssp instruction at the postdominator and at the optimal places is denoted as Shared-OWF-PostDom and Shared-OWF-OPT respectively. All Set-1 applications improve with either of these optimizations because the relssp instruction helps in releasing the shared scratchpad memory earlier. For backprop and SRAD2 applications, Shared-OWF-PostDom is better than Shared-OWF-OPT because the threads in backprop execute one additional GOTO instruction with Shared-OWF-OPT (Shared-OWF-PostDom does not require critical edge splitting). SRAD2 has more number of stall cycles with Shared-OWF-Opt as compared to Shared-OWF-PostDom. For most of the other benchmarks, Shared-OWF-Opt performs better as it can push relssp instruction earlier than with Shared-OWF-PostDom, thus releasing shared scratchpad earlier allowing for more thread level parallelism.
As expected, Set-2 applications do not show much benefit with Shared-OWFPostDom or Shared-OWF-OPT since they access shared scratchpad memory till to- Figure 17 shows the effect of compiler optimizations by analyzing the progress of shared thread blocks through shared and unshared scratchpad regions. In the figure, NoOpt denotes the default scratchpad sharing mechanism where none of our optimizations are applied on an input kernel. Minimize denotes the scratchpad sharing approach which executes an input kernel having minimum access to shared scratchpad region. PostDom and OPT use our modified scratchpad sharing approach that execute an input kernel with additional relssp instructions placed at post dominator and optimal places (Section 6.3) respectively. In the figure, we show the percentage of simulation cycles spent in unshared scratchpad region (before acquiring shared scratchpad), shared scratchpad region, and unshared scratchpad region again (after releasing the shared scratchpad) respectively.
From Figure 17 we observe that shared thread blocks in all the applications access unshared scratchpad region before they start accessing shared scratchpad memory. Hence all the shared thread blocks can make some progress without wait. This progress is the main reason for the improvements seen with scratchpad sharing approach. Consider the application heartwall, where none of the shared thread blocks accesses shared scratchpad memory. Thus, all the shared thread blocks in the application spend their execution in the unshared scratchpad region. The compiler optimizations can not improve the progress of shared thread blocks any further. Minimize does not affect DCT1, DCT2, DCT3, DCT4, FDTD3d, histogram applications because the kernels in these applications declare single scratchpad variable. For the remaining applications, Minimize has same effect as that of NoOpt, because the default input PTX kernel already accesses the shared scratchpad variables such that access to shared scratchpad is minimum. We also observe that PostDom and OPT approaches improve only those applications that spend considerable simulation cycles in unshared scratchpad region after last access to shared scratchpad region. Figure 18 shows the effect of using different scheduling policies. The performance of Shared-OWF-OPT approach is compared with the baseline unshared implementation that uses greedy then old (GTO) and two-level scheduling policies respectively. We observe that Shared-OWF-OPT approach shows an average improvement of 17.73% and 18.08% with respect to unshared GTO and two-level scheduling policies respectively. The application FDTD3d degrade with our approach when compared to the baseline with either GTO scheduling or two-level scheduling since it has more number of L1 and L2 cache misses with sharing. The application histogram degrades with sharing when compared to the baseline with GTO scheduling because of more number of L1 misses. However histogram with sharing performs better than the baseline with two-level policy. We now compare the effectiveness of sharing approaches with Unshared-LRR for different GPGPU-Sim configurations. Figure 19 shows results for a GPU configuration that uses 48K L1 cache. We observe that sharing approach shows an average improvement of 14.04% with Shared-OWF and 18.71% with Shared-OWF-OPT over Unshared-LRR. The applications DCT3, DCT4, SRAD1, SRAD2, and FDTD3d using sharing are improved fur- ther with this configuration since they benefit from increased L1 cache size. For heartwall, Unshared-LRR benefits more with the increased L1 cache than the sharing approaches, it shows relatively less improvement of 86% when compared to Figure 14 . Figure 20 shows the performance comparison of sharing approaches with Unshared-LRR for a GPU configuration that has 48K scratchpad memory and the maximum number of resident threads in the SM as 2048. We observe average improvements of 8.62% and 9.21% with Shared-OWF and Shared-OWF-OPT approaches respectively. Consider the applications DCT1, DCT2, DCT3, and DCT4. With increase in the scratchpad memory, the number of resident thread blocks in the SM for these applications is not limited by the scratchpad memory, hence sharing does not increase the number of resident thread blocks. Also, the compiler optimizations do not insert the relssp instruction into their PTX code since there is no access to shared scratchpad region. Hence Shared-OWF-OPT behaves exactly same as Shared-OPT. The improvement in the performance of sharing approaches over Unshared-LRR is due to the OWF scheduling policy. OWF scheduler arranges the resident warps according to the owner warps. Since all the warps that are launched using Shared-OWF own their resources (no sharing), they become owner warps. Hence the warps are arranged according to their dynamic warp id, giving the observed benefit. While scratchpad sharing can increase the number of resident thread blocks for SRAD1 and SRAD2, no additional blocks could be launched since the number of resident threads is restricted to 2048. Figure 21 shows the performance comparison for a GPU configuration that uses 48K shared memory and the maximum number of resident threads as 3072. With the number of resident threads increasing from 2048 to 3072, sharing is able to increase the number of resident thread blocks in SRAD1 and SRAD2 applications, thereby improving the performance. Figure 22 compares the IPC of Shared-OWF-OPT with Unshared-LRR that uses twice the amount of scratchpad memory on GPU. We observe that DCT3, DCT4, NQU, and heartwall show improvement with Shared-OWF-OPT over Unshared-LRR even with half the scratchpad memory. This is because sharing helps in increasing the TLP by launching additional thread blocks in each SM. The applications DCT1, DCT2, SRAD1, SRAD2, and MC1 applications perform comparable with both the approaches. For the remaining applications, Unshared-LRR with double scratchpad memory performs better than sharing since more number of thread blocks are able to make progress with the former.
Comparison with Different Schedulers.
Analysis of Set-3 Benchmarks
Performance analysis of Set-3 benchmarks is shown in the Figure 23 . Recall that the number of thread blocks launched by these applications is not limited by the scratchpad memory. We observe that the performance of the applications with Unshared-LRR, Shared-LRR, and Shared-LRR-OPT is exactly the same. For Set-3 applications all thread blocks are launched in unsharing mode. Hence Shared-LRR behaves exactly same as Unshared-LRR. Since these applications do not use any shared scratchpad memory, our compiler optimizations do not insert relssp instruction in their PTX code. Hence the number of instructions executed by the Shared-LRR-OPT approach is same as that of Shared-LRR. Similarly, we see that the performance of applications with Unshared-GTO, Shared-GTO, and Shared-GTO-OPT is exactly the same. However, with OWF optimization, Shared-OWF and Shared-OWF-OPT is comparable to the Unshared-GTO because OWF optimization arranges the resident warps according to the owner. Since all the the thread blocks own their scratchpad memory, they are sorted according to the dynamic warp id. Hence they perform comparable to Unshared-GTO. The performances with Shared-OWF and Shared-OWF-OPT are the same because the compiler optimizations do not insert any relssp instruction.
RELATED WORK
Resource sharing technique [Jatala et al. 2016] , proposed by the authors earlier, improves the throughput by minimizing the register and scratchpad memory underutilization by modifying the GPU architecture and scheduling algorithm. This work improves on it by introducing compiler optimizations for better layout of scratchpad variables and early release of shared scratchpad. Other related approaches that improve the performance of GPUs are discussed below:
Resource Management in GPUs
Shared memory multiplexing [Yang et al. 2012 ] addresses the TLP problem caused by limited shared memory by combining two thread blocks into a virtual thread block. The two thread blocks in a virtual block can execute the instructions in parallel, as long as they do not access shared memory; they become serial when they need to access shared memory. Warp level divergence technique [Xiang et al. 2014] improves the TLP by minimizing register underutilization. It launches one additional partial thread block when there are insufficient number of registers for an entire thread block. However, the number of warps in the partial thread block is decided by the number of unutilized registers, and also the partial thread block does not share registers with any other thread blocks. The unified storage approach [Gebhart et al. 2012] allocates the resources of SM (such as registers, scratchpad memory, and cache) dynamically as per the application demand. Tarjan and Skadron [2011] use virtual registers to launch more thread blocks. These registers are mapped to the physical registers as per the demand. We can combine our compiler optimizations to promote the early release of registers with this approach. propose a dynamic algorithm to launch the efficient number of thread blocks in an SM. Li et al. [2011] propose a resource virtualization scheme for sharing of GPU resources with multiprocessors. The virtualization layer proposed by them helps in improving the performance by overlapping multiple kernels executions. Our compile-time optimizations can be used with these techniques to improve TLP further. formulated the problem of scratchpad memory allocation as an integer programming problem, which maximizes scratchpad memory access and minimizes device memory access to improve GPU performance. Their framework can allocate parts of arrays on scratchpad, and also suggest profitable loop transformations. Hayes and Zhang [2014] proposed on-chip memory allocation scheme for efficient utilization of GPU resources. It aims to alleviate register pressure by spilling registers to scratchpad memory instead of local memory. Xie et al. [2015] proposed a compile time coordinated register allocation scheme to minimize the cost of spilling registers. These schemes do not propose any architectural change to GPUs and are orthogonal to our approach of scratchpad sharing.
Compiler Optimizations for Efficient Resource Utilization in GPUs
Scheduling Techniques for GPUs
The two level warp scheduling algorithm, proposed by Narasiman et al. [2011] , forms groups of warps and uses LRR to schedule warps in a group. It also proposes a large warp microarchitecture to minimize resource underutilization. Lee and Wu [2014] hide the long execution latencies by scheduling critical warps more frequently than other than warps. It helps in finishing the thread block sooner thus improving resource utilization. However, it requires the knowledge of critical warps. To address the problem, Lee et al. [2015] proposed a coordinated solution that identifies the critical warps at run time using instructions and stall cycles. Further, they proposed a greedy based critical warp scheduling algorithm to accelerate the critical warps in the SMs. OWL ] provides a scheduling mechanism to reduce cache contention and to improve DRAM bank level parallelism. focus on reducing resource contention by providing lazy thread block scheduling mechanism. They also proposed block level CTA scheduling policy that allocates consecutive CTAs into the same SM to exploit cache locality. Their approach can also be integrated to our approach.
Improving GPU Performance through Memory Management
Several other approaches exploit memory hierarchy to improve the performance of GPU applications. Li et al. [2015b] proposed compiler techniques to efficiently place data onto registers, scratchpad memory, and global memory by analyzing data access patterns. Sethia et al. [2015] proposed a scheduling policy that improves the GPU performance by prioritizing memory requests of single warp when memory saturation occurs. Li et al. [2015a] provide a mechanism to handle the cache contention problem that occurs due to increased number of resident threads in an SM. Their approach is alternative to the earlier proposed thread throttling techniques [Rogers et al. 2013 [Rogers et al. , 2012 .
Problems with Warp Divergence
Other techniques to improve GPU performance is by handling warp divergence. Dynamic warp formation [Fung et al. 2007] addresses the limited thread level parallelism that is present due to branch divergence. It dynamically forms new warps based on branch target condition. However, the performance of this approach is limited by the warp scheduling policy. Thread block compaction [Fung and Aamodt 2011] addresses the limitation of dynamic warp formation that occurs when the new warps that are formed may require more number of memory accesses. Their approach provides a solution by regrouping the new warps at the reconverging points. However in their solution, warps need to wait for other warps to reach the divergent path. Anantpur and Govindarajan [2014] proposed linearization technique to avoid duplicate execution of instructions that occurs due to branch divergence in GPUs. Brunie et al. [2012] ; Han and Abdelrahman [2011] provide hardware and software solutions to handle branch divergence in GPUs.
Miscellaneous
Warped pre-execution [Lee et al. 2016 ] accelerates a single warp by executing independent instructions when a warp is stalled due to long latency instruction. It improves the GPU performance by hiding the long latency cycles in a better way. Baskaran et al. [2008] proposed a compiler framework for optimizing memory access in affine loops. Huo et al. [2010] ; Gutierrez et al. [2008] show that several applications are improved by using scratchpad memory instead of using global memory.
CONCLUSIONS AND FUTURE WORK
In this paper, we propose architectural changes and compiler optimizations for sharing scratchpad effectively to address the underutilization of scratchpad memory in GPUs. Experiments with various benchmarks help us conclude that if the number of resident thread blocks launched by an application are limited by scratchpad availability (Table I) , scratchpad sharing (with the compiler optimizations) improves the performance. On the other hand, for other applications where the number of thread blocks is not limited by scratchpad availability (Table IV) , the hardware changes do not negatively impact the run-time.
In future, we would like to extend our work to integrate register sharing approach [Jatala et al. 2016] . Value range analysis techniques [Harrison 1977; Quintao Pereira et al. 2013] , typically employed for detecting buffer overflows, can be incorporated in our approach to refine the access ranges of shared scratchpad variables, thus help release shared scratchpad even earlier. We need to study the impact of hardware changes on power consumption, and find ways to minimize it.
