To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large register file is inevitable for accommodating more parallel thread contexts, and technology scaling makes it feasible to incorporate ever increasing size of register file. In this paper, we demonstrate that the register file size need not be large to accommodate more threads context. We first characterize the useful lifetime of a register and show that register lifetimes vary drastically across various registers that are allocated to a kernel. While some registers are alive for the entire duration of the kernel execution, some registers have a short lifespan. We propose GPU register file virtualization that allows multiple warps to share physical registers. Since warps may be scheduled for execution at different points in time, we propose to proactively release dead registers from one warp and re-allocate them to a different warp that may occur later in time, thereby reducing the needless demand for physical registers. By using register virtualization, we shrink the architected register space to a smaller physical register space. By under-provisioning the physical register file to be smaller than the architected register file we reduce dynamic and static power consumption. We then develop a new register throttling mechanism to run applications that exceed the size of the under-provisioned register file without any deadlock. Our evaluation shows that even after halving the architected register file size using our proposed GPU register file virtualization applications run successfully with negligible performance overhead.
INTRODUCTION
To enable massive thread level parallelism graphics processing units (GPUs) rely on large register files. Register file is the largest SRAM structure on die and the third most power hungry structure [31] . In a GPU, each warp has its own set of dedicated architected registers, indexed by the warp id, and each architected register has a corresponding physical register allocated in the register file [34] . Once a register is allocated it is not released until the cooperative thread array (CTA), which the warp is part of, completes its execution [27] . This policy simplifies register management hardware but at the cost of significant waste of power and underutilization of registers. As power continues to become an impediment to all computing system designs, it is only imperative to look at solutions to reduce GPU register file power.
The goal of this paper is to propose a GPU register file management approach that roots out the inefficiencies in its usage. This work is motivated by our observation that register lifetime, defined as the time from when a new value is written into the register until the last use of that register value, varies by orders of magnitude across various registers in a typical GPU application. The number of live registers at any point in time is much fewer than the total number of registers used in that kernel. Since warps may be scheduled for execution at different points in time, we propose to proactively release dead registers from one warp and reallocate them to a different warp that may occur later in time, thereby reducing the needless demand for physical registers. CPU designers grappled with the problem of having too few architected registers and how to effectively rename the false dependencies to improve instruction level parallelism. Register renaming was invented to address this concern and a plethora of enhancements were proposed to improve renaming efficiency [36, 38, 39, 11, 18, 35, 25, 26, 37, 40, 19] . We adapt register renaming for achieving the opposite effect of CPU renaming, namely reducing the number of physical registers without reducing the architected register space. We call this mechanism as GPU register file virtualization.
The main contributions of this paper are:
• We propose a new register management mechanism that uses compiler-generated register lifetime information to proactively release registers from one warp and opportunistically re-allocate that register to other warps. We show that through interwarp register sharing we reduce dynamic and static power consumption.
• Since inter-warp register sharing reduces physical register pressure we exploit it to design a GPU with half the number of the physical registers, while transparently allowing the applications to use the full architected register space. As the total register file size in a GPU chip is comparable to that of a shared last level cache in a multi-core CPU [3] , halving the register file has significant economic and yield impact [45] .
RELATED WORK AND CHALLENGES OF REGISTER RENAMING IN GPU EX-ECUTION MODEL
Hardware-assisted eager register release in CPUs:
Moudgill et al. [40] proposed a hardware-only method to release dead registers early. The scheme detects how many instructions are going to read a particular architected register before that register is redefined. This value is dynamically computed and stored as a counter associated with each physical register. After each use the counter is decremented and when it reaches zero, the physical register is released. Several approaches used this scheme to reduce power [39] , improve reliability [18] , and to implement fast checkpointing [38, 11] . Associating a counter for each physical register has very high overhead since GPU physical register file size is much larger; 64K registers in Maxwell GPU [6] , versus about 400 registers in Intel Haswell [3] . Furthermore, counter-based reclaiming is ineffective in a GPU due to smaller instruction window size per warp. Hardware-software cooperative register release in CPUs: Martin et al. [37] , Jones et al. [25] and Lo et al. [35] proposed software-hardware cooperative register renaming. Martin et al. [37] uses dead value information (DVI) instructions to release the registers that are dead at the boundary of procedure calls. In Jones' approach [25] , compiler identifies single use registers, that are read only once during the execution, and then marks the last use instructions of the single use registers. In Lo's method [35] , compiler marks the instruction that lastly uses a register value in CPU to release the register as soon as that instruction is executed. Given that our work focuses on GPUs there are new challenges and opportunities when renaming is used in the context of GPUs. In a CPU, only one path of a diverged flow is executed, while in the context of GPUs warps traverse all the possible flows sequentially. Hence, GPU register release points differ from CPUs, which is properly dealt with in our work. Furthermore, in [35] they do not consider how to reduce the overhead of release instruction. In our work, we exploit the fact that warps in a GPU execute the same code segment. Hence, the register release metadata instructions that are shared across the warps can be cached to effectively reduce the fetch and decode overhead of these instructions. We also show how early release of physical registers can be used to design a GPU with fewer physical registers without curtailing thread level parallelism. We further curtail the static power consumption by proposing to gate unused register subarrays. Other orthogonal approaches to improve register efficiency in CPUs: Previous studies leveraged the properties of data stored in registers to improve register file efficiency. Jourdan et al. [26] exploited the value redundancy in the register file to map several logical registers to the same physical register thereby saving physical register space. Ergin et al. [19] proposed register packing inspired by an observation that a large percentage of instructions produce narrow-width results. Lozano and Gao [36] used register renaming to avoid unnecessary commit in an out-of-order CPU by checking if the last use instruction already consumed the data. GPU register file renaming: To the best of our knowledge, there has not been any study that explored the benefits of GPU register file virtualization. An NVIDIA patent [46] proposed a hardware-only dynamic register allocation and deallocation approach. Once a register space is allocated, the space is deallocated when a new value is written to the architected register. Their approach does not use any compiler knowledge or lifetime analysis. By using register lifetime, we can provide more aggressive register release that leads to less register demand. Moreover, they did not provide any detailed evaluation. Power efficiency in GPU: To reduce the dynamic power of a GPU register file, Gebhart et al. [20] proposed to use small register file cache. In a following study [21] , the authors enhanced the register file cache by adding two small register files. They store short-lived registers to one of two small register files and long-lived registers to a large main register file (MRF). They used the notion of strand as a code segment in which all dependencies on long latency instructions are from operations issued in a previous strand. They define a register as long-lived if its lifetime spans across strands. In our work, the compiler-generated information indicates when a register is safe to be released. The information is used by the hardware to proactively release registers from one warp and reassign them to a different warp. Rather than relying on a multi-level register file design, we use a traditional one-level register file design. By exploiting warp scheduling differences, we allow regis- 
BACKGROUND
Metadata instruction: As power efficiency becomes a top design priority, vendors have developed novel approaches to convey compile time information to hardware to improve power efficiency. NVIDIA's Kepler architecture conveys compile time information to scoreboard logic to track data dependencies. Instead of tracking dependencies at runtime, Kepler relies on compiler to generate the dependency information and this information is conveyed to the hardware using metadata instructions that are embedded in the code [43] . A recent study found that one metadata instruction is added per seven instructions [30] , and the format and the operation of the metadata instruction is similar to explicitdependence lookahead instruction used in Tera computer system [12] . The information contained in the metadata instruction is used for indicating the cycles that the seven following instructions should wait until the dependencies are resolved. These metadata instructions are fetched and decoded to generate control information for upcoming instructions. To pre-process the metadata instruction, the fetch stage in Kepler is partitioned into two separate stages: Sched. inf o and Select. Sched. info stage pre-processes the metadata instruction and Select stage selects an instruction to issue according to the metadata. In this paper, we leverage metadata instructions to interface compiler generated information with the hardware. GPU register file underutilization: GPU register file is not fully utilized in some benchmarks due to limited parallelism or small code footprint. However, what we observed in this study is that even among the allocated registers only a fraction of those allocated registers are live registers. We define a live register as a register that stores a value that may be consumed by any of the future instructions until the end of program execution. Figure 1 shows the fraction of live registers over all the compiler allocated registers for a sample 10K cycle execution window for a representative set of six applications (more experimental details are presented in Section 9). X-axis denotes time in cycles and Y-axis is the fraction of all the allocated registers that are alive at a given time. Except VectorAdd, the remaining five applications barely use half of the allocated registers to carry live data. In case of VectorAdd 100% of the allocated registers are live around 2K cycle mark due to the short program code and relatively small number of registers used. Note that while this data is shown for 10K cycles execution window for clarity, there is no significant change in the fraction of live registers used over the entire application execution window. If one can release the dead registers from one warp and reuse them in another warp then it is possible to build a GPU with a smaller physical register file while transparently allowing the applications to utilize the entire architected register file space.
It is worth noting that when an application is optimally compiled, all of the allocated registers will be alive at some point during the execution. However, because not all registers are alive during the entire execution time, each warp's register demand tend to be lower than the peak demand during a large fraction of execution time. Since a GPU executes multiple warps concurrently, the gap between the accumulated architected register file space demand and the actual live register file size necessary to execute these warps grows significantly. In the next section, we propose a register management scheme that exploits register lifetime information and warp scheduling time differences to reduce peak demand and to improve utilization.
GPU REGISTER LIFETIME
In this section, we analyze GPU register lifetime characteristics. Figure 2 (a) shows three representative register usage patterns seen in GPU applications. The pattern is taken from the benchmark matrixM ul of CUDA SDK used in our experimental evaluation. The corresponding assembly code is shown in Figure 3 . We captured the lifetime of three registers, r0, r1 and r3. The X-axis represents time, and the Y-axis represents the register liveness. A dead value is represented with a Y value of zero and live register is represented as a next step up in the Y-axis.
In this program r1 is written at the beginning of the program and is read at the end of the program execution. As r1's value is read at the end of the program, the register is alive for the entire program duration. Clearly r1 is a long lived register. On the other hand, r0 is actively produced and consumed within a loop. Each group of spikes is a loop iteration and there are a total of five loop iterations shown in the figure. On each entry into the loop, r0's value is loaded from global memory and then a series of read-write sequences are performed. This is an example of a register that has multiple lifetimes but each lifetime is relatively short. In this figure, r3 has the shortest lifetime. It is only used for a short time window at the beginning of the program and the end of the program. After lastly consumed before the loop, r3 is never used within the loop and then redefined after the loop.
It is worth noting that the register space waste due to short lived registers like r3 cannot be eliminated by compiler optimization because there is a short time window when all three registers are active concurrently. Figure 3 is the compiler generated kernel code from which the register usage patterns shown in Figure 2 are captured. The bidirectional arrows in the figure indicate the lifetimes of the three registers. In two time windows where the code between the offset at 0x90 and 0x108 and between the offset at 0x308 and 0x378 is executed, the three registers are concurrently alive. Therefore, the compiler cannot simply assign r0 and r3 to a single architected register.
As shown in this example it is unavoidable to have short lived registers with overlapped lifetime. Even in a single threaded application, wasting a single register is an inefficient use of space. But in a GPU context, all the threads' r3 registers have the same lifetime pattern. For instance, matrixM ul assigns 6 concurrent CTAs per SM and each CTA is executed by 8 warps. Therefore, total of 1280 (6 × 8 × 32) copies of register r3 are dead for a significant fraction of the kernel's execution time.
GPU REGISTER FILE VIRTUALIZATION
We propose a new register management method that effectively reduces the wasted register space. The key idea is to share the register space across the warps by flexibly mapping architected registers to the physical register space. In GPUs, warps are scheduled to execute the same code at different points in time. For instance, using the two-level scheduler [20] that is used in our baseline architecture, the schedule time difference among the warps reaches several hundred cycles because the warps in the pending queue can be scheduled only when the warps in the ready queue encounter long latency memory operations or pipeline stalls. Therefore, when a register's life time ends in one warp, that register space can be allocated to a different warp which is beginning a new register life cycle.
Figure 2(b) shows an example of register reuse. Warp one and three execute the same code but are scheduled in different points in time. Therefore, the short lived register r3 is used by warp one and warp three in different time slots. If warp one releases r3 right after the end of its first lifetime (illustrated as white rectangle), warp three can reuse the space for its own r3 storage.
To enable register sharing across warps, it is necessary to separate architected registers from the physical register space they occupy. CPUs have used register renaming to avoid false data dependency by mapping an architected register's multiple value instances to distinct physical registers. Thus, renaming in CPUs is typically used to enable a larger physical register file than the architected register file. However, in our approach we use renaming to allow multiple architected registers to share a single physical register. This sharing is essentially what virtualization enables. Hence, we call our approach GPU register file virtualization. We will rely on compile-time lifetime analysis to identify dead registers in the code. In the following section we describe how the register lifetime information is collected by simply extending the existing compiler algorithm and how the information is conveyed to the hardware using the metadata instructions.
COMPILER SUPPORT

Register Lifetime Analysis
Intra-Basic Block: The register management logic has to track register lifetime and only when the register is guaranteed to be dead, it can release that register. We will rely on compiler to statically identify the start and end points of the life cycle of each register. Figure 4 shows five representative code examples that should be considered by the compiler in register lifetime analysis. Each rectangle represents a basic block. In the first scenario shown in Figure 4 (a) an intra-basic block analysis can be done trivially to determine lifetime. Whenever a register is used as a destination operand of an instruction, the previous instruction that uses the register as a source operand can release the register after reading the value. We add one meta data flag bit per each instruction operand to indicate whether that register can be released after that read operation. As CUDA instruc- tions have maximum of three operands, three bits are used per instruction and these metadata bits are called per-instruction release flag (pir). When a bit is 1, the corresponding operand storage register can be released after it is read by the current instruction. More details about these metadata bits and their organization are described shortly. Diverged flows: In the presence of a branch divergence, the register release information must be conservatively set. Figure 4 (b) and (c) show two scenarios. In both cases, the register can be safely released at the reconvergence point because of the warps' lock-step execution. In the example of Figure 4 (b), the two branch paths are traversed by a warp sequentially. If the release information is put in each flow of the branch, the flow that is executed first will release the registers and the threads within the same warp that execute the other flow may get incorrect results if they use any of the released registers. Unlike in the intra-basic block case, here the register release is not associated with the actual last use instruction of the register. Instead, it is associated with an instruction that happens to start at the reconvergence point. It is also possible that multiple registers may need to be released at the reconvergence point. Hence, rather than adding meta data to an existing instruction, we add a new per-branch release flag (pbr). The flag contains the list of architected register IDs that can be released at the start of the reconvergence block. The details and overhead are discussed in Section 6.2. Loop: Figure 4 (d) shows a loop where a register produced in one iteration is used in another iteration. In this scenario, clearly there is no option to statically determine the last use and hence the compiler can release the register only when all iterations are complete. On the other hand, if there is no loop carried dependence on registers across loop iterations, then it is possible to release the register after the last consumption within the loop body as shown in Figure 4 (e).
Release Flag Generation
Per-instruction Release Flag: The register lifetime information is generated at compile time and embedded in the code. As mentioned earlier, each instruction has a three-bit per-instruction release flag, where each bit indicates one of the maximum of three source operands that can be released. If the bit is 1, the corresponding operand register can be released after it is read by the instruction. But embedding a 3-bit pir in each instruction requires significant modification on the instruction fetch and cache access logic. To avoid this concern, we use a 64-bit flag-set meta instruction that is present at the beginning of each basic block as shown in Figure 5 (a). The selection of 64-bit flag is to accommodate the fact that CUDA code is already 64-bit aligned.
To keep the metadata instruction comply with existing CUDA instruction set that uses 10-bit opcode we simply reserve a 10-bit value as register release opcode, and then use the remaining bits to store 18 three-bit flags that can cover 18 consecutive instructions within the basic block. If a basic block is larger than 18 instructions long, a metadata instruction is inserted every 18 instructions. If the basic block has fewer than 18 instructions then some of the flag bits are simply unused. Note that the 10-bit register release opcode is split into two sets of four and six bits to follow the Fermi instruction encoding format [1, 42] .
In the example of Figure 5 (a), the pir s first three bits represent the release information for each of the input operands of the first add instruction. Let us assume that r0 is determined to be dead after the execution of add instruction according to the register lifetime analysis. Since r0 is the first input operand, the corresponding pir flag bit is set to one. The second register operand r5 is still alive and hence the corresponding flag is set to zero. There is no third input operand for the add instruction and hence the corresponding bit in the pir is a don't-care.
Per-branch Release Flag: At the diverged flows, we do a conservative release. The registers that are referenced across multiple flows or loop iterations are only released when the diverged flows are converged. At the reconvergence point, a pbr is added. As shown in Figure 5(b) , the format is similar to pir. The only difference is that every six bits represent a register number to release. Note that each thread in Fermi can use up to 63 registers which can be identified by six bits. Total of nine registers can be covered by a pbr. If more than nine registers are released, more pbrs are added. However, according to our evaluation, the average number of registers that are released by pbr is just 2.
ARCHITECTURE SUPPORT
Renaming Table
To enable register reallocation, each architected register is mapped to a physical register whenever it is writ-ten. When the architected register value is no longer used, the mapping is released. The release point is provided either as a part of the pir or pbr. Once a physical register is released, it is marked as available that can then be remapped to another architected register in the future. The physical register availability marking is explained shortly.
To maintain the mapping information, a register renaming table is added to each SM. Since registers are allocated and released per warp, the renaming table is operated per warp. Each renaming table is indexed by combining warp id and architected register id and contains the corresponding physical register id. In our baseline design, each SM has 128KB register file, which holds a total of 1024 physical registers (32×4-
The 128KB register file in each SM is divided into four banks. Each bank is further divided into eight subbanks. Each sub-bank provides data to a four-lane SIMT cluster. Thus a single warp instruction (with 32 SIMT lanes) may access all the eight sub-banks to read one input operand. Since CUDA ISA uses a maximum of three operands each warp instruction may access up to three main register banks. If any of the operands used by an instruction are in the same register bank, bank conflict occurs. GPU compilers strive to avoid register bank conflicts by distributing the input operands across the four main register banks. Kim et al. [27] articulated that the compiler is responsible for reducing the register bank conflict. Lai and Seznec [30] proposed compilerlevel optimizations that show the throughput improvement by reducing the register bank conflicts. Hence, it is prudent not to ignore the compiler allocated bank information while renaming. To preserve the compiler assigned bank information, we restrict register renaming to find a register within the same bank as the original bank that the compiler intended to assign.
To reflect this restriction, we use four 256-bit physical register availability flag vectors; a 256-bit vector per each register bank. In order to rename an architected register, we first identify the bank that the architected register maps to in the absence of renaming, and then restrict searching for an available register within that bank. Reducing renaming To determine which registers are candidates for renaming, we calculate the estimated register value lifetime at compile time. The value lifetime can be calculated at compile time by counting the number of instructions between the write point and the next release point in the code. Then, the registers are sorted using their lifetime. Once the registers are sorted based on lifetime, the compiler only picks the top 1024B/{10bits × #warps per CT A×#max conc. CT As×4B} registers for renaming. Only for the selected registers the compiler inserts pir and pbr flags to release these registers while non-renamed architected registers are never released. The renaming-exempted registers are assigned the lowest N register ids and then N is given to the hardware so that the hardware only stores the mapping information for the registers with id that is higher than N . The lowest N registers are directly mapped to the lowest N physical registers, such that each warp's renaming-exempted registers are mapped to N physical registers from the register id N × warp id. Renaming table organization: The renaming table consists of four banks so that the operands can lookup the physical register index concurrently. When there is a bank conflict, the name lookup may be serialized. The pipeline modification to access the renaming table is illustrated in Figure 6 . According to our simulation of this register renaming table, the access latency of the optimized renaming table of 1KB is 0.22ns. In our eval-uations we conservatively assume that this renaming operation cannot be absorbed in the existing pipeline delays, and hence one extra cycle pipeline latency may be taken for the renaming process.
Flag Instruction Decoding
To provide the register lifetime information, compiler adds release flag metadata instructions. As noted in Section 3, modern GPUs support metadata instruction decoding for different power efficiency considerations. We rely on the same mechanisms to decode the release flag information. The 10-bit opcode is used to first determine if the instruction type is pir or pbr. Once this determination is made the remaining 54 bits are simply used to determine which registers are dead when each of the following 18 instructions complete their execution.
Even though the flag instructions can be simply fetched and decoded as normal metadata instructions, to reduce the potential power overhead due to the repeated flag instruction fetches from the regular instruction cache, we use Release F lag Cache to store 54 bit flags of pir instructions. Figure 6 shows the interaction between register renaming table and the release flag cache. The release flag cache is a direct mapped cache and is shared across the warps. It is indexed by the PC of the pir instruction. Multiple warps that are part of a single CTA all execute the same kernel. Since warps within a CTA are scheduled closely in time by most existing warp schedulers, such as round-robin and two-level schedulers, the sequence of instructions executed by different warps exhibit strong temporal locality. Thus, not every warp needs to maintain an exclusive copy of the same pir. We add a release flag cache access logic to the fetch stage that selectively fetches the pir instructions from the instruction cache only when there is a miss in the release flag cache. Every cycle the fetch stage probes the release flag cache and if the PC is a hit in release flag cache, the pir instruction is not fetched from the regular instruction cache and the program counter is incremented to fetch the next instruction. If the instruction fetch width is bigger than one instruction, pir may end up getting fetched anyway alongside the regular instructions. Nevertheless, the redundantly fetched pir can be easily detected and it is not fed to the decoder logic. In this scenario, while the fetch bandwidth is not altered, at least the decoding cost can be avoided.
When a regular instruction is fetched in select stage, the corresponding three bit flag is also fetched from the release flag cache to determine if any of the operand registers are dead after the current instruction reads the register. The release flag cache is maintained as a direct mapped cache and whenever a PC misses in the release flag cache then the instruction is fetched from the regular instruction cache and decoded. At the decode stage if the instruction was determined to be pir then 54 bit release flags are stored in the cache by replacing the existing entry.
To determine the appropriate number of entries in a release flag cache, we measured the dynamic code increase due to pir and pbr instruction fetching while varying the size of the release flag cache. According to our results, a release flag cache of 10 entries is sufficient to capture most of the metadata instruction locality. As each release flag is 54 bits long, the total release flag cache size is 68B.
The benefits of caching pbr are not significant as these instructions do not control the release of individual register operands of other instructions. Rather they simply release the specified registers. When pbr is fetched, the register mapping table is looked up and the mapping information for each released register specified in the pbr is removed and the corresponding bit of the physical register availability flag is cleared.
USE-CASES
Register File Under-provisioning
Using renaming, we explore a GPU design that underprovisions the physical register file to be smaller than the architecturally defined register file size needed to support multiple concurrent thread contexts. In particular, our results show that in many applications nearly half of the registers that are allocated by the compiler are dead at any given time. Hence, we propose to design a GPU, called GPU-shrink, with only half the number of physical registers than the baseline; namely, the GPUshrink has only 64KB register file per SM, compared to the 128KB register file used in our baseline.
Before evaluating the full benefits of GPU-Shrink, we first evaluated power savings from just shrinking the register file. Figure 7 plots the benefits of shrinking the register file, both in terms of reducing dynamic power and static power. These results were generated by using GPUWattch [31] starting with the 128KB banked register file organization as the baseline. Reducing the register file by half reduces dynamic power consumption by 20% and reduces the overall power (leakage and dynamic) by 30%. These results provide the motivation that shrinking the register file size has significant power benefits. However, naively shrinking the register file can lead to significant performance losses. In particular, if the compiler is forced to spill and fill the registers to/from memory the potential power savings will be overrun by the potential latency penalties of accessing memory. Furthermore, the naive approach requires applications to be recompiled to use fewer registers. In the following sections we describe how these potential disadvantages will be alleviated with GPU-Shrink.
First, the unique aspect of the GPU-Shrink design is that it is transparent to the application/compiler layer. The compiler is free to use all the registers in the baseline without any restrictions. The register management hardware simply renames registers using the reduced physical register size. Hence, the availability vector of 1024 registers is now reduced to 512 registers only. If the cumulative live registers across all the CTAs concurrently running on a SM is less than 512, then there is no application perceived difference between GPU-shrink and regular GPU with renaming.
If the live register demand from all CTAs exceeds 512 then there will be no available physical registers once all the physical registers are already assigned. To guarantee forward progress we propose that the register renaming logic must reserve a minimum number of registers to allow at least one CTA to make forward progress. When faced with high register pressure the warp scheduler can throttle the register demand by allowing warps from selected CTAs to finish while holding back warps from other CTAs. We use the following implementation to guarantee progress. The maximum number of registers required for executing a CTA can be obtained from the GPU compiler. For instance, if N is the number of registers needed for each warp and a CTA has M warps then the maximum number of registers required per CTA is C = N × M . The warp scheduler keeps track of the number of registers already assigned to each CTA using a per-CTA register balance counter; a total of eight counters are needed in our baseline architecture since at most eight CTAs can be concurrently executed in an SM. If k i is the number of registers assigned to CT A i at a given time then the counter i will store C −k i as the remaining registers that may be needed in the worst case for CT A i . Before the warp scheduler selects a warp it checks the number of available physical registers. If the number of available physical registers is greater than the minimum of all C − k i counter values then it allows the warp to continue. Otherwise, the scheduler recognizes the problem that the available physical registers may be too few to enable at least one CTA to complete its execution. In this case the scheduler simply picks the CTA with the minimum register balance counter (arbitrarily breaking ties) and allows only warps from that CTA to execute; as registers are released by this CTA then new CTAs can again start issuing. Intuitively, the assumption is that a CTA that has already occupied most registers will finish soon or has an opportunity to release more registers than other CTAs.
The above approach avoids deadlocks except for one extremely rare corner case. The CTA level throttling can work effectively unless a kernel assigns only one CTA that requires more live registers than is available in the GPU-shrink. Even though it is a very seldom occurring case (none of our benchmark applications encountered such situation), it is theoretically possible. In this worst case scenario, we rely on conventional register spilling. We rely on the scheduler to automatically issue special spill instructions to a system-reserved memory location when there are not enough registers while running a large CTA. To spill registers the warp scheduler selects warps in the pending queue. Note that the registers in a warp can be spilled using coalesced memory accesses where registers associated with all the threads in a warp can be spilled by one memory operation per architected register. While the pending warps' registers are maintained in the memory, the active warps will proceed their execution and release as many register as possible. Eventually when more registers than what is required for a pending warp are released the scheduler loads these registers back from memory to physical registers to re-start these warps.
Static Power Reduction
The register lifetime analysis allows us to power gate all the dead registers. We explored a conventional subarray level power gating approach [32] in the evaluation section. The subarray level power gating shuts down whole subarray when there is no active register in the subarray. Figure 8 shows an example when subarray level power gating is used. The two large blocks in Figure 8(a) represents the two register files, one without register renaming and the other with register renaming enabled, and we show four columns in each block to denote four register banks. The white entries are the active registers and the gray ones are unused register entries. The four horizontal partitions separated by dotted lines show the four subarrays. The left hand side register file shows the active registers distribution when the default register allocation approach is used and the right hand side register file shows register usage when the proposed register renaming is used. By using the register lifetime information given by the compiler, the number of active registers can be reduced. Then, by using the architected register to physical register mapping, the active registers are consolidated into fewer number of subarrays in each bank. Using a single sleep tran- sistor for the entire subarray in Figure 8 (b), all three unused sub arrays can be power gated. Subarray-level power gating can be enabled by simply changing the register allocation policy. Whenever a new register is allocated, we search the available register pool within each subarray range first so that a new subarray is turned on only when the already active subarrays are filled up.
Register file leakage power with FinFET transition: Leakage power is expected to considerably increase every technology generation [28] . Researchers have investigated various device-, circuit-, and architecturelevel techniques such as devices with multiple threshold voltages [13, 48, 47] , power-gating [22, 29] , and voltage scaling [49, 16, 15] to minimize the leakage power consumption. In particular, almost every technology generation has to introduce an innovative device (e.g., high-K/metal-gate strain-enhanced transistors in 45nm technology) just to maintain a constant I of f /µm [4] ; otherwise, it would have been impossible to improve device speed without substantially increasing leakage power. The same goes with the technology transition from 32nm planar MOSFET [44] to 22nm FinFET devices [14] . 22nm FinFET devices improve I of f /I on (but not significantly) compared to 32nm MOSFET devices as seen in [14] . The same result is also confirmed in Figure ? ? where we use GPUWattch [31] 1 to plot the fraction of leakage power over total GPU chip power when a GPU is designed with 32nm and 22nm MOSFET, and 22nm/16nm/10nm FinFET technologies, normalized to a GPU with 40nm technology. Without the introduction of FinFET devices, a much larger fraction of power consumed by the 22nm MOSFET GPU would have been leakage power. FinFET brings the leakage power back to the baseline, but the climb continues from the new reset point as seen in Figure 9 . In other words, minimizing leakage power through various circuit-and architecturelevel techniques will continue to be important even in current and future (FinFET) technology generations. Lastly, the GPU register file has been responsible for a large fraction of total power in GPUs (e.g., 15% from our estimation and as shown in [31, 33] ).
Parameter
Renaming 
EVALUATION
We used GPGPU-Sim v3.2.1 [2] to evaluate the proposed register renaming. We assumed that a GPU has 16 SMs and an SM has 128KB register file which is partitioned into four main banks, and each bank has eight sub-banks as in Fermi. Two-level warp scheduler is used and the ready queue size is set to six warps. For the compiler, nvcc v4.0 and gcc v4.4.5 are used. Two schedulers concurrently issue two instructions at every cycle. The renaming table and the register bank power parameters shown in Table 2 are calculated by using CACTI v5.3 by assuming 40nm technology.
For the workloads, we used 16 applications from NVIDIA CUDA SDK [5] , Parboil Benchmark Suite [7] , and Rodinia [17] . Note that many of the GPU studies using these benchmark suites [9, 23, 24, 10, 41 ] also used 10-20 benchmarks similar to what we used in this work. They cover a broad range of usage behaviors and provide good insights into the potential benefits of the proposed work, without requiring significant additional simulation resources. The number of CTAs, threads per CTA, registers used per kernel, and the number of concurrent CTAs per SM are listed in Table 1 . The value in the parenthesis of # Regs/Kernel field is the minimum required number of registers that can avoid register spills. The values that are outside of the parenthesis in the same field are the register counts that include the address registers and condition registers. We used PTXPlus for register analysis. We modified the ptx parser code in GPGPU-Sim for analyzing the register lifetime and inserting the two new flag instructions. GPGPU-Sim provides a detailed ptx parsing code that includes basic block recognition and control flow analysis. We traced the source and destination operands of each instruction to figure out the release points for each register accurately. Figure 10 shows the percentage of reduced register allocations with our proposed approach. We counted the number of physical registers that were actually used (touched at least once) during the renaming process; this is essentially the maximum number of concurrently live registers during any instance in the program execution. We then subtract this count from the total allocated registers to find the total reduced register counts and plotted that a fraction of the total registers allocated by the compiler. Register allocation is reduced up to 44%, and on average 16% of the register space is eliminated from register allocation. Applications with short kernel size (such as VectorAdd) saw smaller reg- Figure 11 : (a) Performance degradation when using half-sized (64KB) register file and (b) Sensitivity on subarray wakeup latency ister savings. There is less chance for the dead registers to be reused for other warps due to short execution time. Applications that have longer execution time derive higher register savings and hence our approach is particularly appealing to large kernels.
Register Size Savings
Register File Under-provisioning
We compared the performance impact of GPU-shrink with a traditional GPU design that uses only half the number of registers. In the traditional GPU with half the number of registers we force on the compiler to just use half the number of available registers and whenever the compiler needs more registers it has to rely on spilling some registers to memory and later filling them back. We compared GPU-shrink with a simple compiler-enforced register file size reduction mechanism. In our baseline configuration, the applications are already maximally optimized such that the minimum number of registers are used that do not cause any register spill under 128KB register file. Therefore, for this comparison, some applications are recompiled to use less than 64KB registers. Figure 11(a) shows the total execution cycle increase normalized by 128KB register file configuration when GPU-shrink and the compiler enforced register file shrink (denoted as Compiler spill ) are used. Four among 16 benchmarks do not need any throttling because their register usage does not exceed 64KB. Thus, those applications (VectorAdd, BFS, Gaussian, and LIB) had zero performance overhead. In the other applications, GPU-shrink achieves much better performance than simply relying on the compiler to force spills/fills. By releasing and reusing dead registers the effective register file demand is reduced thereby leading to minimal performance overhead. In some applications, the performance is even enhanced when using GPU-shrink; MUM had significant improvement. Further analysis showed that this unexpected behavior is because GPU-shrink dispersed memory contention by throttling some warps leading to performance improvements in those applications. Overall, GPU-shrink suffered 0.58% performance overhead on average, while compiler spill approach suffered 73% increase in execution time.
We also evaluated GPU-shrink-40% and GPU-shrink-30% that uses 40% and 30% smaller register files, respectively. Since our 50% shrink gets zero performance overhead, the additional registers available with these two configurations did not have any impact on the execution latency.
We also measured the performance degradation due to subarray wakeup delay. Figure 11(b) shows the normalized average simulation cycle over when the power gating is not used. We used CACTI-P [32] to measure the wakeup delay for our register file subarray structure. CACTI-P estimated the wakeup delay to be less than one cycle. Nonetheless, for exploration purpose, we used a wakeup delay of 1, 3 and 10 cycles. The performance overheads are less than 2% even with a wakeup delay of 10 cycles. The reason for this low performance impact is that during program execution the number of subarray wakeup events were negligibly small compared to the total execution cycles. Figure 12 shows the register file energy breakdown of three different design options, normalized to 128KB register file that does not use any renaming. Dynamic and Static are the register file's dynamic and static energy consumption. Renaming Table is the additional energy consumed by renaming table. Flag Instruction includes the energy that is consumed by fetching and decoding release flag instructions and by release flag instruction cache. The fetch/decode energy is measured by GPUWattch. The first bar labeled 128KB RF w/ PG shows total register file energy consumption when the register file uses sub-array level power gating af- ter applying renaming. This bar essentially shows the energy reduction when we use register renaming just to reduce the static power but do not alter the physical register file structure. The second bar (64KB RF ) shows the energy savings by cutting the register file into half while using renaming. By halving the register file size, both dynamic and static power can be reduced even without power gating and hence, the average energy saving is even greater than the full size register file with power gating. However, some applications, such as VectorAdd, LUD, Gaussian, and LIB, spend vast majority of execution time on the code segments that have very few live registers. Thus, static power savings are significant when using power gating (as shown in 64KB RF w/ PG). Reducing the register file size without any power gating actually leads to a small increase in the energy consumption compared to power gating a 128KB register file. But when sub-array level power gating is applied on top of the GPU-shrink, as plotted in the third bar, the overall energy savings increase across all the benchmarks. On average, the GPU-shrink with register under-provisioning and sub-array power gating saved a total of 42% register file energy. Figure 13 shows the static and dynamic instruction increase when using register renaming due to the new metadata instructions that were added to the code. The dynamic instruction count was measured as the number of instructions decoded by varying the number of entries in a release flag cache. The integer value next to Dynamic-indicates the number of release flag cache entries used for the evaluation (i.e. Dynamic-5 uses a Figure 14 : Per SM renaming table size w/o constraints and normalized register saving w/ 1KB constraint five-entry release flag cache). As pbr and pir do not issue any instruction to the execution units, the only overhead that is caused by the two added instructions occurs in decoder logic. However, as pir is shared across multiple warps, a new pir is fetched and decoded only when it is not in the release flag cache. Therefore, the dynamic instruction increase is much less than the static instruction growth when more entries are added to a release flag cache. Overall, the increased dynamic code (11% without release flag cache as shown by Dynamic-0) is almost entirely eliminated when using a ten-entry release flag cache (0.2% dynamic code increase as shown by Dynamic-10).
Static and Dynamic Code Increase
Renaming Table Size
The left hand side chart of Figure 14 shows the renaming table size without constraining the size of the table. Almost all the workloads used in the evaluation can rename the registers by using 1KB renaming table except the three workloads: MUM, Heartwall, and LUD. Our worst case renaming table size estimation assumes 48 warps, each accessing 63 architected registers, but this is only an upper limit that is not reached. Thus when we constrain the renaming table size to 1KB only, these three benchmarks were forced to eliminate a few long lived registers from the renaming process. The total number of exempted registers is 2 out of 19 in MUM and LUD, and 4 out of 29 in Heartwall. These exempt registers are assigned a physical register and were never renamed. The right hand side graph of Figure 14 shows the impact of constraining the renaming table size. Since some of the registers were exempt from renaming a few opportunities to reuse those long lived Figure 15 : Comparisons with hardware-only register renaming [46] . registers (after dead) were lost. As expected, Heartwall's register saving is reduced the most among them because it can not rename 13% of total registers.
Comparison With Hardware-Only Approach
We also compared our results with hardware-only register renaming [46] . Hardware-only register renaming approach maps an architected register to a physical register when the architected register is defined. When the architected register is redefined, the mapping is released and the new defined instance of the architected register is mapped to one of the available physical registers. Though [46] was not aimed at power saving, the approach can also be enhanced to save power by power gating unmapped physical registers. Figure 15 (a) shows the register file size reduction of [46] normalized to our proposed approach. In [46] , physical registers are released only when the mapped architected register is re-defined. Therefore, the architected to physical register mapping may last even after the architected registers are dead. Thus, waiting for a register to be re-defined is not as effective as our compilersupported approach, which releases the registers as soon as the lifetime of a register ends. Figure 15 (b) is the register file static power reduction of [46] compared to our approach. In benchmarks such as Heartwall, HotSpot, and LUD, the number of register allocations did not decrease with [46] but that approach can still save some static power. Many of the allocated registers may be defined for the first time only late during the program execution, which allows [46] to power gate the registers before their first definition. However, our approach proactively releases and reassigns registers thereby reducing the total demand on the register file, which saves 2× more static power than [46] when using a 128KB register file. More critically, as [46] does not fundamentally reduce register allocations, it cannot cut the physical register file size without significant performance penalty due to compiler generated spills. Our approach on the other hand can incorporate GPUshrink with compiler support, which enables register file underprovisioning thereby saving both static and dynamic power.
CONCLUSION
The default GPU register management method allocates a physical register to every architected register and then reclaims these registers at the end of a CTA execution. We showed that a more efficient register management approach is viable if the compiler can provide lifetime analysis of register usage. The hardware uses the register lifetime information to proactively reclaim registers and then reuse those registers across different warps. With reduced register demand, we explored how to manage a GPU with just half the number of registers, while still allowing the applications to use the entire architected registers during compilation.
ACKNOWLEDGEMENTS
This work was done when Hyeran Jeon was a PhD student in the University of Southern California. This work was supported in part by NSF (0954211 and 1217102) and DARPA (HR0011-12-2-0020). Nam Sung Kim has a financial interest in AMD and Samsung Electronics.
