It is well known that a large fraction of variables are short-lived. This paper proposes a novel approach to exploiting this fact to reduce the register pressure for pipelined processors with dataforwarding network. The idea is that the compiler can allocate virtual registers (i.e., place holders to identify dependences among instructions) to short-lived variables, which do not need to be stored to physical storage locations. As a result, real registers (i.e., physically existed registers) can be reserved for long-lived variables for mitigating the register pressure and decreasing the register spills, leading to performance improvement. In this paper, we develop the architectural and compiler support for exploiting virtual registers for statically scheduled processors. Our experimental results show that virtual registers are very effective at reducing the register spills, which, in many cases, can achieve the performance close to the processor with twice number of real registers. Our results also indicate that, for some applications, using 24 virtual, in addition to 8 real registers, can attain even higher performance than that of 16 real without any virtual registers.
INTRODUCTION
The register file is on the top of the memory hierarchy, which provides the fastest way for the processor to access data. Registers are typically managed by the compiler by using sophisticated register allocation algorithms [Muchnick 1997 ] to ensure that most frequently used variables can be stored in the registers.
and Martonosi 2000] . In a pipelined processor, data can be forwarded from the output of a functional unit to the inputs of the same or other functional units in the subsequent clock cycles. Therefore, if the live range of a variable is within the length of the data forwarding path, it does not need to be written back to the architectural registers. Moreover, this variable does not even need to be allocated to an architectural register, since a place holder is sufficient to identify the data-dependence relationship for forwarding the data to the right instructions. In this paper, the place holders (i.e., tags) in the operand fields of the ISA are called virtual registers. In contrast, the architectural registers that have physical storage places in the register file are called real registers (also called register for simplicity).
1 Since a large fraction of variables are shortedlived (see Figure 4 for details), this paper proposes to allocate virtual registers (instead of real registers) to the short variables, whose live ranges are smaller than or equal to the maximum length of the data-forwarding path. As a result, the real registers can be used for other variables more efficiently to reduce the register spills. Our experimental results indicate that virtual registers are very successful in reducing the register spills, leading to higher performance without physically enlarging the register file.
The rest of the paper is organized as follows. We discuss related work in Section 2 and introduce the basics of data forwarding network in Section 3. We elaborate our approaches to exploiting virtual registers for reducing the real register pressure in Section 4. The experimental results are given in Section 5. Finally, we conclude this paper in Section 6.
RELATED WORK
There have been several research efforts on exploiting the short-lived variables for performance [Lozano and Gao 1995] and energy optimizations [Hu and Martonosi 2000; Sami et al. 2001] . While the work in [Hu and Martonosi 2000; Sami et al. 2001 ] focused on reducing the register file energy dissipation, the work in [Lozano and Gao 1995] is closest to this study in that both explore the short-lived variables to improve the register allocation for better performance; however, there are some important distinctions. Lozano and Gao [1995] proposed an architecture extension to avoid the useless commits of the values produced by short-lived variables and a compiler-based approach to avoiding assigning the short-lived variables to the real registers. In their work [Lozano and Gao 1995] , the reorder buffer entries are regarded as the extension of real registers, which can be allocated to short-lived variables by the compiler to improve the utilization efficiency of real registers. However, this approach does not apply to VLIW processors, which do not have reorder buffers. In contrast, this paper uses virtual registers that do not physically exist to help enhance the allocation of real registers to long-lived variables. Second, our approach can push Lozano and Gao's scheme further in that in superscalar processors, certain short-live variables do not even need to be allocated to reorder buffer entries, which can use virtual registers instead. Therefore, the number of reads from and writes to the reorder buffer is reduced, which can decrease the stalls due to reorder buffer saturation and also improve the energy efficiency. Another difference is that in [Lozano and Gao 1995] , the short-lived variables are also defined as those whose live ranges are smaller than or equal to the size of the reorder buffer (which is typically much larger than three). By comparison, the short-lived variables in this paper refer to the variables whose live ranges are smaller than or equal to the length of the data-forwarding path (i.e., three in our simulated processor, see Section 3 for details). Yung et al. [1955] proposed register scoreboard and caching to exploit the locality in register values for windowed-register architectures. Yung's approach, however, still needs to allocate real registers to all the variables, although the short-lived variables will stay in the register cache for a short period of time at runtime. In contrast, our approach is based on compiler, which can allocate the limited real registers more efficiently by binding short-lived variables to virtual registers.
Recently, Oehmke et al. [2005] proposed the virtual-context architecture (VCA) to support both multithreading and register windows with significantly fewer registers than a conventional machine. VCA builds on the register renaming logic already found in dynamically scheduled processors, which treats the physical registers as a cache of a much larger memory-mapped logical register space. Therefore, VCA cannot be directly applied to the architectural registers of statically schedule processors (e.g., VLIW) without dynamic register renaming support; otherwise, the hardware simplicity and scalability may be compromised. Moreover, by applying the virtual register-based approach proposed in this paper to the context of VCA, the register pressure can be further mitigated to potentially achieve even better performance of VCA. Gonzalez et al. [1998] introduced the concept of virtual-physical registers, which are the names used to keep track of dependences. The virtual physical registers [Gonzalez et al. 1998 ] and virtual registers proposed in this paper are similar in the sense that both do not use any storage locations. However, there are some significant differences. First, virtual physical registers [Gonzalez et al. 1998 ] are used to delay the allocation of physical registers (not architectural registers) in dynamically scheduled processors, while virtual registers proposed in this paper can be exploited to mitigate the pressure on architectural registers for any processors, regardless of the dynamic register renaming. Second, virtual physical registers [Gonzalez et al. 1998 ] are invisible to the compiler, while virtual registers can be explicitly managed by the compiler to reduce the register spills physically. Monreal et al. [2004] studied VP-LAER, which is a rename scheme that can allocate physical registers later and release them earlier than traditional rename schemes. However, VP-LAER targets the dynamic register renaming for superscalar processors, while virtual registers proposed in this paper can reduce the register pressure for VLIW processors without using dynamic register renaming. 
BACKGROUND ON DATA-FORWARDING NETWORK
A data-forwarding (also called bypassing) network is a widely used mechanism to reduce the data hazards of pipelined processors [Patterson and Hennessy 2004] . A data-forwarding network has also been used in VLIW processors, such as Philips LIFE processor [Kannan et al. 2002] . In this paper, we study a VLIW processor with a five-stage pipeline provided with data-forwarding logic, as shown in Figure 1 . 2 The pipeline stages are the follows:
-IF: Instruction Fetch from the I cache.
-ID: Instruction Decode and operand fetch from the Register File (RF).
-EX: Execution.
-MEM: Memory access for load/store operations.
-WB: Write-Back results into the RF.
As can be seen from Figure 1 , three forwarding paths (i.e., EX-EX, MEM-EX and MEM-ID) provide direct links between pairs of stages through the pipeline registers (i.e., EX/MEM and MEM/WB interstage pipeline registers). Therefore, given a sequence of VLIW instructions I 1 ...I k ...I n , the instruction I k can read its operands from the following instructions: -I k−1 : through the EX-EX forwarding path (when I k is in the EX stage). -I k−2 : through the MEM-EX forwarding path (when I k is in the EX stage). -I k−3 : through the MEM-ID forwarding path (when I k is in the ID stage). -I k− j : through the RF where j > 3.
Problem Formulation
Given an instruction pipeline with a data-forwarding network F , we define the length of the data-forwarding path L(F ) to be the maximum number of clock cycles between the producer instruction that forwards the value and the consumer instruction that takes this value. Given a variable v, we define its live range LR(v) to be the number of machine instructions (including nops) in the longest path between its definition points and last-use points, including either the instruction at the definition point. We say a variable v is a shortlived variable if its live range is smaller than or equal to the length of the data-forwarding path, i.e.
, LR(v) ≤ L(F ).
Given these definitions, the problem to be studied in this paper can be described as follow:
Problem-Reducing Pressure on Real Registers: Given a pipelined machine M with the length of a data-forwarding path L(F ) and a program P , create a register allocation such that the short-lived variables in P will use as few real registers in M as possible to minimize the register spills without violating the program semantics.
OUR SOLUTION
To solve the aforementioned problem, we propose to create virtual registers for short-lived variables. A virtual register is simply a tag to identify the data dependences among instructions, which is not corresponding to a physical storage location in the register file. Since the virtual registers can keep track of the data-dependence relations, the value of short-lived variables can be forwarded to the right consumer instructions without violating the program semantics.
The idea of exploiting virtual registers can be explained by the example code shown in Figure 2 , which is extracted from the source code of cjpeg. Figure 2a gives the source code and Figure 2b gives the corresponding assembly code. The live range of each variable is depicted in Figure 3 . As can be seen, this code segment includes three short-lived variables, i.e., $tr1, $tr2, and $tdist, since their live ranges are less than or equal to three. Without virtual registers, at least three real registers are needed. However, by exploiting the virtual registers, only two real registers (in addition to one virtual register) are sufficient for this code segment, as can be seen from Figure 3 . Therefore, in case there are only two real registers available for this code segment, utilizing the virtual registers for short-lived variables can reduce the number of register spills, potentially leading to higher performance. Clearly, the success of our approach depends on how many variables are actually short-lived. If only a small fraction of variables are short-lived, its impact on the real register allocation and the reduction of register spills will be insignificant. Since in our five-stage pipelined-forwarding architecture, as depicted in Figure 1 , the length of the forwarding path is three; only the variables whose live ranges are equal to or less than three will be identified as short-lived, according to our definition of short-lived variables in Subsection 3.1. Figure 4 shows the percentage of short-lived variables for the selected benchmarks from Mediabench [Lee et al. 1997 ] and SPEC 2000 [spe 2006 ] (more details of our evaluation methodology can be found in Section 5). As we can see, except rasta, all the applications have more than 50% of short-lived variables. On average, 66.2% of variables are short-lived, indicating good opportunities to exploit virtual registers for mitigating real register pressure.
Architectural Support
To support the exploitation of virtual registers, we need slight extension of the underlying hardware. The assumed architectural support is depicted in Figure 5b . Compared with original ISA shown in Figure 5a , 1 bit (called the virtual bit) needs to be added into the ISA for each destination and source operand field. By default, the leftmost virtual bit is set to be 0; the remaining 5 bits of the operand field will be used to specify a real register. When the virtual bit is set to 1, a virtual register (i.e., the tag consisting of all the 6 bits) will be used to identify the data-dependence relations associated with short-lived variables. The decoder also needs to be extended to recognize the additional virtual bits. Clearly, adding 1 virtual bit to each of the three operand fields of a 32-bit instruction format will increase the instruction width by 9.3%, which also requires the corresponding increase of the instruction memory size. This overhead can be partly mitigated by exploiting unused instruction encoding bits; however, this will restrict the proposed approach only to the subset of ISA operations with unused encoding bits in the instruction format. With the increasing use of 64-bit architectures for high-end microprocessors, this overhead is also expected to be reduced (i.e., 4.7% for adding the same number of virtual registers as the real registers without doubling the physical size of the register file). Moreover, for embedded processors that provide multiple ISAs, such as ARM7TDMI [arm 1995] , the wider ISA can be exploited to support virtual registers for better performance, while the number of real registers is kept small and fixed. In addition, we add a Rs Inhibit bit and a Rt Inhibit bit to the register file, which, when enabled (i.e., 1), will inhibit reading the Rs or Rt register, respectively. As can be seen from Figure 5b , the value in the virtual bit of the Rs field and the Rt field will be passed to the Rs and the Rs Inhibit bits directly. Therefore, when a virtual register is used as the source registers, it will not be read from the register file. Instead, the source operands will be forwarded from the interstage pipeline registers through the forwarding network. Similarly, the virtual bit of the Rd field is connected with a NOT gate, whose output is then passed to the RegWrite Enable signal that is typically already existed in the register file to enable (1) or disable (0) the register write operations. Therefore, if the virtual bit of the Rd field is 1, i.e., a virtual register is used as the destination register, the RegWrite Enable signal will be set to 0, thus disabling writing the value of a virtual register to the register file. However, it should be noted, that we only show the extension of a single RISC-like operation in this figure. Obviously, a VLIW instruction consisting of multiple operations can be extended in the same manner by adding inhibit bits to each operation.
Since the virtual registers can be encoded in the machine instructions, we do not need to extend the ISA for compilers to handle virtual registers. For instance, in our base processor with 32 real registers, the virtual register number will vary from 32 to 63 (i.e., 100000 2 to 111111 2 ). Thus if a virtual register is needed for a short-lived variable, the compiler can allocate an available register number between 32 and 63 to it, which will set the virtual bit of the corresponding operand field to 1 in the binary code. With our architectural extension, this virtual bit will ensure that the virtual register will not be read from or written to the register file, while real register operations are not impacted.
Compiler Support
With the architectural support for virtual registers, the compiler's job is to modify the register allocation to exploit these virtual registers for short-lived variables while freeing real registers for long-variables, as shown in Figure 6 . Conceptually, the compiler first performs analysis to identify short-lived variables (called the short-liveness analysis in this paper), which are dependent on the length of the data-forwarding path of the processor. Then, the compiler attempts to allocate virtual registers to short-lived variables and to allocate real registers to long-lived variables. In case that there is no virtual variable available, real registers will also be used for the short-lived variables, i.e., these short-lived variables are treated like long-lived variables. In case there is no real register available, a variable must be either spilled or split. The strategy to be used is controlled by the parameter in the configuration file. This process is then repeated until all the variables have been bound to virtual or real registers, or spilled to memory.
In order to make use of virtual registers, the first step is to perform compiler analysis to identify the short-lived variables, which is called the short-liveness analysis in this paper. This analysis consists of three steps: (1) perform standard dataflow analysis to identify def-use chains [Muchnick 1997 Our virtual/real register allocation algorithm built upon the region-based register allocation [Kim 2001 ] of Elcor compiler, which is based on Chow and Hennessy's priority-based coloring algorithm [Chow and Hennessy 1984] . The framework of this algorithm [Kim 2001 ] is shown in Figure 7 , which includes five major steps: (1) build the interference graph; (2) simplify the graph; (3) prioritize each live range according to the heuristic function [Chow and Hennessy 1984] ; (4) color the live range with the highest priority; and repeat from (4) whenever there remains an uncolored live range; and (5) if there is a live range that cannot be colored, reduce the interference of the live range by splitting it into two or more smaller live ranges and repeat from (4). More information about this register-allocation algorithm can be found in Kim's [2001] thesis.
To minimize the modification on the existing region-based register allocation algorithm, we incorporated our short-liveness analysis and short-lived variable allocation in the binding() function, which is part of the step (4) in Figure 7 . The pseudocode of this function is provided in Figure 8 . In this algorithm, for each variable of a region, if it is a short-lived variable, the compiler tries to allocate a virtual register to it. Otherwise, a real register or spilling has to be used for this variable. The long-lived variables will follow the existing register allocation process without touching virtual registers.
Precise Interrupt
Exploiting virtual registers in the pipelines can cause problems to the precise interrupt (or exception) [Smith and Pleszkun 1988] . Since virtual registers do not have corresponding physical storages, the values that are associated with virtual registers will be lost after interrupts. Therefore, if an instruction needs to use a virtual register as a source operand, while the value associated with this virtual register is produced by an instruction before the faulting instruction or by the faulting instruction itself; the dependent instruction will not be able to fetch the data correctly. To solve this problem, we propose to store the values in the pipeline registers as part of the processor state when dealing with interrupts or exceptions. We also assume the architecture support for NUAL-freeze (nonuniform assumed latency) operations [Schlansker and Rau 2000] , which appear not to advance when virtual time is frozen, such as during cache misses or interruptions. The NUAL-freeze operations can be supported by adding snapshot hardware, such as replay buffers [Rudd 1997 ]. With the support of NUAL-freeze semantics, the operations after exception handling will be executed according to its original virtual time scheduled by the compiler. Since the contents of the pipeline registers are also saved and restored at exactly the same virtual time as if the interruption has never occurred, after the exception handling, the instructions after the faulting instruction can receive the value forwarded from the right pipeline registers based on the original virtual timing. In particular, virtual registers can also be used for short-lived variables, where a load instruction is between the producing instructions and consumer instructions. In case there is a cache miss that is estimated as a hit by the compiler, the pipeline will be stalled by the interlock employed in VLIW processor to ensure correct semantics, i.e., the virtual time is frozen. Since we assume that the operations of the VLIW processor are NUAL-freeze operations, the instruction producing the value is not allowed to produce the value immediately (although it is independent of the load); since the whole pipeline is stalled until the data is returned from the cache. Therefore, when the pipeline is resumed after the load returns, the producer instruction will produce the value and forward the value to the consumer instruction(s) correctly as usual.
EVALUATION

Experimental Setup
We use simulation to evaluate the proposed virtual/real register allocation algorithm on a VLIW processor by using Trimaran v3.7 [tri 2006] . Trimaran is composed of a front-end compiler IMPACT, a back-end compiler Elcor, an extensible intermediate representation (IR) Rebel, and a cycle-level VLIW/EPIC simulator that is configurable by modifying the machine description file [tri 2006] . The virtual/real register allocation algorithm was implemented in Elcor by modifying the existing region-based coloring algorithm [Kim 2001 ]. The machine description file was configured to simulate VLIW processors with various number of real and virtual registers. The VLIW processor simulated in our experiments consists of two IALUs (integer ALUs), two FPALUs (floating-point ALUs), one LD/ST (load/store), unit and one branch unit. Other system parameters used for our default setting are provided in Table I . The trimaran front-end compiler Impact used optimization level 4 (O4). The back-end compiler Elcor performed machine independent optimizations, including dead code elimination, forward copy propagation, local copy propagation, common subexpression elimination, and loop-invariant code removal, as well as machine-dependent optimizations, such as instruction scheduling and register allocation [tri 2006] . The basic block scheduling algorithm was used as the default scheduling algorithm. . Normalized execution cycles by using 64 real registers and by using 32 virtual, in addition to 32 real registers, which are normalized with the execution cycles by using 32 real registers.
The benchmark set includes applications drawn from Mediabench [Lee et al. 1997 ] and the SPEC 2000 3 [spe 2006].
The Effect of Exploiting Virtual Registers
Our first experiment studies the impact of the proposed virtual/real register allocation algorithm on performance of the VLIW processor with 32 real and 32 virtual registers. The percentage of performance improvement (in terms of the total number of execution cycles) relative to the base processor (i.e., 32 physical register with region-based register allocation [Kim 2001] ) is illustrated in Figure 9 . As can be seen, the exploitation of 32 virtual registers have various effects on performance for different applications. Five of the mediabench applications, i.e., cordic,cjpeg,djpeg,des and unepic, result in significant performance increases (i.e., reduction of the execution cycles); which vary from 14.5% for cordic up to 28.8% for cjpeg. On the other hand, two benchmarks, i.e., g721decode, and g721encode are not impacted by introducing additional virtual registers at all. The rest of the benchmarks have relatively small performance improvement, ranging from 0.7% for 256.bzip2 to 3.1% for mpeg2dec. The diverse effects of virtual/real register allocation on different benchmarks relate to two factors: (1) The number of register spillings that can be reduced by exploiting virtual registers, and (2) the ratio of the number of cycles because of reduced number of register spills relative to the overall number of cycles (also called ratio of spill cycles in this paper). In Table II , we give the the number of dynamic register spills with and without using virtual registers, as well as the ratio of spill cycles. By comparing the performance improvement in Figure 9 and the data in Table II , we can see that the performance increase of the virtual/physical register allocation is closely related to the ratio of spill cycles. More specifically, those five benchmarks (i.e., cordic,cjpeg,djpeg,des and unepic) that achieve the best performance results also have the highest ratio of spill cycles, indicating that using virtual registers is an effective way to boosting performance for applications that suffer greatly from register spills In contrast, the ratio of spill cycles of both g721decode and g721encode are 0, since these two applications do not have any register spill, given 32 real registers. Obviously, in this scenario, there is no need to exploit additional virtual registers and, hence, there is no room for performance increase by optimizing register allocation. Other benchmarks that also suffer from a limited number of register spills can only be moderately improved by the virtual/real register allocation. Overall, we find that exploiting virtual registers is effective to enhance performance when the register spilling is a severe problem (i.e., the register pressure is high). Figure 9 also gives the execution cycles of 64 registers, which are normalized with that of 32 registers. As can be seen, the number of execution cycles is reduced by increasing the number of real registers from 32 to 64, except for g721decode and g721encode, for which 32 real registers are adequate. We also find that the execution cycles of 32 real registers plus 32 virtual registers are very close to those of 64 real registers for all the benchmarks. On average, while using 64 real registers can reduce the execution cycles by 9.4%, as compared to that of 32 real register only; exploiting 32 virtual registers in addition to 32 real registers can reduce the number of execution cycles by 8.3%. It should be noted that these results are under the assumption that both processors (i.e., with 64 or 32 real registers) have the same frequency. Since the clock cycle time is typically a function of the size of the register file [Farkas et al. 1998 ]; if we take this into consideration, the scheme with 32 real plus 32 virtual registers may achieve even higher performance than the scheme with 64 real registers, yet having a slower clock frequency. . Normalized execution cycles by using 32 real registers and by using 16 virtual, in addition to 16 real registers, which are normalized with the execution cycles by using 16 real registers.
Sensitivity Results
We have also studied the performance of the VLIW processor with 16 real registers, while all other architectural parameters are fixed. We compare the execution cycles of 16 real with that of 32 real registers, as well as that of 16 virtual registers in addition to 16 real registers, which are shown in Figure 10 . As can be seen, increasing the number of real registers from 16 to 32 can lead to significant performance increase. On average, the number of execution cycles is decreased by 17.9%. Also, using 16 virtual plus 16 real registers can also achieve much better performance than using 16 real registers only. As can be seen from Figure  10 , except a few benchmarks, such as 256.bzip2, the performance of 16 virtual plus 16 real registers is also close to that of 32 real registers. Compared with the results in Figure 9 , however, we can see that the performance gap between 16 virtual plus 16 real registers and 32 registers becomes slightly larger than the performance gap between 32 virtual plus 32 real registers and 64 registers. The reason is twofold. First, with 16 real registers, there are more number of register spills (as compared to that of 32 real registers). Second, 16 virtual registers cannot mitigate the register spills as effectively as 32 virtual registers. As can be seen, 16 additional virtual registers still decrease the number of execution cycles by 12.5%, on average, compared with using 16 real registers only. Note that this performance improvement is achieved without increasing the number of real registers or compromising the clock frequency, indicating the effectiveness of the proposed virtual/real register allocation approach. In this paper, we also investigate the sensitivity of various virtual/real registers combinations, while keeping the total number of registers (including both virtual and real registers) fixed. Since the simulation is very time consuming (especially for SPEC 2000 benchmarks), we select two benchmarks (cjpeg,djpeg) from Mediabench and two (164.gzip and 181.mcf) from SPEC 2000 for this study. We fix the total number of real/virtual registers to be 32, while varying the combination of real/virtual registers to be 8/24, 16/16, and 32/0. Correspondingly, each operand field of these three schemes include 5 bits in total, while the number of virtual bits decreases from 2 to 1, and to 0, respectively. The performance results of these three configurations are given in Figure 11 , which are normalized with respect to the execution cycles of 16 real registers. As we can expect, using more real registers can always lead to higher performance when the total number of real/virtual registers is fixed, simply because real registers can be allocated to both short-lived and long-lived variables. However, an interesting result is that the scheme of 8 real/24 virtual registers can even achieve better performance than the base scheme of 16 real/0 virtual registers, for 3 out of the 4 benchmarks (i.e., the only exception is 164.gzip). This result indicates that by exploiting virtual registers intelligently, a processor with fewer number of real registers can even achieve better performance than a processor with a greater number of real registers, yet no virtual register, even though we assume that the clock frequency of the latter is not negatively affected. Therefore, the virtual/real register allocation can be very useful for processors that are constrained by the number of real registers to boost performance by minimizing the register spills.
CONCLUSION
This paper proposes a novel scheme to reduce register pressure by exploiting the data-forwarding mechanisms widely used in pipelined processors. We propose to add virtual bits to the operand fields of the ISA to create virtual registers that do not have physical storage locations in the register file. These virtual registers can be used to keep track of the data-dependence relations among instructions, making them suitable for being associated with short-lived variables without requesting physical storages in the register file. With the architectural extension to support virtual registers, we modify the region-based register allocation algorithm [Kim 2001 ] to exploit the virtual registers for the short-lived variables. Our experimental results indicate that virtual registers are very effective at reducing the register spills, which, in many cases, can achieve the performance close to the processor with twice number of real registers. Also, our results demonstrate that, for some applications, exploiting 24 virtual registers with 8 real registers can attain even higher performance than that of 16 real registers. While this work focuses on studying virtual registers for statically scheduled processors, we plan to apply virtual registers to dynamically scheduled processors to reduce both the register spills and the pressure on register renaming. In addition, we would like to study the effectiveness of virtual registers in windowed-register or multithreaded architectures, where the register pressure is high, while using a large number of real registers can be prohibitive.
