An effective thermal management scheme, called active bank switching, for temperature control in the register file of a microprocessor is presented. The idea is to divide the physical register file into two equal-sized banks, and to alternate between the two banks when allocating new registers to the instruction operands. Experimental results show that this periodic active bank switching scheme achieves 3.4℃ of steady-state temperature reduction, with a mere 0.75% average performance penalty.
INTRODUCTION
Peak power dissipation and the resulting temperature rise have become a limiting factor to microprocessor performance and a significant component of its cost. Expensive packaging and heat removal solutions are needed to achieve acceptable substrate and interconnect temperatures in high-performance microprocessors. Current thermal solutions are designed to limit the peak processor power dissipation to ensure its reliable operation under worst-case scenarios. However, the peak processor power and ensuing peak temperature are hardly ever observed. Dynamic thermal management (DTM) has been proposed as a class of microarchitectural solutions and software strategies to achieve the highest processor performance under a peak temperature limit.
Most DTM methods are reactive due to the complex nature of temperature variation in a processor; when a certain triggering temperature is reached, DTM mechanisms become operational. For example, in [1] , Skadron et al. introduced a number of DTM methods such as temperature driven frequency scaling, localized toggling and computation migration to spare hardware units. The same authors presented a hybrid DTM technique that combines fetch gating and dynamic voltage scaling (DVS) in [2] . Reference [3] described a feedback control theory based DTM method, which determines the aggressiveness of the DTM methods based on the distance of triggering temperature from the emergency temperature. Recently, reference [4] introduced a predictive DTM method for multi-media applications whereby instruction window resizing and switching among active functional blocks were utilized to achieve the desired temperature control. All of these methods characterize and/or predict the thermal behavior of a processor typically on a functional block basis, calculate the power density of functional blocks within a fixed time period, and apply their temperature control policies as needed.
It is known that the register file is the hottest block in a modern microprocessor chip [1] [4] . As such, full-chip DTM methods, such as fetch-toggling and instruction cache throttling (where the number of fetched instruction is reduced as needed), [5] [6], have been utilized to control this register file temperature. A DTM method specifically targeted toward temperature control in the register file was presented in [7] . This method, called activity migration, is quite effective, albeit it has a large area overhead.
In this paper, we present a DTM method that targets and effectively reduces temperature of the register file. Our idea is based on the observation that the register file is not fully utilized over a program's execution, i.e., the lifetime of registers/operands are short such that we only need a rather small number of physical registers to be active during most of the cpu cycles. Therefore, by introducing two equal-sized banked structures in the physical register file (one active bank and another sleep bank) and alternately using these two banks, temperature of both banks can be reduced while little performance penalty is incurred. This is similar to what the authors proposed in [7] except that we do not introduce a redundant register file structure. Instead we divide the existing register file structure into two banks and alternate between the two while monitoring and respond to register file utilization of the application program. In addition to area savings, our method also avoids processor-wide performance penalty in the sense of IPC degradation.
ACTIVE BANK SWITCHING BASED DTM

Register File Utilization
Many 32-bit instruction set architectures (ISA) are designed to have 32 architectural registers although modern superscalar processors have more than 32 physical registers. This discrepancy is handled by register renaming, which assigns architecture registers to physical registers while considering data/control dependencies among the instructions. In practice, not all of the physical registers are used all the time. In [8] , Tran et al. showed that physical register usages are typically in the range of 40% to 60%. This low utilization phenomenon arises mainly from the dependencies among instructions in the instruction window.
This work was sponsored in part by a grant from the CISE directorate of the National Science Foundation.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Figure 1 shows the utilization of physical registers according to our own simulation data (the simulation methodology will be explained later). Here the x-axis represents the number of physical registers that are being utilized as a percentage of the original register file size (64), whereas the y-axis represents the utilization ratio as a percentage of total execution time. For example, for mcf, 25% of physical registers are actually in use during 42% of the execution time. Note that on average for about 90% the time, less than a half of the physical registers (32) are actually allocated. Figure 2 shows the performance penalty if the register file size is cut in half (from 64 to 32). Notice that for djpeg even though for 25% of the execution time more than 32 registers are used, the respective performance penalty is only 3%. This is because although a new instruction is dispatched and allocated to a physical register, much of the time this instruction is not issued and executed due to the data dependencies among instructions. Based on the above observation, we propose to divide the register file into two equal-sized banks and use only one bank at a time, i.e., the number of physical registers available at any time is one half of the original count and registers are allocated from one of these two banks. Here we designate the active bank as a primary bank and the other one as a secondary bank. Registers are allocated first from the primary bank and only if the primary bank is full, the allocation is done from the secondary bank. Since only a small number of physical registers are used during most of the execution time of typical programs, the duration for which the secondary bank will be in-use is relatively small. When bank switching occurs, there might still be some references to the nonactive bank. However, these pending references will be relatively small compared to the number of references to the active bank.
Thermal Zones and Thermal Gradients
We have carried out detailed analysis of the temperature zones in terms of thermal gradients and classified these zones into two classes: 1) Fast Temperature Rise (FTR) zone: The rising thermal gradient is higher than the falling thermal gradient i.e., the temperature rises faster than it falls (when the chip is allowed to cool off). 2) Fast Temperature Fall (FTF) zone: The falling thermal gradient is equal to or higher than the rising thermal gradient i.e., the temperature drops faster than it rises. Note that the DTM methods are most effective in the FTF zone. Based on our simulations (cf. Figure 3) , the FTF zone is above the FTR zone. This is fortunate because the temperature profile of a microprocessor chip is such that DTM techniques become more effective as the chip temperature rises.
Figure 3 Thermal Gradients in Different Temperature Level
Depending on the type of packaging and cooling solutions, the chip's critical temperature (CT, the temperature beyond which chip may not function correctly or may even get damaged) may lie in any of these two zones. In the absence of any DTM technique, any application program running on a microprocessor chip will give rise to a steady-state temperature (ST) depending on the program behavior e.g., in terms of its CPI. The goal of our proposed DTM method is to minimize the chip ST while meeting a performance loss constraint. If the ST lies in the FTF zone, then the DTM methods tend to work very well and the new ST of the chip will be significantly lower. Otherwise, the DTM techniques are expected to be less effective.
Thermal Model
To mathematically support the periodic active bank switching idea, we use a thermal model developed by Skadron et al. in [3] . Based on this model, the temperature increase in the processor is represented by:
( ) old th th th
where ∆t is a time interval, P is the average power dissipated in an interval, R th is a thermal resistance, C th is a thermal capacitance and T old is the initial temperature of a time period, respectively. After a time interval, the new temperature becomes:
Let t initial and t final denote two instances of time (and their difference be denoted by ∆t), respectively. Then, the rising thermal gradient with respect to time is represented as: ( ) old r th th th
Hence when the active bank is switched, the new active bank's temperature rises according to equation (3) . Whereas the other bank experiences a temperature drop, and this temperature drop follows: 
Consider the case where the ST is above the BT. Conceptually, we would like to identify a trigger temperature (TT) such that BT ≤ TT ≤ ST and switch between the two banks as soon as the temperature of the active bank is about to go above the TT. However, in practice we have found it to be unnecessary to identify such a trigger temperature level. More precisely, a simple DTM policy where we regularly (i.e., at fixed timing intervals) switch between the primary and secondary banks is sufficient. We have found that a fixed interval of 10M CPU cycles is adequate for our purposes and that the overall reduction in ST is not sensitive to the specific value of this interval.
Note that the actual falling thermal gradient in the sleep bank is smaller than equation (4) since some of the registers previously mapped to this bank are alive for certain cycles even after switching. Similarly, the actual rising thermal gradient in the newly active bank is smaller than equation (3) since some of the registers previously mapped to the sleep bank are alive and accessed from that bank for certain cycles. However, the idea is still the same.
Overhead
It is expected that the banked structure in physical register file needs extra control logics and the renaming logic need to be changed to allocate new registers from the active bank only. However, these area penalties are much smaller than those for the activity migration method, which duplicates the entire register file. Furthermore, the periodic active bank switching scheme does not have self-producing performance penalty as is the case for the activity migration method since we do not need to transfer the content of registers from one bank to the other. Table 1 reports the architectural configuration that was assumed in our simulations. 
EXPERIMENTAL RESULTS
Micro-architecture Simulation Data
Methodology
For the experiments, we combine SimpleScalar [9], Wattch [10] and Hotspot [11] . The temperature data is generated every 50K cycles and the initial/ambient temperatures are set by 60/45℃, respectively. For the floor-plan in our thermal simulation, we obtain a 2.6GHz Pentium IV 130nm floor-plan from [15] , estimate/extract the area information for each of the functional unit, and provide this information to our combined simulator.
Figure 4 Detailed Floor-plan for the Register File
Note that we need more detailed information about the register file for the banked structure. Figure 4 (a) shows 'integer execution core' part of the tagged die-photo obtained from [15] . As shown, the register file area in reality is smaller than in the original floorplan and is roughly half of the original size. Hence, we divide the original register file area into half to match our floor-plan with more detailed description (cf. Figure 4 (a) ). We position this half sized register file in the center of register file area and the surrounding area is kept void (cf. Figure 4 (b) ). Since the original register file area corresponds to the size of 128 whereas in our experiments it has size of 64, so we further divide this area into half as (cf. Figure 4 (c) ). For the banked structure, we further divide the original register file area (cf. Figure 4 (c) ) into half to denote two banks of size 32 (cf. Figure 4 (d) ).
Our simulation setup is as follows. For the first 200K cycles of each benchmark program runs, we obtain the typical power figure for the register file (along with other functional units). Next, we use this power figure to mimic the thermal simulation without actually simulating the application by continuously feeding this typical power value to each functional unit. This thermal simulation is carried out in order to find the steady-state temperature for the register file. Once the steady-state is found, we resume the actual thermal simulation of the application.
For the test applications, we used SPEC2000INT benchmarks [12] with reference/train input files, Mediabench program [13] and MPEG-2 decoder program [14] . Input files for mediabench are custom made, input file for the MPEG-2 decoder program is obtained from [14] and the input files of all programs are shown in Table 3 . Each program is compiled with the PISA compiler using default optimization option. For the test platforms, two Linux machines were used: Intel Pentium IV 2.8GHz with 512MB memory and Intel Pentium IV 1.8GHz with 2GB memory.
Experimental Results
At first, we ran each application in a monolithic physical register file of size 64 and record the steady-state temperature. Then, we ran the application with a banked register file with active bank switching. In a banked register file, the total number of physical registers is the same as 64 but they are divided into two banks, each of size 32. 
IPC
In table 2, the difference of steady-state temperatures between the monolithic and the banked register file is shown, which is in the thermal reduction column. The average steady-state temperature reduction the active bank switching scheme is 3.4℃. Note the relationship between the steady-state temperature and the IPC of each program: As a program workload increases, its steady-state temperature increases as well. Figure 5 partially shows the steady-state temperature behavior of the gcc program. The upper thermal curve corresponds to the monolithic register file and the lower two thermal curves correspond to each bank in the banked register file. Compared to the upper curve, note that a program's thermal behavior is maintained in the lower curves and the periodic active bank switching is observed between two lower curves. Note also that two lower curves lay one upon another with very small thermal differences. Each point in the x-axis corresponds to 10M cycles. Table 3 shows the register file utilization in terms of percentage of total execution cycles spent using 1/4, 1/2, 3/4 of the register file, respectively. The performance penalties reported correspond half sized (32) register file. Note that low performance penalty is due to the lower utilization of register file. 
Figure 5 An Example of Thermal Behaviors in gcc
CONCLUSION
We presented an effective steady-state temperature reduction method by adopting a banked structure in the register file. In our scheme, only one bank is active at a time and we keep switching the active bank among the two available banks. With banking, we achieve the steady-state temperature reduction with a small performance penalty. Our experimental results show that periodic active bank switching scheme achieves 3.4℃ of steady-state temperature reduction on average, with 0.75% of average performance penalty compared with the monolithic counterpart.
