# A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable 2D and 3D Microprocessors Fayez Mohamood Michael B. Healy Sung Kyu Lim Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332 {fayez, mbhealy, limsk, leehs}@ece.gatech.edu #### **ABSTRACT** Power delivery is a growing reliability concern in microprocessors as the industry moves toward feature-rich, powerhungrier designs. To battle the ever-aggravating power consumption, modern microprocessor designers or researchers propose and apply aggressive power-saving techniques in the form of clock-gating and/or power-gating in order to operate the processor within a given power envelope. However, these techniques often lead to high-frequency current variations, which can stress the power delivery system and jeopardize reliability due to inductive noise $(L\frac{di}{dt})$ in the power supply network. In addition, with the advent of 3D stacked IC technology that facilitates the design of processors with much higher module density, the design of a low impedance powerdelivery network can be a daunting challenge. To counteract these issues, modern microprocessors are designed to operate under the worst-case current assumption by deploying adequate decoupling capacitance. With the lowering of supply voltages and increased leakage power and current consumption, designing a processor for the worst case is becoming less appealing. In this paper, we propose a new dynamic inductive-noise controlling mechanism at the microarchitectural level that will limit the on-die current demand within predefined bounds, regardless of the native power and current characteristics of running applications. By dynamically monitoring the access patterns of microarchitectural modules, our mechanism can effectively limit simultaneous switching activity of closeby modules, thereby leveling voltage ringing at local powerpins. Compared to prior art, our di/dt controller is the first that takes the processor's floorplan as well as its power-pin distribution into account to provide a finer-grained control with minimal performance degradation. Based on the evaluation results using 2D and 3D floorplans, we show that our techniques can significantly improve inductive noise induced by current demand variation and reduce the average current variability by up to 7 times with an average performance overhead of 4.0% (2D floorplan) and 3.8% (3D floorplan). #### 1. INTRODUCTION High-performance, power-conscious microprocessors exhibit varying current demands depending on the execution characteristics of a given program. For a high frequency microprocessor, any abrupt change in current demand (referred to as di/dt) will result in high-frequency inductive noise that leads to voltage ringing in the power-supply network, thereby posing a serious issue in circuit reliability. This is especially a concern in high-frequency processors where the supply-voltage needs to respond and stabilize to varying current demands without violating stringent timing constraints. In the worst case, overshoot or undershoot in the power supply network can adversely flip data values in the data path, resulting in incorrect computation. To address this reliability issue, processors are often over-designed, typically with the use of an excessive amount of decoupling capacitors (decap) that can warrant reliable operations under the worst case current consumption scenario. However, for increasingly complex processors, inserting an overly excessive amount of decaps enlarges the chip area and at the same time exacerbates the leakage power. Moreover, significant design effort and cost of worst-case design is inevitable for managing the infrequent cases where programs exhibit the maximum level of varying current demands during the course of execution. Traditional technology scaling for CMOS is one reason that causes a high variability in current flow within a processor. As the dimension of devices keeps shrinking, the supply voltage is reduced as well in order to meet the gateoxide reliability requirement. This lowered supply voltage imposes a smaller absolute noise margin, exacerbating the inductive noise issue. On the other hand, the increasing number of available transistors on chip as well as the pursuit of ever-higher operating frequencies result in more power consumption. To mitigate power consumption and its ensuing thermal management problems, aggressive power-saving techniques such as clocking gating and/or power gating were widely studied and applied. Processors such as the Intel Pentium 4, Pentium M and IBM Power5 [14, 19, 3] use different levels of clock gating schemes to dynamically disable portions of the circuits that do not change states. At the meanwhile, the industry has acknowledged the di/dt issue due to the extensive application of clock-gating and responded with architectural solutions. For instance, the L2 cache in the Power5 processor uses progressive clock-gating in different cache banks to mitigate the di/dt effect [14]. This is also one reason why ideal clock-gating, limiting power dissipation in active modules, is difficult to attain in practical designs. Conventionally, the worse-case current consumption can be profiled and gauged by exercising power virus programs [10]. These programs were written with a goal in mind — varying the execution behavior from extremely high activity to almost none for inducing drastic fluctuation of the current demands to stress the power-delivery network. Such exercises provide an approximation of the maximal supply-voltage overshoot or undershoot conceivable in a design. Designers then allocate the appropriate amount of decaps needed to manage this worst-case voltage ringing, in a repetitive process until a given module is within the noise margin. The drawback in such a design lies in the fact that a significant amount of chip area (in the form of decaps) is devoted to the coverage of those infrequent corner cases. For example, the Alpha 21264 reports that roughly 15 to 20% of the die area is occupied by decaps [9] and the trend is going up. Also note that, these decaps will contribute a considerable amount of leakage power in future deep submicron processors. Clearly, the trade-off in future processor designs can be rather complicated in determining the degree of gating sustainable by a chip, the area overheard, and additional leakage due to decaps. To address these shortcomings present in the worst-case design methodology, we propose a low-cost di/dt controlling mechanism embedded into the microarchitecture that will dynamically limit high-frequency di/dt below a predefined threshold. Our design is intended to be integrated in the early microarchitecture planning stages in order to facilitate the design of a processor for the average-case current consumption scenario. The main contributions of this work include the following. - We present the use of decay counters as a simple mechanism to monitor the access pattern of each microarchitectural module in order to prevent unsteady self-switching activity. - We propose a novel microarchitectural technique using a queue-based dynamic di/dt controller to avoid overly prescribed simultaneous (or correlated) gating of modules that share the same local power-pins on the processor's power delivery network. - To simultaneously avoid performance loss as well as reduce high-frequency di/dt effect, we present the integration of *Preemptive ALU Gating* into our queue-based dynamic di/dt controller. - Finally, to achieve fine-grained di/dt control for large modules, we present an enhancement to perform progressive clock-gating without violating the current demand threshold. Unlike prior techniques [10, 15, 16, 22, 23, 24, 25, 26] that largely aim at providing chip-level di/dt control, our technique monitors and controls di/dt by leveraging spatial information of modules obtained from a given floorplan and its power-pin distribution. Inductive noise is highly dependent on the chip floorplan which determines the relative location of functional modules and their distance from the power-pins. Since power-pins will be stressed non-uniformly across the power supply network, certain modules will have a higher susceptibility to inductive noise than the others. Hence, a solution at the chip-level is too coarse-grained and cannot account for the fact that certain power pins are unaffected by a distant module. For the same reason, such designs are also likely to generate many false alarms, resulting in undesired performance degradation. In contrast, by guaranteeing the prevention of simultaneous gating of modules that share the same power-pins, our proposed technique can accurately limit the current demands to be within designated bounds. The rest of the paper is organized as follows. We begin with an outline of prior art and their limitations in Section 2. Section 3 describes power delivery issues faced by modern processor designs and the inefficacy of worst-case design. Section 4 describes the design of our dynamic di/dt controller for 2D and 3D stacked IC processors. Section 5 discusses our experimental methodology and setup followed by the evaluation in Section 6. Finally, Section 7 concludes. #### 2. RELATED WORK The microarchitecture community has recently paid notable attention to di/dt issues due largely to power-saving techniques such as clock-gating, and have proposed solutions to characterize and address them. The work done by Grochowski et al. [10] at Intel was one of the first thrusts to illustrate the criticality of di/dt and propose solutions from the perspective of architects. Their work showed that applications exhibit varying current characteristics in a large range and outlined the fact that a microarchitectural solution can be used to improve current demands with a feedback control mechanism. The objective of their proposed mechanism is to dynamically estimate supply noise violations by performing current-to-voltage calculations on the chip and throttle instruction fetch or issue upon the detection of a reliability emergency. This work, however, addressed the mid-frequency di/dt problem at the chip-level and the architecture presented is incapable of altering current demands over a small clock-cycle interval (e.g. less than 25 clock cycles). In contrast, our work is targeted at mitigating the high-frequency di/dt issue in order to enable average-case microprocessor design and reduce the on-die decap budget. In [15, 16], Joseph et al. analyzed power supply response and control voltage emergencies in a processor via microarchitectural techniques. They concentrated on the worst case di/dt issue which occurs at the resonance frequency of the power supply, and proposes a solution to mitigate the midfrequency di/dt problem in the range of 50-100 MHz. Similarly, Powell and Vijaykumar also proposed solutions to mitigate the reliability issues caused by resonant frequency. In [26], their technique exploits the resonant behavior of the power supply whereby the processor current is dynamically altered to a non-resonant frequency to avoid the worst case voltage drops and spikes. Another technique Pipeline Damping they proposed in [24] throttles microarchitectural activity at the front-end and the back-end to alter the current surges at the resonant frequency. In contrast to purely hardware-based techniques, Hazel-wood and Brooks proposed a hybrid hardware/software solution to address the mid-frequency di/dt issue [11]. Their solution is meant to be an add-on to existing mid-frequency di/dt controllers. Since purely hardware-based di/dt controller techniques incur a performance impact in the case of an emergency, their work aimed at addressing this deficiency by performing dynamic optimization to alter the program codes that induce large di/dt oscillation. By using compiler-based solutions, they illustrated that existing compiler techniques, e.g., software pipelining, code motion and instruction padding can modify program behavior that causes di/dt at the resonance frequency, while avoiding performance overhead. It is important to distinguish that that all the above men- tioned work tackles the mid-frequency di/dt issue (50-100 MHz) and assumes the availability of adequate on-chip decap to counter high frequency di/dt effects. In contrast, the focus of our work lies in mitigating high-frequency di/dt that requires immediate response, thus rendering the above solutions inappropriate. Our primary objective is to advocate average-case design and alleviate the application of a growing large size of on-die decaps, which not only increase the overall chip area but also aggravate leakage power. Toward this effort, Powell and Vijaykumar also proposed a microarchitectural technique called Pipeline Muffling [25] by controlling instruction issue and limiting the number of used resources. However, high-frequency di/dt problem is extremely dependent on the spatial distribution of modules across the floorplan and their distances from the power-pins on a power supply grid. In addition, the high-frequency di/dt problem is not only dependent on a given module's activity, but also correlated to gating events that stress nearby power-pins. Existing works in this area do not account for this fact, that could result in either violated current demand guarantees, or many false alarms. Besides this, Tang et al. in [31] proposed controlled ramping of Floating Point Units alone through scanning the Instruction Fetch Queue for upcoming instructions. This work has two limitations. First, it only deals with FPUs. Second, di/dt effect of an instruction vary depending on how it proceeds through different pipeline stages. In addition, similar to the other works in this area, floorplan or power-pins are not considered. Outside the microarchitecture domain, several solutions were proposed at the circuits level to address high frequency di/dt [6, 21, 32]. The main objective in most of these techniques is to reduce the impedance path to individual modules in a processor, minimizing the voltage surges and dips. Floorplanning algorithms with the objective of minimizing inductive noise were studied recently [20, 6, 7]. However, these are completely static solutions. Although they can improve the average-case noise problem, the chip still needs to be guaranteed the worst-case events. This is mainly due to the fact that static solutions cannot exploit or react to program behavior. It is an extremely important to note that our technique is complementary to such circuit solutions that try to improve the average-case voltage swing. An ideal design will involve optimizing a floorplan and its power supply network for the average-case inductive noise and integrating our dynamic controller to prevent the worst-case current demand at a given power pin domain. Since worstcase program behavior is infrequent, our low-overhead technique can trade off nominal performance in order to meet the current demand threshold. # 3. HIGH-FREQUENCY INDUCTIVE NOISE ISSUES To address inductive noise issues that result from abrupt changes in the dynamic current demands, designers typically target a low-impedance power delivery network by deploying adequate decoupling capacitors. In order to meet the impedance target across a wide range of frequencies, multistage decoupling capacitors are necessary. High-frequency noise is handled by decaps that are distributed throughout the die. Medium-frequency decaps are typically placed on the land side of the package, as close as possible to the motherboard, to facilitate the lowest possible impedance. Finally, bulk capacitors on the motherboard address the low-frequency current fluctuations. Since our work targets the the high-frequency di/dt issue, this section will describe some key aspects that are responsible for exacerbating high-frequency di/dt in future designs. #### 3.1 Sources of High-Frequency Inductive Noise General purpose processors run a wide variety of applications; the current profile for each application varies depending on many factors. Generally, applications with high ILP will exhibit constant use of all modules across the processor, resulting in less current variability. In contrast, applications that oscillate between high and low level activities display a more irregular current profile. With dynamic clock-gating for idle functional units, the abrupt current variation is even more prominent. The current profile also correlates closely to program phases. For instance, a program might consistently performs simple arithmetic operations in a certain phase, leaving little activity in the caches. However, from a fine-grained perspective, even consecutive instructions can vary the current demands substantially if the functional units they exercise are completely different. Although a program appears to be in a consistent phase with a regular instruction profile for thousands of cycles, minute irregularity in-between consecutive instructions can still cause an unexpected current surge or dip, resulting in detrimental voltage spikes. Furthermore, the exact same instruction can generate a different current profile due to dynamic effects like cache misses. For instance a LOAD instruction that hit in the cache versus the same instruction that misses the next time will create different module access patterns and clock-gating activity. Therefore, understanding and exploiting current profiles of applications requires a much finer grained control. Microarchitectural modules have non-uniform current demands and it is critical to create a low impedance path to modules demanding high current. Similarly, modules that induce high current fluctuation can create a greater burden on the power supply, if they are placed close to each other and have a high probability of switching simultaneously. Note that simultaneous switching events along the same direction raises a major issue to power delivery. A floorplan that is resistant to inductive noise tries to generate a well balanced layout to distribute the current demand in a more regular manner across the power-supply grid [21, 20, 7]. However, floorplanning is a static solution. While a noise-aware floorplan can mitigate di/dt effects to a certain extent, it is still unable to completely eliminate the dynamic reliability emergency due to high frequency inductive noise. ## 3.2 Implications of Inductive Noise in 3D-IC Processors The emerging 3D-IC technology enables multiple die to be stacked up in the vertical direction with high-density interconnect vias. The pitch of these vertical die-to-die vias can be as short as a few microns [8], thus substantially reduce the communication latency, in particular for the global interconnects. The 3D stacking can be done either through face-to-face (F2F) or face-to-back (F2B) bonding [4]. F2F bonding allows higher density interconnects since vias are masked and deposited on top of the metal layer using conventional manufacturing technologies. On the other hand, <sup>&</sup>lt;sup>1</sup>Current designs target at sub-milliohm impedance. F2B requires vias to be etched through the backside, however, it facilitates an arbitrary number of die to be stacked. Both F2F and F2B bonding in 3D-IC technology permit higher density and fast interconnects enabling more functionality to be packed in a given area. For high-performance microprocessor designs, 3D stacking is a promising and appealing solution for the worsening scalability of global communication on die. For instance, future multi-core or tera-scale integration designs can leverage 3D-IC technology to fabricate functional units such as caches, ALUs, or instruction schedulers on distinct layers in order to facilitate uniform access times [18, 27, 28, 29]. The important point to note is that the module density is much higher in a 3D-IC within a smaller die footprint. Since the number of power/ground layers is limited to the shrunk die area that can grow in the third dimension, it makes the design of a well balanced 3D floorplan even more challenging. Furthermore, the traditional techniques of designing for the worst-case are counter-intuitive in 3D-IC design. This is because of the fact over-designing 3D-ICs using decaps that consume considerable chip area will automatically negate the inherent benefits such as area and lower wire power provided by 3D-IC technology. #### 3.3 Quantifying Module Activity To effectively manage and prevent the high-frequency voltage ringing at the microarchitectural level, it is imperative to understand the simultaneous switching behavior among different microarchitectural blocks and their relative locations to each other (i.e. the sharing of power-pins) on a given floorplan. To understand switching behavior of microprocessor modules, we quantified two metrics, namely the self-switching activity and correlated or simultaneousswitching activity of modules. Self-switching measurement is used to quantify the number of gating occurrences in the processor for a benchmark during the profiling period. Both gating on and off are considered as likely events to cause current fluctuation. The objective of this metric is to isolate the microarchitectural modules of high switching activity. In addition, the intensity of the gating activity also depends on the current consumption of each module. In other words, even if a module switches less frequently than the others, it still can potentially induce intolerable noise if it draws a significant amount of current. The number of switching events and the current consumption per cycle called intensity of switch are combined into a single weight. If $sw_i$ represents the raw number of switching events for module i and $I_i$ is the intensity of the switch, then the self-switching factor $\alpha_i$ is represented by the following relationship. Self-switching factor, $$\alpha_i = sw_i \times I_i$$ (1) Figure 1: Correlated Switching Matrix Correlated switching events are gating events in the same direction i.e. both modules are clock-gated ON or OFF simultaneously. To measure correlation, we capture the intercycle gating direction of each module in the profiling process. Then each module is paired with every other module in the processor, and checked for simultaneous gating, in the same direction. The result is an upper triangular correlation matrix with each location representing the number of simultaneous gating events encountered. An illustration of the calculation process of correlated switching events that is relative to the modules, is shown in Figure 1. In the matrix, $X_{ij}$ is the number of raw correlated switches that occurred over the profiling duration and $sw_i$ is the number of selfswitching events for module i. It is to be noted that the correlation metric $X_{ij}$ isolates only the modules in consideration. The upper bound of 100 indicates a perfect correlation, i.e. both modules i and j switched simultaneously every single time. The forward diagonal in the same matrix represents the self-switching factor, $\alpha_i$ , for each module. Using eight SPEC2000 INT benchmark programs,<sup>2</sup> Table 1 shows the switching correlation as a matrix for 23 microarchitectural modules considered in our processor model. The diagonal (shaded) in the matrix represents the amount of self-switching factor. As observed from Table 1, certain modules switch far more frequent than others. On the other hand, the weights of the modules that are likely to be accessed every cycle (turned on mostly) such as the L1 I-Cache and the I-TLB are lower. Some modules with smaller weights are dormant, e.g. floating-point register file (Freg), only accessed once in a long while. In addition, as expected, branch predictor and BTB, I-Cache and I-TLB and D-Cache and D-TLB are all highly correlated modules. In addition, it is also observed that the first six ALU modules are also highly correlated, since concurrency exists in integer instructions. The design of our high-frequency dynamic di/dt controller is mainly based around the intrinsic switching behavior of microarchitectural modules. Table 1: Self and Correlated Switching Weights of All Modules Our technique addresses the high-frequency inductive noise issues directly caused by clock-gating. Clock-gating is a well established method in dealing with the increasing power concern and thermal pressure. The main issue in reliability associated with clock-gating is that there is no deterministic or predictive way for determining whether it is reliable to gate <sup>&</sup>lt;sup>2</sup>Since correlation profiling was compute intensive, we used the following subset of benchmarks for this motivational data:256.bzip, 186.crafty, 252.eon, 254.gap, 164.gzip, 181.mcf, 253.perl and 300.twolf. <sup>&</sup>lt;sup>3</sup>Because Table 1, used a demonstration, only profiled results of integer benchmark programs. off modules without inducing hazardous current surges. In addition, conventional clock gating techniques do not have any knowledge of adjacent modules and the extent of correlated clock-gating activity. Our microarchitectural level high-frequency di/dt controller is based on such fundamental observations on the clock-gating activity of modules, their correlation with adjacent modules, the module locations in a floorplan, and the power-pin distribution of the chip. # 4. A FLOORPLAN-AWARE, QUEUE-BASED DYNAMIC DI/DT CONTROLLER In order to address inductive noise issues due to high switching activity in the processor, we now present the design of our dynamic di/dt controller that aims to improve the current profile of a processor regardless of program behavior. Our design is easily customizable, in order to enable a given design achieve the right balance among dynamic di/dt control, power consumption, and performance overhead. The primary components of the di/dt controller include the following: - A low-overhead modular decay counter-based clock-gating mechanism. The objective of the decay counters is to throttle excessive self-gating activity of modules. - A floorplan-aware clock-gating queue that selectively disables simultaneous switching of modules in the same direction. The queue-based controller is designed to limit the maximum current surge or dip for a given set of power-pins shared by several modules on the power supply grid. - Preemptive activation of ALUs through pre-decoding for simultaneous di/dt and performance enhancement. - An enhancement to queue controller in order to enable progressive clock-gating on large modules like L2 cache banks. #### 4.1 Decay Counter based Clock-Gating The key to avoiding clock-gating induced noise lies in identifying program phases to see whether it is reliable, at a particular moment, to gate off an entire microarchitecture module. Although, certain elaborate techniques can accurately predict module requirement patterns, clock-gating require low-overhead mechanisms to justify the extra hardware cost [19]. To allow a dynamic clock-gating technique that is low-overhead, and yet provide a tunable form of di/dt control, we propose the use of decay counters. By using low-resolution decay counters to monitor module access patterns, we can choose to save power only during longstretches of inactivity. To illustrate this, we illustrate an example that quantifies fine-grained module access patterns of certain processor modules over a small simulation period in Figure 2. The figure shows an example of access pattern profile for the branch predictor, the L1 I-Cache, an Integer ALU and the Integer Register File for the bzip benchmark. The 200 cycle interval is shown here to illustrate the potential high-frequency di/dt effects from a fine-grained perspective. It is to be noted that the decay counter does not require a specific access patten to eliminate unnecessary switching activity, such as the ones presented in the figure. The 200cycle access pattern for different modules with varying access patterns is merely used to illustrate the significance of employing decay counters in our design. Typically, it is observed that a module that is inactive for more than 10-12 cycles is likely to remain dormant for an extended period of time. Clearly, there is a threshold cycle count beyond which a module can be gated-off reliably with the least likelihood of encountering high frequency inductive noise. Almost always, it can be seen that when a module is not accessed for less than 5-10 cycles, it is highly likely to be accessed soon in subsequent few cycles. A decay counter is employed to exploit this behavior by enabling clock-gating activity only when a minimum turn-off threshold has exceeded. We use a 4-bit decay counter for each microarchitecture module inside the processor that only permits clock-gating of a module if it has not been accessed during the last 16 cycles. For any given module, the counter decays unless there is an access made to that particular module, in which case the decay counter is reset back to the maximum. The resolution of the decay counter provides the trade-off between high frequency inductive noise control and power dissipation. A large decay counter will further smooth out current spikes over time but at a cost of higher average power consumption due to the fact that modules will be gated off only after a long interval of inactivity. The opportunity for power saving is also dependent on the module access pattern. In Figure 2, it can be seen that certain modules like the branch predictor or I-ALU exhibit larger potential for power savings than others that display high activity like the Integer Register File. ### 4.2 A Floorplan-aware Queue Based di/dt Controller Even though the decay counter can provide a smoother current profile for each module by eliminating unwanted switching activity, it is inherently incapable of avoiding di/dt issues caused by simultaneous gating of modules that share common power pins. To address these shortcomings, we propose a queue-based controller based on the processor's floorplan and power-pin distribution. In the processor's power-delivery network, a module usually draws more current from spatially closer power pins, in other words, following the path(s) with lowest impedance. Consequently, adjacent modules will unreliably stress local power-pins, if they switch simultaneously in the same direction. Therefore, in order to guarantee the maximum current ramp at a given time, it is necessary to be able to dynamically alter simultaneous gating of modules that will stress the same power-pin(s). The proposed queue based controller is designed to overcome unreliable simultaneous switching of adjacent modules. The salient features of the controller are described as follows. - A static queue with an entry for each module sharing the same power-pin domain. Ideally, there will be no more than eight entries in a queue resulting in a 3-bit module identification number that is local to each queue<sup>4</sup>. - Every queue-entry has the corresponding state of the module that indicates either the current state or any requested clock-gating transition event. This will require 2 bits for the ON/OFF states as well as the ON→OFF and OFF→ON transitions. The state is used to drive the pre-wired clock-gating signals to the corresponding modules. $<sup>^4</sup>$ The number of entries are limited to minimize the performance loss as explained in Subsection 4.2 Figure 2: Module Access Patterns Figure 3: di/dt Controller Architecture • Every queue entry that represents a module also has an associated *integer weight* that is proportional to the current consumed by the corresponding module. We use a two bit integer to represent one of the four different current consumption levels. Since weights are use to compute and check for current demand violations, integer weights are appropriate for faster current demand calculations. Fast calculations are essential for quick response to high-frequency di/dt. The high frequency di/dt controller architecture is depicted in Figure 3. The "+" signs on the chip floorplan (lefthand side) indicate the power-pins locations on the power delivery network. For simplicity, we illustrate only four power-pins. The queue based controller works in the following manner. The decay counters will signal a transition event, i.e. ON→OFF for a given module in the queue. Let $\delta$ be the current demand threshold that is permitted for a given power-pin domain. At any given time, a head pointer is always pointed to one single module in the queue. Every cycle, the queue is traversed by a window size which has a total weight of $\gamma$ . The value of $\gamma$ is the largest sum of weights of consecutive modules that are in the transition states (ON $\rightarrow$ OFF or OFF $\rightarrow$ ON), such that $\gamma < \delta$ . Since integer weights can be negative as well<sup>5</sup>, the sliding window will attempt to permit the maximum allowed transitions without violating the maximum current demand constraint. To better understand the di/dt queue-controller mechanism, we use an example based the instantaneous state of the controller as shown in Figure 3. Let us assume that the value of the current demand threshold, $\delta=3$ . In the figure shown, ALU-2 and ALU-3 are gated off (indicated by the bold arrow that is the output of the queue controller). Both Bpred and ALU-1 have an activation request indicated by the OFF $\rightarrow$ ON state. Therefore, the combined weight of the sliding window, $\gamma = 3.6$ The queue-controller will therefore permit both module gating events to occur, since the threshold constraint is not violated in this case. After servicing the transition, the head pointer will traverse two entries and point to the ALU-2 entry in the queue. In contrast, consider an alternate case where ALU-1 had a higher weight that resulted in the weight of the sliding window to exceed the current threshold budget. In this case, only the Bpred transition will be serviced by the queue-controller. Also, the head pointer will traverse only one module entry to ALU-1, so that it can be serviced in the next cycle. Furthermore, consider yet another example where the ALU-1 requires an ON→OFF transition which represents a negative weight. In this case, $\gamma=1$ , thus still permitting both Bpred and ALU-1 to perform their transitions. However, in this case, the sliding window threshold is still below the threshold, $\delta$ , and the queue-controller can potentially gate the next ALU-2 module, if it requires a transition. These examples are provided to illustrate how the sliding window adjusts dynamically based on the worst-case current demand that can be sustained in a given power-pin domain. The example di/dt queue in Figure 3 show the modules in the descending order of weights. It is to be noted that the di/dt controller will enforce the current demand threshold regardless of the order in which they are in the queue. However, the ordering of modules does affect the performance overhead imposed by the design. For instance, clustering modules in the queue that have high weights will create a larger performance overhead since multiple modules will not be permitted to transition because they consistently violate the current demand threshold. The ordering of modules in the queue is static and presents a design choice that needs to be made by an architect for a given floorplan. Note that the queue in our di/dt controller is different from a typical queue structure like the Instruction Fetch $<sup>^5 \</sup>mathrm{OFF} {\to} \mathrm{ON}$ is a positive switch while $\mathrm{ON} {\to} \mathrm{OFF}$ represents a negative switch <sup>&</sup>lt;sup>6</sup>Please note that in a real implementation, the sliding window will have an upper limit in terms of how many modules weights can be computed in a given cycle. Queue, a memory structure allocated at run-time. In contrast, the entries in the di/dt controller queue are pre-wired for each module at the design phase in order to simplify the logic for driving clock-gating signals directly to the modules<sup>7</sup>. However, functional-wise the controller works like a circular queue that traverses as many modules as determined by the sliding window threshold. It is to be noted that the maximum hardware overhead of each microarchitectural module is merely 11 bits (including the decay counter). This is rather negligible in terms of additional power dissipated and the extra current drawn by the controller itself. #### 4.3 Preemptive ALU Gating Preemptive ALU clock-gating through pre-decoding instructions is another technique we propose to prevent unnecessarv gating activity. It is to be noted that decay counter based clock-gating allows gating events to occur based on the history of module accesses. However, decay counters by itself will be unable to predict the requirement of a module if it is required in the immediate future for a recently fetched instruction. For instance, it will be detrimental to performance if an ALU is going to be gated off due to a saturated decay-counter, when in fact an incoming ALU instruction has just been fetched. Furthermore, if an ALU instruction is on its way, it makes sense to leave the unit "on" even from a di/dt perspective. To achieve this goal, we include preemptive turn-on gating of ALU modules by pre-decoding instructions. In a typical RISC ISA, the opcode can be determined by observing the first few bits of the instruction<sup>8</sup>, allowing us to pre-decode this information simultaneously with the instruction fetch. In the case that an ALU instruction has been detected early on, it is used to override the decay-counter turn-off request. In CISC ISA, it might not be easily possible to perform a simple pre-decode due to variable length instructions, but even so, other techniques such as storing pre-decode information in the L1 Instruction Cache [2] can be used to achieve this effect. #### 4.4 Enhanced Progressive Gating of Large Modules Even though simultaneous gating of multiple modules can be prevented completely by selective gating for a given set of power-pins in a power-delivery network, some monolithic modules like the L2 cache can still consume large current resulting in unreliable voltage swing. For this reason, certain processors employ progressive gating of large modules like the L2 cache, in order to mitigate di/dt effects [14]. However, ad-hoc progressive gating does not prevent other adjacent modules from switching simultaneously and can still result in unreliable di/dt surges. To counteract this issue, our queue-based controller can be used to generate multiple clock-gating domains for even a single monolithic module by merely replicating multiple entries for a module with smaller weights. For instance, for a banked L2 cache, there can be as many entries as the number of banks within the queue with proportionally lower weights. Since the queue inherently throttles simultaneous switching activity, it presents a much more effective progressive gating mechanism than current solutions. Thus, the queue-based controller can enable efficient progressive gating of such modules, while maintaining the noise-tolerant current demand threshold through mitigation of simultaneous switching effects. #### 4.5 Pipeline Design Implications The employment of any dynamic di/dt controller requires an appropriate performance throttling mechanism to guarantee program correctness even if certain necessary processor components are unavailable when needed. For instance, instruction scheduler needs to be accurately aware of the ALU availability before issuing the operations. The integration of a di/dt controller into a conventional architecture will require the pipeline logic to be accurately aware of the clock-gating state of the module as well, in order to issue operations without affecting correctness. For this reason, it is essential that the di/dt controller not impose impractical design implications on the processor pipeline. Our queue-based high-frequency di/dt controller can be easily built into a conventional out-of-order pipeline without significant additional complexity. Conventional processor modules are already capable of correctly operating under resource contention. In the events of resource hazards such as over-subscription of ports in register file, caches, or loadstore queue, the selection logic will appropriately delay certain operations from issuing. As indicated in Figure 3, our queue has static entries and pre-wired logic that indicates the availability of any given module. This makes it efficient to integrate the additional resource availability constraint into existing selection logic in the pipeline. Since resource availability can be directly interpreted from the output of the queue-based controller, an enhanced pipeline with the di/dt controller merely needs to ensure that the resource availability constraint overrides all conventional hazards for correct functionality. #### 5. EXPERIMENTAL METHODOLOGY Due to the fact that our design leverages spatial information of modules and power-pins from a given chip-floorplan, we will now briefly describe the floorplanning algorithms we employed to create our 2D and 3D-IC floorplans. The specific floorplans we used are independent of running applications, i.e. no profile-guided optimizations were employed in the floorplanning algorithms. The obtained floorplans along with the predefined power-pin distribution determined the configuration of the queue entries in the dynamic di/dt controller. ### 5.1 Floorplanning for 2D and 3D-IC Processors Although the design of our dynamic di/dt controller is general enough to be independent of any floorplan, the queue configuration is determined by module locations relative to power-pins. However, the architecture of the queue-based controller is universally applicable to any given floorplan to achieve reliable di/dt fluctuation. To gain more insight into our floorplanning process and the module placement, <sup>&</sup>lt;sup>7</sup>Since the queue-entries are pre-wired to the clock gating output, it is possible to apply certain heuristics to the order of modules in the queue with asymmetric weights, in order to permit the maximum possible transition at a given time. Such optimizations however are out of the scope of this work <sup>8</sup>For example, Alpha and PowerPC ISA uses the prefix 6 bits for opcode. $<sup>^9\</sup>mathrm{Typically}$ , L2 cache banks are in separate clock-gating domains. Figure 4: Illustration of our 3D floorplanning. (a) Initial block list, (b) Layer partitioning, (c-d) LP-based 3D slicing floorplan, (e) Floorplan refinement. we briefly describe the basics of the floorplans and provides details on how they were obtained. We assume the same set of microarchitecture modules for our 2D and 3D floorplans. Both floorplans contain 23 modules whose areas are determined by the machine configuration presented in Table 2. Since the procedures of floorplanning a processor onto a 2D and 3D plane are radically different, the techniques used in each case are described separately. The objective function used in all cases was a weighted combination of wirelength and area. The goal of floorplanning is to determine the width, height, and x/y location (for 2D) or x/y/z (for 3D) of the microarchitectural modules. The objective is to minimize a weighted sum of the maximum module temperature, overall footprint area, and the total length of interconnects connecting the modules. We use the same two-step approach for both 2D and 3D floorplanning: Linear Programming (LP) based floorplan construction followed by Simulated Annealing (SA) [17] based floorplan refinement. The only difference lies in the fact that in 3D floorplanning, we perform layer partitioning before the LP-based floorplan construction. We describe our 3D floorplanning in what follows. We first partition the modules into layers (= die) and then floorplan these layer. The goal during our layer partitioning is to minimize the number of inter-layer interconnect, whereas our floorplanning optimizes the temperature, footprint area, and intra-layer wirelength simultaneously. We use the following rules during our layer partitioning: (1) we assign a layer to each module such that the number of interlayer interconnect is minimized; (2) we split pairs of modules that communicate frequently each other into different layers. The goal is to vertically overlap them during the subsequent floorplanning step to achieve better performance; (3) we split highly active modules into different layers such that the shorter vertical interconnect connected to these modules help reduce the dynamic power; (4) we separate the modules with large area such as the RUU into different layers to help minimize the footprint area and reduce the amount of white space. Our LP-based floorplanning is based on slicing floorplanning to handle multiple layers simultaneously. The basic idea is to perform recursive bi-partitioning until each partition contains a single module as illustrated in Figure 4. In our approach the slicing operation determines the overall relative location among the modules, while an LP fine-tunes the location and determines the dimension of the modules. Moreover, we insert each slicing cutline to cut all layers simultaneously (note that in 2D floorplanning we deal with a single layer). Upon each slicing, we perform thermal analysis to obtain new module temperature. We then use LP to simultaneously optimize the performance and thermal distri- bution under the target frequency and leakage constraints. Since the layer partitioning has already addressed the interlayer wire issues, we do not allow the modules to move to other layers during the LP floorplanning. Once we obtain a 3D floorplanning solution, we perform a stochastic refinement based on Simulated Annealing. The basic approach is to perturb the current floorplanning by swapping a pair of modules or rotate a module by 90 degree. If the quality of the perturbed solution improves, we accept this new solution. Otherwise, we rely on the concept of annealing temperature to probabilistically accept worse quality solution as explained in [17]. We compute the initial annealing temperature by setting the probability of accepting bad moves to a low value. This reduces the runtime required for the annealing process significantly and focuses on results that are near the LP based result, which is assumed to be fairly close to optimal. Our final floorplan along with their power pin locations are shown in Figure 5. The left-hand side shows the 2D floorplan. The black dots "•" in alternating columns represent the power-pin locations on the power grid. <sup>10</sup> The right-hand side of the figure shows the 3D floorplan that we used. The 3D floorplan uses the same number of modules separated across four different layers. <sup>11</sup> The basic objective of our dynamic di/dt controller is to minimize the burden on power-pin(s) caused by adjacent modules to a reliable level. Therefore, for any given floorplan and power-pin configuration, the design objective of the di/dt controller is to place queues for effective di/dt control in a distinct section of the floorplan. For this work, we divided the floorplan into four quadrants, with each quadrant representing a distinct power-pin domain. Note that certain power-pins can be in multiple domains. For instance, in the 2D floorplan, quadrant based module separation will result in 5 power-pins per quadrant, because the power pins on the borders of the quadrants exist in multiple domains. The number of distinct power-pin domains is a design choice influenced by the degree of di/dt control that is required. A high number of power-pin domains results in a larger number of queues and finer grained control. On the other hand, too few power domains will result in larger queues impacting performance, because of the fact that the worst-case delay in transition is higher. For both floorplans, a queue was assigned to each quadrant for all the modules placed in it. Since the floorplan determines the queue configuration, different floorplans will have different performance impact as <sup>&</sup>lt;sup>10</sup>This is a type of power-pin configuration that certain flipchip IC designs use. <sup>&</sup>lt;sup>11</sup>An exception is the RUU, which is split on multiple layers since this is a module that consumes a large area. Figure 5: Floorplans. Black dots denote power pin locations. well as distinctive di/dt characteristics. #### **5.2** Simulation Framework Our simulation framework is based on SimpleScalar 3.0 and Wattch [5] running SPEC2000 INT and FP benchmark suite. To understand the access patterns of individual modules that motivated the solution of this work, we include various profiling and instrumentation facilities in our simulator. For the implementation of the dynamic di/dt controller we extended SimpleScalar/Wattch to incorporate floorplan aware queue configuration. We also implemented a detailed, floorplan-dependent performance throttling model and queue configuration for studying the performance impact of our technique. The primary simulation parameters used in our simulations are shown in Table 2. The power and current consumption metrics were based on a 5GHz processor developed using a 70nm process technology [1]. Each simulation was fast-forwarded by 4 billion instructions and simulated for 1 billion instructions. The current signature that was chosen to evaluate the dynamic di/dt controller was obtained by profiling for the worst-case overall module activity over the entire simulation period. To study the thermal impact of our di/dt controller, we integrated Hotspot 3.0 [30] into our simulators. Hotspot assumed the same process technology parametric as mentioned earlier. The heat-sink and heat spreader modules were obtained from the default model and the initial temperatures were set to 300 kelvins. #### 6. **QUANTITATIVE ANALYSIS** In order to evaluate the effectiveness and overhead of our dynamic di/dt controller under different scenarios, we applied our technique to both 2D and 3D floorplans. The results presented include current profiles on a baseline machine without a di/dt controller versus our technique and the average current variability across all benchmarks. Since di/dt is a reliability issue, we also quantify any potential reliability impact due to our technique in the form of thermals. Finally and most importantly, we present the performance overhead incurred due to our dynamic di/dt controller. #### 6.1 Current Profile of Applications To demonstrate the effectiveness of our controller in im- | Parameters | Values | |---------------------|----------------------------------------------------------------------------------------| | Fetch/Decode width | 8-wide | | Issue/Commit width | 8-wide | | Branch predictor | Combining: 16K entry Metatable Bimodal: 16K entries 2-Level: 14 bit BHR, 16K entry PHT | | BTB | 4-way, 4096 sets | | L1 I- and D-Cache | 16KB 4-Way 64B line | | I- and D-TLB | 128 Entries | | L2 Cache | 256KB, 8-way, Unified, 64B line | | L1/L2 Latency | 1 cycle / 6 cycles | | Main Memory Latency | 500 cycles | | LSQ Size | 64 entries | | RUU Size | 256 entries | | Functional Units | 8 IntAlu (only 2 can be used for IntMult)<br>4 FPAlu (only 2 can be used for FPMult) | Table 2: Microarchitecture Parameters proving high-frequency di/dt effect, we now present the current profile of the whole chip as well as for each queue cluster for the 2D floorplan. Note that the effectiveness of a di/dt controller is evaluated by observing its effect on the worst-case current profile of a given application which represents the maximum switching activity of modules. Due to the staggeringly huge amount of current profiles of all benchmark programs, we epitomize their representative characteristics using two types of benchmark programs for this specific study as a demonstration of our analysis. Note that the crucial information conveyed in this section is to show the effectiveness of our proposed mechanism. To explain the current profiles, we profiled one high-ILP benchmark (164.gzip) and another low-ILP (181.mcf, memory-bound) benchmark. The current profiles shown in Figure 6 and Figure 7 were obtained by profiling for the worst-case switching activity during the course of execution. A 4-bit decay counter was used for each module in all experiments. <sup>12</sup> Each graph shows the current profile for both the processor with ideal clock-gating as well as the decay counter based clock-gating mechanism. We also provide their close-up versions of the representative, highly active region of the graph for better visibility. $<sup>^{12}</sup>$ The resolution of the decay counter was based on the motivational data discussed in Section 3. Figure 6: High ILP Benchmark Current Profile (164.gzip) Figure 7: Low ILP Benchmark Current Profile (181.mcf) It can be seen that both gzip and 181.mcf exhibit a repetitive current profile during the worst-case switching period. This is especially prominent in the current profile of mcf where there is a period of high activity for a few hundred cycles, followed by a stable current profile for approximately 500 cycles. This is due to the long-familiar cache misses to main memory that occur in mcf. During which period most modules are inactive and can be clock-gated off to save dynamic power. The effectiveness of the di/dt controller in improving the current ramp is obvious in the zoomed versions of the graphs. It shows that with the decay counter, our system (shown in dashed lines) successfully prevents unnecessary oscillating swing in the current profile and produces a much smoother down-ramp. For gzip in Figure 6, we observe large current variation in the ideal-clock gating scheme due to high activity across all modules. Since there is no significant duration of time where reasonable power savings are Because that modules are never inactive for extended periods of time, the decay counters rarely clock-gate off most modules. The current profile is extremely stable for this reason. In short, the decay-counter based technique finds the optimal power envelope right above the ideal clockgating mechanism and allows clock-gating only when there is a significant likelihood that the given modules will unlikely be accessed again soon.<sup>13</sup> Next, we present the current profile with the integration of the complete queue-based controller for the 2D floorplan. Note that this is the complete controller that incorporates prevention of simultaneous switching, decay-counter based feedback for clock-gating, preemptive ALU gating and progressive gating of L2 cache banks. Figure 8 shows the current profile for all four queues for the 2D floorplan for gzip and mcf. In all cases it can be observed that the current profile is significantly improved by eliminating excessive switching activity. In addition, it can be observed that both the upward ramp and downward ramp effects due to multiple modules in the same power pin domain (i.e. using the same module queue) are spread out across multiple cycles. This is more prominent in the upward ramp of the current with the di/dt controller between cycle 20 and cycle 50 for $Queue\ 1$ in mcf. For $Queue\ 3$ in mcf we observe a different trend whereby the di/dt controller ramps up current repeatedly compared to the baseline, which is stable. This is due to the preemptive ALU gating effect that ramps up additional ALUs which are otherwise unused in the baseline clock-gating scheme due to low ILP. We observe a repetitive pattern where ALUs are gated preemptively only to later decav after approximately 20-25 clock cycles. However, these ramps are still spread out over many cycles and do not violate the current demand threshold. In the case of Queue 4, although there is a significant current decay towards the end, it is to be noted that the simultaneous gating is prevented even in this case (the slope of the drop is less steep, which is not obvious in the graph due to the scale). For gzip, where there is high ILP/switching activity, we notice that the queue-based controller ramps up to the required current levels and do not saturate the decay counters for long enough. For this reason, the queue current profile is <sup>&</sup>lt;sup>13</sup>Note that the chip level current is with the decay-counter based technique alone, which alone does not prevent simultaneous switching. Large upward ramps are resolved by the queue-based controller. Current Profile (164.gzip) (b) Low ILP Benchmark Queue Controller Current Profile (181.mcf) Figure 8: Queue Controller Current Profile Figure 9: Current Variability Figure 10: Thermal Impact of Dynamic di/dt Controller almost always stable, except for the few cases where the decay counters decay long enough to enable clock-gating. It is important to note that this does not mean that there is no opportunity for power-savings in such a design without di/dt control. The presented phase of gzip is the highest ILP portion in our simulation and it is simply not worth it to clock-gate elements during this phase because of the di/dt as well as the performance penalty. Since presenting detailed current profile is infeasible for all benchmarks, we now present the current variability per cycle for the complete duration of the benchmark execution. Unlike the worst case profile that was presented earlier, this metric presents the average variability of current per cycle for both the baseline and the processor with our dynamic di/dt controller. Figure 9 shows the comparison for various SPEC2000 INT and FP benchmark programs. The current variability is calculated by measuring inter-cycle current fluctuations (in absolute value of the swing) over the entire simulation period, as a fraction of the total number of simulation cycles. It can be observed that the baseline architecture shows a higher degree of current variability across the board. The daata show that 186.crafty exhibits the highest variability whereas 171.swim has the lowest variability. In any case, regardless of the native current variability, our dynamic di/dt controlling mechanism can significantly mitigate the dynamic oscillating behavior of current profile of running applications. The di/dt controller pushes the current variability below 0.5 amps/cycle for all the benchmark programs we studied. Note that, a traditional power-virus will no longer be able to stress the power delivery network in the presence of our di/dt controller. #### 6.2 Thermal Impact In typical high-performance processor design, high-frequency inductive noise issue is handled through the worst-case design method. In contrast, the goal of our technique is to guarantee this reliability by enabling an average-case design, while meeting the stringent reliability requirements via dynamic control mechanisms. Therefore, it is critical that our di/dt controller must not induce other forms of reliability vulnerability. Since our technique provides fine-grained di/dt control at the expense of increased power consumption, it is necessary to quantify any potential adverse ther- mal effect due to our technique. Thermal issues are particularly critical in 3D-IC processors for their higher power density as well as the greater difficulty in dissipating heat across multiple die layers. We used Hotspot 3.0 [30] to evaluate the thermal impact of our high-frequency di/dt controller for both the 2D and 3D floorplans. We compared our architecture against the baseline 2D and 3D designs using ideal-clock gating, which represents the scenario of the least power and current consumption. Figure 10 presents the thermal analysis for all 23 modules in our processor model for SPEC2000 benchmark suite. 14 Overall, we observe nominal thermal impact across all modules for both the 2D and 3D floorplans. The 2D cases show an average temperature increase of 3.15 kelvins over their baseline counterparts. The highest temperature rise (over 5 kelvins) is observed in the L1 Data Cache, Branch Predictor, BTB and LSQ modules. Majority of the remaining modules exhibit an average temperature increase below 3 kelvins. We also notice a similar average temperature rise, around 3.74 kelvins, for most of the modules in the 3D floorplan although some modules do experience higher thermal impact. IALU-3 and IALU-5 are the worst with an average temperature rise of 10.11 kelvins and 15.75 kelvins. On this note, we also observe that these modules are located close to each other on the second layer. However, we do not observe a high temperature increase in IALU-7, which is also very close on the same layer. The reason for this is that none of the benchmark programs exhibit enough parallelism to utilize more than 6 ALUs at the same time, leaving IALU-7 and IALU-8 inactive and being clock-gated off for an extended period of time. All the remaining modules in the 3D floorplan exhibit a temperature rise less than 6.5 kelvins. Note that it is possible to further mitigate these worst-case thermal effects by using a thermal-aware floorplanner described in [12, 13], however this is outside the scope of this paper. (Our floorplans were generated with a goal of minimum total wirelength and die area.) Our thermal analysis results indicate that the integration of the di/dt controller does not pose any large adverse thermal effect to either a 2D or a 3D <sup>&</sup>lt;sup>14</sup>For presentation purposes, the RUU in the 3D floorplan is the average temperature of all RUU partitions. Figure 11: Performance Degradation of dynamic di/dt controller floorplan. #### **6.3** Performance Impact We now present the performance analysis of our di/dt controlling mechanism for both the 2D and 3D floorplans. Figure 11 shows the normalized IPC for SPEC2000 INT and FP benchmark suite with the di/dt controller over the baseline machines without any di/dt control. Results for both 2D and 3D floorplans are shown. The 2D-w/Pre and 3D-w/Pre configurations show the queue controller with preemptive ALU gating turned on in order to differentiate the type of applications that can benefit from pre-decoding ALU instructions. The remaining two bars show the same controller without preemptive ALU gating. Progressive gating in the L2 cache was applied to all cases. In general, we observe minimal performance degradation for most of the benchmark programs for both the 2D and 3D floorplans. Note that the performance overhead is dependent on the floorplan because it affects the queue configuration. A more optimized floorplan will result in a better balanced queue configuration. However, if the floorplan results in a configuration where one queue carries a significantly larger number of modules than the others, IPC will be adversely affected due to the fact that the worst-case module activation time is longer. This is the reason that the 3D floorplan shows a slightly less overhead at about 3.8% on average, compared to the 2D floorplan which shows an average performance overhead of 4.0%. The worst performance degradation is shown in 252.eon, at 9.2% and 9.5% for the 2D and 3D floorplans, respectively, for the controllers without preemptive ALU gating. One explanation for the increased performance impact is due to the fact that the raytracing algorithm in eon is ALU intensive. Since modules in the floorplan are asymmetric, both the 2D and 3D floorplans result in locally clustered ALUs that are not symmetrically distributed in all queues. Highly ALU intensive applications will suffer a performance loss in such cases, since a quick ramp up of modules will take longer if most of the ALUs are clustered in the same queues. A strong indicator of this fact is evident from the higher sensitivity eon shows to preemptive ALU gating, compared with the other benchmarks. Most of the other benchmarks only exhibit little performance loss that is less than 5.7%, overall. We also observe that preemptive gating of ALUs improves the performance for certain benchmark programs such as 252.eon, 254.gap, 253.perl and 168.wupwise. This is due in part to the fact that the 4-bit decay counter saturates consistently for ALUs (resulting in turning off the module) right before ALU instructions are issued. It is in these scenarios, that the preemptive gating provides simultaneous performance and di/dt benefits. The decay counters predict future likelihood of module access solely based on the past activity profile. In contrast, preemptive gating can "look-ahead" and override unnecessary gating that the decay counters themselves cannot prevent, thereby inhibiting unnecessary performance loss. The minimal IPC overhead illustrates the practical potential of employing a low-overhead technique to control high-frequency di/dt. #### 7. CONCLUSION The exponential increase in current consumption by newer generations of processors coupled with aggressive power saving techniques have exacerbated the high-frequency di/dt issue that forces designers to elongate the design time in the analysis and implementation of the power delivery network. As long as the current trends in process and performance scaling continue, ad-hoc solutions to mitigate di/dt effects using an adequate decoupling capacitance will not suffice eventually. Decaps not only occupy considerable chip area but also but also contribute the already problematic leakage power issue. Current microarchitecture based solutions are inadequate for deep submicron designs where high-frequency di/dt is intricately intertwined with the chip floorplan as well as the power-pin distribution. In addition, the high module density facilitated by 3D-IC designs will stress the power delivery network even further, worsening operational reliability due to di/dt. Design after thoughts in the form of higher decaps also negate the advantages such as areas and lower wire power that the emerging 3D-IC technology inherently provide. To address the high-frequency di/dt issues and maintain high reliability while alleviating the design cost of creating a low impedance power delivery network, we propose a dynamic queue-based di/dt controller for both 2D and 3D-IC processors. By using decay counters to limit clock-gating activity based on module access patterns and by using this feedback in a queue-based di/dt controller, we show how current demands can be guaranteed for modules in the same power-pin domain. In addition, we also present a preemptive ALU gating mechanism as a performance enhancement technique and integrate an enhanced progressive gating technique for large modules (L2 cache) into our queue-based control mechanism, without violating current demand thresholds due to simultaneous switching. In addition, we also explain how the di/dt architecture can be implemented in a conventional out-of-order pipeline in a complexity-effective manner. The experimental results show that our di/dt controller can improve the current variability of applications by an average of 7x with a mere 4.0% and 3.8% IPC degradation for a 2D and 3D floorplan, respectively. The high-frequency di/dt noise will keep deteriorating due to the continuing CMOS scaling trend that drives down the operating voltage while simultaneously increasing peak power consumption. In overall, our design provides a realistic microarchitectural approach that can be used to alleviate the effort of design afterthoughts and reduce the use of extensive decoupling capacitors that consume larger chip area. Our technique also incurs little performance overhead and does not have any adverse thermal impact. #### 8. ACKNOWLEDGMENT This research was supported by the MARCO C2S2 and GSRC Centers. #### 9. REFERENCES - [1] International Technology Roadmap for Semiconductors. 2004. - [2] T. M. Austin and G. S. Sohi. Zero-cycle Loads: Microarchitecture Support for Reducing Load Latency. In Proceedings of the 28th annual International Symposium on Microarchitecture, pages 82–92, 1995. - [3] B. Bentley. Validating the Intel Pentium4 Microprocessor. In Proceedings of the 2001 International Conference on Dependable Systems and Networks, pages 493–500, 2001. - [4] B. Black, D. Nelson, C. Webb, and N. Samra. 3D Processing Technology and Its Impact on iA32 Microprocessors. In Proceedings of the 22nd International Conference on Computer Design, pages 316–318, 2003. - [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In Proceedings of the 27th annual International Symposium on Computer Architecture, 2000. - [6] H.-M. Chen, L.-D. Huang, I.-M. Liu, and M. D. F. Wong. Simultaneous Power Supply Planning and Noise Avoidance in Floorplan Design. *IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems*, 24(4):578–587, 2005. - [7] Y. Chen, K. Roy, and C.-K. Koh. Current Demand Balancing: a Technique for Minimization of Current Surge in High Performance Clock-gated Microprocessors. *IEEE Transactions on Very Large* Scale Integration Systems, 13(1):75–85, 2005. - [8] S. Das, A. Chandrakasan, and R. Reif. Three-Dimensional Integrated Circuits: Performance, - Design Methodology, and CAD Tools. In *Proceedings* of the IEEE Annual Symposium on VLSI, 2003. - [9] M. K. Gowan, L. L. Biro, and D. B. Jackson. Power Considerations in the Design of the Alpha 21264 Microprocessor. In *Proceedings of the 35th Design* Automation Conference, pages 726–731, 1998. - [10] E. Grochowski, D. Ayers, and V. Tiwari. Microarchitectural Simulation and Control of di/dt-induced Power Supply Voltage Variation. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture, 2002. - [11] K. Hazelwood and D. Brooks. Eliminating Voltage Emergencies via Microarchitectural Voltage Control Feedback and Dynamic Optimization. In Proceedings of the 2004 International Symposium on Low power Electronics and Design, pages 326–331, 2004. - [12] M. Healy, M. Vittes, M. Ekpanyapong, C. Ballapuram, S. K. Lim, H.-H. S. Lee, and G. H. Loh. Microarchitectual Floorplanning Under Performance and Temperature Tradeoff. In Proceedings of the Design, Automation and Test in Europe, pages 1288–1293, 2006. - [13] M. Healy, M. Vittes, M. Ekpanyapong, C. Ballapuram, S. K. Lim, H.-H. S. Lee, and G. H. Loh. Multi-Objective Microarchitectural Floorplanning For 2D and 3D ICs. *IEEE Transactions* on Computer-Aided Design of Integrated Circuits and Systems, 2006. - [14] H. Jacobson, P. Bose, Z. Hu, A. Buyuktosunoglu, V. Zyuban, R. Eickemeyer, L. Eisen, J. Griswell, D. Logan, B. Sinharoy, and J. Tendler. Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. In Proceedings of the IEEE Symposium on High-Performance Computer Architecture, pages 238–242, 2005. - [15] R. Joseph, D. Brooks, and M. Martonosi. Control Techniques to Eliminate Voltage Emergencies in High Performance Processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. - [16] R. Joseph, Z. Hu, and M. Martonosi. Wavelet Analysis for Microprocessor Design: Experiences with Wavelet-Based dI/dt Characterization. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, 2004. - [17] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by Simulated Annealing. *Science*, pages 671–680, 1983. - [18] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, N. Vijaykrishnan, and M. Kandemir. Design and Management of 3D Chip Multiprocessors using Network-in-Memory. In Proceedings of the International Symposium on Computer Architecture, 2006. - [19] H. Li, S. Bhunia, Y. Chen, K. Roy, and T. N. Vijaykumar. DCG: Deterministic Clock-gating for Low-power Microprocessor Design. *IEEE Transactions* on VLSI Systems, 12(3):245–254, 2004. - [20] F. Mohamood, M. B. Healy, S. K. Lim, and H.-H. S. Lee. Noise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling. In Proceedings of the 12th Asia and South Pacific Design Automation Conference, 2007. - [21] M. D. Pant, P. Pant, and D. S. Wills. On-chip Decoupling Capacitor Optimization using - Architectural Level Prediction. *IEEE Transactions on VLSI Systems*, 10(3):319–326, 2002. - [22] M. D. Pant, P. Pant, D. S. Wills, and V. Tiwari. An Architectural Solution for the Inductive Noise Problem due to Clock-gating. In *Proceedings of the* International Symposium on Low Power Electronics and Design, 1999. - [23] M. D. Pant, P. Pant, D. S. Wills, and V. Tiwari. Inductive Noise Reduction at the Architectural Level. In Proceedings of the International Conference on VLSI Design, 2000. - [24] M. D. Powell and T. N. Vijaykumar. Pipeline Damping: a Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage. In Proceedings of the 30th International Symposium on Computer Architecture, pages 72–83, 2003. - [25] M. D. Powell and T. N. Vijaykumar. Pipeline muffling and a priori current ramping: architectural techniques to reduce high-frequency inductive noise. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, pages 223–228, 2003 - [26] M. D. Powell and T. N. Vijaykumar. Exploiting resonant behavior to reduce inductive noise. In Proceedings of the 31st annual International Symposium on Computer Architecture, 2004. - [27] K. Puttaswamy and G. H. Loh. Implementing Caches in a 3D Technology for High Performance Processors. In Proceedings of the International Conference on Computer Design, 2005. - [28] K. Puttaswamy and G. H. Loh. Dynamic Instruction Schedulers in a 3-Dimensional Integration Technology. In Proceedings of the ACM/IEEE Great Lakes Symposium on VLSI, 2006. - [29] K. Puttaswamy and G. H. Loh. The Impact of 3-Dimensional Integration on the Design of Arithmetic Units. In *Proceedings of the International Symposium* on Circuits and Systems, 2006. - [30] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-aware Microarchitecture: Modeling and Implementation. ACM Transactions on Architecture and Code Optimization, 1(1):94–125, 2004. - [31] Z. Tang, N. Chang, S. Lin, W. Xie, S. Nakagawa, and L. He. Ramp up/down Floating Point Unit to Reduce Inductive Noise. In Workshop on Power-Aware Computer Systems, 2000. - [32] S. Zhao, C. Koh, and K. Roy. Decoupling Capacitance Allocation and Its Application to Power Supply Noise Aware Floorplanning. *IEEE Transactions on Computer-Aided Design*, pages 81–92, 2002.