Power gating has been widely adopted in multicore designs. 
INTRODUCTION
Technology scaling of MOSFET into sub-100nm region has resulted in significant increase in leakage power consumption. This makes leakage power reduction an indispensable component in the nano-era low power design. Power gating emerges as an effective solution for runtime leakage control. Recent literatures have shown that power gating is widely applied in multicore designs [1, 2] . Due to the extremely tight power budgets of multicore designs, unused cores are put into ultra-low power mode by disconnecting them from the power grid. The disconnection is done by power gating devices, either footers or headers, or both. Figure 1 illustrates the power gating design [1] , where footers are designed in a ring style surrounding the core.
The cost of power mode transition is not negligible. Energy overhead is one important problem [3] . Another is the rush current incurred when the power gated cores are woken up [4] . The accumulated charges during sleep period are released to the power grid of the chip at wake-up, forming large rush currents. In a multicore chip, it is normal that some cores are active while others are performing power mode transition. The large rush current incurred by the waking-up cores (CoreS in Figure 1 ) can cause temporary voltage drop on the adjacent active cores (CoreA in Figure  1 ), and ground bounce on the whole power grid. The general approach to alleviate the rush current problem is to turn on power gates gradually, and limit the current injected to the power grid. However, this essentially prolongs the power mode transition time. A good wake-up scheme makes a trade-off between delay overhead and the rush current. The wake-up problem for power gating has been extensively studied. The initial focus was on optimal circuit clustering to minimize wake-up delay given a certain rush current constraint. However, it is evident that DSTN and shared virtual ground design approach have been adopted in most design practice [5] . Since there is only one common virtual ground for all gates, circuit clustering does not alter the wake-up process any more. Hence later on, researchers shifted their focus from circuit clustering to power gates partition and activation sequencing [6, 7, 4, 8, 9] . Almost all previous literatures only studied the wake-up problem for a circuit block, instead of a full chip. Just recently, Jiang et al. [10] presented a wake-up scheduling algorithm for a number of circuit blocks in a chip with a certain rush current constraint. However, their approach requires a complicated central control state machine to arrange the wake-up sequencing of each block at runtime, rendering the scheme less practical. Furthermore, very few existing research analyzed or optimized the physical implementation of wake-up circuit.
In this sense, this paper is the first study to consider fast and reliable power mode transition from the chip level, with analysis on detailed physical models and implementations. The study uses the recent ring style power gating for multicore designs as an example, and introduces the concepts and physical designs of two novel techniques, namely current shaping and multi-thread activation. A CAD flow is also presented to automatically design an optimized power gating control circuit using the two novel techniques. The two techniques introduced in this paper are applicable to fast and reliable designs of other on-chip power switches, such as body biasing control or V DD hopping.
RELIABLE POWER MODE TRANSITION
The diagram of ring style implementation of AMD per-core power gating [1] is illustrated in Figure 1 . Figure  2 gives further details on the control circuits of power gates (footers for ground gating). In particular, the control signal (S c ) is routed along with the footers. Repeaters are inserted to boost the propagation of the control signal round the core. Drivers are required to drive the footers, which are usually designed to be very large to reduce voltage drop across them. When central power management unit decides to wake up a power-gated core, it releases a wake-up signal. Assume that this signal arrives at the point A in Figure 2 . Then it propagates towards the arrow direction, and turns on the footers one by one along its way. The control signal returns to A after it has traversed every edge of the core. Power grid is usually designed to accommodate average current drawn from the circuit in active mode [11] . Voltage drop on the power grid is caused by two phenomena. The IR drop is caused by delivering current (I) through resistive power grid (R). The inductive voltage drop is because the high di/dt of fast switching excites the parasitic inductance (L), and thus increases impedance and causes an additional voltage drop. The total voltage drop is the sum of both:
Real Ground

Virtual Ground
Drivers
Repeaters
Logic Gates
In a multicore design, assume that the core under activation (denoted as CoreS) shares a common ground path to the ideal ground with adjacent cores (denoted as CoreA). The resistance of this common path is R return . Denote the average current of CoreA as I adj , and the average current of CoreS in active mode as I avg . By assuming a chip-wide voltage drop threshold as αV DD , the voltage drop equation for CoreA and CoreS in active mode is:
Footers of CoreS are designed to be large in size to avoid performance lost. When CoreS is sleep mode, its virtual ground voltage is elevated to a certain value (V ini ). When CoreS is woken up, the virtual ground discharges through the large-size footers. Assume that the footers are designed to incur αV DD voltage drop, and the peak current of CoreS is I peak . Then its equivalent resistance (R F ) is:
Hence when the footers are turned on during wake-up, the worst case rush current incurred (I worst ) is:
Our experiments show that V ini is usually about 50% to 60% V DD value. Assume that α and β are both 5%. Thus, Equation 4 yields an approximate value of the worst case rush current as 10 × I peak . The worst case voltage drop at the common return path of CoreS and CoreA is:
Subtract each side of Equation 5 by Equation 2:
Equation 2 gives the extra voltage drop caused by waking up CoreS in the worst case. Since 10I peak is much higher than I avg , the voltage drop of this worst case is unacceptable, and is very likely to cause the malfunction of CoreA. One method to guarantee the power integrity of CoreA is to confine the rush current of CoreS to be under I avg . In this way, the first term in the right side of Equation will be zero, and the second term will also be negligible if footer is controlled to avoid sharp slope. Calimera et al. propose to limit the footer current to be within I peak [9] . However, we consider that this limit cannot guarantee power integrity. This is because the power grid is usually designed to accommodate average current consumption, while decaps are used to alleviate instantaneous peak current in active mode. So I peak in active mode is provided by both power grid and decaps. However, the wake-up process of a power-gated circuit usually takes multiple clock cycles. In this case, decaps do not have chances to recharge themselves every cycle, and become useless once depleted [12] . Since decaps are ineffective in providing charges during circuit wake-up, using I peak as a rush current constraint is not safe, and can lead to unaffordable voltage drop. 
FAST ACTIVATION TECHNIQUES
We analyze an ideal wake-up process in three phases: initial boost, current shaping and multi-thread activation.
Initial Boost
Before I on reaches the limit of I avg , the footers should be turned on as fast as possible. Denote on-current and equivalent resistor of each footer as i f and r f , respectively. The ideal number (N ideal ) of footers allowed to be turned on is I avg /I on . However, due to the delay in propagating the control signal, when the control signal reaches N ideal , the virtual ground voltage decreases below its original value due to discharge, so that I on at that point also decreases below I avg , as shown in Figure 4 . Thus, a few more footers need to be turned on to compensate this current drop. Denote the real number of footers required for I on to reach I avg as (N ini ).
To achieve the goal of turning on N ini footer as fast as possible, repeaters and drivers before N ini should be properly sized. Denote the repeater width of stage k as w 
Constraint : Equation  12 calculates the total decrease of virtual ground voltage, and calculates the actual total on-current for N ini footers, and then equate it to the limit of I avg . At the time when N ini footers is fully turned on, the new virtual ground voltage (V p1 ) equals to:
Current Shaping
When I on reaches the limit of I avg , the design goal switches to keep I on as close to I avg as possible. To avoid rush current surpassing the limit, the activation of footers should be slowed down. This can be realized by downsizing repeaters and drivers. The challenge is how to optimally size them to obtain a constant rush current of I avg . We introduce a Current Shaping technique to solve this problem.
Group the footers that has been turned on in initial boost together, and denote the current of them as I a . If no more footers are turned on, I a will drop quadratically, since CoreS can be roughly considered as a huge capacitance discharging through a single resistor. However, suppose we continue to open the remaining footers, and keep the total I on nearly as I avg , then the the virtual ground voltage (V vg (t)) decreases at a linear rate:
Thus, I a can also be modeled as a linear function of time.
The total time (T all ) for discharging is:
In order to shape a flat rush current curve, we need to construct a new current I b (t), which has an opposite slope to I a (t), as shown in Figure 5 .a. In this way, I b (t) compensates the current decrease of I a (t), and the total current I on (t) will remain a constant. The shaping of I b (t) is done by controlling another group of footers, denoted as Group B.
As shown in Figure 5 .a, when I b (t) converges with I a (t) at time T all /2, Group B is fully turned on, and I b (t) starts to decrease together with I a (t). Since after fully turned on, the current of Group B equals to the current of Group A, it can be deduced that the number of footers in Group B equals to the number of footers (N ini ) in Group A. Another requirement is that the gate voltage switching of Group B should take T all /2 to complete. This is the basic idea of current shaping. Once Group A and B converge at T all /2, they merge as Group AB with current I ab (t), and start another round of current shaping. This time, the current decrease rate is doubled, since the on-state footers are doubled:
Similarly, we need to construct a new current I c (t), which has an opposite slope to I ab (t), as shown in Figure 5 .b. I c (t) converges with I ab (t) at time 3/4T all . And the number of footers in Group C equals to the number of footers (2N ini ) in Group AB. The requirement on the gate voltage switching of Group C is that it should start at time T all /2, and end at time 3/4T all . Similarly, by performing current shaping iteratively, a near-constant rush current can be obtained until the virtual ground voltage decreases to the safe threshold βV DD . (β is 5% in our experiments.) Figure 5 .d illustrates the current of each footer group for a complete current shaping. Ideally, the total number of iterations required before reach βV DD is 5, because:
The physical implementation of current shaping is realized by tuning the size of drivers and repeaters. Take the current shaping of Group C for example. The activation of Group C starts at time T all /2. Hence the control signal should arrive at the very first footer (2N ini + 1) of Group C at T all /2. This can be realized by sizing the repeater at stage 2N ini + 1, as shown in 5.e. Group C is fully activated at time 3/4T all . And the turned-on time for each footer in Group C is 1/4T all . This can be realized by sizing the leaf node driver of each footer in Group C. Precisely, the sizing algorithms are shown in Equations 19 to 25. 
Multi-thread Activation
Current shaping addresses the fast discharge problem. In addition to complete discharge, all the footers have to be turned on to ensure normal operation of the core. This turns out to be challenging. With ideal discharging current I avg , the virtual ground voltage is:
To keep the rush current under I avg , the maximum number of footer allowed to be turned on at any time t is:
The curve line in Figure 6 demonstrates the trend of N MAX . As can be observed, N MAX remains a small number for 80% of the time. N MAX only increases drastically when the discharge proceeds to the end. This requires large number of footer to be turned on in a short time. However, it is challenging to implement this due to the slow propagation speed of the control signal. The linear line in Figure 6 shows the number of footers that the control signal passes at time t. Before Point 1 in Figure 6 , the propagation speed is slower than required, so the delay should be minimized at the initial boost. Between Point 1 and 2, the propagation speed is faster than required, so the repeaters and drivers should be downsized during current shaping. After Point 2, the propagation speed is much slower than required. We propose the technique of multi-thread activation to accelerate the propagation of the control signal. Figure 7 illustrates the idea. The control signal propagates through three threads, instead of one, from Point A. T1 and T2 travel through the edge of the core. They are responsible for the initial boost and current shaping. However, since the footer activation speed are essentially doubled, the repeaters and drivers sizing in the initial boost (Equations 7 to 12) and current shaping (Equations 19 to 25) need to be recalculated. TTmp1 travels across the core to reach Point B, and then spreads into two new threads, T3 and T4. In Addition, TTmp1 spawns two child threads, TTmp2 and TTmp3 before reaches Point B. TTmp2 and TTmp3 stretch to Point C and D. Once they reach, they spread into 4 threads, T5 to T8. At this point, the propagation speed is improved to eight times, since there are 8 threads concurrently propagating. In each iteration in Figure 5 , current shaping requires different number of footers to be activated within different time. The required activation speed (number of footers per unit time) in iteration (j) can be calculated by:
The propagation speed of two threads (T1 and T2) is:
where the denominator represents the minimal repeater delay of one stage. At any iteration j, if s j is smaller than s p , only T1 and T2 are needed to activate footers. And they should be properly slowed down to avoid over-limit rush current. Contrarily, if s j is larger than s p , more threads should be running for fast activation. To give an example, assume that at iteration j, s j equals to 3.5s p . And s j was smaller than s p in j − 1 iteration. It is desirable to have 7 (3.5×2) threads running at j iteration. However, our multi-thread activation only have the power of 2 as the number of threads. So the actual number (y j ) of threads should be:
In this case, 8 threads are used to facilitate the needs of s j .
EXPERIMENTAL RESULTS
The experiments were conducted using HSPICE with 32nm predictive model. We generate 8 cores with different sizes from 1.8mm×1.8mm down to 0.14mm×0.14mm. The core circuit is replaced with numbers of C7552 circuits from ISCAS85 benchmark. We sized the power grid such that the IR drop occupies 5% V DD . Footers are sized to incur a voltage droop of 5%V DD .
We compared three approaches for footer activation: the naive single thread approach with uniform repeater and driver sizing, the technique presented in [9] , and our current shaping and multi-thread activation. We compared the wake-up time, area and power consumptions of three approaches. The area and power consumptions include footer, repeater and driver consumptions. In Table 1 , our approach reduces the wake-up time for 5 to 11 times compared with the naive approach, and 1.5 to 3 times compared with the technique in [9] . The area and power consumption of our technique is also slightly smaller than two other techniques. This is because the leaf node drivers for each footer consume a large portion of area and power, while our current shaping technique uses very small transistors for leaf node drivers. (Leaf node drivers can consume up to 25% of the total area, and 60% of the total power consumption). The area overhead in Table 1 shows the percentage of the overall area versus pure footer area. Figure 8 demonstrates the repeater and footer sizing for the 0.86mm×0.86mm core with our approach. The repeaters that are passed by the control signal in different iteration are marked with different colors. 16 threads are running at the last iteration. Figure 8 also demonstrates the leaf node driver sizing for implementing current shaping. This design achieves a wake-up delay of only 0.341ns. The repeaters network in Figure 8 consumes 118% extra area of the original repeater ring. However, drivers consume most part of the total area and power. Figure 9 shows the area and power breakdown for the design in Figure 8 . It can be observed that the cost of repeaters are almost negligible. And the reduction in leaf node driver cost by using our approach also compensates the extra repeater cost. 
CONCLUSION
This paper analyzes the power mode transition problem of power gating from the chip level, with consideration on physical designs. A design rule is proposed for fast wake-up design with guaranteed power integrity. Current shaping and multi-thread activation techniques have been proposed and analyzed to significantly accelerate the wake-up process.
