A dynamically-controlled power-gated (DCPG) FPGA architecture has recently been proposed to reduce static energy dissipation during idle periods. During a power mode transition from an off state to on state, the wakeup current drawn from power supplies causes a voltage droop on the power distribution network of a device. If not handled appropriately, this current and the associated voltage droop could cause malfunction of the design and/or the device. In DCPG FPGAs, the amount of wakeup current is not known beforehand as the structures of power-gated modules are application dependent; thus, a configurable solution is required to handle wakeup current. In this paper we propose a programmable wakeup architecture for DCPG FPGAs. The proposed solution has two levels: a fixed intra-region level and a configurable inter-region level. The architecture ensures that a power-gated module can be turned on such that the wakeup current constraints are not violated. We study the area and power overheads of the proposed solution. Our results show that the area overhead of the proposed inrush current limiting architecture is less than 2% for a power gating region of size 3x3 or 4x4 tiles, and the leakage power saved is more than 85% in a region of size 4x4 tiles.
INTRODUCTION
Static power dissipation is a major component of the total power consumption in field-programmable gate array (FPGA) devices based on sub-90 nm CMOS technology nodes [2] . A recent white paper from Xilinx shows that even with improved process technology, static power could be as large as dynamic power for 28 nm technology node [11] . This matches the prediction that the effect of static current will increase with continuous technology scaling [21] . The operation of some low-power applications, such as mobile and hand-held devices, is dominated by idle periods with small bursts of activity; this may cause the leakage energy consumed due to static power to surpass that dissipated during activity periods.
Recently, a dynamically-controlled power-gated (DCPG) FPGA architecture has been proposed as a way to reduce idle periods' power consumption [5] . During their idle periods, functional blocks in this architecture could be powered down by using power gating, thus reducing their static power dissipation. Unlike statically-controlled power gating (SCPG) in which the states (on or off) of the different parts in an FPGA device are set at configuration time [18, 8, 23] , DCPG enables run-time control of the power state based on the application's behavior.
An important issue in DCPG architectures is the amount of current that is drawn from the power supply during the wakeup phase, known as the wakeup or inrush current. This current can be large and can cause a temporary IR drop across the power rails. This temporary IR drop, called voltage droop, may cause functional errors due to reduced noise margins and degraded performance by corrupting data storage elements, generating incorrect combinational logic output, or resulting in violations of the timing constraints [13] . Figure 1 shows an illustration of inrush current and the associated voltage droop. As the amount of inrush current increases, the droop on the power grid will increase, resulting in a violation of power integrity constraints if the circuit is not designed appropriately.
In application-specific integrated circuits (ASICs) that employ power gating, the inrush current problem is well understood, and many solutions have been proposed in the literature. These solutions revolve around staggering the wake up phase of the logic to guarantee that voltage droop constraints are not violated [10, 7, 6] . This can be done by chaining the power-on signal to turn on power gates in a specific timing sequence by using appropriate delays. Figure 2 shows an illustration of a typical power-on sequencing solution. As can be seen in the figure, the delay to turn on the next stage in the parallel sleep transistor (ST) chain allows the virtual VDD node to be partially charged, such that when the next ST is turned on, the amount of current will remain within a specified constraint.
In a DCPG FPGA architecture, the problem of handling inrush current is different from that in ASICs. FPGAs are flexible in order to implement a wide range of applications. This means that the structures of power-gated modules that can be mapped to a DCPG FPGA is not known beforehand. The DCPG regions in an FPGA that will be used to implement a specific power-gated module and the amount of wakeup current are not known at fabrication time; therefore, it is not feasible to implement a fixed inrush current handling circuitry in such architecture. Instead, it must be flexible enough to support a variety of scenarios. To the best of our knowledge, there has been no published work that considers the problem of inrush current in a power-gated FPGA architecture.
In this paper we propose a configurable inrush current handling architecture suitable for DCPG FPGAs. Our proposed architecture contains delay elements to sequence the wakeup phase of a power-gated module implemented in a DCPG FPGA architecture. Although the experienced designer could handle inrush current manually in a DCPG FPGA (as suggested in [5] ), this increases the design overhead, and has disadvantages that our proposed technique overcomes as we discuss in the next section.
Our proposed architecture has two levels. The lower level, which we refer to as the intra-region level, consists of circuitry to wake up a single power gating region (PGR). This circuitry sequences the enabling of a series of sleep transistors, and ensures that the inrush current resulting from powering up a single PGR does not violate the constraints set by the power grid. This level is not configurable; the required delays must be determined when the chip is fabricated. The upper level, which we refer to as the inter-region level, consists of circuitry to sequence the turning on of regions in the same power-gated module to ensure voltage drop constraints are not violated. Since it is not known at fabrication time how big power-gated modules will be or where they will be on the chip, this level must be configurable using static RAM bits that are set when the chip is configured. As described above, this combination of configurable and static wakeup circuitry is unique to FPGAs.
The paper is organized as follows. Section 2 provides background on the sources of voltage drop during the turnon phase of a power gating architecture, discusses previous works related to handling wakeup current, and provides a background on the DCPG FPGA architecture framework assumed in this paper. Section 3 discusses the power grid model used in this paper, and provides analysis of the effect of inrush current in the DCPG FPGA architecture. Section 4 shows the proposed circuits for handling inrush current and discusses the associated architectural tradeoffs. Section 5 shows the experimental setup and the results for our study. Finally, we conclude the paper in Section 6, and point out directions for future work.
BACKGROUND

Inrush Current in Power-Gated Designs
In power-gated designs, sleep transistors are used as power switches that can be turned off when a functional block is idle to disconnect it from the power supplies. This significantly reduces leakage power dissipation [9] . Assuming a header PMOS switch, as in the DCPG FPGA architecture used in this paper, if a functional block remains in sleep mode for a long time, all internal devices and the virtual VDD node of the block will gradually discharge.
As the functional block is turned on, a sudden charging of all floating nodes in the block will result in a large current to be drawn from the power supplies that flows through the sleep transistors. This large surge current causes voltage drop on the power grid due to IR and L ∂i ∂t drops, leading to functional errors and degraded performance, and may cause reliability problems due to electromigration [12] .
IR drop is the voltage drop on the power network metal lines due to their resistance. Usually, the IR drop is analyzed in static power grid analysis techniques, where all metal segments of the power grid are replaced by their equivalent resistances, and the functional blocks are modeled as current sources that draw the maximum average current that can be estimated by power analysis techniques. In this paper, we focus on the IR component of the power grid voltage drop.
The second source of the droop is the inductive voltage drop (L
∂i ∂t
). This voltage drop mechanism occurs due to current transients. Typically, inductive voltage drop has a significant effect at the package level of the power distribution network.
Previous Work
There have been some previous works that have proposed the use of power gating in FPGAs.
Bsoul and Wilton suggested handling inrush current manually in their DCPG FPGA architecture [5] . In their approach, the designer creates the power controller such that separate signals are used to wake up every region in a powergated module; the power controller must be designed to guarantee that a small amount of the logic in a power-gated module is turned on at a time in order to limit the maximum inrush current. In addition to the complexity that a designer may face in this approach, which is not desirable in FPGAs, the power dissipation of the power controller can increase, offsetting any leakage-energy saving opportunities that may exist in the application. Moreover, the additional signals generated from a power controller will compete with other circuit signals for the FPGA's routing resources, which may negatively affect routability and timing performance, thus affecting energy saving opportunities.
The works in [8, 4, 23, 17] discussed the use of dynamic power modes in FPGAs without addressing the effects of inrush current, and how it can be handled.
Kim et al. proposed reducing glitches on the ground and power rails due to inrush current by dynamically controlling the gate-to-source voltage (Vgs) of sleep transistors [14] . They also proposed daisy-chaining the wakeup of the sleep transistors in a chain of size-increasing sleep transistors.
Howard and Shi proposed reducing inrush current by splitting the chip into logic rows, each powered up by one or few sleep transistors [10] , with a controller to stagger their turn on. They also proposed a two-stage power-on method as an alternative solution. A trickle chain of sleep transistors is turned on to slowly charge the floating nodes, followed by the turn on of the main chain to fully charge the nodes.
Shi and Li proposed a programmable power gating unit [22] . The unit is composed of multiple daisy chains of sleep transistors, and can be configured to select which chains are turned on first to trickle charge the design, and which chains are activated later to fully charge the virtual nodes to VDD.
Calimera et al. presented a power gating reactivation technique based on modulating the size of sleep transistors with delay elements in order to limit the wakeup current [6] . The authors presented an algorithm that can find the optimal sizes of the sleep transistors (STs) in the delay chain for a specific standard cell library.
The above mentioned solutions in [14, 10, 22, 6] are suitable for handling inrush current for designs where the functional blocks are known beforehand. Unlike ASICs, FPGAs are configurable, and they need a solution that is suitable to a wide range of applications.
Dynamically-Controlled Power Gating
The DCPG FPGA architecture [5] is illustrated in Figure 3 ; the figure shows an FPGA that has an application composed of two power-gated modules, M1 and M2. A power state controller (PSC) is synthesized from the information that describes the behavior of the application, such as the data flow graph (DFG) of the application [4] or any other suitable description. This controller could exploit the idle periods that the modules in the application may experience by turning off the logic in the idle modules, and turning them back on when they exit their idle periods. This requires routing power control signals from the controller to the logic blocks that support power gating. Figure 4 shows an example of the basic power gating architecture [5] . In this figure, a logic cluster has four input pins and four connection blocks, distributed uniformly on its four sides. Each of the connection blocks can be used either to route an endpoint of a connection to the corresponding input pin, or to route a power control signal that controls the sleep transistor (ST) of the cluster. This scheme allows power control signals to be routed from, say, an onchip power controller to the target logic clusters in the same way that conventional signals are routed in the original architecture. If a power control signal is to be routed through an input pin, then that input pin is not used as an input to the logic implemented by the cluster. The power state can be either statically set (on/off) or dynamically controlled using the power control signal. This is achieved by proper configuration for the 3:1 multiplexer that drives the ST.
The outputs of the connection blocks are fed as inputs to the power gating multiplexer of the logic cluster. This multiplexer selects the input pin that will be used as the power control signal for the cluster and the bordering connection blocks; this signal is labeled PG CNTL1 in the figure. As shown in Figure 4 (b), PG CNTL1 could drive the gate of the sleep transistor to turn it off for low-leakage mode, or to turn it on for normal circuit activity.
The track isolation buffers in a routing channel are shared between the connection blocks of the two neighboring logic clusters; therefore, it is important to not turn the routing channel off if either of the neighboring logic clusters is on. This is ensured in this architecture by ANDing the power This architecture can be extended to larger regions that can be turned off as a unit, thus reducing the area overhead of the required power gating circuitry [5] . In this case, a group of logic clusters and routing channels that are spatially close to each other could be power-gated by the same power control signal. Figure 5 illustrates a region of power gating of size 2 (a region size of R means the region has R 2 tiles). The bordering routing channels can be used as access points for the power control signals. Obviously, other variations of this architecture are possible where a subset of the connection boxes could be used to provide inputs to the power gating multiplexer instead of using all connection boxes.
The DCPG architecture in [5] does not turn off SRAM configuration bits or flip-flops inside the logic clusters in order to retain their values. Moreover, the architecture assumes that switch boxes are always turned on. We follow the same assumptions in this paper. Note that although switch boxes are not turned off, power gating of a logic cluster and its routing channels reduces the leakage power of a tile by more than 40% [5] .
EFFECT OF INRUSH CURRENT
In this section, we will show that the inrush current seen by the baseline DCPG architecture can cause a large voltage droop, motivating our design in Section 4. Our estimation methodology is as follows. We first model a power grid based on estimates of the current drawn from the power supply during normal operation of an FPGA. We then model the impact of the additional current drawn when a region of the DCPG FPGA architecture is turned on, and show that, with the same size power grid, an unacceptable voltage droop occurs. Finally, we show that if we were to increase the size of the power grid to supply the required additional current, the area required by the power grid increases significantly, motivating our alternative approach.
Baseline Power Grid Model
Our model of the power grid is similar to that in [15, 16] . We assume a mesh-like power grid structure [19] , in which each metal layer has alternating VDD and GND lines. The number of lines in each layer is determined based on the width and spacing between the lines. Vias are provided at the intersection of VDD (GND) lines in adjacent layers. This builds a large VDD (GND) net that can provide power for the transistors. Clean VDD (GND) sources (power supplies) are positioned, and evenly distributed in the top metal layer of the power grid.
In order to determine the width of the power and ground lines, and the number of clean power supplies at the top level, we must first estimate the expected current requirements of the device during normal operation of the DCPG FPGA architecture (at times other than when regions of the chip are turning on or off). To do this, we first mapped the twenty largest MCNC circuits to an FPGA with parameters N = 6, I = 16, W = 90, Fs = 3, Fc,in = 0.2, and Fc,out = 0.1. We then used an enhanced version of the Poon power model [20] (modified to better model leakage power based on curve fitting of HSPICE simulation results) to estimate the power dissipated by each design per FPGA tile, averaged over all tiles in the design. We then used the maximum such power per tile across all designs; this quantity is denoted Pmax avg . For the architecture parameters above, we found that Pmax avg = 400 μW (for 45 nm technology).
We then created a parameterized HSPICE model of the power grid. Each metal segment and via in the power grid is modeled as a resistance. FPGA tiles are modeled as independent current sources. These current sources correspond to the current drawn by each FPGA tile during the normal operation (which depends on Pmax avg ).
We assume the connections between the power grid and each tile are distributed in an array of size n × n where n represents the granularity at which we model the power grid at the lowest metal layers. A large value of n would lead to more accurate results at the cost of longer simulation times; in our experiments, we found that n = 2 gives adequate accuracy with reasonable simulation time. This is shown on the right side of Figure 6 , where there are 2x2 current sources per tile. The amount of current drawn by each current source is Isrc = Pmax avg /(VDD × n × n).
We then performed HSPICE simulations, and iteratively adjusted the width of each metal line in the power grid and the number of clean power sources until the IR drop across the power grid was less than 5% of VDD (in our case, this corresponds to 50 mV). This represents the size of the power grid that can supply current to the DCPG architecture during normal operation. Note that this approximation is both pessimistic and optimistic. It is pessimistic because we have assumed the largest current seen by all of our benchmark circuits, and assumed this current is drawn from each tile. It is unlikely that during normal operation, all tiles would draw this maximum current all the time. It is optimistic, because in a real FPGA, the power grid would likely be over-designed (i.e., provisioned to supply more current than the benchmarks might predict). We consider the impact of over-provisioning the power network in Subsection 3.3.
Voltage Droop Estimation
The power grid described above was sized to provide adequate current during normal operation. When a region is turned on, there will be additional inrush current, which will cause voltage droop on the power rails.
To estimate the magnitude of the voltage droop, we created a detailed transistor-level model of a single region. We then used HSPICE to determine the amount of current drawn per tile when a region is turned on. We then used the original power grid model from Subection 3.1, replacing the normal operating current Isrc with this new (larger) current. Using HSPICE, we then measured the maximum voltage droop that occurs on the voltage rails. Figure 7 (a) shows the voltage droop as a function of region size (a region size of R means each region has R 2 tiles). As can be seen in the figure, for all region sizes, the droop is more than 100 mV, which is twice our target of 50 mV (5% of VDD). This could lead to incorrect operation of the FPGA.
It is interesting to note that the amount of IR drop reduces as the region size increases. This is because the per tile inrush current is smaller for larger region sizes. This phenomenon happens because as the region size increases, the number of the bordering routing channels increases, i.e., the number of RCs pulled out to the region's borders increases. These bordering RCs have their own STs that turn on faster than the region's ST, resulting in a small overlap between their inrush current and the inrush current of the region's ST. Thus, the per tile floating nodes that need to be charged when turning on the region's sleep transistor decreases as the region size increases, resulting in smaller inrush current per tile.
Over-Provisioning the Power Grid
One naive solution to the large voltage droop problem is to over-provision the power grid. To investigate this, we again iteratively adjusted the metal width and the number of clean VDD supplies until HSPICE predictions showed that the voltage droop due to the inrush current is below our 50 mV target. Using an area model, we were able to estimate the area impact of doing this; Figure 7(b) shows the ratio of the area required by the modified power grid to the area of the original power grid, as a function of region size. Clearly, the overhead in doing this is significant. This result motivates the more intelligent inrush current limiting architecture described in the next section.
PROPOSED ARCHITECTURE
In this section, we describe our new power-gating architecture for handing inrush current in a DCPG FPGA. The architecture consists of strategically placed configurable and non-configurable delay elements and sleep transistors that can be used to ensure that inrush current does not violate the constraints set by the power grid. The proposed architecture has two levels: a fixed intra-region level and a configurable inter-region level. Each are described below.
Intra-Region Level
As described in Section 2, a power gating region (PGR) consists of one or more tiles (a tile is a logic block surrounded by routing), and is the smallest unit of granularity that can be turned on or off. The purpose of the intra-region power gating architecture is to limit the amount of current drawn by a single region when it is turned on. By limiting the current drawn by a single region, it becomes possible to turn on multiple regions simultaneously; this tradeoff will be revisited in Subsection 4.3.
The top portion of Figure 8 shows the intra-region power gating architecture. Instead of using a single sleep transistor (ST) for the whole region, a set of parallel STs is used for the region. At a wakeup event, the STs are turned on sequentially in order to limit the inrush current to a set value (I max tile ). Clearly, I max tile must be small enough that voltage droop does not occur on the power rail when a single region is turned on; as will be described in Subsection 4.3, there are tradeoffs that may motivate a significantly smaller value of I max tile .
The value of I max tile is fixed, and is determined at fabrication time. As a result, the delay elements and the sleep transistors (STs) do not need to be configurable. For a given I max tile , the sizes of the delay elements and the STs can be found using the sizing algorithm proposed in [6] . This algorithm takes the current constraint, a minimum delay value Unlike [6] , which assumes that all delays are created by a set of identical delay elements, we use a library of delay elements. The smallest delay element in our library has a delay of 100 ps. Figure 9 (a) shows an example delay element that consists of a series of transmission gates. Larger delays can be realized by increasing the chain length or by increasing the length of the gate of individual devices.
Note that it is also necessary to turn on the STs for the surrounding routing channels (RCs are connection blocks and track isolation buffers) of the region since they have their own virtual VDD nodes (see Subection 2.3). As shown in Figure 8 , the output signal of the power gating multiplexer is chained through all RCs to wakeup STs that are in sleep mode. Figure 10 shows an example of the first two RCs.
The design of the wakeup circuit in the bordering RCs of a power gating region (PGR) is similar to that for the internal part of the PGR ( [6] is used to size sleep transistors and to insert delay elements). The only difference is the additional 2:1 multiplexer at the output of the last delay element in an RC's power gating circuit to select among the power control signal that was used for the RC, or the delayed output from the last delay element. This multiplexer will ensure that the correct wakeup signal is routed to the next RC in the chain even if the current RC is statically powered on/off.
Although the authors in [6] suggest that their sizing algorithm is optimal, we believe that manual optimizations can be further achieved in our architecture because we have control over the design of delay elements, unlike in [6] where they assume a specific library of cells. However, we leave the exploration of further optimizations as future work.
Inter-Region Level
The circuitry described in the previous subsection ensures that a single region can be turned on without violating the current constraint of the power grid. However, in practice there might be excess current caused by other activity, such as unrelated signals passing through the region (for switch boxes that are not powered down), or from transient signal changes while turning on a region. Furthermore, it is unlikely that the part of a user circuit that will be turned on will be confined to a single architectural region. If multiple regions are turned on simultaneously, then, even with the architecture described in the previous subsection, current constraints might be violated causing a large voltage droop. Since the pattern of which regions will turn on together is specific to the user circuit, it is impossible to design a fixed architecture that is suitable for all application circuits. This is different than the more traditional problem of designing wakeup circuitry for a fixed-function chip (such as an ASIC); our architecture must be flexible enough to work for a wide variety of user circuit scenarios.
Our approach is to provide a Programmable Delay Element (PDE) for each region on the chip. As shown in Figure 8 , the PDE is inserted just after the region gating multiplexer that selects which of the region's inputs is used as a power gating signal. Figure 9 (b) shows our implementation of each PDE. Each of the blocks labeled ΔT represent a delay element of the minimum amount of delay required to wakeup a PGR.
The proposed architecture allows the CAD tool to configurably delay the turn on of individual regions, to limit the number of regions that turn on simultaneously. The maximum number of regions that can turn on at a time is dictated by the architecture, and will be denoted Rconcur. If a user's circuit contains a power-gated functional block that occupies R block regions, then the CAD tool can logically divide the block into R block /Rconcur parts. Each part would be configured with a different delay, ensuring that when a wakeup event occurs, no more than Rconcur regions are turned on at a time. The value of Rconcur is dictated by the architecture, and depends on the design of the power grid as well as the value of I max tile ; these tradeoffs will be revisited in Subsection 4.3.
Note that the size of the multiplexer in the PDE dictates the maximum number of regions that can be turned on due to one wakeup event; if more regions are to be turned on, the CAD tool can create a multi-cycle wakeup circuit out of the general-purpose FPGA logic.
Architectural Tradeoffs
There is a complex set of tradeoffs between the architectural parameters in our wakeup circuitry and the area of the architecture and the power and wakeup time of applications. The value of I max tile , selected when designing the intra-region architecture, determines the sizes of the delay elements. While it may seem desirable to make this as large as possible to minimize the area of the intra-region level circuitry, and to minimize the time to turn on a single region, doing so reduces the achievable value of Rconcur, since each region draws more instantaneous current. A lower value of Rconcur means that more "steps" are required to turn on a large functional block. In addition, a lower value of Rconcur would imply more fingers in the PDE multiplexer, increas- Energy consumed to turn on a region E turn of f Energy consumed to turn off a region ing its area and leakage power. The size of the power grid also affects these parameters; a larger power grid would be able to supply more instantaneous current with an acceptable voltage droop, relaxing the requirements of our power gating architecture. The optimization of all of these parameters is a complex problem; in Section 5, we experimentally investigate some of these tradeoffs.
Conditions for Energy Savings
Power-gating a functional block is only beneficial if the idle time is longer than a certain threshold. To establish the mathematical model for this constraint, we use the terms in Table 1 that are related to the DCPG FPGA architecture.
If the functional block is not placed in sleep mode during its idle period, then its power consumption during idle period using the power gating architecture is:
On the other hand, if the block is placed in sleep mode during its idle period, then the energy consumption during sleep mode and to enter and exit sleep mode is:
where EDP T is the energy consumed during power transitions; that is, during the time of the turn-on phase and during the time of the turn-off phase. During the period of turning off a power-gated module, only Rconcur regions can be turned off simultaneously. Therefore, some of the regions will be dissipating leakage energy until the whole power-gated module is powered down. Similarly, during the power up phase, some of the regions will dissipate leakage energy until the whole power-gated module is turned on. The leakage energy dissipated for these regions during both the turn-on and turn-off phases of a power-gated module is accounted for in 3.
where m = R block /Rconcur . Therefore, putting a functional block in sleep mode is more energy efficient than not doing so when E idle > E sleep . This can also be written in terms of the block's idle time (t idle ) as:
Note that the analysis above does not consider other parameters that play a role on deciding whether using sleep mode is feasible or not, such as the energy consumed by a power controller. However, such terms can be integrated when dealing with higher level energy models that describe the energy for a complete system.
EXPERIMENTAL SETUP AND RESULTS
In this section we evaluate the proposed inrush current limiting architecture, and investigate some of the architectural tradeoffs identified in Section 4.3.
Experimental Setup
Analysis Settings
We assume 45 nm [24] with VDD = 1 V. Power consumption and duration of power transitions were measured assuming the worst case temperature of 85
• C. In the algorithm that finds the sizes of the sleep transistors in the intra-region architecture, however, we assume a temperature of 25
• C. This is because it is possible that an on transition happens after a long idle period, in which the temperature has gone down. At this temperature, a sleep transistor can deliver current that is larger than at higher temperatures.
Power Grid
We use the methodology explained in Section 3 to build the power grid used in this study. The power grid is assumed for a chip of 3 × 3 mm size (57 × 57 tiles). The power grid has been synthesized manually using M1-M4, with a pitch of 30 μm for the top metal layers. Clean VDD and GND sources were distributed at a spacing of 400 μm.
For the lowest metal layer, which has the current sources that represent the tiles' circuits, we used a pitch that is equivalent to the physical width of a tile divided by the number of sources (n = 2) in each metal segment passing over a tile. Thus, there are n × n current sources connected to the lowest metal layer lines over each tile. The physical length of a tile was found by mapping the number minimumwidth transistor areas (MWTA) of a tile to the physical dimensions of the tile, assuming square tiles [3] .
We assume that the maximum allowed voltage drop at a virtual VDD node is 100 mV (10% of VDD). The source of this drop is 50 mV from the IR drop on the power grid, and 50 mV drop on sleep transistors.
Power Gating Architecture
To size the sleep transistors, we first used HSPICE simulations of the transistor-level model of a power gating region to determine the total effective width of the sleep transistors, and then used the algorithm from [6] to break this into individual sleep transistors as discussed in Section 4. In choosing the total effective width, we assumed a maximum allowable voltage drop of 50 mV on the sleep transistor of a power gating region during normal operation, and a worst case temperature of 85
• C. Our sizing method requires an estimate of the activity of the nodes in a region; rather than [5] and proposed intra-region architectures performing extensive power analysis using the Poon power model, we assume an activity of 30% for these nodes. We found that this approximation over-estimates the power and results in larger than necessary sleep transistors, however, the overall impact on the results is small. Similar to the architecture in [5] , we assumed that switch boxes and storage elements are not turned off during sleep mode.
FPGA Architecture and Area Model
We use similar FPGA architecture parameters as the ones used in Section 3: N = 6, I = 16, W = 90, Fs = 3, Fc,in = 0.2, and Fc,out = 0.1. To calculate areas, we used the MWTA model from [3] . Note that in some cases, as in delay elements, we had to increase the resistance of transistors (to achieve larger delays) by increasing the gate length of transistors. To account for this in the area model, we first calculated the area for the minimum sized transistor in the 45 nm technology node, assuming MOSIS scalable CMOS design rules [1] . Then we calculated the area for the transistor as we increase the gate length. We found that increasing the gate length by a factor of l results in 0.125 × l increase in the area of the transistor. We used the same scaling factor to scale results from the MWTA area model for transistors that have larger-than-minimum gate length.
Results and Discussion
Intra-Region Level
In order to understand the area, timing, and energy overheads of the proposed architecture, we varied the maximum supported current by the power grid in each tile location; this corresponds to different power grid area costs. Figure 11 (a) shows the area overhead of the delay elements and 2:1 multiplexers in the intra-region level for different PGR sizes compared to the area of a tile that has no power gating circuitry. The area of a tile's switch box is not included. Figure 11(b) shows the wakeup time. We can see that as I max tile increases, the area overhead and the wakeup time decrease. This is due to a reduction in the number of stages of delay elements as well as a reduction in the size of each delay element. It is clear from the figures that larger PGR sizes have smaller area overhead and a smaller wakeup time per tile. Figure 11 (c) shows the energy consumption during power mode transitions (sum of both during turning on and during turning off) for different region sizes. This energy is due to delay elements as well as inrush current. As I max tile increases, the energy due to power mode transition decreases. It is interesting to note that the energy due to the inrush current dominates the total energy. For example, for I max tile = 400 μA, the transition energy due to inrush current is about 86% of the total energy (graphs not included due to space constraints). Although increasing the wakeup time leads to decreased instantaneous power due to smaller inrush current, the overall energy is increased. Figure 12 shows the reduction in leakage power of a region achieved by turning off that region for both the baseline architecture (from [5] ) and the proposed architecture described in Section 4, both compared to the leakage in a region in the baseline architecture. A region size R = 4 (4x4 tiles) is assumed in this figure. As can be seen, turning off a region has a dramatic effect on leakage power, however, the savings are smaller for the proposed architecture. This is due to the leakage energy of the delay elements. We believe, however, that circuit-level optimizations can reduce the leakage power, such as using larger gate length for devices in buffers that drive sleep transistors and combining the wakeup of multiple routing channels simultaneously. These optimizations will be investigated as future work.
The extra leakage overhead in the proposed architecture also occurs when the region is in its on-state, as shown in Figure 13 . Again, we anticipate that circuit level optimizations could reduce this overhead somewhat.
Inter-Region Level
In the architecture assumed in this section, the power grid is sized sufficiently such that as long as each region draws no more than I max tile , it can supply enough current to turn on all regions on the chip simultaneously assuming there is no other activity. However, in practice, there will be other In the presence of these extra transitions, there is a limit to the number of regions that can be turned on simultaneously. Figure 14 (a) shows the IR drop in the power rail due to this extra current, which we denote Iexcess, assuming the worst case in which all power gating regions in a chip are turned on simultaneously. Clearly, the IR drop violates our constraint of 50 mV. Figure 14(b) shows the impact this has on the number of regions that can be turned on simultaneously, Rconcur. As can be seen, as Iexcess increases, the number of tiles that can be turned on simultaneously drops significantly.
As described in Section 4.3, there is a relationship between the number of regions that can be turned on at a time, Rconcur, and the size of the PDE in each tile. As Rconcur increases, fewer "steps" are required to turn on a large powergated functional block, meaning each PDE can be smaller. Figure 15 (a) shows this relationship for two example functional block sizes: a small one (ex5p, 109 tiles) and a large one (clma, 775 tiles). These functional block sizes represent what architects may use as a target when designing the intra-and inter-region circuits (i.e., the largest sizes of power-gated functional blocks that can be turned on using only one turn-on event). As Rconcur increases, the required PDE size decreases, which leads to smaller area/power overheads of the inter-region level (Figure 15(b) ), but at the cost of power grid sizing or the area/power overheads of the intraregion level. Figure 16 shows the minimum idle time (t idle ) that is required for a functional block in order to achieve energy savings when turned off using the proposed architecture. We used the relation in (4) to obtain the results in this figure.
Minimum T idle to Achieve Energy Saving
The results are reported assuming a region with R = 4 and I max tile = 400 μA. As Rconcur increases, the required PDE size (number of supported delays) decreases, which leads to better area/power overheads of the inter-region level; this leads to smaller minimum t idle .
As can be seen in Figure 16 , in the worst case (when Rconcur = 1), the minimum t idle that is required in order to achieve energy savings is about 200-900 ns. These times are relatively small when compared to idle times that applications, such as mobile devices, experience in real life. For example, for an applications that runs at 500 MHz, a functional block only needs to be idle for about 450 cycles in order to achieve energy savings!
CONCLUSION AND FUTURE WORK
Wakeup inrush current can cause large voltage droop on the power distribution network in a chip, leading to a malfunction of the design or the device. In DCPG FPGAs, the problem is different from that in ASICs as the structure of applications and the power-gated modules is not known at fabrication time; thus, a configurable architecture is required to solve this problem.
In this paper, we presented a configurable architecture to limit the wakeup current during turn-on in dynamicallycontrolled power-gated FPGAs. Our approach has a fixed intra-region level, and a configurable inter-region level. Appropriate design of the intra-region level ensures that voltage droop constraints are not violated in a power gating region. By combining the intra-region and inter-region levels, it is possible to provide design-time configurability for the turnon of multiple regions in a power-gated application.
We investigated different tradeoffs associated with the two levels of the inrush current limiting architecture. We found that the overhead of the proposed intra-and inter-region architecture is small in terms of its area and power.
As future work, we hope to investigate other tradeoffs associated with the proposed architecture, such as the best combination of power grid sizing, intra-region level overdesign, and inter-region level flexibility that can achieve best power/area results. Clearly, this is a complex multiobjective optimization problem that depends not only on the architecture parameters, but also on the target application domain.
Another interesting area of future work is the circuit-level optimization that can be performed in order to reduce the power/area overheads and to shorten turn on times. Circuitlevel optimizations of delay elements and the buffers that are inserted between delay elements and sleep transistors, and the possibility to turn on multiple routing channels simultaneously could lead to smaller area/power overheads and shorter wakeup times, resulting in more energy savings.
Another interesting avenue for future work is the development of CAD tools that can automatically detect opportunities for power gating functional blocks. Today, ASIC designers identify these blocks manually through design intent files, and initially, this is the way we would expect this architecture to be used. However, in the long term, automating this process would significantly simplify system design.
