Abstract: We present two techniques to reduce the power consumption in FPGAs. The first technique uses two supply voltages: timing-critical paths run on normal Vdd, while the non-critical ones save power by using a lower Vdd. Our programmable dual-Vdd architectures and Vdd assignment algorithms provide an average power saving of 61% across the MCNC benchmarks. The second technique targets applications where configuration time is crucial. It uses Asymmetric SRAM (ASRAM) (instead of high-Vt SRAM) cells to implement the configuration memory. Our bit-inversion algorithm further reduces leakage by increasing the number of ASRAM cells that are in their preferred state.
Introduction
Field Programmable Gate Arrays (FPGAs) are rapidly gaining popularity. Compared to ASICs, they provide lower non-recurring costs, shorter time-to-market and higher ease-of-use. Compared to microprocessors, they provide higher performance, especially for parallel applications. Furthermore, their ability to be reconfigured allows the designs to be updated similar to the use of software updates. Recent advances allow portions of the FPGA to be reconfigured while the remaining FPGA remains active (Xilinx; Altera).
Even with so many desired features, FPGAs are rarely used in power-critical applications because they consume nearly five times more power compared with ASICs (Schumacher, 2003) . The two major components of power consumption in an integrated circuit are dynamic power and leakage. Both these components are larger in an FPGA than in ASICs -leakage because of extra transistors used to provide programmability, and dynamic power because of capacitive loads of programmable switches in the routing tracks.
In this paper, we present two techniques (Gayasen et al., 2004a; Srinivasan et al., 2004) 1 to reduce the power consumption in FPGAs. The first technique reduces both dynamic and leakage powers by using two supply voltages (high: Vddh, low: Vddl). Through the use of supply transistors, circuit blocks on the FPGA can run on either of the two voltages. Extra configuration bits store the state of the supply transistors, which sets the Vdd for every circuit block. A Vdd assignment algorithm assigns values to these configuration bits, such that timing constraints are not violated. We integrated this algorithm with VPR (an FPGA place-and-route tool from Toronto (Betz and Rose, 1997) ) and obtained a 61% reduction in total FPGA power. For the case when all routing muxes can be individually programmed to either Vddh or Vddl, the FPGA area increases by about 50%. In order to reduce this area penalty, we present another architecture where the voltages of the routing resources are fixed (at either Vddh or Vddl) at the time of fabrication. The router then routes the critical nets through the resources that are on high Vdd, thus maintaining the performance of the design. We observe that this architecture reduces the FPGA energy by 57.3% with an area penalty of only about 20%, and the performance for all the benchmark designs remains within 20% of the original (when all resources are at Vddh). We further reduce the area penalty by controlling the Vdd of multiple logic blocks from the same set of supply transistors. The second technique aims to reduce the leakage power in the configuration memory of FPGAs without increasing the area or configuration time. Since, in a 90 nm FPGA, configuration memory consumes nearly 38% of total leakage (Tuan and Lai, 2003) , reducing its power is important. FPGA manufacturers have recognised the need for reducing leakage in configuration memory. Some recent FPGAs use high threshold voltage (Vt) transistors for the configuration memory cells to reduce their leakage. This, however, increases their configuration time (because of increase in the latency of the memory cells). For some applications, such as those that use FPGAs as coprocessors to accelerate certain functions, the FPGA needs to be dynamically reconfigured. These applications will greatly benefit from a reduction in configuration time. To achieve this goal, we propose the use of ASRAM cells to store configuration. ASRAM cells are designed to have low leakage when they store their preferred bit (either 0 or 1). Azizi et al. (2002) showed that, compared with a high-Vt SRAM cell, ASRAM reduces the latency by 37%, while providing comparable leakage benefits (reduction by a factor of seven compared to normal SRAM). In order to fully use the polarity bias of ASRAM, we also propose a bit-inversion algorithm to maximise the number of preferred bits in the configuration bit-stream. We implemented this algorithm using JBits (a set of Java APIs available from Xilinx) and tested it on real designs to increase the number of zeros in the Look-Up Table (LUT) configuration bit-stream by 27% on the average. Note that both these techniques preserve the placement and routing of the design and only modify the configuration bits after routing.
The remainder of this paper is organised as follows. Section 2 provides a brief overview of existing work on low power FPGAs. Section 3 describes the dual-Vdd technique in detail, including the experimentation and results. Section 4 presents the second technique of using ASRAM cells to reduce leakage. Section 5 concludes the paper with some suggestions for future work.
Related work
Several researchers have looked at power consumption in FPGAs before. Most of them focused on dynamic power (e.g., George et al., 1999) . Kusse and Rabaey (1998) measured the dynamic power of a Xilinx XC4003A FPGA. Shang et al. (2002) analysed the dynamic power consumption in a Virtex-II FPGA. Poon et al. (2002) and Li et al. (2003) evaluated different FPGA architectures for power efficiency. Singh and Marek-Sadowska (2002) presented a routability-driven bottom-up clustering technique for area and power reduction in clustered FPGAs. Kumthekar et al. (1998) proposed a technique to modify the LUT configuration bits to reduce dynamic power.
Leakage in FPGAs has captured interest only very recently. Tuan and Lai (2003) analyse leakage power in Xilinx FPGAs. They conclude that FPGA leakage must be significantly reduced to enable their use in mobile applications. Calhoun et al. (2003) present a fine-grained leakage control scheme using sleep transistors at gate level. Rahman and Polavarapuv (2004) evaluate several low-leakage design techniques for FPGAs and show that using multiple Vt switch blocks reduces leakage significantly. select the polarities of logic signals to reduce active leakage power in FPGAs. Gayasen et al. (2004b) present a region-constrained placement approach to reduce leakage in FPGAs. Chen et al. (2004) present a cut enumeration algorithm targeting low power technology mapping for FPGA architectures with dual supply voltages. Lodi et al. (2005) present several low leakage designs for the FPGA routing switch. Rahman et al. (2005) propose a heterogeneous routing fabric consisting of a mixture of slow (low power) and fast (high power) resources. Detailed experiments help them decide what resources to slow down without affecting performance.
Researchers have previously proposed dual-Vdd techniques for ASICs (Usami and Horowitz, 1995; Takahashi et al., 1998) . Li et al. (2004a) first applied the idea to FPGAs. They fixed the voltages of logic blocks and attempted to place the design such that timing-critical blocks used high Vdd. This approach did not provide enough power savings unless some performance degradation was allowed. Therefore, a programmable Vdd FPGA was next proposed (Li et al., 2004b) , where the logic blocks could be programmed to run on Vddh or Vddl.
Here, all routing resources remained at Vddh. Furthermore, the emphasis was on reducing dynamic power while keeping the leakage constant. The programmable Vdd idea was first applied to routing resources by Gayasen et al. (2004a) . evaluated several variants of the dual-Vdd architecture. improved the voltage assignment algorithm by formulating it as a linear programming problem. All these approaches require two power supplies and two power grids. To eliminate these overheads, proposed a circuit that utilised the threshold drop across a transistor to locally generate an alternate power supply for every routing switch.
Dual-Vdd FPGA
Reducing the supply voltage (Vdd) is an effective technique for reducing both dynamic and static power. Dynamic power varies quadratically with supply voltage, while both sub-threshold leakage (due to Drain Induced Barrier Lowering, DIBL) and gate leakage vary exponentially. However, reducing the supply voltage negatively affects the circuit performance. Dual-Vdd is a popular technique to reap the benefits of voltage scaling without its performance penalty. The timing-critical blocks in the design operate on the normal Vdd (or Vddh), while non-critical blocks operate on a second supply rail with a lower voltage (or Vddl). While dual-Vdd ICs have been successfully used in low-power ASICs and custom ICs (Takahashi et al., 1998) , no commercial FPGA today uses multiple Vdd's for power reduction.
2
The difficulty of designing a dual-Vdd FPGA is that the optimal Vdd assignment changes from one design to another. Consequently, if logic blocks are statically determined to be operating at low or high Vdd, the placement and routing algorithms need to be modified accordingly (e.g., Li et al., 2004a) . However, static assignment of Vdd to the blocks may prevent the ability to reduce power or to meet timing constraints for some designs. In contrast, the use of Vdd-programmability for each block helps to tune the number of high and low Vdd blocks as desired by the application. In this approach, the challenge is in determining the Vdd assignments to each block. The need for Level Converters (LCs) wherever a low-Vdd block drives a high-Vdd block and the associated delay and energy overheads are important considerations when performing these Vdd assignments. Furthermore, positioning of the LCs influences the ability to assign lower Vdd's to the routing blocks.
In our programmable dual-Vdd architecture, the Vdd of a circuit block is selected between Vddh and Vddl by using two high-Vt transistors (supply transistors) connecting the block to the supplies (see Figure 1 ). This circuit was previously used by Li et al. (2004b) . The state (ON/OFF) of each supply transistor is controlled by a configuration bit, which is set by the Vdd assignment algorithm. The configuration bits are set either to connect the block to one of the power supplies or completely disconnect the block from both the power supply lines when the block is unused or idle. We evaluate the effectiveness of different Vdd assignment algorithms and implementation choices for an island-style FPGA architecture designed in 65 nm technology. Our results demonstrate that one of the Vdd assignment techniques provides an average power saving of 61% across different MCNC benchmarks. The remainder of this section is organised as follows. In Section 3.1, we discuss our dual-Vdd FPGA architectures. Section 3.2 describes the experimental methodology we used, and discusses the Vdd assignment algorithms and the power estimation technique. Section 3.3 presents experimental results.
Architecture
We propose two types of dual-Vdd architectures. The first, Fully Programmable (FP), architecture allows all logic blocks (CLBs) and routing resources to be independently programmed as Vddh or Vddl. The second, Logic Programmable (LP), gives that flexibility only for CLBs and fixes the voltages of the routing resources. Both the architectures are built on cluster-based island-style FPGAs, with the configuration stored in SRAM cells. The Basic Logic Element (BLE) consists of a 4-input LUT and a flip-flop. Multiple BLEs cluster together to form a CLB (see Figure 2 ). In both architectures, level conversion takes place only at CLB pins. For this purpose, CLB pins have LCs attached to them. A multiplexer allows to by-pass the LC if level conversion is not needed at that pin. Placing the LC only at CLB pins reduces the complexity of the routing fabric, and also limits the area and leakage overhead of LCs.
Fully Programmable (FP)
The FP architecture facilitates configurable supply voltage for logic blocks and routing multiplexers. Figure 2 (a) shows how the CLB is configured using high-Vt supply transistors to operate at two different voltages.
We experimented with two variants of FP, differing in the placement of the LCs. While the first version places LCs at the output pins of CLBs, the second places them at CLB input pins. Figure 2 (a) shows the first case, where only the output pins of a CLB have LCs attached to them. In this case, a net with multiple fanouts operates at high Vdd if any one of the CLBs driven by this net is at high Vdd (since the signal's voltage level does not change in the routing fabric). This limits the number of routing muxes that can operate at low Vdd and therefore is less effective in reducing routing power compared to the case when LCs are attached to CLB input pins. However, the drawback of keeping LCs at input pins of CLBs (apart from area penalty) is that a larger number of LCs are needed, which increases the leakage in logic blocks. Our results support this reasoning, but show that overall leakage is lower for the second case.
Figure 2(b) shows a routing multiplexer (mux) in the FP architecture. The multiplexer's output is connected to a level-restoring buffer to restore the Vt-drop through the NMOS-based multiplexer. Note that the same set of supply transistors controls the voltage of configuration SRAM cells and the level-restoring buffer. Since the configuration SRAM is not timing critical, the supply transistors need to be sized just enough to supply the maximum current needed by the level-restoring buffer.
If a circuit block (CLB or routing mux) is completely unused, then in order to save leakage, it is desirable to completely switch off that block. This is achieved by keeping a separate configuration bit for every supply transistor.
3 Although this incurs more area overhead, it results in significant leakage savings, since resource utilisation in an FPGA is typically low (Tuan and Lai, 2003; Gayasen et al., 2004b) .
Due to the area overhead of LCs and supply transistors (and associated configuration SRAM cells), the dual-Vdd FPGA takes approximately 50% more area than a single-Vdd FPGA.
The majority of leakage in an FPGA occurs in the configuration SRAM cells. Gayasen et al. (2004b) have previously shown that by increasing the threshold voltage of the configuration SRAM, its leakage can be reduced by 98%, while increasing configuration time by 20%. Since configuration time is not critical in most of our target designs, this tradeoff for power savings is reasonable. For applications where configuration time is crucial, we propose the use of Asymmetric SRAM (ASRAM) cells (see Section 4). In order to see the effect of dual-Vdd on power consumption, we have neglected the configuration SRAM leakage both for the single supply design and for the dual supply design (since the reduction of configuration SRAM leakage is achieved by increasing its threshold voltage, and is equally applicable to both single and dual supply designs).
Logic Programmable (LP)
The LP architecture facilitates configurable supply only to logic blocks (see Figure 3) . The routing resources run at supplies fixed at the time of device fabrication. The routing switches contain sleep transistors to cut off their power supply when not used.
The FP dual-Vdd FPGA of the previous section results in a large area penalty of about 50%. A key observation is that most of the area is consumed by the routing resources.
By fixing the supply voltages of routing resources, an LP FPGA eliminates the supply transistors and associated configuration SRAM cells in the routing fabric. Instead, we need only one sleep transistor per routing switch. This sleep transistor is controlled by the SRAM cell that controls the state of the routing switch. This more than halves the area cost of supply transistors in the routing fabric. Compared with a single Vdd FPGA, the area penalty for an LP FPGA is close to 20%. This circuit is similar to one of the circuits in , with the difference that in our case the supply voltage could be either Vddh or Vddl while they fixed the supply to Vddh for routing. Every logic block still has its own supply transistors and can be independently programmed to function at Vddh or Vddl. In order to further reduce the area penalty due to these supply transistors, we share the supply transistors among multiple logic blocks. Since all CLBs do not normally draw the maximum currents at the same time, the supply transistor can be sized smaller than the sum of independent supply transistors. Hence, the area overhead of supply transistors is reduced.
Level conversion still occurs only at CLB pins. However, unlike FP, we do not have the flexibility to set the Vdd of nets to match that of logic blocks connected to them. Therefore, we need to allow for level conversion at both input and output pins of CLBs.
The LP architectures are especially suited for low-cost applications with low power requirements.
Level Conversion
LCs have been studied widely ever since multi-Vdd circuits were proposed (Usami and Horowitz, 1995; Puri et al., 2003) . The area, delay and power overheads of LCs prohibit random Vdd assignment to logic elements of a circuit. For the present work, we have used the LC circuit shown in Figure 4 and a 65 nm Berkeley Predictive SPICE model (BPTM) to simulate it. For an FPGA architecture where LCs are placed at CLB input pins, four LCs are required per BLE. For a Vddh of 1.1 V and Vddl of 0.9 V, the LC delay is almost 17% of the delay of an LUT, and as much as 41% of the clock-to-Q delay of the flip-flop. This significant delay in the LC prohibits the use of many LCs within a logical path of the circuit. In contrast to delay, power consumption in an LC was observed to be negligible (<1%) compared to a BLE. This allows us to place LCs at all pins of a CLB and still save power. 
Methodology
We used VPR and its power model (Betz and Rose, 1997; Poon et al., 2002) for this work. MCNC benchmarks were used to evaluate the dual-Vdd architecture and Vdd assignment algorithms. The architecture of FP FPGA closely resembles a modern FPGA. The LUT size of 4 and cluster size of 8 LUTs are the same as a Xilinx Virtex-II device. The routing channel consists of 200 tracks, with buffered segments of lengths 1, 2, 6 and 'long'. The switch block used a Wilton topology (Wilton, 1997) .
For LP, however, we simplified the fabric to resemble the one used by Betz and Rose (1999) . The CLB consists of four BLEs. The routing fabric consists of only length-four segments, which has been shown to be the best for area and speed by (Betz and Rose, 1999) . We further changed the switch block topology to Subset. These simplifications made it easier to implement the LP architecture in VPR. A Subset switch block connects only segments of the same type. In an LP FPGA, we wanted no connections from a Vddl routing resource to a Vddh resource because the routing switches did not have any LCs. Using a Subset switch block made it easier to guarantee this (by creating a type for segments at a particular Vdd). This, however, also does not allow connections from Vddh to Vddl routing resources, and therefore, the power savings we report here for LP could be improved. For the purpose of comparison of FP with LP, this restriction is justified because we do not allow such connections for FP either. Furthermore, we chose all segments to be of length four because we did not want nets to solely use longer or shorter wires. Because of the Subset topology, only wires of the same type would connect, and therefore, a length 6 wire will not connect to a length 2 wire (which does not resemble a modern commercial FPGA architecture, such as Virtex-II). Despite these simplifications, we believe our results to be indicative of other segmented routing architectures as well.
Circuit simulations were performed in SPICE using 65 nm BSIM4 device models. Delays of BLE and LC were obtained from these simulations. Power consumption, both static and dynamic, of the LC was also obtained through SPICE simulations. Figure 5 shows the experimental flow. The flow deviates from a normal VPR flow after the place and route stage. We first assign voltage to all CLBs using algorithms that are discussed below, and then estimate power of the design placed and routed on the target dual-Vdd architecture. Assigning voltages after routing makes the timing analysis more accurate, since all the routing delays get incorporated in the timing graph. 
Vdd assignment
In order to be effective, a dual-Vdd scheme requires that paths in the circuit vary in their delays. If all paths are of same delay then all circuit elements will require high Vdd to maintain the performance of the design. Figure 6 shows the distribution of path delays averaged over MCNC benchmarks. We observe that path delays in a circuit vary considerably. Therefore, a dual-Vdd scheme can be expected to reduce the power consumption significantly. Figure 6 also shows the path delays after using our dual-Vdd assignment algorithms. We use the heuristic shown in Algorithm 1 for Vdd assignment. Initially we assign low Vdd to all CLBs in the FPGA and find those paths whose delays become greater than the desired clock time period. We call such paths 'critical'. Those CLBs which do not belong to any of the critical paths can be kept at low voltage without affecting performance of the design. Some of the remaining CLBs and routing muxes need to operate at high-Vdd so that the design's performance target is met. The order in which these CLBs are analysed is crucial for the performance of the heuristic. We define 'criticality' of a CLB as the number of critical paths that pass through this CLB. The CLBs within a path are analysed in decreasing order of their criticalities. We started with CLBs on the most critical path, and proceeded to smaller paths in decreasing order of their delay. Algorithm 1 handles the case when LCs are at CLB inputs. In that case all routing muxes driven by a CLB have the same voltage as the CLB. For the other situation, when LCs are at CLB outputs, the voltage of routing muxes driving a CLB is the same as that of the CLB. In order to enumerate all paths whose delays become larger than the required clock time period, we used the algorithm proposed by Ju and Saleh (1991) . It maintains all paths in a heap data structure with their delays as the keys. Each path also maintains all the branch-points in the path in increasing order of their branch-slacks. 4 We also experimented with a variant (High-to-Low (h2l)) of the above algorithm, in which all the CLBs are initially kept at high Vdd and then some of them are changed to low Vdd (see Algorithm 2). Before changing a CLB to low-Vdd, we need to make sure that this will not increase the delay of some other path in the circuit above the desired clock period. The number of low Vdd blocks using both versions, for Vddh of 1.1 V and Vddl of 0.8 V (for 65 nm technology) is shown in Table 1 . For 10 out of 15 designs, the h2l version performs better than Low-to-High (l2h). This happens because in case of h2l, when the CLBs on a particular path are being analysed whether they can be run on low-Vdd, the algorithm continues to look at all the other CLBs on the path even after it failed to change the Vdd of some CLB. In contrast, in the l2h case, the algorithm keeps changing CLBs on a path to high Vdd (in decreasing order of criticality), till the delay of the path is less than the required clock period. This sometimes causes the path's delay to be reduced more than what was necessary.
Algorithm 2 Algorithm for Vdd assignment: High-to-Low
(assuming LCs at CLB input pins) For the LP FPGA, the core of the Vdd assignment algorithm remains the same as that for FP. The main differences lie in the way the routing segments are handled.
Since their Vdd's are fixed, the assignment algorithm does not assign voltages to them. Additionally, since this architecture allows level conversion at both inputs and outputs of the logic blocks, we modify the assignment algorithm accordingly.
Power estimation
After all logic blocks have been assigned appropriate supply voltages, we estimate the power consumption of the entire FPGA. We concentrate only on the power consumption in the core of the FPGA and do not try to optimise or estimate IO power consumption. Furthermore, we did not estimate the power consumption in the global routing grid used for clock distribution. In order to estimate dynamic power, VPR's power model calculates transition densities at all internal nodes of the FPGA, assuming that all inputs to the FPGA have the same static probability (default: 0.5). Capacitances are estimated from the capacitance values of a MOSFET and that of wires and switches, all of which need to be provided in the architecture file taken by VPR as an input. We used the Berkeley Predictive 65 nm technology parameters for our experimentation.
We modified VPR's dynamic power model to include dual supply voltages. The dynamic power of a circuit element reduces by (Vddl/Vddh) 2 when its voltage is reduced from Vddh to Vddl. SPICE simulations of an LC provided its energy values for different pairs of Vddh and Vddl. We used these energy values and the transition density at the input of an LC to calculate its dynamic power.
VPR has got a basic leakage model, which calculates sub-threshold leakage due to weak inversion. But in a 65 nm technology, two more effects, namely, DIBL and gate leakage become significant and need to be included in the leakage estimation. We also modified the leakage model to take into account multiple supply voltages, and sleep modes. Specifically, the following modifications were made to VPR's leakage estimation.
• Gate leakage and sub-threshold leakage due to DIBL were included in the leakage estimation. In order to estimate leakage of a single MOSFET, we used results from SPICE simulations. 65 nm BSIM4 device models were used. Simulations were performed for various supply voltages to get leakage numbers for different voltages. These numbers were incorporated into the power model of VPR to estimate gate leakage of the entire FPGA.
• We estimated average leakage in a routing multiplexer by halving the worst case leakage, as discussed in Rahman and Polavarapuv (2004) . To verify the results, we simulated multiplexers of various sizes and structures and found our leakage estimate to be very close to the SPICE results.
• In the dual-Vdd FPGA, unused logic blocks and routing muxes are kept in a sleep state by switching off both the supply transistors. Circuit simulations in SPICE showed that in sleep mode, leakage of a circuit block reduces to 10% of the original (high Vdd) leakage.
• To estimate LC leakage, we obtained the leakage number for one level converter from SPICE simulations and multiplied this by the number of LCs in the FPGA.
Results and analysis
In this section, we first evaluate the FP architecture (Figures 7-10 ) and then compare it with LP (Figures 11  and 12 ). 
FP Architecture
Power in the dual-Vdd architecture strongly depends on the values of Vddh and Vddl. In order to understand this dependence and to come up with a good voltage choice, we fixed the high-Vdd at 1.1 V and varied Vddl from 0.8 V to 1.0 V. Figure 7 shows the power consumption for different Vddl values (using h2l algorithm, LC at CLB's inputs). Note that for 11 (out of 15) designs, Vddl value of 0.9 V results in maximum power savings. When Vddl is increased to 1.0 V, although the number of CLBs on low Vdd increases, the total power consumption increases. This happens because the power consumption of the circuit elements at 1.0 V is significantly higher than at 0.9 V. Interestingly, when we reduce Vddl to 0.8 V, power consumption again increases because the number of CLBs and routing muxes on low Vdd becomes too low. Therefore, for all other results in this section, we use a Vddl of 0.9 V. For this case, the average power reduction is close to 61%. Figure 8 shows the power consumption of the designs for the two algorithms -h2l and l2h and LC placements -at CLB Outputs (LCo) or Inputs (LCi). (h2lLCi denotes h2l algorithm with LC at CLB Inputs.) Note that for most designs, the h2l algorithm outperforms the l2h algorithm. This is expected because, as shown in Section 3.2, the h2l algorithm resulted in larger number of low-Vdd CLBs. Furthermore, the placement of LCs at CLB inputs saves more power (average: 61%) than their placement at outputs (average: 57%). This happens because LC leakage is not large enough to overshadow the gains we get in the routing power by placing LCs at CLB inputs. Figure 9 shows the static and dynamic power consumption in both logic and routing resources for the different algorithms and LC placements. An important observation is that not all components of power are reduced by the same factor. The reduction in dynamic power is much less than that in leakage. For example, using h2l algorithm and placing LC at CLB inputs saves 24% dynamic power and 76% leakage power. This can be attributed to two factors. First, in an FPGA since there exist a large number of unused circuit elements, it is possible to reduce the leakage in them by switching them off. Second, leakage varies exponentially with supply voltage, but dynamic power varies only quadratically with supply voltage. Note that leakage in routing resources reduces to less than 17% of the original, because in most designs it is possible to put a large number of routing muxes in sleep state, as they are sparsely used. Another trend to note is that the logic portion of leakage is larger when LCs are placed at CLB Inputs (LCi) than when they are placed at CLB Outputs (LCo). This implies that the larger overall power saving for the LCi case comes entirely from the routing resources. Figure 10 shows what happens when we modify the Vdd assignment algorithm to allow some degradation in the performance of the design. In the figure, a delay value of 110% denotes 10% performance penalty. Note that these delay values may increase after circuit implementation due to the use of supply transistors and due to a possible increase of wire lengths (since total CLB area and consequently inter-CLB distances increase). Using h2lLCi, a 10% decrease in performance increases the average power saving by around 4%, but beyond 20%, the power remains almost constant.
LP architecture
For LP architectures, since we hard-wire the supply voltages of routing fabrics, the critical path delay of the design may get affected. Therefore, we first look at the impact of LP on the delays of all designs. Figure 11 shows both the average and worst-case delays for the benchmark designs. Restricting the maximum increase in delay due to LP to 20%, we decide to keep 50% of the routing resources on low Vdd. Note that the average increase in delay for this architecture is only 3% of the FP architecture. The slightly irregular variation in delay happens due to the heuristic nature of the router. In this delay comparison, we do not include the increase in delay because of resistance of the supply transistors, delay through the mux at CLB pins that selects between Vddl or the level-converted signal and because of an increase in the wire lengths as a consequence of an increase in the FPGA area. The increase, however, is minimal and is highly dependent on the circuit implementation. For example, Inukai et al. (2000) demonstrate effective supply gating of circuits with a performance penalty of less than 10%. Li et al. (2004b) observed a penalty of 5% for dual-Vdd circuits when they used regular-Vt gate-boosted supply transistors.
We realised that if the FPGA has too many routing resources, it is possible that none of the low voltage resources get used and the delay of the design remains the same as that for single Vdd FPGA (if the router is timing-driven). To avoid such a scenario, we first found the minimum channel width for every design using VPR, and then used 130% of the minimum as the channel width. This is different from the above FP experiments. However, while comparing FP with LP, we used the LP channel width for both architectures. Also note that the CLB here consists of four BLEs instead of the eight in the FP experiments. Figure 12 shows the total FPGA energy (power-delay product) obtained using this architecture for different spatial granularities. h2l-50-2x1 on the x-axis refers to the architecture where 50% of the routing resources are at Vddl and the supply transistors are shared among CLBs in clusters of dimension 2 × 1. We compare energy instead of power because the critical path delays of designs mapped on LP FPGAs are different from those on FP FPGAs. The Vdd assignment algorithm remains h2l for all of them. Compared with FP, LP increases the energy by about 4.1%. The routing energy increases because we do not change their supply voltage. However, the energy used by logic blocks decreases by about 1.5%, because, due to the presence of LCs at both CLB inputs and outputs, we have more flexibility in assigning Vdd's to them. We further observe that the use of 4 × 4 clusters increases the total energy by about 12% (compared with FP).
Use of Asymmetric SRAM (ASRAM)
In this section, we present a mechanism to reduce leakage in the configuration memory of FPGAs with minimal overheads. Since configuration memory consumes nearly 38% of total leakage in a 90 nm FPGA (Tuan and Lai, 2003) , reducing their power is important.
For most applications, the latency of configuration memory cells does not affect the total performance. Therefore, many FPGAs in sub-100 nm technology use high-Vt cells to reduce SRAM leakage (Xilinx; Altera). However, a growing number of applications are relying on dynamic reconfiguration of the FPGA. For those applications, we want to reduce the latency of the configuration memory. Hence, for them, making all the SRAM cells high-Vt is not a good choice. To alleviate this problem, we propose the use of ASRAM cells (Azizi et al., 2002) . ASRAM cells have a lower leakage when storing a preferred data value. We will use ASRAM-0 to refer to the ASRAM cells whose preferred value is 0 and ASRAM-1 to those with preferred value 1. Since our experiments with several designs show that 87% of the configuration bits are zero, we propose to use ASRAM-0 cells for the configuration bits. LUT configuration bits, however, do not show such a strong preference for any value. We observed that on the average only 42% of the LUT SRAM bits were '0'. Consequently, we try to maximise the number of zeros in LUT bits by restructuring the logic of the design, as will be explained in Section 4.2. Apart from power benefits, the accrued advantage in the form of soft-error resilience is additional motivation for this approach.
Background
ASRAMs can reduce SRAM leakage by a factor of seven, with only a small increase in the latency (Azizi et al., 2002) . Figure 13 shows the ASRAM-0 cell, in which only the transistors that are normally leaking for a stored value of zero are made high Vt. Similarly, in case of the ASRAM-1 cell the transistors that leak for a stored value of one are made high Vt. The resultant design is asymmetric with respect to Vt; hence it is called an asymmetric SRAM. When used with a specialised sense amplifier design, the asymmetric cells can reduce the leakage power at the expense of a small performance penalty. ASRAM cells have another advantage: in their preferred state, they are more resilient to bit-flips because of radiation, namely soft errors (Degalahal, 2003) . Lower supply-voltages and node-capacitances in new technologies have increased the probability of soft errors.
Consequently, scaling technology favours ASRAM cells for configuration memory because of both, leakage and soft errors. 
Methodology
The configuration bits that control the routing fabric contain a large number of zeros. Hence, ASRAM-0 cells are naturally suited to store them. In contrast, using ASRAM-0 cells for LUTs is not very well justified since they do not show a significant inclination to value 0 or 1. Consequently, in order to use ASRAM-0 for LUT bits, we propose a bit-inversion technique that reduces the number of 1's in LUTs (see Figure 15) . If an LUT contains more 1's than 0's, then its bits are flipped (to increase the number of zeros) and the bits in LUTs that are being driven by this LUT are rearranged (to maintain correct functionality) (see Figure 14) . Note that this technique prohibits inversion of bits in those LUTs which are connected to the outputs of the design, either directly, or through a flip-flop. Further note that such a technique may be employed in a complimentary manner to increase the number of 1's in the design as well (e.g., . We extracted the list of all LUTs in a design from the ASCII description of the implemented design (XDL), which describes only those LUTs that were actually used in the implementation. Therefore, this list could be created in O(n) time, where n is the number of used LUTs. Furthermore, the invert_LUT procedure makes one loop over the entire list of used LUTs, which makes it possible to complete the entire inversion of bits in O(n) steps. 
Experimentation
In order to evaluate our idea of inverting LUT-bits, we selected a set of applications and used the Xilinx Virtex-II FPGA as our target hardware. Most of the applications we chose are available from Xilinx's website as reference designs (Xilinx). We also picked one of the larger designs from the ITC-99 benchmark suite and two academic designs implementing an Adaptive Viterbi Algorithm decoder (Swaminathan et al., 2002) . Table 2 provides the number of slices used by each application and the target FPGA device used for implementing it. Note that we chose the smallest possible FPGA device for every application.
Xilinx ISE 6.1 tools were used to synthesise, map, place and route the design. Next, bitgen was used to generate the configuration bitstream file and the routed design was also converted to an ASCII file in XDL format using xdl tool. The XDL file was used to find which LUTs are used, and the information about partially used LUTs. Table 2 Characteristics of benchmark designs. Note we consider only the used LUTs in the We used API's provided by JBits (ver 3.0) to get a detailed information about the implemented design from the configuration bitstream file. We wrote a Java program using JBits API's that took the XDL file and bitstream file as inputs and inverted the LUT bits as described in Section 4.2, along with bit shuffling. We did the inversion first to maximise zeros in LUT bits, and then to experiment with ASRAM-1 as configuration memory, we also inverted the LUT bits to maximise ones. This program also estimated the number of care bits (used bits) in the implemented design and counted the number of zeros and ones in them, as well as the distribution of these between LUT bits and routing configuration bits.
Results
Table 2 summarises the results. One of the observations from the table is that the majority of the care bits are zeros. Table 2 also shows the percentage of zeros in the configuration bits before and after the application of our bit-inversion technique. The results reflect an average increase of 27% in the number of zeros in the LUTs after the 1-to-0 inversion. Note that this reflects an average improvement of only 3% in the number of zeros considering all the care bits. The reason for such a small improvement is that the bit inversion is applicable to only LUT bits, which comprise about 10-15% of the total care bits. For used LUT bits, note that on one extreme, the xapp288 design shows an increase of 58.6% in the number of 0's after bit-inversions while on the other extreme, avak7 shows an improvement of only 17.6%. This large variance in the improvement occurs due to variations in the number of 0's among designs. Xapp288 has only 24.4% 0's in its LUT bits originally. This enables our bit-inversion technique to flip a large number of 1's to 0's. On the other hand, avak7 has 47.9% of 0's in LUT bits. Due to an almost equal number of 0's and 1's in the design, the bit-inversion technique is ineffective. Bit inversion reduces the leakage power in the LUT memory by a factor of 8.5 on the average. For all designs, we also estimated the Failure-In-Time (FIT) values, defined as the number of failures in 10 9 hours of operation and observed an average reduction of 25%.
A further improvement of 4-8% was observed when we applied 1-to-0 bit inversion in the LUT memory bits.
Compared with regular SRAM, ASRAM-0 reduced the configuration memory leakage by a factor of 6.03 before the LUT bit-inversion optimisation. After incorporating bit inversion, this number increased to 6.28, reducing the overall FPGA leakage by 32.0%.
Conclusion and future work
We presented two types of dual-Vdd FPGA. The FP FPGA reduced the total energy by about 60% on an average at the expense of about 50% area penalty. The LP FPGA reduced the total energy by 57.3% with about 20% area increase compared to single supply FPGA. LP, however, resulted in an average increase of 3% in the critical path delay over FP.
We also explored different Vdd assignment algorithms and LC placements for FP architecture. Experiments demonstrated that h2l algorithm coupled with placement of LCs at the input pins of CLBs resulted in maximum power savings. The dynamic power was reduced by 24%, while the reduction in static power was close to 76%.
As a technique to reduce configuration SRAM leakage with minimal overheads, we proposed the use of ASRAM-0 cells to store configuration bits. After inverting LUT bits to maximise the number of zeros in the configuration memory, we reduce the configuration leakage by a factor of 6.2.
In future, the implementation of the LP dual Vdd architecture can be modified to allow connections from Vddh to Vddl resources in the routing fabric. Further, the routing architecture can be improved to use different lengths of segments.
