ABSTRACT
INTRODUCTION
Reconfigurable technologies have made remarkable progress over the last decade. Commercial FPGAs available today provide a wide range of functionalities along with the added benefits of low non-recurring engineering cost and higher flexibility. However, power efficiency of FPGAs has continuously lagged behind these improved capabilities and hence, several low-power application domains (e.g. mobile applications) restrict the use of FPGAs because of this prohibitive power consumption. Besides, increased packaging and cooling costs, and decreased system reliability can also be attributed to high power dissipation. Hence, it is extremely important to improve the power efficiency of FPGAs. CMOS devices have been scaled down for several years to achieve higher performance and logic density, and FPGAs at 90nm technology are now being developed. Various FPGA manufacturers have roadmaps to use the 65 nm technology in near future 1 . However, with each generation of technology scaling of 1 Xilinx and IBM have a roadmap to produce chips at 65 nm. Lattice and Fujitsu are discussing the use of Fujitsu's forthcoming 65 nm technology in future Lattice products.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. supply voltage (V dd ), threshold voltage (V t ), channel length, and gate oxide thickness, there is a significant increase in leakage current. A reduction in V dd is accompanied by a reduction in V t as well to compensate for performance penalties, which results in an exponential increase in subthreshold leakage. On the other hand, thinning down gate oxides to improve driving capability leads to a substantial increase in gate leakage. These trends in technology scaling makes leakage power a dominant component in total power consumption. Therefore it is imperative to concentrate on leakage power optimization techniques. Our aim is to accomplish this goal in this paper. Specifically, we investigated the impact on logic block leakage power by shutting down unused transistors in the look-up tables (LUTs). This opportunity comes from the fact that, the flexibility offered by an FPGA to target many applications, results in a large portion of the logic being left unused. In fact, prior studies show that typically 38% of the logic structures of an FPGA remain unused [1] and leakage power is consumed by both the used and the unused parts. In addition, leakage power is proportional to the total transistor count [2] and consequently shutting down unused transistors will lead to savings in leakage power. We performed a preliminary investigation on the variance of LUT utilization across several circuits. We observed that there are a significant number of LUTs for which one or more inputs remain unused. We present details of this analysis in Section 3. This motivated us to devise a hierarchical LUT structure, where the complexity of the LUT can be reduced incrementally based on the number of inputs required by the logic function it needs to implement. Reduction of LUT complexity is achieved by selectively shutting down transistors and SRAM cells that are associated with the unused inputs. This complexity reduction is done in hierarchical steps: from 16 cells SRAM array (4-input effective LUT size) to 8 cells SRAM array (3-input effective LUT size), from 8 cells SRAM array to 4 cells SRAM array (2-input effective LUT size), and so on. There are pros and cons to employing a leakage control technique at the LUT level. The disadvantage is due to the overhead associated with the sleep transistors that we will need to employ to perform V dd gating. First, these sleep transistors will bring some area overhead. However, since FPGA area is predominantly determined by routing area, an increase in logic area will not affect the total chip area significantly. In addition, leakage control techniques specifically target deep submicron technologies, 90 nm and below. The continuous downsizing of feature sizes will further reduce the impact of an increase in logic area. There will be some degradation on LUT delay due to the sleep transistors. The impact on performance in our hierarchical LUT structure is kept minimal. We will elaborate on this in Section 3.
The most important advantage of providing leakage control at the LUT level is the fact that we do not affect the packing, placement, and routing stages in any way. Each of these stages performs optimizations to improve important design metrics such as logic utilization, congestion/routability, wirelength, and delay. We do not want to impair their ability in reaching an optimal solution. Additional constraints imposed on packing and/or placement can affect routability adversely leading to infeasible, i.e. unroutable designs, or solutions with increased wirelength. This in turn will increase the interconnect power, which can overshadow the expected improvement in leakage power. Therefore, we assume that we cannot anticipate a priori, which LUTs in a design will allow complexity reduction and where these LUTs will be placed. This is the reason why we aim to provide a leakage reduction technique that is compatible with any placement result. Our specific contributions in this paper are:
We introduce a low-penalty optimization technique to reduce leakage power consumption in FPGA logic blocks by exploiting the variance in LUT utilization across different designs, and We analyze and evaluate the leakage power gain associated with this optimization technique. The rest of this paper is organized as follows: Section 2 presents an overview of related work on leakage power optimization in FPGA technology. Section 3 briefly discusses the structures of the basic components used in an FPGA logic block, based on which our optimization technique is proposed. Section 4 illustrates our approach to optimize logic block leakage power. We also present statistical information on LUT utilization, which motivated such an optimization. Section 5 presents our results: leakage power savings based on actual power consumption, as well as savings in leakage power based on the number of transistors that can be shut down, and show that these estimates are consistent with each other. Section 6 summarizes our conclusions.
RELATED WORK
Leakage power optimization techniques for ASICs have been extensively studied. A detailed study on leakage current mechanism and leakage reduction techniques for CMOS circuits is presented by Roy et al. [3] . Although a variety of leakage power optimization techniques have been proposed for ASICs and microprocessors in the past [4] [5] [6] [7] [8] [9] , reducing leakage power for FPGAs has been in focus only recently. Until recently, most of the power optimization techniques for FPGAs primarily focused on dynamic power reduction. Sheng et al. [10] analyzed dynamic power consumption in Virtex II FPGA family. Li et al. [11] developed fpgaEVA-LP for power efficiency analysis of LUT table based FPGA architectures. Several techniques for reducing leakage power were proposed in the past year. Gayasen et al. [1] proposed a technique for disabling unused portions of the FPGA through region constraint placement employing sleep transistors that control coarse grain regions of the FPGA. Our technique provides a finer grain leakage reduction capability. As explained earlier, we intend to avoid placing any constraint on the placement of logic blocks and we aim to provide a good leakage control solution for an arbitrary placement. In Section 3, we will discuss how we can tune our hierarchical LUT structures to achieve a solution that will have a more stringent control over the associated overhead of V dd gating. Anderson et al. [2] proposed an optimization technique that selects polarities for logic signals at the inputs of LUTs so they spend the majority of their time in low leakage states. Our technique can be viewed complementary to this approach, where signal polarity assignment is performed on the active part of the logic blocks whereas inactive portions of the LUTs can benefit from our technique. Li et al. [12] proposed a scheme where the SRAM cells use a high V t (which reduces 15× leakage power) with a 13% increase in configuration time. However, for deep submicron designs beyond 65 nm, V t cannot be increased beyond a certain limit, since V dd would be scaled down significantly.
Rahman and Polavarapuv [13] evaluate several low-leakage design techniques for FPGAs and conclude that multiple V t switch blocks are very effective in reducing leakage power dissipation. Our proposed optimization can again co-exist with both of these techniques. Calhoun et al [14] propose a design methodology using Multi-Threshold CMOS gates for leakage reduction and demonstrate the application of this design technique onto a reconfigurable architecture. Their approach is a general design technique for CMOS designs. Whereas our approach aims to introduce leakage optimization into LUTs without changing the LUT design methodology fundamentally.
FPGA LOGIC BLOCK STRUCTURES
Before we proceed to describe our leakage reduction techniques, we will briefly discuss the structure of a logic block and its components that are commonly used in our target architecture. Many modern FPGAs use island-style architecture, which consist of an array of logic blocks, I/O blocks, and programmable routing. The logic and the I/O blocks are connected through a programmable routing fabric. Logic is implemented using look-up tables (LUTs). In essence, a k-input LUT (k-LUT) is a small memory that can implement any function with at most k inputs. A k-LUT is built with 2 k SRAM cells and a 2 k :1 multiplexer, where the SRAM cells are programmed to be the truth table of the kinput function the LUT implements. Commercial FPGAs mostly use 4-LUTs, and previous work has shown that 4-LUTs have highest area efficiency [15] . As we will discuss in more detail in Section 5, we have performed our experiments using the VPR tool flow. Our architectural assumptions are closely related to the target architecture used in VPR and the diagrams depicting the representative architecture are based on descriptions in [16] . The 4-LUT shown in Figure 1 uses 16 SRAM cells, a 16-input pass transistor based multiplexer, and a set of buffers. Each SRAM cell consists of 6 minimumwidth transistors. The total number of transistors for this LUT is 167 (96 for the SRAM cells, 30 for the multiplexer tree, and 41 for the input buffers and complementers) [16] . A LUT, a flip-flop, and a multiplexer are grouped together to form a logic element as shown in Figure 2 . Logic blocks of modern FPGAs consist of a cluster of logic elements called Configurable Logic Blocks (CLBs), arranged in different hierarchical organizations. For instance, a few LUTs are grouped together to form a slice and several slices are grouped to form a CLB. Input multiplexers enable the communication between the inputs to the logic clusters and the inputs of individual LUTs within the cluster. In our target architecture we assume that any input to the logic cluster can be routed to any LUT input. 
LEAKAGE POWER OPTIMIZATION
In this section, we will first present a preliminary analysis we performed on a set of benchmarks to assess the variance in LUT utilization across different designs. Next, we will introduce our proposed hierarchical LUT structure.
Variance in LUT Utilization
An analysis of the 20 MCNC benchmarks [17] after technology mapping using Flowmap [18] shows that if a circuit is mapped onto an FPGA containing 4-LUTs, there will be many LUTs that do not use all 4 inputs. Table 1 shows the distribution of 2-, 3-, and 4-input LUTs (in addition to the unused LUTs) needed for each of these 20 MCNC benchmarks. Figure 3 shows the distribution of the 2-, 3-, and 4-input LUTs shown as a percentage of the total number of LUTs actually present. From Figure 3 it is observed that on an average only 53% of the 4-LUTs use all their inputs. Although using 4-LUTs yields high utilization rate, at the same time 47% of the LUTs do not use one or more inputs. Based on this observation, we propose a technique that will save leakage power. Instead of having LUTs with fixed number of inputs we propose a hierarchical look-up table, which can yield LUTs with varying number of inputs. In this structure we employ V dd gating to hierarchically cut-off power supply to one half or three quarters of the original 4-input LUT. This will in effect yield a 3-input LUT or a 2-input LUT from a 4-input LUT selectively. 
Figure 3. Distribution of LUT inputs
The hierarchical LUT structure will employ mechanisms to disable unused parts of the SRAM array and the output multiplexer as well as to deactivate multiplexers associated with unused LUT inputs. In the following subsections we will elaborate on the hierarchical LUT structure.
Hierarchical LUT
In this section we first describe how we construct a hierarchical LUT structure. Next, we present an analysis of the incurred overhead by employing the proposed scheme.
Hierarchical reduction of SRAM array
After the LUTs are packed into complex logic clusters we assess how many inputs each LUT will use based on the logic mapped to it. Depending on the number of inputs each LUT uses, a portion of the SRAM array and the associated output multiplexer will be deactivated. 
Figure 4. Implementation of hierarchical 3-input LUT (The LUT complexity can be reduced from 3-LUT configuration to 2-LUT configuration).
For example, if a 3-LUT as shown in Figure 4 uses only two of its inputs then the SRAM cells inside the hashed block marked Block 1 can be shut off, and the LUT input In2 has to be set to 0. In this manner, input In2 controls the pass transistor at the third level of the multiplexer tree and disconnects the upper half of the LUT structure from the active lower half. Similarly, if the 3-LUT uses only one of its inputs, then Block 2, in addition to those in Block 1 will be shut off. In this case, both In2 and In1 have to be set to 0. If all the 3 inputs are unused, then the entire SRAM cell array along with the multiplexer can be shut down. We assume that the LUT inputs are so utilized such that the unused inputs are always the higher order inputs, i.e. only In2 is unused if the 3-LUT uses only two of the 3 inputs. This implies that we can enforce LUTs to use (hence not use) specific inputs. Some FPGAs are not built like that, which can affect the effectiveness of our approach. For our current work we make the aforementioned assumption of inputs being interchangeable. In the case of a more restricted architecture, techniques such as reorganizing the SRAM configuration to use specific LUT inputs could be applied. Based on this observation and since leakage power is proportional to the transistor count [2] , we hope to obtain a considerable amount of savings on leakage power. The total number of active transistors for 2-, 3-, and 4-LUT (including only the SRAM cells and the transistors contained in the output multiplexer trees) are shown in Table 2 .
Elimination of active input multiplexers
Usually, complex logic clusters are designed such that any input to the logic cluster is accessible by any LUT input pin within the cluster as we described in Section 3. An input multiplexer is used for each LUT input to route any external cluster input to individual LUT inputs. Generally, the number of inputs to a complex cluster is less than the total number of LUT inputs contained in the cluster. In other words, a cluster of four 4-input LUTs has less than 16 inputs. There have been studies to determine the best number of cluster inputs for a given cluster size. For example, for a logic cluster that has four 4-LUTs, 10 inputs is determined to be optimal in terms of logic utilization [16] , which we have used for our experiments. In that case there are sixteen 14:1 input multiplexers, because all the four inputs of each of the four 4-LUTs need one input multiplexer. The multiplexers are 14:1 because each LUT input can come from any of the 10 cluster inputs or 4 LUT outputs within the cluster. Now, if a 4-LUT is reduced to a 3-LUT, then the number of these input multiplexers is reduced by 1 per LUT. In other architecture configurations a logic cluster can have a full set of 16 inputs. Then, there are sixteen 20:1 (16 external inputs to the cluster and 4 additional inputs generated through feedback from the 4 LUTs within the cluster) multiplexers. In such a case if the number of inputs to the LUTs can be decreased, that can add up to even larger savings in leakage power. Figure 5 depicts a CLB with 10 inputs, which contains four 4-LUTs, where two of these are configured as 4-LUTs, one is configured as a 3-LUT, and one is configured as a 2-LUT. The input multiplexers shown in black can be shut down in this case.
V dd Gating and Overhead Estimation
In this section we discuss how V dd gating is performed to shut off the SRAM cells associated with the unused LUT inputs, and also discuss the associated overhead. We use a SRAM-controlled sleep transistor that cuts off the power supply for each of the blocks shown in Figure 4 . A schematic for the V dd gating for a group of two SRAM cells is shown in Figure 6 , where each SRAM cell structure is shown inside the dashed box. The structure of the SRAM cell that controls the sleep transistor is the same as the other SRAM cells, but the details are not shown in the figure for clarity. When a 3-LUT (as shown in Figure 4 ) uses only two inputs Block 1 is shut off by using the sleep transistor associated with Block 1. The sleep transistor for Block 1 in case of a 3-LUT can shut down power for the four SRAM cells and the seven transistors in the MUX tree. The SRAM cell associated with the sleep transistor has a value of 1 or 0 based on whether an input is unused or used. Similarly, when the 3-LUT uses only one input, both Block 1 and Block 2 can be shut off using the corresponding sleep transistors. When none of the inputs are used (i.e. the LUT is unused), we shut off all the SRAM cells and the entire MUX tree. This technique uses k SRAM-controlled sleep transistors for a k-LUT, and each such sleep transistor uses 7 transistors (1 for the V dd gating, and 6 for the additional SRAM cell). Therefore, for a k-LUT we need an extra k×7 transistors. Based on this, we will need 28 extra transistors for a 4-LUT in addition to the 167 transistors needed for a 4-LUT (as shown in Table 2 ). Adding these sleep transistors will hence increase the logic block area by 16.8%. However, since FPGA area is predominantly determined by routing area, an increase in logic area will not affect the total chip area significantly.
EXPERIMENTAL RESULTS
The effectiveness of the proposed leakage reduction technique is evaluated for 1.8V 180nm technology. We have used parameters at this technology due to their immediate availability and we performed power measurements based on this technology. We start by describing our methodology and subsequently present the experimental results.
Methodology
We start with estimating the leakage power for a LUT. For this we have used Power Model [19] , an additional module integrated with Versatile Place and Route tool (VPR) [16] . Since VPR only allows defining architectures with a single LUT type, i.e. single LUT size, we have estimated the leakage power of LUTs of different sizes in the following manner. We first packed the logic of each benchmark using only one LUT per logic cluster. We repeated this for three different cases; using only 4-LUTs, 3-LUTs, and 2-LUTs. Then, for each implementation we have divided the total logic block leakage power by the number of logic blocks used. We obtained an average leakage power measure for different LUT sizes in this fashion, and the results are shown in Table 3 . We use these values together with the data presented in Table 1 in Section 4 to estimate the savings in leakage power that can be achieved by using a hierarchical LUT structure. Moreover, since leakage power is proportional with the number of transistors, we also estimated the leakage power savings in terms of the number of transistors that could be shut down.
Results
We begin by presenting the leakage power consumption of a LUT shown in absolute and normalized form in Table 3 . We have used the normalized value of the LUT leakage power to estimate the overall savings in leakage power using our optimization as compared to using all 4-LUTs. The results are shown in Table 4 . The second column of Table 4 shows the relative amount of leakage power consumed by all LUTs without any optimization. These values are equal to the number of 4-LUTs used to implement the circuit, since the normalized leakage power for a 4-LUT is 1. The third column shows the relative leakage power when the optimization is applied. If a circuit uses x 2-LUTs, y 3-LUTs, and z 4-LUTs, then the value for the optimized power is obtained as x × 0.53 + y × 0.74 + z × 1.00. As seen from Table 4 , savings in logic block leakage power of about 23% is possible with the proposed optimization technique. Table 5 shows the achieved savings in terms of the number of transistors that could be shut off. Hence, we can conclude that by using our optimization technique about 26% of the logic block transistors can be shut off, which is consistent with the 23% power savings shown in Table 4 . We would like to emphasize once more that although our results show a leakage power savings close to 23% for 180nm technology, this savings in logic block leakage power will be substantially higher for smaller technologies. Finally, we may choose to adopt hierarchical LUT structures in a selective manner, i.e. not every single LUT within a logic cluster needs to be hierarchical. We analyzed the distribution of number of inputs used per LUT across logic clusters. The distribution is presented in Table 6 . For this benchmark set we observe that approximately 2 out of each 4 LUTs packed within a logic cluster use all 4 inputs. Similarly, 1 out of 4 LUTs use 3 inputs. To reduce the overheads associated with the hierarchical LUT structure, a logic cluster can be configured, where only 2 out of 4 LUTs are designed as hierarchical LUTs. 
CONCLUSIONS
The process technology trends in FPGA manufacturing indicate that leakage power will be an increasingly important design concern for future reconfigurable devices. In this paper, we investigated a fine grain leakage control technique, which relies on the observation that a significant amount of logic blocks are underutilized in practice. We addressed this aspect by introducing a hierarchical LUT structure, where depending on the level of utilization, the complexity of individual LUTs can be incrementally reduced via shutting off unused portions. Variance in LUT utilization can be exploited in different ways. One opportunity is to utilize unused portions of LUTs to improve reliability. We can exploit variance in LUT utilization to embed redundancy into the logic in a systematic fashion.
