In this paper, two packing algorithms for the detection of activity profiles in MTCMOS-based FPGA 
Introduction and Related Work
The scaling of the CMOS technology has precipitated an exponential increase in subthreshold leakage currents. Therefore, it is not surprising that leakage power now constitutes a high percentage of the total chip power. In FPGA applications, the management of leakage power has been overshadowed by performance improvement and dynamic power minimization techniques. As modern FPGAs are getting implemented in 90nm CMOS technology, solving the leakage power problem is pivotal to devising power-aware FPGAs. For the FPGA industry to continue competing with high-performance custom VLSI designs in the semiconductor market, or to explore new territories such as wireless personal communication systems (PCSs), the industry must invest in novel techniques to control leakage power dissipation. Although, FPGAs provide flexibility in design, they are not fully exploited. In fact, the logic and switching resources utilization are approximately only 60% and 50%, respectively, of the total FPGA resources. As a result, unutilized parts of FPGAs cost designers a large amount of inactive leakage power without providing any gainful output. Thus, these unutilized resources should be investigated to achieve minimal leakage power. In addition, it is helpful to recall that even the utilized blocks dissipate inactive leakage power during their standby modes.
One technique that has become increasingly popular for mitigating inactive leakage power is employing high-V th (HVT) sleep transistors (STs) to cut off a low-V th circuit from the power rails during the standby mode. When the ST is turned of f , the circuit leakage is limited to that of the ST. In this technique, the sizing of the ST also impacts the amount of speed loss in the active mode because of the added resistance to ground. STs should be able to support the peak current requirements of the logic clusters that the STs control so that the speed penalty does not exceed 5%. Therefore, by selecting the appropriate ST size, the speed penalty, leakage power, and area overhead can be minimized for the entire circuit. Since the peak currents of the clustered logic blocks control the circuit speed, a method by which the kind and number of blocks are chosen to be clustered to share one ST is crucial. Ideally, any number of gates can be grouped with one ST, as long as their switching periods are mutually exclusive. However, for gates with simultaneous switching, the number of clustered gates must be limited to ensure that the speed penalty does not exceed 5%.
Over the past few years, a number of ST sizing methodologies have been reported in the literature. In [1] and [2] , a single ST was proposed to support the whole circuit. The ST was sized according to the mutual exclusive discharge patterns in [1] . Also, a distributed ST network methodology was suggested in [3] to minimize the total ST area. None of the techniques in the literature offer automated methodologies to cluster the logic blocks [1, 2, 3] . Either they were all based on intuition, or on supporting the entire circuit with a global ST. In addition, none of these techniques accounted for the routing overhead for clustering purposes, which is a critical issue in nanometer designs. In addition, [4] and [5] discussed techniques that employed STs, but did not address how and which gates should be chosen to share an ST. From the FPGA perspective, a leakage control technique which employed regionconstrained placement of STs was developed in [6] . Again, there was no description of how clusters were created. Consequently, in this paper two new packing algorithms for detecting which blocks in FPGA structures exhibit switching correlations, i.e., can be clustered in one activity profile. A flow chart for the new design flow is shown in Figure 1 . The first algorithm is a connection-based packing technique, where the proximity of FPGA blocks is considered, whereas the second algorithm is a logic-based packing approach where the Hamming Distance between the activity of the different blocks is utilized. Both algorithms are analyzed and applied to a number of FGPA benchmarks for validity. Once the activity profiles are known, STs are connected to contain these clustered blocks sharing similar activity profiles. This connection is performed in the configuration stage of the FPGA. The leakage power saving is finally evaluated for each of the two algorithms. It is important to note that this clustering (packing) stage occurs before the placement stage in FPGAs. Consequently, by applying the two algorithms and knowing which gates should be clustered, the placer ensures that they would be placed close to each other to minimize the routing overhead. The FPGA fabric is thus divided into regions of similar activity, each of which is independently controlled though a local ST. Furthermore, latches are inserted between one MTCMOS circuit and another to ensure that the logic is retained during the standby mode (when the virtual ground rails float). These latches are inherently located in the FPGA BLEs, and are adopted as the interface between every two MTCMOS blocks. Thus, no extra logic is required to perform this interface to guarantee data retention. By applying these techniques, the blocks which are unutilized are connected to a common ST and permanently turned off during the configuration of the FPGA. On the other hand, utilized blocks that display similar switching activities (i.e., the same activity profile) will be grouped together to dynamically and collectively turn them on or off. In order to limit the overhead of the reconfiguration control circuitry, the ST should not change state so frequently.
Targeted FPGA Architecture
The targeted FPGA architecture is shown in Figure 2 . Each BLE consists of a 4-input LUT, flip-flop and 2:1 multiplexer. Several BLEs are grouped together to form CLBs. Inside the CLBs, the BLEs are connected together using the local switching resources. In addition, the CLBs are connected using the global routing resources of the FPGA. Every n CLBs are connected to the ground via a High-VT (HVT) NMOS transistor to reduce leakage current and force the n CLBs into low-power modes during their inactive periods. The HVT sleep transistor is controlled using a SLEEP signal at its gate. Moreover, in each CLB, the latches are used to retain the value of the BLEs outputs when they enter they sleep mode. The several CLBs served by one sleep transistor are called the sleep region. The size of the sleep region, i.e., n is controlled by many factors; maximum allowable size for the sleep transistor, hence the maximum peak current this transistor will hold, the maximum performance deterioration due to the sleep transistor allowed, as well as the maximum permitted ground bounce in the virtual supply lines.
Connection-Based Activity Packing Algorithm (CAP)
The CAP algorithm involves assigning similar activity labels to the BLEs that are expected to have similar activity profiles. Afterwards, the BLEs are clustered to minimize the delays along the critical paths by applying simulated annealing. The CAP algorithm consists of two phases: activity generation and clustering.
Activity Generation
The aim for the activity generation phase of CAP is to give BLEs that share nets, similar activity labels. The main reasoning for this approach is that BLEs that share inputs are expected to be active at the same time. Moreover, cascaded BLEs are more likely to have similar activity labels because as the output of the driving BLE change, the driven gate is expected to change state. Hence, this approach assumes a 100% probability of change in the output of a BLE when one of its inputs change state, thus giving the pessimistic results for the activity labeling.
The algorithm begins with the circuit primary inputs and greedily allocates activity regions as it traverses the circuit netlist by means of simple depth-first graph search algorithm. The result is a fast and computationally efficient algorithm. While traversing the circuit netlist, whenever a new BLE is encountered, it is necessary to determine whether to add this BLE to the current activity region, or to place it in a new activity region. There are two principal driving costs that need to be considered at each node: the total number of activity regions and the size of the activity region.
The number of activity regions corresponds physically to the total number of sleep signals employed in the design. Increasing the number of activity regions results in increasing the number of sleep signals used, thus causing a power-inefficient implementation, as well as complicating the control circuitry for generating these signals. Hence, the total number of activity regions needs to be minimized as much as possible.
The other driving cost function for the activity generation algorithm is the size of the activity region. Reducing the size of the activity region provides the clustering algorithm with more flexibility to pack only those BLEs that manifest the same activity, not those that have close activity profiles. Although this leads to a greater leakage savings, the disadvantages of a large number of activity regions once again become an issue. Furthermore, the algorithm must be expansive while each BLE is processed. The addition of any BLE to the current activity region, implies the addition of all of its fan-in and fan-out BLEs, because the algorithm is connectionbased. As a result of this, the number of fan-ins and fan-outs of any BLE, should be considered during the process. Consequently, the cost of adding the current BLE to the current activity region is expressed as
where maxCap is the predefined maximum capacity for the activity region, currCap is the current capacity of the activity region, lev b and lev a are the minimum number of unlabeled logic levels from the BLE to the primary inputs and outputs, respectively, and α is a weighting constant that is used to signify the logic levels either before or after the BLE, respectively, hence, improve the quality of the final solution. The use of leva and lev b provides the cost function with the ability to look around the current BLE to examine what other BLEs are expected to be attracted to the current activity region when the BLE under investigation is placed in it.
By running the algorithm on several benchmarks, it is found that a value for maxCap of 1.5 times the longest path from input to output in the circuit provides the best results in terms of power savings. Giving a constant value for maxCap, irrespective of the circuit size, results in impractical results. Moreover, increasing maxCap than 1.5 times the longest path in the circuit results in having excessively large activity regions that are usually not fully filled up by the algorithm. On the other hand, decreasing maxCap increases the number of activity regions in the final design.
Another cost function is maintained to represent the attraction between the BLE and the activity region under consideration, and is expressed as
where m is the number of nets that connect the current sleep region to the BLE under consideration. Hence, the decision of whether or not a certain BLE should be placed in the current activity region is given by:
⇒ add to the current activity region cost1 + δ × cost2 0 ⇒ start a new current activity region where δ is a normalization factor. The values of the α and δ are determined by exhaustively trying several values and checking the quality of the solution. In our experiments, a value of 0.5 is selected for α. Again, the value of δ controls which of the cost functions, cost1 or cost2, should be given higher priority. A value of -0.1 is adopted for δ in this work, and it proved to produce good results. The reason for choosing a negative value is that the value of cost 1 is negative, unless the activity region size constraint is violated. The activity generation phase consists of two stages: exploration and labeling. In the exploration stage, the netlist is converted into a directed graph and traversed by the depth-first search algorithm. While each node is traversed, two labels are added to it; the number of levels and paths from this node to any of the primary inputs and the primary outputs. In the second stage, the graph is traversed by the depth-first search approach and the cost of adding each of the nodes connected to it to the current activity region (cost 1 ) is computed. Then the cost of not adding them to the activity region (cost2) is computed. The node with the minimum cost 1 is selected as the candidate node, and then compared to its cost 2 , and a decision is made. This continues until each node in the graph is labeled with its activity region. The pseudocode for the activity generation phase of CAP is given in Figure 3 . Figure 4 indicates that the algorithm begins with node A and then studies its child D, and adds it to the activity region. Following that, the logic blocks connected to D; B and E are examined. B is added to the activity region because it has the minimum C2. Afterwards, E is added to the activity region. A new activity region is started from F because the sum of C 1 and δC 2 is positive. Lastly, C is added to the second activity region.
Packing Phase
The packing is performed by employing simulated annealing technique on the resulting circuit netlist which consists of the input netlist plus an activity number for each CLB. The packing algorithm follows these hard constraints while solving the optimization problem: (i) the number of BLEs must be less than the cluster size, (ii) the number of inputs needed by the BLEs inside the cluster must be less than the number of cluster inputs, (iii) all the BLEs inside a cluster must have the same activity profile, and (iv) the peak current in the cluster does not surpass the maximum allowable current of the ST for a speed penalty of 5%. In addition, the packing algorithm follows the same objective function as that of TVPack [7] . The objective is to minimize the number of inter-cluster connections that lie on the critical path of the circuit. In addition, simulated annealing is applied to the optimization problem, unlike [7] . The reason for using simulated annealing is to speed up the solution of the packing problem since the problem here is more complex than the one in [7] due to the addition of the new hard constraints (iii) and (iv).
Experimental Results
The CAP algorithm is implemented and tested on several benchmarks to assess its capability of using connectivity to generate the activity of the circuit, as well as the power savings due to the use of STs. The experiments are performed on a 900MHz Ultra Sparc III machine with 8Gbytes RAM, and the results are summarized in Table 1 . The third column in Table 1 lists the number of resulting clusters and the minimum FPGA array that can be used to map the circuit. The power dissipated by each design is calculated using the power model developed in [8] . In each benchmark, the power savings consist of two parts; savings from permanently turning off all the unused cluster and savings from dynamically turning on and off the different used clusters in the design during operation depending on their activity profile. When the unused clusters are turned off, their standby leakage power dissipation is reduced significantly because of the presence of the sleep transistor in the leakage path. Thus this part of power saving is calculated by merely subtracting the standby leakage for each cluster with and without a sleep transistor and multiplying it by the total number of unused clusters in the design. Moreover, to calculate the savings due to the dynamic switching of the used clusters, the logic power dissipation per cluster P d is calculated by
where t on and t of f are the percentage of times the cluster is either on or off, respectively, and P dyn and P leak are the dynamic and standby leakage power dissipation of the cluster. The active leakage dissipation of each cluster is ignored in the case when sleep transistors are used with respect to the dynamic power dissipation. The power savings is thus the difference between P d and the logic power dissipation without using sleep transistors. From the results in Table 1 , it can be deduced that the CAP algorithm can be used to achieve an average power saving of 22.5%. Moreover, the minimum power saving that is attained is more than 10%, except for the s1269 benchmark which has high switching characteristics, thus resolving the power savings to merely that of turning off the unused part of the FPGA. In addition, the results for cm150a and cm163a denotes that the power savings in these two cases results only from turning off the used parts of the FPGA during their idle state, as there are no utilized parts in the FPGA. Furthermore, the execution time of the packing algorithm is almost linear with the circuit size, except for sequential circuits that have long cycles which complicates their processing, due to the use of simple depth-first search algorithms in traversing the netlist graph. This grouping is similar to the worst-case grouping because the algorithm does not incorporates the actual logic function of the circuit and assumes that when the inputs to the BLE change, its output will change, which is not true in all cases. Incorporating the logic function of the BLE can actually result in a better grouping, but can be computationally expensive. That's why the Logic-based Activity Packing (LAP) algorithm is proposed.
Logic-Based Activity Packing (LAP)
The LAP algorithm depends on the representation of the activities as binary sequences. The packing is then performed by grouping those BLEs that have similar activity sequences, i.e., minimum Hamming distance between the activity vectors. For LAP, the circuit topology for the activity-based packing is ignored and instead the circuit logic function is used to find the optimum clustering that prolongs the off periods of each CLB. This is achieved by exhaustively simulating all the input combinations of the circuit to generate the activity vectors. In order to properly explain this algorithm, several definitions and notations will be first explained.
Activity Vector
Definition 1: Activity Vector Given a net x in a circuit netlist, the activity vector A x of x is defined as follows:
where n is the total number of inputs to the circuit, a i is a binary variable that is '1' if any of the outputs of the circuit depend on net x for evaluation when the inputs to the circuit are given by the i th input vector, and T represents the transpose of the vector. In FPGAs, each BLE has only one output; thus, the activity vector of each net resolves to be the activity vector of the BLE driving that net. Hence, Ax is the activity vector of net x, as well as the BLE called X, where x is the output of X. Example 1: For the circuit in Figure 5 , blocks F and G must be on to generate the outputs of the circuit f and g, respectively. Consequently, the activity vectors A f and Ag for blocks F and G, respectively, are given by A f = [ 1 1 1 1 1 1 1 1 ] T ,
On the other hand, for computing the activity vector at the inputs of block F , it is noteworthy that block D will be only used to generate the output signal f if the input c is '1'. Similarly, block E is only used when c is '0'. Hence, the activity vectors for D and E, when f is evaluated, are represented by
