This paper presents a minimum area, low-power driven clustering algorithm for coarse-grained, antifuse-based FPGAs under delay constraints. The algorithm accurately predicts logic replication caused by timing constraint during the low-power driven clustering. This technique reduces size of duplicated logic substantially, resulting in benefits in area, delay, and power dissipation. First, we build power-delay curves at nodes with the aid of the prediction algorithm. Next, we choose the best cluster starting from primary outputs moving backward in the circuit based on these curves. Experimental results show 16% and 20% reduction in dynamic and leakage power dissipation with 18% area reduction compared to the results of clustering without the replication prediction.
Introduction
FPGAs have become commonplace not only in low-volume designs but also in portable, battery-powered devices craving for power efficiency. These devices with smaller form factor and increased performance continue to define the present and future applications. Applications of this type are characterized as performing faster, becoming smaller in size, having longer battery life, and being marketable ahead of the competition. Previously, programming logic devices were not an option for integration on portable devices because they are bulky and consumed too much power. In the past several years, architectures of FPGAs have improved greatly and finally they play important roles in portable devices.
Antifuse-based FPGAs are one time programmable logic devices. The anti-fuse is initially in a high impedance state and is transformed into a low impedance metal-to-metal link when programmed. Figure 1(b) illustrates the cross-sectional view of the antifuse programming technology. The antifuse element is formed by depositing a high resistance layer (> 1GΩ) of amorphous silicon above a tungsten via a plug that would otherwise bridge the insulation between the two metal layers [1] . Figure 1 shows a coarse-grained, anti-fuse based FPGA from QuickLogic, which is the target device in this paper. The FPGA consists of pASIC3 logic cells, interconnects, and antifuse switches. The logic cell has a large number of inputs and multiple outputs in order to increase the logic utilization. This utilization is however a strong function of the power and efficacy of the design automation tools.
In a typical flow of FPGA CAD tools, clustering, which follows the technology mapping step, is an important optimization because it maps the target circuit net list into an FPGA array. The clustering, therefore, refers to the task of grouping logic gates in the circuit netlist and assigning each group to a configurable logic block in the FPGA array (in the case of our target architecture, this means packing gates into pASIC3 cells.) Logic replication, which is often needed to meet the timing constraints, is an indispensable part of the clustering step. Logic replication directly affects the area and power dissipation of the FPGA synthesis solution. This increase is especially true with respect to leakage power since this leakage is a direct function of the size of the logic circuit implementation. The authors in [8] report that 33% logic replication is observed as a result of the performance-driven clustering in SRAM-based FPGAs.
In this research, we present a low-power driven clustering algorithm with minimal logic replication for coarse-grained, antiPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. fuse based FPGAs. As stated earlier, in the context of our problem, a cluster refers to a group of circuit nodes that can fit in a pASIC3 logic cell. We use a dynamic programming-based clustering technique where starting from the circuit inputs moving toward the circuit outputs, we incrementally generate a set of power-delay curves for all nodes in the network [2] [3] . Each such curve, stored at some intermediate node, describes the set of noninferior clustering solutions for the subgraph rooted at that node. A critical factor in determining the quality of the clustering solution for a circuit is how accurate and complete the set of power-delay curves are; this is strongly depends on the accuracy of incremental cost (i.e., power dissipation) calculation at each node in the network. We have seen that existing heuristics for this cost calculation, which divide the cost of a multiple fanout node equally among its fanout nodes [2] [4], can result in significant computational errors, thereby, degrading the quality of the overall solution. In this paper, we present a new heuristic approach for the cost propagation across multiple fanout nodes in a Boolean network that allocates the cost of logic cone rooted at a multiple fanout node to its fanout nodes in proportionately. More precisely, we simply determine the cost allocation to each fanout node it by traversing backward in the circuit, to compute the amount of logic replication.
Background
Clustering techniques for SRAM-based FPGAs have been presented in [5] [6] [7] [8], and clustering problem for coarsegrained, anti-fuse based FPGAs has been addressed by a number of researchers [9] [10]. In particular, the authors of [10] presented an area-driven clustering algorithm. They set up a pair of linear equations and calculated the minimum number of required pASIC3 logic cells. Their algorithm, which produced 12% area improvement compared to a commercial tool, did not consider the routing cost. The same authors also presented a performancedriven clustering based on a labeling procedure that generates the minimum number of clusters on the timing critical paths. A slacktime relaxation was used to avoid redundant logic replication without violating the performance constraint. In addition, a random merging was used to cluster closely-placed partially-filled clusters. The algorithm gave about 45% delay improvement compared to a commercial tool. The key limitation of their work is that they used a unit delay model, which is not accurate enough to estimate the delay of a logic design. A delay-optimal clustering for low power was presented in [3] . For optimality, they enumerate all feasible cluster patterns at each gate in the circuit and maintain only the power-optimal solutions at each gate for each arrival time value.
In this research, we follow the same flow as that in [10] . There are four different programmable gate groups (cf. Figure 2 ) inside a pASIC3 logic cell. We call each of these gate groups a base gate. After deriving the base gates, cell generation is performed for each base gate. Cell personalization is done either by assigning constant 1 or 0 to some of the inputs or by connecting (bridging) some of the inputs together. By applying all possible combinations of these two operations to a base gate, many different library cells can be generated. We call the personalized cells "primitive cells". 
Design Flow and Problem Description
A cluster i, denoted by CL i , is defined as a group of circuit nodes that can be realized in a single pASIC3 logic cell without any resource conflicts. The set of nodes that drive nodes in cluster CL i is referred to as its leaf set and denoted by Λ i .
The clustering algorithm comprises of two steps: cluster generation and cluster selection. During the cluster generation, clusters rooted at nodes in network are generated and power-delay curves are computed in a postorder traversal of the network starting from primary inputs going toward the primary outputs. For cluster selection, clusters are determined during a preorder traversal from primary outputs back toward the primary inputs. The design flow can be described as follows:
1. Select the logic cone rooted at a primary output, which has the largest number of un-clustered nodes. 2. Traverse the cone in postorder to create power-delay (PD) curves. 3. Select a power-delay point from the PD curve of a primary output and form a cluster based on the point. 4. Select power-delay points from PD curves at leaf nodes of the previous cluster and do this in preorder until all nodes in the logic cone are clustered. 5. Go to step 1 if any logic cone is not clustered yet.
A clustering solution at a node u is characterized by a powerdelay point (PD-point) which is a pair {p u , d u }, where d u gives the delay value (i.e., latest signal arrival time) associated with the PDpoint, and p u gives the corresponding power dissipation of the clustering solution rooted at node u.
Consider intermediate nodes n i and n j in a Boolean network (a circuit netlist with signal direction specified) where there exists a common multiple fanout node, n k , in their transitive fanin cones. In typical performance-driven clustering, to minimize the arrival time to n i and/or n j , logic replication of logic under n k may become necessary. An example of this scenario is shown in Figure 3 (a) where when finding clustering solutions at nodes n 4 or n 5 , it may become necessary to replicate n 1 , n 2 and n 3 . Assume that there are two possible clustering solutions, CL 2 and CL 3 (CL 4 and CL 5 ), at n 4 (n 5 ).
1 There is also a single clustering solution CL 1 at node n 3 . The area of each cluster is 1 whereas the delay depends on the topology of the logic mapped to the cluster. We calculate the AD curve of n 4 as follows. For the clustering solution CL 3 , the AD value is (1,0.8) whereas for CL 2 , the area value is 1+1/2=1.5 and the delay is 1. The area cost calculation is done in this way because the cost of cluster CL 1 is divided equally between its two fanout nodes. This generates a new AD value of (1.5,1). Notice however that (1.5,1) is inferior to (1,0.8), and therefore, it will be dropped, resulting in the AD curve of {(1,0.8)} for n 4 .
2 Similarly, the AD curve of n 5 will be pruned to {(1,0.9)}. However, by dropping the two inferior points from the AD curves of n 4 and n 5 we force a logic clustering solution whereby three nodes (n 1 , n 2 and n 3 ) must be replicated as shown in Figure 3 (b) and (c). The overall area cost of this clustering solution is 2 and the worst-case delay is 0.9. Suppose that the required time at node n 4 and n5 is 2. Now, in fact, there is a better solution whereby CL 2 is chosen at n 4 and CL 5 at n 5 (cf. Figure 3(d) .) The area cost of this solution is 2, while its worst-case delay cost is 1.1. However, there is no logic duplication, which means that the utilization of one of the pASIC3 logic cells in the latter solution is much lower, thereby, potentially allowing a future packing of extra logic into that pASIC3 cell. The reason that the area cost of the solution given in part (d) is 2 is that CL 5 can be treated as multiple-output Boolean functions providing both the signal that goes out of n5 and the signal that goes out of n 3 and feeds into cluster CL 2 . Therefore, there is no need to replicate n 1 , n 2 and n 3 to separately generate the signals from n 3 into CL 2 , as would have been the case if the cluster was treated as a single-output Boolean function.
We have identified the aforesaid problem as a key reason behind a significant increase in the logic replication cost of a mapping solution to pASIC3 arrays. Therefore, in the remainder of this paper, we focus on developing a heuristic solution to calculate the replication cost across multiple-fanout nodes of the circuit during the post-order traversal.
Performance-driven Clustering
In this section, we present a clustering procedure with the accurate calculation of logic replication cost during the forward traversal of the Boolean netlist. 
Cluster generation and power-delay curves
A technology mapped network consists of primitive cells. In the cluster generation phase, we postorder from the primary inputs to the primary outputs. This ordering ensures that when a node is processed, all of its fanin nodes have already been processed. When constructing the PD curves for some node, n, we first invoke a matching algorithm described in [12] to enumerate all possible cluster matches at that node. For each cluster match, we then calculate its PD value as follows. The (dynamic programming) power value of the cluster is the summation of the (dynamic programming) power values of all its inputs plus the power cost of the cluster itself. Similarly, the (dynamic programming) delay value of the cluster match is the maximum of the (dynamic programming) delay values of its inputs plus the delay thru the cluster itself. Figure 4 illustrates the PD curve generation at node n 5 with a cluster CL n . PD curves of leaf set nodes n 1 , n 2 , and n 4 have already been computed. The PD curve for CL n matched at node n 5 is created by PD curves from the leaf nodes. In the conventional calculation method of [2] [4], the power dissipation at node n for cluster match CL n is calculated as:
where C fo (u) is the capacitance driven by node u , sw u is the transition probability of node u, and fanout(n i ) is the number of fanouts that node n i drives. The arrival time at node n 5 with CL n is simply the maximum arrival time among arrival times from leaf nodes plus the delay thru the cluster. 
Correct accounting of logic replication
Logic replication may be needed to meet a timing constraint at a node. It occurs when a selected cluster rooted at the node covers nodes that have already been covered by another cluster. Logic 3 For pASIC3 mapping problem, there is no "unknown load problem" [4] , which often complicates the calculation of the dynamic programming power and delay values in ASIC design flows. This is because the load ahead of a node during the postorder traversal is always the load imposed by another pASIC3 logic cell. Notice that the input pin capacitances of all inputs to a pASIC3 logic cell are the same. replication potentially occurs on the boundary of logic cones associated with primary outputs. We propose an algorithm, which estimates the cost of logic duplication by simulating the clustering procedure for each PD point during the postorder traversal. The algorithm assumes that the delay of each PD point at a node is close enough to the required time at the node. Notice that, given the required time at a node, the best PD point has the largest delay, which is equal to or less than to the required time, and the smallest cost. Therefore, being selected as the best PD point means that the required time at that node is very close to the delay in the PD point. Therefore, we use the delay in the PD point as the required time at the node. Under this assumption, being aware that the maximum path delay from a fanin node in a cluster are the largest delay from the fanin node to the root node of the cluster, we can calculate required times of fanin nodes of a cluster by subtracting the maximum path delays from the required time at the root node. If any required time of fanin nodes, which has been covered by clusters, is equal to or larger than the arrival time of the fanin, there is no logic duplication on the logic cone boundary. This leads to zero cost toward transitive fanin of the crossing boundary, whereas typical approach divides the cost by the size of fanout of fanin nodes. If logic duplication is mandatory to meet the timing constraint, we only add the cost caused by the duplicated logic. Notice that duplication operation can go toward primary inputs until no timing violation occurs. Let's assume that logic cone PO0 has been covered by clusters, and node d has a PD point having a node b and e as fanin nodes. Figure 5 (a) depicts the case in which no duplication is necessary. Since our approach does not account for the cost of unduplicated nodes, the cost toward transitive fanin of node b is zero. Therefore, we simply add the cost at node e and the cost of node d to the total cost at node d. On the other hand, if duplication is required as shown in Figure 5(b) , the cost caused by the duplicated nodes is added to the total cost. Figure 5 (c) illustrates this notion in detail. An un-clustered logic cone, Φ(PO i ), is defined as the set of unclustered nodes in the transitive fanin cone of primary output, PO i . For the moment, only focus on the solid closed curves and ignore the dashed ones. In Figure 5 (c), Φ(PO 0 ) is clustered first. In our proposed heuristic accounting of the logic replication cost during the postorder traversal of the circuit graph, when calculating the dynamic programming (DP) power cost of CL 5 at n 5 , we divide the DP cost of CL 1 at n 3 by its fanout count inside the logic cone (which is two) and add to this quantity the power cost of CL 5 . Note that when we calculate the DP cost of CL 2 at n 8 , we would account for the cost of CL 1 exactly once (1/2 contribution coming from the n 3 →n 5 branch, the other coming from the n 3 →n 4 branch.) 4 Suppose that after preorder traversal of logic cone Φ(PO 0 ), we select a clustering solution in which CL 1 is matched at n 3 while CL 2 is matched at n 8 . Next, we start clustering logic cone Φ(PO 1 ). Consider generating the PD curve at node n 6 (having first processed node n 11 , creating a cluster match of CL 6 at that node.) For cluster CL 4 matched at node n 6 , we need to compute the DP power cost of its fanin nodes n 3 and n 11 . At n 3 (n 11 ), we have the PD curve of all possible clustering solution rooted there. However, we do not know what specific clustering solution for the cone rooted at n 3 will be used for each PD point at n 6 . This is the key difficulty in the estimation of logic replication cost. Consider two extreme cases where in one case, CL 1 match at n 3 is used as the best solution for Φ(PO 1 ) resulting in no logic duplication; in the other case all of the cone under n 3 is replicated since no common signals exist between the best matching solution of this sub cone under Φ(PO 0 ) and Φ(PO 1 ). The way we solve this problem is to calculate the PD curve of CL 4 matched at node n 6 , by completely ignoring the fact that cone Φ(PO 0 ) has already been processed and a mapping solution has been obtained. Suppose a PD curve of X={x1,x2,…,xm} at node n6 is generated in this way, where xi=(pi,di). Take any point say xi corresponding to a clustering solution with CL4 matching at n 6 . We assume that di is the required time at n 6 . We go ahead and calculate the required time at output of n 3 as di-delay(CL 3 ). If this required time is larger than the arrival time at n 3 coming from the synthesis solution for Φ(PO 0 ), then for the calculation of the DP power cost of cluster CL4 at n 6 , the DP power cost of subcone rooted at n 3 is set to zero. Otherwise (i.e., a timing violation will occur if the solution generated for Φ(PO 0 ) is used), we find the optimum clustering solution of logic subcone rooted at n 3 and use the power cost of this solution toward the calculation of the power cost of cluster CL 4 at n 6 . The dashed enclosed curves show a case in which the subcone rooted at n 3 must be resynthesized in order to meet a timing requirement at PO 1 . Notice that in case of logic duplication, the duplicated copy of n 3 needs to drive only node n 6 ; therefore, the PD curve at node n 3 must be updated to reflect this change in load. The arrival time of CL 4 becomes the maximum value among arrival times of different input paths. Arrival times through duplicated nodes can be calculated based on arrival times of clustered nodes.
An example in
Accounting for logic replication, the total power dissipation at node n with cluster CL n can be extended from equation (1) and can be given by:
,
where Φ n is a logic cone to which node n belongs, C fo (u) is the capacitance driven by node u, sw u is the transition probability of node u, and fanout(n, Φ n ) is the number of fanouts inside Φ n from node n. Figure 6 gives the pseudo code to account for the effect of logic replication. When a node has to select a cluster, the function predict_logic_replication is executed. It first checks to see if the replication is necessary by checking if any node in leaf set has been clustered. In order to compute the required times for nodes in the leaf set, the capacitance of a node is computed as if the node has been clustered. The required times for nodes in leaf set are computed and passed on to the next level logic replication prediction in the recursive function predict_cluster_selection.
Cluster selection
After the PD curves for all nodes in the transitive fanin cone of a primary output are computed during the postorder traversal of the circuit, a suitable point on the PD curve of the root node is chosen, given the required time at the root of the logic cone. The cluster for the point at the root is identified and the required times for its inputs are computed. The preorder traversal resumes at its child nodes to satisfy the new required time while minimizing the power dissipation. Our approach is similar to PDMAP presented in [2] .
Implementation and Experimental Results
We have implemented the clustering algorithm based on SIS [11] and used 90nm CMOS technology process model to estimate delay and power dissipation information of primitive cells. Large combinational circuits were selected from the MCNC91 benchmark. We first ran low power technology mapping by using PDMAP [2] and then applied our low power, minimal logic replication clustering algorithm to the network.
In FPGAs, inter-cluster interconnect capacitance interconnect is not ignorable. Thus, we use constant values representing those capacitances in pASIC3 family FPGAs. However, in this research we assumed that the capacitances of intra-cluster interconnect is ignorable. The key limitation of the present work is that of assuming a fixed capacitance of inter-cluster interconnections. Inter-cluster interconnect is a major component in FPGAs and accurately estimating the capacitance is crucial in early stages of the design in order to increase the efficacy of this clustering algorithm. Table I shows the experimental results. Our approach could reduce the total number of nodes by 18% on average, resulting in savings in area and power dissipation without any sacrifice of speed. The run time increases due to the repeated invocation of the logic replication predictor for the same node with the same required.
Conclusion
In this paper, a minimal-area clustering algorithm for low power was proposed. The proposed algorithm builds PD curves for nodes in a network by predicting the amount of logic replication based on the timing constraint. The prediction provides accurate power dissipation on a cost point on the curves.
Experimental results indicate that the proposed algorithm generates much less duplicated logics with less delay and power dissipation compared to the traditional cost distribution method. The algorithm achieved 18% reduction on the total number of nodes, resulting in saving both dynamic power and leakage power dissipation by 16% and 20% respectively without any sacrifice of delay. 
