This paper presents a delay optimal FPGA clustering algorithm targeting low power. We assume that the configurable logic blocks of the FPGA can be programmed using either a high supply voltage (high-Vdd) or a low supply voltage (low-Vdd). We carry out the clustering procedure with the guarantee that the delay of the circuit under the general delay model is optimal, and in the meantime, logic blocks on the non-critical paths can be driven by low-Vdd to save power. We explore a set of dual-Vdd combinations to find the best ratio between low-Vdd and highVdd to achieve the largest power reduction. Experimental results show that our clustering algorithm can achieve power savings by 20.3% on average compared to the clustering result for an FPGA with a single high-Vdd. To our knowledge, this is the first work on dual-Vdd clustering for FPGA architectures.
INTRODUCTION
Reducing power consumption for FPGAs has attracted much attention recently [1, 4, 5, 8, 9, 11, 12, 17] . Meanwhile, performance remains the most important factor for FPGA designs. Since most FPGAs are hierarchical in nature, circuit clustering has become an integral part of the FPGA synthesis flow. It has been shown that cluster-based logic blocks can improve the FPGA performance, area and power [11, 13, 17 ].
An early work on performance-driven circuit clustering was presented by Lawler et al. in [10] . Given a constraint M on the size of the clusters, Lawler's algorithm produces a delay optimal partitioning of the circuit under the assumption that internal delays within a cluster are zero, and the external delay from one cluster to the other is one (unit delay model). Later, Murgai et al. proposed the general delay model [14] . In this model, each gate of the network has a delay; no delay is encountered on an interconnection linking two gates internal to a cluster; and an edge delay is encountered on every interconnection between two different clusters. This model is very powerful and can capture many timing constraints by simple extensions. Rajaraman and Wong derived the first delay optimal clustering algorithm under the general delay model [15] . In [19] , Vaishnav and Pedram presented a low-power single-Vdd clustering algorithm with the optimal delay under the general delay model. Their algorithm is power optimal for trees. They enumerated all clustering solutions for a graph and selected a low-power clustering solution from all delay optimal clustering solutions. Both [15] and [19] allowed node duplications, i.e., a gate may be assigned to more than one cluster. There are a few prior research efforts on clustering for FPGA architectures [2, 3, 13, 17] . The optimization goals were on area-delay tradeoff [2] , routing track reduction [3] , performance improvement [13] , and area and power reduction [17] . There is no guarantee that one can achieve the optimal clustering delay under the general delay model in these works. In [6] , a performancedriven multi-level (two-level hierarchy) FPGA clustering algorithm was presented.
One of the popular design techniques for power reduction is to lower supply voltage, which results in a quadratic reduction of power dissipation. However, the major drawback is the negative impact on chip performance. A multiple supply voltage design in which a reduction in supply voltage is applied only to non-critical paths can save power without sacrificing performance. Clustered voltage scaling (CVS) was first introduced in [18] , where clusters of high-Vdd cells and low-Vdd cells were formed, and the overall performance was maintained. The works in [7, 20] combined CVS with other techniques such as gate sizing and variable supply voltage. The work in [16] assigned variable voltages to functional units at the behavioral synthesis stage. The work in [12] assigned voltage values to logic blocks in an FPGA chip made of predefined dual-Vdd/dual-Vt fabric.
In our work we develop a low-power FPGA clustering algorithm, named DVpack, with consideration of two supply voltages. We guarantee an optimal circuit delay under the general delay model. We impose the constraint that the nodes being packed in a single cluster have to be driven by the same Vdd. We extend the idea of [19] to build delay-power-vdd points to form a solution curve for each node in the network. After the optimal circuit delay is determined, the non-critical paths will be relaxed in order to accommodate low-Vdd clusters to reduce power. Our algorithm is delay and power optimal for trees, and delay optimal for directed acyclic graphs (DAGs). We also show that the complexity of solution curve generation is polynomial in terms of the network depth without the need to reduce data precision.
In Section 2, we provide some definitions and formulate the dualVdd FPGA clustering problem. Section 3 introduces our FPGA architecture and power model. Section 4 gives a detailed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
2.1p
description of our algorithm. Section 5 presents experimental results, and Section 6 concludes this paper.
DEFINITIONS AND PROBLEM FORMULATION
A Boolean network can be represented by a DAG where each node represents a logic gate, and a directed edge (i, j) exists if the output of gate i is an input of gate j. We use input(v) to denote the set of nodes which are direct fanins of gate v. We use F v to represent the subgraph of the network that contains v and all of the transitive fanins of v, including primary inputs (PIs). A cluster rooted on a node set R, denoted as C R , is a subgraph such that any path connecting two arbitrary nodes in C R lies entirely in C R . The roots in R are also the outputs of C R . node(C R ) represents the set of nodes contained in C R . input(C R ) denotes the set of distinct nodes outside of C R that supply inputs to the nodes in node(C R ). The dual-Vdd clustering problem for min-power FPGA (DV-CMF problem) is to cover a given i-bounded Boolean network with K-M-feasible clusters or equivalently, K-M-clusters. This is done in such a way that the total power consumption is minimized under a dual-supply voltage FPGA architecture model, while the optimal clustering delay is maintained. We assume that the input networks are all 4-bounded, i.e., each node n represents an LUT, where |input(n)| ≤ 4. We set K to 10 and M to 4 in this study (these parameters can change). Therefore, our final clustering solution is a DAG in which each node will be a 10-4-cluster, and the edge (C U , C V ) exists if some node u ∈ U is in input(C V ). The voltages are denoted as V L for low-Vdd and V H for high-Vdd.
ARCHITECTURE AND POWER MODEL 3.1 Level Converter and Logic Element
A level converter is required when a V L device output is to be connected to a V H device input. Otherwise, excessive leakage power will occur in the V H device due to large short-circuit current. We use the same level converter presented in [5] , where delay and power data of the level converter and the 4-input LUT (4-LUT) with various Vdd settings were obtained through SPICE simulation under the 0.1u technology. This work uses these data. Figure 1 shows a K-input configurable logic block (CLB) containing M basic logic elements (BLEs, each one a 4-LUT). The output of a BLE can be programmed to go through a level converter or bypass it. This gives us the capability to insert a level converter when a V L BLE drives a V H BLE in another cluster. We assume that there are pre-fabricated tracks in the routing channels with either V H or V L settings. When a V L BLE is driving the routing interconnects (wires and buffers), we assume that it can use a set of V L routing tracks. This model is similar to that used in [5] . Our assumption represents the ideal case, which will provide an upper-bound of power reduction for clustering FPGAs with dual Vdds under timing constraint.
Power Model
For each K-M-feasible cluster, the total power of the cluster is:
where S j is the switching activity of node j in the cluster; x is the number of nodes in the cluster; P LUT contains both dynamic and static power [5] ; P LUT_static is the static power of an LUT, which is counted when the LUT is not switching; P inputs is the power consumed on the cluster inputs, which is defined as follows:
where S i is the switching activity on input i of the cluster. C in is the input capacitance on an LUT (including MUXes and local buffers in front of the LUT); P wire is calculated as follows: where C local_wire is the capacitance of the local interconnects inside the cluster driven by node j. S o is the switching activity of the cluster output. 1 All the S values are calculated beforehand. C net is the estimated output capacitance of wires and buffers contained in the net driven by the output, and P buf_static is the static power of the buffers contained in the net. C net is changeable gate by gate. We use the wire-load model to obtain reasonable wire-capacitance estimation before placement and routing. Experimental results show that the estimated power has an excellent correlation with the reported power after placement and routing. Details are omitted due to page limit. 
ALGORITHM DESCRIPTION 4.1 Cluster Enumeration
We carry out a cluster enumeration procedure to get all the singleroot clusters in the network. Figure 2 shows an example (ignore the dashed lines for now). Consider K = 14 and M = 6, and we are to generate clusters rooted on node t. Following a topological order, the clusters rooted on the predecessors of t, such as r and s, 1 We only examine the case where each cluster has just one output here. 2 Interested readers are referred to [5] for a thorough analysis based on a similar wire-load model. have already been generated. All the clusters on t can be generated from the clusters on r and s, following a dynamic programming approach. For example, we can first retrieve all the clusters of size 2 on r: {m,r}, {n,r}, and {o,r}, and then combine one such cluster with each of the clusters of size 3 on s (adding root node t will make a cluster of size 6 on t). If some node, such as n, appears from both sides, clusters of size 4 on s will be returned and tested for feasibility. This simple technique will effectively handle the reconvergent paths in the network. Next, we can try to combine clusters of size 3 on r with clusters of size 2 on s, and so on. Thus, all the feasible clusters can be generated. The number of clusters in the worst case is on the order of O(6 3M+1 ) [19] for clustering 4-LUTs. The actual number is small when M is small. When M is large, heuristics can apply, such as cluster pruning, which fits well into our dynamic programming paradigm. 2 . These solution points will form a curve with discrete points if drawn along delay and power axes. After the solution curves are formed for r and s, we can generate solution points for cluster C t . Figure 3 shows the concept. In general, when we generate the delay-power-vdd curve for a cluster C rooted on v, we first retrieve the solution curves of the nodes in input(C). We then calculate all the valid delay values that can be propagated from input(C) to v through C. Two cases are considered for C using either V H or V L . For each solution point sp j in the curve of node i, where i ∈ input(C): C (a multiple of d L ) . There is no need for D conv here.
Delay-Power-Vdd Curve Generation
After we go through all the sp j for each node in input(C), all the valid delay values that could appear for C are collected. However, we need to validate the range of such values. We calculate the earliest and the latest time a signal may arrive at v. For C using V H : . This number determines how many delay-power-vdd points are going to be generated for C. We define the arrival time set as in [19] . It is a set with the cardinality of |input(C)|, with each element corresponding to an arrival time at one input i. After all the points are generated for C, inferior solutions will be pruned away. This process repeats for each feasible cluster rooted on v, and then all the solution points will be merged together and will go through pruning. 
Final Clustering Solution
After cluster enumeration with solution curve generation, the optimal clustering delay can be obtained through the propagated minimum delay values among all the non-inferior solutions. This optimal clustering delay is set as the required time for the design. The critical path is always driven by V H , and clusters on noncritical paths can be driven by V L to reduce power when such a delay relaxation will not violate the required time of the design. Clusters rooted on the POs are generated first, and then the inputs of the generated clusters are iteratively processed.
There is one complication because of the involvement of level converters. When we try to select a cluster C on v, C can either be a V L cluster or a V H cluster depending on which setting provides larger power savings. 
The minimum R(i) and R_lv(i) propagated back among all the fanouts (v is one of them) of i are final R and R_lv for i. Next, to pick a cluster on i, we go through each delay-power-vdd point of every cluster rooted on i to find the best power solution:
given the corresponding delay of P min fulfills the following:
The cluster with P min is picked for node i. The cluster uses the voltage V Pmin . However, if D Pmin violates the specified required time, the next best P min will be examined until finding a feasible P min . This procedure continues until all the PIs are reached.
Theorem 2:
The presented clustering algorithm will generate delay and power optimal solutions for Boolean networks that are trees, and generate delay optimal low-power solutions for networks as DAGs, targeting FPGAs with dual-Vdd architecture.
EXPERIMENTAL RESULTS
We will show the comparison results between the dual-Vdd clustering algorithm and the single-Vdd clustering algorithm to examine how much power savings a dual-Vdd FPGA architecture can achieve through effective circuit clustering. We implement a single-Vdd clustering algorithm, SVpack. SVpack follows the delay and power propagation procedure shown in 
