Abstract
INTRODUCTION
Power reduction is one of the most important optimization targets for modern VLSI design. Power can be optimized across different design stages, from the system-level all the way down to the gate and transistor level. The higher the design level is, the more critical the design decisions are for the quality of the final product. It is reported that system and behavior level power optimization techniques can achieve more than 40% of power reduction [20] . In this study, we focus on behavioral synthesis for optimizing the dynamic power, which is still the dominating power source for many modern VLSI circuits.
The basic problem of behavioral synthesis or high-level synthesis is the mapping of a behavioral description of a circuit into a cycle-accurate RTL design consisting of a datapath and a control unit.
A datapath is composed of three types of components: functional units (e.g., ALUs, multipliers, and shifters), storage units (e.g., registers and memory), and interconnection units (e.g., buses and multiplexers). The control unit is specified as a finite state machine which controls the set of operations for the datapath to perform during every control step (clock cycle). The behavioral synthesis process mainly consists of three tasks: scheduling, allocation, and binding (or module assignment). Scheduling determines when a computational operation will be executed; allocation determines how many instances of resources (functional units, registers, or interconnection units) are needed; binding binds operations, variables, or data transfers to these resources. Binding can also include module selection, which determines the types of the resources to be used (e.g., a carry lookahead adder or a carry-save adder). Behavioral synthesis is a well studied problem [1] [7] [8] [11] [22] .
In general, it has been shown that the code density and simulation time can be improved by 10X and 100X, respectively, when moved to the behavior-level synthesis from RTL synthesis [26] . Such an improvement in efficiency is much needed for design in the deep submicron era.
There are two sources of power consumption: dynamic power and static power. Dynamic power is consumed when signal transitions take place at gate outputs. Static power (also called leakage power)
is consumed when the circuit is either active or idle. Dynamic power consumption is calculated as P d = 0.5 × S × C × V dd 2 × f, where S denotes the switching activity of the circuit, C denotes the effective capacitance, V dd is the supply voltage, and f is the operating frequency. To lower dynamic power, each of these factors can be reduced. For example, using a smaller amount of resources would effectively reduce the C value in the formula (e.g., behavioral synthesis using variable precision arithmetic units [10] ). Some work has utilized variable supply voltages to reduce power (e.g., [5] [14] [23] ), where the basic idea is that operations on the non-critical paths can be driven by a lower supply voltage to save power without suffering an overall performance penalty. In this study, we will focus on reducing the switching activity (SA) for power minimization.
There are many research results that have addressed the problem of minimizing SA through behavioral synthesis. We introduce some representative work next. To address the interactions among the different tasks in behavioral synthesis, there are works that carried out scheduling, allocation, and binding simultaneously for power minimization [3] [6] [21] . Most of these algorithms used iterative approaches, such as the simulated annealing algorithm or the variable depth search algorithm. The advantage of such algorithms is that they can search for a global optimal solution. The potential drawback is that there is no guarantee that such an optimal solution can be found. In addition, to reduce the runtime complexity of such algorithms, each of the tasks can only be designed using unsophisticated approaches. Other works focused on one or two tasks for optimization so that they could achieve larger gains for these individual tasks. Reference [19] performed scheduling and binding to increase the opportunity for two operations to be bound into the same functional unit (FU) consecutively if they share common operands. It also performed register binding to minimize the input switching of FUs. However, their algorithms were all heuristic in nature and had no guarantee of optimality. The work in [15] presents effective metrics to evaluate the power dissipation of scheduled data-flow graphs (DFG) 1 . It showed that metric evaluation is much faster than performing optimal binding and iterative power improvement, thus enabling fast design space exploration.
Reference [2] worked on register binding with a scheduled DFG. It is the first algorithm that presented an optimal register binding solution for SA reduction working with a DFG. The authors formulated the problem as a minimum cost clique covering of the compatibility graph, and solved it using a max-cost flow algorithm. However, this algorithm only considered the intra-transition SA, and did not consider the inter-transition SA (more details in Section 2). The work in [5] took one step further from [2] and addressed the binding task under multiple supply voltages for DFGs. It developed an optimal algorithm for assigning low V dd to as many operations as possible, under the resource and time constraint, while at the same time minimizing the total SA. This work did not consider the inter-transition SA either. Reference [17] addressed the importance of inter-transition SA and derived a two-step approach to solve the problem. It first used a network flow algorithm to get an initial binding solution, and then it used a post-processing greedy iterative algorithm to make its solution legal.
Recently, there are works that addressed other issues during behavioral synthesis. For example, in [13] , multi-cycle interconnect communication is considered during behavior synthesis to reduce system latency. Reference [26] proposed a module selection algorithm that combined design-time optimization with post-silicon tuning to maximize performance and power yield with consideration of process variation. The work [18] considered thermal optimization during resource allocation and binding to reduce the hot spots and cooling cost of the design. In [12] , resource binding for effective soft error tolerance in FPGAs was studied for higher chip reliability. Reference [25] studied behavioral synthesis for digital microfluidic biochips. It targeted next-generation system-on-chip (SOC) designs that are expected to include microfluidic components.
In this study, we present an algorithm that shows that resource binding considering inter-transition SAs for DFGs can be solved optimally. We achieve such a goal by transforming the problem into the shortest path problem in a k-dimensional graph extended from the algorithm in [16] . We derive the complexity of the algorithm and show that it can be solved in polynomial time. The order (or degree) of the polynomial, k, is the number of allocated resources. In general, k is a constant, which is specified by designers to fulfill area/power constraints for the design. The runtime of the algorithm is reasonable when k is small. When k is large, the algorithm's complexity becomes high, especially for large benchmarks. To deal with large k values, we also propose a heuristic that uses a network-flow algorithm, followed by a legalization step as done in [17] . However, our enhanced legalization step differs from [17] in that it uses a bipartite matching algorithm. Experimental results show that this legalization heuristic produces better results than the heuristic in [17] and can obtain the optimal solution for several benchmarks.
The remainder of this paper is organized as follows. Section 2 provides some key definitions and the problem formulation. Section 3 presents both the new heuristic and the optimal algorithm. Section 4
shows experimental results, and Section 5 concludes this paper.
DEFINITIONS AND PROBLEM FORMULATION
In a DFG G = (V, A), set V corresponds to operations and set A corresponds to data transfers between
operations. An edge a = (x, y) | x, y ∈ V, a ∈ A indicates there is a data dependency between operations x and y. Scheduling assigns operations to control steps so that the overall execution latency meets a certain time constraint, and the number of resources used also meets a certain resource constraint. After scheduling, the lifetime of each operation in the DFG is the time during which the operation is active. Now we introduce the concepts of intra-transition SA and inter-transition SA, which were first defined in [17] . In Figure 1 (b), the number of adders allocated is three, which is equal to the maximum number of operations scheduled in a single control step (cstep3 in the example). Suppose the binding solution is path1 = {op1, op3, op5} ⇒ adder1; path2 = {op2, op6, op8} ⇒ adder2; and path3 = {op4, op7} ⇒ adder3. If input vector PI 1 arrives at the primary inputs, after four cycles, all the corresponding outputs for the design are computed. Consecutively, input vector PI 2 can arrive and go through the same propagation. Take adder1 for example. There is switching on the adder's ports when execution switches from op1 to op3 and then to op5 when PI 1 propagates through the design. The SA incurred in this iteration is called the intra-transition SA. When PI 2 arrives, the execution would actually switch from op5 back to op1 to execute the new vector. The SA incurred across such iteration boundaries is called the inter-transition SA. Figure 1 (b) can model the intra-transition SA by assigning a weight on an edge which represents the SA between the two operations. However, it cannot model the inter-transition SA.
To model the inter-transition SA, the compatibility graph can be manipulated through rotation and duplication introduced in [17] . For example, now, a path {5, 1, 5'} would represent that op1 and op5 are bound into one FU, and such a decision is reached by considering both the inter-transition SA of edge (5, 1) and the intra-transition SA of edge (1, 5'). However, to have a legal solution, every binding path needs to start from one of the nodes v, from the first column of the transformed compatibility graph, and end by the correspondingly duplicated node v' on the last column. We call this constraint the matching node constraint.
Our problem can be simply formulated as follows:
Problem: Low-power binding with inter-transition SA. We name this problem the IT-SA-Bind problem.
Given: (1) A scheduled DFG G(V, A); (2) A set of available functional units R; (3) SA on the intraand inter-transition edges.
Objective: Bind all operations to functional units under the resource-constraints so that the total SAs of functional units are minimized.
ALGORITHM DESCRIPTION
We first introduce our SA estimation method in Section 3.1. We then present a heuristic algorithm in Section 3.2 that offers a new legalization step over what was used in [17] to solve the IT-SA-Bind problem. Finally, we present the optimal algorithm to solve the same problem in Section 3.3.
SA (Switching Activity) Estimation
Our algorithm begins by carrying out a DFG simulation for SA estimation. This is similar to the approach presented in [3] . The difference is that we focus on the intra-and inter-transition SA estimations between any two compatible operations instead of targeting an operation set as done in [3] .
We introduce some related definitions next. For two operations x and y, we define C intra (x, y) as the toggle count between x and y when the FU switches the execution from x to y in the same iteration (intra-transition). Similarly, we define C inter (x, y) as the toggle count between x and y when the FU switches the execution from x to y across two different iterations (inter-transition). 
where D H (P, Q) represents the Hamming Distance between bit vectors P and Q. We use an example next to illustrate how equations (1) and (2) are defined.
A simple DFG is given in Figure 2 . At each control step (clock cycle), the primary inputs a and b will take one set of input vectors. For example, at cycle one, the input vectors
shown in the figure. These vectors will be operated upon, and their effects will propagate through the whole DFG and we have I ) = 7. Similar toggle count computations can be carried out for any two compatible operations to estimate the switching activity between these two operations if they are bound together into the same FU.
Switching Activity S intra and S inter are computed as follows:
inter inter ( , ) ( , ) 2
where BitWidth is the bitwidth value of the input port of the FU. Note that in the formula and the example, we only considered the two input ports of an FU. We can perform similar computation for the output port of the FU. The final SA should count both the input-port SA and the output-port SA.
A Heuristic Solution with Network Flow and Bipartite Matching Algorithms
In [17] , the authors added a super source and a super sink into the transformed compatibility graph to generate a network flow graph and solved a min-cost max-flow network problem to reduce the SA.
Since the control step of the max-allocation is moved to the front (Figure 1(c) ), the generation of the flow graph is easier. All the nodes and edges from the transformed compatibility graph are kept in the flow graph, and then the super source node is connected to all the nodes in the front control step and the super sink node is connected to all the nodes in the last column (the duplicated column). Then, k singlecapacity flows with a minimum total cost can start from the super source and end at the super sink, where k is the number of nodes in the front column, i.e., the one with the max allocation. This solution will generate k disjoint paths, where each path can be bound into one resource. Since there is no way to enforce the matching node constraint over the network flow solution, the solution may not be legal. For example, it may generate a solution where {6, 8, 2, 4, 7'} in Figure 1 (c) are on a single path to be bound together. This is illegal because it does not correspond to a practical binding solution. The authors in [17] then used a greedy algorithm to make the solution legal. It first came up with a conflict graph that reflects the conflicting relation of the paths that need to be legalized. It then started legalizing one path at a time iteratively based on the increasing order of the legalization cost of the paths until all the paths are legalized.
We present a new heuristic that would improve the legalization step to solve the IT-SA-Bind problem. We generate an initial solution using the same min-cost max-flow network method as used in [17] . We then formulate the legalization problem into a bipartite matching problem, and legalize all the paths together. Figure 3 demonstrates this idea. Figure 3 (a) shows four disjoint paths generated by the network flow solution, where the top three paths are illegal. To legalize the path starting with a (named as path a), we can either switch edge (a, e) to (a, f) ⇒ {a, f, g, a'}, marked as fix front; or switch edge (k, c') to (k, a') ⇒ {a, e, k, a'}, marked as fix back. Similarly, we can derive the fix scenarios for other paths. As a result, we can build a bipartite graph as shown in Figure 3(b) , where the two selected columns from Figure 3 (a) become the two disjoint sets of the bipartite graph.
The costs on the edges are shown in the figure. For example, bipartite edge (a, g) represents the fixfront scenario and its cost is s(a, f) + s(g, a'), where s(a, f) represents the switching activity of edge (a, f) in Figure 3(a) . Similarly, the weight for bipartite edge (a, k) is s(a, e) + s (k, a' ). This provides a way to evaluate which fix scenario would provide a better legalization for path a. However, such a local decision can have a global impact. For example, if we pick (a, k), then (c, k) cannot be used to legalize the path c. Therefore, we should obtain a minimum weight matching for the graph and achieve a global low cost to legalize all the paths. is the schedule length for the DFG, and R is the set of FUs.
Optimal Binding with k-Dimensional Graph
An obvious attempt to derive an optimal solution for the IT-SA-Bind problem is to use the multicommodity network flow algorithm with multiple sources and sinks. However, this solution has, in general, an exponential complexity and is therefore not interesting to us. Instead, we approach this problem through another route: transforming the original problem into finding the shortest path in a kdimensional graph. The k-dimensional graph was introduced in [16] to find k disjoint paths in a DAG (directed acyclic graph) between a single source s and a single sink t such that the total cost of the paths was minimized. We borrow this concept, extend it to the multi-source/multi-sink scenario based on the transformed compatibility graph, and use it to obtain an optimal legal binding solution. A legal solution would honor the matching node constraint, while at the same time ensure that all the nodes are bound using k resources, i.e., that the disjoint paths cover all the nodes in the graph (we name this the node coverage constraint).
We are given a transformed compatibility graph (which is a DAG) 
as follows: = t j , we arrange the edges of P 1 , P 2 , …, P k in ascending order of their edge sources to form the
, u 1 (2) → u 2 (2) , …, u 1 = t j . Note that t j is the duplicated s j node in G T , i.e., t j = s j ' (Section 2). Therefore, the matching node constraint can be fulfilled in these k vertexdisjoint s j to t j paths.
Furthermore, the total cost of the edges in P k is equal to the total edge cost of the corresponding k paths in G T . Therefore, if P k is the shortest path from <s 1 , s 2 Proof:
First, we show that k vertex-disjoint paths can cover all the vertices in G T . The compatibility relation on V T makes V T a partially ordered set. (Please refer to [3] for the definition of partially ordered sets.) A subset of V T , which contains the largest number of mutually non-compatible nodes, has cardinality k. Dilworth's theorem [8] indicates that a partially ordered set P can be partitioned into k-disjoint paths covering all the elements if P contains at least one subset Y, where |Y| = k; every pair of elements in Y are non-compatible with each other; and k is the largest number for such kind of subsets in P. These properties hold exactly in V T .
Because the edge cost of every edge in G T is defined as a negative value, and for any two shows symmetry for the edges. This is only because G T in the example is symmetric. In general, the edges in the k-dimensional graph do not need to be symmetric. The shortest path solution P 2 contains two disjoint paths for G T . We can simply extract out the embedded G T edges in the shortest path, which are (1 → 4), (2 → 3), (3 → 2'), and (4 → 1'). These edges form two disjoint paths:
and 2 → 3 → 2'. Figure 4(d) shows the final binding solution from these two disjoint paths. We observe that both the matching node constraint and the node coverage constraint are met.
Theorem 2.
The IT-SA-Bind problem can be solved optimally in polynomial time for DFG.
Therefore, the overall complexity of the algorithm is:
Since k is a constant that is usually specified by the designer to honor area/power constraint, this
indicates that the IT-SA-Bind problem is solvable in polynomial time. ■ When k is small, this algorithm will use reasonable runtime as shown in the experimental results. The algorithm is especially useful for applications where a large value of resource allocation would not help improve performance anyway due to certain node dependencies (i.e., the parallelism in the application is limited). However, when k is large, the runtime of the optimal algorithm may not be affordable any more. That case, the proposed bipartite-based algorithm in Section 3.2 can be used.
Note that all the algorithms presented in this study can be easily applied to register binding and bus binding for low power, because those problems can be translated into binding with compatibility graphs as well.
EXPERIMENTAL RESULTS
To examine the quality gap between our optimal solution, the new heuristic, and the previous heuristic solution [17] , we implemented all of these algorithms for comparison purpose. The benchmarks we use include binary adder trees, binary multiplier trees, and some benchmarks from [24] , which include several different DCT and DSP algorithms. To have a fair comparison, all the algorithms use the same scheduling result with the same resource constraint for each benchmark respectively. Scheduling is done by a list scheduling algorithm. We used a 2.8 GHz Pentium 4 machine with Linux, with 2 GB of memory. better than Confl. Figure 5 lists the same comparison results using a bar chart. For some cases, Bipar achieves the optimal solution. We can also observe that in general if the ratio of n over k is larger, the gain on SA reduction is also larger. This is easy to understand because the more legalization opportunities that appear, the more optimization gain that can be obtained. Bipar are almost comparable. It is not surprising that Kdim uses the largest runtime due to its high computational complexity. However, it is still tolerable. For certain benchmarks, such as adder trees, the tradeoff between power reduction and runtime increase is significant and worthwhile.
CONCLUSIONS AND FUTURE WORK
In this paper we presented algorithms to solve the low-power resource binding problem considering inter-transition switching activities in the design. We proved that such a problem for DFGs can be solved optimally in polynomial time, and we presented such an optimal algorithm in detail. The optimal solution is, on average, 6.7% (up to 20.1% individually) better than a previously published algorithm. We also developed a new heuristic algorithm that, on average, is 4.1% worse when compared to the optimal solution. It is, on average, 2.6% (up to 10.5% individually) better than the previously published algorithm. In the future, we can study ways to prune the k-dimensional graph in terms of its nodes and edges so that it will use significantly less runtime, while still achieving nearoptimal solution.
ACKNOWLEDGEMENT
This work was supported in part by the NSF Grant CCF 07-02501 and the NSF Career Award CCF 07-46608. We used the machines donated by Intel. 
