This paper describes a technique for calculating the switching activity of a set of registers shared by dierent data values. Based on the assumption that the joint pdf (probability density function) of the primary input random variables is known or that a suciently large number of input vectors has been given, the register assignment problem for minimum power consumption is formulated as a minimum cost clique covering of an appropriately dened compatibility graph (which is shown to be transitively orientable). The problem is then solved optimally (in polynomial time) using a max-cost ow algorithm. Experimental results conrm the viability and usefulness of the approach in minimizing power consumption during the register assignment phase of the behavioral synthesis process.
Introduction
One driving factor behind the push for low p o w er design is the growing class of personal computing devices as well as wireless communications and imaging systems that demand high-speed computations and complex functionalities with low p o w er consumption. Another driving factor is that excessive p o w er consumption is becoming the limiting factor in integrating more transistors on a single chip or on a m ultiple-chip module. Unless power consumption is dramatically reduced, the resulting heat will limit the feasible packing and performance of VLSI circuits and systems.
The behavioral synthesis process consists of three phases: allocation, assignment and scheduling. These processes determine how many instances of each resource are needed (allocation), on what resource a computational operation will be performed (assignment) and when it will be executed (scheduling). Traditionally, behavioral synthesis aims to minimize the number of resources required to perform a task in a given time or to minimize the execution time for a given set of resources. It has become necessary to develop behavioral synthesis techniques that also account for power dissipation in the circuit. This extends the two-dimensional optimization problem to a third dimension. The three phases of the behavioral synthesis process must be thus modied to produce low power circuits. Unfortunately, p o w er dissipation is a strong function of signal statistics and correlations, and hence is non-deterministic.
Automatic techniques that minimize the switching activity on globally shared busses and register les, that select low p o w er macros that satisfy the timing constraints, that schedule operations to minimize the switching activity from one cycle step to next, etc. must be developed. This paper considers register assignment for low p o w er. Most of the high-level synthesis systems perform scheduling of the control and data ow graph (CDFG) before allocation of the registers and modules and synthesis of the interconnect [11] [18] [7] as this approach provides timing information for the allocation and assignment tasks. Other systems perform the resource allocation and binding before scheduling, in order to provide more precise timing information available during the scheduling [9] . Either approach has its own advantages and shortcomings. The present w ork assumes that the scheduling of the CDFG has been done and performs the register allocation before the allocation of modules and interconnection.
During the register allocation and assignment, data values (arcs in the data ow graph) can share the same physical register if their life times do not overlap. In the past, researchers have proposed various techniques to reduce the total number of the registers used. The existing approaches include rule-based [6] , greedy or iterative [10] , branch and bound [13] , linear programming [1] , and graph theoretic, as in the Facet system [18] , the HAL system [16] and the EASY system [17] .
Power consumption of well designed register sets depends mainly on the total switching activity of the registers. In many applications, the data streams which are input to the circuit have certain probability distributions. Various ways of sharing registers among dierent data values thus produce dierent switching activities in these registers. This work presents a novel way o f c alculating this switching activity based on the assumption that the joint pdf (probability density function) of primary input random variables is known or a suciently large number of input vectors has been given. In the latter case, the joint pdf can be obtained by statistical methods. After obtaining the joint p d f of primary input variables, the pdf of any i n ternal arc (data value) in the data ow graph and the joint pdf of any pair of arcs (data values) in the data ow graph are calculated by a method that will be described in detail in the follow-ing sections. The switching activity on a pair of arcs is then formulated in terms of the joint pdf of these arcs, or alternatively, in terms of a function of the joint pdf of all primary input variables.
The life time of each arc (data value) in a scheduled data ow graph is the time during which the data value is active ( v alid) and is dened by a n i n terval show that the unoriented compatiblity graph for the arcs (data values) in a scheduled data ow graph without cycles and branches is a comparability graph (or transitively orientable graph) which i s a p erfect graph [5] . This is a very useful property, as many graph problems (e.g. maximum clique; maximum weight k-clique covering, etc.) can be solved in polynomial time for perfect graphs while they are NP-complete for general graphs.
Having calculated the switching activity b e t w een pairs of arcs that could potentially share the same register and given the number of registers that are to be used, the register assignment problem for minimum power consumption is formulated as a minimum cost clique covering of the compatibility graph. The problem is then solved optimally (in polynomial time) using a max-cost ow algorithm.
The two problems, calculation of the cross-arc switching activities (which m ust be performed O(j E j) times, where j E j is the number of edges in the compatibility graph) and power minimization during register assignment, are independent. The calculation of the cross-arc switching activities can be performed by a n y means. We present one such technique later. Other techniques may h o w ever be used. The power optimization is performed once the crossarc switching activities are known. The remainder of this paper is organized as follows: Section 2 shows the method to calculate the switching activity between pairs of data values (arcs). Section 3 shows the method to optimize the power consumption of registers in the register allocation phase in behavioral synthesis. Section 4 are some examples to demonstrate the methodology.
2 Switching Activity Calculation
Calculation of various pdfs
In many instances, the input data streams are somewhat known, and can be thus described by some probabilistic distributions. (Our proposed method applies not only to the well known probability distributions, such as joint Gaussian distribution, but also to arbitrary probability distributions.) Given a sucient n umber of input vectors, it is possible to nd the symbolic expressions for the pdf's and the joint pdf of all inputs using methods in statistics. F or example, one way to do this is to calculate the frequency of the occurence for each v ector among the set of input vectors, and then perform the interpolation on the sets of discrete points to obtain the symbolic expression of the joint pdf. Alternatively, one can work directly with the input vectors without having to nd the symbolic expression of the joint pdf, that is, for a suciently large number of the input vectors, the frequency of occurence for each input vector can serve a s the value of the joint pdf for that pattern.
If we are given the joint pdf of the input random variables of a data ow graph, then the joint pdf of any pair of values (arcs in the data ow graph) can be calcualted [15] . We w ant to nd the joint pdf of any t w o arcs.
Suppose that the two arcs are y1 = u1(x1; x 2 ; : : : ; x n ) and y2 = u2(x1; x 2 ; : : : ; x n ). We can add another (n 2) free functions y3; y 4 ; : : : ; y nand form a system of n equations in n input variables. Let's denote the joint pdf of the n input variables as (x1; x 2 ; : : : ; x n ). If the inverse solution x1 = w1(y1; y 2 ; : : : ; y n ) ; x 2 = w 2 ( y 1 ; y 2 ; : : : ; y n ) ; : : : ; x n=w n ( y 1 ; y 2 ; : : : ; y n ) can be obtained symbolically, then the joint pdf of y1; y 2 ; : : : ; y n which is denoted by 0 (y1; y 2 ; : : : ; y n ) is: J 1 (y1; y 2 ; : : : ; y n ) = Once we h a v e the 0 (y1; y 2 ; : : : ; y n ),we can calculate the pairwise pdf of y1 and y2, fy 1 y 2 (y1; y 2 ), as The integration can be performed either symbolically or numerically. The numerical integration over (n 2) variables involves much more computation, but is an alternative approach which is always possible whenever the symbolic integration over the (n 2) variables is not possible.
In addition to the calculation of pairwise joint pdfs, the pdf of any i n ternal arc is needed to calculate the total switching activity of the set of registers. Suppose function y = w(x1; x 2 ; : : : ; x n ) is some arc (data value) in the data ow graph depending on n input random variables x1; x 2 ; : : : ; x n . The cdf (cumulated distribution function) of the new random variable y is dened as G(y) = prob(Y y), which is equal to prob(w(x1; x 2 ; : : : ; x n )y ). The above probability can be evaluated as:
where (x1; x 2 ; : : : ; x n ) is the joint pdf of the n input random variables x1; x 2 ; : : : ; x n , and A = f(x1; x 2 ; : : : ; x n )j w ( x 1 ; x 2 ; : : : ; x n )y g . The pdf of y as g(y) is then obtained by g(y) = d G ( y ) dy . 
The power consumption model
Switched c apacitancerefers to the product of the load capacitance and the switching activity of the driver. The power consumption of a register is proportional to the switched capacitance on its input and output (see Fig. 1 ). Suppose register R1 can be shared between three data values i; j and k:We assume that an input multiplexor picks the value that is written into R1 while an output demultiplexor dispatches the stored value to its proper destination. Now, P(R1) / switching(x) (Cout;Mux + Cin;R 1 ) + switching(y) (Cout;R 1 +Cin;DeMux). Since switching(x) = switching(y), P(R1) = switching(y)Ctotal. Note that Ctotal is xed for a given library. I n a n y case, minimizing the switching activity at the output of the registers will minimize the power consumption regardless of the specic load seen at the output of the registers. Here we ignore the power consumption internal to registers and only consider the external power consumption.
In the register allocation phase, if several compatible arcs are assigned to the same register R, the switching on R will occur whenever one stored data value is replaced by another data value. For example, suppose X,Y,Z and W are four compatible data values that share register R and the arcs (X;Y);(Y;Z);(Z;W) 2 A. Suppose that in the beginning, the register was reset to some unknown value. We assume the switching activity from the unknown value to X is some constant v alue. Then the following is the chain of the data transitions X ! Y ! Z ! W. If the input variable values are known, then the exact switching activity is calculated as constant + H(X;Y)+H(Y;Z) where H(i;j) is the Hamming distance between two n umbers i and j. If, however, the circuit has even one input random variable, the whole system has to be described in a probabilistic way as described next.
Assume that the n primary input random variables are a1; a 2 ; : : : ; a nand set A = f(a1; a 2 ; : : : ; a n ) gis the set containing all possible combinations of input tuples. Let set B = f(x;y) j x = x(a1; a 2 ; : : : ; a n ) ; y=y ( a 1 ;a2; : : : ; a n ) ; 8 ( a 1 ; a 2 ; : : : ; a n )2Ag. The switching activity b e t w een the two consecutive data values X and Y is then given by:
fxy(x; y) H(x; y) (1) where the summation is over all possible patterns of (x; y) 2 B , and the function H(x; y) is the Hamming distance between two n umbers x and y which are represented in a certain number system in binary form. Equation ( 1) requires that the discrete type joint pdf for x; y be known.
The method for calculating the joint pdf of two random variables described in section 2.1 is mainly suitable for the case when the variables in the system are of continuous type. When however the precision used to represent the discrete numbers is high enough or the variance of the underlying distribution is not too large, the continuous type pdf gxy(x; y) can be used as a good approximation for the discrete type pdf fxy(x; y) after being multiplied by the scaling factor ( P (x;y)2B gxy(x;y)) 1 . The symbolic computation method is however not very practical because it involves the tasks of nding the symbolic inverse solution of the system of nonlinear equations and symbolic or numerical integration of complicated expressions over the region dened by a combination of inequalities and/or equalities. Fortunately, the same switching activity for a pair of discrete random variables x and y can be obtained much more easily by the following:
switching(X;Y) = X a 1 X a 2 X a n ( a 1 ; a 2 ; : : : ; a n )
H ( x ( a 1 ; a 2 ; : : : ; a n ) ; y ( a 1 ; a 2 ; : : : ; a n )) (2) where (a1; a 2 ; : : : ; a n ) is the joint pdf of the input variables a1; a 2 ; : : : ; a n .
Both equation ( 1) and equation ( 2) started from the assumption that the joint p d f ( a 1 ; a 2 ; : : : ; a n ) is obtained or known. This is a necessary condition in order to precisely calculate the cross-arc switching activities. Furthermore, equation ( 2) can be used directly once the input vectors are given without obtaining the symbolic expression for (a1; a 2 ; : : : ; a n ). Here we assume that the bit width of a register is nite, so the total number of the patterns that can be stored in a register is also nite. If we assume all of the numbers in our system are integers (positive or negative), then the total number of dierent ( x; y) pairs involved in equation ( 1) is at most 2 2bit width . In general, equation ( 2) involves multidimensional nested summations over intervals of integral values. When the joint pdf of primary input variables is band-limited (e.g. Gaussian), we can narrow d o wn the interval of summation in each dimension and thereby signicantly speed up the computation.
Let's denote the set A = f(a1; a 2 ; : : : ; a n ) g , set B = f(x;y) j x = x(a1; a 2 ; : : : ; a n ) ; y = y ( a 1 ; a 2 ; : : : ; a n ) ; 8 ( a 1 ; a 2 ; : : : ; a n )2Ag, C = f(y;z) j y = y(a1; a 2 ; : : : ; a n ) ; z = z ( a 1 ; a 2 ; : : : ; a n ) ; 8 ( a 1 ; a 2 ; : : : ; a n )2Ag, and D = f(z;w)jz = z(a1; a 2 ; : : : ; a n ) ; w = w ( a 1 ; a 2 ; : : : ; a n ) ; 8 ( a 1 ; a 2 ; : : : ; a n )2Ag. X a n ( a 1 ; a 2 ; : : : ; a n ) ( H ( x; y) +H(y;z) + H ( z;w)) ( 3) The total switching activity for a register can be calculated after the the set of variables that share that register are found. Note that the sequence of data transitions are known at that time. To minimize the total power consumption on the registers, a network NG = (s; t; Vn; E n ; C ; K ) is constructed from the compatibility graph G0 = G(V;A). This is a similar construction to the one used in [17] to solve the weighted m o dule allocation problem which simultaneously minimizes the number of modules and the amount of interconnection needed to connect all modules. Conceptually, NG = (s; t; Vn; E n ; C ; K ) is constructed from G0 = G(V;A) with two extra vertices, the source vertex s and the sink vertex t. The additional arcs are the arcs from s to every vertex in V of G(V;A), and from every vertex in V of G(V;A) t o t . W e use the Max-Cost Flow algorithm on NG to nd a maximum cost set of cliques that cover the G0 = G(V;A). The network on which the ow i s conducted has the cost function C and the capacities K dened on each arc in En. Assuming that each register has an unknown value at time t0 , w e use a constant sw0 to represent the switching(Unknown; v) for each v ertex v. More formally, the network NG = ( s; t; Vn; E n ; C ; K ) is dened X a n ( a 1 ; a 2 ; : : : ; a n ) H ( u ( a 1 ; a 2 ; : : : ; a n ) ; v ( a 1 ; a 2 ; : : : ; a n ))c (5) w(v;t) = L; 8 v 2 V; w(t;s) = L: (6) where A = f(a1; a 2 ; : : : ; a n ) g ,B=f ( u; v) j u = u(a1; a 2 ; : : : ; a n ) ; v = v ( a 1 ; a 2 ; : : : ; a n ) ; 8 ( a 1 ; a 2 ; : : : ; a n )2Ag, L = bmax fswitching(u;v)g Denition 3.4 [14] Let N = (s,t,V,E,C,K) be a ow network with underlying directed g r aph G=(V,E), a weighting on the arcs cij 2 R + for every arc (i,j) 2 E, a capacity K(e) for every arc e 2 E, and a ow value v0 2 R + . The min-cost ow problem is to nd a feasible s-t ow of value v0 that has minimum cost. In the form of an LP: min c t f Af = v0d every node f b every arc f 0 every arc where A is the node-arc incidence matrix and di =
( 1 i = s +1 i = t 0 otherwise Denition 3.5 The maximum cost ow problem is that given a network N=(s,t,V,E,C,K) and a xed ow value v0, nd the ow that maximizes the total cost.
The easiest method to solve the max-cost ow problem is to negate the cost of each arc in the network, and run the min-cost ow algorithm on the new network [14] .
The previous network construction N 0 G ensures that the resulting paths are vertex disjoint cliques in G0 (or G 0 0 ). When the max-cost ow algorithm is applied on this network,we obtain cliques that maximize the total cost. The ow v alue on each path is one, this implies that the total cost on each individual path is the sum over all individual arcs on that path according to their topological order in the graph G0 = G(V;A), where the cost on each arc is a linear function of the \Saved Power". f(e) switching(e)
In our specially constructed network, f(e) i n e v ery arc e except (t; s) has value either zero or one. The rst term in the above, P e 2En f(e), is a constant ( = 2 j V j + k for G0 = G(V;A)) among all possible clique coverings that cover all of the vertices in the original graph G0. When we maximize the total cost for a given ow v alue in N 0 G , w e are indeed minimizing the total power consumption given that the number of registers is equal to this ow v alue.
Note that, the max-cost ow o n N 0 G always nds the clique covering that covers all of the vertices in the original graph G0 whenever the ow v alue j f j is larger than or equal to kmin. kmin can be determined by the left edge algorithm [11] or simply by nding the maximum number of arcs that cross any c-step boundary. In most cases, the kmin found by the left edge algorithm is equal to the kmin for max-cost ow. However, in some pathological cases, the two v alues are not the same. In that case, a post-processing step is needed [2] . 2
The time complexity for the max-cost ow algorithm is O(km 2 ), according to [4] , where m = 2 j V j +2 for the graph G0 = G(V;A) and k is the ow v alus.
Conditional branches can be easily handeled in our system by relaxing the conditional data ow graph into several unconditional data ow graphs and performing the above method on the individual relaxed data ow graphs. Due to the limited space, we do not present the details here. A detailed exposition is provided in [2] . 4 
An example
The following example is based on a scheduled data ow graph as the one shown in We used equation ( 2) in Section 2.2 to calculate the cross-arc switching activities for every pair of arcs in G(V;A):
The switching activity of for any v ariable x from time = t0 which is assumed to have some unknown value to the time that the variable gets its rst value was taken to be a After calculating the switching activities, we construct the max-cost ow network. The weight o n e a c h arc is calculated by equation ( 4)- ( 6) in Section 3.
Here we c hoose M = 1000, and so L = 10837. The following weights are obtained:
w(x; x 0 ) = L; 8x 2 V ; w(s;x) = 5271; 8x 2 V and other w's are given by the following Note that our method nds the minimum power register assignment for the given number of registers.
To demonstrate that the switching activity calculation based on the joint pdf is necessary to obtain a low p o w er register assignment w e performed an experiment where every arc weight in the compatibility graph was set to some constant ( C = 100) and then ran the max-cost ow for dierent o w v alues. For ow v alue 5, we obtained:
Number of registers is equal to 5, cliques = ffa; h; kg; fb;f;ig;fc;g; jg; fdg;fegg, actual total switching activity = 80.487882, which is 13.55% worse than the optimum solution. Next, we generated register assignment solution using Real [11] which nds the minimum number of registers need (in this case) and obtained the following result:
Number of registers is equal to 5, cliques = ffa; f; i; kg; fb; g; jg; fc;hg;fdg;fegg, actual total switching activity = 78.471, which is 10.71% worse than the optimum solution. Indeed, among all valid register assignment o f g i v en size, our proposed algorithm nds the one that minimizes the power consumption.
The percentage power reduction increases for larger data ow graphs. For example, we obtained 22.5% improvement in power (compared to the minimum register count register assignment procedure) on 7-input data ow graph using similar assumptions about the joint pdf and the data types. Specically, For the Min-Power Register Assignment, we obtained: Number of registers is equal to 9, cliques = ffa; ig; fb; hg; fe;j;kg;fl;mg;fng;fcg;fdg;ffg;fggg and actual total switching activity = 6.861; Number of registers is equal to 8, cliques = ffa; l; mg; fb; hg;fe; j; kg; ff;ig;fcg;fdg;fgg;fngg and actual total switching activity = 7.272; Number of registers is equal to 7, cliques = ffa; l; m; ng;fd; hg; fe;j;kg, ff;ig;fbg;fcg;fggg and actual total switching activity = 7.763. For the Min-Count Register Assignment, we obtained: Number of registers is equal to 7, cliques = ffa;j;mg;fb; h; k; ng; fc;i;lg;fdg;feg; ffg;fggg and actual total switching activity = 10.017.
Conclusion
This paper presented a novel way t o c alculate the switching activity external to a set of registers based on the assumption that the joint pdf of the primary input random variables is known or can be calculated. For a scheduled data ow graph without cycles, the compatibility graph for register allocation and assignment problem was proven to be a transitively orientable graph. A special network was then constructed from the above compatibility graph and the max-cost ow algorithm (a variation of min-cost ow algorithm) was performed to obtain the minimum power consumption register assignment. Due to properties of transitively orientable graph, the time complexity is polynomial.
Our future work will focus on the register assignment for pipelined design and data ow graph with outer loops.
