This paper presents an integrated approach to data path synthesis which solves three important design problems: scheduling, allocation, and hardware partitioning with power minimization as a key design objective. Based on the rules of thumbs introduced in prior work on synthesis for low power we derive an integer programming formulation for solving the problems. We then, based on the formulation, develop an efficient algorithm which performs scheduling, allocation and hardware partitioning simultaneously so that the effects of them on power consumption are exploited more fully and effectively. Our experimentation results show that the algorithm is quite effective, producing designs with significant savings in power consumption.
INTRODUCTION
Power consumption in VLSI circuits has become an important consideration in circuit design in recent years [1] . In many application domains, we need to use low-power circuits in order to lower the packaging and cooling costs and to extend the battery life. In designing low-power circuits, a number of techniques for reducing internal power dissipation has been proposed. These techniques often focus on reducing the dominant term in the equation for power dissipation in CMOS digital circuits 1]" P CLV2DDfp (1) where P denotes the power dissipated in charging and discharging the output capacitive load CL.
VDD is the supply voltage and fp is the output switching frequency. One way to reduce power consumption is to lower the total capacitive load CL. In general, there are different types of * Based on "An Efficient Data Path Synthesis Algorithm for Behavioral-level Power Optimization" by C. Park, T. Kim [2] .
Previous research efforts [3] [4] [5] in high level synthesis have mostly focused on speed and/or area optimization using a global periodic clock signal and a single type of functional modules for each operation. Much work on power optimization has focused at the logic level. The power trade-off between different types of adders and multipliers was studied in [6] . In [7] power was minimized by modifying the function of each node in the circuit. [8] employed a re-encoding technique using gated clocks for reducing power in sequential circuits. A high-level synthesis system, HYPER-LP, presented in [9] uses a variety of architectural and arithmetic transformations to optimize the power dissipation.
In this paper, based on the observations from the prior work on synthesis for low power [6, 8, 9, 11] we design a new high-level synthesis algorithm which performs the tasks of scheduling, allocation, and hardware partitioning in an integrated fashion for low power design. Specially, for a given unscheduled data flow graphs, we are to (1) select functional modules from a given general library (2) schedule the operations on the selected functional modules so that groups of functional modules with similar activity patterns can be deactivated and (3) allocate registers to the variables and partition the registers so that groups of registers can be deactivated. Our objectives are to minimize the total hardware cost as well as power consumption. Our algorithm employs the following techniques to reduce power consumption: functional module selection, selective shutdown offunctional modules, and selective shutdown of registers.
Functional Module Selection
In general, an operation can be executed on one of several different types of functional modules which have different execution times, areas and power consumptions. For example, Table I shows that a 32-bit addition operation can be executed on a ripple-carry adder (RCA) in 20 ns which consumes 22.7 mW or on a carry lookahead adder (CLA) in 10 ns which consumes 37.3mW, and a 32-bit multiplication operation can be executed on a Booth multiplier (BOOTH) in 160 ns which consumes 84.0 mW or on an array multiplier (ARR) in lOOns which consumes 295.6 mW [10] . To reduce power consumption, utilization of functional modules which consume less power is clearly desirable.
As an example, Figure   shows a given unscheduled data flow graph. Suppose the duration of a control step is 120 ns, and to carry out a multiplication operation BOOTH takes 2 control steps and ARR takes control step. Schedule A in Figure 2 shows a schedule when only ARR is available. Schedule B in Figure 3 Consequently, Schedule B yields a 35% saving in power consumption.
Selective Shutdown of Functional Modules
Recent studies [11] A more practical approach is to use the same regenerated clock signal for a group of functional modules so that not only will the number of clock regenerators be reduced, but also the control logic for clock gating will not become excessively complicated.
In our approach, we use a regenerated clock signal for each type of functional modules. Figure 5 shows the clock regenerator logic used in our algorithm. Figure 6 
Selective Shutdown of Registers
In a dynamic register, information is stored in the form of electric charge which leaks gradually over time. Thus dynamic registers need to be refreshed periodically, usually in each control step. Consequently, power is consumed in both the refresh circuit and t.he clock signal. However, data stored in a register might become obsolete, and it is unnecessary to refresh a register after the data stored in the register will no longer be used. Similar to the case of functional modules, by clock gating, we can turn off both the refresh circuit and the clock signal that drives the refresh circuit during control steps in which a register no longer needs to be refreshed. Again, it is not practical for each register to have its own regenerated clock signal. Consequently, we shall partition the registers into groups and let each group be driven by its own regenerated clock signal.
To partition the registers, we first partition the variables into groups. Each group of variables is then assigned to the registers to form a group. The partition is to be carried out in such a way that the total number of active control steps in the registers is minimized. (An active control step of a register is a control step in which the register contains the value of a live variable. Similarly, an inactive control step is one in which the register contains data which is obsolete.) Figure 7 (a) shows the life times of 9 variables from Schedule B in Figure 3 . are 2 registers and 2 (common) inactive control steps. In group P2, there are register and 4 (common) inactive control steps. The total number of active control steps in the registers is 10. In fact, Partition C achieves a saving of 33% of power consumption in register than using Partition A.
The three techniques mentioned above are closely inter-related. Our algorithm performs the tasks of functional module selection, scheduling, allocation and partitioning simultaneously embodying these techniques. We are given a general library which contains several types of functional modules for each type of operation and the total number of control steps within which all operations in the data flow graph are to be executed. We are to (1) determine an execution schedule for the operations, (2) select the type of functional module for each operation, (3) partition the variables into groups, (4) allocate functional modules and registers and (5) determine the regenerated clock signal for each type of functional modules and each group of registers. The hardware cost, the total power consumption, and the total number of active control steps in functional modules and registers will be minimized. We propose a polynomial time algorithm which is an approximation algorithm for solving an integer programming (IP) problem.
Previous research efforts [3] [4] [5] At the beginning, we set the value of one of these variables Oijkp to 1. Such a choice might lead to the determination of the 0-1 values of other 0-1 variables because of the constraints in (1) and (2) .
Let us use the example in Figure 8 to illustrate the idea.
We have the following constraints when the total number of control steps is given to be 5.
Constraints of the form (1):
(1) 01111.1.o1112.1.o1211 -t-o1212-(2) 02211 "1"O2212"1"02221 "1"O2222"1"O2311 "1"02312"1" 02321 "1" 02322 "1" O2411 "1" 02412 (3) 03211 +O3212-t-O3221 +03222-t-O3311 "1"O3312 (4) 04311 "1" O4312-+-04411 "1" 04412 "1" 04511 "1" 04512 (5) 05311 +05312+05321 "1"05322"1"O54ll "1"O5412 (6) 06411 "1" O6412"1" 06511 + 06512
Constraints of the form (2):
(7) 01111 "1"01112 _ _ 02211 "1"02212"1"02221 "1"02222 (g) 01111 "1"01112 03211 "1"03212"1"03221 "1"03222 (9) 02211 "1"02212 04311 "1"04312 (10) 02211 "1"02212"1"02311 "1"02312"1"02221 "1"02222 04411 "1" 04412 (11) 03211 "1"03212 05311 "1"05312"1"05321 "1"05322 (12) (1), we have o1111 o1112 O1212 0.
Since 01111 + 01112 0, because of precedence constraints (7) and (8) The variables in the equalities will be assigned equal fractional values. For the example in Figure 8 , (9) . We then compute the value of the objective function F. For the example in Figure 8 , we obtain the values of F corresponding to all possible choices of setting one of the variables Oikp to 1.
(The values of the weighting factors in F are chosen to be: when c 1, /3--2, , 3 On the basis of the value of F, the value of one of the variables oi;p will be set to 1. In other words, among all variables oi;p, the one that produces the minimum value of F will be set to which together with the values of other variables assigned accordingly constitutes an intermediate solution.
For the example in Figure 8 , 06511 is set to 1. In this case, we obtain an intermediate solution: 
EXPERIMENTAL RESULTS
We tested our program on a number benchmark examples. Our algorithm described in Section 4 was implemented in C and executed on a Sun Sparc20 workstation. Example df.5 is the differential equation from [3] [16] with the given number of control steps set to 8 and 10, respectively. Examples ewf.18 and ewf.20 are the elliptic wave filter from [16] with the given number of control steps set to 18 and 20, respectively. closely. If we want to minimize the total hardware cost, we obtain a design which consumes 544 mW in the functional module and 336mW in the registers. The total hardware cost is 88. If we want to minimize the power consumption in the functional modules only, we obtain a design which consumes only 252 mW in the functional modules and 336mW in the registers. The total hardware cost has risen to 94. If we want to minimize the power consumption in the registers only, we obtain a design which consumes 288mW in the functional modules but only 252mW in the registers. The total hardware cost is 88. In Power+Hardware, when we take everything into consideration, we obtain a 42.8% saving in power consumption. The hardware cost is 57.
Indeed, as is shown in Table III we can achieve up to 52.6% reduction in power consumption and 43.4% on the average.
In order to compare the result produced by our approximation algorithm with the optimal result, we should solve the IP in Section 3. However, since the formulation is cast as a non-linear integer programming problem, the computational effort will be substantial even for small problem instances. Hence, we first generate an integer linear programming (ILP) formulation from the integer programming (IP) formulations in Section 3. Since the objective function F is a quadratic function, we approximate it by a linear form using La Grange first order conditions. We then used the LINDO package on an IBM3081 to solve the ILP problem. Table IV 
CONCLUSIONS
We presented an integrated approach to the problem of solving scheduling, allocation and hardware partitioning with the power consumption as one of key design objectives. We first proposed an integer programming formulation for solving the problem, from which we derived an efficient approximation algorithm. Unlike previous approaches for low power in which scheduling and allocation are performed independently, our approach combined scheduling, allocation and partitioning together to exploit the effects of them on power consumption more effectively. The experimental results confirmed that our algorithm is quite effective and robust.
