Abstract| W e present a dynamic programming technique for solving the multiple supply voltage scheduling problem in both non-pipelined and functionally pipelined data-paths. The scheduling problem refers to the assignment of a supply voltage level (selected from a xed and known number of voltage levels) to each operation in a data ow graph so as to minimize the average energy consumption for given computation time or throughput constraints or both. The energy model is accurate and accounts for the input pattern dependencies, re-convergent fanout induced dependencies, and the energy cost of level shifters. Experimental results show that using three supply voltage levels on a numberof standard benchmarks, an average energy saving of 40.19% (with a computation time constraint of 1.5 times the critical path delay) can be obtained compared to using a single supply voltage level.
I. Introduction
O NE driving factor behind the push for low p o wer design is the growing class of personal computing devices as well as wireless communications and imaging systems that demand high-speed computations and complex functionalities with low p o wer consumption. Another driving factor is that excessive p o wer consumption has become a limiting factor in integrating more transistors on a single chip. Unless power consumption is dramatically reduced, the resulting heat will limit the feasible packing and performance of VLSI circuits and systems.
The most e ective way to reduce power consumption is to lower the supply voltage level for a circuit. Reducing the supply voltage however increases the circuit delay. Chandraskan et. al. 1] compensate for the increased delay by shortening critical paths in the data-path using behavioral transformations such as parallelization or pipelining The resulting circuit consumes lower average power while meeting the global throughput constraint at the cost of increased circuit area.
More recently, the use of multiple supply voltages on the chip is attracting attention. This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on non-critical paths to use lower voltages (thus reducing the energy consumption). This scheme tends to result in smaller area overhead compared to parallel architectures. Jui There are however a number of practical problems that mu s t b e o vercome before use of multiple supply voltage becomes prevalent. These problems include routing of multiple supply voltage lines, area/delay o verhead of required level shifters, and lack of design tools and methodologies for multiple supply voltages. The rst issue is an important concern which should be considered by any designer who wants to use multiple supply voltages. That is, there is a trade-o between lower energy dissipation and higher routing cost. The remaining issues (that is, level shifter cost and lack of tools) are addressed in this paper. That is, we will show t h a t the area/delay o verhead of level shifters is relatively small and will present an e ective algorithm for using multiple supply voltages during behavioral synthesis.
In this context, an important problem is to assign a supply voltage level (selected from a nite and known number of supply voltage levels) to each operation in a data ow graph (DFG) and schedule various operations so as to minimize the energy consumption under given timing constraints. We will refer to this problem as the multiplevoltage scheduling problem or the M V S problem for short.
In this paper, we tackle the problem in its general form.
We will show that the M V Sproblem is N P -hard even when only two p o i n ts exist on the energy-delay curve for each module (these curves may be di erent from one module to another), and then propose a dynamic programming approach for solving the problem. This algorithm which has pseudo-polynomial complexity (cf. Section IV-C) produces optimal results for trees, but is suboptimal for general directed acyclic graphs. The dynamic programming technique is then generalized to handle functionally pipelined designs. This is the rst time that the use of multiple supply voltages in a functionally pipelined design is considered. We will present a n o vel revolving schedule for handling these designs.
The paper is organized as follows. In Section II, we summarize related work. In Section III, we describe timing and energy consumption models for non-pipelined designs. In Section IV, we present a dynamic programming approach for solving the multiple-voltage scheduling problem for the tree-like D F G's and then for general DFG's. In Section V, we extend the approach to functionally pipelined designs. Experimental results and concluding remarks are provided in Sections VI and VII.
II. Related Problems
The Multiple-voltage scheduling problem (M V S ) a s d escribed above is closely related to the circuit implementation problem as de ned in 2]. The latter problem is to minimize the total gate area in a circuit by selecting a gate implementation for each circuit node while meeting a timing constraint. It was shown in 2] that even under a fanout (load) independent delay model, with two implementations per circuit node, equal signal arrival times at inputs, and chain-like circuit structure, the problem of nding a solution where circuit area and signal arrival time is NP-complete. We will show ( c f . Section IV) that the M V S problem for minimum energy is also NP-complete. Another similar problem is that of delay constrained technology mapping 3] 4] 5]. Our method for solving multiple voltage scheduling is similar to the method used in 4] 5]. In these works, the authors use dynamic programming to cover a subject graph by a library of pattern graphs with the goal of minimizing area/power while satisfying given timing constraints.
The M V S problem was tackled in 6] where the authors proposed an algorithm for minimizing the energy consumption of a non-pipelined design while meeting the computation time constraint. The authors assume that delay vs. supply voltage curves for all modules in the design library are given and propose an iterative improvement algorithm for solving the problem. The approach is optimal for general directed acyclic graphs. However, the authors make a n umber of simplistic and rather unrealistic assumptions (e.g., the assumption that the di erence of squares of the consecutive v oltages on the delay vs. voltage curve i s x e d the independence of energy consumption of a module from data activity at its inputs identical latency vs. supply voltage curves for all modules in the circuit including adders and multipliers). The rst assumption enables the authors to reduce the problem of M i n P i2modules E i under given computation time constraint where E i is the energy consumption of module i to M a x P i2modules d i where d i is the delay of module i for the corresponding voltage assignment.
If the assumptions made in 6] do not hold for a given problem instance, then their proposed algorithm will produce a suboptimal solution without any performance guarantee.
Usami and Horowitz 7] proposed a technique to reduce the energy consumption in a circuit by making use of two supply voltage levels. The idea is to operate gates on the critical paths at the higher voltage level and the gates on the non-critical path at the lower voltage level. In this manner, the energy consumption is minimized without affecting the circuit speed.
Power Pro ler 8] primarily uses a genetic search algorithm to solve the multiple voltage scheduling problem. Johnson and Roy presented an ILP based formulation for the multiple voltage scheduling problem for non-pipelined design in 9]. Both algorithms have exponential worst-case complexity and hence the results are suboptimal for large problem instances where computation time is bounded due to practical considerations. In addition, they do not address conditional branches nor do they consider functional pipelining. Their energy models do not support input data dependency.
In comparison to previous work, our algorithm is able to nd the minimal energy solution for tree-like D F G's under timing constraints, handles general DFG's and func-tionally pipelined designs, explicitly supports the conditional branches, uses an energy model that takes di erent input data switching activities into consideration, and has pseudo-polynomial time complexity.
III. Energy-delay Curves
We assume there are latches on the inputs of all modules to synchronize the input arrival times, and no multiple module activations per cycle occurred. where operation j is a predecessor of operation i.
B. The energy dissipation model
We p r e s e n t in this section two computational models for energy dissipation at behavioral level. Our optimization algorithm is however independent of the speci cs of these energy models. More precisely, any energy macro-model whose parameters depend on the input and/or output activity factors can be used here. This includes for example, the power macro-model reported in 10].
We assume that the dynamic energy dissipation in a functional unit is given by this equation:
where V i is the supply voltage of functional unit F U i , F U i 1 and F U i 2 are the average switching activities on the rst and second input operands of F U i , respectively F i is a function of F U i 1 and F U i 2 and in general may be nonlinear.
We propose two methods to calculate E F U i =V 2 i given the pairs ( i 1 , i 2 ).
The rst method is based on look-up table, that is, we store energy dissipation values for various ( 1 , 2 ) combinations and interpolate to calculate the energy value for a given ( 1 , 2 ) c o m bination which is not found in the table. This method can achieve v ery high accuracy based on the numberofentries in the look-up table.
The second method is based on energy macro-modeling using a linear equation with 1 and 2 as random variables. More precisely, w e use the least square t to nd a plane in the 3-dimensional space that best ts the set of points ( i 1 , i 2 , E F U i =V 2 i ) for each m o d u l e F U i . From the least square t, we obtain: the technology and logic style used, and the internal module structure. We obtain the C i values for every module using gate-level simulation and the least square t. The accuracy of the model can be improved by using more variables.
To v alidate our energy model, we present some results for the set of data-path modules used in our library which a r e implemented in a 1 technology (cf. Table I) using the two methods presented above. Table I presents energy values in pJ at V =5volts when the input sequence has average activities of 1 =0.5, 2 =0.1 (random data for one operand and biased data for another operand). It is clear from these results that the table look-up method (with 100 entries) remains accurate over the range of value whereas the curve tting method becomes inaccurate for small .
We h a ve also assumed that the range of V i is such t h a t the major source of energy consumption is the capacitive charging/discharging that is, E i /V 2 i remains constant as V i is scaled down. This may not be true when static standby current becomes important a t v ery low v oltages.
With this macro-modeling, we can calculate the energy consumption of each module alternative under di erent supply voltages and switching activities. Note that F U i 1 and F U i 2 are calculated by using behavioral simulation of the given DFG using the set of user-speci ed (applicationdependent) input vectors.
Let E LS i be the energy used by level shifter i in the circuit when its input changes once. The energy dissipation (in pJ) in a 16-bit level shifter per voltage level transition is given in Table II (all 16 bits are switching). The propagation delay through a level shifter taken from 7] for typical load va l u e i s l e s s t h a n 1 ns (which m a k es it negligible compared to the propagation delay through the modules) (cf. Table IV ). Note that at most one level shifter will be used after any module. We can absorb the delay costs (1 ns) for level shifters into the delay of the functional units they follow, because in the module library, the minimum module delay is at least 20 times larger than the level shifter delay. Multiplexors will be used to route data in for non-overlapping operations that share the same module sequentially. From Table I , we can also see that the energy consumed in multiplexors is relatively small compared to energy dissipation in adders and multipliers. In any case, multiplexors are needed with or without multiple supply voltages.
We assume (and enforce) t h a t e a c h module is active only when it is performing an operation, and is in the sleep mode at all other times. The sleep mode can be achieved by c l o c k gating or use of ip-ops with enable/disable.
C. Trade-o curves
We calculate on each n o d e o f t h e D F G a delay function (or delay curve) where each point on that curve relates the accumulated energy consumed on the subtree rooted at that node (or operation) and the output arrival time of the node when a certain module (with certain supply voltage level and hence delay) is used to perform that operation. Di erent module alternatives for the same operation give rise to di erent points on the delay curve. The accumulated energy is the sum of energy consumed in all modules in that subtree (including the root of that subtree) plus all energy consumed in the necessary level shifters.
The delay function is therefore represented by a set of or- 
IV. The Scheduling Algorithm
We rst describe scheduling of DFGs which are trees.
The goal here is to obtain a minimum energy solution that binds the operations in DFG to modules in the library while satisfying a computation time constraint. It is a simple exercise to formulate this problem as an integer linear programming problem (ILP). However, the I L Pformulation does not take advantage of the problem structure and is in general very di cult and ine cient t o solve. Instead, we use a dynamic programming approach as described next.
A. Post-order traversal
A post-order traversal of the tree is performed, where for each n o d e n and for each module alternative a t n, a n e w in the common region among all delay functions in order to ensure that the resulting merged function re ects feasible matches at the children of n. Note that the energy consumed in level shifters is computed during the post-order traversal by k eeping track of the voltages used in the current node and its children (using Table II and switching activity information). The delay function for successive module alternatives at the same node n are then merged by applying a lower-bound merge operation on the corresponding delay functions. See 11] for details of operations.
The delay function addition and merging are performed recursively until the root of the tree is reached. The resulting function is saved in the tree at its corresponding node. Thus each node of the tree will have an associated Corollary IV.2: If the tree is node-balanced (its height is logarithmic in the number of its leaf nodes), then our dynamic programming algorithm runs in polynomial time.
D. Extension to general DFG's
The delay functions for nodes of a general DFG are computed by a post-order traversal as was the case for a treelike DFG. The key question is how t o add up the energy cost of children of a node during the post-order step.
We h a ve adopted a heuristic whereby the energy value of a m ultiple fanout point is divided by its fanout count w h e n its is propagated upward in the DFG. This heuristic is also adopted in technology mapping programs such as MIS 13] or ad-mapper 4] and tends to produce good results.
General DFG's contain conditional branches. We use nodes D and J to indicate the distribute and join nodes in order to express the conditional branches. For each D and J pair (which serve a s synchronization points), there were two subgraphs which represent t h e 'true' and 'false' conditions, respectively. We t r e a t t h e t wo subgraphs as if they are two s i m ultaneous (parallel) subgraphs and apply dynamic programming technique on each subgraph except for the following. During the post-order traversal, when we come to a D node, we do not divide the cost of the subgraph rooted at D by t wo (in case of a single branch). Furthermore, when we come to a J node, we weight the cost of each b r a n c h b y the probability that the branch i s taken and then add the weighted branch costs to obtain the cost of the J node.
E. Module sharing after scheduling
It is di cult to account for the possibility of module sharing during dynamic programming. An attempt to consider sharing during the module assignment and scheduling phase will violate the principle of optimality t h a t i s t h e basis for using dynamic programming. This is because the dynamic programming cost at the root of a subtree cannot be determined independently of the rest of the tree (which is not yet mapped), so the optimal solution cannot be obtained by merging optimal solutions for the corresponding subproblems.
After scheduling is completed, a module allocation and binding algorithm is applied whose goal is to exploit the possibility for sharing modules among compatible operations. This algorithm uses conventional techniques to detect operation compatibility and mutual exclusiveness of operations (as in parallel branches).
We use a scheme similar to that of 14] for minimum energy module binding using a max-cost network ow a lgorithm. Details can be found in 15].
V. Functionally Pipelined Data-path

A. Background
In a functionally pipelined design, several instances of the execution of a data ow graph are overlapped in time. The time domain is discretized into time steps (for a given length of a time step). Unlike a structural pipelining, there is no physical (but logical) stages in a functional pipeline. Structural pipelining implies the use of pipelined modules, such a s 4-stage pipelined multiplier. Both functional and structural pipelining are aimed to increase the throughput of computation. Latency L is de ned as the number of time steps between two consecutive pipeline initiations. A control step or c-step is a group of time steps that overlap in time (cf. Fig. 3 ). For a given latency L, c-step i corresponds to time steps i + ( m L), where m is an integer. We denote the L consecutive c-steps in a pipeline initiation as a frame. When the supply voltage level of a module is lowered, its delay increases and the operation assigned to the module may become multi-cycle. If the voltage is further lowered, for a small pipeline initiation latency L, an operation may become multi-frame.
The computation time T comp of a functionally pipelined data-path is de ned as the total time needed to process one data sample. Normally, a functionally pipelined circuit has to meet some throughput and/or computation time constraints. Throughput constraint is often more important than the computation time constraint in a functionally pipelined design. Suppose we are given N input samples to be processed by a functionally pipelined data-path. Let T comp be the In our problem, the latency, L and t c are assumed to be given. Therefore, when we minimize (E FU + E LS ), which is the average total energy used by all modules and level shifters per pipeline initiation, we are indeed minimizing the average power dissipation. An algorithm for performing scheduling and allocation for functionally pipelined DFG's is described in 16]. This technique known as the feasible scheduling deals with single cycle operations and operations that can be chained together in one c-step, but not multi-cycle or multi-frame operations.
B. Handling multi-frame operations
Our goal is to obtain a minimum energy functionally pipelined data-path realization while meeting the global throughput constraint (which i s d e s c r i b e d b y t wo parameters t c and L). Suppose there is a module M A with delay equal to k t c , where k L > 1, which is capable of performing an operation A in the DFG. To sustain the initiation rate of one data sample per L t c , w e u s e d k L e modules for operation A and use a revolving schedule as described next. In the following, we s h o w that the revolving schedule is the best possible schedule in terms of the number of the module instances used.
Suppose that we have modules M
Theorem V.2: For any module with delay k t c , where k L > 1, d k L e is the theoretical lower bound on the number of modules that have to be utilized in order to perform the corresponding operation with the pipeline latency of L without creating any resource con ict. We next discuss how the dynamic programming approach has to be modi ed for the functionally pipelined designs. We consider three cases. 1) Operation delay k t c is larger than L t c . As shown before, here we h a ve n o c hoice but to use d k L e modules to perform the operation without creating any resource conict while meeting the global throughput constraint. Recall that each module is active only when it is performing an operation, otherwise, it is in the sleep mode. In any time interval, given t c and L, the total number of operations is the same regardless of the number of modules used to execute those operations. The total energy consumption for processing N data samples can be calculated as follows. Let the input vectors feeding to a module M A be denoted by V 1 , V 2 , V 3 , V 4 , etc.. Suppose the corresponding operation becomes multi-frame and thus we need to duplicate the module to M A 1 and M A 2 . The input sequence feeding to M A 1 is V 1 , V 3 , etc., whereas that feeding to M A 2 is now V 2 , V 4 , etc. Obviously, the input activities for M A 1 and M A 2 are di erent from that of M A . However, the activities for M A 1 3) Operation delay k t c < L t c . We use one module per operation, however, the module may be shared. We a g a i n relegate the sharing issue to a post-processing phase where the scheduling solution obtained by dynamic programming approach is further modi ed to increase module sharing (thus reducing area cost of the design). is a track which is circular in nature, i.e. the Lth c-step in the current f r a m e comes before the rst c-step of next frame). The exact solution is obtained by the algorithm proposed in 18] which solves the register allocation problem in cyclic data ow graphs by u s i n g a m ulti-commodity ow formulation. Instead, we h a ve adopted a less expensive heuristic for doing module sharing as described next.
To resource con icts in a functionally pipelined datapath can be detected in a straight forward manner. See 11] for details.
VI. Experimental Results
We rst present the result obtained by our algorithm on a small DAG (not a tree) and the result obtained by exhaustive s e a r c h. We assume four voltages are available and that all primary inputs carry 5-V signals. The module library is shown in Table III . The energy consumed by the level shifters is shown in Table II . In this example, the length of a c-step is 30 (ns) and a total computation time constraint T comp = 700 (ns). The results of dynamic programming algorithm and exhaustive search are shown in Fig. 4 . Note our new method can handle a very large graph (more than thousands of nodes) in seconds, but the exhaustive search (and the ILP formulation) which c a n b e used to obtain the true optimal solution can only handle a small example ( 20 nodes) in a reasonable amount of time. The two solutions obtained are di erent, but the results show that our solution which i s o n l y 1 % a way f r o m the optimal solution.
In the remainder of this section, we present detailed results of our algorithm on a number of standard benchmarks including a Test DFG, AR Filter, Elliptical Wave Filter, Discrete Cosine Transform, Robotic Arm Controller, 2nd- 2 The columns corresponding to E i LS E i are the percentage of energy consumed in level shifters over the total energy. The results show that although the power consumed in level shifters is not negligible, it is not large either. Note that we can delete level shifters for step-down voltage conversions as described in 7]. In our experiments, however we inserted the level shifters for both step-up and step-down conversions. Table V shows that an average energy saving of 3.88%, 40.19% and 64.8% is achieved when using 3 supply voltage levels with total computation time (T comp ) set to T crit (the longest path delay in the DFG), 1:5T crit and 2T crit .
Energy saving for T comp = T crit is very much circuitdependent. That is, the energy saving is higher in circuits where the number of non-critical nodes is large. For the AR lter circuit, E 3 E 1 ratio is as low as 0.85 while for the F D C T circuit, this ratio is 1. Energy saving potential increases substantially when T comp > T crit , e.g., E 3 E 1 ratio for T comp = 1 :5T crit goes down to 0.60 and 0.57 for the AR lter and F D C T circuits, respectively.
In the functionally pipelined case, we can achieve l o wer average energy for a given throughput constraint ( w h i c h i s described by two parameters t c and L) by using a larger computation time because larger T comp will result in a solution that uses lower voltages and thereby l o wer average energy. Note that the throughput of the functional pipeline remains the same. However, this causes more operations to become multi-cycle or multi-frame operations which will increase the number of modules used to achieve the same throughput constraint. Thus the computation time constraint indirectly controls the chip area.
VII. Conclusion
We presented a dynamic programming approach for assigning voltage levels to the modules in non-pipelining and functionally pipelined data-paths. The average power consumption can be reduced by using a single lowered supply voltage. If the computation time constraint is violated with only a single lower supply voltage, then pipelining or parallelism on whole or part of the circuit to recover performance h a s t o b e u s e d . Although this is one way of trading the chip area for power, the area penalty is generally much higher. With a given computation time constraint, when multiple voltages are used, our algorithm will lower the supply voltages of operations which are not on the critical path while keeping the supply voltages of operations on the critical path a t a m a x i m um. The computation time constraint i s thus achieved at lower area overhead.
The use of d k L e modules for a multi-frame operation was necessary to maintain the throughput while reducing the average energy consumption in the data-path. This, however, increases the controller and multiplexor cost. The multiplexor cost can be easily obtained from Table I . An energy cost model for controller can be developed as a function of the total number of functional units used in the circuit. For example, by assuming that the energy cost of the controller scales with the log 2 of the number of modules of a given type used in the circuit. Therefore, during the post-order traversal, we can add a term re ecting the extra energy consumed in the controller when using d k L e modules to implement an operation. Thus the accumulated energy consumed in each node in the DFG will include the energy consumed in the controller.
