Abstract-This paper presents resource and latency constrained scheduling algorithms to minimize power/energy consumption when the resources operate at multiple voltages (5 V, 3.3 V, 2.4 V, and 1.5 V). The proposed algorithms are based on efficient distribution of slack among the nodes in the data-flow graph. The distribution procedure tries to implement the minimum energy relation derived using the Lagrange multiplier method in an iterative fashion. Two algorithms are proposed, 1) a low complexity ( 2 ) algorithm and 2) a high complexity ( 2 log( )) algorithm, where is the number of nodes and is the latency. Experiments with some HLS benchmark examples show that the proposed algorithms achieve significant power/energy reduction. For instance, when the latency constraint is 1.5 times the critical path delay, the average reduction is 39%.
I. INTRODUCTION

P
OWER considerations have become an increasingly dominant factor in the design of both portable and desk-top systems. An effective way to reduce power consumption is to lower the supply voltage level of a circuit. Reducing the supply voltage, however, increases the circuit delay and reduces the throughput. To maintain the throughput, parallelism and/or pipelining has to be incorporated [1] . The resulting circuit consumes lower average power while meeting the global throughput constraint at the cost of increased circuit area. Another way of maintaining the throughput is to use resources operating at multiple voltages [2] - [7] . This has the advantage of allowing modules on the critical paths to be assigned to the highest voltage levels (thus meeting the required timing constraints) while allowing modules on noncritical paths to be assigned to the lower voltages (thus reducing the power consumption). Supporting multiple voltages on chip poses many challenges, i.e., incorporation of multiple-power-distribution grids, efficient-level converters, etc. However, the viability of this method has been successfully demonstrated in the design of a MPEG-4 video codec in [8] .
In this paper, we address the problem of scheduling a data-flow graph (DFG) under resource and latency constraint for the case when the resources operate at multiple voltages.
We assume that the resources respect different voltage-delay curves. The scheduling algorithm minimizes power/energy consumption for the case when the resources operate at multiple voltages (5 V, 3.3 V, 2.4 V, and 1.5 V). It is worth mentioning that the operating voltages do not have to be restricted to 5 V, 3.3 V, 2.4 V, and 1.5 V. In fact, the resources can operate on any voltage for which the corresponding delay is known.
The proposed algorithm operates in two passes. In the first pass, minimum-time, resource-constrained scheduling is done. In the second pass, the difference between the given latency and the time needed by the resource-constrained schedule (obtained in the first pass) is distributed among the nodes in a way that minimizes the total power/energy consumption. The distribution procedure (derived using the Lagrange multiplier method) uses the energy/delay (E/D) ratio of the nodes to distribute the slack. The procedure is implemented by an iterative algorithm, where in each iteration, increasing number of resources with high E/D ratio are disabled. The iterations continue till there is a timing violation. Two algorithms are proposed, 1) algorithm and 2) a more complex algorithm, where is the number of nodes and is the latency. The algorithm gives a schedule with lower energy compared to the algorithm. Experiments with some HLS benchmarks (DIFFEQ, AR lattice, EW filter, FIR Filter, FFT4 and DCT) show the effectiveness of these approaches in reducing power/energy. When the latency constraint is tight, 1 the average reduction is 17.5%, when the latency constraint is 1.5 times the critical path delay, the average reduction is 39% and when the latency constraint is two times the critical path delay, the average reduction is 58.5%.
There are several scheduling algorithms for multiple-voltage resources in the literature today [2] - [7] . These algorithms can be classified into i) only latency-constrained (i.e., latency is a hard constraint and resources are minimized) [3] , [4] , ii) only resource-constrained (i.e., resource is a hard constraint) [5] , and iii) latency and resource constrained (i.e., both latency and resource are hard constraints) [6] , [7] . While the (only) latency-constrained and (only) resource-constrained algorithms have polynomial or pseudopolynomial time complexity, the latency and resource-constrained algorithms are based on integer linear programming and have (worst case) exponential time complexity. In this paper, we propose a heuristic algorithm for latency and resource-constrained scheduling that produces comparable results with only polynomial time complexity.
The rest of the paper is organized as follows. Section II presents the definitions and the Lagrange formulation. Section III describes the algorithms and illustrates them with examples. Section IV includes the results on some HLS benchmark examples. Section V concludes the paper.
II. PRELIMINARIES
A. Definitions
The input to our algorithm is a data flow graph, a timing constraint, and a resource constraint.
Timing Constraint: This is the time available to execute the operations in the data flow graph. It is also referred to as the latency constraint.
Resource Constraint: This is specified by the number of resources for each type, where each type of resource can only be operated at a specific voltage. The corresponding energy and delay values are also given. Examples of resources include adder/subtractor (denoted by ), multiplier (denoted by ), etc.
In addition, we assume that a level shifter is needed between resource A and resource B only if resource A operates at a lower voltage compared to resource B. The number of level shifters is not user defined. In fact, the proposed algorithm tries to reduce the number of level shifters.
B. Slack Distribution Using the Lagrange Multiplier Method
In the proposed algorithm, the Lagrange multiplier method is used to distribute the slack among the nodes in the critical path. The relation between the voltages of the nodes in the critical path is derived in the following way.
The delay of a resource is determined by the delay of the gates on the critical path. Let be the delay of gate operating at voltage and let be the load capacitance of gate (1) Then the delay of the resource operating at voltage is equal to sum of the delay of the gates on the critical path (2) where is the sum of the capacitances of the gates on the critical path of resource .
The energy of resource operating at volts is given by (3) where is the average switching activity of the resource, and is the total load capacitance of the resource. Note that is typically much larger than . Our aim is to minimize subject to the time constraint . We use the Lagrange multiplier method to determine the supply voltage of each node . . . We find that is minimum when the following condition is satisfied among the nodes in the critical path Since computing the supply voltage for the above equation can be a computational burden, we simplified the equation. For , this is approximated to
The error as a result of approximation is 0.1% [10] . Since ratio is a constant for each resource, from (2) and (3) 
To satisfy (5), nodes with high (such as wide multipliers) have to be assigned to lower-voltage resources. This is the basis of our slack distribution algorithm.
III. RESOURCE AND LATENCY-CONSTRAINED SCHEDULING: ALGORITHM AND EXAMPLES
Algorithm Overview: In a nutshell, the proposed resource and latency-constrained algorithm operates in two passes. In the first pass, resource-constrained scheduling is done in a way that minimizes the computation time. In the second pass, the slack between the given latency and the computation time obtained by the scheduling algorithm in the first pass is distributed to the nodes such that the total power/energy consumption is minimum. The Lagrange multiplier method described in Section II-B, is used to find the optimal distribution of slack between the nodes.
A. First Pass: Minimum-Time Resource-Constrained Algorithm
The minimum-time resource-constrained algorithm schedules the nodes such that the computation time is minimum. To ensure that a feasible schedule is obtained, it calculates the number of ready nodes in each cycle in a special way. Define:
• : the minimum number of control cycles required to schedule all the nodes of type in the ready set corresponding to the th control cycle • : the number of ready nodes of type that will be scheduled in control cycle to the resources operating at volts. In this scheme, is first calculated and then is used to calculate . Use of results in the ready nodes being scheduled in a way that takes minimum time, thereby increasing the feasibility of the algorithm. Power consumption reduction is not considered in this step of the algorithm.
The proposed algorithm is list based [9] . The nodes in a ready set are prioritized based on the freedom:
nodes with the lowest freedom are chosen among the ready nodes. Then the nodes are scheduled such that if the freedom of a node is low, it is assigned to a high-voltage resource and if the freedom of a node is high, it is assigned to a low-voltage resource. This is implemented in two different ways. Let be the number of available resources of type operating at volts in cycle . Then in the first scheme, nodes with higher freedom are scheduled to the resources with lower voltage, while in the second scheme nodes with lower freedom are scheduled to the resources with higher voltage. The first scheme has the advantage of lower-power consumption since more low-voltage resources are utilized. However, it can result in an infeasible solution when the given latency is tight. The second scheme has higher feasibility of scheduling but possibly higher-power consumption. In our procedure, both the schemes are implemented, and the schedule which is feasible and takes lower time is chosen for the minimum-time algorithm (first pass) and the schedule which is feasible and consumes lower power is chosen for the low power algorithm (second pass).
Finally, if the freedom of a node is higher than the delay of the resource to which it is assigned, the node can be scheduled to a lower voltage resource, if available. While this improves the power consumption, it can result in increased number of level shifters, since the extra freedom allows the node to be scheduled to a lower voltage than its children.
The computation time obtained by application of the minimum-time resource-constrained algorithm is referred to as . Thus if the latency constraint , a feasible solution cannot be obtained. ii) Start scheduling from the lowest-freedom node to the resources with the highest voltage.
Algorithm
iii) For the minimum time algorithm, choose the schedule [among i) and ii)] that is feasible and results in lower time. For the low-power algorithm, choose the schedule [among i) and ii)] that is feasible and results in lower power. b) If the delay of the scheduled node is less than the freedom of the node, reschedule the node to the lower available voltage resource.
Calculation of Parameter
: Recall that is the number of ready nodes of type that will be scheduled in control cycle to the resources operating at volts. Define as the number of nodes of type that can be scheduled to the resource of type operating at volts during the period cycle to cycle . (6) (7) So, given , , and can be calculated. Note that these parameters have to be calculated for each resource . This will be explained with the following example.
Example 1: Let the resources consist of a 5 V multiplier, a 3.3 V multiplier, and a 2.4 V multiplier. Thus, . Assume that all the resources are available ( for all ). The delays are related by , , and , where is the delay of a multiplier relative to an adder operating at 5 V. In control cycles, three nodes can be assigned to a 5 V multiplier, one node to a 3.3 V multiplier and one node to a 2.4 V multiplier. Thus . Thus, in three control cycles, a maximum of five multiplication nodes can be scheduled under the given resource constraint. So, given , we can easily calculate . Now consider calculating given . Let . Then under the same resource constraint, can be calculated using (1) and (2) satisfies the above equation, resulting in , and .
B. Second Pass: Low-Power Algorithm
At the end of the first pass, if , each node has a nonzero slack. The objective of the low-power algorithm is to distribute the available slack between the nodes (i.e., determine the voltage assignment of the nodes) such that the latency constraint is satisfied and the minimum energy condition (5) is satisfied as much as possible. Recall that the voltage assignment is directly related to the operating voltage of the corresponding hardware resource. The procedure is iterative; in each iteration, increasing number of resources with high values are disabled and the nodes are scheduled. The order in which the resources are disabled is determined using (5). After each iteration, the minimum energy condition (5) is better satisfied. The iterations continue till there is a timing violation. We motivate our procedure with the help of the following example.
Example 2: Consider a DFG with an addition node being fed to a multiplication node. The resource constraint consists of multipliers and adders operating at 5 V, 3.3 V, 2.4 V, and 1.5 V, and the latency is 17 control cycles. The energy-delay ratio of the multiplier at V is , and that of the adder is . For other values of , is still . From (5), the condition for minimum energy is or (8) where is the voltage that is assigned to the multiplication (addition) node. Our aim is to schedule the nodes without violating the timing constraint such that the minimum energy equation is satisfied as much as possible.
Assume that multipliers and adders operating at 5 V, 3.3 V, and 2.4 V are available. In the beginning, let both the multiplication and the addition node be assigned to the 5 V resources. For this assignment, the computation time is 6, which is less than the latency . Furthermore, for , the left-hand side of (8) is 8, which is larger than the right-hand side of (8) which is 4. In order to satisfy (4) better, we choose to assign the multiplication node to the 3.3 V multiplier. Choosing the 3.3 V multiplier is equivalent to disabling the 5 V multiplier from the set of resources. For this assignment, the left-hand side is 4.6 which is closer to 4 but still higher. The corresponding computation time of 10 is still less than and so we assign the multiplication node to the 2.4 multiplier. Choosing the 2.4 multiplier is equivalent to disabling the 5 V and the 3.3 V multiplier from the set of resources. The computation time is now 16 and the left-hand side is 2.8, which is less than the right-hand side. So it is now the addition node that has to be assigned to a lower-voltage resource, namely, a 3.3 V adder. With this assignment, the latency is 17 and the right-hand side is 2.3 which is closer to the left-hand side. Note that while the minimum energy condition is not satisfied (the left-hand side and right-hand side of (8) are close but not equal), this assignment is the closest to the minimum-energy solution. We cannot lower the voltages any further since that would cause a latency violation. From this example, we see how (8) helps us to determine the order in which the resources should be disabled so that the minimum energy condition is satisfied as much as possible.
1) Priority Assignment:
The priority of which resources to disable first is determined using (5) and the energy values of [4] that are quoted in Table I . The priorities are given in Table II . According to this table, the multipliers operating at 5 V have the highest chance of being disabled followed by multipliers operating at 3.3 V, followed by adders operating at 5 V, followed by adders operating at 3.3 V, etc.
The proposed algorithm disables resources in the order of their priority. For each configuration, it schedules the nodes and checks if its computation time . If it is true, then it disables the resources with the next highest priority and reschedules the nodes. If it is not true ( ), then the specific resource cannot be disabled for all the control cycles. We next describe two algorithms which differ in determining until what control cycle the specific resource can be disabled.
2) Description of Algorithms 1 and 2:
Assume that a feasible solution exists when resources with priorities 1 through are disabled and that there is a timing violation when resources with priorities 1 through are disabled. For the feasible schedule, let be the computation time and let the finish time Algorithm 1: This is a one-pass algorithm. Here we start scheduling the nodes with the resource with priority disabled, and after an assignment check if the is greater than . If it is greater, a conflict has occurred, and node and the remaining unassigned nodes are scheduled assuming that the resource with priority has not been disabled. Algorithm 2: This is an iterative algorithm, and consequently more complex compared to Algorithm 1. The procedure consists of first disabling the resource with priority from control cycle 1 to and checking for timing violations. If , then the resource is disabled from control cycle 1 to . If the new , then the resource is disabled from control cycle 1 to , and so on. After steps, where is the number of voltage levels, is the number of resources and is the latency, the algorithm determines which resources to disable such that the computation time is less than but as close to as possible. Thus, Algorithm 2 iteratively schedules the nodes until unused slack is minimum. As a result, its complexity is times larger than that of Algorithm 1. We explain the operation of Algorithm 2 with the help of the following example.
Example 3: Let the latency . Let the computation time when only the 5 V multiplier is disabled is 16, and when both the 5 V and 3.3 V multipliers are disabled is 35. Thus, while the 5 V multiplier can be disabled for the whole time, the 3.3 V multiplier can only be disabled for part of the time. Algorithm 2 determines until what control cycle the 3.3 V multiplier can be disabled in the following way: in the first iteration, the algorithm disables the 3.3 V multiplier from 1 to 8 ( 16/2) cycles and completes the schedule in 24 cycles. Since there is a positive slack, the algorithm disables the 3.3 V multiplier from 1 to 12 ( 8 16/4) cycles. The latency is now 31 which is greater than . So the algorithm disables the multiplier from 1 to 10 ( 12 16/8) cycles and completes the scheduling in 28 cycles. Finally, the algorithm disables the multiplier from 1 to 11 ( 10 16/16) cycles and completes in 29 cycles, which is as close as to as possible. The low-power algorithm is summarized as follows. 
Summary:
The proposed resource and latency constrained scheduling algorithm operates in two passes in the following way.
First Pass: Algorithm Schedule (minimum-time version) Second Pass: Algorithm Low Power (invokes Algorithm 1 or 2) We refer to the version when Algorithm 1 is invoked in Algorithm Low Power as Algorithm LC (or the low-complexity algorithm), and the version when Algorithm 2 is invoked as Algorithm HC (or the high-complexity algorithm).
C. Complexity Analysis
In this section, we show that the worst case complexity of Algorithm LC is and of Algorithm HC is , where is the latency and is the number of nodes. The complexity of the Algorithm Schedule is dominated by the step where the nodes are ordered with respect to their freedom. Specifically, in this step, nodes are chosen among ready nodes in control cycle . In the worst case, all the nodes are ready in each control cycle and the complexity for that control cycle is . Note that . Thus, in the worst case, the complexity of Algorithm Schedule is . In Algorithm Low Power, the priority table can be searched in passes, where is the number of voltage levels and is the number of resources. Since the number of voltage levels and number of resources are limited, the term can be ignored for asymptotic analysis.
Algorithm LC consists of Algorithm Schedule and Algorithm Low Power with Algorithm 1 invoked and thus, has a complexity of . Algorithm HC, on the other hand, includes Algorithm Low Power with Algorithm 2 invoked. Since Algorithm 2 requires additional passes to find the exact control cycle where disabling priority changes, the complexity of Algorithm HC is .
D. Illustrative Example Example 4:
The resource constraint is , , , , and for the Fig. 1 . The timing constraint is . Switching activities at all nodes are assumed to be 0.5. Algorithm 1 is invoked in case of timing violation.
DFG in
First Pass: In control cycle , the ready sets are a, e and b . For nodes a and e, , , . Node a is assigned to the 5 V multiplier since it has lower freedom, and node e is assigned to the 3.3 V multiplier [ Step 3a of Schedule]. , , and node b is assigned to the 5 V adder. However, since the children of node b (node c) will be available at control cycle 6, we assign node b to the lowervoltage resource (2.4 V).
In control cycle , the ready set is c . , , and node c is assigned to the 5 V adder. In control cycle , the ready set is d . , , and node d is assigned to the 5 V multiplier. In control cycle , the ready set is f . , . But since 5 V multiplier is not available in control cycle 10, no scheduling is done. Node f is assigned to the 5 V multiplier in control cycle 12.
The schedule at the end of the first pass is shown in Fig. 2(a) . . Since , the multipliers operating at 5 V are disabled first (according to the priority table).
. Second Pass: In control cycle , the ready sets are a, e and b . For multiplication nodes a and e, , ,
. So, node a is assigned to the 3.3 V multiplier since it has lower freedom and node e is assigned to the 2.4 V multiplier.
, , and node b should be assigned to the 5 V adder. However the children of node b (node c), will be available at control cycle 10. So we assign node b to the lower voltage resource (2.4 V) [ Step 3b of Schedule].
In control cycle , the ready set is c . , , and node c is assigned to the 5 V adder. In control cycle , the ready set is d . , , and node d is assigned to the 3.3 V multiplier. In control cycle , the ready set is f . ,
. Scheduling node f to 3.3 V violates its finish time . Thus, the algorithm shift backs to the previous schedule (no resources are disabled) and node f is assigned to the 5 V multiplier.
The final schedule is given in Fig. 2(b) . Note that the computation finishes in 20 cycles, which is less than .
E. Other Issues 1) Switching Activity Consideration:
Switching activity of the resources depends on the switching probability of the input data and the circuit structure. The energy values of the resources and the priority table for disabling resources, assumed that the input switching activity is the same for all the resources . However, when the correlation of the multiplexed input data streams is high, the input switching activity is low and the energy consumption of the resource is low. Our analysis in [10] shows that if the switching activity of the resources vary by a factor of 2, then using average switching activity value results in a 3% error if the voltages are assigned according the minimum energy equation (derived using the Lagrange multiplier method). Thus, assuming that the switching activity of the resources varies by a factor less than 2, using introduces an error of at most 3% into our results.
2) Feasibility of the Algorithm: Let be the optimumminimum-computation time (critical-path delay) under the given resource constraint. Finding a feasible solution for is equivalent to finding the optimal solution. The proposed algorithm is a list-based algorithm and does not guarantee an optimal solution. However, the proposed algorithm guarantees a feasible solution if , where or for the benchmarks that we have considered. The results are shown in Table III . Here implies 1 multiplier and 1 adder operating at each of the following voltages: 5 V, 3.3 V, 2.4 V, 1.5 V, and implies two multipliers and two adders operating at the same set of voltages. For this set of benchmarks, the average error is 3.9%. Thus, our list-based algorithm generates a feasible solution almost all the time.
IV. RESULTS
In this section, we present the results obtained by running our algorithm on some high-level synthesis benchmarks (DIFFEQ, AR lattice, EW filter, FIR Filter, FFT4, and DCT). We present the results when actual energy consumption values in [4] are used. The switching activity of the nodes is assumed to be 0.5. The results for (i.e., 1 multiplier and 1 adder operating Table IV also demonstrates how the level-shifter-power consumption varies with the latency. If the given latency is tight, the majority of the nodes are assigned to the high-voltage resources to satisfy the timing constraint and consequently, the number of level shifters is low. When the given latency is high, the number of nodes assigned to the lower-voltage resources increases and the number of level shifters increases. The ratio of level-shifter energy to the total energy is 10.1% when the latency is and 11.4% when the latency is for Algorithm HC. Fig. 3 graphically illustrates the energy reduction when Algorithm LC is used on the benchmarks for the case when and when . Note, that since the values of and are different for the two cases, the energy reductions for the two cases should not be directly compared. Fig. 4 plots the reduction in energy when the latency varies from to for the EW filter in Fig. 4(a) , and the average case (the average of DIFFEQ, AR lattice, EW filter, FIR filter, FFT4 and DCT reductions) in Fig. 4(b) . Here is the ratio of the energy of the assignment using Algorithm HC to the case when all the resources are assigned to 5 V, and is the ratio of the assignment using Algorithm LC to the case when all the resources are assigned to 5 V. We assume that and that the latency is varied from to , where is 3.9% higher than the critical path length (see Table III ). The changes in the energy reduction for the EW filter graph [ Fig. 4(a) ] occur in steps. This is very typical of all the benchmarks that occur because increasing the latency can enable an extra addition or multiplication node to be scheduled to a lower-voltage resource. Similar step changes are not visible in Fig. 4(b) since this is an average energy-reduction plot, where the step changes have been smoothened.
From the plots, we see that the energy reduction increases with increase in latency. This is to be expected since an increase in latency facilitates more nodes being assigned to lower voltages. Second, the energy reduction occurs in steps. This is because a reduction occurs only when an additional node can be assigned to a lower-voltage resource as a result of the increase in latency. So, if the increase in the delay of a node (as a result of the lower-voltage assignment) is less than the increase in latency, there is no energy reduction. Third, for the same latency, the energy reduction obtained using Algorithm HC, is larger compared to Algorithm LC. Thus, the use of a more complex algorithm results in higher-energy savings. Fourth, for small values of , where , is very close to . This is expected too, since when the latency is tight, most of the nodes are assigned to high-voltage resources.
V. CONCLUSION
In this paper, we present a new scheduling scheme under resource and latency constraint that minimizes power/energy consumption for the case when the resources operate at multiple voltages. The proposed scheme minimizes the power/energy consumption by distributing the slack among the nodes according to the condition derived using the Lagrange multiplier method. The scheme is implemented using an iterative algorithm, where in each iteration, increasing number of resources with high-energy-delay ratio are disabled and the nodes scheduled using a list-based algorithm. We propose two algorithms: 1) a simpler algorithm and 2) a more complex algorithm, where is the latency and is the number of nodes. The average reduction obtained by the more complex algorithm is 17.5% when the latency constraint is tight and is 39% when the latency constraint is 1.5 times the critical-path delay. The results obtained by the simpler algorithm are on the average 9% less compared to those obtained by the more complex algorithm. This is the expected tradeoff between the complexity of the algorithm and power/energy savings.
Ali Manzak received the B.S. degree in electronics
