Abstract-Switching activity and schedule length are the two most important factors that influence the energy consumption of an application executed on a VLIW (Very Long Instruction Word) processor. Considering these two factors together, we propose an instruction-level energy-minimization scheduling technique to reduce the energy consumption of applications on VLIW processors. We first formally prove that this problem is NP-complete. Then three heuristic algorithms, MSAS, MLMSA, and EMSA, are proposed. While switching activity and schedule length are given higher priority in MSAS and MLMSA, respectively, EMSA gives the best result considering both of them. The experimental results show that EMSA gives a ¢ ¤ £ ¦ ¥ § % reduction in energy compared with the traditional list scheduling approach on average.
I. INTRODUCTION
In embedded systems, high performance digital signal processing (DSP) used in image processing, multimedia, wireless security, etc., needs to be processed not only with high data throughout but also with low power consumption. To satisfy the ever-growing performance requirement, a high-performance architecture with multiple functional units such as VLIW (Very Long Instruction Word) architectures, for example, TI TMS320C6K, is now commonly used. Since multiple functional units are executed simultaneously in these architectures, power consumption becomes one of the most important issues to be considered with the concern of performance. Therefore, an efficient scheduling scheme is needed to reduce the energy consumption of an application as well as guarantee the required timing performance. Recent research for various processors [1] , [2] shows that the instruction sequence of an application plays an important role in its energy consumption. Thus, new research directions in power optimization have begun to address the issues of instruction-level scheduling for reducing energy consumption [3] - [5] . In this paper, we 
where¨% ' ) ( is the base power needed to support instruction execution,¨3 R 4 ( 6 7 is the basic power to execute a sub-instruction S U T on a functional unit, and 9 V B A R C F E W G H is the switching power caused by switching activities between S Y X # a T (current sub-instruction) and S b X # a ¦ c (last subinstruction) executed on the same functional unit (FU V A D C F E G I H will change with different schedules though. Therefore, in order to minimize the energy consumption of an application, schedule length and switching activity both need to be considered in scheduling.
A lot of research efforts have been put in this field. In [6] , compiler techniques are proposed to reduce power variations, hot-spots and peak power in scheduling. Though these techniques can efficiently reduce power variations, they don't aim to minimize the total energy consumption. Low power resource allocation approach [3] , [7] , [8] attempts to find an allocation for a fixed schedule in such a way that the total switching activities can be reduced. While this approach is effectively applied in resource allocation in the high-level synthesis, it may give inferior results in solving instruction-level energyminimization scheduling problem because scheduling and allocation are performed separately. In [5] , a two-phase scheduling approach is proposed to optimize transition activity in the instruction bus on a VLIW architecture. Based on an initial schedule, this approach performs low power resource allocation (called horizontal scheduling) in the first phase and vertically schedule instructions in the second phase to further reduce the switching activities. Since the schedule length is fixed priori, this approach can not be directly applied to solve instruction-level energyminimization scheduling problem. In [4] , several revised list scheduling techniques are proposed to minimize energy based on the instruction-level energy models for the specific processors. Using similar energy models, in [9] , several energy-oriented instruction scheduling approaches are presented and compared with performance-oriented scheduling. However, these techniques are not general enough to be applied to VLIW processors because of the specific architectures.
Several techniques [10] - [12] are proposed to minimize switching activities only. In [10] , an instruction scheduling technique, called cold scheduling, is proposed to reduce the switching activities on the control path. In [11] , an operand sharing scheduling technique is proposed to schedule the operation nodes with the same operands as closely as possible to reduce the switching activities on the functional units. In [12] , a scheduling algorithm for optimizing coefficients of a FIR filter is proposed to minimize the switching activities on memory data bus and function units. Since schedule length is not considered, these techniques may give inferior results for energy minimization.
In recent works [13] , [14] , the power efficient scheduling problem is formulated as the Traveling Salesman's Problem (TSP) and solved by heuristics of TSP when there is one FU. However, formulating a problem as TSP does not necessarily mean that the problem is NP-complete. For example, the problem to sequence jobs that require common resources on a single machine [15] can be transfered to TSP but still polynomially solvable. In the literature, it is still unknown that the instruction-level scheduling problem with minimum switching activities (called minimum-switchingactivities scheduling problem) is solvable in polynomial time or NP-complete.
In our work, we formally prove minimum-switchingactivities scheduling problem is NP-complete no matter there are resource constraints or not. Based on this result, we further prove that instruction-level energy-minimization problem is NP-complete with or without resource constraints. While minimum latency scheduling problem is polynomial-time solvable if there is only one FU or no resource constraints, the problem becomes NP-complete when considering switching activities as the second constraint.
To solve the instruction-level energy-minimization problem on VLIW architectures, three algorithms are proposed. We first propose two algorithms, MSAS and MLMSA, to solve two special cases of the energy-minimization problem. MSAS algorithm is designed for the case when switching activity plays the most important role in total energy consumption. MLMSA is for the case when schedule length plays the most important role. In both MSAS and MLMSA algorithms, we consider all nodes in the ready list in each schedule step and use weighted bipartite matching to do scheduling and allocation simultaneously. Then an algorithm, Energy Minimization Scheduling Algorithm (EMSA), is proposed to solve the general instruction-level energyminimization scheduling problem. In EMSA, we start from an initial schedule and reschedule the nodes to reduce switching activities when relaxing the initial schedule to a given timing constraint. The best schedule is selected from all possible schedules up to point. The experimental results show significant reductions in energy consumption. When switching activity plays the most important role in total energy consumption, MSAS gives a reduction of 41.2% in switching activity over list scheduling on average. When schedule length plays the most important role in total energy consumption, MLMSA keeps the same schedule length as list scheduling and reduces 33.2% switching activities on average. For the general case, EMSA gives a 31.7% reduction in energy compared with list scheduling on average.
The remainder of the paper is organized as follows. Section II introduces basic concepts and energy model. Section III proves instruction-level energy-minimization scheduling problem is NP-complete. The algorithms are discussed in Section IV. Experimental results and concluding remarks are provided in Section V and VI, respectively.
II. BASIC CONCEPTS AND MODELS
In this section, we introduce the basic concepts that will be used in later sections. A motivational example and the energy model are also introduced in this section.
A. DAG and Bit String( )
, is a nodeweighted graph, where set and an edge between two nodes denotes a dependency relation. In this paper, the operation time of each node in the DAG is assumed to be one time unit.
Assembly code that consists of long instruction words generated by a compiler has to satisfy the dependency relations in the DAG. It consists of long instruction words. A long instruction word contains multiple sub-instructions. Each sub-instruction specifies on which functional unit it is going to be executed. The bit switches between the consecutive sub-instructions executed on the same functional unit cause the switching power consumption (9 V A D C F E G I H in equation 1) in the power model introduced in Section I (the detail discussions about the power model will be given in Section II-B). Therefore, we can capture the switching power consumption ( ¢ opcodes. When loading instructions from program memory, the state of instruction bus may be changed, which causes switching power consumption. Based on a series of experiments we did on TMS320VC5410 DSP processor, we collected the core current when executing the loops of different instructions and instruction combinations using similar method in [1] . The experimental results show that the core current increases correspondingly with the increase of the hamming distance of consecutive instructions. Some experimental results are shown in Figure 1 . It confirms the correctness of our power model. The architecture of our experimental VLIW processor is shown in Figure 2 (a). It is a simplified model and similar to one cluster of TI C6000 VLIW processors. and then computed. The DAG representation for the assembly program is shown in Figure 3 (d). Since bus capacitances are usually several orders of magnitude higher than those of the internal nodes of a circuit [16] , a considerable amount of power can be saved by reducing the switching activities on the buses. Therefore, we will only consider reducing the transitions on the instruction bus, data address bus and data bus in this example. For each node , Bit String(u) denotes the states of three buses after executing node . It consists of three parts: the opcode on the instruction bus, the operand on the data bus, and the access address on the data address bus as shown in Figure 4 (a). Usually we don't know the inputs at the compiling time. However, we can predict their possible values or patterns for a specific application. In this example, we assume our prediction for the operand part in MSAS algorithm. Assume the initial states of buses are all 0's. Then the total number of bit switches for each bus is the total number of transitions for a schedule. The number of transitions for Schedule 1 is 121 and that for Schedule 2 is 66. Since they have the same schedule length, Schedule 2 consumes less energy than Schedule 1 when executed on our experimental VLIW processor. The generated codes with long instruction words can be found in the column of instruction bus in Figure 4(b) and (c) .
B. Power Cost Function and Energy Model
Our energy model is based on the methodology for the instruction-level energy estimation framework for VLIW architecture proposed in [17] - [20] . In VLIW processors such as TI TMS320C620x and TMS320C670x DSP devices, there is no data cache and application programs are loaded into on-chip-memory before they are executed [21] , therefore the influence of cache miss is not considered in our energy model.
As shown in equation 1, the power consumption to execute a long instruction word during a clock cycle, % ' ) (
In VLIW processors,% ' x ( denotes the power to support basic pipeline execution even when a long instruction word contains only NOPs. ¨3 5 4 ( ! 6 ! 7 denotes the basic power for a functional unit to execute a sub-instruction on a specific datapath even when the inputs don't cause any transition.
9
V A D C F E W G H denotes the switching power caused by switching activities between S b X # a T (current sub-instruction) and S Y X # a c (last sub-instruction) executed on the same ¢ ¡ . The switching power is proportional to the number of transitions. So
where G is a power coefficient representing the consumed power per transition, and H P I R Q (Weighted Hamming Distance) is a function used to compute the number of transi-
as:
where V T is the weight of a transition. V T is used to denote the weight for the power consumption caused by one transition on different units.
The energy consumption of an application is the summation of all its power consumption during each clock cycle. 9 B A D C F E G I H is the switching power and changes with different schedules. Therefore, schedule length and switching activity need to be considered together in instruction-level scheduling techniques in order to minimize energy consumption of an application. One example is shown in Figure 5 .
In Figure 5 (a), a DAG is given, in which the binary string close to each node is the value of Bit String( ). Step FU1 FU2 Step FU1 FU2 Step FU1 FU2 
D. The Optimization Problem
Our optimization problem is defined as follows: 
III. NP-COMPLETENESS
In this section, we prove that our optimization problem is NP-complete. From the energy model in equation (2), we know the energy consumption of a schedule is related to schedule length and switching activity. To prove our optimization problem is NP-complete, we use the following method. We first consider two subproblems of our optimization problem: minimum-latency scheduling problem that only minimizes schedule length and minimum-switching-activities scheduling problem that only minimizes switching activities. If we can prove one of these two problems is NP-complete, we can further prove our optimization problem is NP-complete. It is well-known that minimum-latency scheduling problem is polynomial-time solvable when there is only one functional unit or no resource constraints. Therefore, the focus is put on minimum-switching-activities scheduling problem. As discussed in Section I, it is still unknown that minimum-switching-activities scheduling problem is solvable in polynomial time or NP-complete in the literature. Thus, we categorize the problem into two cases as shown in Theorem 3.1 and Theorem 3.2 and prove them as follows. Proof: Given an instance of minimum-switchingactivities scheduling problem, it can be directly mapped as an instance of our optimization problem by setting% ' ) ( =0 in the energy model. Since the reduction can be done in polynomial time, our optimization problem is NP-complete.
IV. THE ALGORITHMS
In this section, we present three algorithms to solve instruction-level energy-minimization scheduling problem. Two algorithms for two special cases are presented first and then another algorithm to solve general problem is proposed.
A. MSAS Algorithm
1) The Algorithm: Our first algorithm, Minimizing Switching Activities Scheduling (MSAS), is designed to solve a special case of instruction-level energyminimization scheduling problem, i.e., the case when the switching activities play the most important role in energy consumption.
When¨% ' x ( is very small compared with G (in equation 3), the energy of a schedule depends mainly on switching activities. For example, when¨% ' x ( equals 0.1 in Figure 5 , we need to reduce 10 control steps in schedule length to count one bit switch. Thus, we need an algorithm to minimize switching activities as much as possible. On the other side, considering the performance, we also want to minimize schedule length. Hence, MSAS algorithm minimizes switching activities in first priority and still considers schedule length. Since most previous works focus on one functional unit, we need algorithms that can take advantage of multiple functional units under VLIW architectures. MSAS algorithm is shown in Algorithm IV.1.
Due to the existence of the dependency in a DAG, we can only schedule a node after all its parent nodes have been scheduled. The scheduling problem with switching activities minimization is how to find a matching between functional units and ready nodes in such a way that the ; end for end while schedule based on this matching minimizes the total switching activities in every scheduling step. This is equivalent to the min-cost weighted bipartite matching problem. Thus, in MSAS algorithm, we repeatedly create a weighted bipartite graph
between the set of functional unit and the set of nodes in Ready List, and assign nodes based on the min-cost maximum bipartite matching ¡ . In each scheduling step, the weighted bipartite graph, T . An example is shown in Figure 6 .
Given the DAG in Figure 5 (a), the scheduling in the first step by
algorithm is shown in Figure 6 when there are 2 FUs. Ready List of the DAG in the first step is shown in Figure 6 (a). Assume that the initial states of FUs are 000 and V T " a in weighed hamming distance function H T I R Q A H . A weighted bipartite graph based on the set of FUs and the set of nodes in Ready List is constructed in Figure 6(b) . A min-cost maximum bipartite matching is shown in Figure 6 Algorithm to VLIW architecture: Our generic algorithm can be easily modified to solve the scheduling problem on various VLIW architectures. In this section, we show how to apply to the VLIW architecture similar to TI C6000 VLIW processor. In such architecture, a sub-instruction can be put into any location in the instruction bus. It is the decoder that distributes the sub-instruction to a FU. So we need to change our generic algorithm a little bit to handle this case. The VLIW architecture consists of heterogeneous FUs with resource constraints. Thus, the number of sub-instructions of a type executing in a clock cycle can not exceed the resource bound. For example, since there is only one load/store unit in the experimental VLIW processor in Figure 2 , we can not schedule more than one load/store instruction at the same schedule step. Assume that at most , we can sort the edges with this type of sub-instructions by an increasing order and schedule these sub-instructions from the first ¡ edges. Then we can remove all sub-instructions of this type from ¢ ẍ ©¦ d t C a and find the best matching for the left FUs and left nodes till every FU has been assigned a node. In such a way, we can reduce switching activity as much as possible. We also need to add NOP nodes if the number of nodes is less than the number of FUs in each step. Using T to denote sub-instru C in the instruction bus and applying above, Schedule 2 in Figure 4 (c) is generated by
. It reduces bit switches from 121 of Schedule 1 to 66.
B. MLMSA Algorithm
Our second algorithm, Minimizing Latency with Minimal Switching Activities Scheduling (MLMSA), is designed to solve another special case of instruction-level energy-minimization scheduling problem, i.e., the case when&% ' ) ( is very big compared with G in equation 3 (the switching power caused per transition). In this case, the schedule length plays the most important role in energy consumption. Thus, MLMSA is designed to generate a schedule with minimum schedule length in the first priority and then reduce switching activities as much as possible.
MLMSA is similar to MSAS except that we use the different method to assign weights to the edges of weighted bipartite graph. In order to reduce the schedule length, we define Depth( ), a priority function for each node in a DAG, to be the longest path from to a leaf node in a DAG. When constructing weighted bipartite graph will be scheduled first. However, for the nodes with same priority, switching activities minimization is considered. Thus, we minimize the schedule length while considering switching activities.
C. EMSA Algorithm
In this section, an algorithm, Energy Minimization Scheduling Algorithm (EMSA), is proposed to solve the general instruction-level energy-minimization scheduling problem. The basic idea is to reschedule nodes to reduce switching activities when relaxing the schedule length of an initial schedule to a given timing constraint and then select the schedule with minimal energy consumption. e , we reschedule nodes to new locations based on an initial schedule and a timing constraint from the bottom up. Since the timing constraint is equal to or greater than the schedule length of the initial schedule, the nodes in the initial schedule may have some freedom to be moved to a new location such that the switching activities can be reduced. In order to obey the precedence relation, we asso- and compute the best location of each node with two considerations: 1) it causes the greatest reduction in switching activity; 2) it is the closest to the bottom. Then we select the node with the greatest reduction in switching activity among all nodes in ¢ ẍ ©¦ 9 x a and reschedule it to its best location. In this way, we can reduce switching activities as much as possible when rescheduling a node.
Since the schedule generated by e ¡ 9 ¢ is directly related to the initial schedule. It is very important to have a good initial schedule. Both In our experiments, the opcodes for each node is obtained from TI TMS320C6000 Instruction Set [24] .
The experimental results for ¡ 9 ¢ w 9 and list scheduling are shown in Table I . Column "SA" lists the switching activities for list scheduling (field "ListS") and Table II .
The experimental results in Table III and Table IV . Each benchmark is scheduled with three timing constraints. Among them, the first one is the schedule length obtained by list scheduling. Row "%" in field "EMSA" lists the reduction on energy comparing e ¡ 9 ¢ with list scheduling (field "ListS"). In each table, the results are shown when the number of FUs equals 3,4,and 5, respectively.
Through the experimental results from Table I -IV, we found our algorithms reduce energy consumption significantly compared with list scheduling. When switching activity plays the most important role in energy consumption, % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % ( ( ( From the experimental results in Table III and Table IV , we can see the energy based on e ¡ 9 ¢ is reduced correspondingly with the increase of the timing constraint for each benchmark. To show the further energy variation, we show the results for 8-lattice IIR filter when the timing constraint varies from 17 to 25 in Figure 7 . e ¡ 9 ¢ gradually reduces the energy when the timing constraint increases from 17 to 23. After that, the energy can not be reduced anymore. This shows that we can not gain further energy reduction when the timing constraint reaches a certain point. At this point, e ¡ ¢ w 9 gives a reduction of 47.8% compared with list scheduling when the timing constraint equals 23.
VI. CONCLUSION
In this paper, we presented instruction level scheduling techniques to minimize the energy consumption of applications executed on VLIW processors. We studied the two most important factors, switching activity and schedule length, that influence the energy consumption of an application on a VLIW processor during scheduling. We proved that instruction-level energy-minimization scheduling problem is NP-complete with or without resource constraints.
We proposed three heuristic algorithms, MSAS, MLMSA, and EMSA, to solve the problem. While switching activity and schedule length are given higher priority in MSAS and MLMSA respectively, EMSA gives the best result considering both of them. The experimental results show our algorithms can significantly reduce energy consumption compared with the standard list scheduling.
