The focus of this paper is on the minimization of the variation in power consumed by a VLIW processor during the execution of a target program through instruction scheduling. The problem is formulated as a mixed-integer program (MIP) and a problem-specific branch-and-bound algorithm has been developed to solve it more efficiently than generic MIP solvers. Simulation results based on the TMS320C6711 VLIW digital signal processor using benchmarks from Mediabench and Trimaran showed that over 40% average reduction in power variation can be achieved without sacrificing execution speed of these benchmarks. Computational requirements and convergence rates of our algorithm are also analyzed. Extension of conference paper [Xiao and Lai 2004] . The mixed-integer program in this paper makes use of a substantially improved set of constraints. The lower bound estimation of the branch-andbound algorithm has also been improved. The experimental results are, therefore, completely new.
INTRODUCTION
Very-long instruction word (VLIW) processors are designed for executing programs that exhibit a high degree of instruction-level parallelism (ILP), such as those for multimedia signal processing. They are able to execute several instructions in parallel on separate functional units. The number of instructions being executed simultaneously varies from clock cycle to clock cycle. Since the power consumed by the processor at any given time depends on the number, as well as the type of instructions that are being executed, there are potentially large fluctuations in processor supply current . Large fluctuations in supply current will lead to an increase in power supply voltage noise, commonly known as the di/dt problem, which may cause timing and logic errors Smith et al. 1999; Chandrakasan et al. 2000] . For high-performance processors, the problem is more pronounced, because they generally have a larger number of gates, wider datapaths, and operate at higher clock frequencies, leading to larger surge currents within a shorter time period. Furthermore, large current variation is usually correlated with large current spikes that can adversely affect chip temperature [Brooks and Martonosi 2001] to which chip reliability and subthreshold leakage power are exponentially related [Dhodapkar et al. 2000] . Besides, maximum battery life is attained when the variance of the discharge current distribution is minimized [Pedram and Wu 2002] . Battery efficiency may change by as much as 25%; depending on the discharge current profile given the same average current.
Published works on the control of power variation use either the hardware approach [Pant et al. 1999 [Pant et al. , 2000 Grochowski et al. 2002; Joseph et al. 2003; Vijaykumar 2003, 2004; El-Essawy and Albonesi 2004] or the hybrid hardware/software approach [Hazelwood and Brooks 2004] . For VLIW processors, since the instruction schedules produced by the compiler largely determines the power profile, power variation control can be achieved through instruction scheduling. Two methods have been specifically proposed for minimizing processor power variance. The first one extends the performance-oriented iterative modulo scheduling algorithm by adding power-aware heuristics [Yun and Kim 2001] . The alternative method is to formulate this scheduling problem as a mixed-integer program (MIP) [Yang et al. 2002; Xiao and Lai 2004] . The advantage of the later approach is that optimal solutions can be guaranteed. However, the computational complexity of algorithms for solving the MIP is generally much higher than that for heuristic ones, particularly if generic MIP solvers are used as in Yang et al. [2002] .
Furthermore, there are three main problems with the resource usage model in Yang et al. [2002] and Xiao and Lai [2004] . First, they assumed that the instructions scheduled onto each functional unit were executed on dedicated resources in a fully pipelined manner. In fact, instructions with multicycle functional unit latency will lock the target functional unit for a number of cycles. No new instruction can be dispatched to that functional unit during this locking period. Second, they did not consider the issue of register utilization. Rescheduling instructions can change variables' lifetimes, which may increase pressure on the registers. Without sufficient registers, the register allocator must insert spill and restore code into the schedule, which will cause extra delay and increase the total energy consumption of the resulting schedule. Third, the sharing of resources, such as read/write ports and shared buses, between the functional units had not been taken into account. As a result, illegal schedules may result.
Another problem with Yang et al. [2002] is that a rather unrealistic instruction-level power model has been used. It assumes that every pipeline stage in a given functional unit consumes the same amount of power. In fact, significant differences exist between the power consumed at different stages of the pipeline, which are reflected by experimentally verified VLIW instructionlevel power models, such as the one proposed in Julien et al. [2003] . Optimal schedules obtained, based on the simplistic power models, may be far from optimal in reality.
In this paper, the problem of VLIW instruction scheduling for minimal power variance is solved using the MIP approach. We shall refer to it as the powerbalanced scheduling problem. We make use of an accurate VLIW instructionlevel power model, which is a modified version of the one proposed in Julien et al. [2003] . Our MIP formulation is based on a more complete resource usage model, which takes into account constraints imposed by multicycle functional unit latency, as well as limited shared resources, such as registers, read/write ports, and buses that have not been considered before. Appropriate data-dependency constraints similar to those used in , Leupers and Marwedel [1997] , and Wilken et al. [2000] are included to ensure that optimal execution performance of the resulting schedule is maintained. The major contribution of this paper is the problem-specific branch-and-bound algorithm. It solves the MIP much more efficiently than generic solvers, making the technique feasible for production compilers. In particular, the heuristics used to guide the branching and selection processes greatly cut down the search space. The lower-bound estimation process is computationally efficient, which also accelerates the convergence of the branch-and-bound algorithm. Another advantage of the proposed branch-and-bound algorithm is that the peak power can also be flexibly bounded by the branching heuristics. Empirical results show that an average power variation reduction of over 40% can be achieved without degrading the execution performance of the target programs and with a relatively low computational cost.
The rest of this paper is organized as follows. The original VLIW power model and our modifications that allow it to be used for instruction scheduling are presented in Section 2. In Section 3, our MIP formulation of the problem is presented. Our efficient branch-and-bound algorithm is developed in Section 4. Section 5 shows the experimental results based on the C6711 VLIW digital signal processor, using benchmarks programs from Mediabench and Trimaran.
INSTRUCTION-LEVEL POWER MODEL
Two instruction-level power models specifically for VLIW architectures can be found in the literature. The first one, presented in Sami et al. [2000 Sami et al. [ , 2002 , Bona et al. [2002] , Benini et al. [2002] , and Zaccaria et al. [2003] , has been • S. Xiao and E. M.-K. Lai used to characterize a four-issue VLIW core with a six-stage pipeline. It takes into account factors such as instruction ordering and power consumption in the pipeline stage. Experimental results carried out on a set of embedded DSP benchmarks have demonstrated an average error of 4.8% compared to gatelevel simulations. The main disadvantage of this model is its complexity, which requires a large number of parameters to be estimated. Since these parameters are normally obtained via power measurements, the number of measurements required is prohibitively large. Furthermore, it also reduces the computational efficiency of the instruction-scheduling algorithm.
The power model proposed in Julien et al. [2003] , on the other hand, has lower complexity. Furthermore, the accuracy of this model is comparable to the first one with a maximum error between estimation and measurement of 4% for the C62 and 6% for the C67 series of digital signal processors. Another advantage is that the modeling methodology, which is based on algorithmic activities, is generally applicable to any VLIW processor. However, some modifications to this model are needed in order to be suitable for our instruction-scheduling problem. In Section 2.1, we shall briefly review the main components of this power model, followed by a description of our modifications in Section 2.2.
An Algorithmic Activity-Based Power Model
The algorithmic activity-based power modeling methodology is motivated by the observation that the architectural complexity of VLIW processors hides the details of many internal activities. Hence, no significant power consumption difference can be observed between similar types of instructions. For instance, for the C6201 digital signal processor, an addition and a multiplication dissipate about the same amount of power. The same is true for data-transfer instructions between on-chip memories. Modeling of a target VLIW processor involves grouping its architectural components into functional blocks, based on a functional-level concurrent activity analysis. The activity rates of these functional blocks and their interactions are modeled by instruction-level algorithmic parameters. The power cost is estimated from the instruction-level algorithmic activities of the target program.
The instruction-level algorithmic activity-based modeling method in Julien et al. [2003] is illustrated with C62. The model described in this section is based on this processor.
Functional Blocks with Concurrent
Activities. The architecture of the target VLIW processor is modeled as two functional blocks-the instructions management unit (IMU) and the processing unit (PU). They model the concurrent activities in the pipeline steps, which can be separated into three stages:
1. The fetch stage. It includes program address generation (PG), program address send (PS), program access ready wait (PW), and program fetch packet receive (PR), which fetch the instructions from program memory. 2. The decode stage. The instructions are first dispatched to the correct unit (DP) and then decoded by the processing unit (DC). 3. The execution stage. The instruction is executed in a variable number of steps. Figure 1 shows the two functional blocks together with the associated concurrent activities. The first five pipeline steps (PG to DP) are performed by the IMU and the rest by the PU.
Instruction-Level Algorithmic Parameters.
In the C6x family of VLIW processors [Texas Instruments 2000] , eight instructions constitute a fetch packet and are fetched at the same time. The execution of the individual instructions in a fetch packet is partially controlled by a bit in each instruction, which determines whether the instruction executes in parallel with another instruction. All instructions that are executed in parallel constitute an execute packet. The remaining instructions are executed in the cycles afterward. The maximum size of an execute packet is equal to the maximum issue width of the processor. Figure 2 shows an example of a fetch packet with three execute packets. The activity rate of the functional blocks and their interactions are modeled by two algorithmic parameters α and β. They are obtained from the compiled code and have a significant impact on the final power consumption. The parallelism rate α indicates the average flow between the fetch stages and the program memory controller in the IMU. It can be defined in terms of the number of fetch packets NFP and execute packets NEP.
Since NFP ≤ NEP, α ≤ 1 with α = 1 when parallelism is highest. In the example shown in Figure 2 , NFP = 1, NEP = 3, and α = 1/3. The processing rate β between the IMU and the PU represents the utilization rate of the processing units. It is given by
where NPU max is the number of processing units in the processor and NPU tot is the total number of instructions, which have been executed on the processing units. Thus, NPU tot /NEP indicates the average number of the processing units used per cycle. For the C62 processor, NPU max = 8 and in the example shown in Figure 2 , NPU tot = 7, since one of the instruction is a NOP (no operation), which does not involve any execution. Thus, we have β = 7/24.
2.1.3 Power Model Coefficients. The total power I tot consumed by the processor when there is no instruction or data cache misses is proportional to the power consumed by the IMU and the PU. Hence,
where
and e is a constant, which represents the idle power consumed by the processor when it is not fetching or executing any instruction. The power consumption laws that form the power model are able to take into account all the power consumption sources including pipelines and internal memories. If pipeline stalls are considered, then α and β should be replaced by
respectively, where PSR is the pipeline stall rate. If the power consumed by direct memory access (DMA) is included, then Eq. (3) becomes
with
ε is the DMA utilization rate which represents the activity level of the DMA unit.
Model Modifications
In order to perform power-balanced instruction scheduling, runtime power profiling must be enabled by the power model. We are not able to produce such power profiles using the power model described in Section 2.1. Therefore, we propose some modifications, which are described in detail in Sections 2.2.1 to 2.2.4. Table I shows the average power cost of the instruction decode step DC and the various execution steps E1, E2, and E3-E5 with respect to the parallelism rate α. It is obvious that the power cost of these steps are significantly different. More importantly, concurrent activities between internal memory and the processing units in the PU exist only when the executed instructions involve data memory access. In order to perform better runtime power profiling, fine-grained concurrent activity analysis is necessary. We propose that the PU be divided into two separate functional blocks. For those instructions that do not involve memory access, the decode stage can be grouped with the execute steps to form the computation unit (CU). Another functional block called the data memory unit (DMU), handles the execution steps that access internal data memory. The subdivided functional block grouping is shown in Figure 3 our model applicable to more generic VLIW architectures, the definitions for α and β given by Eqs. (1) and (2) have to be modified. Since the parallelism rate α is reflected by the average width K of the instruction words in IMU, it can be defined by
Functional Blocks with Concurrent Activities.
where K max is the issue width of the processor. The utilization rate β can be redefined as β = NPU/NPU max (11) where NPU is the average number of processing units used per cycle with the same physical meaning as the NPU tot /NEP term in Eq. (2). For the C62 processor, K max = NPU max = 8. Applying (10) to the example in Figure 2 , we have K = 8/3, and, thus, α = 1/3, which is the same as that by (1). Similarly, applying Eq. (11), we have NPU = 7/3 and β = 7/24. With the subdivided functional blocks in Figure 3 , an additional algorithmic parameter δ is needed to represent the activity rate of the DMU and its interactions with the CU. Parameter δ, which represents the data memory access rate, can be computed by (12) where NMU max is the maximum number of internal data memory accesses that can be executed in a single instruction cycle and NMU is the average number of internal data memory accesses executed per cycle.
Power Model Coefficients.
The expression for the total power I tot , previously given by (3), now becomes
where I IMU , I CU , and I DMU are the power consumed by the IMU, CU, and DMU respectively. Now,
where c and d are the model coefficients for the CU and DMU respectively. The power model coefficients a, c, d , and e in Eqs. (4) and (13-15) can be estimated by measuring the supply current to the processor (I tot ) in a similar way to Tiwari et al. [1994] and Russell and Jacone [1998] . A set of elementary programs are designed to activate the IMU, CU, and DMU together or separately. For each elementary program, the values of α, β, and δ could be computed as described in Section 2.2.2. Each elementary program is typically placed in an infinite loop in order to obtain a stable current reading. There are some constraints on the size of the loops. On one hand, we want to minimize the impact of the branch at the end of the loop by having more instructions within the loop. On the other hand, the loop size should not be so large that it causes cache misses, which are undesirable.
Suppose N elementary program experiments are conducted. The power model coefficients a, c, d , and e can then be estimated by linear regression that minimizes the sum of squares of the residuals
2 . In order to cope with the effect of nuisance factors and the imprecision inherent in measurements, a large N is required.
The physical meaning of the model coefficients can be illustrated through the example programs in Figures is given by
The (a/Kmax) component is the additional power consumed by the functional block IMU, because the issue width of the instruction word K in Figure 4 is one more than the other program. The additional power consumed by the first program is also as a result of the number of processing units used per cycle N PU . Since this difference is one, the power difference is (c/NPU max ).
Power Consumption as a Function of Time.
Our scheduling problem requires the computation of power consumption as a function of time. Let the instruction schedule be
is the long instruction word fetched at the ith time slot of this schedule and w j i , 1 ≤ j ≤ k are the individual instructions that make up W i . Suppose the ith time slot of this schedule corresponds to the PG (program address generation) pipeline step when W i is fetched. The average power in the ith time slot of this schedule is then the sum of the power of the IMU, the power of the CU, and the power of the DMU in this time slot. It can be expressed mathematically as
According to Eqs. (4) and (10), the power of the IMU in the ith time slot
is proportional to the parallelism α (i) activated in this time slot. As depicted in Figure 3 , in the ith time slot when the instruction word W i is in its PG pipeline step, W i−1 is already in the PS step, W i−2 is in the PW step, and so on. and K i−4 be the width of the instruction words in time slots i, i − 1, i − 2, i − 3, and i − 4, respectively. We have
where K (i) is the average width of the instruction words in IMU in the ith time slot.
Similarly, in the ith time slot, instruction word W i−5 is in the DC (decode) step in the functional block CU. At the same time, W i−6 is in its first execution (E1) step. Even if instructions with multiple execution pipeline steps exist, we do not need to add up those instruction words, which are in their pipeline steps after E1. This is because, according to Table I , the execution steps after E1 in CU consume much less power compared to the steps before. For example, there is no significant power dissipation difference between an addition and a multiply instruction because the second step E2, which exists for multiply only, consumes only 4 mA per execution against 60 mA per execution in the DC or E1 step. According to Eqs. (14) and (11), the power of the CU in the ith time slot I (i) CU is proportional to the utilization rate β (i) activated in this time slot. Hence,
where NPU i−5 and NPU i−6 are the number of instructions at their pipeline step DC and E1, respectively. Finally, in the ith time slot, if any one of W i−7 , W i−8 , and W i−9 is involved in the three pipelines steps for data memory access in the functional block DMU (address send, access ready wait and data receive), we have
where NMU i−7 , NMU i−8 , and NMU i−9 are the number of internal data memory accesses executed at time slots i, i − 1, and i − 2, respectively.
Model Accuracy
In Julien et al.
[2003] the instruction-level algorithmic activity-based modeling method was applied to the C62 and the C67 VLIW digital signal processors. These two processors share the same functional block analysis, but the model coefficient values are different. The models were validated using a number of classical digital signal-processing algorithms: a finite impulse response (FIR) filter, a least mean-square (LMS) filter, a discrete wavelet transform with two image sizes: 64 × 64-pixel (DWT1) and 512 × 512-pixel (DWT2), an enhanced full-rate (EFR) vocoder, based on the global system for mobile communication (GSM) standard, and an MPEG application. The average error between the estimates and the physical measurements was 2.5%, with a maximum error of 4%. The maximum error between the estimates and the physical measurements of the C67 was 6% [Julien et al. 2003 ].
The idea behind this instruction-level algorithmic activity-based powermodeling method is that the power consumed by a processor is proportional to the power consumed by the functional blocks, which are grouped by concurrent activity analysis. The power by each functional unit is proportional to the degree of parallelism in it. The architectural complexity of VLIW processors hides the details of many internal activities. Our modifications do not downgrade the model accuracy, because only the model representation has been changed in order to better profile power consumption over time. The idea behind this model remains the same.
MIXED-INTEGER PROGRAM FORMULATION
The MIP for power-balanced scheduling consists of two main parts-the objective function and the constraints. The objective function is to minimize the power variance for the program segment considered. Instruction scheduling is traditionally done using list scheduling on basic blocks or modulo scheduling on loops [Allen and Kennedy 2002] . However, for VLIW processors, techniques are used that enlarge the instruction scheduling scope, "region" refers to each program segment considered [Faraboschi et al. 2001; Kathail et al. 2001] .
There are three main types of constraints in this MIP. Because of data dependency and resource usage conflicts among the instructions, not all combinations of instructions are allowed in a single very-long instruction word. All legal combinations must, first, satisfy the data-dependency constraints imposed by the data-flow requirements of the target program. At the same time, they must also satisfy the resource constraints, which are imposed by the architecture of the VLIW processor. The third type of constraints ensures that the total execution time of the schedule will not exceed that of the initial schedule. Therefore, if a speed-optimized instruction schedule is used as the initial schedule for the MIP, the power-balanced solution will also be speed-optimized.
The complete mixed-integer program is formally given by P1 below. The total execution time is divided into time slots. The total number of time slots and instructions is given by t and n, respectively. A complete instruction schedule X is composed of the set of nonzero binary decision variables x k i . x k i equals 1 if instruction k is allocated to time slot i; otherwise it is 0.
Each functional unit is capable of executing certain types of instructions. A functional unit is indexed by the unit type and the index of the particular unit within that type. Functional unit latency L k is the number of cycles that the instruction engages the functional unit k. The number of delay slots D l associated with an instruction l is the number of cycles required before the result of that instruction, once issued, is available.
For ease of reference, the notations used in the MIP are listed in Appendix. Detailed discussions on the objective function and the constraints are given in subsequent sections.
P1:
min P (X ) subject to 
VLIW Instruction Scheduling for Minimal Power Variation
• Article 18 / 13
Objective Function
Power consumption in each of the t time slots is obtained through Eqs. (16-23) .
The variables in these equations are related to the binary decision variables in the MIP in the following ways. and K i−4 are the width of the instruction words in time slot i, i − 1, i − 2, i − 3, and i − 4, respectively. They can be computed by adding up the binary decision variables for corresponding time slots.
NPU i−5 and NPU i−6 are the number of instructions decoded in time slots i and i − 1, respectively. They are actually instructions already fetched in time slots i − 5 and i − 6, respectively. Therefore,
NMU i−7 , NMU i−8 , and NMU i−9 are the number of internal data memory accesses executed at time slots i, i − 1, and i − 2, respectively. They relate to the instructions fetched in time slots i − 7, i − 8, and i − 9, respectively. Hence,
where f k = 1, if instruction k involves internal data memory access. Otherwise, it is zero.
Substituting Eqs. (35-37) into Eqs. (16-23), the total power consumption I i in time slot i can be expressed in terms of the binary decision variables x k i .
The average power over the duration of the whole schedule is
An appropriate objective function is therefore given by
This objective function will be most appropriate for smoothing the discharge profile to improve battery efficiency. If it is desirable to bound the absolute peak power, the method discussed in Condition 1 of Section 4.2 could be used. However, it cannot guarantee a bound for the worst possible instantaneous current variation and so it cannot be used to solve the di/dt problem mentioned in Section 1.
Data Flow-Dependency Constraints
Flow dependencies occur when one instruction m uses the result of another instruction l . The time slots where these instructions are scheduled are given by Register reuse can introduce additional flow dependencies. Antidependencies and output dependencies will not be considered, since they can be addressed by register renaming. Register pressure will be considered through the resource usage constraints.
Resource Usage Constraints
There are four types of resource constraints. They are the constraints on the functional units, registers, shared buses, and read/write ports. We shall describe each of them in detail. 
The second type of constraint on a functional unit is associated with instructions that has multicycle functional unit latency. These instructions occupy the functional unit for a number of cycles. Thus, new instructions cannot be dispatched to that functional unit during this locking period. A functional unit q of type j is locked by an earlier instruction k in the current time slot i, if the following three conditions are satisfied. for integer values of x. 3. Time slot i is in the lock-down period of instruction k, i.e., L k time slots after k has been issued. Mathematically,
These three conditions can be combined into a single expression
Checking this against each of the n instructions gives us constraint Eq. (28).
Register Constraints.
A rescheduling of instructions may lengthen a variable's lifetime, leading to increased pressure on registers. Therefore, we need to analyze the lifetime of each register variable to ensure that there are sufficient registers for reallocation in each time slot. We shall assume that 1. all registers are of the same type; 2. every operand of an instruction occupies, at most, one register; 3. the result of an instruction resides in the register used by later instructions.
These three assumptions describe the most common way by which instructions utilize the general registers. Some processors may have special-purpose instructions that make use of special registers. Extra register constraints can be included for the rescheduling of these special instructions. If an operand of certain instructions requires more than one register, the number of feasible time slots where this instruction can be rescheduled into will be smaller, since it takes up more registers. Furthermore, if data are allowed to be passed from one instruction to another directly, then these two instructions will have more feasible time slots to be rescheduled into.
The register usage chain can be derived from the dependence relations in the data flow graph. The lifetime of a register variable starts at the time when an instruction defines it until the time when it is last used by other instructions. Consider the data flow graph where only instruction m uses the register variable that instruction l has defined. Since 
is equal to 1 if time slot c is in the lifetime of the defined register variable, where D l is the number of delay slots of instruction l and L m is the functional unit latency of instruction m.
Extending Eq. (43) to a set S of instructions that uses this register variable, we have
This is the basis of constraint (29), which ensures that the total number of live register variables in a given time slot does not exceed the total number of registers available.
Shared Bus Constraints.
A shared bus l should not be used by more than one instruction in any time slot i. An instruction k would use shared bus l in time slot i if this instruction is allocated to time slot i, i.e., x k i = 1, and it can be executed on the qth functional unit of type j that requires shared bus l , i.e., a k q j b l kq j = 1. Constraint (30) on shared bus usage is to ensure that the total number of instructions in a given time slot using the same shared bus must not exceed one.
3.3.4
Read/Write Port Constraints. If two functional units share a read register port that forbids parallel access for source operands, then instructions using these functional units cannot be scheduled to the same instruction word if they both require source operands. Similar restriction applies to the write register ports for the destination of the instructions. An instruction k would use read register port l in the time slot i if it is allocated to time slot i, i.e., x k i = 1, and it can be executed on the qth functional unit of type j , which shares read register port l , i.e., a k q j d l q j = 1. Thus, we arrive at constraints (31) and (32).
Other Constraints
Constraint (33) guarantees that all execution deadlines must be met. In order to ensure that the execution performance of the target program is not compromised, we can set the total number of time slots as that in the optimal schedule obtained by a performance-oriented compiler.
Since the total execution time of the MIP solution does not exceed that of the initial schedule, the total power consumption of the whole program would remain unchanged. This is because Eq. (3), which is used for computing the total power, is linear. Rescheduling instructions in this way will only result in a redistribution of power among the time slots available. While the power variation changes, the total power computed by Eq. (3) remains the same.
In addition, constraint (34) ensures that each instruction must be issued once and only once.
BRANCH-AND-BOUND ALGORITHM
P1 can be solved using generic MIP solvers, which are much more efficient than using exhaustive search. However, they do not have problem-specific knowledge to help reduce the search space and, therefore, are generally inferior to problemspecific algorithms. In this section, we shall describe a branch-and-bound algorithm that we developed for solving P1 efficiently. One more advantage is that the peak power can also be flexibly bounded by one of the heuristics for branching.
Preliminaries
We shall assume that an initial schedule is obtained by using a standard performance optimized compiler. Therefore, the task is to reschedule instructions for minimal power variation. Starting with the initial schedule X 1 , a branch-andbound tree can be constructed using X 1 as the root node. Each node of the tree represents a feasible schedule. A branch of the tree connects a parent schedules X r to a child schedule X t , if X t is obtained from X r by rescheduling one single instruction.
For every schedule X r in the branch-and-bound tree, we define two associate sets of instructions. The first set, denoted U r , consists of the instructions rescheduled along the path from X 1 to X r . The second set, denoted V r , consists of all the remaining instructions yet to be rescheduled. A schedule X s is a successor of X r if we can obtain X s from X r by rescheduling only the instructions in the set V r .
• S. Xiao and E. M.-K. Lai Let I u i denote the total power consumption in time slot i because of the instructions in U r . I u i can be computed in a similar way to Eq. (38), i.e.,
We shall now discuss the branching process, selection rules, and lower-bound estimation of our algorithm in detail.
Branching
Let X CBS denote the current best schedule. For any schedule X r , if the lower bound of the objective function values for all the successors of X r is not less than the objective function value for X CBS (i.e., P (X CBS )), then no better schedule exists among the successors of X r . Therefore, there is no need to branch from X r and X r can be removed from the leaf schedule pool. Otherwise, we branch from X r .
The branching process for a selected leaf schedule X r is summarized in Algorithm 1. New schedules are generated from X r by rescheduling an instruction in the set V r . The higher priority is given to instructions with less data dependencies. Instructions with less data dependencies typically have more schedule slacks for power balancing. The convergence of the branch-and-bound algorithm should accelerate if these schedule slacks are explored earlier.
Suppose instruction k in V r is to be rescheduled. First, we branch from X r by rescheduling instruction k to all time slots. The child schedules are checked against the constraints (27-32). The objective function values of those feasible ones are then computed. For any feasible-child schedule X f , if P (X f ) < P (X CBS ), then X f becomes the new X CBS . Next, X r is deleted from the leaf schedule pool and the generated feasible-child schedules are examined to see which ones should be inserted into this pool. Let X s be the new schedule after instruction k is rescheduled to time slot j . X s can then be inserted into this pool if it satisfies the following five conditions. r Condition 1. I u j should not be larger than the peak power of the current best schedule. That is, I u j ≤ P max (X CBS ) (46) where P max (X CBS ) is the peak power of the current best schedule. If it is desirable to limit the peak power of the resultant schedule, then we can replace P max (X CBS ) in Eq. (46) by a predetermined bound on the peak power. However, by doing this, potentially optimal solutions may be abandoned. if the depth of X r < n − 1 then 8 /*Otherwise the branching process will reach the bottom of the branch and bound tree.*/; 9 Add the feasible child schedules which satisfy the five conditions described in Section 4.2 to the leaf schedule pool; 10 end 11 end 12 Delete X r from the leaf schedule pool.
r Condition 2. Data dependencies must be satisfied. These dependencies are expressed by the following inequality:
for all instructions a, b = 1, 2, . . . , n and G ab > 0. Here, Applying Eq. (47) to the new schedule X s obtained by rescheduling instruction k to time slot j , we have
where ∀a, b ∈ U r and G ak , G kb = −1.
r Condition 3. The time slots where an instruction may be scheduled must satisfy Eq. (47) as well as the deadline constraints (33). For instruction k in the new schedule X s , the range of possible time slots are given by
where D F is the number of time slots needed to execute instruction F , which is defined in Eq. (51). r Condition 4. The resource constraints, Eqs. (28-32) must be satisfied in time slot j . r Condition 5. The lower bound of the objective function values for the successors of X s must be less than P (X CBS ).
Selection Rules
Given the current leaf schedule pool, the branch-and-bound algorithm uses the selection rules to choose the one for branching. The selection strategy is to give higher priority to the leaf schedules with less instructions in its U r . Such a selection rule can ensure that the instructions with less data dependencies are rescheduled earlier. Instructions with less data dependency may have more schedule slacks for power balancing. Therefore, it would accelerate the convergence of the branch-and-bound algorithm if these schedule slacks are explored earlier.
Lower-Bound Estimation
The efficiency and effectiveness of the branch-and-bound algorithm is highly dependent on how accurately the lower bound of the objective function, given a selected schedule X r , can be estimated. There are two basic requirements for the lower-bound estimation algorithm. First, the estimated lower bound should be tight, i.e., its value should not be too far off the optimal value for the successors of X r . This helps to reduce the search space. Second, it should be computationally efficient.
A common approach is to relax the integer constraints on the nonbranched decision variables in the MIP. The optimal solution of the relaxed program then provides a lower-bound estimate. However, a direct application of this approach to P1 is not computationally efficient enough, since it still involves solving the original program albeit without the integer variables. More importantly, this approach does not provide us with a lower bound that is tight enough for our problem.
Our approach is to recognize that at this time we are only interested in obtaining a lower bound of the objective function values instead of an optimal instruction schedule. In any successor of X r , the power consumption at time slot i, denoted by I i , is a sum of the power consumption of instructions already rescheduled (I u i ) plus the power consumption of the instructions from the set V r . By redistributing the power consumption of instructions in V r , we obtain the minimum of the objective function (40), which is a lower bound of the objective function values for the successors of X r .
The total power consumption of each instruction in V r is composed of power consumption in each function block, which is spread out over a number of time slots. We shall denote the power consumed by an instruction in a time slot in function blocks IMU, CU, and DMU as P I MU , P CU , and P DMU . For example, we have P IMU = denote the number of units of P IMU , P CU , and P DMU , respectively, in a particular time slot i. An alternative mixed-integer program, P LB , can be formulated based on these variables. The optimal solution of P LB will provide us with a lower bound of the objective function for the successors of X r . This alternative MIP is specified as follows.
where represents the power of the instructions that has yet to be rescheduled. In solving P LB , we are essentially redistributing the various portions of power of those instructions in V r , instead of the whole instructions, to the available time slots. Note that no actual instruction schedules are explicitly obtained.
The advantage of using P LB is that it implicitly ensures that instructions will not occupy fractions of a time slot by subdividing the power of a single instruction in a functional block into integer number of time slots. In contrast, if we simply relax the constraints on the binary decision variables x k i of P1 in a conventional way, some instructions will not start at the beginning of a time slot. This leads to a solution with power variation much less than that obtained through P LB . Thus, P LB is able to provide us with a much more accurate lower bound. This is stated formally by the following theorem.
THEOREM 4.1. For a given schedule X r , the optimal value of P (X ) obtained by the integer program P LB is a lower bound of the objective function P (X ) of the mixed-integer program P1 for all successors of X r .
PROOF. For a given schedule X r , let s LB denote the optimal value of P (X ) for the integer program P LB . We need to prove that for any successor X s of X r , s LB ≤ P (X s ).
Let P1 be the subproblem of P1 with the partial schedule X r . It is an MIP with binary decision variables x k i for k ∈ V r and i = 1, . . . , t. Let s P 1 denote the optimal objective function value of P1 . Then, for any successor X s of X r ,
Now compare P LB with P1 for X r . The two formulations have the same objective function. However, in P LB , the data dependency and resource constraints on the instructions in V r have been removed. Since P LB is actually the same problem as P1 , but with less constraints, the optimal value of P LB must be less than the one obtained by P1 . That is, for X r ,
Therefore, based on Eqs. (58) and (59), s LB ≤ P (X s ) for any successor X s of X r .
Furthermore, P LB is a very simple integer program, which can be solved very efficiently. It does not have any data dependency and resource constraints as in P1. Hence, it can be solved by using a simple water-filling algorithm as shown in Algorithm 2. This algorithm starts with the schedule where each time slot already has power consumption given by I u i . The power units of the instructions in V r are placed into the time slots of this schedule by filling those time slots with lowest power first. The power units, which are larger, are chosen first, because it is easier to use the smaller portions to "fill in the gaps" later, so that power variation is minimized. 
PERFORMANCE EVALUATION
Texas Instruments' C6711 digital signal processor is used as the target VLIW processor for our experiments. Its detailed resource constraints and instruction set information can be found in Texas Instruments [2000] . Figure 6 illustrates its internal organization. The algorithmic activity based power model for the C6711 described in Section 2.2 is employed.
The Mediabench [Lee et al. 1997 ] and the Trimaran benchmarks [Chakrapani et al. 2005 ] are used. The benchmark programs are compiled using the compiler in Code Composer Studio with optimization options "-o3" (optimization enabled at file level) and "-ms0" (speed first, size second). The instruction schedules obtained are used as the initial feasible solution for P1. By setting the total number of available time slots as these speed optimized ones, the solution provided by P1 will have the same speed performance, but with power variations minimized. In this way, the proposed branch-and-bound algorithm is used an additional back-end phase in the non-power-aware compiler.
All our computational experiments were conducted on an Intel Pentium 4 personal computer running at 2. are summarized as follows:
1. Reduction in power variation: Overall, our algorithm produces schedules with an average improvement of 46.85% for the Trimaran benchmarks and 41.76% for the MediaBench benchmarks while maintaining the same speed performance as the non-power-balanced schedules (refer to Columns "Fn," "F," and "Imv"). Comparison between the values under Columns "Fn" and "F" for Trimaran and Mediabench benchmarks is also, respectively, highlighted in Figure 7 . 2. Reduction in maximum power deviation: The reductions on average in terms of maximum deviation from the mean are 24.10% for the Trimaran benchmarks and 17.31% for the MediaBench benchmarks (refer to Columns "fn," "f," and "Imf ").
Figure 8, respectively, shows the percentage of rescheduled instructions and the percentage of time slot with rescheduled instructions for each Trimaran benchmark. These percentages are not exactly consistent with those improvement percentages, under Column "Imv" in Table II . The reason is that the optimization achievement also depends on the power consumption and execution clock cycles of the rescheduled instructions. Figure 9 shows the convergence behavior for a program block with 60 instructions and 39 time slots. As a result of the proposed branching and selection rules, the branch-and-bound algorithm is able to reach a solution with an objective function value within 4.4% of the global optimum after only 380 leaf schedules have been visited.
For complex programs, the time taken to reach the optimal solution may be unacceptably long. Given the convergence behavior of the branch-and-bound a Fn, power variation defined by (40) of schedules produced by code composer; F, power variation defined by (40) of schedules produced by the branch-and-bound algorithm; T, the computation time of the branch-and-bound algorithm; fn, maximum power deviation from the mean for schedules produced by code composer; f, maximum power deviation from the mean for schedules produced by the branchand-bound algorithm; Imv, percentage improvement of "F" over "Fn"; ImvT, the gain per unit time ImvT =
F n−F T
; Imf, percentage improvement of "f " over "fn."
algorithm, it may be sensible to, instead, obtain a suboptimal solution within a reasonable time. Using the most time consuming mediabench benchmarks, "jpeg" and "mpeg2," we set a maximum limit of 300 nodes per instruction block.
The results in Table III shows that an average saving of 70.39% in computation time can be achieved at the cost of a reduction in the improvement of power variation by 9.68% (refer to Columns "dImv" and "uT") . This tradeoff result between computation time and power variation reduction percentage is also highlighted in Figure 10 for jpeg and mpeg2, respectively.
CONCLUSIONS
Although VLIW is an energy-efficient architecture and VLIW instruction scheduling techniques for performance optimizations are adaptable to total a Fn, power variation defined by (40) of the schedules produced by code composer; F 300 , power variation defined by (40) of schedules produced by the branch-and-bound algorithm, with maximum branching limit of 300; T 300 , the computation time for F 300 ; Imv 300 , percentage improvements of "F 300 " over "Fn"; ImvT 300 , the gain per unit time ImvT 300 = F n−F 300 T 300 ; dImv, the degraded best objective function values computed by dImv = Imv − Imv 300 ; uT, computation time saving percentage by comparing "T 300 " and "T" in Table II. energy optimization, instruction schedules that are optimized for speed often exhibit large variation in processor power consumption during the execution of the target program. This paper focuses on minimizing the power variance by rescheduling instructions without compromising execution speed. The problem is formulated as a mixed-integer program. The major contribution of this paper is a branch-and-bound algorithm that can solve this MIP much more efficiently than generic solvers, making the technique more attractive for use in practical compilers. In particular, the heuristics used to guide the branching and selection processes are able to substantially reduce the search space. The lowerbound estimation process is effective and computationally efficient, which also accelerates the convergence of the branch-and-bound algorithm. Furthermore, Fig. 10 . Tradeoff between computation time and power variation reduction percentage with maximum subproblem 300. the peak power can also be flexibly bounded by the heuristics for branching. The results of simulation experiments based on the C6711 VLIW digital signal processor using benchmarks programs from Mediabench and Trimaran confirmed the effectiveness and efficiency of our method.
FUTURE WORK
The techniques proposed in this paper are for VLIW instruction scheduling at compile time. During the execution of the program, power variation would be affected by instruction and data cache misses. These factors are not considered in this paper. An extension of the current work is to develop techniques that can be used at the compile time to estimate and handle cache misses.
Only simulated results are presented. The problem would be treated more thoroughly if direct physical measurements of runtime power variation on a real system are performed. This requires measuring the instantaneous current from a program execution trace for any benchmark program at least the processor clock frequency. The existing setup uses sampled multimeter data for measuring runtime power consumption [Isci and Martonosi 2003] . However, if this setup is used for our purpose, some dynamics measurement techniques have to be refined. There is need for synchronization between the software application execution, the processor's clock, and the oscilloscope [Muresan and Gebotys 2001] , since a well and true power vs. time waveform is required in order to determine the position of an instruction or of a set of instructions in the waveform and then give some insight into the possible defect of our algorithm and the improvement to be made. The work next is to develop measurement schemes appropriate for our purpose.
APPENDIX NOTATIONS OF THE MIP
r n is the total number of instructions in the given schedule. r t is the total number of time slots available. 
