Abstract-A heterogeneous multiprocessor (HeMP) system consists of several heterogeneous processors, each of which is specially designed to deliver the best energy-saving performance for a particular category of applications. A low-power real-time scheduling algorithm is required to schedule tasks on such a system to minimize its energy consumption and complete all tasks by their deadlines. Existing works assume that processor speeds are known as a priori and cannot deliver the optimal energy-saving performance. The problem of determining the optimal voltage for each processor to minimize the total energy consumption is called a voltage-setup problem. To the best of our knowledge, this is the first paper to propose the optimal solution for the HeMP single-level voltage-setup problem. This paper provides an optimal solution for the HeMP single-level voltage-setup problem. We first formulate the problem as a nonlinear generalized assignment problem that has been proved to be nondeterministic polynomial-time hard (NP-hard). We next develop a pruning-based algorithm to obtain the optimal solution. A heuristic algorithm is also proposed to derive an approximate solution. After obtaining the optimal partition, each processor's speed is determined by its final workload. In our simulations, we model more than a couple dozens of off-the-shelf embedded processors including ARM processor and TI DSP. The results show that the pruning-based algorithm reduces the time needed to derive the optimal solution by at least 98%, compared with the exhaustive search. Also, our heuristic algorithm achieves the minimum energy consumption over existing works.
I. INTRODUCTION
A heterogeneous multiprocessor (HeMP) system consists of a set of heterogenous processors. Each processor may have its own instruction-set architecture, specially designed to provide the best performance for a particular category of applications. The HeMP architecture is commonly adopted by real-time embedded systems, in which each task must complete before its deadline. Examples are embedded control [1] - [3] and Manuscript received February 8, 2008 ; revised November 7, 2008 , February 9, 2009 , and July 6, 2009 . Current version published October 21, 2009 . This work was supported in part by the National Tsing Hua University, Taiwan, under Grant NTHU 98N2436E1, by the National Science Council of Taiwan under Grant NSC 97-2220-E-007-038, and by the Industrial Technology Research Institute, Taiwan, under Grant ITRI 98-EC-17-A-01-01-0838. This paper was recommended by Associate Editor J. Lach. E. T.-H. Chu is with the Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail: edwardchu@ cs.nthu.edu.tw).
T.-Y. Huang is with Microsoft Research, Redmond One Microsoft Way, Redmond, WA 98052 USA (e-mail: huang.taiyi@gmail.com).
Y.-C. Tsai is with MediaTek Inc., Hsinchu 300, Taiwan (e-mail: yuche.tsai@mediatek.com).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2009.2028683 multimedia systems [4] , [5] . In addition, these systems are usually battery powered. Therefore, the problem of minimizing the energy consumption without missing any deadline has become an important issue in constructing low-power real-time HeMP systems. Many real-time low-power scheduling algorithms have been proposed to reduce a processor's energy consumption by adjusting its supplying voltage [6] , [7] . Most of them assume that their processor voltages/speeds are known as a priori. Accordingly, their schedules may not be optimal in minimizing energy consumption. On the other hand, several other algorithms [8] - [12] have been designed to determine the number of levels and the optimal speed for each level to achieve the minimum energy consumption. It is called the voltage-setup problem [9] - [12] , which is crucial because a number of embedded platforms nowadays support high flexibility for designers to choose efficient operating points for specific products. Selecting processors' voltages plays an important role in the design process. The single-level voltage-setup problem is solved by [8] to deal with a system where a processor has one speed. The multilevel voltage-setup problem is addressed by [9] , [10] , and [11] to deal with the case where a processor has multiple speeds. In addition, a voltage schedule for a fixed-priority system is built by [13] . However, all these works [8] - [11] , [13] focus on a single-processor system and cannot solve the multiprocessor voltage-setup problem.
To our best knowledge, this is the first paper to propose the optimal solution for the HeMP single-level voltage-setup problem, which is defined as scheduling n periodic real-time tasks on m heterogeneous processors and determining an optimal speed for each processor. This work is motivated by HeMP systems that do not support dynamic voltage scaling (DVS). Some of the reasons why DVS is not supported in these systems include the following.
1) The workload of an application-specific embedded system is usually static and deterministic. Its execution flow is typically a repeat of "sensing → computation → actuation," as described in [1] - [3] . Hence, due to its deterministic workload, DVS is not necessary. 2) Many embedded systems are designed within a small budget. Sometimes, these systems also have a constraint on space and size. A commercial dc-dc converter designed for DVS may not fit into the size of the printed circuit board for a small robot, as requested by [1] . In addition, this converter costs several times more than an 8-b microprocessor and significantly reduces the viability of a small-budgeted embedded system [14] .
3) The transition time for a DVS processor may take up to 20 ms [15] when including the latency of synchronizing with other components in the system. Such a delay is not acceptable for certain real-time systems that have strict timing constraints.
We formulate the HeMP single-level voltage-setup problem as a nonlinear generalized-assignment problem (GAP) that is proven to be nondeterministic polynomial-time hard (NP-hard) [16] . Although the problem is NP-hard, it can still be solved optimally for moderate-size problems. In this paper, a pruningbased algorithm (PBA) is proposed to find the optimal solution efficiently. We construct a decision tree to represent all possible task partitions and traverse it to find the optimal solution. The variable-based, energy-based, and speed-based pruning methods are adopted to dramatically remove impossible task partitions. A local optimization algorithm is also proposed to speed up the tree traverse. When PBA finds the optimal partition, each processor's speed is determined by its final workload. Because the problem we address is NP-hard, PBA may require exponential time to find the optimal solution in the worst case. Hence, a polynomial-time algorithm MinMax-E is also proposed to acquire an approximate solution.
In our experiments, we model more than 30 off-the-shelf processors, including ARM processor and TI DSP. Our results show that PBA reduces the time needed to derive the optimal solution by at least 98%, compared with the exhaustive search. In particular, it can solve the voltage-setup problem of four processors and 25 tasks in 8.24 s, while the exhaustive search needs more than three years, approximately. We also compare MinMax-E with our previous work kX3+DP [12] and two related works, List [17] and integer linear programming (ILP) [18] . Due to the balanced energy consumption in the initial assignments, MinMax-E consumes less energy than the existing works. The improvement is as much as ten times.
The remaining sections are structured as follows. Section II shows the system model. Sections III-V present three key components of PBA. Section VI presents a heuristic algorithm. Section VII is our performance analysis. Section VIII is the related work, and Section IX gives the conclusion.
II. SYSTEM MODEL AND PROBLEM FORMULATION

A. Energy Model and Task Model
Our HeMP system consists of m heterogeneous processors. They are C 1 , C 2 , . . ., and C m . We adopt a commonly used energy model [19] - [21] , in which the power consumption of C j at speed y j , denoted by P j (y j ), is determined by
where k j y 3 j denotes the dynamic power consumption and the constant q j denotes the static power consumption. The dynamic power consumption of C j is approximated by k j × V 2 j × y j , where V j is C j 's supply voltage and k j is its adjusted switched capacitance, and its speed y j is almost linear to V j . Without loss of generality, our model also allows users to set a lower bound and an upper bound on the speed of each processor. We use h j and u j to denote this lower and upper bound speeds of C j , respectively.
The task model includes a set of periodic real-time tasks. Let T be the set of n periodic tasks, {τ 1 , τ 2 , . . . , τ n }. Each periodic task τ i is a sequence of jobs released at constant intervals called periods. Moreover, the job of each task must be completed before the next job of the task is released. All tasks are independent and preemptible. A periodic task τ i is denoted by (e i,j , p i ) , where e i,j is the number of maximum clock cycles to execute τ i on processor C j and p i is τ i 's period. If τ i cannot be executed on C j , we simply set e i,j to ∞. Each processor schedules its tasks on the earliest deadline-first basis. No task migration is allowed at runtime because migrating tasks not only incurs unpredictable overhead but also reduces the schedulability bound of the system. In addition, as pointed out by [22] , it takes exponential time to complete the schedulability test of a set of migratable tasks. This task model is commonly adopted in embedded control systems [1] - [3] and embedded multimedia systems [4] , [5] .
Let T j be the set of tasks to be executed on C j . We use l i,j = (e i,j /p i ) to denote the required clock cycles per second for τ i to execute on C j . Namely, l i,j is regarded as the workload of τ i on C j . The total workload of C j , denoted by f j , is τ i ∈T j l i,j . To guarantee that all tasks on C j meet their deadlines, C j must execute at a speed of f j or higher. Because C j has a lower bound speed h j , we define y j as.
Let D be the hyperperiod of p 1 , p 2 , . . . , p n , that is, D is the least common multiple of p 1 , p 2 , . . . , p n . For τ i , its execution time on C j is e i,j /y j . When there is no task for a processor to execute, it enters a power-saving mode whose energy consumption is negligible. Hence, the energy consumption of C j in D to execute τ i at speed y j is
We define E j (y j ) as the energy consumption of C j in D to execute all tasks in T j at speed y j . By (3), we obtain
where
B. Problem Formulation
The single-level HeMP voltage-setup problem is defined to schedule n tasks on m processors and determine an optimal speed for each processor. The goal is to develop a feasible schedule that minimizes the total energy consumption of this system. Let P * represent this problem
E j (y j ). Because a task can only be assigned to one processor, we use a binary variable x i,j to denote which processor τ i is assigned to. We set
Therefore, we can formulate P * as a nonlinear GAP
Equation (6) denotes that a task can only be assigned to one processor. Equation (7) represents the workload of C j . Equation (8) guarantees that C j 's speed will not exceed its upper bound speed u j . Equation (9) ensures that C j executes at least at its lower bound speed h j . A nonlinear GAP has been proven to be NP-hard in [16] . Table I lists the definitions and notations used in this paper. Example 1(a): Consider P * , defined by n = 3, m = 3, and We first use List [17] , a commonly used low-power multiprocessor scheduler, to solve P * . List sequentially assigns each task to the least loaded processor. With these tasks, List assigns τ 1 to C 1 , τ 2 to C 2 , and τ 3 to C 3 . Hence, by (5), its energy consumption is (9 · 2 3 + 0.3) + (4 · 33 + 0.1) + (1 · 2 3 + 0.2) = 188.6. In contrast, our optimal solution presented in this paper only consumes 1/8 the energy of List, as shown in Example 1(g).
C. Overview of PBA
To solve P * , each x i,j needs to be determined. We construct a decision tree, shown in Fig. 1 , based on all x i,j 's. This tree has a depth of n levels. The ith level shows all possible decisions for τ i 's assignments. Every node has m children, each of which denotes an assignment of τ i to a processor. In addition, there are totally m n leaf nodes, each of which is a schedule of n tasks on m processors. Hence, it takes O(m n ) to search the tree exhaustively.
To traverse the tree efficiently, PBA, shown in Algorithm 1, is proposed. Initially, the current best task partition x * is empty, and its energy consumption ε * is infinite (line 2). Let v i denote a node in the tree. PBA first does a variable-based pruning to remove some binary variables. It next traverses the pruned tree from the root node v 0 (line 4). At each node v i , PBA estimates the energy lower bound of each v i 's children (line 12) and first visits the node with the lowest lower bound (line 13 to 19). Two pruning rules are used to speed up the tree traverse (line 15). One is energy-based pruning, which prunes the branch whose energy lower bound is higher than the current best solution. The other is speed-based pruning, which removes the branch when a processor's speed exceeds its maximum speed. When reaching a leaf node, a local optimization algorithm, named MaxReduction, is adopted to find the best task partition x in the neighborhood of the obtained task partition x (line 24). x replaces x * if x consumes less energy (line 25 to 27). Finally, when the tree traverse completes, the optimal-task partition is in x * and each processor's speed is determined by (2) (line 5). In the following, we first introduce the three key components of PBA, which are the variable-based pruning, the estimation of a node's energy lower bound, and the local optimization algorithm. We next present a polynomial time algorithm, named MinMax-E, to derive an approximate solution. v j = the node in D(i) whose energy lower bound E j is the smallest; 15:
if(E j < E * and no processor in v j is overloaded) then 16:
end if 18: 
III. VARIABLE-BASED PRUNING
As (5) shows, E j is a convex function. By assigning τ i to C j , we increase C j 's energy consumption E j . Task τ i increases E j by the minimum amount when T j is empty and by the maximum amount when T j has every task except
denote the range of increase in E j by assigning τ i to C j . By (5), we have
In addition, let L j denote the workload of C j if all tasks are assigned to C j . We have
For any two processors C a and
, τ i will never be assigned to C a , that is, x i,a is guaranteed to be 0 in the solution of P * . We call such a variable a dominated variable which can be removed from the decision tree. To find all dominated variables of τ i , we may need to compare any two ranges. Since there are totally C (243.3, 1701). In addition, we have (
is dominated, and τ 3 will never be assigned to C 1 . Similarly, since A 2,2 > B 2,3 , x 2,2 is also dominated and set to 0. We update (l ij ) by
By removing dominated variables, the updated decision tree, shown in Fig. 2 , has only 12(= 3 · 2 · 2) leaf nodes, reduced from the original 27 nodes.
IV. ENERGY LOWER BOUND ESTIMATION
PBA prunes a node if its energy lower bound is larger than the energy consumption of the current best task partition. In this section, we formulate the lower bound estimation problem as a convex programming problem and develop an iterative algorithm to solve it.
A. Formulating the Energy Lower Bound of a Node
Let v ϕ be the discussed node and T ϕ be v ϕ 's subtree. To estimate v ϕ 's energy lower bound, we relax four constraints. First, we assume that each unassigned task at v ϕ can be split and dispatched to each processor. Second, we assume that the required clock cycles per second of the unassigned task τ i are l i , which is given by min{l i,j |j = 1, . . . , m}. Third, unassigned tasks will not increase the static energy consumption of each processor. Finally, the speed of C j is set at its workload. Let θ ϕ be the unassigned tasks at v ϕ and T ϕ j be the set of tasks assigned to C j at v ϕ . We define r j = τ i ∈T ϕ j l i,j as C j 's current speed, which also represents C j 's current workload. Then, the total workload of any task partition in T ϕ is at least
The first part of W ϕ is the sum of the processors' current workload at v ϕ . The second part is the lower bound of unassigned workload.
Our goal is to dispatch W ϕ to the processors so that the total energy consumption is minimum. By (5) and (9), we have
Therefore, we calculate the energy lower bound of v ϕ by solving
T , and (·) T is matrix transpose. The minimum value of f (y 1 , y 2 , . . . , y m ), which also represents v ϕ 's energy lower bound, must be less than the energy consumption of any task partition in T ϕ because of four reasons. First, the total workload of any task partition in T ϕ is at least W ϕ . Second, tasks cannot be split. Third, unassignedtasks may increase the static energy consumption. Finally, a processor's speed may be faster than the demand speed.
Example 1(c): Let us consider v 2 in Fig. 2 , where τ 1 is dispatched to C 2 and θ 2 = {τ 2 , τ 3 }. Since l 1,2 is 1, each processor's current speed is r = (r 1 , r 2 , r 3 ) = (0, 1, 0). The total workload of any task partition in T 2 is at least
According to r, we have (q 1 , q 2 , q 3 ) = (0, 0.1, 0). Therefore, the energy lower bound of v 2 is solved by 
B. Calculating the Energy Lower Bound of a Node
To solve P ϕ , each processor's speed y j needs to be determined. Since P ϕ is a convex programming problem, the reduced gradient method and the convex simplex method are able to solve it [23] . However, both of them require the calculation of matrix inverse which will induce a heavy computation load when m is big. In addition, the required memory becomes significant when m is large because two slack vectors are introduced to handle each bounded variable y j . Moreover, both methods do not guarantee to solve P ϕ in finite iterations. Therefore, we propose Algorithm 2 to solve P ϕ efficiently. Algorithm 2 starts from an initial solution y (line 1) and revises it iteratively until it becomes P ϕ 's optimal solution (line 2 to 9). The corresponding energy consumption is the energy lower bound of v ϕ (line 10). In the following, we first describe the method to obtain an initial solution. Then, we show how to test whether a given solution is optimal. Finally, the process to revise a nonoptimal solution is given. , r 2 , . . . , r m ), we increase y j that has the smallest k j . When y j reaches its maximum speed u j , it is kept at u j . We then increase another y j that has the second smallest k j . The iterative process stops until
We continue the previous example. At first, y = (r 1 , r 2 , r 3 ) = (0, 1, 0). Since k 3 is the smallest, y 3 is selected for increasing. We stop when y 3 reaches u 3 = 3. Because 3 j=1 y j = 4, no y j is further selected for increasing. The initial solution is y = (0, 1, 3) .
2) Testing the Optimality of a Solution:
. . , u m ), y is P ϕ 's only feasible solution and indeed P ϕ 's optimal solution. Otherwise, we partition y into three parts, which are N 1 = {y j |y j = r j }, N 2 = {y j |r j < y j < u j }, and N 3 = {y j |y j = u j }. Each y j in N 1 is equal to its minimum speed (current speed) and, in N 3 , is equal to its maximum speed. Also, y j in N 2 is between its minimum and maximum speed.
Theorem 1: Let y = (y 1 , y 2 , . . . , y m ) be one of P ϕ 's feasible solutions. y is P ϕ 's optimal solution if there exists a variable y a ∈ y that satisfies the following:
Proof: Please refer to the Appendix. Theorem 1 gives an insight into P ϕ 's optimal solution. We regard 3k i y 2 i = (∂f /∂y j ) as the energy slope of C j . If y is P ϕ 's optimal solution, the energy slope of the lowest speed processors (y j ∈ N 1 ) must be larger than that of the medium speed (y j ∈ N 2 ) and the full speed processors (y j ∈ N 3 ). In addition, all medium-speed processors have the same energy slope.
Corollary 2: If y is P ϕ 's optimal solution, y a , mentioned in Theorem 1 is given by
Proof: Please refer to the Appendix. Algorithm 3, based on previous observations, is proposed to examine the optimality of y. It first selects y a by Corollary 2 (line 1 to 5). Next, it adopts Theorem 1 to test y's optimality. , y 2 , . . . , y m ) as a point in the m-dimension space. Given y and adjustment ranges, we have a farthest point z that we can move to. Algorithm 2 finds the point that consumes the minimum energy in (y, z] and updates y to that point. (line 6 to 7). The process stops until y becomes a candidate solution (line 4 to 8). If the obtained candidate solution is not optimal, the previous process is repeated (line 2 to 9). In summary, Algorithm 2 starts from a candidate solution and moves to a better one by adjusting the processors' speeds. The movement stops at P ϕ 's optimal solution. Table II lists the details of Algorithm 2 (lines 2 to 9), which is inspired by Theorem 1. As Table II shows, the processor selection is based on their current energy slope. We select processors which are more unbalanced in energy slope. If Next, we calculate the adjustment ranges. We take the case of N 2 = ∅ and o b < 0, for example. As o b < 0, y b is in N 1 . Referring to the first condition of Theorem 1, we increase y b and decrease all y j in N 2 concurrently to approach P ϕ 's optimal solution. During the adjustment, we keep all mediumspeed processors in the same energy slope while meeting (11) . Therefore, each y j in N 2 has a different decreasing rate σ j /A, where σ j = k 1 /k j and A = 
Details for another two cases are in Table II.   TABLE II  IMPLEMENTATION DETAILS OF ALGORITHM 2 Starting from y, we search for a point in (y, z] that consumes the lowest energy. The optimal step size λ is solved by the line-search problem
where f (y + λ(z − y)) is the total energy consumption, shown in (10) . Then, y is updated to y = y + λ(z − y). We repeat steps (2) and (3) until we reach a candidate solution. If this solution is not optimal, we go back to step (1).
Example 1(f):
The initial solution y is (y 1 , y 2 , y 3 ) = (0 ∈N 1 , 1 ∈N 1 , 3 ∈N 3 ). Since N 2 = ∅, we select y 1 as y b and y 3 as y c for speed adjustment. As Δ = min{u 1 − y 1 , y 3 − r 3 } = min{7 − 0, 3 − 0} = 3, the farthest point that we can move to is z = (y 1 + , y 2 , y 3 − ) = (3, 1, 0) . Then, we have f (y + λ(z − y)) = f (3λ, 1, 3 − 3λ) = 216λ 3 + 81λ 2 − 81λ + 31.1, which is minimized if λ = (1/4). Therefore, y is updated to (0. 75 ∈N 2 , 1 ∈N 1 , 2.25 ∈N 2 ) . y is a candidate solution (3k 1 y 
, y is P 2 's optimal solution. The energy lower bound of v 2 is f (8/11, 12/11, 24/11) = 19.1 which is shown in Fig. 3 .
We list the energy lower bound of v 1 and v 3 on the left-hand side of them. Because v 2 has the lowest lower bound among v 1 , v 2 , and v 3 , PBA traverses its subtree first. At v 2 , PBA calculates the energy lower bound of v 4 and v 5 . Node v 5 is visited first because its energy lower bound is less than that of v 4 . When reaching v 6 , we obtain a task partition x = (x 1,2 = 1, x 2,3 = 1, x 3,2 = 1) which consumes 109.3 energy units. PBA next applies MaxReduction to find the best task partition in the neighborhood of x.
Theorem 3: Algorithm 2 guarantees obtaining P ϕ 's optimal solution in finite steps.
Proof: Please refer to the Appendix.
V. LOCAL OPTIMIZATION SEARCH
When obtaining a feasible task partition x at a leaf node, PBA applies MaxReduction to find the task partition that consumes the least energy in the neighborhood of x. This neighborhood is composed of all possible task partitions, got by reassigning a group of tasks on the most-loaded processor. Namely, we intend to improve the current best task partition by searching the best task partition in this neighborhood and speed up the tree traverse. In the following, we state this local optimal search problem in a recursive form and present MaxReduction.
A. Recursive Formula
Let C a be the most loaded processor that consumes the most energy among processors. We reassign τ i ∈ C a to another processor if the new task partition consumes less energy. Let C b be the target processor. The current workload and speed of C a are f a and y a . Also, the current workload and speed of C b are f b and y b . By (5), a smaller energy consumption is achieved by reassigning τ i to C b if and only if Currently, there are Z tasks assigned to C a . We define β a as a list of these Z tasks. Each task in β a is sorted by its index in increasing order. β a,γ is the index of the γth task in β a . We define Q as the maximum energy reduction after migrating a group of tasks out of C a under the migration order β a . To determine this group of tasks, we further define M [γ, g] as the maximum energy reduction after migrating a group of tasks, each of which is one of the first γ tasks in β a , and the sum of cycle counts per second of all migrated tasks is less than or equal to g. Obviously, we have
In addition, we set M [0, g] = 0 for g = 0 to f a because no task is migrated in these cases.
Because each migration changes the workload of both the source and the target processors, we define H [γ, g] 
where EnergyDelta (H[γ, g], η, a) denotes the amount of energy reduction by migrating τ η out of C a at the workload of
The EnergyDelta is shown in Algorithm 4. We first determine the energy reduction by migrating τ η out of C a (line 2). We next consider C j as the target processor. If migrating τ η to C j will not incur an overflow, we calculate the energy increment of C j (line 6). We examine each migration (line 4 to 8) and return the maximum energy reduction R (line 10). Its time complexity is bounded by O(m).
Algorithm 4
R = max{R, Minus − Plus} 8: end if 9: end for 10: return R
B. Local Optimization Algorithm
Algorithm 5, MaxReduction, is the implementation of (14) to determine the group of tasks that should be migrated out of the most loaded processor C a for the maximum energy reduction. (14). If it is not selected, we simply make a copy of
g] (line 13 to 14). We next change two entries in H[γ, g] by
where b is the target processor determined in EnergyDelta. Then, we determine Q and its associatedg (line 20). The group of tasks that results in Q is an optimal solution for maximizing the energy reduction under the migration order of β a . This task partition is obtained by backtracking from M [z,g] (line 21).
Algorithm 5 1: Procedure MaxReduction(x)
2: C a = the most loaded processor in the task partition x; 3: 
The obtained task partition at leaf node v 6 is x = (x 1,2 = 1, x 2,3 = 1, x 3,2 = 1). The current speeds of the processors are (y 1 , y 2 , y 3 ) = (0, 3, 1), and the energy consumption is (E 1 , E 2 , E 3 ) = (0, 108.1, 1.2). MaxReduction is applied to C 2 because it consumes the most energy. Its task list β 2 is {τ 1 , τ 3 } which also represents the migration order.
As Fig. 4 shows, we first set
By Referring to H table, we know that τ 3 is reassigned to C 3 , and τ 1 is kept at C 2 . The best task partition in the neighborhood of x is x = (x 1,2 = 1, x 2,3 = 1, x 3,3 = 1) and consumes 31.3 energy units. x is the current best solution, and we have x * = x . As Fig. 3 shows, PBA next visits v 7 and applies MaxReduction to its task partition. The best solution in the neighborhood of this task partition is x = (x 1,2 = 1, x 2,1 = 1, x 3,3 = 1), which becomes x * since x consumes 21.6 energy units. Then, we go back to v 4 and visit v 8 and v 9 . However, no better solution is found. Moreover, v 3 and v 1 are pruned and the tree traverse completes. x * is the optimal solution, in which (x 1,2 = 1, x 2,1 = 1, x 3,3 = 1). By (2), the optimal speed of C 1 , C 2 , and C 3 is 1, 1, and 2 Hz, respectively.
VI. HEURISTIC ALGORITHM
PBA is designed to find the optimal solution of the HeMP single-level voltage-setup problem, which is NP-hard. Hence, despite the pruning methods that it adopts, PBA is expected to require significant computation time O(m n ) for large-size problems. Hence, we develop a heuristic algorithm, MinMax-E, to acquire an approximate solution in polynomial time.
Let C i,a be τ i 's most favorite processor, where a is determined by arg min 1≤j≤m {k j l
MinMax-E first sorts tasks by Δ i in decreasing order and dispatches them one by one. It intends to minimize the energy consumption of the most energy consuming processor. A task is assigned to a processor that incurs the minimum increase in energy consumption. Then, MinMax-E applies MaxReduction to the initial task partition. In short, MinMax-E first takes O(nm ln m) to sort all Δ i and O(nm) to get the initial partition. Then, it takes O(nmL) to obtain the improved solution. Therefore, its time complexity is O(nmL), if L > m.
VII. EXPERIMENTAL RESULTS
A. Simulation Setup
To demonstrate the effectiveness of our algorithms, we conducted a series of simulations and modeled more than 30 processors. These processors include ARM processor and TI DSP. The power parameters of each processor are obtained from the official website of ARM and TI and summarized in Table III . A PC with 2.4-GHz Intel processor and 2-GB RAM is used for simulation. We evaluate our algorithms on a simulated HeMP system consisting of 2 to 30 processors, each of which is randomly selected from the list of modeled processors. The maximum clock cycle for a task to execute on a particular processor e i,j is independent of the scheduling algorithm. In contrast, it depends on how this task is programmed and compiled to fully utilize the strength of this processor. To give a fair evaluation, we use the same set of e i,j 's for all algorithms under evaluation. Finally, for each experiment, we run simulations for 30 times and take the average value for comparison.
B. Computational Results
Two types of platforms are used to study the PBA's performance. Processors in Type 1 platforms are all adopted either from TI DSP series or from ARM series. Type 2 platform consists of both TI DSP and ARM processors. Thus, Type 2 platforms are more heterogeneous than Type 1. We generate l i,j randomly in [1M , 100M ] with uniform distribution. Results are shown in Figs. 5 and 6. Generally speaking, it takes more time to search the optimal solution for Type 1 platforms because each k i in Type 1 platforms is more similar than that in Type 2. This similarity makes it difficult for Type 1 platforms in energybased pruning and induces longer search time. For the execution time required by the exhaustive search, we profile smaller scale problems and use these results to approximate large-scale problems that would execute longer than a few hours. Because the computation for an exhaustive search is well structured, we believe that these are fair approximations. As the Reduce column shows, PBA reduces at least 98% time than the exhaustive search. In particular, in Type 2 platforms, PBA solves the problem (4 CPUs, 25 Tasks) in 8.24 s, while it takes the exhaustive search more than three years, approximately. The Visited Nodes column records the average number of nodes evaluated. The All Nodes column shows the number of nodes in the complete decision tree. Over 99.99% nodes are removed by our pruning methods.
We further evaluate two well-known nonlinear integer programming solvers, BARON [24] and LINDOGlobal [25] , via the NEOS server [26] , which is composed of Sun Microsystems workstations. As BARON and LINDOGlobal columns show, their performance is much worse than PBA. Moreover, in some cases, LINDOGlobal reports that it cannot find the optimal solution due to the numerical instability.
The CadAvg column is the average number of candidate solutions Algorithm 2 visits before finding P ϕ 's optimal solution, and CadMax is the maximum. Although there are at most 3 m candidate solutions, our results show that at most m candidate solutions are visited.
The ratio of MinMax-E's energy consumption to PBA is listed at the MinMax-E column. Generally speaking, the ratio becomes larger when the problem size increases as MinMax-E only explores partial solutions. However, this ratio is rather bounded between 1.00 and 1.31 and approaches 1.00 in a twoprocessor system. It is fair to say that MinMax-E is capable of delivering reasonably good performance.
C. Effects on Variable-Based Pruning
We investigate the impact of l i,j 's range, the number of tasks, and processors on variable-based pruning. The processors are selected from both ARM and TI DSP series. Let X be the number of possible task partitions before variable pruning and Y after variable pruning. The reduction ratio, shown in the vertical axis, is defined as (X − Y )/X. The horizontal axis represents the range of l i,j . For example, 10K means that l i,j is in [1, 10K] and 100K is in [1, 100K] . In Fig. 7(a) , we fix the number of tasks at 15 and vary the number of processors and l i,j 's range. It shows that the reduction ratio becomes larger when l i,j 's range increases. This is because the more different the execution time of a task on each processor is, the more opportunities we have to prune dominated variables. Moreover, the reduction ratio becomes larger when the system has more processors. In Fig. 7(b) , we fix the number of processors at four. Generally speaking, the system with more tasks has a smaller reduction ratio than that with less tasks.
D. Effects on Local Optimization Search
MaxReduction is applied to improve the task partition obtained at a leaf node. As Fig. 8(a) and (b) shows, R is defined as the ratio of the energy consumption of the current task partition to the energy lower bound of the system. The y-axis energy reduction is the ratio of energy reduction gotten by MaxReduction from the original energy consumption. The processors are selected from both ARM and TI DSP series, and l i,j is in [1M , 100M ]. The results indicate that the improvement depends on how close the current task partition is to the optimal partition. The larger R is, the more energy reduction is obtained by MaxReduction. The number of tasks and processors also affects the performance of MaxReduction. In Fig. 8(a) , the number of tasks is 15. The system with less processors gets larger energy reduction than that with more processors. This is because the system with two processors has more tasks on each processor and owns more migration opportunities. The same result is also obtained in Fig. 8(b) , where the processor number is fixed at four.
E. Effect of Power Ratio
We define the power ratio of a processor as the ratio of its maximum dynamic power to its static power. The power ratio of each processor listed in Table III is less than 0.1. We vary the power ratio of each processor between 0.1 and 0.4 and randomize l i,j in [1M , 100M ]. As Fig. 9 shows, the execution time increases when the power ratio becomes larger. This is caused by the assumption that unassigned tasks will not increase the static energy consumption of each processor. For such a light-loaded system, the static power may dominate the energy consumption, and the estimated lower bound may be too small to prune all unnecessary branches.
F. Performance Comparisons of Heuristic Algorithms
We compare MinMax-E with with kX3 + DP, List and ILP. kX3 + DP is our previous work [12] . List is a commonly used homogeneous low-power scheduling algorithm that dispatches a task to the processor with the least utilization [17] . ILP represents a task-partition algorithm upon a HeMP system [18] . To investigate the impact of the initial assignment on the final energy consumption, we apply MaxReduction to the List's and ILP's task partitions. The enhanced algorithms are named List+ and ILP+. Table IV lists the time complexity of each algorithm.
To evaluate the performance of each algorithm, we generate l i,j in [1M , 100M ] and select processors from both TI DSP and ARM series. As Fig. 10 shows, the y-axis is the percentage of energy consumption in comparison with List. In Fig. 10(a) and (b), ILP delivers the worst performance since it does not consider the energy consumption of each processor while dispatching tasks. After combining with MaxReduction, ILP+ and List+ can deliver comparable performances. However, the difference between MinMax-E and ILP+ becomes larger when the number of processors increases. This is because the number of energy-unbalanced processors increases in ILP+ when the system includes more processors. In Fig. 10(c) and (d), we remove ILP and ILP+ because they both are exponential-time algorithms and not applicable in these experiments. Due to the balanced energy consumption in the initial assignments, MinMax-E significantly outperforms List+ and consumes less energy than kX3 + DP. For example, List+ consumes more than ten times energy than MinMax-E in (30 CPUs, 250 Tasks).
G. Case Study: Small Mobile Robot
We discuss how our algorithms work for a small mobile robot capable of moving around in an irregular terrain and executing simple computational tasks. Examples of small mobile robots include, but not limited to, soccer robots [1] and electronic pets [27] . The main tasks executed by such a robot are obstacle avoidance, environmental monitoring, localization, and multirobot cooperation. To carry out these tasks efficiently, a mobile robot usually embeds a multiprocessor system to provide concurrent execution while meeting timing constraints, as discussed in [1] - [3] , and [27] . Fig. 11 shows a multiprocessor system commonly adopted by small mobile robots. Their tasks are generally categorized into control tasks and communication tasks. Control tasks are suitable to be executed on general-purpose processors, while communication tasks are suitable to be executed on signalprocessing processors. As a result, a HeMP system is often adopted to implement small mobile robots.
These tasks, control and communication, exhibit a common set of features. Their computations are nearly deterministic. In other words, their worst case execution times can be well analyzed and bounded. In addition, their workloads occur regularly with timing constraints. Thus, we can model each task as a real-time periodic task. Because of its deterministic workloads, DVS is expected to contribute less in runtime energy reduction. Furthermore, the extra circuits required for DVS may violate the budget and the space constraints often imposed by small mobile robots. In summary, a HeMP single-level voltagesetup problem needs to be solved to minimize the energy consumption on small mobile robots.
We simulate the execution of a small mobile robot consisting of four processors, two of which are ARM940T and another two of which are ARM9TDMI. ARM940T are general-purpose processors, and ARM9TDMI are signal-centric processors. Fig. 12 shows the performance parameters of these processors. We simulate a realistic workload derived from MiBench [28] on this HeMP platform. MiBench [28] is a popular benchmark used in embedded-system design. Our workload, listed in Fig. 13 , consists of 25 tasks, 16 of which are control tasks. Fig. 13 also gives the clock cycles to execute each task on each processor. This information is directly derived from [29] . To prioritize tasks, we assign different periods to these tasks. Critical tasks are executed at shorter periods, while others are executed at longer periods. Fig. 14 shows the performance comparison of different algorithms for executing this workload. The result of exhaustive search is unavailable because its estimated execution time is over three years. On the contrary, PBA successfully obtains the optimal voltage setup in less than a few minutes, and the optimal task-to-processor assignment is shown in Fig. 13 . MinMax-E and kX3 + DP, our previous works [12] , significantly outperform other heuristic algorithms to deliver a nearoptimal solution. List and List+ achieve the same energy consumption because List cannot reduce energy consumption by migrating tasks from the most loaded processor. The same reason explains why ILP and ILP+ have the same result.
VIII. RELATED WORK
Many real-time low-power scheduling algorithms have been proposed for a homogeneous multiprocessor (HoMP) system [6] , [30] . Aydin and Yang [30] partitioned periodic real-time tasks into a HoMP system by considering the energy constraint. Hsu et al. [31] developed two power-aware scheduling algorithms for real-time tasks executing on multiprocessor systems. All these algorithms focused their discussion on HoMP systems. Without considering that a task may have different execution times on heterogeneous processors, these algorithms cannot be directly applied on HeMP systems.
Hua et al. [7] proposed a low-power algorithm to schedule a set of soft real-time tasks on HeMP systems. Assuming that the available processor speeds and the task partition are known as a priori, this algorithm provides a schedule that minimizes energy under these constraints. Hsu et al. [31] addressed this problem for a HeMP system in which each processor has a fixed speed. Again, this algorithm assumes that each processor speed is given. Without this information, their real-time schedulers may not be optimal in reducing energy.
The voltage-setup problem is first formulated in [9] to determine the number of levels and at which values voltages should be implemented to deliver the optimal energy-saving performance for a specific application. Aydin et al. [8] proved that the optimal voltage for a one-processor single-level problem is equal to its utilization when the maximum speed is normalized to one. Seo and Dutt [10] and Buss et al. [11] proposed optimal solutions for a one-processor multilevel problem. Liu et al. [4] proposed a framework to choose the voltage ranges for a multiprocessor platform, in which the mapping of tasks onto processors is known as a priori. Finally, Ito et al. [32] , [33] applied integer linear programming to solve some data format conversion problems. Because our problem is nonlinear integer programming, their solutions cannot be applied on our problem and no direct comparison is made.
IX. CONCLUSION
HeMP systems are adopted by low-power embedded systems to host different categories of applications. The problem of minimizing the energy consumption without missing deadlines has become an important issue in constructing low-power real-time HeMP systems. This paper has presented PBA to get the optimal solution of the HeMP single-level voltage-setup problem. The results show that PBA reduces the time needed to derive the optimal solution by at least 98%, compared with the exhaustive search. Our paper is the first one that proposes the optimal solution to the HeMP single-level voltage-setup problem. In the future, we will extend this work to solve the HeMP multilevel voltage-setup problem.
APPENDIX
Proof of Theorem 1:
Since P ϕ is a convex programming problem, Karush-Kuhn-Tucker (KKT) conditions [23] are necessary and sufficient for a solution to be optimal. Then, P ϕ 's Lagrange function is 
Proof of Theorem 2:
For ease of presentation, we use "interloop" to represent line 4 to line 8 of Algorithm 2.
We first prove that an improved solution can be found at each iteration of the interloop, in which we solve (13) . . . We now show that the inner loop can find a candidate solution in m iterations. There are two cases. 1) 0 < λ < 1: It means that 3k j y 2 j is the same for y j ∈ N 2 . Hence, y + λd is a candidate solution. 
