Pipelined computing is a promising paradigm for embedded system design. Designing a power management policy to reduce the power consumption of a pipelined system with nondeterministic workload is, however, nontrivial. In this article, we study the problem of energy minimization for coarse-grained pipelined systems under hard real-time constraints and propose new approaches based on an inverse use of the pay-burst-onlyonce principle. We formulate the problem by means of the resource demands of individual pipeline stages and propose two new approaches, a quadratic programming-based approach and fast heuristic, to solve the problem. In the quadratic programming approach, the problem is transformed into a standard quadratic programming with box constraint and then solved by a standard quadratic programming solver. Observing the problem is NP-hard, the fast heuristic is designed to solve the problem more efficiently. Our approach is scalable with respect to the numbers of pipeline stages. Simulation results using real-life applications are presented to demonstrate the effectiveness of our methods. 
INTRODUCTION
With increasing requirements for high performance, multicore architectures are believed to be the major solution for future embedded systems. Many real-time applications, especially streaming applications, can be executed on multiple processors simultaneously to achieve parallel processing. When real-time applications are executed on multicore architectures powered by batteries, minimizing the energy consumption is one of the major design goals, because an energy-efficient design will increase the lifetime, increase the reliability, and decrease the heat dissipation of the system. Pipelined computing is a promising paradigm for embedded system design, which can, in principle, provide high throughput and low energy consumption [Carta et al. 2007] . For instance, a streaming application can be split into a sequence of functional blocks that are computed by a pipeline of processors where power-gating techniques can be applied to achieve energy efficiency.
Performance constraints of a streaming application are usually imposed on two principle metrics, that is, throughput and latency. The latency is the main concern for applications such as video/telephone conferencing and automatic pattern recognition applications, where the latency beyond a certain boundary is not tolerated. In the case of pipelined real-time systems, the latency of a streaming application can be expressed as the end-to-end deadline requirement that the application is processed through the pipeline.
Designing the scheduling policy for the pipeline stages under the requirements of both energy efficiency and timing guarantee is, however, nontrivial. In general, energy efficiency and timing guarantee are conflict objectives, that is, techniques that reduce the energy consumption of the system will usually pay the price of longer execution time, and vice versa. Previous work on this topic either requires precise timing information of the system [Yu and Prasanna 2002; Xu et al. 2007] or tackles only soft real-time requirements [Javaid et al. 2011b; Carta et al. 2007 ]. However, this precise timing of task arrivals might not be guaranteed in practice. Thus, the previous approaches cannot guarantee the worst-case deadline and cannot be applied to those embedded systems where violating deadlines could be disastrous. Compared to the preceding work, our work tackles a pipelined event stream with nondeterministic workloads in hard real-time systems by an inverted use of the pay-burst-only-once principle for energy efficiency.
This article studies the energy minimization problem of coarse-grained pipelined systems under hard real-time requirements. We consider a streaming application that is split into a sequence of coarse-grained functional blocks which are mapped to a pipeline architecture for processing. The workload of the streaming application is abstracted as an event stream and the event arrivals of the stream are modeled as the arrival curves in the interval domain [Le Boudec and Thiran 2001] . The event stream has an end-to-end deadline requirement, that is, the time by which any event in the stream travels through the pipeline should be no longer than this required deadline. The objective is thereby to find those optimal scheduling policies for individual stages of the pipeline with minimal energy consumption while the deadline requirement of the event stream is guaranteed.
Intuitively, the problem can be solved by partitioning the end-to-end deadline into sub-deadlines for individual pipeline stages and optimizing the energy consumption based on the partitioned sub-deadlines. However, any partition strategy based on the end-to-end deadline and the follow-up optimization method will suffer counting multiple times the burst of the event stream, which will inevitably overestimate the needed resource for every pipeline stage and lead to poor energy saving. A motivation example in Section 4 will demonstrate this drawback in detail. Therefore, a more sophisticated method is needed to tackle this problem.
In this article, we develop a new approach to solve the energy minimization problem for pipelined multiprocessor embedded systems while guaranteeing the worst-case end-to-end delay. This article summarizes and extends the results built in Chen et al. [2013] . Our idea to solve this problem lies in an inverse use of the well-known pay-burstonly-once principle [Le Boudec and Thiran 2001] . Rather than directly partitioning the end-to-end deadline, we compute for the entire pipeline one service curve that serves as a constraint for the minimal resource demand. The energy minimization problem is then formulated with respect to the individual resource demands of pipeline stages. To solve this problem, we propose two heuristics, that is, a quadratic programming heuristic and a fast heuristic. In the quadratic programming heuristic, the minimization problem is transformed to a standard quadratic programming problem with box constraint and then solved by a standard solver. Observing that the formulated problem is NP-hard, we present a fast heuristic to find a suboptimal solution by analyzing the properties of the optimal solution, running with the complexity O(mn) (where m and n are the stage number and sample step number, respectively). For simplicity, we consider power-gating energy minimization and use periodic dynamic power management in Huang et al. [2009b Huang et al. [ , 2011a to reduce the leakage power, that is, to periodically turn on and off the processors of the pipeline. In this work we compute period power management schemes offline and the fixed T on /T off for processors of every pipeline stage are applied during runtime. With this approach, we can not only guarantee the overall end-to-end deadline requirement but also retrieve the pay-burst-only-once phenomena, achieving a significant reduction of the energy consumption. In addition, our methods are scalable with respect to the number of pipeline stages. The contributions of this article are summarized as follows.
-A new method is developed to solve the energy minimization problem for pipelined multiprocessor embedded systems by inversely using the pay-burst-only-once principle. -A minimization problem is formulated based on the needed resource of individual stages of the pipeline architecture and a transformation of the formulation to a standard quadratic programming problem with box constraints. The formulated problem is proved NP-hard. -A quadratic programming heuristic is developed to solve the formulated problem and a formal proof is provided to show the correctness of our approach, that is, guarantee on the end-to-end deadline requirement. -A fast heuristic is developed to solve the formulated problem, running with the complexity O(mn).
The rest of the article is organized as follows. Section 2 reviews related work in the literature. Section 3 presents basic models and the definition of the studied problem. Section 4 presents the motivation example and Section 5 describes the proposed approach. Experimental evaluation is presented in Section 6, and Section 7 concludes.
RELATED WORK
Pipelined computing is a promising paradigm for embedded system design, which can in principle provide high performance and low energy consumption. Pipelined multiprocessor systems are widely applied as a viable platform for high performance implementation of multimedia applications [Shee and Parameswaran 2007; Javaid and Parameswaran 2009; Shee et al. 2006; Karkowski and Corporaal 1997] . Energy optimization for pipelined multiprocessor systems is an interesting topic where a number of techniques have been proposed in the literature. Carta et al. [2007] and Alimonda et al. [2009] proposed a feedback control technique for dynamic voltage/frequency scaling (DVFS) in a pipelined MPSoC architecture with soft real-time constraints, aimed at minimizing energy consumption with throughput guarantees. Each pipelined processor is associated with a dedicated controller that monitors the occupancy level of the queues to determine when to increase or decrease the voltage-frequency levels of the processor. Javaid et al. [2011b] proposed an adaptive pipelined MPSoC architecture and a runtime balancing approach based on workload prediction to achieve energy efficiency. The authors in Javaid et al. [2011a] proposed a dynamic power management scheme for adaptive pipelined MPSoCs. In this work, the duration of idle periods is determined based on future workload prediction and used to select an appropriate power state for the idle processor. However, the prior approaches are under soft real-time constraints. Regarding hard real-time systems, these approaches cannot be applied. There are also methods [Davare et al. 2007; Hong et al. 2011; Juurlink 2006, 2009; Liu et al. 2014; Yu and Prasanna 2002] for hard real-time systems. To guarantee the end-to-end delay, the anthers in Liu et al. [2014] studied the problem of minimizing the number of processors required for scheduling end-to-end deadlineconstrained streaming applications modeled as CSDF graphs, where the actors of a CSDF are executed as strictly periodic tasks. In Davare et al. [2007] , the authors optimized periods for dependent tasks on hard real-time distributed automotive systems in order to meet the end-to-end constraints. In Hong et al. [2011] , the authors proposed a distributed approach to assign local deadlines for periodical tasks on distributed systems to meet the end-to-end deadline constraints. To reduce the energy consumption, Yu and Prasanna [2002] presented an integer linear programming (ILP) formulation for the problem of frequency assignment of a set of periodic independent tasks on a heterogeneous multiprocessor system. The authors in de Langen and Juurlink [2006, 2009] proposed leakage-aware scheduling heuristics to reduce the energy consumption by translating real-time applications with periodic tasks to DAGs using the framebased scheduling paradigm and considering the trade-offs among DVFS, DPM, and the number of the processors. But these methods require precise timing information such as periodical real-time events. However, in practice, this precise timing information of task arrivals might not be determined in advance. The nondeterminism in the timing of event arrivals results from two main causes: (a) An event may be triggered by the physical environment, which, in general, is not able to be accurately predicted. (b) When a distributed system is considered, an event might be triggered by other events on different processing components in which variable execution workloads would make the prediction of precise information on event arrivals extremely complicated. In the aforesaid research, there is no guarantee that an event will arrive in time. Therefore, these approaches cannot be applied to guarantee the worst-case deadline in embedded systems where violating deadlines could be disastrous. Unlike previous work, we focus on improving energy efficiency in hard real-time embedded systems while guaranteeing the system satisfies the worst-case deadline constraint.
To model irregular event arrivals, Real-Time Caculus (RTC) [Thiele et al. 2000] , which is based on network calculus [Le Boudec and Thiran 2001] , can be applied. Specifically, the arrival curve in the RTC models an upper bound and a lower bound of the number of event arrivals or the demand of computation under a specified time interval domain. Considering the DVFS system, Maxiaguine et al. [2005] computed a safe frequency at periodic intervals to prevent buffer overflow of a system. By adopting RTC models, Chen et al. [2009] explored the schedulability for the online DVFS scheduling algorithms proposed in Yao et al. [1995] . Combining optimistic and pessimistic DVFS scheduling, Perathoner et al. [2010] presented an adaptive scheme for the scheduling of arbitrary event streams. When only considering dynamic power management (DPM), Huang et al. [2009b Huang et al. [ , 2011a presented an algorithm to find periodic time-driven patterns to turn on/off the processor for energy saving. Online algorithms are proposed in Huang et al. [2009a Huang et al. [ , 2011b and Lampka et al. [2011] to adaptively control the power mode of a system, procrastinating the processing of arrived events as late as possible. In one algorithm in Huang et al. [2009a Huang et al. [ , 2011b , a tight bound of event arrivals is computed based on historical information of event arrivals in the recent past. Instead of using historical information, a dynamic counter technique [Lampka et al. 2011 ] is used to predict the future workload. Compared to preceding work, the distinct difference of ours is that we can tackle the correlation of a pipelined event stream by an inverted use of the pay-burst-only-once principle. With this new method, retrieving this correlation of the same event stream between different pipeline stages, we can compute longer deadlines for each pipeline stage and reduce the overall power consumption of the system.
MODELS AND PROBLEM DEFINITION

Hardware Model
The hardware architecture we have chosen is a simplified one with no shared cache and shared bus among different processing cores. The processing cores are connected in a pipelined fashion via dedicated FIFOs. We consider the system with pipeline architecture shown in Figure 1 (a). Subtasks of a partitioned application are mapped and executed in different processors. The processors communicate data only through distributed memory units. Each memory unit can be organized as one or several FIFOs. The data communication and synchronization among processors are realized by blocking read and write SW primitives. This kind of hardware architecture has been realized in Nikolov et al. [2008] . As the service curve of each stage can be computed for energy efficiency by our proposed approaches offline, the worst-case FIFO size of each stage can be determined by applying the analysis approach in .
Each processor in the pipelined system has three power consumption modes, namely active, standby, and sleep modes, as shown in Figure 1(b) . To serve events, the processor must be in the active mode with power consumption P a . When there is no event to process, the processor can switch to sleep mode with lower power consumption P σ . However, mode switching from sleep mode to active mode will cause additional energy and latency penalty, respectively denoted as E sw,on and t sw,on . To prevent the processor from frequent mode switches, the processor can stay at standby mode with power consumption P s , which is less than P a but more than P σ , that is, P a > P s > P σ . Moreover, the mode switch from active (standby) mode to sleep mode will cause energy and time overhead, respectively denoted by E sw,sleep and t sw,sleep .
Consider the overhead of switching the system from active mode to sleep mode, the system break-even time T BET denotes the minimum time length that the system stays at sleep mode. If the interval that the system can stay at sleep mode is smaller than T BET , the mode-switch mode overheads are larger than the energy saving, therefore switching mode is not worthwhile. And break-even time T BET can be defined as follows:
where t sw = t sw,on + t sw,sleep and E sw = E sw,on + E sw,sleep .
26:6 G. Chen et al.
Energy Model
The analytical processor energy model in Martin et al. [2002] , Wang and Mishra [2010] , Jejurikar et al. [2004] , and de Langen and Juurlink [2009] is adopted in this article, whose accuracy has been verified with SPICE simulation [Martin et al. 2002; Wang and Mishra 2010; de Langen and Juurlink 2009] . The dynamic power consumption of the core on one voltage/frequency level (V dd , f ) can be given by
where V dd is the supply voltage, f the operating frequency, and C eff the effective switching capacitance. The cycle length t cycle is given by a modified alpha power model
where K 6 is technology constant and L d is estimated by the average logic depth of all instructions' critical path in the processor. The threshold voltage V th is given as
where V th1 , K 1 , K 2 are technology constants and V bs is the body bias voltage. The static power is mainly contributed by the subthreshold leakage current I subn , the reverse bias junction current I j , and the number of devices in the circuit L g . It can be presented by
where the reverse bias junction current I j is approximated as a constant and the subthreshold leakage current I subn can be determined as
where K 3 , K 4 , and K 5 are technology constants. To avoid junction leakage power overriding the gain in lowering I subn , V bs should be constrained between 0 and −1V. Thus, the power consumption at active mode and at standby mode, that is, P a and P s , under one voltage/frequency (V dd , f ) can be respectively computed as
where P on is an inherent power needed for keeping the processor on.
Task Model
This article considers streaming applications that can be split into a sequence of tasks. As shown in Figure 1(a) , an H.263 decoder is represented as four tasks (i.e., PD1, deQ, IDCT, MC) implemented in a pipelined fashion [Oh and Ha 2002] . To model the workload of the application, the concept of arrival curve
, originating from network calculus [Le Boudec and Thiran 2001] , is adopted. α u ( ) and α l ( ) provide the upper and lower bounds on the number of arrival events for the stream S in any time interval . Many other traditional timing models of event streams can be unified in the concept of arrival curves. For example, a periodic event stream can be modeled by a set of step functions, whereᾱ
For a sporadic event stream with minimal interarrival distance p and maximal interarrival distance p , the upper and lower arrival curves areᾱ u ( ) = p + 1,ᾱ l ( ) = p , respectively. Moreover, a widely used model to specify an arrival curve is the PJD model, where the arrival curve is characterized by period p, jitter j, and minimal interarrival distance d. In the PJD model, the upper arrival curve can be determined asᾱ Figure 2 depicts arrival curves for the previous cases. Analogous to arrival curves that provide an abstract event stream model, a tuple
defines an abstract resource model which provides upper and lower bounds on the available resources in any time interval . Further details are referred to Thiele et al. [2000] . Note that arrival curves are event based, meaning they specify the number of events of the steam in one interval of time, while service curves are based on the amount of computation time. Therefore, service curve β has to be transformed toβ to indicate the number of events of the stream that the processor can process in a specified interval time. Suppose that the execution time of an event is c, the transformation of the service curves can be done
With these definitions, a processor with lower service curveβ
Gl ( ) is said to satisfy the deadline D for the event stream specified by α u ( ) if the following condition holds.
Note that we adopt the same assumption as Maxiaguine et al. [2005] , Huang et al. [2009a Huang et al. [ , 2009b , Lampka et al. [2011] , and Chen et al. [2009] and assume the worst-case execution time (WCET) of each task can be predefined and considered as system input in the article. As mentioned in the previous section, the hardware architecture that we have chosen is a simplified one with no shared cache and shared bus among different processing cores. In this sense, we can safely assume the WCET of the running tasks as system inputs.
Problem Statement
This article considers periodic power management [Huang et al. 2009b ] that periodically turns on and off a processor. In each period T = T on +T off , it switches the processor to active (standby) mode for T on time units, followed by T off time units in sleep mode, as shown in Figure 1 
is an integer, suppose that γ (L) is the number of events of event stream S served in L. If all the served events finish within L, the energy consumption E(L, T on , T off ) by applying this periodic scheme is
where E sw is E sw,on + E sw,sleep for brevity. Given a sufficiently large L, without changing the scheduling policy, the minimization of energy consumption E(L, T on , T off ) of a single processor is to find T off and T on such that the average idle power consumption P(T on , T off ) is minimized.
By defining K = T on T on +T off , the average idle power consumption P in (10) can be defined by T off and K(0 ≤ K ≤ 1) as follows:
By analyzing (11), it is obvious that the following properties hold.
According to Properties 1 and 2, when T off > E sw P s −P σ holds, the processing unit should turn on as briefly as possible in one period. When T off ≤ E sw P s −P σ holds, the processing unit should turn on all the time with T off = 0. In this context, E sw P s −P σ can be seen as the break-even time of the processing unit.
Based on (10), the energy minimization problem of an m-stage pipeline can be formulated as minimizing the function
where
. Now we can define the problem that we studied as follows.
Given pipelined platform with m stages, an event stream S processed by this pipeline, and an end-to-end deadline requirement D, we are to find a set of periodic power managements characterized by T on and T off that minimize the average idle power consumption P defined in (12) while guaranteeing that the worst-case end-to-end delay does not exceed D.
MOTIVATION EXAMPLE
A phenomenon called pay-burst-only-once is well known and can give a closer upper estimate on the delay when an end-to-end service curve is derived prior to delay computations [Fidler 2003 ]. When a workload flow with a burst traverses a number of stages in sequence, the effect of the burst of the flow on the end-to-end delay bound is the same as if the flow traversed only one node. The end-to-end delay bound computed with this property can be tighter than the sum of delay bounds of each node.
This section presents a motivation example where an event stream passes through a two-stage pipeline with a deadline requirement D. For simplicity, arrival curves in the leaky-bucket form and service curves in rate-latency form [Le Boudec and Thiran 2001] are used. In this representation, an arrival curve is modeled as α( ) = b + r · , where b is the burst and r the leaky rate. Correspondingly, a service curve is modeled as β( ) = R · ( − T ), where R is service rate and T the delay. A graphical illustration of the example is shown in Figure 3 , where D = 20, b = 5, r = 0.5, and R 1 = R 2 = 1.
We first inspect the strategy of partitioning the end-to-end deadline and using the partitioned sub-deadlines for the two pipeline stages. For simplicity, we split the D equally, that is, D/2 for each stage. As shown in Figure 3 = 2.5. Let's take a close look at this solution. According to the concatenation theorem β R 1 ,T 1 ⊗β R 2 ,T 2 = β min(R 1 ,R 2 ),T 1 +T 2 , we get a concatenated service curve β = −(T 1 + T 2 ) = − 7.5. With this concatenated service curve, the maximal overall end-to-end deadline for β 1 and β 2 is 12.5, which is far more strict than D. This example indicates that the obtained β 1 and β 2 based on partitioning the end-to-end deadline are too pessimistic.
The reason for the pessimism comes from paying the burst b/R 1 for the second stage of the pipeline as well as the additional delay r·T 1 R 2 from the first stage, as the pay-burstonly-once principle points out. These effects will be accumulated for every stage of the pipeline, leading to even more pessimistic results as the number of pipeline stages increases. In addition, computing the resource demand of each stage requires the lower bound of the output arrival curve from the previous stage. Computing this output curve requires numerical min-plus convolution that will incur considerable computational and memory overheads. In conclusion, the strategy based on partitioning the end-toend deadline is not a viable approach, in particular for those cases of pipelined systems with many stages.
On the other hand, one can first derive the total concatenated server demand β T l , in this case T = 15 as shown in Figure 3(b) . Any partition based on this T will result in smaller but valid service curves for each pipeline stage, as we can always retrieve the original end-to-end deadline by means of the pay-burst-only-once principle. For example, by an equal partition of T , both T 1 and T 2 are 7.5 and D is still preserved. This brings the basic idea of our approach that will be presented in the next section.
PROPOSED APPROACH
Our approach lies in an inverse use of the pay-burst-only-once principle, as mentioned in the previous section. Rather than directly partitioning the end-to-end deadline, we compute one service curve for the entire pipeline, which serves as a constraint for the minimal resource demand. The energy minimization problem is then formulated with respect to the resource demands for individual pipeline stages. To solve this minimization problem, the formulation is transformed into a quadratic programming form and solved by a 2-phase heuristic.
Without loss of generality, a pipelined system with m heterogeneous stages (m ≥ 2) is considered. The processor of the i stage can provide minimal β Gl i service. Since periodic power management is considered, the minimal service β 
The derivation of Eq. (13) is presented in Lemma A.1 in the appendix section. In addition, to obtain a tight lower bound of the service curve of the entire pipeline, we restrict T i on as a multiple of the worst-case execution time c i , that is,
Problem Formulation
Regarding the problem formulation, we first present an approximation approach (see Lemma 5.1) to derive a lower bound of the PPM service curve. By using this approximated curve, we derive the concatenated service curve directly (see Lemma 5.2), which can be used to guarantee the real-time properties (see Theorem 5.3). Then, the energy minimization problem is formulated with respect to the resource demands for individual pipeline stages. Before presenting the formulation, we first state a few basics. By
, we have the following two lemmas.
According to the definition of the min-plus convolution operation , the inequality a + b ≥ a + b , and the inequality Eq. (13), we havē
With the restriction T i on = n i c i , n i ∈ N + and a ≥ a, we have
According to a ≥ a − 1, we have c i ≥ Applying Pay-Burst-Only-Once Principle for Periodic Power Management
26:11
According to the rule of min-plus convolution of rate-latency service curve 
Then, we get the right side of the inequality.
LEMMA 5.2.
PROOF. According to the rule of min-plus convolution of rate-latency service curve 
With Lemma 5.2, we state the next theorem. 
PROOF. In Lemma 5.2, the right-hand side of the inequality is a lower bound of
Gl that is the concatenated service curve of the pipeline. With
, the end-to-end delay of the pipeline is no more than D according to the pay-burst-only-once principle. Therefore, the theorem holds.
The left-hand side of inequality Eq. (14) can be considered as a bounded delay func-
For the stream S with deadline D, a set of minimum bounded delay functions bdf min ( , ρ, b) can be derived by varying b (see Section 5.2). Therefore, we should find a solution of [ K, T off ] such that the resulting bounded delay function bdf ( , ρ 0 , b 0 ) is no less than the minimum bounded delay functions bdf min ( , ρ, b) . Therefore we can formulate our optimization problem as following:
where to the average power consumption (10) of each stage.
The advantage of formulation (15) is twofold. First of all, the service curves of individual pipeline stages are the variables of the optimization problem, which, on the one hand, overcomes the problem of paying the burst multiple times while, on the other hand, avoiding the costly computation during the optimization. Second, this formulation allows us to use a more efficient method to analyze the problem as presented in the following sections.
Quadratic Programming Transformation
How to solve the minimization problem (15) is not obvious. The constraints b and ρ, indeed, are not fixed values and in addition these two constraints are correlated. For a fixed b, the minimum bounded delay function bdf min ( , ρ, b) can be determined by computing ρ.
In this article, we conduct the optimization by varying b and computing ρ for every possible b. For a fixed b, we can transform (15) into a quadratic programming problem with box constraints (QPB), as stated in the following lemma.
LEMMA 5.4. The minimization problem in (15) can be transformed as the following quadratic programming problem with box constraints:
minimize x=[x 1 ... x m ] x T Q x subject to 0 ≤ x i ≤ E i sw (1 − ρ c i ), i = 1, . . . , m,(17)
where Q = A− B, A is an m× m matrix of ones and B is an m× m diagonal matrix with i th diagonal element
(b− m j=1 c j )(P i s −P i σ ) E i sw .
Denote x * as the optimal solution for the QPB problem in (17), then the optimization solution for (15) can be obtained with
PROOF. With Cauchy-Buniakowski-Schwartz's inequality, we can get that
The minimum value of
can be obtained at
when the following equation holds. Applying Pay-Burst-Only-Once Principle for Periodic Power Management
26:13
Then optimization formulation in (15) can be formulated as
By defining x i = E i sw (1 − K i ), formulation (15) can be transformed as the QPB problem in (17).
Note that there is a feasible region for b. To guarantee all the resulting T i off ≥ 0, the bounded delay b should not be less than (18), which can guarantee that all the resulting K i will not exceed 1. In summary, the feasible region of b ∈ [b l , b u ] can be bounded as follows:
Quadratic Programming Heuristic
With the preceding information, we can now present the overall algorithm to the energy minimization problem defined in Section 3.4. Basically, bounded delay b is scanned by step within the range [b l , b h ]. For each b, we first solve the subproblem (17) with a QPB solver, and then the obtained solution is repaired to fulfill further constraints (this will be explained later on). The pseudocode of the algorithm is depicted in Algorithm 1. 
ALGORITHM 1: Quadratic Programming Heuristic
c j holds, the matrix Q in Lemma 5.4 is not positive semi-definite. Thus, QPB is the nonconvex quadratic programming problem which is NP-hard [Jeyakumar et al. 2006 ].
To solve the subproblem (line 3 in Algorithm 1), we apply the existing QPB solver. According to Theorem 5.5, QPB is NP-hard when the scanned bounded delay b is big enough (i.e.,
It is in general difficult to solve the problem optimally. Nevertheless, there are approximation schemes [Fu et al. 1998 ] that can efficiently solve the nonconvex QPB and there are many excellent off-the-shelf software packages [Chen and Burer 2012] available. In this article, the state-of-the-art finite B&B algorithm [Chen and Burer 2012 ] is applied to solve our QPB problem.
After obtaining a pair of K and T off , the repair phase (line 4 in Algorithm 1) is conducted to fulfill further constraints. This repair scheme is represented in Algorithm 2. First of all, the resulting T , which results in a K i increase and power consumption increase E i (line 18 in Algorithm 2). In the end, the total loss Q should be reassigned to the stage with T i < 0 to reduce the power consumption further (lines 21-32 in Algorithm 2). The reassignment heuristic uses power increase E i as a metric to decide which stage should be assigned first. Specifically, the heuristic iterates through all stages that need to compensate and, in each iteration, picks the stage with maximum power increase E i and increase T i off without causing K i < K i . The reassignment heuristic terminates when there is no loss to reassign or no stage needs to compensate. It is worth noting that the repair phase we conduct can still guarantee the repaired solution to satisfy the constraints, as stated in Lemma 5.6. 
Fast Heuristic
In Section 5.3, we presented a quadratic programming heuristic with QPB transformation. According to Theorem 5.5, QPB is NP-hard when the scanned bounded delay b is big enough. Assume that bounded delay b is scanned by n steps, then the heuristic in Section 5.3 needs to solve this NP-hard problem several times, which is time consuming. Besides, in the first optimization step, the quadratic programming heuristic does not consider the break-even time constraint (i.e., T i off of pipeline stage i is not smaller than T i BET ), which could also make the result pessimistic. To overcome these drawbacks, we present a fast heuristic to find a suboptimal solution, running with O(mn) time complexity (m is the stage number). Different from the heuristic in Section 5.3, we consider the break-even time constraint in the optimization phase and partition stage set P as two stage sets according to this constraint, rather than decoupling the break-even time constraint and optimization. Based on this stage-set partition, we can derive a suboptimal solution as stated in Lemma 5.7. 
off . According to Cauchy-Buniakowski-Schwartz's inequality, the optimal average power consumption of the stage subset S 2 can be determined as (19) when 
According to (19), the average power consumption of the stage subset S 2 gets the minimum when According to Lemma 5.7, the optimal solution can be derived directly if the stage partition P = {S1, S2} is determined. Thus, optimal solution can be derived by exhaustively exploring all possible stage partitions with the complexity Ø(2 n ). When the stage number increases, the complexity will increase exponentially. To reduce its complexity, a fast stage partition scheme is proposed in this article. In this scheme, we first greedily put all stages into the stage set
BET } (i.e., we assume all the stages can enter sleep mode). Under this greedy partitioning, we compute the optimal T off according to Lemma 5.7 as described in lines 1 and 2 in Algorithm 3. Then, we can assign the stages by checking whether the resulting optimal T off under the greedy partition is greater than T i BET (see lines 3-9 in Algorithm 3). The feasibility of this partition scheme can be guaranteed by Lemma 5.8.
LEMMA 5.8. Stage partition P = {S1, S2} generated by Algorithm 3 is feasible. For each b, we can first obtain a suboptimal partition by the greedy partition scheme depicted in Algorithm 3, and then the optimal solution under the obtained partition can be determined. The pseudocode of the algorithm is depicted in Algorithm 4.
PROOF. In Algorithm 3,
ALGORITHM 4: Fast Heuristic
compute ρ by Eqn. (16); 3:
generate the feasible partition S1 and S2 by Algorithm 3;
4:
obtain K and T off according to Lemma 5.7;
5:
repair K and T off by Algorithm 2; 6:
if P( K, T off ) < P min then 7:
end if 10: end for
PERFORMANCE EVALUATIONS
In this section, we demonstrate the effectiveness of our approach. We compare three approaches in this section: (1) the pay-burst-only-once algorithm based on quadratic programming (PBOOA-QP) presented in Section 5.3; (2) pay-burst-only-once algorithm based on the fast heuristic (PBOOA-FH) presented in Section 5.4; and (3) the deadline partition algorithm (DPA). DPA partitions the end-to-end deadline into sub-deadlines for individual pipeline stages and explores all the possible deadline partition combinations to find that deadline partition with the minimum energy consumption. For each deadline partition combination, DPA uses the scheme in Huang et al. [2009b] to minimize the energy consumption of individual pipeline stages to optimize the overall energy consumption. To show the effects of our scheme, we report the average idle power computed as Eq. (10) as well as the computation time of all the schemes. The simulation is implemented in Matlab using the RTC toolbox and the finite B&B algorithm [Chen and Burer 2012] is used to solve QPB. All results are obtained from a 2.83 GHz processor with 4GB memory. 
Simulation Setup
The experiments are conducted based on the classical energy model of a 70nm technology processor in Martin et al. [2002] , Wang and Mishra [2010] , Jejurikar et al. [2004] , and , whose accuracy has been verified with SPICE simulation. Table I lists the energy parameter under 70nm technology [Martin et al. 2002; Wang and Mishra 2010; Jejurikar et al. 2004; . According to Jejurikar et al. [2004] , executing at V dd = 0.7V is more energy efficient than executing at lower voltage levels. To achieve the minimization of the overall energy consumption of the system, we assume that the processor runs at this critical frequency level when the processor is in the active state. From Wang and Mishra [2010] and Jejurikar et al. [2004] , body bias voltage V bs is obtained as −0.7V . From Jejurikar et al. [2004] , P on related to idle power can be obtained as 100mW and the power consumption in sleep mode P σ is set as 50μW. In Jejurikar et al. [2004] , we can obtain energy overhead E sw of the state transition as 483μJ. We set time overhead t sw of the state transition as 10ms. According to the energy parameter in Table I and the energy model in Section 3.2, we can calculate the corresponding active power P a and standby power P s under voltage level V dd = 0.7V . Table II lists all the power parameters used in the experiment. An event stream is specified by the PJD model with period p, jitter j, and minimal interarrival distance d. It is worth noting that a worst-case execution time c is associated with the service curve of different stages, as stated in Section 3.3. The jitter j and the relative deadline D of the stream are respectively defined as j = ϕ · p and D = γ · p and vary according to the corresponding factors.
To evaluate the effectiveness of our approach, we conduct the experiments with three applications. We collected results for these three applications with deadline and jitter varying with the corresponding factors γ and ϕ. In the following, we give a brief overview of the three applications. The H.263 decoder application [Oh and Ha 2002] was modeled by four tasks consisting of packet decoding (PD1), an inverse quantization operation (deQ), an inverse DCT operation (IDCT), and motion compensation (MC). The execution time of each subtask in the H.263 decoder application can be found in Oh and Ha [2002] . The activation period of the H.263 decoder application is 100ms with varying the jitter and the end-to-end deadline. The MP3 decoder application is implemented in a pipelined fashion [Oh and Ha 2002] that can be split into five tasks, including packet decoding (PD2), Huffman decoding (HD), the inverse quantization operation (deQ), an inverse DCT operation (IDCT), and antialiasing (FB). The execution time of each subtask in the H.263 decoder application can be found in Oh and Ha [2002] . The activation period of the MP3 decoder application is 100ms with varying the jitter and the end-to-end deadline. Time Delay Equalization (TDE) comes from the GMTI (Ground Moving Target Indiciator) application obtained from StreamIt benchmarks [Thies and Amarasinghe 2010]. The Time Delay Equalization (TDE) application contains 4 tasks, including tasks like FFT reorder, combined DFT, FFT reorder, and combined IDFT. We set the activation period of the consumer application as 30ms.
Simulation Result
We first evaluate how the power consumptions of the compared approaches change as the jitter and deadline vary. Cases of 2-stage and 3-stage pipeline architectures with homogeneous 70nm processors are evaluated. We vary the jitter factor ϕ from 0-3 with step 0.5 and the deadline factor γ from 1.5-2 with step 0.5. The simulation results of the three approaches are shown in Figure 4 . In the figure, each line represents the average energy consumption under the varied jitter factor settings with the fixed deadline factor and task mapping. From these, we can make the following observations: (1) pay-bustonly-once-based approaches always outperform the deadline partition approach for all settings on both pipeline architectures. We list average normailized power savings of PBOOA-QP and PBOOA-FH with respect to DPA in Table III ; (2) the average idle power consumptions of the three approaches increase as jitter increases, since bigger jitter requires longer T on to gurantee the worst-case end-to-end deadline; (3) the average idle power consumptions of the three approaches decrease as end-to-end deadline increases. This is expected becasue the loose end-to-end deadline requirement could result in smaller execution time T on and longer sleep time T off ; (4) one interesting obervation is that pay-bust-only-once-based approaches can achieve more power savings on 3-stage pipeline rather than 2-stage pipeline for different jitter and deadline settings. This is because DPA on 3-stage pipeline pay burst more times than 2-stage pipeline, which leads PBOOA-QP and PBOOA-FH to achieve more power savings on 3-satge pipeline. Next, we conduct the experiment to show the impact of time overhead of state transition t sw on the effectiveness of our approaches. An H.263 application with jitter factor ϕ = 0.5 and deadline factor γ = 1 runs in 3-stage pipeline architectures with homogeneous 70nm processors. We vary the time overhead of state transition t sw from 5ms-15ms with fixed step size 1ms. Figure 5 illustrates the average power consumptions for the three compared approaches. In this figure, we can observe that our approaches can find efficient solutions and outperform DPA in all of t sw settings. Besides, when t sw increases, the average power consumptions of DPA increase faster compared to payburst-only-once-based approaches. This is because DPA generates the less idle time due to suffering from a paying burst many times compared to pay-burst-only-once-based approaches, as we show in Section 4. The increase of t sw will reduce the opportunities for turnning off the processor, which means that entering sleep mode should be more difficult for DPA.
Then, we discuss the impact of the period setting on the effectiveness of the approaches. The MP3 application with jitter factor ϕ = 1 and deadline factor γ = 1.5 runs in two-stage pipeline architectures with homogeneous 70nm processors, where we vary period settings from 70-130ms with fixed step size 10ms. Figure 6 illustrates the average power consumptions for the three compared approaches under different period settings. From Figure 6 , we can see that the pay-burst-only-once-based approaches outperform DPA at all period settings. Furthermore, the average power consumption of all approaches decreases when the period increases. This is expected because a bigger period of the application can prolong the idle intervals. In the end, we demonstrate the scalability of our approaches. We test them by up to a 20-stage heterogeneous pipeline. The execution time of subtasks mapped on each stage are randomly generated between 5-15ms. According to the power model presented in Section 3.2, the power profile of each stage can be generated by randomly selecting voltage V dd between 0.5V -0.8V . The activation period of the event stream is 40ms with jitter factor ϕ = 1. The end-to-end deadline for the test case with different stage number is determined by n · 20, where n is the stage number. The overhead values of state transition t sw and E sw of different stages are randomly selected in [1ms, 5ms] and [400uJ, 800uJ], respectively. Based on the observation that the deadline partition algorithm may suffer deadline combination explosion and the costly computation, we set the search step as 5 for the three compared approaches. Figure 7 shows the power consumption and computation overhead on different pipeline architectures. From this figure, we have the following observations: (1) as shown in Figure 7 (a), the computation overhead of the deadline partition algorithm increases exponentially. When the stage number exceeds 10, the Deadline Partition Algorithm (DPA) fails to generate the results due to the expiration of time budget of 8 hours. For the case of 9-stage pipeline, DPA takes almost 420 minutes, which is 9182× longer than the 3-stage pipeline case. This is expected because the deadline combinations will increase exponentially as the stage number increases. In addition, as the stage number increases, the time for computing the resource demand of each following stage, which requires the lower bound of the output arrival curve from the previous stage, increases. Computing this output curve requires numerical min-plus convolution that will incur considerable computational and memory overheads; (2) compared to the deadline partition algorithm, pay-burst-only-once-based approaches are fast and the computation time increases slowly with respect to the stage number, especially for PBOOA-FH. With the case of 20-stage pipeline, the PBOOA-QP approach takes 3.7 minutes, 124× more computing time than the 3-stage case. PBOOA-FH takes only 0.08 minutes to generate the result, only 7.5× than the 3-stage case; (3) in the context of average idle power consumption, pay-burst-only-once-based approaches are more energy efficient than the deadline partition algorithm. In Figure 7 (b), we can see PBOOA-QP and PBOOA-FH approaches always outperform DPA for all pipeline architectures, indicating that our approaches are not only faster but also more energy efficient than the DPA approach. Besides, as observed in prior experiments, the gap in power consumption between the deadline partition algorithm and pay-burst-only-once-based algorithm increase as the stage number increases. This is expected because, as the stage number increases, the times when DPA should pay burst also increase. In contrast, the proposed approaches only need to pay burst only once, which leads the tighter end-to-end delay bound and prolongs the idle intervals of the stages for energy efficiency; (4) the PBOOA-FH approach can achieve almost identical average idle power consumption with respect to the PBOOA-QP approach with almost 10× speedup. In some cases, PBOOA-FH can even achieve more energy savings than PBOOA-QP. This is because that, in contrast to the PBOOA-QP approach, PBOOA-FH integrates break-even time constraints into the optimization phase, which leads the PBOOA-FH approach to find better solutions than the PBOOA-QP approach.
CONCLUSION
This article presents new approaches to minimize energy consumption for pipelined systems. Targeting the streaming application with nondeterministic workload arrivals under hard real-time constraints, our approaches can not only guarantee the original end-to-end deadline requirement but also retrieve the pay-burst-only-once phenomena, resulting in a significant reduction in both the energy consumption and computing overhead. Moreover, our approaches are scalable with respect to the number of pipelined stages. Simulation results demonstrate the effectiveness of our approaches. In the future, we intend to extend our approaches to dynamic voltage-frequency scaling (DVFS) to reduce dynamic power for pipelined systems. Another interesting future work would be to target multidimensional issues such as energy and thermal constraints simultaneously. In addition, how to combine our approaches with consideration of the mapping of the application is also deemed worthy for our future work. 
PROOF. According to Huang et al. [2009b] , the service curve of period power management specified by T on and T off can be represented as Eq. (21).
This proof presents the derivation of Eq. (20), which is used to represent the service curve of period power management, to indicate that Eqs. (21) and (20) are equivalent. According to the definition of the min-plus convolution,
We make some transformations as follows.
Then, we have
As s ≤ , there are two possibilities between the parameters r and r s : (1) when k s = k , r s ≤ r should be held for s ≤ ; (2) when k s ≤ k − 1, there is no constraint between r and r s because k s ≤ k − 1 is sufficient to guarantee s ≤ . 
-When T off < r s < T holds, we have Eq. (24) by calling Eq. (23).
-When 0 ≤ r s ≤ T off holds, we have Eq. (25) by calling Eq. (23).
For the preceding two cases, the infimum of β Gl k s ≤k −1 ( ) for the case k s ≤ k − 1 can be obtained as Eq. (26) by calling Eqs. (24) and (25).
Case 2: k s = k . For this case, r s ≤ r should be held for s ≤ , thus we have Eq. (27) by calling Eq. (22).
As r s should be constrained by r , there are two cases for r . 
-r > T off . For this case, there are two subcases for r s , that is, 0 ≤ r s ≤ T off < r < T and T off < r s ≤ r < T .
-0 ≤ r s ≤ T off < r < T . For this case, we have Eq. (30) by calling Eq. (27). 
For the prior two cases, the infimum of β 
