Abstract-Energy efficiency is a critical design concern for embedded systems. Dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) has been wildly used to explore the idleness of processors and dynamically reduce the energy consumption by putting idle processors to low-power states. In this paper, we explore how to effectively apply dynamic power management in adaptive manner to reduce leakage power consumption for coarse-grained pipelined systems under hard real-time requirements. At each adaptive point, a system transformation is proposed to model the pipeline system with unfinished events as multi-stream system. By using extended payburst-only-once principle, the service curves for corresponding stream can be computed as a constraint for a minimal resource demand and energy minimization problem can be formulated with respect to the resource demands at each adaptive point. One light-weight heuristic, called balance workload scheme (BWS), is proposed in this paper to solve the minimization problem. Simulation results using real-life applications are presented to demonstrate the effectiveness of our approach.
I. INTRODUCTION
To achieve high performance and energy efficiency, multicore architectures are widespread in modern computer systems. When multi-core architectures are powered by batteries, reducing the power consumption is one of the major design goals, because an energy-efficient design enables slower depletion of batteries and results in lower chip temperatures that improve performance and reliability.
Among multi-core architectures, pipelined multiprocessor architectures are believed to be one of promising computing paradigms for embedded system design, which can, in principle, provide high throughput and low energy consumption [2] , [11] , [12] . In pipelined multiprocessor architectures, processors are connected in a pipelined fashion via shared memory (e.g., FIFOs). A streaming application can be split into a sequence of functional blocks that are computed by a pipeline of processors where clock/power-gating techniques can be applied to achieve energy efficiency.
Designing the scheduling policy for the pipeline stages under the requirements of both energy efficiency and timing guarantee is non-trivial due to the conflict objectives between energy efficiency and timing guarantee. Most of the previous work on this topic [26] , [4] , [12] , [2] cannot be applied to hard real-time system with non-deterministic workloads. To deal with non-deterministic workloads, the current stateof-the-art approach [3] computes periodic time-driven turn on/off patterns for pipeline stages in offline to minimize the leakage power consumption while guarantee the end-to-end deadline. However, the off-line approach is not energy-efficient compared to adaptive approach, because the slacks caused by runtime variability of execution time and event arrivals cannot be explored in the static approach.
This paper explores how to apply dynamic power management in adaptive manner to reduce leakage power consumption for coarse-grained pipelined systems under hard real-time requirements. We consider a pipeline architecture, where each processor have active, standby, and sleep modes with different power consumptions, and a streaming application with end-toend deadline requirement. The streaming application can be split into a sequence of coarse-grained functional blocks which are mapped to a pipeline architecture for processing. The workload of the streaming application is abstracted as an event stream. The event arrivals of the stream are modeled as the arrival curves in interval domain [15] , [22] . At each adaptive point, dynamic power management scheme should decide when and which power state should be selected for a processor to reduce the energy consumption of the system under the hard real-time constraints. The decision is a challenging one due to the following facts. First, at one time instant for scheduling decision, scheduling decision need to guarantee the timing requirement of the new events in the first stage, as well as the on-going events stored in the system. Second, the time and energy overheads are involved in a transition between active mode and sleep mode. Third, the current decision should be consistent with the decision made in the last point.
In this paper, we propose one integrated approach to resolve these concerns. The integrated approach could adaptively regulate the delay of the processors according to the workload and the current state of the stages, and procrastinate the events as late as possible. Comparing to the state-of-the-art work in [3] , the adaptive approach can efficiently explore the slacks by using dynamic counter techniques [14] and adaptively checking FIFO state, and can achieve significant energy savings with respect to the off-line approach in [3] . The contributions of this paper are summarized as follows:
• We propose one system transformation to model the system with event stored in FIFOs as multi-stream system, which enables us to analysis the system timing efficiently.
• We extend the pay-burst-only-once principle for the transformed multi-stream system and offer the proof for its correction. By inversely using this extended the payburst-only principle, the service curves for the corresponding stream can be computed as a constraint for a minimal resource demand.
• We derive a formulation of the minimization problem based on the needed resource of individual stages of the pipeline architecture at each adaptive time instant for scheduling decision.
• We propose one light weight scheme, called balance workload scheme, to find one optimal decision at each adaptive point.
• We conduct simulation using the real-life application to demonstrate the effectiveness of our approach. The rest of the paper is organized as follows: Section II reviews related work in the literature. Section III presents basic models and the definition of the studied problem. Section IV describes the motivation and the proposed approach. Experimental evaluation is presented in Section V and Section VI concludes the paper.
II. RELATED WORK
Pipelined computing is a promising paradigm for embedded system design, which can in principle provide high performance and low energy consumption. Pipelined multiprocessor systems are wildly applied as a viable platform for high performance implementation of multimedia applications [21] , [20] . Energy optimization for pipelined multiprocessor systems is an interesting topic where a number of techniques have been proposed in the literature. Carta et al. [2] and Alimonda et al. [1] proposed a feedback-control technique for dynamic voltage/frequency scaling (DVFS) in a pipelined MPSoC architecture with soft real-time constraints, aimed at minimizing energy consumption with throughput guarantees. Javaid et al. [12] proposed a adaptive pipelined MPSoC architecture and a run-time balancing approach based on workload prediction to achieve energy efficiency. Authors in [11] proposed a dynamic power management scheme for the adaptive pipelined MPSoCs. In this work, the duration of idle periods is determined based on the future workload prediction and is used to select an appropriate power state for the idle processor. However, above approaches are under the soft real-time constraints. When coming to the hard real-time systems, these approaches are not applicable.
There are also many works [5] , [26] , [27] for hard realtime systems. Yang et al. [26] presents an integer linear programming (ILP) formulation for the problem of frequency assignment of a set of periodic independent tasks on heterogeneous multi-processor system. Zhang et al. [27] proposed a two-phase framework that integrates task scheduling and voltage selection to minimize the energy consumption of realtime dependent tasks on MPSoCs. But these methods require the precise timing information, such as periodical real-time events. However, in practice, this precise timing information of task arrivals might not be determined in advance. In the above studies, there is no guarantee that a event will arrive in time. Therefore, these approaches can not be applied to guarantee the worst-case deadline in embedded systems where violating deadlines could be disastrous. Unlike previous works, we focus on improving energy-efficiency in hard real-time embedded systems while guarantee the system satisfy the worst-case deadline constraint.
To model irregular event arrivals, Real-Time Caculus (RTC) [22] , which is based on Network Calculus [15] , can be applied to abstract task arrivals into time interval domain. Considering the DVFS system, Maxiaguine et al. [17] computed a safe frequency at periodical interval to prevent buffer overflow of a system. By adopting RTC models, Chen et al. [6] explored the schedulability for online DVFS scheduling algorithms. Combining optimistic and pessimistic DVFS Scheduling, Perathoner et al. [19] presented an adaptive scheme for the scheduling of arbitrary event streams. When only consider dynamic power management (DPM), Huang et al. [10] presented a algorithm to find periodic time-driven patterns to turn on/off processor for energy saving. On-line algorithms are proposed in [9] , [14] to adaptively control the power mode of a system, procrastinating the processing of arrived events as late as possible. In one algorithm in [9] , a bound of event arrivals is computed based on historical information of event arrivals in the recent past. Instead of using historical information, dynamic counter technique [14] is used to predict the future workload. Unfortunately, above approaches can only be applied to single processor. In the context of multiple processors system, the authors in [3] recently presented one off-line approach to compute a set of periodic time-driven turn on/off patterns for the pipelined multiprocessor systems. However, the approach in [3] cannot explore slack generated at runtime to reduce the energy consumption further. Nevertheless, how to apply dynamic power management at runtime is not yet clear. In this paper, we present an adaptive approach to determine energy-efficient scheduling at runtime with hard real-time constraints for pipelined multiprocessor systems using the arrival curve model.
III. SYSTEM MODELS AND PROBLEM DEFINITION

A. Hardware Model
We consider the system with the pipeline architecture showed in Fig. 1(a) . Sub-tasks of a partitioned application are mapped and executed in different processors, which are connected via FIFOs. Each processor in the pipelined system has three power consumption modes, namely active, standby, and sleep modes, as shown in Fig. 1(b) . To serve events, the processor must be in the active mode with power consumption P a . When there is no event to process, the processor can switch to sleep mode with lower power consumption P σ . However, mode-switching from sleep mode to active mode will cause additional energy and latency penalty, respectively denoted as E sw,on and t sw,on . To prevent the processor from frequent mode switches, the processor can stay at standby mode with power consumption P s , which is less than P a but more than P σ , i.e. P a > P s > P σ . Moreover, the mode-switch from active (standby) mode to sleep mode will cause energy and time overhead, respectively denoted by E sw,sleep and t sw,sleep .
Consider the overhead of switching system from active mode to sleep mode, the system break-even time T BET denotes the minimum time length that the system stays at sleep mode. If the interval that the system can stay at sleep mode is smaller than T BET , the mode-switch mode overheads are larger than the energy saving. Therefore, switching mode is not worthwhile. Break-even time T BET can be defined as:
where t sw = t sw,on +t sw,sleep and E sw = E sw,on +E sw,sleep . 
B. Energy Model
The analytical processor energy model in [16] , [25] , [13] is adopted in this paper, whose accuracy has been verified with SPICE simulation. The dynamic power consumption of the core on one voltage/frequency level (V dd , f ) can be given by:
where V dd is the supply voltage, f is the operating frequency and C ef f is the effective switching capacitance. The cycle length t cycle is given by a modified alpha power model.
where K 6 is technology constant and L d is estimated by the average logic depth of all instructions critical path in the processor. The threshold voltage V th is given below.
where V th1 , K 1 , K 2 are technology constants and V bs is the body bias voltage.
The static power is mainly contributed by the subthreshold leakage current I subn , the reverse bias junction current I j and the number of devices in the circuit L g . It can be presented by:
where the reverse bias junction current I j is approximated as a constant and the subthreshold leakage current I subn can be determined as:
where K 3 , K 4 and K 5 are technology constants. To avoid junction leakage power overriding the gain in lowering I subn , V bs should be constrained between 0 and -1V. Thus, the power consumption at active mode and at stand-by mode, i.e., P a and P s , under one voltage/frequency (V dd , f ) can be respectively computed as: P a = P dyn + P sta + P on (7) P s = P sta + P on (8) where P on is an inherent power needed for keeping the processor on.
C. Event Model
This paper considers streaming applications that can be split into a sequence of tasks. As shown in Fig. 1(a) , a H.263 decoder is represented as four tasks (i.e., PD1, deQ, IDCT, MC) implemented in a pipeline fashion [18] . To model the workload of the application, the concept of arrival curve
, originated from Network Calculus [15] , is adopted. α u (∆) and α l (∆) provides the upper and lower bounds on the number of arrival events for the stream S in any time interval ∆. Many other traditional timing models of event streams can be unified in the concept of arrival curves. For example, a periodic event stream can be modeled by a set of step functions whereᾱ
For a sporadic event stream with minimal inter arrival distance p and maximal inter arrival distance p ′ , the upper and lower arrival curve isᾱ
respectively. Moreover, a widely used model to specify an arrival curve is the PJD model, where the arrival curve is characterized by period p, jitter j, and minimal inter arrival distance d. In PJD model, the upper arrival curve can be determined as Further details are referred to [22] . Note that arrival curves are event-based which specifies the number of the events of the steam in one interval time, while service curves are based on the amount of computation time. Therefore, service curve β has to be transformed toβ to indicate the number of the events of the stream that processor can processed in specified interval time. Suppose that the execution time of an event is c, the transformation of the service curves can be done bȳ
With these definitions, a processor with lower service curveβ Gl (∆) is said to satisfy the deadline D for the event stream specified by α u (∆), if the following condition holds.
D. Problem Statement
This paper explores how to use dynamic power management in runtime manner to effectively minimize the energy consumption for coarse-grained pipelined multiprocessor systems under hard real-time requirement. Intuitively, energy saving can be achieved by (a) tuning as many as possible processors in the pipeline to sleep mode, and (b) keeping each processor staying at sleep mode as long as possible. However, according to the definition of break-even time T BET , switching from/to 
Examples for arrival curves, where (a) periodic events with period p, (b) events with minimal inter-arrival distance p and maximal inter-arrival distance p ′ = 1.3p, and (c) events with period p, jitter j = p, and minimal inter-arrival distance d = 0.75p.
the sleep mode is not worthwhile when the sleeping interval is shorter than T BET . On the other hand, prolonging the sleep mode might cause the current or future events violate their timing constraints. Thus, we need to make decisions for: (a) which processors in the pipeline should be selected to switch to the sleep mode, and (b) when to turn them back to the active mode to serve events to guarantee the deadline constraints. Thus, the problem studied in this paper is defined as follows. 
IV. PROPOSED APPROACH
This section presents one adaptive approach to reduce leakage power of pipelined multiprocessor systems, which lies in an inverse use of the extended pay-burst-only-once principle. At one time instant for scheduling decisions (i.e., turn on or turn off the processors), scheduling decision should guarantee the timing requirement of the event trace in future as well as on-going events stored in system. To effectively model the system that contains unfinished events, we propose a novel system transformation, by which the whole system can be transferred as the multi-stream system where one stream is used to maintain the event trace in the first stage and the other streams are used to model the unfinished events in system. Based on this system transformation, we can compute one service curve for the corresponding stream as a constraint for the minimal resource demand by using the extended pay-burstonly-once principle. The energy minimization problem is then formulated with respect to the resource demands for individual pipeline stages. We propose one light-weight heuristic, called balanced workload scheme (BWS), to solve this minimization problem. In this paper, balanced workload scheme (BWS) is implemented in periodic manner to regulate the delay of the stages. Finally, we discuss how to determine the size of FIFOs between processors.
A. Motivation
In contrast to the work in [3] , which computes a set of periodic power management for each processor in off-line, we propose one on-line approach to minimize the energy consumption, which could adaptively switch power state of the processor in pipelined multiprocessor systems according to the current workload and the state of the stages. Compared to static approach presented in [3] , our approach could achieve energy savings by the following facts.
Firstly, the execution slack usually occurs due to difference between worst-case assumptions made in the offline analysis and the actual online behavior of the system. The off-line approach is based on the assumption that each job executes for its worst-case execution time (WCET). However, due to the inherent variability of execution time, most of the jobs in a real scenario finish their execution earlier than their WCET c w , thus generate execution slack c w − c (c denotes the execution time in a real scenario).
Secondly, the real-time system is analyzed by the off-line approach with the assumption that the task arrives in worstcase pattern, i.e., the upper bound α u (∆) and lower bound α l (∆) on the number of arrival tasks for the stream S in any time interval ∆. However, this worst-case arrival pattern rarely happens in hard real-time systems. Jobs are released with a variable delay bounded by the arrival curve α(∆). For brevity, the slack generated due to this variable delays is termed as dynamic slack.
The slacks mentioned above can be explored and managed explicitly in our approach, leading to significant energy savings with respect to the off-line approach in [3] . By adaptively monitoring the filling level of FIFOs between processors, our approach can explicitly identify the execution slack. On the other hand, by using the dynamic counter techniques [14] , our approach can adaptively predict the event arrival in future. This adaptive prediction scheme, which efficiently explores and manages dynamic slack, procrastinates the events as late as possible without violating the timing constraints.
B. Real-time Calculus Routines
1) Service Curve: Without loss of generality, a pipelined system with m heterogeneous stages (m ≥ 2) is considered. Fig. 3 presents 3 -stage pipeline as example. At one adaptive time instant, we are to regulate the delay τ i adaptively according to the workload to reduce leakage power consumption for pipelined multiprocessor system under hard real-time constraints. For each stage, the service curve at each adaptive point can be modeled as a bounded delay function, which can be defined as follows.
β
(10) The transformed service curveβ Gl i can be approximated as: 
2) Demand Curve for Unfinished Events Stored in System:
3) Future Prediction: To explore the dynamic slack efficiently, we use dynamic counter technique presented in [14] to conservatively bound the future workload. According to [14] , the number of events arriving in the time interval [t, t + ∆] can be bounded tight by µ(∆, t), which is determined by dynamic counters. Due to the simplicity of the dynamic counter, dynamic counters can be easily implemented as part of the hardware with negligible overhead [8] , [14] .
C. System Transformation
In this paper, we are to find one adaptive power management scheme to reduce leakage power consumption for pipelined multiprocessor system under hard real-time constraints. However, it is not a easy task to determine adaptive power management for each adaptive time instant to guarantee hard real-time constraints for the pipeline. We take 3-stage pipeline system as an example, as shown in Fig. 3 . At each time instant, there might be some unfinished events stored in system waiting for process. At the same time, the system also needs to process the new event entering to the first stage. Thus, power management decision at each time instant should guarantee the timing requirement of the new event trace in the first stage as well as unfinished events stored in system. In addition, unfinished events stored in system prevent us to adopt pay burst only once principle directly. According to [15] , [3] , the approach without using pay-burst-only-once principle will suffer from pessimistic result as well as costly computation.
In this section, we propose one novel system transformation, which enables us to analyze the system timing efficiently. The main idea is that the whole system is transferred as one multistream system where one stream is used to maintain the event trace in the first stage and the other streams are used to model the workload for the unfinished events. For m−stage system, unfinished events in stage p i (2 ≤ i ≤ m) can be represented by one special leaky-bucket stream S i with α(∆) = b i +r i ·∆, where the burst b i is the number of the unfinished events (i.e., the events stored in F IF O i and the event processing on stage p i ), denoted by Q i + 1 in Fig. 4 , and leaky rate r i is 0. The stream S 1 in the first stage can be represented as the predicted arrival curve µ(∆, t) pulsed with the one burst Q 1 +1 represented by unfinished events in stage p 1 . As shown in Fig. 4 , at each adaptive time instant, the 3-stage pipeline system can be transferred as the system with 3 streams (i.e., S 1 , S 2 , and S 3 ). With this system transformation, we can compute one service curve for the corresponding stream as a constraint for the minimal resource demand by using the extended pay-burst-only-once principle (See Section IV-D). 
D. Problem Formulation
After system transformation, the pipeline at each adaptive point can be represented as an aggregate scheduling system. Then, a new extended pay-burst-only-once principle is proposed to compute the end-to-end service curve for stream S i in an aggregate scheduling system. Based on this new extended pay-burst-only-once principle, the energy minimization problem is then formulated with respect to the resource demands for individual pipeline stages at each adaptive point.
In [7] , the extended pay-burst-only-once principle has been extended to compute the end-to-end service curve for stream S i in an aggregate scheduling only for the case with FIFO service curve elements and single leaky bucket constrained arrival curve. However, the original arrival curve bound α(∆) and the predicted arrival curve µ(∆, t) at the adaptive time instant t may not be constrained as the form of leaky bucket in the real world. In addition, the events are scheduled as nonpreemptive first-come-first-serve manner in this system, not with FIFO service curve. Thus, the approach presented in [7] cannot be applied. In this paper, a new extended pay-burstonly-once principle is proposed to derive end-to-end service curve for stream S i for the transferred multi-stream system.
Lem. 1: At one adaptive time instant, the adaptive m-stage system can be represented as m-stream system. The end-toend service curve β i (∆) for stream S i (i = 1, ..., m) with service curve elements with the rate-latency format β R,T and arrival curves α(∆), can be computed by:
where Q j is the number of the stored events of F IF O j at the one adaptive time instant. Proof: According to the derivations in [7] , we obtain (14) for the stream S i for the time pairs t j+1 −t j ≥ 0 in general sense.
where R j i denotes the arrival function of stream S i on the stage j (the case of j = m + 1 indicates the output of the stream),  i denotes the stage set that the stream S i goes through (i.e., the path of the stream S i ), κ i denotes the interference stream set that uses the complete path or some part of the path of the stream S i , ,and  i,k denotes all stages of a sub-path that are passed by both stream S i and stream S k (k > i ≥ 1).
According to system transformation and non-preemptive first-come-first-serve schedule, the stream S i is interferenced by the stream S j when j ≥ i + 1 holds. Thus, we have (15) .
In addition, according to system transformation, the stream S i goes through from the ith stage p i to the final stage p m . Thus, we can obtain (16) and (17) .
At one adaptive point, the stage may have already fetched one event from its FIFO and started to process this event. Consider this case, we can derive (18) with (17) and k ≥ i + 1.
Similar to the derivation in [7] , with (15), (16) and (18) , the inf (tj+1−tj >0)|j∈i of (14) can be derived to the form that is given in (13) .
⊔ ⊓ By using the approximated lower bound of (11), we can state below theorem with Lem. 1.
Thm. 1: At each adaptive time instant t, assume demand curve for future events arrival and unfinished events stored in stage p i in m stage pipeline system under end-to-end deadline D constraint can be defined as µ(∆ − D, t) and α 
where α D Si (∆) is the demand curve for S i , which can be defined as:
Proof: By using the approximated lower bound of (11), the lower bound of end-to-end service curve β Gl i (∆) for stream S i (i = 1, ..., m) can be derived as the form of the right hand side of (19) according to Lem. 1. For each stream, (19) can guarantee the end-to-end deadlines of the future events as well as the stored events are no more than D.
⊔ ⊓ For each stream S i , (19) needs to be satisfied to guarantee the deadline of both the future events and the stored events. The lower bound of end-to-end service curve β . To satisfy the inequality (19) , the maximum bounded-delay can be determined as follows.
, ∀∆ ≥ 0} (21) Thus, the end-to-end deadline constraints can be formulated as (22) . The constraint set (22) can be organized as form of triangle. For stream S i , the constraint has m + 1 − i variables in the left hand of (22) .
The right hand side of (22) is constant. For brevity, for the rest of the paper, the right hand side of (22) is denoted as λ(S i ).
The deadline constraint set (22) with triangle form can guarantee the deadline requirements for the pipeline. In next step, one decision should be made to decide which processor can switch to sleep state and which cannot. Due to the definition of the break even time T BET , the current state of the processors should be taken into account to make such decision. If the processor currently stays at the active or idle state, the processor can be selected to enter sleep state only when τ ≥ T BET holds. If the processor currently stays at sleep state, the processor can stay at sleep state without any constraints. Therefore, we need to consider the current state of stages to make the decision.
At one adaptive time instant, the stage can be divided into active set Φ a and sleep set Φ s according to the current state that the stages are in. Active set Φ a and sleep set Φ s denote the stage set that stay at active state (or idle state) and the stage set that stay at sleep state, respectively. Let binary variable e i denote the state switch decision for active stage set Φ a in the adaptive time instant: e i = 1 if the stage p i (p i ∈ Φ a ) switch from active state to sleep state and 0 otherwise. The binary variable e i depends on the break even time constraints: (a) T BET ≤ τ i ⇒ e i = 1; (b) T BET > τ i ⇒ e i = 0. The sleep interval of the active stage can be formulated asτ i = e i · τ i , which is a quadratic item. For brevity, we call these constraints as active set constraints
For the sleep set Φ s , the on-going sleep interval τ 0 i , i.e., the duration from the point of entering sleep state to current point, may not be greater than T BET . Thus, to make the current decision be consistent to the former decision, processors need to stay at sleep mode for more than max(T BET − τ 0 i , 0). We call this additional sleep interval consistency interval CI i = max(T BET − τ 0 i , 0). The consistency constraints can be defined as:
(23) At each adaptive time instant, we are to greedily maximize the total sleep interval of all processors. The objective function can be formulated as:
Up to now, the adaptive power management problem can be formulated as a set of optimization problem at different time instant, which maximizes the objective (24) with respect to deadline constraints, active set constraints, and consistency constraints.
Above optimization problem contains quadratic itemτ i = e i · τ i and integer variables. Solving such a mixed integer quadratic programming (MIQP) problem is quite timeconsuming. To address this issue, one light weight scheme, called balance workload scheme (BWS), is presented in Section IV-E to maximize the total sleep interval of all processors while guaranteeing the hard real-time constraints.
E. Proposed Heuristic
In this section, we present one fast heuristic, called balance workload scheme (BWS), to find sub-optimal solution at each adaptive time instant. At each adaptive time instant, balance workload scheme (BWS) tries to find the first-step decision to turn as many as possible processors into sleep mode. Based on the first-step decision, we make the second-step decision to keep the processor sleeping as long as possible.
1) Determine Sleep Stages:
For the stage in active set Φ a in current point, we need to determine which stage will enter into sleep stage and which stage will stay at active state, i.e., active set constraints. For the stage in sleep set Φ s , sleep interval should be more than CI i to make the current decision be consistent to the former decision, i.e., consistency constraints. As the first-step decision, we are to determine which stages in Φ a should switch to sleep mode. This decision tries to switch as many as possible processors to sleep mode to balance the workload among processors.
The pseudo code of the algorithm is depicted in Algo. 1, which determines sleep stages set π s and active stages set π a . To satisfy constraints for sleep set Φ s in (23), sleep intervals τ s i for stages in Φ s are initially assigned to CI i and later on might be prolonged by Algo. 2. Thus, we assign sleep decision variable e i as 1 and put sleep set Φ s into π s at this moment (Line 2-Line 5 in Algo. 1). Then, (22) can be updated as (25) .
Note that, for multiple inequality (25) with the same variables (i.e., the same set Φ a ∩  i ), the constants in the right hand of different inequality (25) can be merged by min operation. With this operation, multiple inequality (25) with the same variables can be merged as one inequality. Similar to (22) , the constraint set (25) can also be organized as form of triangle (Line 7 in Algo. 1). We denoteλ(Φ a , i) as the constant in the the right hand of i variables inequality (25) and Φ a,i as the corresponding stage subset with i variables. For example, assume 4-stage pipeline with Φ a = {p 1 , p 3 , p 4 } and Φ s = {p 2 }, the constraint set (25) is organized as form of triangle and is represented as follow.
For the constraint (25) with i variables, the number of the processor we can turn off can be bounded as (26) . We denote this ith bound as ψ(Φ a , i).
According to the triangle form constraint set (26), we first find the minimum bound ψ m (Φ a ) and the corresponding index I m (ψ m ) (Line 8 in Algo. 1). Then, active set Φ a can be divided into two part Φ a,Im(ψm) and Φ a − Φ a,Im(ψm) . For stages in Φ a,Im(ψm) , we need to select ψ(Φ a , I m (ψ m )) stages to turn off. To balance the workload between stages, the backlog Q i is considered as the priority to select the stage to turn off. The smaller Q i is, the higher priority to turn off the stage has. With this criterion, we select ψ(Φ a , I m (ψ m )) stages Φ H a,Im(ψm) with higher priority to put into π s and the remaining part Φ a,Im(ψm) − Φ 
π s ← {π s , p i }; 5: end for 6: while Φ a = φ do 7: Update constant setλ(Φ a , i) and ψ(Φ a , i) according to (25) and (26); 8: (Φ a , i) ) and the corresponding index I m (ψ m );
Update Φ a ← Φ a − Φ a,Im(ψm) ; 12: end while
2) Prolong Sleep Interval:
In the next step, we need to determine the sleep interval for each stages based on the decision {π a , π s } obtained from Algo. 1. For the stages in π a , we assign the delay τ i as 0 directly (Line 2 in Algo. 2). For the stages in π s , the delay τ i is represented as τ i = τ (22) can be updated as (27) .
Similar to (25) , the constraint set (27) can also be organized as form of triangle. We denoteλ e (π s , i) as the constant in the the right hand of i variables inequality (27) Algo. 1, i.e., consistence constraint and break-even constraint. The while loop (Line 11 to Line 16 in Algo. 2) can guarantee the extended delay τ e i conforms to end-to-end deadline (22) accoding to the definition of (27) .
⊔ ⊓
F. Worst Case FIFOs Size Analysis
To prevent the FIFO overflow during the implementation of the proposed dynamic power management, we need to analyze the worst case FIFO size requirement in off-line. This information could guild us to assign share memory between stages before the balance workload scheme is implemented in run-time and avoid the FIFO overflow. For ith stage in m stages pipeline system, the worst case FIFO size can be obtained by turning on the other stages all the time and procrastinating ith stage as late as possible. According to (21) , the maximum bounded-delay can be determined and then the service curve of ith stage can also be determined. Using the analysis approach in [24] , the worst-case FIFOs size can be determined.
V. EXPERIMENTAL EVALUATIONS
This section provides simulation results for the proposed adaptive dynamic power management scheme. The pipeline simulator is implemented in MATLAB using RTC-toolbox from [23] . We implement balanced workload scheme (BWS) in periodic manner and set the activation period as 5ms.
A. Simulation Setup
The experiments are conducted based on classical energy model of 70nm technology processor in [16] , [25] , [13] , whose accuracy has been verified with SPICE simulation. Tab. I lists the energy parameter under 70nm technology [16] , [25] , [13] . According to [13] , executing at V dd = 0.7V is more energy efficient than executing at lower voltages levels. To achieve the minimize the overall energy consumption of the system, we assume that the processor runs at this critical frequency level when the processor is in the active state. From [25] , [13] , body bias voltage V bs is obtained as −0.7V . From [13] , P on related to idle power can be obtained as 100mW and the power consumption in sleep mode P σ is set as 50µW . According to energy model in Section III-B, we can calculate the corresponding active power P a and stand-by power P s under voltage level V dd = 0.7V . In [13] , we can obtain energy overhead E sw of state transition as 483µJ. We set time overhead t sw of state transition as 10ms. Tab. II lists all the power parameters used in the experiment. The H.263 decoder shown in Fig. 1(a) is used as the test application. The execution time of each subtask in H.263 decoder application can refer from [18] . An event stream is specified by the PJD model with period p, jitter j, and minimal inter-arrival distance d. The period p and the jitter j of the H.263 decoder application are respectively set as 100ms and 150ms with varying the end-to-end deadline. The relative deadline D of the stream is defined as D = γ · p and varies according to the deadline factor γ. To compare the impact of different algorithms, we simulate traces with a 10sec time span. The traces are generated by the RTC tools [23] and conformed to the arrival curve specifications. It is worthy noting that a worst-case execution time c w is associated with the transformed service curve, as stated in Section III-C. 
B. Results
Firstly, we show the effectiveness of the proposed adaptive power management scheme (BWS) comparing to the periodic power management scheme (PPM) in [3] under different end-to-end deadline constraint. Cases of 2-stage and 3-stage pipeline architectures with homogeneous 70-nm processors are evaluated. To demonstrate how the proposed scheme can effectively explore dynamic slacks, we firstly remove the variability of the execution time of the tasks by setting the factor α as 1. We vary the deadline factor γ from 1 to 2 with step 0.1. The simulation results are shown in Fig. 5 . From Fig. 5 , we can make the following observations: (1) The overall power consumptions of both scheme decrease as the end-to-end deadline increases. This is expected because the loose end-to-end deadline requirement could create more opportunities of entering the sleep state and longer sleep time.
(2) Adaptive power management scheme (BWS) outperforms periodic power management scheme (PPM) on both pipeline architectures. In case of no execution slack, adaptive power management scheme (BWS) can on average achieve 50% and 35% energy savings on 2-stage pipeline and 3-stage pipeline, respectively. This indicates our approach can efficiently explore dynamic slack to achieve energy savings comparing to periodic power management scheme (PPM). Next, we show how the proposed power management scheme can efficiently explore execution slack. H.263 application with deadline factor γ = 1 is evaluated in 2-satge and 3-stage pipeline architectures with homogeneous 70-nm processors. We vary execution time factor α from 0.1 to 1 with step 0.1. The smaller α is, the more variable the execution time of task is. We randomly generate different actual execution time for each event and then put them into simulator. Fig. 6 shows how the overall power consumption changes when execution time factor α varies. As shown in Fig. 6 , power consumption of the proposed approach (BWS) react with the variability of execution time of the tasks. The increment of α, which indicates the variability spaces of execution time of the tasks decrease, will result in the increment of overall power consumption of the proposed approach. This is caused by the fact that the filling level of FIFOs among processors could response to this variability of execution time of tasks. By adaptively monitoring the filling level of FIFOs (See (19) in Thm. 1), our approach can explicitly identify the executionn slack to achieve the energy savings. Comparing to the case α = 1, the proposed approach can achieve 21% and 13% energy savings at α = 0.5 for 2-stage and 3-stage pipeline, respectively. In addition, we can observe periodic power management scheme (PPM) fails to response to the variability of execution time in both architectures. Comparing to the case α = 1, PPM can only achieve 3% and 1% energy savings at α = 0.5 for 2-stage and 3-stage pipeline, respectively.
Finally, we demonstrate the efficiency of the proposed schemes by reporting overall power consumption and the com- putation time for different pipeline architectures. We test our approaches by up to 10-stage heterogeneous pipeline by using obove stream setting and collect the maximum computation time for each architecture. The worst case execution time of subtasks mapped on each stage are randomly generated between 20ms and 40ms. We set the execution time factor α as 0.5. The end-to-end deadline for the test case with different stage number is determined by n · 60, where n is the stage number. Fig. 7 shows the power consumption and the maximum computation time on different architectures. We can see that the proposed approach (BWS) outperforms PPM approach. In addition, as shown in the figure, the proposed approach (BWS) require a small computation time (less than 0.75ms), which makes our algorithms applicable online.
VI. CONCLUSION
This paper presents one adaptive power management approach to reduce the leakage power consumption for pipelined systems. Targeting the streaming application with non-deterministic workload arrivals under hard real-time constraints, the proposed approach adaptively regulates the delay of the processors according to the workload while guarantee the end-to-end deadline requirement. In addition, the proposed approach can efficiently explore the slacks generated at runtime to achieve energy savings. Simulation results demonstrate the effectiveness of our approaches.
