We consider the problem of energy-efficient on-line scheduling for slice-parallel video decoders on multicore systems. We assume that each of the processors are Dynamic Voltage Frequency Scaling (DVFS) enabled such that they can independently trade off performance for power, while taking the video decoding workload into account. In the past, scheduling and DVFS policies in multi-core systems have been formulated heuristically due to the inherent complexity of the on-line multicore scheduling problem. The key contribution of this report is that we rigorously formulate the problem as a Markov decision process (MDP), which simultaneously takes into account the on-line scheduling and per-core DVFS capabilities; the power consumption of the processor cores and caches; and the loss tolerant and dynamic nature of the video decoder's traffic. In particular, we model the video traffic using a Direct Acyclic Graph (DAG) to capture the precedence constraints among frames in a Group of Pictures (GOP) structure, while also accounting for the fact that frames have different display/decoding deadlines and nondeterministic decoding complexities.
INTRODUCTION
High-quality video decoding imposes unprecedented performance requirements on energy-constrained mobile devices. To address the competing requirements of high performance and energy-efficiency, embedded mobile multimedia device manufactures have recently adopted MPSoC (multiprocessor system-on-chip) architectures that support Dynamic Voltage Frequency Scaling (DVFS) and Dynamic Power Management processors. Unfortunately, if this technique is applied to video decoding applications, it will require retiming delays proportional to the GOP size, which may be arbitrarily large.
All prior research outlined in Table 1 takes into account processing energy, but does not take into account the power consumption of different cache levels in the memory hierarchy. Since multimedia applications are data-access dominated [12] , read and write accesses to the memory cache contribute significantly to the overall energy consumption. In summary, although many important advancements have been made, there is still no rigorous multicore scheduling solution that simultaneously considers per-core DVFS capabilities; dynamic processor assignment; the separate power consumption of the processor cores and caches; and loss-tolerant tasks with different complexity distributions, DAG dependency structures (i.e. precedence constraints), and stringent, but soft realtime, constraints. The contributions of this report are as follows:
• We rigorously formulate the multi-core scheduling problem using a Markov decision process (MDP) that considers the abovementioned properties of the multi-core system and video decoding application. The MDP enables the system to optimally trade-off long-term power and performance, where the performance is measured in terms of a Quality of Service (QoS) metric that is related to the decoder's throughput.
• The MDP solution requires complexity that exponentially increases with both the number of processors and the number of frames in a short look-ahead window. To mitigate this complexity, we propose a novel two-level scheduler. The first-level scheduler determines scheduling and DVFS policies for each frame using frame-level MDPs, which account for the coupling between the optimal policies of parent frames and their children's optimal policies. The first-level acts in discrete time. The second-level scheduler decides the final frame-to-processor and frequency-to-processor mappings at run-time, ensuring that certain system constraints are satisfied. The second-level also performs slack reclamation [8] [10] [11] to avoid wasting resources when tasks finish before the first-level scheduler's time quantum is up.
• We validate the proposed algorithm in Matlab using accurate video decoder trace statistics generated from a parallelized H.264 decoder that we implemented on a cycle-accurate MPARM simulator [15] .
The remainder of the report is organized as follows. We introduce the system and application models in Section 2 and formulate the on-line multi-core scheduling problem as an MDP. In Section 3, we propose a lower complexity solution by approximating the original MDP problem with a two-level scheduler. In Section 4, we present our experimental results. We conclude in Section 5.
PROBLEM FORMULATION

Fig. 1. Hardware configuration of our MPARM-based virtual platform.
We consider the problem of energy-efficient slice-parallel video decoding in a time slotted multicore system, where time is divided into slots of (equal) duration t ∆ seconds indexed by t ∈ » . We assume that there are M slave processors, which we index by
, and one master processor as illustrated in Fig. 1 . Our problem formulation focuses on scheduling slice decoding tasks on the slave cores. We discuss the master core in more detail in Section 4.
In Section 2.1, we describe seven important video data attributes. In Section 2.2, we propose a sophisticated Markovian traffic model for characterizing video decoding workloads. Importantly, the proposed traffic model accounts for the video data attributes introduced in Section 2.1. In Section 2.3, we use the traffic model to reveal several opportunities for parallel execution of slice decoding tasks. In Sections 2.4, 2.5, and 2.6 we describe the scheduling and frequency actions, the evolution of the video traffic/workload, and the power and Quality of Service (QoS) metrics used in our optimization. In subsection 2.7, we formulate the multicore scheduling problem as a Markov decision process (MDP).
Video data attributes
We model the encoded video bitstream as a sequence of compressed data units with different decoding and display deadlines, source-coding dependencies, priorities, and decoding complexity distributions. In this report, we assume that a data unit corresponds to one video slice, which is a subset of a video frame that can be decoded independently of other slices within the same frame. . The assumption of exponentially distributed complexity is inaccurate; however, it is necessary to make the MDP problem formulation tractable. We briefly discuss why we 1 Because slices within a frame are encoded without exploiting correlations among neighboring slices, there is a trade-off between video rate-distortion performance during encoding (which is better for coarser grained slices) and potential parallelization gains during decoding (which are higher for finer grained slices). This trade-off has been thoroughly discussed in prior work [13] ..The focus of this report is on optimally scheduling slices at the decoder side given a bitstream that has already been encoded with slices. 2 In a typical hybrid video coder like H.264/AVC or MPEG-2, I, P, and B indicate the type of motion prediction used to exploit temporal correlations between video frames. I-frames are compressed independently of the other frames, P-frames are predicted from previous frames, and B-frames are predicted from previous and future frames. 3 For simplicity of exposition, we assume that the bitstream is pre-encoded and that it was encoded using a fixed number of slices per frame. However, our framework can be adapted to account for an encoder that uses a variable number of slices per frame (e.g. by generating slices of approximately equal computational complexity [13] or equal size in bits). If the video has been pre-encoded, then we can assume that 
Display deadline:
,disp
denotes the final time slot in which g k v must be decoded so that it can be displayed.
Decoding deadline:
,dec
denotes the final time slot in which g k v must be decoded so that frames that depend on it can be decoded before their display deadline. Note that
7. Dependency: The frames must be decoded in decoding order, which is dictated by the dependencies introduced by predictive coding (e.g., motion-compensation). In general, the dependencies among frames can be described by a directed acyclic graph (DAG), denoted by , DAG V E , with the nodes in V representing frames and the edges in E representing the dependencies among frames. We use the notation and therefore
′ is decoded. 4 We write ( ) These attributes are important because they determine which slices can be decoded, how long they will take to decode, when they need to be decoded, and what the penalty is for not decoding them on time. In the next subsection, we propose a Markovian traffic model that captures the above attributes in a structured manner, enabling us to rigorously formulate the multicore scheduling problem as an MDP.
Markovian Traffic Model
We define a traffic state ( ) , ,
to represent the video data that can potentially be decoded in time slot t . This traffic state comprises three components defined in the following paragraphs: the current frame set t ⊂ C V , the buffer state t x , and the dependency state t r .
In time slot t , we assume that the set of frames whose deadlines are within the scheduling time window
 can be decoded. We define current frame set as all the frames within the STW, i.e. 4 Note that frames in GOP 1 g + do not depend on frames in GOP g ; however, frames in GOP g can depend on frames in GOP 1 g + (e.g. the last B frames in GOP g may depend on the I frame in GOP 
In words, a frame's arrival time (respectively, display deadline) is the first (respectively, last) time slot in which it appears in the current frame set, and a frame's decoding deadline is the minimum display deadline of its children. Note that the distinction between display and decoding deadlines is important because, even if a frame's decoding deadline is missed, which renders its children undecodable, it is still possible to decode the frame before its display deadline. Fig. 2 illustrates how the current frame sets are defined for a simple IBPB GOP structure and Table 2 tabulates the decoding and display deadlines for the same GOP structure. The following example illustrates one way to define the current frame sets for the GOP structure in Fig. 2 . Table 2 . Decoding and display deadlines for the GOP structure in Fig. 2 . Decoding  Deadline  t  t+1  t+1  t+3  t+3  t+5  t+5  t+7   Display  Deadline  t  t+1  t+2  t+3  t+4  t+5  t+6  t+7 Example 1: Current frame sets: Let We define the buffer state ( ) 
Opportunities for parallelism
Given the current frame sets illustrated in Example 1 and the GOP structure in Fig. 2 , the following example identifies four opportunities for parallelism. Fig. 2 ) and certain B and P frames (e.g. The goal in the proposed framework is to optimally and dynamically map slices to processors, and adapt the processors' frequencies, in order to minimize average system power consumption subject to a minimum average slice decoding rate. This optimization is made more formal in section 2.7; however, before we can formalize the optimization, we need to define the scheduling and frequency actions, model the evolution of the video traffic/workload over time, and define the power and QoS metrics used in our optimization.
Scheduling actions and processor frequencies
denote the number of slices belonging to frame v that are scheduled on processor j at time t . For notational convenience, we define
, and ( ) • Buffer constraint:
In words, the total number of scheduled slices belonging to frame v cannot exceed the number of slices in frame v 's buffer in time slot t .
• Processor constraint:
. In words, no more than one slice can be scheduled on processor j in time slot t .
• Dependency constraint: If . In words, all of the v th frame's dependencies must be satisfied before slices belonging to it are scheduled to be decoded.
We assume that each processor can operate at a different frequency in each time slot to trade-off processing energy and delay. Let ( ) 1 2 , , ,
denote the frequency vector, where j t f ∈ F is the speed of the j th processor in time slot t and F is the set of available operating frequencies. Recall from Section 2.1 that slices belonging to frame v have decoding complexity 
. Due to the memoryless property of the exponential distribution, if a slice belonging to frame v is scheduled on processor j at time t , then it will finish decoding in time slot t (i.e. in t ∆ seconds) with probability ( )
∆ , regardless of the number of times it was previously scheduled. In other words, if a slice takes multiple time slots to decode, then the memoryless property implies that it is not necessary to know the number of cycles that were spent decoding the slice in past time slots to predict the distribution of remaining cycles. Hence, assuming exponentially distributed service times greatly reduces the number of states required in our Markovian traffic model (see Appendix A for more details). This is (implicitly) why a lot of prior research on power management using MDPs assumes exponential service times (e.g. [17] [18]).
We note that, if a slice finishes decoding before the time quantum is up, then we start decoding another slice (from the same frame) during the "slack" time, which is the time between the beginning of the next time quantum and the time that the originally scheduled slice finished decoding. We discuss this in more detail in Section 3.2.
State evolution and system dynamics
To fully characterize the video traffic, we need to understand how the traffic state
, comprising the current frame set t C , the buffer state t x , and the dependency state t r , evolves over time.
The transition of the current frame set from t C to 1 t + C is independent of the scheduling action; in fact, as illustrated in Fig. 2 , it is deterministic and periodic for a fixed GOP structure, and therefore the sequence of current frame sets { } | t t ∈ C » can be modeled as a deterministic Markov chain.
Unlike the current frame set transition, the transition of the buffer state from . For notational convenience, we
, and 
The sequence of buffer states { }
x t ∈ » can be modeled as a controlled Markov chain. Note that the buffer state for frame v , i.e.
v t
x , is only defined for
. We will refer to this range of times as the lifetime
The transition of the dependency state from v t r to 1 v t r + for
can be determined as follows:
The first line in (2) states that frame v can be decoded in time slot 
It follows from (2) that the sequence of dependency states { } | v t r t ∈ » can be modeled as a controlled Markov chain. Note that, similar to the buffer state, the dependency state is only defined for the lifetime
. Note that (2) and (3) 
Power cost and slice decoding rate
The power-frequency function ( ) j t f ρ maps the j th processor's speed f t j to its expected power consumption (watts). We assume that the power-frequency function is a strictly convex and increasing function of the frequency f and that it is the same for each processor. We also consider the expected power consumed by the instruction, data, and L2 cache using a function 
Note that different frame types require different cache access patterns so the cache access power depends on the frame type (i.e.
( ) ( )
, , type
. In Section 0, we describe how we populate (4) by profiling an H.264 video decoder on our MPARM simulation platform.
We consider the following QoS metric in each time slot t :
This QoS metric is simply the expected number of slices belonging to frame v that will be decoded on processor j in time slot t . We will refer to (5) as the slice decoding rate for frame v on processor j . For notational simplicity, in the remainder of the report, we will omit the functional dependence of (4) and (5) on
Markov decision process formulation
In this subsection, we formulate the problem of energy-efficient slice-parallel video decoding on M processors. In each time slot t , the objective is to determine the scheduling action
and t v ∈ C , and the frequency vector t f , in order to minimize the total average power consumption subject to a constraint on the average slice decoding rate. The total discounted 5 average power consumption and slice decoding rate can be expressed as
, and (6) ( )
respectively, where γ ∈ 0,1    ) is the discount factor, and the expectation is over the sequences of traffic states 
where η is the discounted slice decoding rate constraint. Note that it is a trivial extension to maximize the average slice decoding rate under an average power constraint.
The constrained optimization defined in (8) can be formulated as an unconstrained MDP by introducing a Lagrange multiplier λ + ∈ » associated with the slice decoding rate constraint. Note that the buffer, processor, and dependency constraints defined in (8) must still hold in every time slot, however, for notational simplicity, we will omit them from our exposition in the remainder of the report. We can define the Lagrangian cost function:
, , , , ,
For a fixed λ , in each time slot t , the unconstrained problem's objective is to determine the frequency vector t f and scheduling matrix t Y in order to minimize the average Lagrangian cost. The discounted average
Lagrangian cost can be expressed as 5 In this report, for mathematical convenience, we use discounted averages instead of conventional averages; however, the problem can be formulated using non-discounted averages. We refer the interested reader to [17] for an intuitive justification for using discounted averages.
denote the traffic state transition probability function, the problem of minimizing (10) can be mapped to the following dynamic programming equation:
which can be solved using the well-known value iteration algorithm [14] as follows:
where n is the iteration index,
V λ T is initialized to 0 for all T , and
LOW COMPLEXITY SOLUTION
Unfortunately, solving (12) directly is a computationally intractable problem for two reasons. First, the number of traffic states exponentially increases with the number of frames in the current frame set. Second, actions. Now, consider a 6 The homogeneity assumption means that all processors have the same cost function and the same set of available operating frequencies. It implies that the MDP only needs to determine whether or not a slice is scheduled, and then slices can be greedily assigned to processors. Since at most M slices can be scheduled in each time slot (i.e. one slice per processor), this implies that Y reduces to M binary decisions that must be made jointly. 
actions.
Clearly, the reason for the exponential growth in the state space (respectively, action space) is that the optimization simultaneously considers the states (respectively, scheduling actions and processor frequencies) of multiple frames. However, carefully studying the optimization objective and constraints defined in (8) , it is clear that the only reason these need to be optimized jointly is the processor constraint, which ensures that only one slice is assigned to each processor in each time slot. Motivated by this weak coupling among tasks, we propose a two-level scheduler to approximately solve (8): The first-level scheduler determines the optimal scheduling actions and processor frequencies for each frame under the assumption that each frame has exclusive access to the M processors. Given the results of the first-level scheduler, the second-level scheduler determines the final slice-to-processor and frequency-to-processor mappings.
First-level scheduler
The first-level scheduler computes a value function
for every frame in a GOP. This value function only depends on the current frame set, the frame's buffer state v x , and the frame's dependency state v r . Note that the current frame set indicates the remaining lifetime of a frame and describes the connections to its parents and children. Hence, the current frame state will have a significant impact on the optimal scheduling and DVFS decisions for the frame. To account for the dependencies among frames, we define the v th frame's value function
in such a way that it includes the values of its children. In this way, frames with many children (e.g. I frames) can account for how their scheduling and frequency decisions impact the future performance of their children. We describe the first-level scheduler in more detail in the remainder of this section.
Frame-level value iteration
The first-level scheduler performs the frame-level value iteration algorithm illustrated in Table 3 to compute the optimal value functions { } , :
v g V v * ∈ V . Similar to the conventional value iteration algorithm [14] , the proposed frame-level value iteration algorithm iteratively updates the value functions for every state until a stopping condition is met. However, unlike the conventional value iteration algorithm, the proposed algorithm has multiple coupled value functions that need to be updated. Note that the coupling exists because the value of a frame depends on the values of its children. Due to this coupling, the form of the value function update (lines 5-9 in Table 3 ) is different from the conventional value iteration algorithm.
If it is not possible to make any decisions for a frame in the current traffic state, then we set the frame's value to 0 in that state. Hence, if a frame is not in the current frame set (i.e. v ∉ C ), does not have its dependencies satisfied (i.e. 0 v r = ), or is in the current frame set, but is already fully decoded (i.e. v ∈ C and 0 v x = ), then we set the frame's value to 0 (line 8 in Table 3 ). The more interesting case is when the frame is in the current frame set, still has undecoded slices, and has its dependencies satisfied (i.e. v ∈ C , 
1.
Initialize: ( ) 
, The frame-level value iterations allow us to eliminate the exponential growth of the state space with respect to the number of frames in the current frame set, but we still have to address the fact that the optimization in (13) (Line 6 of Table 3 ) requires a search over an exponential number of scheduling and frequency vectors. In this subsection, we discuss how to decompose the monolithic update defined in (13) into M stages (hereafter, sub-value iterations), each corresponding to a local scheduling problem on a single processor. These M sub-value iterations can be performed iteratively, using the output of the j th processor's sub-value iteration as the input to the ( ) 1 j − st processor's sub-value iteration. Importantly, decomposing the monolithic update into M sub-value iterations significantly reduces the computational complexity of the update. The decomposition is illustrated in Fig. 4 and described in detail in the remainder of this subsection. In Appendix B, we discuss the complexity of the frame-level value iteration algorithm with decomposed value iteration update. Equipped with this new notation, we derive the sub-value iterations from (13) .
Notice that, in (13)
is independent of 
The inner minimization in (14) is the M th processor's sub-value iteration, the result of which we denote by ( )
Sub-value iteration at processor
,
The M th processor's sub-value iteration estimates the value of being in traffic state , ,
, and (iii) the expected discounted future value of the v th frame's children, i.e.
( )
, , 
is used as input to the ( ) 1 M − st processor's sub-value iteration derived below. These outputs are represented by the rightmost nodes in Fig. 4 .
To derive the sub-value iterations at processors
, ,
is independent of (14) from (13), we can rewrite (14) as follows: 
The inner minimization in (16) is the ( ) 
The j th processor's sub-value iteration estimates the value of being in traffic state Finally, using the same arguments as above, the sub-value iteration at processor 1 j = is defined as follows: Sub-value iteration at processor 1:
. (18) The output of the first processor's sub-value iteration includes (i) the immediate power costs incurred by all processors, (ii) the slice decoding rate of all processors, (iii) the expected discounted future value of frame v , and (iv) the expected future discounted value of frame v 's children. The output of the first processor's subvalue iteration during iteration n , i.e.
, , : 0, ,
, is used as input to the M th processor's sub-value iteration during iteration ∈ V , then we can determine the optimal action to take in each traffic state, and therefore the optimal policy, by finding the scheduling and frequency vectors that optimize (13) . However, as we discussed earlier, this requires searching over an exponential number of scheduling and frequency vectors.
Fortunately, it turns out that we can use the sub-value iterations proposed in Section 3.1.2 to find an approximately optimal policy. An algorithm for doing this is summarized in Table 4 . In Appendix B, we discuss the complexity of determining the approximately optimal policy.
The key idea behind the algorithm in Table 4 is to find the (scalar) scheduling and frequency actions that optimize the sub-value functions defined in (15), (17), and (18) for each processor. However, there is one complication that must be dealt with before we can do this. Specifically, notice that the sub-value iterations for
require knowledge of the number of slices that finish decoding on processors 1
. Unfortunately, we need to select the (scalar) scheduling action and processor frequency on processor j before
is known. To work around this problem, the algorithm in Table 4 first selects the optimal (scalar) scheduling action and frequency for processor 1. Then, to select the optimal (scalar) scheduling actions and frequencies for processors
, the algorithm approximates 
The floor of X , denoted by X     
, jv jv f y * * as the argument that maximizes the j th processor's sub-value function (Eq. (15) or (17)) given the optimal future value.
End 7.
( ) ( ) 
Second-level scheduler
Given the optimal policies calculated by the first-level scheduler (i.e.
, it is very likely that slices belonging to different frames in the current frame set will want to be scheduled on the same processor in the same time slot, thereby violating the processor constraint defined in (8) . To avoid this problem, the second-level scheduler determines the final slice-to-processor and frequency-to-processor mappings using an Earliest Deadline First (EDF) policy. Specifically, frame 
where
is the frame's decoding deadline and ties are broken randomly.
In addition to ensuring that the processor constraint defined in (8) is satisfied, a key role of the secondlevel scheduler is to guarantee that, once scheduled on a processor, a slice remains on that processor until it is either completely decoded or it expires. Keeping a slice on one core prevents the system from having to migrate a slice decoding task from one processor to another, which can be expensive in terms of delay, memory bandwidth, and system energy.
Finally, if a slice finishes decoding before the first-level scheduler's time quantum is up, then the secondlevel scheduler will start decoding another slice (from the same frame) during the "slack" time, which is the time between the beginning of the next time quantum and the time that the originally scheduled slice finished decoding. This is analogous to how slack reclamation is commonly used in the power management literature (see, e.g., [8] [10] [11] ). That is, typically, an amount of time is allocated to a task based on its worst-case execution time (analogous to the first-level scheduler's time quantum), and, if the task completes before that time (analogous to it finishing before the first-level scheduler's time quantum is up), then the remaining slack time is used to schedule the next task. If there are no schedulable slices available to use the slack, then, similar to [3] , the second-level scheduler idles the processor at the lowest operating frequency so that we do not waste energy.
In Appendix B, we discuss the complexity of the second-level scheduler.
Impact of modeling assumptions on the optimal policy
In this section, we discuss two major assumptions that slightly increase the power consumption of our proposed algorithm relative to the optimal algorithm.
First, at the first-level scheduler (driven by the MDP model), we assume that only one slice can be decoded on each core in each time slot, despite the fact that the second-level scheduler allows additional slices to be processed during the slack time. This may cause the first-level scheduler to be slightly more aggressive in its selection of processor frequencies than it would be with a more accurate (and more complex) MDP model that accounts for multiple slices being scheduled on each processor in each time slot. Consequently, more power will be consumed on average than required by the workload.
Second, we assume that the slices have exponential complexity distributions. This has an interesting impact on the temporal selection of operating frequencies when decoding a slice. Suppose that we have T time slots of duration t ∆ seconds to decode one video slice with random complexity 0 W ≥ (cycles). Under the exponential complexity model, the probability of decoding a task in any given time slot is a constant conditioned on the operating frequency (i.e. it is independent of how many cycles have been processed in previous time slots). During the first (respectively, last) time slots spent decoding a slice, the exponential model tends to overestimate (respectively, underestimate) the probability of decoding a slice relative to the true probability, and therefore tends to select lower (respectively, higher) operating frequencies than are optimal. Overall, these policies approximately average out in terms of cycles allocated to decoding the slice (relative to the true optimal policy), but end up using more power than necessary (due to the convexity of the power-frequency function). We discuss this in more detail in Appendix A.
EXPERIMENTS
In this section, we describe our experimental framework in detail and evaluate our proposed algorithm. We note that we did not have access to a decoder that supports the sophisticated level of slice parallel decoding that our algorithm is designed to exploit. Specifically, the available decoder implementation can decode slices belonging to the same frame in parallel, but it cannot decode slices from different frames in parallel. Because the latter capability is essential to our proposed algorithm, we use Matlab to evaluate it, instead of the MPARM simulator.
Experimental framework
In order to validate our optimized multi-core scheduling approach in Matlab, we use accurate profiling/statistics generated from a parallelized H.264 decoder executed on a very sophisticated multiprocessor virtual platform simulator. In fact, in this work, we have extended and customized the multiprocessor ARM (MPARM) virtual platform simulator [15] , which is a complete SystemC simulation environment for MPSoC architectural design and exploration. MPARM provides cycle-accurate and bus signal-accurate simulation for different processors. In our experiments, we have used the ARM9 Instruction Set Simulator as the main core. In addition, we have customized into MPARM its DVFS figures, number of cores, memory latencies, and cache size in order to accurately calculate and report the energy and power consumption of the cores and the different memory cache levels for our multimedia benchmark (i.e., H.264).
In order to run the H.264 decoder for up to CIF resolution on an ARM9 core, we have generated a specific experimental setup. In this experimental platform, we have integrated five ARM 9 cores running at a maximum frequency of 500MHz with DVFS support for each core (125MHz, 166MHz, 250MHZ at 1.07V
and 500MHZ at 1.6V). These multiple processing cores replace the co-processing units, namely, the GPU, the DSP and the hardware acceleration featured in recent MPSoC models. Each of the processing cores has private 32KB L1 instruction and data caches. Moreover, we have also integrated 512KB of L2 cache memory that is shared between all the cores and connected to the main memory via an AMBA interconnection bus. The main memory is divided into private memory and shared memory. The private memory is L1 and L2 cacheable and the shared memory is only L2 cacheable. The synchronization between different cores is implemented with semaphores. The hardware configuration of our MPARM-based virtual platform is illustrated in Fig. 1 (Section 2).
We have used a real time operating system RTEMS (Real-Time Executive for Multiprocessor Systems)
[16] in order to have multitasking execution on our MPARM multi-core experimental platform. Our optimized multi-core scheduling framework requires accurate statistics data output for each task (i.e., slice). Therefore, we have added an API that is able to create different interrupts from the application layer to the hardware layer for requesting a statistics record. This new API records, on select parts of the code, the execution time and the power consumption of the CPU, the instruction cache, the data cache, and the L2 cache. All of the statistics related to each task are then stored in a file.
For our multimedia benchmark, we have used the Joint Model reference software version 17.2 (JM 17. 2) of an H.264 encoder. To support simple slice-level parallelism, we modified the H.264 decoder by allocating parts of the data to the shared memory instead of the private memory such that it is accessible from all the cores. We have then implemented our own memory management API (i.e., malloc, calloc, and free) for the MPARM shared memory. Finally, we have added a few instructions in the decoder code that tell us which part of the application is running on each core. We have then divided the decoder into three main tasks: the first task handles the parsing of the input video bit-stream to slices. This task is assigned to the master core (core 1).
Then, the second task decodes the slices mapped by the master core. Slave cores process these slices in parallel. Finally, as a third task, the de-blocking filter is applied on the decoded slices. This last task was assigned to the master core. We use the developed API to record statistics for each task. Moreover, it also provides detailed statistics for each decoded slice, namely, the execution time, estimated power consumption figures, the slice index, the frame index, the GOP index and the assigned core. All these generated profiling data for each slice is then ported to Matlab and used as input into our algorithm to populate the expected power function and slice decoding complexity distributions. Since the MPARM experimental platform was only used to generate accurate profiling data, we have implemented a simple static scheduling algorithm to map the slices to the slave cores.
To generate our experimental results, we implemented the two-level scheduling algorithm proposed in Section 3 in Matlab. This algorithm, together with the slice-level data traces recorded from MPARM, allowed us to determine on-line scheduling and DVFS policies for the Silent and Foreman sequences (CIF resolution, 30 frames per second, 8 slices per frame) with an IBPB GOP structure as illustrated in Fig. 2 . In our Matlab simulations, we assume a time slot duration of 1/90 s, which is one-third of the frame period. We divide each GOP into 12 current frame sets to capture the dependencies among frames. These 12 current frame sets are generated from the four unique current frame sets given in example 1 (Section 2.2) by repeating each for three consecutive time slots. 8 The system, application, and other parameters used in our experiments are given in Table 5 . Importantly, although our MDP model assumes that the slice decoding complexities are exponentially distributed, we use the actual slice decoding times from the MPARM simulator when we simulate the scheduling and DVFS policies. 
Trade-off between power consumption and Quality of Service
The optimization proposed in (8) allows the system to trade-off power consumption and a QoS metric, namely, the slice decoding rate, which is roughly proportional to the frame rate. This trade-off can be made by adapting the Lagrange multiplier λ in the cost function defined in (9) . Intuitively, small values of λ lead to scheduling and DVFS policies that favor power conservation over QoS, whereas larger values of λ lead to policies that favor QoS over power consumption. Fig. 5 shows the trade-off between the average power consumption and average frame rate for the values of λ given in Table 5 and M = 1, 2, 4, and 8 processors. The minimum power consumption per core, which is approximately 20 mW, is due to leakage power. If we were to introduce DPM into our optimization framework, then this minimum power would be significantly lower. Clearly, as λ increases, the QoS is improved at the expense of power; as the number of processors increases, less power is required per processor to decode at a given QoS; and, depending on the video source characteristics (e.g. Foreman vs. Silent), the achievable QoS varies for a given power consumption (in this case, Silent receives a higher QoS than Foreman for the same power consumption because Silent is a lower activity sequence). for the Foreman and Silent sequences, respectively. It is interesting to note that, as the decoded frame rate decreases, having less processors results in less overall power consumption. This is due to the large leakage power incurred by each processor, which, as mentioned before, could be significantly reduced using DPM in addition to DVFS. It is clear from Fig. 5 that the proposed scheduling algorithm exploits the loss-tolerant nature of video decoding tasks to achieve lower decoded frame rates when the energy-budget does not allow for full frame rate
decoding. An important question is whether or not the algorithm could do significantly better. In the next subsection, to answer this question, we look at some statistics on which frames miss their deadlines most frequently. Table 5 ). The results show that the proposed on-line scheduling and DVFS optimization has a very desirable property: as minimizing power becomes more important (i.e. λ decreases), B frames are the first to miss their deadlines, followed by P frames, and then I frames. In other words, due to the smart scheduling algorithm, the QoS (i.e. frame rate) decreases slowly with the power consumption. In contrast, a scheduling policy that allows P frames to be lost before B frames, or I frames before P frames, is inherently suboptimal because a deadline miss by one I or P frame induces deadline misses of dependent frames, adversely impacting the QoS.
Display deadline miss rates
Impact of video resolution
The frame size will impact the system performance and power consumption in a several important ways. To demonstrate, we have included simulation results for QCIF resolution videos in Fig. 7 and Fig. 8 , which complement the simulation results for CIF resolution sequences in Fig. 5 and Fig. 6 . 1. For a fixed number of cores and a fixed frame rate, increasing the video resolution requires higher power consumption. Alternatively, for a fixed number of cores and a fixed power consumption level, increasing the video resolution will decrease the frame rate. This is because higher resolution frames have higher decoding complexity (proportional to the resolution).
2. For a fixed frame rate constraint, the number of cores required to achieve the lowest total power consumption increases with the video resolution. For QCIF resolution videos, Fig. 7(d) shows that at frame rates below 15 frames per second (fps) the minimum total power consumption is achieved by 1 core, while above 15 fps the minimum power consumption is achieved by 2 cores. Moreover, using 3. Higher resolution frames can be partitioned into more slices, enabling more efficient use of available cores (i.e. higher frame rates and lower power consumption). In the simulation results illustrated in Fig. 7 and Fig. 8 , each QCIF resolution frame is partitioned into 4 slices, while in Figures   5 and 6 , each CIF resolution frame is partitioned into 8 slices. The improved efficiency achieved by a large number of slices in the CIF resolution video -enabled by improved load balancing -is responsible for the phenomenon described in point 2 above (i.e., the fact that using more cores can actually reduce the total power consumption required to achieve a fixed frame rate). 
Experimental comparison
In Fig. 9 , we compare our proposed algorithm (with 400 λ = ) to the so-called Optimum Minimum-Energy Multicore Scheduling algorithm (OPT-MEMS [3] ), and to a modification of our algorithm (with
where we require all processors to operate at the same frequency (i.e. coordinated DVFS). We note that [3] supports both DPM and coordinated DVFS; however, we only compare against the DVFS part to achieve a fair comparison. 9 As in [3] , we switch idle processors to the minimum frequency to avoid wasting energy.
OPT-MEMs uses a frame's worst-case execution complexity and its deadline to determine a DVFS schedule that multiplexes between two frequencies in time in order to execute exactly the worst-case number of cycles before the task's deadline. There are four important limitations of OPT-MEMS. First, OPT-MEMS is myopic because it does not consider characteristics and requirements of future tasks (e.g. deadlines, complexities, dependencies) when deciding the DVFS schedule for the current task. Myopic DVFS schedules are known to be suboptimal [20] . Second, OPT-MEMS does not provide a scheduling technique to allocate tasks to processor cores; instead, it assumes that each task is perfectly divisible among an arbitrary number of cores (i.e. it uses a fluid model). This corresponds to the case of perfect load balancing, which can only be achieved in practice if the number of slices per frame is exactly the number of cores, and each slice has exactly the same decoding complexity. 10 Third, OPT-MEMS does not provide a mechanism for scheduling slices belonging to different frames at the same time. This leads to some inefficiency because fully parallelized decoding (which appropriately accounts for frame dependencies) is not possible. Forth, OPT-MEMS uses coordinated DVFS, i.e. it assumes that all processor cores operate at the same frequency. This leads to inefficiency in practice because tasks cannot be perfectly load balanced.
As illustrated in Fig. 9(a) and Fig. 9(b) , for M = 1 or 2 processors, all algorithms achieve approximately the same frame rates and power consumptions for a given sequence. This is because, even at the highest operating frequency, there are not enough resources to decode all frames. For M = 4 or 8 processors, Fig. 9(a) and Fig. 9(b) show that all algorithms achieve the full frame rate (or very close to the full frame rate); however, Fig. 9 (c) and Fig. 9(d) show that the proposed algorithm achieves lower overall power consumption.
For M = 4 cores, the proposed algorithm reduces power by approximately 24% for Foreman and 36% for Silent, relative to OPT-MEMS. For M = 8 cores, the proposed algorithm reduces power by approximately 12%
for Foreman and 24% for Silent, relative to OPT-MEMS. The improvements are more modest for M = 8 cores because each core runs at a much lower operating frequency than with M = 4 cores, so there is less opportunity to reduce power consumption. It is noteworthy that the change in power consumption between the OPT-MEMS and coordinated DVFS algorithms is largely due to the MDP-based optimization, whereas the change in power consumption between the coordinated DVFS and proposed algorithms is largely due to independent DVFS frequencies for each core (however, the MDP-based optimization and the gains due to independent DVFS frequencies are not completely separable). 10 More precisely, OPT-MEMS defines a speed-up factor [ ] S j to describe the speed-up achieved (relative to 1 core) when there are j cores available.
If a task that takes w cycles on one core is executed in parallel on j cores with a speed-up of [ ] S j , then the task can be executed within
We assume that [ ] 1 S j = in our comparison, which implies perfect load balancing. 
CONCLUSION
We propose a Markov decision process based on-line scheduling algorithm for slice-parallel video decoders on multicore systems. Solving for the optimal on-line scheduling and DVFS policy requires complexity that exponentially increases with both the number of processors and the number of frames in a short look-ahead window used by the scheduler. To mitigate this complexity, we proposed a novel two-level scheduler. The first-level scheduler determines scheduling and DVFS policies independently for each frame and the second-level decides the final frame-to-processor and frequency-to-processor mappings at run-time, ensuring that certain system constraints are satisfied. We validated the proposed algorithm in Matlab using accurate video decoder trace statistics generated from a parallelized H.264 decoder that we implemented on a cycle-accurate MPARM simulator. Our experimental results indicate that the proposed algorithm effectively trades-off power consumption and QoS by ensuring that a limited energy-budget is allocated to decoding the most important frames (e.g. I and P frames) before the less important frames (e.g. B frames).
In future work, we plan to integrate the proposed two-level scheduler into the MPARM simulator, first by creating a "hook" between the simulator and Matlab, which will allow us to control the scheduling and DVFS actions at run-time with our Matlab code, and later by actually implementing the two-level scheduler on the master core, which will allow us to measure the impact of the scheduler's overheads on the system's performance. We also plan to integrate DPM into the proposed solution to achieve even lower power consumption.
APPENDIX A: IMPACT OF THE EXPONENTIAL ASSUMPTION
In this appendix, we first describe why we make the exponential assumption. We then provide some analysis to explain the impact of the assumption on the optimal policy. We will refer to Fig. 10 throughout our discussion. 
APPENDIX B: COMPUTATIONAL OVERHEADS
There are four components of the proposed algorithm that incur overheads:
1. Offline: Frame-level value iteration at the first-level scheduler (i.e. Table 3 ) with decomposed value iteration update (see Section 3.1.2).
Offline:
Determining an approximately optimal policy for each frame (i.e. Table 4 ).
3. Online: Policy look-up.
4. Online: Second-level scheduler (see Section 3.2).
The first two components are performed offline, so their overheads will not impact the online system performance. The second two components are performed online, but are light-weight so they require little overhead. For completeness, we discuss the computation overheads of each component below:
Offline: frame-level value iteration at the first level scheduler
The proposed frame-level value iteration algorithm requires computing a value function The sub-value iteration at processor M requires searching over (a maximum of) M possible previously decoded slices, F processor frequencies and two (2) scheduling actions (i.e., 0 or 1 slice scheduled) and, for each possible combination, requires computing an expectation over two (2) possible departures (i.e., 0 or 1 slice decoded) and a sum over the value functions of frames that are children of frame v (say U children). The sub-value iterations for processors 1 through In the experimental results, we consider a four frame group of picture structure (i.e. x r π * * * = y f C (i.e. the optimal frequency and scheduling actions for each processor) for all C frames in the current frame set C . Assuming that one table look-up takes (1) O , then, in each time slot, we incur an overhead of ( ) O C to access the policy look-up tables for each frame in the current frame set C .
Online: Second-level scheduler
On each of the M processors, the second-level scheduler uses an Earliest Deadline First policy to determine which of the C frames in the current frame set C get scheduled. This incurs complexity overheads ( ) O CM .
