We consider the problem of energy-efficient scheduling for slice-parallel video decoders on multicore systems with Dynamic Voltage Frequency Scaling (DVFS) enabled processors. We rigorously formulate the problem as a Markov decision process (MDP), which simultaneously considers the on-line scheduling and per-core DVFS capabilities; the power consumption of the processor cores and caches; and the loss tolerant and dynamic nature of the video decoder. The objective is to minimize longterm power consumption subject to a minimum Quality of Service (QoS) constraint related to the decoder's throughput. We evaluate the proposed scheduling algorithm using traces generated from a cycle-accurate multiprocessor ARM simulator.
INTRODUCTION
Despite improvements in mobile device technology, energyefficient multicore scheduling for video decoding remains a challenging problem for several reasons. First, video decoding applications have intense and time-varying workloads, which have worst-case execution times that are significantly larger than the average case. Second, they have sophisticated dependency structures due to predictive coding. These dependency structures, which can be modeled as directed acyclic graphs (DAGs), not only result in different frames having different priorities, but also make it difficult to balance loads across the cores, which is important for energy efficiency [1] . Finally, they often have stringent delay constraints, but are considered soft real-time applications. In other words, video frames should meet their deadlines, but when they do not, the application quality (e.g. decoded video frame rate) is reduced.
During the last decade, many energy-efficient multicore scheduling algorithms that exploit Dynamic Voltage Frequency Scaling (DVFS [7] ) and/or Dynamic Power Management (DPM [12] ) have been proposed, e.g. [2] [3] [4] [6] [8] . The Largest Task First with Dynamic Power Management (LTF-DPM) algorithm in [3] assumes that frame decoding deadlines are equally spaced in time, and therefore does not support video group of pictures (GOP) structures with B frames; moreover, LTF-DPM will typically have looser deadline constraints than our proposed algorithm because it assigns groups of frames a common "weak" deadline. The Stochastic Scheduling2D algorithm [6] considers a periodic DAG application model that requires a "source" and "sink" node in each period, making the algorithm incompatible with GOP structures where the last B frame in a GOP depends on the I frame in the next GOP (e.g. an IBPB GOP). The Variation Aware Time Budgeting (Var-TB) algorithm in [8] uses a functional partitioning algorithm for parallelizing the video decoder (e.g. pipelining decoder sub-functions such as inverse DCT and motion compensation on different cores). Functional partitioning is known to be suboptimal [13] and parallelization approaches based on data partitioning (e.g. mapping different frames, slices, or macroblocks to different processors) are superior [13] . The so-called SpringS algorithm in [4] uses a task-level software pipelining algorithm called RDAG [5] to transform a periodic dependent task graph (expressed as a DAG) into a set of tasks that can be pipelined on parallel processors. However, if this technique is applied to video decoding applications, it will require retiming delays proportional to the GOP size, which may be large.
There is no solution that simultaneously considers per-core DVFS capabilities; dynamic processor assignment; and losstolerant tasks with different complexity distributions, DAG dependency structures, and stringent, but soft real-time, constraints. The contributions of this paper are as follows:
• We rigorously formulate the multi-core scheduling problem using a Markov decision process (MDP) that considers the abovementioned properties. The MDP enables the system to optimally tradeoff long-term power and performance.
• The MDP solution requires complexity that exponentially increases with both the number of processors and the number of frames in a short look-ahead window. To mitigate this complexity, we propose a novel two-level scheduler. The firstlevel determines scheduling and DVFS policies for each frame using frame-level MDPs, which account for the coupling between the optimal policies of parent and children frames. The secondlevel decides the final frame-and frequency-to-processor mappings, ensuring that certain system constraints are satisfied.
• We validate the proposed algorithm in Matlab using video decoder trace statistics generated from an H.264/AVC decoder that we implemented on a cycle-accurate multiprocessor ARM (MPARM) simulator [11] .
The remainder of the paper is organized as follows. We introduce the system and application models in Section 2 and formulate the on-line multi-core scheduling problem as an MDP. In Section 3, we propose a lower complexity solution by approximating the original MDP problem with a two-level scheduler. In Section 4, we present our experimental results. We conclude in Section 5.
PROBLEM FORMULATION
We consider the problem of energy-efficient slice-parallel video decoding in a time slotted multicore system, where time is divided into slots of (equal) duration t ∆ seconds indexed by t ∈ » . We assume that there are M processors, which we index by {1, , } j M ∈ … . In Section 2.1, we describe seven important video data attributes. In Section 2.2, we propose a sophisticated Markovian traffic/workload model that accounts for the video data attributes introduced in Section 2.1. In Sections 2.3, 2.4, and 2.5 we describe the scheduling and frequency actions, the evolution of the video traffic/workload, and the power and Quality of Service (QoS) metrics used in our optimization. In subsection 2.6, we formulate the multicore scheduling problem as an MDP.
Video data attributes
We model the encoded video bitstream as a sequence of compressed data units. We assume that a data unit corresponds to one video slice, which is a subset of a video frame that can be decoded independently of other slices within the same frame [9] . We assume that the video is encoded using a fixed, periodic, GOP structure that contains K frames and lasts a period of T time slots of duration t ∆ . The set of frames within GOP g ∈ » is denoted by v must be decoded so that frames that depend on it can be decoded before their display deadline. 7. Dependency: The frames must be decoded in decoding order, which is dictated by the dependencies introduced by predictive coding (e.g., motion-compensation). In general, the dependencies among frames can be described by a DAG, denoted by , DAG V E , with the nodes in V representing frames and the edges in E representing the dependencies among frames. We use the notation These attributes determine which slices can be decoded, how long they will take to decode, when they need to be decoded. In the next subsection, we propose a Markovian traffic model that captures the above attributes, enabling us to rigorously formulate the multicore scheduling problem as an MDP.
Markovian traffic model
We define a traffic state ( , , ) ) is the first (respectively, last) time slot in which it appears in the frame working set, and a frame's decoding
is the minimum display deadline of its children. Note that the distinction between display and decoding deadlines is important because, even if a frame's decoding deadline is missed, which renders its children undecodable, it is still possible to decode the frame itself before its display deadline. Fig. 1 illustrates the STW concept for a simple IBPB GOP structure.
We define the buffer state ( | )
C , where v t x denotes the number of slices of frame v awaiting decoding at time t . Finally, the dependency state
C defines whether or not each frame in the frame working set is decodable in time slot t . In particular, 
Scheduling actions and frequencies
In words, the total number of scheduled slices belonging to frame v cannot exceed the number of slices in its buffer in time slot t .
• Processor constraint:
. In words, no more than one slice can be scheduled on processor j in time slot t .
• Dependency constraint: If
words, all of the v th frame's dependencies must be satisfied before slices belonging to it are scheduled to be decoded. We assume that each processor can operate at a different frequency in each time slot to tradeoff processing energy and delay. Let ( , , , )
vector, where j t f ∈ F is the speed of the j th processor in time slot t and F is the set of available operating frequencies.
State evolution and system dynamics
To fully characterize the video traffic, we need to understand how the traffic state ( , , )
C evolves over time. The transition of the frame working set from t C to
is independent of the scheduling action; in fact, it is deterministic and periodic for a fixed GOP structure, and therefore the sequence { | } t t ∈ C » can be modeled as a deterministic Markov chain. 
The sequence { | } r t ∈ » can be modeled as a controlled Markov chain.
Power cost and slice decoding rate
The power-frequency function ( ) j t f ρ maps the j th processor's speed j t f to its expected power consumption (watts).
We also consider the expected power consumed by the instruction, data, and L2 cache using a function ( , , type( )) 
which is simply the expected number of slices belonging to frame v that will be decoded on processor j in time slot t . We will refer to (3) as the slice decoding rate. In the remainder of the paper, we will omit the dependence of (2) and (3) on type( ) v .
Markov decision process formulation
In this subsection, we formulate the problem of energy-efficient slice-parallel video decoding on M processors. In each time slot t , the objective is to determine the scheduling action jv t y , for all {1, 2 , } , M j ∈ … and t v ∈ C , and the frequency vector t f , in order to minimize the long-term power consumption subject to a long-term slice decoding rate constraint. The total discounted [12] average power consumption and slice decoding rate can be expressed as
, and
respectively, where
is the discount factor, and the expectation is over the sequence of traffic states { | } t t ∈ T » .
Stated more formally, the optimization objective and constraints are as follows:
, [ , ] min subject to
where η is the slice decoding rate constraint. Note that the buffer, processor, and dependency constraints defined in Section 2.3 must hold in every time slot; however, we will omit them from our exposition in the remainder of the paper. Equation (6) 
LOW COMPLEXITY SOLUTION
Solving (6) and (7) is a computationally intractable problem because their complexity increases exponentially with the number of frames in the frame working sets and with the number of processors M . The reason for the exponential growth in the state space (respectively, action space) is that the optimization simultaneously considers the states (respectively, scheduling actions and processor frequencies) of multiple frames on all of the processor cores. However, the only reason these need to be optimized jointly is the processor constraint, which ensures that only one slice is assigned to each processor in each time slot. Motivated by this weak coupling among tasks, we propose a twolevel scheduler to approximately solve (6) and (7): The first-level scheduler determines the optimal scheduling actions and processor frequencies for each frame under the (false) assumption that each frame has exclusive access to the M processors. Given the results of the first-level scheduler, the second-level scheduler determines the final slice-and frequency-to-processor mappings by resolving conflicts in the first-level scheduling decisions.
First-level scheduler
The first-level scheduler computes a value function
for every frame in a GOP, which provides a measure of the expected long-term Lagrangian cost under the optimized scheduling policy. Note that this value function only depends on the frame working set, the frame's buffer state v x , and the frame's dependency state v r and is independent of the buffer and dependency states of the other frames in the working set. Importantly, the frame working set indicates the remaining lifetime of a frame and describes the connections to its parents and children; hence, it has a significant impact on the optimal scheduling and DVFS decisions for the frame. To account for the dependencies among frames, we define the v th frame's value
so that it includes the values of its children. In this way, frames with many children (e.g. I frames) can account for how their scheduling and frequency decisions will impact the future performance of their children. We describe the first-level scheduler in more detail in the remainder of this section.
Frame-level value iteration
The first-level scheduler performs the frame-level value iteration algorithm illustrated in Table 1 ∈ V . Unlike the conventional value iteration algorithm [10] , the proposed algorithm has multiple coupled value functions that need to be updated because the value of a frame depends on the values of its children. Due to this coupling, the form of the value function update (lines 5-9 in Table  1 ) differs from the conventional value iteration algorithm. If it is not possible to make any decisions for a frame in the current traffic state, then we set the frame's value to 0 in that state. Hence, if a frame is not in the frame working set (i.e. or is already fully decoded (i.e. v ∈ C and 0 v x = ), then we set the frame's value to 0 (line 8 in Table 1 ). If the frame is in the frame working set, still has undecoded slices, and has its dependencies satisfied (i.e. v ∈ C , has not been decoded). In other words, the parent frame's value function is coupled with the children's value functions only if the parent frame gets fully decoded.
Decomposing frame-level value iteration
The frame-level value iterations allow us to eliminate the exponential growth of the state space with respect to the number of frames in the frame working set, but we still have to address the fact that the optimization in (8) value iterations), each corresponding to a local scheduling problem on a single processor. These M sub-value iterations can be performed iteratively, using the output of the j th processor's sub-value iteration as the input to the ( 1) j − st processor's subvalue iteration. Importantly, decomposing the monolithic update into M sub-value iterations significantly reduces the computational complexity of the update. Due to space limitations, we refer the interested reader to [14] for a derivation of the subvalue iterations. Sub-value iteration at processor M : 
The M th processor's sub-value iteration estimates the value of being in traffic state , , ( )
under the assumption that only processor M exists in the current time slot, while all processors exist thereafter. This value is calculated as the sum of (i) the immediate cost incurred by processor M for processing slices belonging to frame v , (ii) the expected discounted future value of frame v , and (iii) the expected discounted future value of frame v 's children. The output of the M th processor's subvalue iteration is used as input to the ( 1) M − st processor's subvalue iteration. Sub-value iteration at processors {2, ,
, n . , , mi 
The j th processor's sub-value iteration estimates the value of being in traffic state , , ( )
x r = T C under the assumption that only processors , , j M … exist in the current time slot, while all processors exist thereafter. This value is calculated as the sum of the immediate cost incurred by processor j and an expectation over the value calculated by the ( 1) j + st processor's sub-value iteration. The output of the j th processor's sub-value iteration is used as input to the ( 1) j − st processor's sub-value iteration. Sub-value iteration at processor 1:
The output of the first processor's sub-value iteration includes (i) the immediate power costs incurred by all processors, (ii) the slice decoding rate of all processors, (iii) the expected discounted future value of frame v , and (iv) the expected future discounted value of frame v 's children. and frequencies jv f ∈ F for each processor {1, , } M j ∈ … . Therefore, using the proposed decomposition significantly reduces the optimization complexity.
Finally, at run-time, we determine the approximately optimal actions , ,
jv jv f y * * to take in each state , , ) (
x r = T C by taking the arguments that minimize the right-hand sides of (9), (10) , and (11). For complete details on the action selection procedure, we refer the interested reader to [14] .
Second-level scheduler
Given the optimal actions calculated by the first-level scheduler, it is likely that slices belonging to different frames in the frame working set will want to be scheduled on the same processor in the same time slot, thereby violating the processor constraint in (6) . To avoid this problem, the second-level scheduler determines the final slice-and frequency-to-processor mappings using an Earliest Deadline First (EDF) policy.
Specifically, frame of all of the frames scheduled on processor j (with ties broken randomly). Finally, if a slice finishes decoding before the first-level scheduler's time quantum is up, then the second-level scheduler will start decoding another slice during the "slack" time, which is the time between the beginning of the next time quantum and the time that the originally scheduled slice finished decoding.
EXPERIMENTS
To validate our optimized multi-core scheduling approach in Matlab, we use accurate profiling/statistics generated from an H.264/AVC decoder executed on a sophisticated cycle-accurate and bus signal-accurate MPARM simulator [11] . We implemented the two-level scheduling algorithm proposed in Section 3 in Matlab. This algorithm, together with slice-level data traces recorded from MPARM, allowed us to determine scheduling and DVFS policies for the Silent and Foreman sequences (CIF resolution, 30 frames per second, 8 slices per frame) with an IBPB GOP structure. The relevant parameters used in our experiments are given in Table 2 .
In Fig. 2 , we compare our proposed algorithm to the Optimum Minimum-Energy Multicore Scheduling algorithm (OPT-MEMS [2] ), and to a modification of our algorithm where we require all processors to operate at the same frequency (i.e., coordinated DVFS). We note that OPT-MEMS supports both DPM and coordinated DVFS; however, we only compare against the DVFS part to achieve a fair comparison. (Although DPM can be integrated into our proposed solution, we omitted it here to simplify the exposition.)
OPT-MEMs uses a frame's worst-case execution complexity and its deadline to determine a DVFS schedule that multiplexes between two frequencies in time in order to execute exactly the worst-case number of cycles before the task's deadline. There are four important limitations of OPT-MEMS. First, OPT-MEMS does not consider characteristics and requirements of future tasks (e.g. deadlines, complexities, dependencies) when deciding the DVFS schedule for the current task. Second, OPT-MEMS does not provide a scheduling technique to allocate tasks to processor cores; instead, it assumes that each task is perfectly divisible among an arbitrary number of cores. This corresponds to the case of perfect load balancing, which can only be achieved in practice if the number of slices per frame is exactly the number of cores, and each slice has exactly the same decoding complexity. Third, OPT-MEMS does not provide a mechanism for scheduling slices belonging to different frames at the same time. This leads to some inefficiency because fully parallelized decoding (which accounts for frame dependencies) is not possible. Forth, OPT-MEMS uses coordinated DVFS. This leads to inefficiency in practice because tasks are not the same size and therefore cannot be perfectly load balanced with a single frequency for all cores.
As illustrated in Fig. 2 (a) and Fig. 2(b) , for M = 1 or 2 processors, all algorithms achieve approximately the same frame rates and power consumptions for a given sequence. This is because, even at the highest operating frequency, there are not enough resources to decode all frames. For M = 4 or 8 processors, Fig. 2(a) and Fig. 2(b) show that all algorithms achieve the full frame rate (or very close to the full frame rate); however, Fig. 2(c) and Fig. 2(d) show that the proposed algorithm achieves lower overall power consumption. For M = 4 cores, the proposed algorithm reduces power by approximately 24% for Foreman and 36% for Silent, relative to OPT-MEMS. The improvements are more modest for M = 8 cores because each core runs at a much lower operating frequency than with M = 4 cores, so there is less opportunity to reduce power consumption.
CONCLUSION
We propose a Markov decision process based on-line scheduling algorithm for slice-parallel video decoders on multicore systems. To mitigate the complexity of solving the optimal on-line scheduling and DVFS policy, we proposed a novel two-level scheduler. The first-level scheduler determines scheduling and DVFS policies independently for each frame and the second-level decides the final frame-to-processor and frequency-to-processor mappings at run-time. We validate the proposed algorithm in Matlab using accurate video decoder trace statistics generated from an H.264/AVC decoder that we implemented on a cycle-accurate MPARM simulator. 
