Markov Decision Process Based Energy-Efficient Scheduling For Slice-Parallel Video Decoding by Mastronade, Nicholas et al.
MARKOV DECISION PROCESS BASED ENERGY-EFFICIENT SCHEDULING 
FOR SLICE-PARALLEL VIDEO DECODING  
 
Nicholas Mastronarde*, Karim Kanoun†, David Atienza†, and Mihaela van der Schaar‡ 
 
*
 Department of EE, University at Buffalo, † Institute of EE, EPFL, ‡ Department of EE, UCLA 
 
 
ABSTRACT 
We consider the problem of energy-efficient scheduling for 
slice-parallel video decoders on multicore systems with Dynamic 
Voltage Frequency Scaling (DVFS) enabled processors. We 
rigorously formulate the problem as a Markov decision process 
(MDP), which simultaneously considers the on-line scheduling 
and per-core DVFS capabilities; the power consumption of the 
processor cores and caches; and the loss tolerant and dynamic 
nature of the video decoder. The objective is to minimize long-
term power consumption subject to a minimum Quality of Service 
(QoS) constraint related to the decoder’s throughput. We evaluate 
the proposed scheduling algorithm using traces generated from a 
cycle-accurate multiprocessor ARM simulator. 
Index Terms— Slice-parallel video decoding, multicore 
scheduling, multicore power management, dynamic voltage 
scaling, Markov decision process. 
1. INTRODUCTION 
Despite improvements in mobile device technology, energy-
efficient multicore scheduling for video decoding remains a 
challenging problem for several reasons. First, video decoding 
applications have intense and time-varying workloads, which have 
worst-case execution times that are significantly larger than the 
average case. Second, they have sophisticated dependency 
structures due to predictive coding. These dependency structures, 
which can be modeled as directed acyclic graphs (DAGs), not 
only result in different frames having different priorities, but also 
make it difficult to balance loads across the cores, which is 
important for energy efficiency [1]. Finally, they often have 
stringent delay constraints, but are considered soft real-time 
applications. In other words, video frames should meet their 
deadlines, but when they do not, the application quality (e.g. 
decoded video frame rate) is reduced. 
During the last decade, many energy-efficient multicore 
scheduling algorithms that exploit Dynamic Voltage Frequency 
Scaling (DVFS [7]) and/or Dynamic Power Management (DPM 
[12]) have been proposed, e.g. [2][3][4][6][8]. The Largest Task 
First with Dynamic Power Management (LTF-DPM) algorithm in 
[3] assumes that frame decoding deadlines are equally spaced in 
time, and therefore does not support video group of pictures 
(GOP) structures with B frames; moreover, LTF-DPM will 
typically have looser deadline constraints than our proposed 
algorithm because it assigns groups of frames a common “weak” 
deadline. The Stochastic Scheduling2D algorithm [6] considers a 
periodic DAG application model that requires a “source” and 
“sink” node in each period, making the algorithm incompatible 
with GOP structures where the last B frame in a GOP depends on 
the I frame in the next GOP (e.g. an IBPB GOP). The Variation 
Aware Time Budgeting (Var-TB) algorithm in [8] uses a 
functional partitioning algorithm for parallelizing the video 
decoder (e.g. pipelining decoder sub-functions such as inverse 
DCT and motion compensation on different cores). Functional 
partitioning is known to be suboptimal [13] and parallelization 
approaches based on data partitioning (e.g. mapping different 
frames, slices, or macroblocks to different processors) are superior 
[13]. The so-called SpringS algorithm in [4] uses a task-level 
software pipelining algorithm called RDAG [5] to transform a 
periodic dependent task graph (expressed as a DAG) into a set of 
tasks that can be pipelined on parallel processors. However, if this 
technique is applied to video decoding applications, it will require 
retiming delays proportional to the GOP size, which may be large.  
There is no solution that simultaneously considers per-core 
DVFS capabilities; dynamic processor assignment; and loss-
tolerant tasks with different complexity distributions, DAG 
dependency structures, and stringent, but soft real-time, 
constraints. The contributions of this paper are as follows: 
• We rigorously formulate the multi-core scheduling problem 
using a Markov decision process (MDP) that considers the 
abovementioned properties. The MDP enables the system to 
optimally tradeoff long-term power and performance. 
• The MDP solution requires complexity that exponentially 
increases with both the number of processors and the number of 
frames in a short look-ahead window. To mitigate this 
complexity, we propose a novel two-level scheduler. The first-
level determines scheduling and DVFS policies for each frame 
using frame-level MDPs, which account for the coupling between 
the optimal policies of parent and children frames. The second-
level decides the final frame- and frequency-to-processor 
mappings, ensuring that certain system constraints are satisfied.  
• We validate the proposed algorithm in Matlab using video 
decoder trace statistics generated from an H.264/AVC decoder 
that we implemented on a cycle-accurate multiprocessor ARM 
(MPARM) simulator [11]. 
The remainder of the paper is organized as follows. We 
introduce the system and application models in Section 2 and 
formulate the on-line multi-core scheduling problem as an MDP. 
In Section 3, we propose a lower complexity solution by 
approximating the original MDP problem with a two-level 
scheduler. In Section 4, we present our experimental results. We 
conclude in Section 5. 
2. PROBLEM FORMULATION 
We consider the problem of energy-efficient slice-parallel 
video decoding in a time slotted multicore system, where time is 
divided into slots of (equal) duration t∆  seconds indexed by 
t ∈  . We assume that there are M  processors, which we index 
by {1, , }j M∈ … . In Section 2.1, we describe seven important 
video data attributes. In Section 2.2, we propose a sophisticated 
Markovian traffic/workload model that accounts for the video data 
attributes introduced in Section 2.1. In Sections 2.3, 2.4, and 2.5 
we describe the scheduling and frequency actions, the evolution of 
the video traffic/workload, and the power and Quality of Service 
(QoS) metrics used in our optimization. In subsection 2.6, we 
formulate the multicore scheduling problem as an MDP. 
2.1. Video data attributes 
We model the encoded video bitstream as a sequence of 
compressed data units. We assume that a data unit corresponds to 
one video slice, which is a subset of a video frame that can be 
decoded independently of other slices within the same frame [9]. 
We assume that the video is encoded using a fixed, periodic, GOP 
structure that contains K  frames and lasts a period of T  time 
slots of duration t∆ . The set of frames within GOP g ∈   is 
denoted by 
1 2
{ , , , }g g g g
K
v v vV  …  and the set of all frames is 
denoted by g
g∈
V V

 ∪ . Each frame gkv  is characterized by 
seven attributes: 
1. Type: Frame g
k
v
 
is an I, P, or B frame. We denote the operator 
extracting the frame type by type( )g
k
v . 
2. Number of slices: Frame g
k
v  is composed of max{1, , }
g
k
v
l l∈ …  
slices, where 
g
k
v
l  is assumed to be fixed for all frames and is 
determined by the encoder [9].  
3. Decoding complexity: Slices belonging to frame g
k
v
 
have 
decoding complexity 
g
k
v
w  cycles. We assume that 
g
k
v
w  is an 
i.i.d. random variable conditioned on the frame type. 
4. Arrival time: 
g
k
v
t  denotes the earliest time slot g
k
v  can be 
decoded (i.e., its arrival time at the scheduler). 
5. Display deadline: ,disp
g
k
v
d  denotes the final time slot in which 
g
kv
 
must be decoded so it can be displayed. 
6. Decoding deadline: ,dec ,disp
g g
k k
v v
d d≤  denotes the final time slot 
in which g
k
v  must be decoded so that frames that depend on it 
can be decoded before their display deadline.  
7. Dependency: The frames must be decoded in decoding order, 
which is dictated by the dependencies introduced by predictive 
coding (e.g., motion-compensation). In general, the 
dependencies among frames can be described by a DAG, 
denoted by ,DAG V E , with the nodes in V  
representing frames and the edges in E  representing the 
dependencies among frames. We use the notation g g
kk
v v
′
′ ≺  to 
indicate that frame g
k
v  depends on frame g
k
v
′
′  (i.e., there exists 
a path directed from g
k
v
′
′  to 
g
k
v ) and therefore g
k
v
 
cannot be 
decoded until  g
k
v
′
′  is decoded. We write ( , )
g g
kk
v v
′
′ ∈ E  if there 
is a directed arc emanating from frame g
k
v
′
′  and terminating at 
frame g
k
v , indicating that g
k
v
′
′  is an immediate parent of 
g
k
v . 
These attributes determine which slices can be decoded, how 
long they will take to decode, when they need to be decoded. In 
the next subsection, we propose a Markovian traffic model that 
captures the above attributes, enabling us to rigorously formulate 
the multicore scheduling problem as an MDP.  
2.2. Markovian traffic model 
We define a traffic state ( , , )
t t t t
= x rT C  to represent the 
video data that can potentially be decoded in time slot t . This 
traffic state comprises the frame working set 
t
⊂C V , the buffer 
state tx , and the dependency state tr . 
In time slot t , we assume that the set of frames whose 
deadlines are within the scheduling time window (STW) 
[ ],
t
t t W+  can be decoded. We define the frame working set as 
all the frames within the STW, i.e. 
,disp{ | { , 1, , }}
t t
vv d t t t W= ∈ ∈ + +C V … . Because the 
GOP structure is fixed and periodic, 
t
C  is periodic with some 
period T . A frame’s arrival time vt  (respectively, display 
deadline ,dispvd ) is the first (respectively, last) time slot in which 
it appears in the frame working set, and a frame’s decoding 
deadline ,decvd
 
is the minimum display deadline of its children.  
Note that the distinction between display and decoding deadlines 
is important because, even if a frame’s decoding deadline is 
missed, which renders its children undecodable, it is still possible 
to decode the frame itself before its display deadline. Fig. 1 
illustrates the STW concept for a simple IBPB GOP structure. 
We define the buffer state ( | )v
t tt
x v= ∈x C , where v
t
x  
denotes the number of slices of frame v  awaiting decoding at 
time t . Finally, the dependency state ( | )v
t tt
r v ∈r C  defines 
whether or not each frame in the frame working set is decodable 
in time slot t . In particular,
 
v
t
r  is a binary variable that takes 
value 1 if all of frame v ’s dependencies are satisfied, i.e. if 
,
0
u t
x =  for all u v≺  , and takes value 0 otherwise.  
2.3. Scheduling actions and frequencies 
Let {0,1}jv
t
y ∈ =Y
 denote the number of slices belonging 
to frame v  that are scheduled on processor j  at time t . For 
notational convenience, we define ( | )j j
t
v
t t
y v ∈=y C . There 
are three important constraints on the scheduling actions jvty  for 
all {1, , }Mj ∈ …  and 
t
v ∈ C : 
 
Fig. 1. Illustrative DAG dependencies for an IBPB GOP structure that contains 4K =  frames and lasts a period of 4T =  time 
slots of duration 1 / 30t∆ =
 
seconds. 
1
gv
2
gv
3
gv
4
gv
1
1
gv +
1
2
gv +
1
3
gv +
1
4
gv +
2
1
gv +
, tt t W + 
• Buffer constraint: 
1
M jv v
t tj
y x
=
≤∑ . In words, the total 
number of scheduled slices belonging to frame v  cannot 
exceed the number of slices in its buffer in time slot t . 
• Processor constraint: 1
t
jv
tv
y
∈
≤∑ C . In words, no more 
than one slice can be scheduled on processor j  in time slot t . 
• Dependency constraint: If 0vtr = , then 1 0
M jv
tj
y
=
=∑ . In 
words, all of the v th frame’s dependencies must be satisfied 
before slices belonging to it
 
are scheduled to be decoded. 
We assume that each processor can operate at a different 
frequency in each time slot to tradeoff processing energy and 
delay. Let 1 2( , , , )M M
t t t t
f f f= ∈f F…  denote the frequency 
vector, where j
t
f ∈ F  is the speed of the  j th processor in time 
slot  t  and F  is the set of available operating frequencies.  
2.4. State evolution and system dynamics 
To fully characterize the video traffic, we need to understand 
how the traffic state 
 
( , , )
t t t t
= x rT C  evolves over time. The 
transition of the frame working set from 
t
C  to 
1t+
C  is 
independent of the scheduling action; in fact, it is deterministic 
and periodic for a fixed GOP structure, and therefore the sequence 
{ | }
t
t ∈C   can be modeled as a deterministic Markov chain. 
The transition of the buffer state from v
t
x  to 
1
v
t
x
+
 depends on 
the scheduling actions and processor frequencies. Let 
( , )jv jv j jv
t t t t
z z f y=  denote the number of slices belonging to 
frame v  that finish decoding on processor j  at time t  given 
frequency j
t
f . Note that jv jv
t t
z y≤ . Let | )( ,jv j jv
tz t t
p z f y  denote 
the probability that jv
t
z  slices are decoded on processor j  in time 
slot t  given the frequency j
t
f  and scheduling action jv
t
y . 
Before we can write the buffer recursion governing the 
transition from v
t
x  to 
1
v
t
x
+
, we need to define a partition of the 
frame working set 
1t+
C . The partition divides 
1t+
C  into two sets: 
a set of frames that persist from time t  to 1t +  because they 
have display deadlines ,dispvd t> , i.e., 1t t+∩C C ; and, a set of 
newly arrived frames with arrival times 1vt t= + , i.e., 
1 1 1
\
t tt tt+ + +
− ∩C C C C C . Based on this partition, 
1t
vx
+
 can 
be determined from v
t
x  as follows 
 
1 1
1
1
, i
i \, f 
f 
.
Mv jv
tt t t
t v
t t
v j
vx
x
l v
z
= +
+
+
− ∈ ∩
=
∈



∑ C C
C C
 
(1) 
The sequence { | }
t
vx t ∈   can be modeled as a controlled 
Markov chain.  
The transition of the dependency state from v
t
r  to 
1
v
t
r
+
 follows 
intuitively from the definition of dependency: frame v  can be 
decoded in time slot 1t +  (i.e., 
1
1v
t
r
+
= ) if and only if all of its 
parents are completely decoded at the end of time slot t . It 
follows that the sequence { | }v
t
r t ∈   can be modeled as a 
controlled Markov chain.  
2.5. Power cost and slice decoding rate 
The power-frequency function ( )j
t
fρ  maps the j th 
processor's speed j
t
f  to its expected power consumption (watts). 
We also consider the expected power consumed by the 
instruction, data, and L2 cache using a function 
( , , type( ))j jvt tf y vσ  (watts). Thus, the total expected power 
consumed by processor j  (and the associated accesses to the 
various caches) at time t  can be written as  
( )( , ) ( ) , , type ), (
t
j j j j jv
t t t tt tv
P f f f y vρ σ
∈
= + ∑y CC  (watts). (2) 
We consider the following QoS metric in each time slot t : 
 ( ), , type( ) ( | , )jv jv
t t
j jv jv
t t t
jv j jv
t tz z ty
Q f y v zp z f y
≤
= ∑ ,, (3) 
which is simply the expected number of slices belonging to frame 
v  that will be decoded on processor j  in time slot t . We will 
refer to (3) as the slice decoding rate. In the remainder of the 
paper, we will omit the dependence of (2) and (3) on type( )v . 
2.6. Markov decision process formulation 
In this subsection, we formulate the problem of energy-efficient 
slice-parallel video decoding on M  processors. In each time slot 
 t , the objective is to determine the scheduling action jvty , for all 
{1, 2 , }, Mj ∈ …  and tv ∈ C , and the frequency vector tf , in 
order to minimize the long-term power consumption subject to a 
long-term slice decoding rate constraint. The total discounted [12] 
average power consumption and slice decoding rate can be 
expressed as 
 0 1
( , , )
M
t j j
t t
t j
t
P P fγ
∞
= =
=
 
 
 
  
∑∑E yC , and (4) 
 0 1
( , )
t
M
t j jv
t t
t j v
Q Q f yγ
∞
= = ∈
=
 
 
 
  
∑∑∑E
C
, (5) 
respectively, where [0,1)γ ∈  is the discount factor, and the 
expectation is over the sequence of traffic states { | }t t ∈T  . 
Stated more formally, the optimization objective and constraints 
are as follows: 
 
, [ , ]
min   subject to  jv
t jvty t
P Q η
∀ ∈
≥
f 
 (6) 
where η  is the slice decoding rate constraint. Note that the 
buffer, processor, and dependency constraints defined in Section 
2.3 must hold in every time slot; however, we will omit them from 
our exposition in the remainder of the paper. 
Equation (6) can be formulated as an unconstrained MDP by 
introducing a Lagrange multiplier λ
+
∈   associated with the 
slice decoding rate constraint. For a fixed λ , in each time slot t , 
the unconstrained problem’s objective is to determine the 
frequency vector 
t
f  and scheduling actions [ ]
v
jv
t j
y , for all 
processors {1, , }Mj ∈ …  and all frames 
t
v ∈ C , that minimize 
the discounted average Lagrangian cost: i.e., 
 ( ){ }, [ ] , in .m t jvjvty tL QPλ λ η∀ ∈= + −f   (7) 
3. LOW COMPLEXITY SOLUTION 
Solving (6) and (7) is a computationally intractable problem 
because their complexity increases exponentially with the number 
of frames in the frame working sets and with the number of 
processors M . The reason for the exponential growth in the state 
space (respectively, action space) is that the optimization 
simultaneously considers the states (respectively, scheduling 
actions and processor frequencies) of multiple frames on all of the 
processor cores. However, the only reason these need to be 
optimized jointly is the processor constraint, which ensures that 
only one slice is assigned to each processor in each time slot. 
Motivated by this weak coupling among tasks, we propose a two-
level scheduler to approximately solve (6) and (7): The first-level 
scheduler determines the optimal scheduling actions and 
processor frequencies for each frame under the (false) assumption 
that each frame has exclusive access to the M processors. Given 
the results of the first-level scheduler, the second-level scheduler 
determines the final slice- and frequency-to-processor mappings 
by resolving conflicts in the first-level scheduling decisions.  
3.1. First-level scheduler 
The first-level scheduler computes a value function 
( , , )v v vV x rC  for every frame in a GOP, which provides a 
measure of the expected long-term Lagrangian cost under the 
optimized scheduling policy. Note that this value function only 
depends on the frame working set, the frame’s buffer state vx , 
and the frame’s dependency state vr  and is independent of the 
buffer and dependency states of the other frames in the working 
set. Importantly, the frame working set indicates the remaining 
lifetime of a frame and describes the connections to its parents 
and children; hence, it has a significant impact on the optimal 
scheduling and DVFS decisions for the frame. To account for the 
dependencies among frames, we define the v th frame’s value 
function ( , , )v v vV x rC  so that it includes the values of its children. 
In this way, frames with many children (e.g. I frames) can account 
for how their scheduling and frequency decisions will impact the 
future performance of their children. We describe the first-level 
scheduler in more detail in the remainder of this section. 
3.1.1. Frame-level value iteration 
The first-level scheduler performs the frame-level value 
iteration algorithm illustrated in Table 1 to compute the optimal 
value functions ,{ : }v gV v∗ ∈ V . Unlike the conventional value 
iteration algorithm [10], the proposed algorithm has multiple 
coupled value functions that need to be updated because the value 
of a frame depends on the values of its children. Due to this 
coupling, the form of the value function update (lines 5-9 in Table 
1) differs from the conventional value iteration algorithm. 
If it is not possible to make any decisions for a frame in the 
current traffic state, then we set the frame’s value to 0 in that 
state. Hence, if a frame is not in the frame working set  (i.e. 
v ∉ C ), does not have its dependencies satisfied (i.e. 0vr = ), 
or is already fully decoded (i.e. v ∈ C  and 0vx = ), then we set 
the frame’s value to 0 (line 8 in Table 1). If the frame is in the 
frame working set, still has undecoded slices, and has its 
dependencies satisfied (i.e. v ∈ C , 0vx > , and 1vr = ), then 
the value function update comprises four distinct terms: the power 
consumed by each processor in the current state; the expected 
slice decoding rate on each processor in the current state; the 
expected future value of frame v ; and the sum of the expected 
future values of the v th frame’s children. Note that the expected 
future value of frame v is 0 if v ∉ ′C ; and, the expected future 
values of the child frames are 0 if ur ′  is not 1 (i.e., if the parent 
has not been decoded). In other words, the parent frame’s value 
function is coupled with the children’s value functions only if the 
parent frame gets fully decoded. 
3.1.2. Decomposing frame-level value iteration  
The frame-level value iterations allow us to eliminate the 
exponential growth of the state space with respect to the number 
of frames in the frame working set, but we still have to address the 
fact that the optimization in (8) (Line 6 of Table 1) requires a 
search over an exponential number of scheduling and frequency 
actions. In this subsection, we discuss how to decompose the 
monolithic update defined in (8) into M  stages (hereafter, sub-
Table 1. Frame-level value iteration algorithm performed by the first-level scheduler. 
1. Initialize:
 
0,
( 0, , )v v vx rV
λ
=C  for all gv ∈ V, C , {0, },v vx l∈ … , and {0,1}vr ∈  
2. Repeat 
3. 
 0∆ ←  
4. 
  For each gv ∈ V, C , {0, },v vx l∈ … , and {0,1}vr ∈  
5. 
  If v ∈ C , 0vx > , and 1vr =  (frame v  is in the frame working set, has undecoded slices, and has its dependencies satisfied) 
6. 
 
( )
1: ,
1: , 1: ,
1: ,
1: ,
1
1,
1
,
, ,
:1 , 1
( , )
( ) ( , ) ( , )
min
,
( | , ) , ( ,, , )
M v M v
v v uM M
v v v
jv jv jv jv
n
M
jv
j
jv
n
M
jv jv v v M v v u u u
u r
n
u v
z
j
V x r
f f y Q f y
f y V xp r V l rz
λ
λ λ
λ
γ
ρ σ
=≤ ∈ ′ ′=
+
=
=
+ −
− +′ ′ ′ ′ ′
 
  
 
 
 
        
      
∑ ∏
∑
∑
f y
z y
z
C
C
C C
≺

 (8) 
7.   Else 
8. 
   
1,
,( , ) 0
n
v v vV x r
λ+
=C  
9   End 
10.  End 
11. 
 
1, ,
max ( , ) ( , ), }{ | , |,v v v v v
n
v
n
V r V rx x
λ λ+
∆ ← −∆ C C   and  1n n← +  
12. Until ∆ < ε  (a small positive number) 
13. Output: ,{ : }v gV v∗ ∈ V  
value iterations), each corresponding to a local scheduling 
problem on a single processor. These M  sub-value iterations can 
be performed iteratively, using the output of the j th processor’s 
sub-value iteration as the input to the ( 1)j − st processor’s sub-
value iteration. Importantly, decomposing the monolithic update 
into M
 
sub-value iterations significantly reduces the 
computational complexity of the update. Due to space limitations, 
we refer the interested reader to [14] for a derivation of the sub-
value iterations. 
Sub-value iteration at processor M : 
 
( )
( )
1,
,
,
,  
| ,
,
: ,  1
,
( ) ( , ) ( , )
,min
(
,
,
),,
Mv Mv
MvMv Mv
u
v v
Mv Mv Mv Mv
v v
M v
n
Mv
n
f y
z f
n
v
Mv v
u u u
y
u u r
V x r
f f y Q f y
V x z r
V l r
λ
λ
λ
σ
γ
ρ λ
∈ ′ ′
−
=
=
+ −
−
+
′
′ ′
+′
′
 
 
       
  
 
 
 
   
∑E
C
C
C
C
≺
 (9) 
The M th processor’s sub-value iteration estimates the value of 
being in traffic state ,,( )vv vx r=T C  under the assumption that 
only processor M  exists in the current time slot, while all 
processors exist thereafter. This value is calculated as the sum of 
(i) the immediate cost incurred by processor M  for processing 
slices belonging to frame v , (ii) the expected discounted future 
value of frame v , and (iii) the expected discounted future value 
of frame v ’s children. The output of the M th processor’s sub-
value iteration is used as input to the ( 1)M − st processor’s sub-
value iteration.  
Sub-value iteration at processors {2, , 1}j M∈ … − : 
 
( )
( )
1,
,
,
,  
,| ,
,
( ) ( , ) ( , )
,
n .
,,
mi
v
jv jv
jv j jv
j v
n
v
j v
v v
jv jv jv jv j
v jv v
f y
nz f y
V x r
f f y Q f y
V x z r
λ
λ
ρ σ λ
−
+
=
+ −
−
   
 
   

  
E
C
C
 (10) 
The j th processor’s sub-value iteration estimates the value of 
being in traffic state ,,( )vv vx r=T C  under the assumption that 
only processors ,,j M…  exist in the current time slot, while all 
processors exist thereafter. This value is calculated as the sum of 
the immediate cost incurred by processor j  and an expectation 
over the value calculated by the ( 1)j + st processor’s sub-value 
iteration. The output of the j th processor’s sub-value iteration is 
used as input to the ( 1)j − st processor’s sub-value iteration.  
Sub-value iteration at processor 1: 
 
1
11
1
1
1 1 1
1,
1
1,
, 
,| ,
1
1
( , )
( ) ( , ) ( , )
mi
( , ),
n
,
v v
v v v
v v
v v v
v
n
v
v
f y
nz f
v
v v v
y
V x r
f f y Q f y
V x z r
λ
λ
λρ σ
+
=
+ − +
−
   
 
  
 
    
E
C
C
. (11) 
The output of the first processor’s sub-value iteration includes (i) 
the immediate power costs incurred by all processors, (ii) the slice 
decoding rate of all processors, (iii) the expected discounted 
future value of frame v , and (iv) the expected future discounted 
value of frame v ’s children. 
1,
v
n
V
λ+
 is used as input to the M th 
processor’s sub-value iteration during iteration 1n + .  
Performing the M  sub-value iterations for frame v  in a single 
traffic state , , )(v v vx r=T C   only requires a search over the 
(scalar) scheduling actions {0,1}jvy ∈  and frequencies 
jvf ∈ F  for each processor {1, , }Mj ∈ … . Therefore, using 
the proposed decomposition significantly reduces the optimization 
complexity.  
Finally, at run-time, we determine the approximately optimal 
actions , ,( , )jv jvf y∗ ∗  to take in each state , , )(v v vx r=T C  by 
taking the arguments that minimize the right-hand sides of (9), 
(10), and (11). For complete details on the action selection 
procedure, we refer the interested reader to [14]. 
3.2. Second-level scheduler 
Given the optimal actions calculated by the first-level 
scheduler, it is likely that slices belonging to different frames in 
the frame working set will want to be scheduled on the same 
processor in the same time slot, thereby violating the processor 
constraint in (6). To avoid this problem, the second-level 
scheduler determines the final slice- and frequency-to-processor 
mappings using an Earliest Deadline First (EDF) policy. 
Specifically, frame ,jv ∗  gets scheduled on processor j  at 
frequency ,jvf ∗  if ,jv ∗  has the minimum decoding deadline 
,decvd  of all of the frames scheduled on processor j  (with ties 
broken randomly). Finally, if a slice finishes decoding before the 
first-level scheduler’s time quantum is up, then the second-level 
scheduler will start decoding another slice during the “slack” time, 
which is the time between the beginning of the next time quantum 
and the time that the originally scheduled slice finished decoding. 
4. EXPERIMENTS 
To validate our optimized multi-core scheduling approach in 
Matlab, we use accurate profiling/statistics generated from an 
H.264/AVC decoder executed on a sophisticated cycle-accurate 
and bus signal-accurate MPARM simulator [11]. We implemented 
the two-level scheduling algorithm proposed in Section 3 in 
Matlab. This algorithm, together with slice-level data traces 
recorded from MPARM, allowed us to determine scheduling and 
DVFS policies for the Silent and Foreman sequences (CIF 
resolution, 30 frames per second, 8 slices per frame) with an IBPB 
GOP structure. The relevant parameters used in our experiments 
are given in Table 2. 
In Fig. 2, we compare our proposed algorithm to the Optimum 
Minimum-Energy Multicore Scheduling algorithm (OPT-MEMS 
[2]), and to a modification of our algorithm where we require all 
processors to operate at the same frequency (i.e., coordinated 
DVFS). We note that OPT-MEMS supports both DPM and 
coordinated DVFS; however, we only compare against the DVFS 
part to achieve a fair comparison. (Although DPM can be 
integrated into our proposed solution, we omitted it here to 
simplify the exposition.)  
OPT-MEMs uses a frame’s worst-case execution complexity 
and its deadline to determine a DVFS schedule that multiplexes 
between two frequencies in time in order to execute exactly the 
worst-case number of cycles before the task’s deadline. There are 
four important limitations of OPT-MEMS. First, OPT-MEMS 
does not consider characteristics and requirements of future tasks 
(e.g. deadlines, complexities, dependencies) when deciding the 
DVFS schedule for the current task. Second, OPT-MEMS does 
not provide a scheduling technique to allocate tasks to processor 
cores; instead, it assumes that each task is perfectly divisible 
among an arbitrary number of cores. This corresponds to the case 
of perfect load balancing, which can only be achieved in practice 
if the number of slices per frame is exactly the number of cores, 
and each slice has exactly the same decoding complexity. Third, 
OPT-MEMS does not provide a mechanism for scheduling slices 
belonging to different frames at the same time. This leads to some 
inefficiency because fully parallelized decoding (which accounts 
for frame dependencies) is not possible. Forth, OPT-MEMS uses 
coordinated DVFS. This leads to inefficiency in practice because 
tasks are not the same size and therefore cannot be perfectly load 
balanced with a single frequency for all cores. 
As illustrated in Fig. 2(a) and Fig. 2(b), for M = 1 or 2 
processors, all algorithms achieve approximately the same frame 
rates and power consumptions for a given sequence. This is 
because, even at the highest operating frequency, there are not 
enough resources to decode all frames. For M = 4 or 8 processors, 
Fig. 2(a) and Fig. 2(b) show that all algorithms achieve the full 
frame rate (or very close to the full frame rate); however, Fig. 2(c) 
and Fig. 2(d) show that the proposed algorithm achieves lower 
overall power consumption. For M = 4 cores, the proposed 
algorithm reduces power by approximately 24% for Foreman and 
36% for Silent, relative to OPT-MEMS. The improvements are 
more modest for M = 8 cores because each core runs at a much 
lower operating frequency than with M = 4 cores, so there is less 
opportunity to reduce power consumption. 
5. CONCLUSION 
We propose a Markov decision process based on-line 
scheduling algorithm for slice-parallel video decoders on 
multicore systems. To mitigate the complexity of solving the 
optimal on-line scheduling and DVFS policy, we proposed a 
novel two-level scheduler. The first-level scheduler determines 
scheduling and DVFS policies independently for each frame and 
the second-level decides the final frame-to-processor and 
frequency-to-processor mappings at run-time. We validate the 
proposed algorithm in Matlab using accurate video decoder trace 
statistics generated from an H.264/AVC decoder that we 
implemented on a cycle-accurate MPARM simulator.  
6. REFERENCES 
[1] H. Aydin and Q. Yang, “Energy-Aware Partitioning for 
Multiprocessor Real-Time Systems,” Proc. of the 17th 
International Symposium on Parallel and Distributed 
Processing (IPDPS '03), Apr. 2003. 
[2] W. Y. Lee, Y. W. Ko, H. Lee, and H. Kim, “Energy-efficient 
scheduling of a real-time task on DVFS-enabled multi-
cores,” Proc. of the 2009 Intl. Conf. on Hybrid Information 
Technology (ICHIT '09), pp. 273-277, 2009. 
[3] Y.-H. Wei, C.-Y. Yang, T.-W. Kuo, S.-H. Hung, and Y.-H. 
Chu, “Energy-efficient real-time scheduling of multimedia 
tasks on multi-core processors,” Proc. of the 2010 ACM 
Symp. on Applied Computing (SAC '10), pp. 258-262, 2010.  
[4] H. Liu, Z. Shao, M. Wang, and P. Chen, “Overhead-Aware 
System-Level Joint Energy and Performance Optimization 
for Streaming Applications on Multiprocessor Systems-on-
Chip,” Proc. of the 2008 Euromicro Conference on Real-
Time Systems (ECRTS '08), pp. 92-101, July 2008.  
[5] C. E. Leiserson and J. B. Saxe, “Retiming synchronous 
circuitry,” Algorithmica, vol. 6, no. 1-6, pp 5–35, 1991. 
[6] R. Xu, “Energy-aware scheduling for streaming 
applications,” Ph.D. Dissertation: 
http://preview.tinyurl.com/c6uul4m  
[7] P. Pillai and K. G. Shin, “Real-time dynamic voltage scaling 
for low-power embedded operating systems,” SIGOPS Oper. 
Syst. Rev., vol. 35, no. 5, pp. 89-102, Oct. 2001. 
[8] J. Cong and K. Gururaj, “Energy efficient multiprocessor 
task scheduling under input-dependent variation,” Proc. of 
the Conference on Design, Automation and Test in 
Europe (DATE '09), pp. 411-416, 2009. 
[9] M. Roitzsch, "Slice-balancing H.264 video encoding for 
improved scalability of multicore decoding," Proc.of the 7th 
ACM & IEEE International Conference on Embedded 
Software, pp. 269-278, 2007. 
[10] R. S. Sutton and A. G. Barto, “Reinforcement learning: an 
introduction,” Cambridge, MA:MIT press, 1998. 
[11] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. 
Olivieri, “MPARM: Exploring the Multi-Processor SoC 
Design Space with SystemC,” J. VLSI Signal Process. Syst., 
vol. 41, no. 2, pp. 169-182, Sept. 2005. 
[12] L. Benini, A. Bogliolo, G. A. Paleologo, G. De Micheli, 
“Policy optimization for dynamic power management,” IEEE 
Trans. on Computer-Aided Design of Integrated Circuits and 
Systems, vol. 18, no. 6, June 1999. 
[13] E. B. van der Tol, E. G. Jaspers, R. H. Gelderblom, 
“Mapping of H.264 decoding on a multiprocessor 
architecture,” Proc. of the SPIE (May 2003), pp. 707-718. 
[14] N. Mastronarde, K. Kanoun, D. Atienza, P. Frossard, and M. 
van der Schaar, “Markov decision process based energy-
efficient on-line scheduling for slice-parallel video decoders 
on multicore systems,” IEEE Trans. on Multimedia, vol. 15, 
no. 2, pp. 268-278, Feb. 2013. 
Table 2. Simulation parameters.  
Parameter Value(s) 
No. slave cores (M ) 1, 2, 4, 8 
Frequency set (F ) {125, 166, 250, 500} MHz 
Sequence Foreman (220 frames), Silent (300 frames) 
Resolution CIF (352 x 288) 
GOP Structure ‘IBPB’ 
Frame rate 30 frames per second 
Time slot duration 1/90 s 
No. frame working sets 12 
No. slices per frame 8 
Lagrange multiplier (λ ) 400 
 
Fig. 2. Experimental comparisons (a,b) Avg. decoded frame rates 
for Foreman and Silent, respectively. (c,d) Avg. total power 
consumption for Foreman and Silent, respectively.  
 
