University of Pennsylvania

ScholarlyCommons
Departmental Papers (CIS)

Department of Computer & Information Science

11-2015

Cache-Aware Compositional Analysis of Real-Time Multicore
Virtualization Platforms
Meng Xu
University of Pennsylvania, mengxu@cis.upenn.edu

Linh T.X. Phan
University of Pennsylvania, linhphan@cis.upenn.edu

Oleg Sokolsky
University of Pennsylvania, sokolsky@cis.upenn.edu

Sisu Xi
Chenyang Lu

See next page for additional authors

Follow this and additional works at: https://repository.upenn.edu/cis_papers
Part of the Digital Communications and Networking Commons, and the OS and Networks Commons

Recommended Citation
Meng Xu, Linh T.X. Phan, Oleg Sokolsky, Sisu Xi, Chenyang Lu, Christopher Gill, and Insup Lee, "CacheAware Compositional Analysis of Real-Time Multicore Virtualization Platforms", Real-Time Systems
Journal 51(6), 675-723. November 2015. http://dx.doi.org/10.1007/s11241-015-9223-2

This research was published earlier in http://repository.upenn.edu/cis_papers/770/
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/786
For more information, please contact repository@pobox.upenn.edu.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization
Platforms
Abstract
Multicore processors are becoming ubiquitous, and it is becoming increasingly common to run multiple
real-time systems on a shared multicore platform. While this trend helps to reduce cost and to increase
performance, it also makes it more challenging to achieve timing guarantees and functional isolation. One
approach to achieving functional isolation is to use virtualization. However, virtualization also introduces
many challenges to the multicore timing analysis; for instance, the overhead due to cache misses
becomes harder to predict, since it depends not only on the direct interference between tasks but also on
the indirect interference between virtual processors and the tasks executing on them.
In this paper, we present a cache-aware compositional analysis technique that can be used to ensure
timing guarantees of components scheduled on a multicore virtualization platform. Our technique
improves on previous multicore compositional analyses by accounting for the cache-related overhead in
the components’ interfaces, and it addresses the new virtualization-specific challenges in the overhead
analysis. To demonstrate the utility of our technique, we report results from an extensive evaluation
based on randomly generated workloads

Keywords
compositional analysis, interface, cache-aware, multicore, virtualization

Disciplines
Computer Engineering | Computer Sciences | Digital Communications and Networking | OS and Networks

Comments
This research was published earlier in http://repository.upenn.edu/cis_papers/770/

Author(s)
Meng Xu, Linh T.X. Phan, Oleg Sokolsky, Sisu Xi, Chenyang Lu, Christopher Gill, and Insup Lee

This journal article is available at ScholarlyCommons: https://repository.upenn.edu/cis_papers/786

Cache-Aware Compositional Analysis of Real-Time Multicore
Virtualization Platforms
Meng Xu · Linh Thi Xuan Phan · Oleg Sokolsky ·
Sisu Xi · Chenyang Lu · Christopher Gill · Insup Lee

Abstract Multicore processors are becoming ubiquitous, and it is becoming increasingly common to run
multiple real-time systems on a shared multicore platform. While this trend helps to reduce cost and to increase
performance, it also makes it more challenging to achieve timing guarantees and functional isolation. One
approach to achieving functional isolation is to use virtualization. However, virtualization also introduces
many challenges to the multicore timing analysis; for instance, the overhead due to cache misses becomes
harder to predict, since it depends not only on the direct interference between tasks but also on the indirect
interference between virtual processors and the tasks executing on them.
In this paper, we present a cache-aware compositional analysis technique that can be used to ensure
timing guarantees of components scheduled on a multicore virtualization platform. Our technique improves on
previous multicore compositional analyses by accounting for the cache-related overhead in the components’
interfaces, and it addresses the new virtualization-specific challenges in the overhead analysis. To demonstrate
the utility of our technique, we report results from an extensive evaluation based on randomly generated
workloads.
Keywords Compositional analysis · Interface · Cache-aware · Multicore · Virtualization

1 Introduction
Modern real-time systems are becoming increasingly complex and demanding; at the same time, the microprocessor industry is offering more computation power in the form of an exponentially growing number of cores.
Hence, it is becoming more and more common to run multiple system components on the same multicore
platform, rather than deploying them separately on different processors. This shift towards shared computing
platforms enables system designers to reduce cost and to increase performance; however, it also makes it
significantly more challenging to achieve separation of concerns and to maintain timing guarantees.
One approach to achieve separation of concerns is through virtualization technology. On a virtualization
platform, such as Xen [Barham et al., 2003], multiple system components with different functionalities can be
deployed in domains (virtual machines) that can each run their own operating system. These domains provide
a clean isolation between components, and they preserve the components’ functional behavior. However,
existing virtualization platforms are designed to provide good average performance – they are not designed
Meng Xu · Linh Thi Xuan Phan · Insup Lee · Oleg Sokolsky
University of Pennsylvania
E-mail: {mengxu, linhphan, lee, sokolsky}@cis.upenn.edu
Sisu Xi · Chenyang Lu · Christopher Gill
Washington University in St. Louis
E-mail: {xis, cdgill, lu}@cse.wustl.edu

2

Meng Xu et al.

to provide real-time guarantees. To achieve the latter, a virtualization platform would need to ensure that
each domain meets its real-time performance requirements. There are on-going efforts towards this goal, e.g.,
[Bruns et al., 2010; Crespo et al., 2010; Lee et al., 2012], but they primarily focus on single-core processors.
In this paper, we present a framework that can provide timing guarantees for multiple components running
on a shared multicore virtualization platform. Our approach is based on multicore compositional analysis, but it
takes the unique characteristics of virtualization platforms into account. In our approach, each component—i.e.,
a set of tasks and their scheduling policy—is mapped to a domain, which is executed on a set of virtual
processors (VCPUs). The VCPUs of the domains are then scheduled on the underlying physical cores. The
schedulability analysis of the system is compositional: we first abstract each component into an interface that
describes the minimum processing resources needed to ensure that the component is schedulable, and then we
compose the resulting interfaces to derive an interface for the entire system. Based on the system’s interface,
we can compute the minimum number of physical cores that are needed to schedule the system.
A number of compositional analysis techniques for multi-core systems have been developed (for instance, [Baruah and Fisher, 2009; Easwaran et al., 2009; Lipari and Bini, 2010]), but existing theories assume a
somewhat idealized platform in which all overhead is negligible. In practice, the platform overhead—especially
the cost of cache misses—can substantially interfere with the execution of tasks. As a result, the computed
interfaces can underestimate the resource requirements of the tasks within the underlying components. Our
goal is to remove this assumption by accounting for the platform overhead in the interfaces. In this paper, we
focus on cache-related overhead, as it is among the most prominent in the multicore setting.
Cache-aware compositional analysis for multicore virtualization platforms is challenging because virtualization introduces additional overhead that is difficult to predict. For instance, when a VCPU resumes
after being preempted by a higher-priority VCPU, a task executing on it may experience a cache miss, since
its cache blocks may have been evicted from the cache by the tasks that were executing on the preempting
VCPU. Similarly, when a VCPU is migrated to a new core, all its cached code and data remain in the old core;
therefore, if the tasks later access content that was cached before the migration, the new core must load it from
memory rather than from its cache.
Another challenge comes from the fact that cache misses that can occur when a VCPU finishes its budget
and stops its execution. For instance, suppose a VCPU is currently running a task τi that has not finished its
execution when the VCPU finishes its budget, and that τi is migrated to another VCPU of the same domain
that is either idle or executing a lower-priority task τj (if one exists). Then τi can incur a cache miss if the new
VCPU is on a different core, and it can trigger a cache miss in τj when τj resumes. This type of overhead is
difficult to analyze, since it is in general not possible to determine statically when a VCPU finishes its budget
or which task is affected by the VCPU completion.
In this paper, we address the above virtualization-related challenges, and we present a cache-aware compositional analysis for multicore virtualization platforms. Specifically, we make the following contributions:1
– We present a new supply bound function for the existing multiprocessor resource periodic (MPR) model
that is tighter than the original supply bound function proposed in [Easwaran et al., 2009], thus enabling
more resource-efficient interfaces for components (Section 3);
– we introduce DMPR, a deterministic extension of the multiprocessor resource periodic model to better
represent component interfaces on multicore virtualization platforms (Section 4);
– we present a DMPR-based compositional analysis for systems without cache-related overhead (Section 5);
– we characterize different types of events that cause cache misses in the presence of virtualization (Section 6); and
– we propose three methods (BASELINE, TASK - CENTRIC - UB, and MODEL - CENTRIC) to account for the
cache-related overhead (Sections 7.1, 7.2 and 8);
– we analyze the relationship between the proposed cache-related overhead analysis methods, and we
develop a cache-aware compositional analysis method based on a hybrid of these methods (Section 9).
1

A preliminary version of this paper has appeared in the Real-Time Systems Symposium (RTSS’13) [Xu et al., 2013].

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

3

To demonstrate the applicability and the benefits of our proposed cache-aware analysis, we report results
from an extensive evaluation on randomly generated workloads using simulation as well as by running them
on a realistic platform.

2 System Descriptions
The system we consider consists of multiple real-time components that are scheduled on a multicore virtualization platform, as is illustrated in Fig. 1(a). Each component corresponds to a domain (virtual machine) of the
platform and consists of a set of tasks; these tasks are scheduled on a set of virtual processors (VCPUs) by
the domain’s scheduler. The VCPUs of the domains are then scheduled on the physical cores by the virtual
machine monitor (VMM).
Each task τi within a domain is an explicit-deadline periodic task, defined by τi = (pi , ei , di ), where pi
is the period, ei is the worst-case execution time (WCET), and di is the relative deadline of τi . We require that
0 < ei ≤ di ≤ pi for all τi .
Each VCPU is characterized by VPj = (Πj , Θj ), where Πj is the VCPU’s period and Θj is the resource
budget that the VCPU services in every period, with 0 ≤ Θj ≤ Πj . We say that VPj is a full VCPU if
Θj = Πj , and a partial VCPU otherwise. We assume that each VCPU is implemented as a periodic server [Sha
et al., 1986] with period Πj and maximum budget time Θj . The budget of a VCPU is replenished at the
beginning of each period; if the budget is not used when the VCPU is scheduled to run, it is wasted. We assume
that each VCPU can execute only one task at a time. Like in most real-time scheduling research, we follow the
conventional real-time task model in which each task is a single thread in this work; an extension to parallel
task models is an interesting but also challenging research direction, which we plan to investigate in our future
work.
We assume that all cores are identical and have unit capacity, i.e., each core provides t units of resource
(execution time) in any time interval of length t. Each core has a private cache2 , all cores share the same
memory, and the size of the memory is sufficiently large to ensure that all tasks (from all domains) can reside
in memory at the same time, without conflicts.

Domain 1

gEDF
τ 1 ,τ 2 ,τ 3

Domain
Domain
1 2

gEDF
gEDF
τ
τ ,τ 4,,ττ 5 ,τ 6
1

2

3

Domain
3
Domain
2

Domain 3

gEDF
gEDF
τ 4 ,τ 5 ,ττ67 ,τ 8

gEDF
τ 7 ,τ 8

hP1 , Q1 , m1 i hP1 ,hP
Q12,,mQ12i, m2 i hP2 , QhP
i3 , m3 ihP3 , Q3 , m3 i
2 , 3m, 2Q

VP1 VP2 VPVP
3 2VP4 VP3 VPVP
1 VP
4 5

VP5

VMM

cpu
cpu
cpu1 cpucpu
2 1cpu3 2 cpu4 3

VMM

cpu4

(a) Task and VCPU scheduling.

VP1

VP
VP
VP
VP23 VP
VP
4 2 VP
5 4 VP5
3 1 VP
gEDF

gEDF

cpu43 cpu4
cpu
cpu
cpu
cpu1 cpu
2 3 cpu
21
(b) Scheduling of VCPUs.

Fig. 1 Compositional scheduling on a virtualization platform.

Scheduling of tasks and VCPUs. We consider a hybrid version of the Earliest Deadline First (EDF) strategy.
As is shown in Fig. 1, tasks within each domain are scheduled on the domain’s VCPUs under the global EDF
2 In this work, we assume that the cores either do not share a cache, or that the shared cache has been partitioned into cache sets
that are each accessed exclusively by one core [Kim et al., 2012]. We believe that an extension to shared caches is possible, and we
plan to consider it in our future work.

4

Meng Xu et al.

(gEDF) [Baruah and Baker, 2008] scheduling policy. The VCPUs of all the domains are then scheduled on
the physical cores under a semi-partitioned EDF policy: each full VCPU is pinned (mapped) to a dedicated
core, and all the partial VCPUs are scheduled on the remaining cores under gEDF. In the example from
Fig. 1(b), VP1 and VP3 are full VCPUs, which are pinned to the physical cores cpu1 and cpu2 , respectively.
The remaining VCPUs are partial VCPUs, and are therefore scheduled on the remaining cores under gEDF.
Cache-related overhead. When two code sections are mapped to the same cache set, one section can evict the
other section’s cache blocks from the cache, which causes a cache miss when the former resumes. If the two
code sections belong to the same task, this cache miss is an intrinsic cache miss; otherwise, it is an extrinsic
cache miss [Basumallick and Nilsen, 1994]. The overhead due to intrinsic cache misses of a task can typically
be statically analyzed based solely on the task; however, extrinsic cache misses depend on the interference
between tasks during execution. In this paper, we assume that the tasks’ WCETs already include intrinsic
cache-related overhead, and we will focus on the extrinsic cache-related overhead. In the rest of this paper, we
use the term ‘cache’ to refer to ‘extrinsic cache’.
We use ∆crpmd
to denote the maximum time needed to re-load all the useful cache blocks (i.e., cache
τi
blocks that will be reused) of a preempted task τi when that task resumes (either on the same core or on a
different core).3 Since the overhead for reloading the cache content of a preempted VCPU (i.e., a periodic
server) upon its resumption is insignificant compared to the task’s, we will assume here that it is either zero or
is already included in the overhead due to cache misses of the running task inside the VCPU.
Objectives. In the above setting, our goal is to develop a cache-aware compositional analysis framework
for the system. This framework consists of two elements: (1) an interface representation that can succinctly
capture the resource requirements of a component (i.e., a domain or the entire system); and (2) an interface
computation method for computing a minimum-bandwidth cache-aware interface of a component (i.e., an
interface with the minimum resource bandwidth that guarantees the schedulability of a component in the
presence of cache-related overhead).
Assumptions. We assume that (1) all VCPUs of each domain j share a single period Πj ; (2) all Πj are
known a priori; and (3) each Πj is available to all domains. These assumptions are important to make the
analysis tractable. Assumption 1 is equivalent to using a time-partitioned approach; we make this assumption
to simplify the cache-aware analysis in Section 8, but it should be easy to extend the analysis to allow different
periods for the VCPUs. Assumption 2 is made to reduce the search space, which is common in existing
work (e.g., [Easwaran et al., 2009]); it can be relaxed by first establishing an upper bound on the optimal
period (i.e., the period of the minimum-bandwidth interface) of each domain j, and then searching for the
optimal period value based on this bound. Finally, Assumption 3 is necessary to determine how often different
events that cause cache-related overhead happen (c.f. Section 6), which is crucial for the cache-aware interface
computation in Sections 7 and 8. One approach to relaxing this assumption is to treat the period of the VCPUs
of a domain as an input parameter in the computation of the overhead that another domain experiences. Such a
parameterized interface analysis approach is very general, but making it efficient remains an interesting open
problem for future research. We note, however, that although each assumption can be relaxed, the consequence
of relaxing all three assumptions requires a much deeper investigation.

3 Improvement on Multiprocessor Periodic Resource Model
Recall that, when representing a platform, a resource model specifies the characteristics of the resource supply
that is provided by that platform; when representing a component’s interface, it specifies the total resource
requirements of the component that must be guaranteed to ensure the component’s schedulability. The resource
3 We are aware that using a constant maximum value to bound the cache-miss overhead of a task may be conservative, and
extensions to a finer granularity, e.g., using program analysis, may be possible. However, as the first step, we keep this assumption
to simplify the analysis in this work, and we defer such extensions to our future work.

Real-Time
Syst (2009)Analysis
43: 25–59
Cache-Aware
Compositional
of Real-Time Multicore Virtualization Platforms

31

5

Fig. 2 Worst case resource supply of MPR model.

Fig. 3 Schedule of µ w.r.t. sbfµ (t)

provided
by a resource
model R can alsoperiodic
be captured
by a supply
bound
function
denoted by peSBFR (t),
Definition
1 (Multiprocessor
resource
model
(MPR))
A (SBF),
multiprocessor
that specifies the minimum number of resource "units that R provides over any interval of length t.

riodic resource model µ = !!, ", m # specifies that an identical, unit-capacity mul-

In this section, we first describe the existing multiprocessor periodic resource (MPR) model [Shin et al.,
tiprocessor platform collectively provides " units of resource in every ! time units,
2008], which serves as a basis for our proposed resource model for multicore" virtualization platforms. We then
where
"for
units
are supplied
concurrency
at mostSBF
m ;given
at any
time etinstant
at thus
present a newthe
SBF
the MPR
model thatwith
improves
upon the original
in [Shin
al., 2008],
"
"
processors
allocated toand
this
resource
mosttighter
m physical
enabling
MPR-based
interfaces are
for components
more
efficientmodel.
use of resource.
! denotes the re-

source bandwidth of model µ.

It is easy to see from the above definition that a feasible MPR model must satisfy
the condition " ≤ m" !. The supply bound function of a resource model (sbf) lower
3.1 Background
MPR of processor supply that the model guarantees in a given time
bounds theonamount
interval. Specifically, sbf
R (t) is equal to the minimum amount of processor capacity
An MPR
Θ̃, m0 ) specifies
that ainmultiprocessor
platform
with a number
of identical, unitthat model
modelΓR=is(Π̃,
guaranteed
to provide
any time interval
of duration
t. In uniprocescapacity CPUs provides Θ̃ units of resources in every period of Π̃ time units, with concurrency at most m0 (in
sor systems, sbf is used in schedulability conditions to generate resource model based
other words, at any time instant at most m0 physical processors are allocated to this resource model), where
interfaces.
Extending
thisΘ̃/approach
to multiprocessors, in this paper we
Θ̃ ≤ component
m0 Π̃. Its resource
bandwidth
is given by
Π̃.
derive
similar
schedulability
conditions
to
generate
MPR model based component
The worst-case resource supply scenario of the MPR model is shown in Fig. 2 [Easwaran
et al., 2009].
" #. Figure 3
interfaces.
Hence we
now present
the insbf
for a MPR
model
= !!, ",
Based
on this worst-case
scenario,
the authors
[Easwaran
et al.,
2009]µproposed
an m
SBF
that bounds the
shows
the schedule
formodel
µ that
minimum
supply
in a time interval of
resource
supplied
by the MPR
Γ =generates
(Π̃, Θ̃, m0this
), which
is defined
as follows:
"
duration t, where α = % m" & and β = " − m" α. As can be seen, length of the largest

6

Meng Xu et al.



if t0 < 0
0,
0 
˜ Γ (t) =
SBF
t /Π̃ Θ̃ + max{0, m0 x − (m0 Π̃ − Θ̃)}, if t0 ≥ 0 ∧ x ∈ [1, y]

 0 
t /Π̃ Θ̃ + max{0, m0 x − (m0 Π̃ − Θ̃)} − (m0 − β), if t0 ≥ 0 ∧ x ∈
/ [1, y]
where α =

(1)


j t0 k
j Θ̃ k
l Θ̃ m
j Θ̃ k
, β = Θ̃ − m0 α, t0 = t − Π̃ −
, x = t0 − Π̃
and y = Π̃ −
.
0
0
m
m
m0
Π̃

3.2 Improved SBF of the MPR model
˜ Γ given in Eq. (1) is a valid SBF for the MPR model Γ , it is
We observe that, although the function SBF
conservative. Specifically, the minimum amount of resource provided by Γ over a time window of length t
˜ Γ (t) when (i) the resource bandwidth of Γ is equal to its maximum
(see Fig. 2) can be much larger than SBF
concurrency level (i.e., Θ̃/Π̃ = m0 ), or (ii) x ≤ 1, where x is defined in Eq. (1). We demonstrate these cases
using the two examples below.
Example 1 Let Γ1 = hΠ̃, Θ̃, m0 i, where Θ̃ = Π̃m0 , and Π and m0 are any two positive integer values.
By the definition of the MPR model, Γ1 represents a multiprocessor platform with exactly m0 identical,
0
unit-capacity CPUs that are fully available. Injother
k words, Γ1 provides m t time units inevery lt time
munits.
0
0
Θ̃
Θ̃
However, according to Eq. (1), we have α = m0 = Π̃, β = Θ̃ − m α = 0, t = t − Π̃ − m0
= t,
j 0k
j k
0
0
t
Θ̃
x = t − Π̃ Π̃ , and y = Π̃ − m0 = 0. Whenever x ∈
/ [1, y], for all t = t ≥ 0,


˜ Γ (t) = t0 /Π̃ Θ̃ + max{0, m0 x − (m0 Π̃ − Θ̃)} − (m0 − β) = m0 t − m0 .
SBF
1
˜ Γ (t) < m0 t for all for all t such that x ∈
As a result, SBF
/ [1, y].
1
Example 2 Let Γ2 = hΠ = 20, Θ = 181, m0 = 10i and consider t = 21.1. From Eq. (1), we obtain α = 18,
β = 1, t0 = t − 1 = 20.1, x = 0.1, and y = 2. Since x ∈
/ [1, y], we have
0

˜ Γ (t) = b t cΘ̃ + max{0, m0 x − (m0 Π̃ − Θ̃)} − (m0 − β)
SBF
2
Π̃
20.1
=b
c181 + max{0, 10 × 0.1 − (10 × 20 − 181)} − (10 − 1) = 172.
20
We reply on the worst-case resource supply scenario of the MPR model shown in Fig. 2 to compute the
worst-case resource supply of Γ2 during a time interval of length t. We first compute the worst-case resource
supply when t = 21.1 based on Case 1 in Fig. 2:
– t starts at the time point s1 ;
– During the time interval [s1 , s1 + (Π̃ − α − 1)], i.e., [s1 , s1 + 1], Γ2 supplies 0 time unit;
– During the time interval [s1 + (Π̃ − α − 1), s1 + (Π̃ − α − 1) + Π̃], i.e., [s1 + 1, s1 + 21], Γ2 supplies
Θ = 181 time units;
– During the time interval [s1 + (Π̃ − α − 1) + Π, s1 + t], i.e., [s1 + 21, s1 + 21.1], Γ2 supplies 0 time
unit.
Therefore, Γ2 supplies 181 time units during a time interval of length t = 21.1 based on Case 1 in Fig. 2.
Next, we compute the worst-case resource supply when t = 21.1 based on Case 2 in Fig. 2:
– t starts at the time point s2 ;
– During the interval [s2 , s2 + (Π̃ − α)], i.e., [s2 , s2 + 2] Γ supplies β = 1 time unit;
– During the interval [s2 + (Π̃ − α), s2 + 2(Π̃ − α)], i.e., [s2 + 2, s2 + 4], Γ supplies β = 1 time unit;

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

7

– During the interval [s2 + 2(Π̃ − α), s2 + t], i.e., [s2 + 4, s2 + 21.1], Γ supplies (21.1 − 4) × m0 = 171
time units.
Therefore, Γ2 supplies 1 + 1 + 171 = 173 time units during any time interval of length t based on Case 2 in
Fig. 2. Because the two cases in Fig. 2 are the only two possible worst-case scenarios of the MPR resource
model [Easwaran et al., 2009], the worst-case resource supply of Γ2 during any time interval of length t = 21.1
is 173 time units. Since SBFΓ2 (t) = 172, the value computed by Eq. (1) under-estimates the actual resource
provided by Γ2 .
Based on the above observations, we introduce a new SBF that can better bound the resource supply of the
MPR model. This improved SBF is computed based on the worst-case resource supply scenarios shown in
Fig. 2.
Lemma 1 The amount of resource provided by the MPR model Γ = hΠ̃, Θ̃, m0 i over any time interval of
length t is at least SBFΓ (t), where

0,
t0 < 0



β
 t0 Θ̃ + max0, m0 x0 − (m0 Π̃ − Θ̃) ,
t0 ≥ 0 ∧ x0 ∈ [1 − m
0 , y]
Π̃ 

SBFΓ (t) =
β
0
0
Θ̃

max 0, β t − 2(Π̃ − b m0 c)
t ∈ [0, 1] ∧ x 6∈ [1 − m
0 , y]



 t00
β
0 00
0
0
0
0
b Π̃ cΘ̃ + max 0, m x − (m Π̃ − Θ̃) − (m − β) , t ≥ 1 ∧ x 6∈ [1 − m0 , y]
(2)
where
(
Θ̃ − m0 α, Θ̃ 6= Πm0
Θ̃
Θ̃
α = b 0 c;
; t0 = t − (Π̃ − d 0 e); t00 = t0 − 1;
β=
m
m
m0 ,
Θ̃ = Πm0
x0 = (t0 − Π̃b

t0
c);
Π̃

x00 = (t00 − Π̃b

t00
c) + 1;
Π̃

y = Π̃ − b

Θ̃
c.
m0

Proof We will prove that the function SBFΓ (t) is a valid SBF of Γ based on the worst-case resource supply
patterns of Γ shown in Fig. 2.
Consider the time interval of length t0 (called time interval t0 ) and the black-out interval (during which
the resource supply is zero) in Fig. 2. By definition, x0 is the remaining time of the time interval t0 in the
last period of Γ , and y is half the length of the black-out interval plus one. There are four cases of x, which
determine whether SBFΓ (t) corresponds to the resource supply of Γ in Case 1 or Case 2 in Fig. 2:
– x0 ∈ [1, y]: It is easy to show that the value of SBFΓ (t) in Case 1 is no larger than its value in Case 2. Note
that if we shift the time interval of length t in Case 1 by one time unit to the left, we obtain the scenario in
Case 2. In doing so, SBFΓ (t) will be increased by β time units from the first period but decreased by at
most β time units from the last period. Therefore, the pattern in Case 2 supplies more resource than the
pattern in Case 1 when x0 ∈ [1, y].
β
– x0 ∈ [1 − m
0 , 1]: As above, if we shift the time interval of length t in Case 1 by one time unit to the left,
we obtain the scenario in Case 2. Recall that x0 is the remaining time of the time interval of length t0 in
the last period, x0 ≤ 1 and y ≥ 1. In shifting the time interval of length t, SBFΓ (t) will lose (1 − x0 )m0
β
0
0
time units while gaining β time units from the first period. Because x0 ≥ 1 − m
0 , β − (1 − x )m ≥ 0.
0
0
Therefore, SBFΓ (t) gains β − (1 − x )m ≥ 0 time units in transferring the scenario in Case 1 to the
β
scenario in Case 2. Hence, Case 1 is the worst-case scenario when x0 ∈ [1 − m
0 , 1].
β
0
– x ∈ [0, 1 − m0 ): It is easy to show that Γ supplies less resource in Case 2 than in Case 1 when we shift
the time interval of length t of Case 1 to left by one time unit to get Case 2. Therefore, Case 2 is the
β
worst-case scenario when x0 ∈ [0, 1 − m
0 ].
0
– x > y: We can easily show that SBFΓ (t) is no larger in Case 2 than in Case 1. Because x0 > y, when we
shift the time interval t of Case 1 to left by one time unit to get the scenario in Case 2, Γ loses m0 time
units from the last period but only gains β time units, where β ≤ m0 . Therefore, Case 2 is the worst-case
scenario when x0 > y.

8

Meng Xu et al.

β
From the above, we conclude that Case 1 is the worst-case resource supply scenario when x0 ∈ [1 − m
0 , y],
β
0
and Case 2 is the worst-case resource supply scenario when x 6∈ [1 − m0 , y].
Based on the worst-case resource supply scenario under different conditions above, we can derive Eq. 2 as
follows:

– When t0 < 0: It is obvious that SBFΓ (t) = 0 because Γ supplies no resource in the black-out interval.
β
– When t0 ≥ 0 and x0 ∈ [1 − m
0 , y]: Based on the worst-case resource supply scenario in Case 1,
t0
Γ has b Π̃ c periods and provides Θ̃ time units in each period. Γ has x0 remaining time in the last
period, which provides max{0, m0 x00 − (m0 Π − Θ) − (m0 − β)} time units. Therefore, Γ supplies
t0
b Π̃
cΘ̃ + max{0, m0 x00 − (m0 Π − Θ) − (m0 − β)} time units during time interval t.
– When t0 ∈ [0, 1] and x0 6∈ [1 −

β
m0 , y]:

Θ̃
Θ̃
Because t0 ∈ [0, 1], t ∈ [Π − d m
0 e, Π − d m0 e + 1]. Therefore,

Θ̃
Θ̃
t < 2(Π − d m
0 e) + 2, where 2(Π − d m0 e) is the length of the black-out interval. Hence, the worst-case
Θ
resource supply of Γ during time interval t is max{0, β(t − 2(Π − b m
0 c))}.
00
β
0
0
– When t > 1 and x 6∈ [1 − m0 , y], the worst-case resource supply scenario is Case 2. Γ has b tΠ̃ c periods
and provides Θ̃ time units in each period. Γ supplies max{0, m0 x00 − (m0 Π̃ − Θ̃) − (m0 − β)} time units
00
during its first and last periods. Therefore, SBFΓ (t) = b tΠ̃ cΘ+max{0, m0 x00 −(m0 Π̃ − Θ̃)−(m0 −β)}.

The lemma follows from the above results.
u
t
It is easy to verify that, under the two scenarios described in Examples 1 and 2, SBFΓ1 (t) and SBFΓ2 (t)
correspond to the actual minimum resource that Γ1 and Γ2 provide, respectively. It is also worth noting that,
for the scenario described in Example 1, the compositional analysis for the MPR model [Easwaran et al.,
2009] is compatible4 with the underlying gEDF schedulability test under the improved SBF but not under the
original SBF in Eq. (1). In the next example, we further demonstrate the benefits of the improved SBF in terms
of resource bandwidth saving.
Example 3 Consider a component C with a taskset τ = {τ1 = · · · = τ4 = (200, 100, 200)} that is scheduled
under gEDF, and the period of the MPR interface of C is fixed to be 40. Following the interface computation
method in [Easwaran et al., 2009], the corresponding minimum-bandwidth MPR interfaces, Γ1 and Γ2 , of C
when using the original SBF in Eq. (1) and when using the improved SBF in Eq. (2) are obtained as follows:
Γ1 = h40, 145, 4i and Γ2 = h40, 120, 3i. Thus, the MPR interface of C corresponding to the improved
SBF can save 145/40 − 120/40 = 0.625 cores compared to the interface corresponding to the original SBF
proposed in [Easwaran et al., 2009].

4 Deterministic Multiprocessor Periodic Resource Model
In this section, we introduce the deterministic multiprocessor resource model (DMPR) for representing the
interfaces. The MPR model described in the previous section is simple and highly flexible because it represents the collective resource requirements of components without fixing the contribution of each processor
a priori. However, this flexibility also introduces some extra overhead: it is possible that all processors stop
providing resources at the same time, which results in a long worst-case starvation interval (it can be as long
as 2(Π̃ − dΘ̃/m0 e) time units [Easwaran et al., 2009]). Therefore, to ensure schedulability in the worst case,
it is necessary to provide more resources than strictly required. However, we can minimize this overhead by
restricting the supply pattern of some of the processors. This is a key element of the deterministic MPR that
we now propose.
A DMPR model is a deterministic extension of the MPR model, in which all of the processors but one
always provide resource with full capacity. It is formally defined as follows.
4 We say that a compositional analysis method is compatible with the underlying component’s schedulability test it uses if
whenever a component C with a taskset τ is deemed schedulable on m cores by the schedulability test, then C is also deemed
schedulable under an interface with bandwidth no larger than m by the compositional analysis method.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

9

Definition 1 A DMPR µ = hΠ, Θ, mi specifies a resource that guarantees m full (dedicated) unit-capacity
processors, each of which provides t resource units in any time interval of length t, and one partial processor
that provides Θ resource units in every period of Π time units, where 0 ≤ Θ < Π and m ≥ 0.
Θ
By definition, the resource bandwidth of a DMPR µ = hΠ, Θ, mi is bwµ = m + Π
. The total number of
processors of µ is mµ = m + 1, if Θ > 0, and mµ = m, otherwise.
(Π, Θ, m' ) = (6,2.5,2)

t
VP1

Θ

Θ

Θ

VP2
VP3

0

Π

2Π

3Π

Fig. 3 Worst-case resource supply pattern of µ = hΠ, Θ, mi.

Observe that the partial processor of
µ is represented by a single-processor periodic resource model
Figure 4 Worst case resource supply of DMPR
Ω = (Π, Θ) [Shin and Lee, 2003]. (However, it can also be represented by any other single processor resource
model, such as EDP model [Easwaran et al., 2007].) Based on this characteristic, we can easily derive the
worst-case supply pattern of µ (shown in Figure 3) and its supply bound function, which is given by the
following lemma:
Lemma 2 The supply bound function of a DMPR model µ = hΠ, Θ, mi is given by:
SBFµ (t) =

where y =

(
mt, if Θ = 0 ∨ (0 ≤ t ≤ Π − Θ)
mt + yΘ + max{0, t − 2(Π − Θ) − yΠ}, otherwise

 t−(Π−Θ) 
, for all t > Π − Θ.
Π

Proof Consider any interval of length t. Since the full processors of µ are always available, µ provides the
minimum resource supply iff the partial processor provides the worst-case supply. Since the partial processor
is a single-processor periodic resource model Ω = (Π, Θ), its minimum resource supply in an interval
of length t is given by [Shin and Lee, 2003]: SBFΩ (t) = 0, if Θ = 0 or 0 ≤ t ≤ Π − Θ; otherwise,


SBFΩ (t) = yΘ + max{0, t − 2(Π − Θ) − yΠ} where y = t−(Π−Θ)
. In addition, the m full processors
Π
of µ provides a total of mt resource units in any interval of length t. Hence, the minimum resource supply of
µ in an interval of length t is mt + SBFΩ (t). This proves the lemma.
u
t
It is easy to show that, when a DMPR µ and an MPR Γ have the same period, bandwidth, and total number
of processors, then SBFµ (t) ≥ SBFΓ (t) for all t ≥ 0, and the worst-case starvation interval of µ is always
shorter than that of Γ .

5 Overhead-free Compositional Analysis
In this section, we present our method for computing the minimum-bandwidth DMPR interface for a component, assuming that the cache-related overhead is negligible. The overhead-aware interface computation
is considered in the next sections. We first recall some key results for components that are scheduled under
gEDF [Easwaran et al., 2009].

10

Meng Xu et al.

5.1 Component schedulability under gEDF
The demand of a task τi in a time interval [a, b] is the amount of computation that must be completed within
[a, b] to ensure that all jobs of τi with deadlines within [a, b] are schedulable. When τi = (pi , ei , di ) is
scheduled under gEDF, its demand in any interval of length t is upper bounded by [Easwaran et al., 2009]:
j t + (p − d ) k
i
i
dbf i (t) =
ei + CIi (t), where
pi
(3)
n
n
j t + (p − d ) k oo
i
i
CIi (t) = min ei , max 0, t −
pi .
pi
In Eq. (3), CIi (t) denotes the maximum carry-in demand of τi in any time interval [a, b] with b − a = t, i.e.,
the maximum demand generated by a job of τi that is released prior to a but has not finished its execution
requirement at time a.
Consider a component C with a taskset τ = {τ1 , ...τn }, where τi = (pi , ei , di ), and suppose the tasks in
C are schedulable under gEDF by a multiprocessor resource with m0 processors. From [Easwaran et al., 2009],
the worst-case demand of C that must be guaranteed to ensure the schedulability of τk in a time interval (a, b],
with b − a = t ≥ dk is bounded by:
X
X
DEM(t, m0 ) = m0 ek +
Iˆi,2 +
(I¯i,2 − Iˆi,2 )
(4)
τi ∈τ

where

i:i∈L(m0 −1)


Iˆi,2 = min dbf i (t) − CI i (t), t − ek , ∀ i 6= k,

Iˆk,2 = min dbf k (t) − CI k (t) − ek , t − dk ;

I¯i,2 = min dbf i (t), t − ek , ∀ i 6= k,

I¯k,2 = min dbf k (t) − ek , t − dk ;

and L(m0 −1) is the set of indices of all tasks τi that have I¯i,2 − Iˆi,2 being one of the (m0 − 1) largest such
values for all tasks.5 This leads to the following schedulability test for C:
Theorem 1 ([Easwaran et al., 2009]) A component C with a task set τ = {τ1 , ...τn }, where τi = (pi , ei , di ),
is schedulable under gEDF by a multiprocessor resource model R with m0 processors in the absence of
overhead if, for each task τk ∈ τ and for all t ≥ dk , DEM(t, m0 ) ≤ SBFR (t), where DEM(t, m0 ) is given
by Eq. (4) and SBFR (t) gives the minimum total resource supply by R in an interval of length t.
5.2 DMPR interface computation
In the absence of cache-related overhead, the minimum resource supply provided by a DMPR model µ =
hΠ, Θ, mi in any interval of length t is SBFµ (t), which is given by Lemma 2. Since each domain schedules
its tasks under gEDF, the following theorem follows directly from Theorem 1.
Theorem 2 A domain D with a task set τ = {τ1 , ...τn }, where τi = (pi , ei , di ), is schedulable under gEDF
by a DMPR model µ = (Π, Θ, m) if, for each τk ∈ τ and for all t ≥ dk ,
DEM(t, mµ ) ≤ SBFµ (t),
where mµ = m + 1 if Θ > 0, and mµ = m otherwise.
We say that µ is a feasible DMPR for D if it guarantees the schedulability of D according to Theorem 2.
The next theorem derives a bound of the value t that needs to be checked in Theorem 2.
5

Here, dk and t refer to Dk and Ak + Dk in [Easwaran et al., 2009], respectively.

(5)

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

11

Theorem 3 If Eq. (5) is violated for some value t, then it must also be violated for a value that satisfies the
condition
CΣ + mµ ek + U + B
t<
(6)
Θ
Π + m − UT
Pn
Pn ei
ei
where CΣ is the sum of the mµ − 1 largest ei ; U =
i=1 (pi − di ) pi ; UT =
i=1 pi ; and B =
Θ
2 Π (Π − Θ).
Proof The proof follows a similar line with the proof of Theorem 2 in [Easwaran et al., 2009]. Recall that
DEM(t, mµ ) is given by Eq. (4). According to Eq. (4), we have
t + (pi − di )
ei
p i − di
t + (pi − di )
cei ≤
ei ≤ t +
ei .
Iˆi,2 ≤ b
pi
pi
pi
pi
Therefore,

n
X

n
n
X
ei X pi − di
t +
ei = tUT + U.
pi
pi
i=1
i=1
i=1
P
Because the carry-in workload of τi is no more than ei , we derive
(I¯i,2 − Iˆi,2 ) ≤ CΣ . Thus,

Iˆi,2 ≤

i:i∈L(mµ −1)

DEM(t, mµ ) ≤ mµ ek + tUT + U + CΣ .
Further, SBFµ (t) gives the worst-case resource supply of the DMPR model µ = hΠ, Θ, mi over any
interval of length t. Based on Lemma 2, the resource supply of µ is total resource supply of one partial VCPU
(Π, Θ) and m full VCPUs. From [Shin and Lee, 2003], the resource supply of the partial VCPU (Π, Θ) over
Θ
any interval of length t is at least Π
(t − 2(Π − Θ)). In addition, the resource supply of m full VCPUs
over any interval of length t is mt. Hence, the resource supply of µ over any interval of length t is at least
Θ
mt + Π
(t − 2(Π − Θ)). In other words,
SBFµ (t) ≥ mt +

Θ
(t − 2(Π − Θ)).
Π

Suppose Eq. (5) is violated, i.e., DEM(t, mµ ) > SBFµ (t) for some value t. Then, combine with the
above results, we imply
mµ ek + tUT + U + CΣ > mt +

Θ
(t − 2(Π − Θ)),
Π

which is equivalent to
t<

CΣ + mµ ek + U + B
.
Θ
Π + m − UT

Hence, if Eq. (5) is violated for some value t, then t must satisfy Eq. (6). This proves the theorem.

u
t

The next lemma gives a condition for the minimum-bandwidth DMPR interface with a given period Π.
Lemma 3 A DMPR model µ∗ = hΠ, Θ∗ , m∗ i is the minimum-bandwidth DMPR with period Π that can
guarantee the schedulability of a domain D only if m∗ ≤ m for all DMPR models µ = hΠ, Θ, mi that can
guarantee the schedulability of a domain D.
Proof Suppose m∗ > m for some DMPR µ = hΠ, Θ, mi. Then, m∗ ≥ m + 1 and, hence, bwµ∗ =
m∗ + Θ∗ /Π ≥ m + 1 + Θ∗ /Π ≥ m + 1. Since Θ < Π, bwµ = m + Θ/Π < m + 1. Thus, bwµ∗ > bwµ ,
which implies that m∗ cannot be the minimum-bandwidth DMPR with period Π. Hence the lemma.
Computing the domains’ interfaces. Let Di be a domain in the system and Πi be its given VCPU period
(c.f. Section 2). The minimum-bandwidth interface of Di with period Πi is the minimum-bandwidth DPRM
model µi = hΠi , Θi , mi i that is feasible for Di . To obtain µi , we perform binary search on the number of

12

Meng Xu et al.

full processors m0i , and, for each value m0i , we compute the smallest value of Θi0 such that hΘi0 , Πi , m0i i is
feasible for Di (using Theorem 2).6 Then mi is the smallest value of m0i for which a feasible interface is
found, and, Θi is the smallest budget Θi0 computed for mi .
Computing the system’s interface. The interface of the system can be obtained by composing the interfaces
µi of all domains Di in the system under the VMM’s semi-partitioned EDF policy (c.f. Section 2). Let D
denote the number of domains of the platform.
Observe that each interface µi = hΠi , Θi , mi i can be transformed directly into an equivalent set of mi
full VCPUs (with budget Πi and period Πi ) and, if Θi > 0, a partial VCPU with budget Θi and period Πi .
Let C be a component that contains all the partial VCPUs that are transformed from the domains’ interfaces.
Then the VCPUs in C are scheduled together under gEDF, whereas all the full VCPUs are each mapped to a
dedicated core.
Since each partial VCPU in C is implemented as a periodic server, which is essentially a periodic task, we
can compute the minimum-bandwidth DMPR interface µC = hΠC , ΘC , mC i that is feasible for C by the same
technique used for domains.PCombining µC with the full VCPUs of the domains, we can see that the system
must be guaranteed mC + 1≤i≤D mi full processors and a partial processor, with budget ΘC and period
ΠC , to ensure the schedulability of the system. The next theorem directly follows from this observation.
Theorem 4 Let µi = hΠi , Θi , mi i be the minimum-bandwidth DMPR interface of domain Di , for all
1 ≤ i ≤ D. Let C be a component with the taskset
τC = {(Πi , Θi , Πi ) | 1 ≤ i ≤ D ∧ Θi > 0},
which are scheduled under gEDF. Then the minimum-bandwidth DMPR interface with period ΠC of the system
is given by: µsys = hΠC , ΘC , msys i, where
P µC = hΠC , ΘC , mC i is a minimum-bandwidth DMPR interface
with period ΠC of C and msys = mC + 1≤i≤D mi .
Based on the system’s interface, one can easily derive the schedulability of the system as follows (the
lemma comes directly from the interface’s definition):
Lemma 4 Let M be the number of physical cores of the platform. The system is schedulable if M ≥ msys + 1,
or, M = msys and ΘC = 0, where hΠC , ΘC , msys i is the minimum-bandwidth DMPR system’s interface.
The results obtained above assume that the cache-related overhead is negligible. We will next develop the
analysis in the presence of cache-related overhead.

6 Cache-Related Overhead Scenarios
In this section, we characterize the different events that cause cache-related overhead; this is needed for the
cache-aware analysis in Sections 7 and 8.
Cache-related overhead in a multicore virtualization platform is caused by (1) task preemption within the
same domain, (2) VCPU preemption, and (3) VCPU exhaustion of budget. We discuss each of them in detail
below.
6.1 Event 1: Task-preemption event
Since tasks within a domain are scheduled under gEDF, a newly released higher-priority task preempts a
currently executing lower-priority task of the same domain, if none of the domain’s VCPUs are idle. When
a preempted task resumes its execution, it may experience cache misses: its cache content may have been
6 Note that the number of full processors is always bounded from below by bU c, where U is the total utilization of the tasks in
i
i
Di , and bounded from above by the number of tasks in Di or the number of physical platform (if given), whichever is smaller.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

13

evicted from the cache by the preempting task (or tasks with a higher priority than the preempting task, if a
nested preemption occurs), or the task may be resumed on a different VCPU that is running on a different
core, in which case the task’s cache content may not be present in the new core’s cache. Hence the following
definition:
Definition 2 (Task-preemption event) A task-preemption event of τi is said to occur when a job of another
task τj in the same domain is released and this job can preempt the current job of τi .
Fig. 4 illustrates the worst-case scenario of the overhead caused by a task-preemption event. In the
figure, a preemption event of τ1 happens at time t = 3 when τ3 is released (and preempts τ1 ). Due to this
event, τ1 experiences a cache miss at time t = 5 when it resumes. Since τ1 resumes on a different core,
all the cache blocks it will reuse have to be reloaded into new core’s cache, which results in cache-related
preemption/migration overhead on τ1 . (Note that the cache content of τ1 is not necessarily reloaded all at once,
but rather during its remaining execution after it has been resumed; however, for ease of exposition, we show
the combined overhead at the beginning of its remaining execution).

Fig. 4 Cache-related overhead of a task-preemption event.

Since gEDF is work-conserving, tasks do not suspend themselves, and each task resumes at most once
after each time it is preempted. Therefore, each task τk experiences the overhead caused by each of its
task-preemption events at most once, and this overhead is bounded from above by ∆crpmd
.
τk
Lemma 5 A newly released job of τj preempts a job of τi under gEDF only if dj < di .
Proof Suppose dj ≥ di and a newly released job Jj of τj preempts a job Ji of τi . Then, Jj must be released
later than Ji . As a result, the absolute deadline of Jj is later than Ji ’s (since dj ≥ di ), which contradicts the
assumption that Jj preempts Ji under gEDF. This proves the lemma.
u
t
The maximum number of task-preemption events in each period of τi is given by the next lemma.
Lemma 6 (Number of task-preemption events) The maximum number of task-preemption events of τi
under gEDF during each period of τi , denoted by Nτ1i , is bounded by
Nτ1i ≤

X
τj ∈HP(τi )

ld − d m
i
j
pj

(7)

where HP(τi ) is the set of tasks τj within the same domain with τi with dj < di .
Proof Let τic be the current job of τi in a period of τi , and let ric be its release time. From Lemma 5, only jobs
of a task τj with dj < di and in the same domain can preempt τic . Further, for each such τj , only the jobs that
are released after τic and that have absolute deadlines no later than τic ’s can preempt τic . In other words, only
jobs that are released within the interval (ric , ric + di − dj ] can preempt τicl. As a result,
the maximum number
m
P
di −dj
of task-preemption events of τi under gEDF is no more than τj ∈HP(τi ) pj .
u
t

14

Meng Xu et al.

6.2 VCPU-preemption event
Definition 3 (VCPU-preemption event) A VCPU-preemption event of VPi occurs when VPi is preempted
by a higher-priority VCPU VPj of another domain.
When a VCPU VPi is preempted, the currently running task τl on VPi may migrate to another VCPU
VPk of the same domain and may preempt the currently running task τm on VPk . This can cause the tasks
running on VPk experiences cache-related preemption or migration overhead twice in the worst case, as is
illustrated in the following example.
Example 4 The system consists of three domains D1 -D3 . D1 has VCPUs VP1 (full) and VP2 (partial); D2
has VCPUs VP3 (full) and VP4 (partial); and D3 has one partial VCPU VP5 . The partial VCPUs of the
domains – VP2 (5, 3), VP4 (8, 3) and VP5 (6, 4) – are scheduled under gEDF on cpu1 and cpu2 , as is shown
in Fig. 5(a). In addition, domain D2 consists of three tasks, τ1 (8, 4, 8), τ2 (6, 2, 6) and τ3 (10, 1.5, 10), which
are scheduled under gEDF on its VCPUs (Fig. 5(b)).

(a) Scheduling scenario of VCPUs.

(b) Cache overhead of tasks in D2 .

Fig. 5 Cache overhead due to a VCPU-preemption event.

As is shown in Fig. 5(a), a VCPU-preemption event occurs at time t = 2, when VP4 (of D2 ) is preempted
by VP2 . Observe that, within D2 at this instant, τ2 is running on VP4 and τ1 is running on VP3 . Since τ2 has
an earlier deadline than τ1 , it is migrated to VP3 and preempts τ1 there. Since VP3 is mapped to a different
core from cpu1 , τ2 has to reload its useful cache content to the cache of the new core at t = 2. Further, when
τ1 resumes at time t = 3.5, it has to reload the useful cache blocks that may have been evicted from the cache
by τ2 . Hence, the VCPU-preemption event of VP4 causes overhead for both of the tasks in its domain.
Lemma 7 Each VCPU-preemption event causes at most two tasks to experience a cache miss. Further, the
cache-related overhead it causes is at most ∆crpmd
= maxτi ∈C ∆crpmd
, where C is the component that has
τi
C
the preempted VCPU.
Proof At most one task is running on a VCPU at any time. Hence, when a VCPU VPi of C is preempted, at
most one task (τm ) on VPi is migrated to another VCPU VPj , and this task preempts at most one task (τl )
on VPj . As a result, at most two tasks (i.e., τm and τl ) incur a cache miss because of the VCPU-preemption
event. (Note that τl cannot immediately preempt another task τn because otherwise, τm would have migrated
to the VCPU on which τn is running and preempted τn instead.) Further, since the overhead caused by each
cache miss in C is at most ∆crpmd
= maxτi ∈C ∆crpmd
, the maximum overhead caused by the resulting cache
τi
C
crpmd
misses is at most 2∆C
.
u
t
Since the partial VCPUs are scheduled under gEDF as implicit-deadline tasks (i.e., the task periods are
equal to their relative deadlines), the number of VCPU-preemption events of a partial VCPU VPi during each
VPi ’s period also follows Lemma 6. The next lemma is implied directly from this observation.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

15

Lemma 8 (Number of VCPU-preemption events) Let VPi = (Πi , Θi ) for all partial VCPUs VPi of the
2
2
domains. Let HP(VPi ) be the set of VPj with 0 < Θj < Πj < Πi . Denote by NVP
and NVP
the
i
i ,τk
maximum number of VCPU-preemption events of VPi during each period of VPi and during each period of
τk inside VPi ’s domain, respectively. Then,
lΠ − Π m
i
j
Πj
VPj ∈HP(VPi )
lp m
X
k
≤
.
Πj

2
NVP
≤
i

2
NVP
i ,τk

X

(8)
(9)

VPj ∈HP(VPi )

6.3 VCPU-completion event
Definition 4 (VCPU-completion event) A VCPU-completion event of VPi happens when VPi exhausts its
budget in a period and stops its execution.
Like in VCPU-preemption events, each VCPU-completion event causes at most two tasks to experience a
cache miss, as given by Lemma 9.
Lemma 9 Each VCPU-completion event causes at most two tasks to experience a cache miss.
Proof The effect of a VCPU-completion event is very similar to that of a VCPU-preemption event. When
VPi finishes its budget and stops, the running task τm on VPi may migrate to another running VCPU VPj ,
and, τm may preempt at most one task τl on VPj . Hence, at most two tasks incur a cache miss due to a
VCPU-preemption event.
u
t
3
3
Lemma 10 (Number of VCPU-completion events) Let NVP
and NVP
be the number of VCPU-completion
i
i ,τk
events of VPi in each period of VPi and in each period of τk inside VPi ’s domain. Then,
3
NVP
≤1
i
lp − Θ m
i
k
3
NVP
≤
+1
i ,τk
Πi

(10)
(11)

Proof Eq. (10) holds because VPi completes its budget at most once every period. Further, observe that τi
experiences the worst-case number of VCPU-preemption events when (1) its period ends at the same time as
the budget finish time of VPi ’s current period, and (2) VPi finishes its budget as soon as possible (i.e., Bi time
units from the beginning of the VCPU’s period) in the current period and as late as possible (i.e., at the end of
the VCPU’s period) in all its preceding periods. Eq. (11) follows directly from this worst-case scenario. u
t
VCPU-stop event. Since a VCPU stops its execution when its VCPU-completion or VCPU-preemption
event occurs, we define a VCPU-stop event that includes both types of events. That is, a VCPU-stop event
of VPi occurs when VPi stops its execution because its budget is finished or because it is preempted by a
higher-priority VCPU. Since VCPU-stop events include both VCPU-completion events and VCPU-preemption
stop
events, the maximum number of VCPU-stop events of VPi during each VPi ’s period, denoted as NVP
,
i
satisfies
lΠ − Π m
X
i
j
stop
2
3
+1
(12)
NVP
= NVP
+ NVP
≤
i
i
i
Πj
VPj ∈HP(VPi )

Overview of the overhead-aware compositional analysis. Based on the above quantification, in the next
two sections we develop two different approaches, task-centric and model-centric, for the overhead-aware
interface computation. Although the obtained interfaces by both approaches are safe and can each be used
independently, we combine them to obtain the interface with the smallest bandwidth as the final result.

16

Meng Xu et al.

7 Task-centric Compositional Analysis
This section introduces two task-centric analysis methods to account for the cache-related overhead in the
interface computation. The first, denoted as BASELINE, accounts for the overhead by inflating the WCET of
every task in the system with the maximum overhead it experiences within each of its periods. The second,
denoted as TASK - CENTRIC - UB, combines the result of the first method using an upper bound on the number
of VCPUs that each domain needs in the presence of cache-related overhead. We describe each method in
detail below.
7.1 BASELINE: Analysis based on WCET-inflation
As was discussed in Section 6, the overhead that a task experiences during its lifetime is composed of the
overhead caused by task-preemption events, VCPU-preemption events and VCPU-completion events. In
addition, when one of the above events occurs, each task τk experiences at most one cache miss overhead and,
hence, a delay of at most ∆crpmd
. From [Brandenburg, 2011], the cache overhead caused by a task-preemption
τk
event can be accounted for by inflating the higher-priority task τi of the event with the maximum cache
overhead caused by τi . From Lemmas 8 and 10, we conclude that the maximum overhead τk experiences
within each period is

2
3
δτcrpmd
= max {∆crpmd
} + ∆crpmd
(NVP
+ NVP
τi
τk
k
i ,τk
i ,τk
τi ∈LP(τk )

where LP(τk ) is the set of tasks τi within the same domain with τk with di > dk and VPi is the partial VCPU
of the domain of τk . As a result, the worst-case execution time of τk in the presence of cache overhead is at
most
e0k = ek + δτcrpmd
.
(13)
k
Thus, we can state the following theorem:
Theorem 5 A component with a taskset τ = {τ1 , ...τn }, where τk = (pk , ek , dk ), is schedulable under
gEDF by a DMPR model µ in the presence of cache-related overhead if its inflated taskset τ 0 = {τ10 , ...τn0 } is
schedulable under gEDF by µ in the absence of cache-related overhead, where τk0 = (pk , e0k , dk ), and e0k is
given by Eq. 13.
Based on Theorem 5, we can compute the DMPR interfaces of the domains and the system by first inflating
the WCET of each task τk in each domain with the overhead δτcrpmd
and then applying the same method as the
k
overhead-free interface computation in Section 5.2.7
7.2 TASK - CENTRIC - UB: Combination of BASELINE with an upper bound on the number of VCPUs
Recall from Section 6 that, VCPU-preemption events and VCPU-completion events happen only when the
component has a partial VCPU. Therefore, the taskset in a component with no partial VCPU experiences only
the cache overhead caused by task-preemption events. Recall that when a task-preemption event happens, the
corresponding lower-priority task τi experiences a cache miss delay of at most ∆crpmd
. Thus, the maximum
τi
cache overhead that a high-priority task τk causes to any preempted task is maxτi ∈LP(τk ) ∆crpmd
, where
τi
LP(τk ) is the set of tasks τi within the same domain with τk that have di > dk . As a result, the worst-case
execution time of τk in the presence of cache overhead caused by task-preemption events is at most
e00
k = ek +

max

τi ∈LP(τk )

∆crpmd
,
τi

(14)

where τi ∈ LP(τk ) if di > dk . This implies the following lemma:
7 Note that we inflate only the tasks’ WCETs and not the VCPUs’ budgets, since δ crpmd includes the overhead for reloading the
τk
useful cache content of a preempted VCPU when it resumes.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

17

Lemma 11 A component with a taskset τ = {τ1 , ..., τn }, where τk = (pk , ek , dk ), is schedulable under
gEDF by a DMPR model µ̄ = hΠ, 0, m̄i in the presence of cache-related overhead if its inflated taskset
τ 00 = {τ100 , ..., τn00 } is schedulable under gEDF by µ00 = hΠ, Θ00 , m00 i in the absence of cache-related
00
00
Θ 00
overhead, where τk00 = (pk , e00
k , dk ), ek is given by Eq. 14, and m̄ = m + d Π e. Further, the maximum
number of full VCPUs of the interface of the taskset τ in the presence of cache overhead is m̄.
Proof First, observe that the inflated taskset τ 00 safely accounts for all the cache overhead experienced by τ .
This is because (1) inflating the worst-cache execution time of each task τk with maxτi ∈LP(τk ) ∆crpmd
is safe
τi
to account for the cache overhead delay caused by task-preemption events (as was proven in [Brandenburg,
2011]), and (2) the DMPR model µ̄ has no partial VCPU and thus, τ does not experience any cache overhead
caused by VCPU-preemption events or VCPU-completion events. Further, based on Lemma 2, one can easily
show that the resource supply bound function SBFµ (t) of a DMPR model µ = hΠ, Θ, mi is monotonically
non-decreasing with the budget of µ when the period of µ is fixed. In other words, SBFµ̄ (t) ≥ SBFµ00 (t)
for all t. Combine the above observations, we imply that τ is schedulable under the resource model µ̄ in the
presence of cache overhead if τ 00 is schedulable under the resource model µ00 in the absence of cache overhead.
This proves the first part of the lemma.
Since τ is schedulable under the resource model µ̄ in the presence of cache overhead, the number of full
VCPUs of the overhead-aware interface of τ is always less than or equal to the ceiling of the bandwidth of µ̄,
which is exactly m̄.
u
t
Note that the maximum number of full VCPUs given by Lemma 11 can be larger or smaller than the
interface bandwidth computed by the BASELINE method, as is illustrated in the following two examples.
Example 5 Consider a system Sys1 consisting of two domains, C1 and C2 , with workloads τC1 = {τ11 =
· · · = τ13 = (100, 40, 100)} and τC2 = {τ21 = · · · = τ23 = (100, 40, 100)}, respectively. Suppose that
Sys1 employs the hybrid EDF scheduling strategy described in Section 2; the periods of DMPR interfaces
of C1 , C2 and Sys1 are set to 80, 40 and 20, respectively; and the cache overhead per task is 1. Then, the
DMPR cache-aware interface of C1 computed using the BASELINE method is µC1 = h80, 76, 1i, which has a
bandwidth of 1 + 76/80 = 1.95.
In contrast, if we only consider the cache overhead caused by task-preemption events, then the interface of
the system is given by µ00
C1 = h80, 64, 1i. Based on Lemma 11, the maximum number of full VCPUs of C1 is
1 + 64/80 = 2, and the corresponding DMPR interface is µ̄C1 = h80, 0, 2i. Thus, the interface computed by
the BASELINE method has a smaller bandwidth than the maximum number of full VCPUs given by Lemma 11.
Example 6 Consider a system Sys2 that is identical to the system Sys1 in Example 5, except that the cache
overhead for each task is 5 instead of 1. In this case, the cache-aware interface of C1 computed using the
BASELINE method is µ̄C1 = h80, 72, 2i, which has a bandwidth of 2 + 72/80 = 2.9. In contrast, if we only
consider only the cache overhead caused by task-preemption events, then the interface of the system is given by
µ00
C1 = h80, 74, 1i. Based on Theorem 11, the maximum number of full VCPUs is 1 + 74/80 = 2. Therefore,
the interface computed by the BASELINE method has a larger bandwidth than the maximum number of full
VCPUs given by Lemma 11.
Since the interface µ̄ given by Lemma 11 does not always have a smaller bandwidth than the interface
computed using the BASELINE method, we combine the two interfaces to derive the minimum-bandwidth
DMPR interface in the presence of overhead, as is given by Theorem 6. The correctness of this theorem is
derived directly from the correctness of Lemma 11 and Theorem 5.
Theorem 6 Let C be a component with a taskset τ = {τ1 , ..., τn } that is schedulable by the gEDF scheduler,
where τk = (pk , ek , dk ) for all 1 ≤ k ≤ n. Suppose µ0C = hΠ, Θ0 , m0 i is the feasible DMPR interface given
by Theorem 5, and m00 is the maximum number of full VCPUs of C given by Lemma 11. Then, the component
0
00
C is schedulable under the DMPR interface µC , where µC = µ0C if m00 > m0 + Θ
Π , and µC = hΠ, 0, m i
otherwise.

18

Meng Xu et al.

Interface computation under the TASK - CENTRIC - UB method: Based on the above results, the overheadaware interface for a system can be obtained by first computing the interface for each domain using Theorem 6,
and then computing the system’s interface by applying the overhead-free interface computation in Section 5.

7.3 TASK - CENTRIC - UB vs. BASELINE
As was discussed in Section 7.2, the interface of a domain computed by the TASK - CENTRIC - UB method
always has a bandwidth no larger than the bandwidth of the interface computed by the BASELINE method. We
will show that this relationship also holds for the interfaces at the system level. We first define the dominance
relation between any two analysis methods as follows:
Definition 5 A compositional analysis method CSA is said to dominate another compositional analysis
method CSA0 iff for any system S, the interface bandwidth of S when computed using CSA is always less
than or equal to the interface bandwidth of S when computed using CSA0 .
Lemma 12 The TASK - CENTRIC - UB method always dominates the BASELINE method.
Proof Consider a system S with D domains, {C1 , ..., CD }. Let µCi = hΠi , Θi , mi i and µ0Ci = hΠi , Θi0 , m0i i
be the minimum-bandwidth DMPR interfaces of Ci under the TASK - CENTRIC - UB method and the BASELINE
method, respectively. We have the following:
– Under the TASK - CENTRIC - UB method, the system has a set of partial VCPUs, VPpart = {VP1 =
(Π1 , Θ1 ), ..., VPD = (ΠD , ΘD )}, and (m1 + ... + mD ) full VCPUs. Based on the analysis in Section 5, the minimum-bandwidth DMPR interface of S is given by µS = hΠC , ΘC , mS i, P
where µC =
hΠC , ΘC , mC i is the minimum-bandwidth DMPR interface for VPpart and mS = mC + 1≤i≤D mi .
– Under the BASELINE method, the system has a set of partial VCPUs, VP0part = {VP01 = (Π1 , Θ10 ),
0
...,VPD = (ΠD , ΘD
)} and (m01 + ... + m0D ) full VCPUs. Therefore, the minimum-bandwidth DMPR in0
0
0
terface system is given by µ0S = hΠC , ΘC
, m0S i, where µ0C = hΠ, Θ
minimum-bandwidth
PC , mC i is the
0
0
0
DMPR interface of the partial VCPU set VPpart , and mS = mC + 1≤i≤D m0i .
From Theorem 6, there are two cases for the relationship between µCi and µ0Ci :
1. Θi = Θi0 and mi = m0i , if the interface bandwidth computed by the BASELINE method is less than or
Θ0
i
equal to the maximum number of full VCPUs of Ci given by Lemma 11 (i.e., m0i + Πi ≤ mi + Θ
Π );
0
2. Θi = 0 and mi ≤ mi , otherwise.
We can conclude from the above cases that for all partial VCPUs VPi and VP0i computed respectively by the
0
TASK - CENTRIC - UB method and the BASELINE method, VPi = VPi , or VPi has budget equal to 0 whereas
VP0i has budget larger than 0. In other words, VPpart ⊆ VP0part .
Because VPpart is only a subset of VP0part , we can derive from Eq. (4) that the resource demand of VPpart
is always less than or equal to the resource demand of VP0part . Therefore, if VP0part is schedulable under the
DMPR interface µ0C , then VPpart is also schedulable under µ0C . Because µC is the bandwidth-optimal DMPR
0
ΘC
ΘC
interface of VPpart , the bandwidth of µC is no larger than the bandwidth of µ0C , i.e., Π
+
m
≤
+ m0C .
C
Π
C
C
P
P
In addition, 1≤i≤D mi ≤ 1≤i≤D m0i , because mi ≤ m0i . Hence, the bandwidth of µS , which is equal to
0
P
P
ΘC
ΘC
0
0
0
1≤i≤D mi , is no larger than the bandwidth of µS , which is ΠC + mC +
1≤i≤D mi . This
ΠC + mC +
proves the lemma.
u
t

8 Model-centric Compositional Analysis
Recall from Section 6 that each VCPU-stop event (i.e., VCPU-preemption or VCPU-completion event) of
VPi causes at most one cache miss overhead for at most two tasks of the same domain. However, since it is

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

19

unknown which two tasks may be affected, the BASELINE method in Section 7 assumes that every task τk of
the same domain is affected by all the VCPU-stop events of VPi (and thus includes all of the corresponding
overheads in the inflated WCET of the task). While this approach is safe, it is very conservative, especially
when the number of tasks or the number of events is high.
In this section, we propose an alternative method, called MODEL - CENTRIC, that avoids the above assumption to minimize the pessimism of the analysis. The idea is to account for the total overhead due to VCPU-stop
events that is incurred by all tasks in a domain, rather than by each task individually. This combined overhead is
the overhead that the domain as a whole experiences due to VCPU-stop events under a given DMPR interface
µ of the domain (since the budget of the partial VCPU of a domain is determined by the domain’s interface).
Therefore, the effective resource supply that a domain receives from a DMPR interface µ in the presence of
VCPU-stop events is the total resource supply that µ provides, less the combined overhead.
8.1 Challenge: Resource parallel supply problem
Based on the overhead scenarios in Section 6, at first it seems possible to account for the overhead of the
VCPU-preemption and VCPU-completion events by inflating the budget of an overhead-free interface with the
cache-related overhead caused by the VCPU-preemption and VCPU-completion events that occur within a
period of the overhead-free interface. However, this interface budget inflation approach is unsafe, due to the
resource parallel supply under multicore interfaces. We illustrate this via the following scenario.
Example 7 Consider a system with a single component C that has a workload τ = {τ1 = τ2 = (2, 0.1, 2), τ3 =
(2, 1.81, 2)}, which is scheduled under gEDF . We assume that ties are broken based on increasing order of
tasks’ indices, i.e., a task with a smaller index has a higher priority. Suppose the cache overhead for each task
is given by ∆crpmd
= ∆crpmd
= 0.05 and ∆crpmd
= 0.2. (The time unit is ms.) In this example, we consider
τ1
τ2
τ3
only the cache overhead caused by VCPU-preemption and VCPU-completion events and assume that there are
no other types of overhead.
Based on the overhead-free anlaysis in Section 5, the taskset τ is schedulable under the DMPR interface
µ = h2, 1.01, 2i. Since the interface has only one partial VCPU and this partial VCPU is not preempted by
any other (full) VCPUs, the taskset τ in C experiences no VCPU-preemption event. In addition, at most one
VCPU-completion event happens in a period of the DMPR interface µ. Further, based on Section 6, each
VCPU-completion event causes at most two tasks to experience a cache miss. Therefore, the total cache
overhead delay in a DMPR interface’s period is at most 2 max1≤i≤3 {∆crpmd
} = 0.4.
τi
Suppose we inflate the budget of the overhead-free DMPR interface µ with the total cache overhead delay
of 0.4. Then, we obtain the DMPR interface µ0 = h2, 1.41, 2i. However, the taskset τ is not schedulable under
µ0 , as is illustrated by Fig. 6.
Fig. 6(a) shows the resource supply pattern of µ0 , and Fig. 6(b) shows the release and schedule patterns of
the tasks in τ . Here, the tasks τ1 , τ2 , and τ3 are released at t = 1.01. τ3 migrates from VCPU3 to VCPU2
at t = 1.41 and occurs a delay of ∆crpmd
= 0.2 time units to reload its cache content (because VCPU3
τ3
completes its budget at t = 1.41). τ3 keeps running on VCPU2 for 1.41 time units and finishes its execution
at t = 3.02. Since τ3 ’s absolute deadline is t = 3.01, τ3 misses its deadline.
The flaw in the cache-aware analysis approach that naı̈vely inflates the interface’s budget comes from
the resource parallel supply problem of the global multicore scheduling. In the above scenario, when τ3
experiences cache overhead, its worst-case execution time is enlarged and thus, it needs more CPU time to
execute. However, inflating the budget of the interface cannot guarantee that τ3 receives the inflated budget,
e.g., when part of the inflated budget is assigned to a VCPU that supplies resource in parallel with the VCPU
on which τ3 is running. Because τ3 is not a parallel task and cannot execute on two cores at the same time, τ3
does not fully utilize the inflated budget. As a result, although the extra budget is enough to account for the
cache overhead τ3 experiences, the inflated budget is not enough to guarantee the schedulability of the taskset
under the resource model with inflated budget.

20

Meng Xu et al.
0.4

Period=2

1.01

vcpu3 (partial)
vcpu2 (full)
vcpu1 (full)
0

	
  

	
  	
  	
  	
  1

	
  

	
  

	
  2

	
  

	
  	
  	
  	
  	
  3

	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  

	
  

	
  	
  	
  	
  5

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  

t

	
  	
  	
  	
  5

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  

t

Inflated budget because of cache overhead
a) Resource supply scenario

τ 1 ,τ 2 ,τ 3

Miss deadline! Tardiness is 0.01

1.4

0.2

0.4

τ3

vcpu3 (partial)

τ2

vcpu2 (full)

τ3

τ1

vcpu1 (full)
0

	
  

	
  	
  	
  	
  1

	
  

	
  

	
  2

	
  

	
  	
  	
  	
  	
  3

	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  

	
  

d3 = 2
τ 1 = (2, 0.1, 2)

τ 2 = (2, 0.1, 2)

τ 3 = (2,1.81,2)

Δcrpmd = 0.2

Priority: 	
  τ 1 > τ 2 > τ 3

b) Task scheduling scenario under the resource supply of scenario a)

Fig. 6 Scenario of unsafe analysis of inflating interface’s budget.

It is worth noting that the above overhead-aware analysis based on interface budget inflation is only safe
under the assumption that the resource demand of a taskset is independent of the resource supply of the
interface. However, this assumption is incorrect in the multicore setting: both the resource demand of a taskset
in Eq. 4 and the resource supply of a resource mdoel in Lemma 2 depend on the number of VCPUs of a
component, and they are coupled in terms of the number of VCPUs.
In the next section, we present an alternative approach that explicitly considers the effect of cache overhead
on the SBF of the interface of each VCPU.

8.2 Cache-aware effective resource supply of a DMPR model
We first analyze the effective resource supply of a DMPR model µ, i.e., the supply it provides to a domain in
the presence of the overhead caused by VCPU-stop events. We then combine the results with the overhead
caused by task-preemption events to derive the schedulability and the interface of a domain.
Consider a DMPR interface µ = (Π, Θ, m) of a domain Di , and recall that µ provides one partial VCPU
VPi = (Π, Θ) and m full VCPUs to Di . Then, in the presence of overhead due to VCPU-stop events, the
effective resource supply of µ consists of the effective resource supply of VPi and the effective resource
supply of m full processors. Here, the effective budget (resource) of a VCPU is the budget (resource) that is
used solely to execute the tasks running on the VCPU, rather than to handle the cache misses that are caused
by VCPU-stop events. We quantify each of them below.
For ease of exposition, we say that a VCPU incurs a CRPMD if the task running on the VCPU incurs the
overhead caused by a VCPU-stop event, and we call a time interval [a, b] an overhead interval of a VCPU if
the effective resource the VCPU provides during [a, b] is zero. (Note that the first overhead interval of VPi in
a period cannot start before VPi begins its execution.) Finally, we call [a, b] a black-out interval of a VCPU if
it consists of overhead intervals or intervals during which the VCPU provides no resources.

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

21

stop
Effective resource supply of the partial VCPU VPi of µ. Recall that NVP
denotes the maximum number
i
of VCPU-stop events of VPi during each period Π. The next lemma states a worst-case condition for the
effective resource supply of VPi :
stop
Lemma 13 The worst-case effective resource supply of VPi in each period occurs when VPi has NVP
i
VCPU-stop events.

Proof Because VPi has a constant budget of Θ in each period Π, the more cache-related overhead it incurs
in a period, the fewer effective resources it can supply to (the actual execution of) the tasks in the domain.
Since the overhead that a domain’s tasks incur in a period of VPi is highest when VPi stops its execution as
many times as possible, the worst-case effective resource supply of VPi in a period occurs when VPi has the
stop
maximum number of VCPU-stop events, which is NVP
events. Hence, the lemma.
u
t
i
Based on this lemma, we can construct the worst-case scenario during which the effective resource supply
of VPi is minimal, and we can derive the effective supply bound function according to this worst-case scenario.
Lemma 14 The effective resource supply that VPi provides during I is minimal when (1) VPi provides its
budget as early as possible in the current period and as late as possible in the subsequent periods, (2) VPi
has as many VCPU-stop events as possible in each period, and (3) the interval I begins in the current period
of VPi and the total length of the black-out intervals that overlap with I is maximal.
Proof Suppose VPi provides Θ resource units in each of its period. Denote by ScenarioA and ScenarioB
the effective resource supply scenarios described in Claim 1 and the worst-case supply scenario. Further,
stop
denote by SBFstop
VPi (t) and SBFVPi (t) the effective resource supply of VPi over any interval of length t in
stop
ScenarioA and ScenarioB, respectively. Then, SBFstop
VPi (t) ≥ SBFVPi (t). Let the effective resource supply in
stop
each period of VPi in ScenarioB be Θ∗ . Because there is at most NVP
cache misses during each period of
i
stop crpmd
∗
∗
∗
VPi , Θ ≥ Θ − NVPi ∆VPi = Θ , where Θ is the effective budget that VPi provides in each period in
ScenarioA. There are two cases:
stop crpmd
stop
stop
Case 1) Θ ≤ NVP
∆VPi : We have SBFstop
VPi (t) = 0. Because SBFVPi (t) ≤ SBFVPi (t), VPi can provide at
i
stop crpmd
most Θ∗ effective budget in each period under ScenarioB, where Θ∗ = Θ − NVP
∆VPi . In other words,
i
∗
∗
∗
Θ∗ ≤ Θ . Since Θ ≤ Θ∗ , we obtain Θ∗ = Θ .
stop crpmd
Case 2) Θ > NVP
∆VPi : There are five sub-cases, as follows:
i
stop
stop
(a) t ≤ x + z: We have SBFstop
VPi (t) = 0. Because SBFVPi (t) ≤ SBFVPi (t), VPi in ScenarioB must provide
its budget as early as possible in the current period and as late as possible in the next period (as is shown
in the interval [t3 , t5 ] in ScenarioA), so that it can guarantee that SBFstop
VPi (t) = 0. Further, because VPi
must provide at most Θ∗ time units during each period Π, VPi always provides effective resource when t
is enlarged. Therefore, the maximum length of the black-out interval is x + z.
(b) x + z < t ≤ x + z + Θ∗ : Since VPi provides Θ∗ resource units in each period and the whole second
period of ScenarioB overlaps with the interval I, VPi must provide Θ∗ resource units at the end of the
Θ∗ time unit interval of the second period. Thus, ScenarioB is the same as ScenarioA during the interval
[t5 , t6 ].
∗
(c) x + z + Θ∗ < t ≤ x + 2z + Θ∗ : SBFstop
VPi (t) = Θ and VPi in ScenarioA provides no effective resource
during [t6 , t7 ]. Therefore, VPi in ScenarioB also provides no effective resource during [t6 , t7 ] (since
∗
SBFstop
VPi (t) ≤ Θ ).
∗
(d) x + 2z + Θ < t ≤ x + 2z + 2Θ∗ : Similar to the sub-case (b) above, VPi in ScenarioB must provide
Θ∗ time units during [t7 , t8 ] (because otherwise, it cannot provide Θ∗ time units in each period).
(e) By repeating the sub-cases (c) and (d), we can prove that VPi in ScenarioB provides no less effective
resource than that in ScenarioA.

From the above, we imply that ScenarioA is the worst-case effective resource supply scenario of VPi . Hence,
the lemma.
u
t

22

Meng Xu et al.

Lemma 15 The effective supply bound function of the partial VCPU VPi = (Π, Θ) of a resource model
µ = (Π, Θ, m) of a component C is
(
stop crpmd
yΘ∗ + max{0, t − x − yΠ − z}, if Θ > NVP
∆VPi
stop
i
SBFVPi (t) =
(15)
0,
otherwise
stop crpmd
where ∆crpmd
= max {∆crpmd
}, Θ∗ = Θ − NVP
∆VPi , x = Π − ∆crpmd
− Θ∗ , y = b t−x
τi
VPi
VPi
Π c and
i

z = Π − Θ∗ .

τi ∈C

Proof Let I be any interval of length t. We will prove the lemma based on the worst-case resource supply
scenario given by Lemma 14.
Fig. 7 illustrates the worst-case scenario described in Lemma 14, where I begins at time t3 and the
intervals during which VPi provides effective resources are [t2 , t3 ], [t5 , t6 ] and [t7 , t8 ]:

Fig. 7 Worst-case effective resource supply of VPi = (Π, Θ).

In the figure, the first overhead interval of VPi in a period starts when VPi first begins its execution in
that period. This first overhead interval is caused by the VCPU-completion event of VPi that occurs in the
previous period. Recall from Lemma 13 that the maximum number of VCPU-stop events of VPi in a period
stop
Π is NVP
. Further, according to the gEDF scheduling of component C, any task in C may run the partial
i
VCPU and experience the cache overhead caused by the VCPU-stop event. Therefore, the maximum overhead
a task in component C experiences due to a VCPU-stop event of V Pi is ∆crpmd
= max {∆crpmd
}. As a result,
τi
VPi
τi ∈C

stop crpmd
the effective budget is Θ∗ ≥ Θ − NVP
∆VPi . Further, we have:
i
stop
t3 − t2 ≥ Θ − (NVP
− 1)∆crpmd
− (t2 − t1 ) = Θ∗ + ∆crpmd
− (t2 − t1 );
VPi
VPi
i

x = t4 − t3 = (t4 − t1 ) − (t3 − t2 ) − (t2 − t1 ) ≤ Π − ∆crpmd
− Θ∗ ;
VPi
z = t7 − t6 = (t8 − t6 ) − (t8 − t7 ) ≤ Π − Θ∗ .
Based on this information, we can derive the minimum effective resource supply during the interval I as follows:
stop crpmd
stop
∗
if Θ ≤ NVP
∆VPi , then Θ∗ = 0 and SBFstop
VPi = 0; otherwise, SBFVPi (t) = yΘ +max{0, t−x−yΠ −z}.
i
stop crpmd
∗
In addition, SBFstop
and x = Π − ∆crpmd
− Θ∗ . Therefore,
VPi (t) is minimal when Θ = Θ − NVPi ∆VPi
VPi
Equation 15 gives the minimum effective resource supply of the worst-case effective resource supply scenario
described in Lemma 14. This proves the lemma.
u
t
Effective resource supply of all m full VCPUs of µ. Similar to the partial-VCPU case, we can also establish
a worst-case condition for the total effective resource supply of the full VCPUs:
stop
Lemma 16 The m full VCPUs provide the worst-case total effective resource supply when they incur NVP
i
CRPMDs in total during each period Π of the partial VPi of µ.

Proof Because the total resource supply of m full VCPUs in any interval of length t is always mt, these
VCPUs together provide the least effective resource supply when they incur the maximum number of CRPMDs.
Recall from Section 6 that, when a VCPU-stop event of the partial VCPU VPi of a domain Di occurs, it

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

23

causes one CRPMD in a full VCPU of the same domain. Hence, the total number of CRMPDs that these full
VCPUs incur together is the number of VCPU-stop events of the partial VCPU VPi of the same domain. The
lemma then follows from a combination with Lemma 13.
u
t
The next lemma gives the worst-case supply scenarios of m full VCPUs. Fig. 8 illustrates one of the
conditions under this worst-case scenario.

t

VPj
VPf

t2

t1

t3

t5

t4

x
CRPMD

t6

t7

Θ*

Fig. 8 Worst-case resource supply of m full VCPUs of µ.

preemption

unavailable
to \tau_k
which
is reloading
cacheof µ in any interval I of length t occurs
Lemma 17
The worst-case effective
resource
supply
of m
full VCPUs
point
stop
when (1) all the NVP
CRPMDs
are
experienced
by
one
full
VCPU
VPf in each period Π of VPi , (2) VPf
i
incurs the overhead
period
the rest
0 1
2as late
3 as
4 possible
5 6 in 7the first
8 9
10 and
11 as12early
13 as14possible
15 16in 17
18 of
19periods of VPi ,
(3) the maximum overhead cost of each CRPMD overhead is ∆crpmd
,
and
(4)
the
interval
I
begins when the
VPi
first CRPMD occurs in the first period.

Proof We denote by ScenarioA the effective resource supply scenario given by Lemma 17 (see Fig. 8), and
stop crpmd
let ScenarioB be a worst-case effective resource supply scenario of the m full VCPUs. Let x = NVP
∆VPi .
i
We will prove that the m full VCPUs provides no less effective resource in ScenarioB than in ScenarioA with
the following arguments:
1. While a full VCPU VPf is experiencing a CRPMD, the resource provided by any other full VCPU VPj is
unavailable to the task currently running on VPf (since this task cannot execute on more than one VCPUs
at any given time). Since it is unknown which exact task in the domain is running on VPf , it is unknown
whether VPj is available to a given task. Hence, we consider VPj as unavailable to every task while VPf
is experiencing the overhead, so as to guarantee the safety of the schedulability analysis. Recall from
stop
Lemma 16 that, all m full VCPUs incur NVP
CRPMDs in each period. The unavailable intervals of each
i
stop
period Π is maximized when all these NVPi CRPMDs are incurred by one full VCPU V Pf in each period
Π of VPi . Hence, ScenarioB must obey Condition (1).
stop crpmd
2. The maximum total length of the unavailable intervals of m full VCPUs in each period is x = NVP
∆VPi .
i
The maximum black-out interval happens when the unavailable intervals in two periods are consecutive
and the maximum cost of each CRPMD is ∆crpmd
VPi . Therefore, the full VCPU V Pf should incur the
overhead as late as possible in the first period and as early as possible in the second period of VPi in order
for the black-out interval to be maximized. In addition, the interval I should begin when the first CRPMD
occurs in the first period. Hence, ScenarioB should obey the conditions (3) and (4), and the m full VCPUs
provide no less effective resource in ScenarioB than in ScenarioA when t ≤ 2x.
3. When x + kΠ < t < 2x + kΠ (k ∈ N ), because m full VCPUs must provide m(Π − x) effective
resource units in each period and the interval t has k periods, the m full VCPUs in ScenarioB should
provide at least km(Π −x) effective resource units during a time interval of length t. Because t > x+kΠ,
the m full VCPUs in ScenarioB have already provided km(Π − x) effective resource units during the
interval of length x + kΠ. Therefore, they must provide no effective resource in the remaining time
interval of length t − (x + kΠ) (otherwise, the m full VCPUs would provide more effective resource in

24

Meng Xu et al.

ScenarioB than in ScenarioA.) Hence, VPf should incur the overhead as early as possible in all periods
(except for the first period) of VPi . Hence, by combining the the arguments (2) and (3), we imply that
ScenarioB must obey Condition (2) and the m full VCPUs provide no less effective resource in ScenarioB
than in ScenarioA when x + kΠ < t < 2x + kΠ.
4. When 2x + kΠ < t < x + (k + 1)Π (k ∈ N ), the m full VCPUs in ScenarioB provides no effective
resource during [x + kΠ, 2x + kΠ] according to the argument (3). In addition, the m full VCPUs in
ScenarioB must provide m(Π − x) effective resource units during [x + kΠ, x + (k + 1)Π], i.e., the
(k + 1)th period of VPi , in order to guarantee m(Π − x) effective resource units during the (k + 1)th
period of VPi . Therefore, the m VCPUs in ScenarioB always provides the same effective resource during
[2x + kΠ, x + (k + 1)Π] as in ScenarioA. Hence, they provide no less effective resource in ScenarioB
than in ScenarioA when 2x + kΠ < t < x + (k + 1)Π.
Because the m full VCPUs provide no less effective resource in ScenarioB than in ScenarioA, and
ScenarioB is a worst-case effective resource supply scenario, we imply that ScenarioA is also a worst-case
effective resource supply scenario of the m full VCPUs. Hence, the lemma.
u
t
The next lemma gives the effective SBF of the m full VCPUs of µ based on the worst-case scenario
described in Lemma 17.
Lemma 18 The effective resource supply bound function of the m full VCPUs of µ is given by:
(

m yΘ0 + max{0, t − yΠ − 2x} if Θ 6= 0
stop
SBFVPs (t) =
mt
if Θ = 0

(16)

stop crpmd
0
where x = NVP
∆VPi , y = b t−x
Π c and Θ = Π − x.
i

Proof The effective resource supply bound function SBFstop
VPs (t) of the resource supply scenario given by
Lemma 17 is given by: When t < 2x , SBFstop
(t)
=
0;
When
x + kΠ < t < 2x + kΠ, SBFstop
VPs
VPs (t) =
stop
km(Π − x); When 2x + kΠ < t < x + (k + 1)Π, SBFVPs (t) = km(Π − x) + m(t − 2x − kΠ).
Equation 16 is derived by rearranging the equations of SBFstop
VPs (t). Since the resource supply scenario given
by Lemma 17 is a worst-case scenario, SBFstop
VPs (t) is the effective resource supply bound function of the m
full VCPUs of µ.
u
t
Effective resource supply of a DMPR model The next lemma gives the effective resource supply that a
DMPR interface µ = (Π, Θ, m) provides to a domain Di after having accounted for the overhead due to
VCPU-stop events. The lemma is a direct consequence of Lemmas 15 and 18.
Lemma 19 The effective resource supply of a DMPR interface µ = hΠ, Θ, mi of a domain Di after having
accounted for the overhead due to VCPU-stop events is given by:
stop
stop
SBFstop
µ (t) = SBFVPi (t) + SBFVPs (t), ∀ t ≥ 0.

(17)

SBFstop
VPi (t)

Here,
is the effective resource supply of the partial VCPU VPi = (Π, Θ), which is given by
Eq. (15), and SBFstop
VPs (t) is the effective resource supply of the m full VCPUs of µ, which is given by Eq. (16).
Proof Since the resource supply of a DMPR interface is the total effective resource supply of its partial VCPU
stop
and full VCPUs, the lemma directly follows from the definition of SBFstop
u
t
VPi (t) and SBFVPs (t).
Note that, when no partial VCPU exists for interface µ = hΠ, 0, mi, the effective resource supply of µ is
equal to the resource supply of µ, i.e., SBFstop
(t) = mt.
µ
8.3 DMPR interface computation under MODEL - CENTRIC method
Based on the effective supply function, we can develop the component schedulability test as follows.
Theorem 7 Consider a domain Di with a taskset τ = {τ1 , ...τn }, where τk = (pk , ek , dk ). Let τ 00 =
00
crpmd 8
{τ100 , ...τn00 }, where τk00 = (pk , e00
for all 1 ≤ k ≤ n. Then, Di is
k , dk ) and ek = ek + maxτi ∈LP (τk ) ∆τi
8

Recall that LP(τk ) = {τi |di > dk }

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

25

schedulable under gEDF by a DMPR model µ in the presence of cache-related overhead, if the inflated taskset
τ 00 is schedulable under gEDF by the effective resource supply SBFstop
µ (t) in the absence of overhead.
Proof Since τ 00 includes the overhead that τ incurs due to task-preemption events, if SBFstop
µ (t) is sufficient
to schedule τ 00 assuming negligible overhead, then it is also sufficient to schedule τ in the presence of taskpreemption events. As SBFstop
µ (t) gives the effective supply that µ provides to τ after having accounted for
the overhead due to VCPU-stop events, µ provides sufficient resources to schedule τ in the presence of the
overhead from all types of events. This proves the theorem.
Based on the above results, we can generate a cache-aware minimum-bandwidth DMPR interface for a
domain in the same manner as in the overhead-free case, except that we use the effective resource supply
and the inflated taskset in the schedulability test. Similarly, the system’s interface can be computed from the
interfaces of the domains in the exact same way as the overhead-free interface computation.

9 Hybrid cache-aware DMPR interface
Recall from Section 7 that the TASK - CENTRIC - UB method always dominates the BASELINE method. However,
neither of these analysis methods dominates the MODEL - CENTRIC method, and vice versa. We demonstrate
this using two example systems, where the TASK - CENTRIC - UB method gives a smaller interface bandwidth in
the first system but a larger interface bandwidth in the second system compared to the interface bandwidth
given by the MODEL - CENTRIC method.
Example 8 Let Sys1 be a system consisting of two domains C1 and C2 that are scheduled under the hybrid
EDF scheduling strategy (c.f. Section 2) and that have workloads τC1 = {τ11 = ... = τ14 = (200, 100, 200)}
and τC2 = {τ21 = τ22 = (200, 100, 200)}, respectively. By applying the analysis in Sections 7.2 and 8,
the interfaces of the system under TASK - CENTRIC - UB and under MODEL - CENTRIC are computed to be
µSys1 = h20, 17, 5i and µ0Sys1 = h20, 19, 5i, respectively. Thus, the system’s interface under TASK - CENTRIC UB has a smaller bandwidth than that of the interface computed under MODEL - CENTRIC .
Example 9 Let Sys2 be a system consisting of two domains C1 and C2 that are scheduled under the hybrid EDF
scheduling strategy and that have workloads τC1 = {τ11 , ..., τ15 = (100, 5, 100)} and τC2 = {τ21 , ..., τ25 =
(100, 5, 100)}, respectively. The interfaces of this system under TASK - CENTRIC - UB and under MODEL 0
CENTRIC are given by µSys2 = h20, 0, 4i and µSys = h20, 14, 3i, respectively. Thus, the system’s interface
2
under TASK - CENTRIC - UB has a larger bandwidth than that of the interface computed under MODEL - CENTRIC.
One can also show that neither MODEL - CENTRIC nor BASELINE dominates one another. For instance,
consider the system Sys1 in Example 8. The interface of the whole system under the BASELINE method is
0
µ00
Sys1 = h20, 17, 5i, which has a smaller bandwidth than the interface µSys1 computed using the MODEL CENTRIC method. Further, since the TASK - CENTRIC - UB method dominates the BASELINE method but not the
MODEL - CENTRIC method, the BASELINE method also does not dominate the MODEL - CENTRIC method.
From the above observations, we can derive the minimum interface of a component from the ones
computed using the TASK - CENTRIC - UB and MODEL - CENTRIC methods (since TASK - CENTRIC - UB method
always dominates BASELINE), as stated by Theorem 8. The theorem is trivially true, since both interfaces
computed using the TASK - CENTRIC - UB and MODEL - CENTRIC methods are safe. We refer to this analysis as
the HYBRID method.
Theorem 8 (Hybrid cache-aware interface) The minimum cache-aware DMPR interface of a domain Di
(a system S) is the interface that has a smaller resource bandwidth between µtask and µmodel , where µtask and
µmodel are the minimum-bandwidth DMPR interfaces of Di (S) computed using the TASK - CENTRIC - UB and
the MODEL - CENTRIC methods, respectively.
Discussion. We observe that the schedulability analysis under gEDF in the absence of overhead (Theorem 1)
is only a sufficient test, and that its pessimism degree varies significantly with the characteristics of the

26

Meng Xu et al.

taskset. For instance, under the same multiprocessor resource, one taskset with a larger total utilization may be
schedulable while another with a smaller total utilization may not be schedulable. As a result, it is possible that
the overhead-aware interface of a domain (system) may require less resource bandwidth than the overhead-free
interface of the same domain (system).

10 Evaluation
To evaluate the benefits of our proposed interface model and cache-aware compositional analysis, we performed
simulations using randomly generated workloads. We had five main objectives for our evaluation: (1) Determine
how much resource bandwidth the interfaces computed using the improved SBF (Section 3.2) can save
compared to the interfaces computed using the original SBF proposed in [Easwaran et al., 2009]; (2) determine
how much resource bandwidth the DMPR model can save compared to the MPR model; (3) evaluate the
relative performance of the HYBRID method and the BASELINE method; (4) study the impact of task parameters
(e.g., the range of taskset utilization, the distribution of task’s utilization, the period range of tasks) on the
interfaces under the HYBRID and BASELINE methods; and (5) evaluate the performance of the HYBRID analysis
when using a cache overhead value per task and when using the maximum cache overhead value for the entire
system.
10.1 Experimental setup
Key factors. We focus on the following five key factors that can affect the performance of a cache-aware
compositional analysis:9 :
– Utilization of a task set. Tasks with larger utilizations tend to have a larger number of tasks; thus, each task
tends to experience more cache overhead during its lifetime because there are more other tasks that can
preempt it.
– Distribution of task utilizations. High-utilization tasks are more sensitive to cache overhead and can more
easily become unschedulable because of this overhead than tasks with small utilization.
– Periods of the tasks. If two tasks have the same utilization and experience the same cache overhead, the
task with the smaller period has a higher probability of missing its deadline because of the overhead than
the task with the larger period because the former has a smaller relative deadline. Therefore tasks with
smaller period are more sensitive to cache overhead.
– Number of tasks in a task set. In the BASELINE approach and the task-centric approach from Section 7,
when a VCPU-stop event happens, each task’s worst-case execution time is inflated by the cache overhead
caused by this event, even though at most two tasks actually experience the cache overhead that the event
has caused. Hence, these two approaches will become more and more pessimistic as the number of tasks
increases.
– Cost of cache overhead per event. If the cost of cache overhead increases, tasks will experience longer
delays when task-preemption or VCPU-stop events occur.
Workload. In order to evaluate the impact of the above five factors on the performance of overhead-free
and overhead-aware compositional analysis, we generated a number of synthetic real-time workloads with
randomly generated periodic task sets that span a range of different parameters for each of these factors. Below,
we explain how the parameters were chosen.
We picked the task set utilizations from the interval [0, 24], with increments of 0.2, to be consistent
with the ranges used in [Brandenburg et al., 2011] and [Brandenburg, 2011]. However, we observed that a
smaller interval is sufficient to demonstrate the relative performance of overhead-free and overhead-aware
compositional analysis; hence, we used the range [0, 5], again with increments of 0.2, when evaluating the
impact of the other factors on overhead-aware compositional analysis.
9

We assume other factors are same when we discuss one factor’s impact on the cache-aware analysis

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

27

The tasks’ utilizations were drawn from one of four distributions: one uniform distribution over the range
[0.001, 0.1] and three bimodal distributions; in the latter, the utilization was distributed uniformly over either
[0.1, 0.5) or [0.5, 0.9], with respective probabilities of 8/9 and 1/9 (light), 6/9 and 3/9 (medium), and 4/9 and
5/9 (heavy). These probabilities are consistent with the ones used in [Bastoni et al., 2010] and [Brandenburg,
2011]. The periods of the tasks were drawn from a uniform distribution over one of the following three ranges:
(350ms, 850ms), (550ms, 650ms), and (100ms, 1100ms); all periods were integer. These distributions
are identical to those used in [Lee et al., 2011]. The number of tasks in a task set ranged from [0, 300] with
increments of 20.
The cost of cache overhead per event was chosen based on the cache overhead ratio, which we define as
the cache overhead of a task τi divided by the worst-case execution time of τi . We picked the cache overhead
ratio from the range [0, 0.1] with increments of 0.01. This range was chosen based on measurements of the L2
cache miss overhead of tasks on our experimental platform; we found that the cost of missing the L2 private
cache but hitting the L3 shared cache was 0.02ms when the working set size was 256KB (the L2 private
cache size). Because the L3 cache hit latency is very small (less than 100 cycles), the cache overhead per
task-preemption or VCPU-stop event is only 0.02ms. Therefore, the cache overhead ratio was less than 0.02
for any task we measured that had a worst-case execution time of more than 2ms.
Overhead measurements. For our measurements, we used a Dell Precision T3610 six-core workstation
with the RT-Xen 2.0 platform [Xi et al., 2014]; each domain was running LITMUSRT 2012.3 [Calandrino
et al., 2006]. The scheduler was gEDF in the domains and semi-partitioned EDF in the VMM, as described
in Section 2. We allocated a full-capacity VCPU to one domain and pinned this VCPU to a physical core of
its own; this was done to avoid interference from domain 0 (the administrative domain in RT-Xen), which
was pinned to a different core. We measured the cache overhead of the cache-intensive program ρ as follows.
First we warmed up the cache by accessing all the cache content of the program; then we used the time stamp
counter to measure the time lhit it takes to access the same content again. Because the cache was warm, lhit is
the cache hit latency of this program. Next, we allocated an array of the same size as the private L2 cache and
loaded this into the same core’s L2 cache in order to pollute the cache content of ρ. Finally, we again accessed
all the cache content of ρ and recorded the cache miss latency lmiss . The cache overhead of the program ρ per
task-preemption or VCPU-stop event is then lmiss − lhit .

10.2 Overhead-free analysis
We begin with an empirical comparison of the overhead-free analyses. For this purpose, we set up four domains
with harmonic periods, and we randomly generated tasks and uniformly distributed them across the four domains. To be consistent with [Phan et al., 2013], we generated 25 task sets per task set utilization or task set size.
MPR with improved SBF vs. MPR with original SBF. To estimate the impact of the improved SBF, we
generated 625 tasksets with taskset utilizations ranging from 0.1 to 24, with increments of 0.2. The task
utilizations were drawn from the bimodal-light distribution as described earlier; the tasks’ periods were
uniformly distributed across [350ms, 850ms]. For each taskset we generated, we distributed the tasks into one
domain, and we then computed the overhead-free interface of the domain using MPR with the improved SBF,
as well as using the original MPR. Fig. 9(a) shows the average bandwidth savings due to the improved SBF.
We observe that, across all taskset utilizations, MPR with the improved SBF always requires either the same or
less resource bandwidth than MPR with the original SBF. We also observe that MPR with the improved SBF
saves over 0.8 cores when the taskset utilization is larger than 5. Fig. 9(b) and 9(c) show the average resource
bandwidth savings with the other two bi-modal distributions; we observe that, in all three cases, MPR with the
improved SBF consistently outperformed MPR with the original SBF.
DMPR vs. MPR with the original SBF. To compare DMPR to MPR with the original SBF on the whole
system, we distributed the tasks in each taskset over four domains and we then computed the overhead-free

28

Meng Xu et al.

Average resource bandwidth saved

Average resource bandwidth saved

1.2
1
0.8
0.6
0.4
0.2
0

5

10
15
Task set utilization

1.2
1
0.8
0.6
0.4
0.2
0

20

(a) Bimodal-light distribution.

1.5
Average resource bandwidth saved

1.4

1.4

5

10
15
Task set utilization

1

0.5

0

20

(b) Bimodal-medium distribution.

5

10
15
Task set utilization

20

(c) Bimodal-heavy distribution.

Fig. 9 Average resource bandwidth saved: MPR with improved SBF vs. MPR with original SBF.

14
12
10
8
6
4
2
0

5

10
15
Task set utilization

20

(a) Bimodal-light distribution.

18
Average resource bandwidth saved

18
Average resource bandwidth saved

Average resource bandwidth saved

16

16
14
12
10
8
6
4
2
0

5

10
15
Task set utilization

20

(b) Bimodal-medium distribution.

16
14
12
10
8
6
4
2
0

5

10
15
Task set utilization

20

(c) Bimodal-heavy distribution.

Fig. 10 Average resource bandwidth saved: DMPR vs. MPR with original SBF.

interface of the whole system using both DMPR and MPR with the original SBF. Fig. 10(a) shows the average
bandwidth savings of DMPR for different taskset utilizations. Our results show that DMPR consistently saves
bandwidth relative to MPR with the original SBF for up to 16 cores. There are very few data points beyond
this point because we can only compute the average bandwidth savings when both analyses return valid
interfaces for the same taskset; however, for taskset utilizations above 16, MPR generally fails to compute a
valid interface for the system.
As shown in Fig. 11(a), the fraction of tasksets with valid interfaces under MPR with the original SBF
decreases with increasing taskset utilization. This is because the original SBF of MPR is pessimistic and
cannot provide m0 t time units with interface Γ = hΠ, Πm0 , m0 i. Once the interfaces of the leaf components
(i.e., domains) have been computed, these interfaces are transferred to VCPUs as the workload of the top
component. When some of those VCPUs have utilization 1, the resource demand increases faster than the
resource supply of MPR with the original SBF; hence, MPR cannot find a valid interface. DMPR does not
have this problem because it can always supply m0 t time units with bandwidth m0 ; hence, the fraction of
tasksets with valid interfaces is always 1. As Fig. 11(b) and Fig. 11(c) show, the results for the other two
bimodal distributions are similar: DMPR is consistently able to compute interfaces for all tasksets, whereas
MPR with the original SBF finds fewer and fewer interfaces as the taskset utilization increases.

10.3 Comparison of HYBRID cache-aware analysis vs. BASELINE cache-aware analysis
Next, we compared the performance of the two overhead-aware analysis approaches. For this we used the
same tasksets and system configuration as for the previous experiment, but we additionally computed DMPR
interfaces for each taskset using the respective approach.
Impact of taskset utilization. Fig. 13(a) shows the average resource bandwidth savings of the HYBRID
approach compared to the BASELINE approach for each taskset utilization. We observe that a) HYBRID reduced
the resource bandwidth in all cases, and that b) more and more cores are being saved as the taskset utilization

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

0.8

0.6
MPR
DMPR

0.4

0.2

0

5

10
15
Task set utilization

1

0.8

0.6
MPR
DMPR

0.4

0.2

0

20

(a) Bimodal-light distribution.

Fraction of computable taskset

1
Fraction of computable taskset

Fraction of computable taskset

1

29

5

10
15
Task set utilization

0.8

0.6
MPR
DMPR

0.4

0.2

0

20

(b) Bimodal-medium distribution.

5

10
15
Task set utilization

20

(c) Bimodal-heavy distribution.

Fig. 11 Fraction of taskset with valid interfaces: DMPR vs. MPR with original SBF.
2.5

1.5

1

0.5

0

5

10
15
Task set utilization

2
Average resource bandwidth saved

Average resource bandwidth saved

Average resource bandwidth saved

2

2

1.5

1

0.5

0

20

(a) Bimodal-light distribution.

5

10
15
Task set utilization

1.5

1

0.5

0

20

(b) Bimodal-medium distribution.

5

10
15
Task set utilization

20

(c) Bimodal-heavy distribution.

Fig. 12 Average resource bandwidth saved: HYBRID vs. BASELINE.

3
2.5
2
1.5
1
0.5
0

50

100

150
200
Task set size

250

(a) Bimodal-light distribution.

300

1.8
Average resource bandwidth saved

2.5
Average resource bandwidth saved

Average resource bandwidth saved

3.5

2

1.5

1

0.5

0

50

100

150
200
Task set size

250

300

(b) Bimodal-medium distribution.

1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

50

100

150
200
Task set size

250

300

(c) Bimodal-heavy distribution.

Fig. 13 Average resource bandwidth saved: HYBRID vs. BASELINE.

increases. Note that, as the taskset utilization increases, the interface bandwidth can sometimes decrease. One
reason for this is that the underlying gEDF schedulability test is only sufficient, and is not strictly dependent
on the taskset utilization; in other words, it is possible that a taskset with a high utilization is schedulable but
another with a lower utilization is not. We also observe that, as discussed earlier, the relative performance of
the HYBRID and BASELINE analyses is easy to see even for small taskset utilizations; this is why we only
compare the two overhead-aware analysis for taskset utilizations [0, 5] instead of the larger [0, 24] range.
Impact of task utilization. Fig. 13(a)-Fig. 13(c) show the average resource bandwidth savings for different
taskset utilizations and each of the three bimodal distributions. We observe that, in all three cases, the HYBRID
approach consistently outperformed the BASELINE approach. Further, as the taskset utilization increases, the
savings also increase and remain steady at approximately one core once the taskset utilization has reached 10.
Impact of taskset size. We investigated the impact of the number of tasks (i.e., the taskset size) on the average
bandwidths saving of the HYBRID approach compared to the BASELINE approach. For this experiment, we

30

Meng Xu et al.

1.2
1
0.8
0.6
0.4
0.2
0

1

2
3
Task set utilization

4

(a) Task period: [100, 1100]ms.

1.4
Average resource bandwidth saved

1.4
Average resource bandwidth saved

Average resource bandwidth saved

1.4

1.2
1
0.8
0.6
0.4
0.2
0

1

2
3
Task set utilization

1.2
1
0.8
0.6
0.4
0.2
0

4

(b) Task period: [350, 850]ms.

1

2
3
Task set utilization

4

(c) Task period: [550, 650]ms.

Fig. 14 Average resource bandwidth saved under different ranges of tasks’ periods

Average resource bandwidth saved

1.6

Average resource bandwidth saved

35
30
25
20
15
10
5
0
0

1.4
1.2
1
0.8
0.6
0.4
0.2
0
0

0.02
0.04
0.06
0.08
Ratio of cache overhead over wcet of a task

1

2
3
Task set utilization

4

5

0.1

Fig. 15 Average bandwidth saving under different
ratios of cache overhead to task WCET.

Fig. 16 Average bandwidth saving of HYBRID with
cache overhead per task over HYBRID with maximum cache overhead of system (Ratio of overhead
over wcet is uniformly in [0,0.1])

generated a set of tasksets with sizes between 4 to 300, with increments of 20, and with 25 tasksets per size.
As before, we tried each of the three bimodal distributions we discussed in Section 10.1.
Fig. 13(a)-Fig. 13(c) show the average resource bandwidth savings for different taskset sizes with each
of the three bi-modal distributions. We observe that a) the HYBRID approach consistently outperforms the
BASELINE approach, and b) the savings increase with the number of tasks. This is expected because the
BASELINE technique inflates the WCET of every task with all the cache-related overhead each task experiences;
hence, its total cache overhead increases with the size of the taskset.
Impact of task period distribution. We further investigated the impact of the distribution of tasks’ periods
on the average bandwidth savings of the HYBRID approach compared to the BASELINE approach. For this
experiment, we generated a number of tasksets with taskset utilizations in the range [0, 5] with increments of
0.2, and, as usual, 25 tasksets per taskset utilization. The individual tasks’ utilizations were drawn from the
bi-modal light distribution. For the tasks’ periods, we tried each of the three distributions that were discussed in
Section 10.1. Fig. 14(a)-Fig. 14(c) show the average resource bandwidth saving for three different distribution
of tasks’ periods; in all three cases, the HYBRID approach consistently outperforms the BASELINE approach.
Impact of cost of cache overhead. We first generated 25 tasksets with taskset utilization 4.9 and uniformly
distributed the tasks of each taskset over four domains with harmonic periods. The tasks’ utilizations were
uniformly distributed in [0.001, 0.1], and their periods were uniformly distributed in [350ms, 850ms]. We
then modified the cache overhead of tasks of the 25 tasksets and generated a set of tasksets with cache-related
overhead ratio [0, 0.1] with increments of 0.01 based on the 25 tasksets. Recall from Section 10.1 that we
define the cache-related overhead ratio of a task τi to be the cost of one cache-related overhead of τi divided
by the worst-case execution time of τi .

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

Schedulable
Deadline miss ratio

Overhead-free MPR
Theory
RT-Xen
Yes
No
78%

Overhead-free DMPR
Theory
RT-Xen
Yes
No
78%

31

HYBRID

Theory
No

RT-Xen
No
0.07%

BASELINE

Theory
No

RT-Xen
No
7%

Table 1 Performance in theory vs. in practice.

Fig. 15 shows the average resource bandwidth savings of the HYBRID approach over the BASELINE
approach for each cache overhead ratio. We observe that the HYBRID approaches saves more resources as the
cache-related overhead ratio increases. This is expected because tasks’ utilizations are uniformly distributed
over [0.001, 0.1] and a taskset has more tasks than the number of VCPUs. Since the BASELINE approach
inflates the WCET of every task with all the cache-related overheads any task can experience, its total cache
overhead increases as the cost of one cache-related overhead increases.
Impact of per-task cache overheads. When different tasks can have different costs for cache-related overheads, it is pessimistic to simply use the largest cache overhead in the system, as we did in [Xu et al., 2013]. To
evaluate the impact of considering cache overheads per task, we generated tasks with different cache-related
overhead ratios, drawn from an uniform distribution over [0, 0.1]. We then calculated the system’s interface
with the HYBRID analysis using the following two approaches: (1) Using a per-task cost of cache overheads to
compute the HYBRID analysis, as we did in this work; and (2) Using the upper bound for the cache overhead
in the system as the cost for each task, as we did in [Xu et al., 2013].
Fig. 16 shows the average resource bandwidth savings of the HYBRID approach with per-task cache
overheads relative to the more pessimistic approach. We observe that the HYBRID approach with per-task
cache overheads consistently outperformed the pessimistic approach; however, the saving does not increase as
the taskset utilization increases. This is because the TASK - CENTRIC - UB approach only considers the cache
overhead caused by task-preemption events, and each task’s WCET is only inflated with one cache overhead.
Therefore, the pessimistic HYBRID analysis with system’s maximum cache overhead may have the same
upper-bounded number of full VCPUs as the HYBRID analysis with cache overhead per task. When both
analyses use the upper-bounded number of full VCPUs as the components’ interface, the HYBRID analysis
with per-task cache overheads will have the same interface bandwidth as the pessimistic analysis and thus saves
no resources; however, (2) if both HYBRID analyses choose the interfaces computed by the MODEL - CENTRIC
analysis, the HYBRID analysis with per-task cache overheads will save resources relative to the pessimistic
approach because every time one cache-related overhead happens, the pessimistic approach will have more
cache overhead.
10.4 Performance in theory vs. in practice
We also validated the correctness of the cache-aware interfaces (and the invalidity of the overhead-free
interfaces) in practice. For this experiment, we first computed the domains’ interfaces, and we then ran the
generated tasks on our RT-Xen experimental platform. The periods and budgets of the domains in RT-Xen
were chosen to be those of the respective computed interfaces. We then computed the schedulability and
deadline miss ratios of the tasks, based on the theoretical schedulability test and the measurements on the
RT-Xen platform. Table 1 shows the schedulability and deadline miss ratios of these methods.10
We observe that the overhead-free MPR and DMPR interfaces significantly underestimate the tasks’
resource requirements: even though the tasks were claimed to be schedulable by the computed interfaces,
78% of the jobs missed their deadlines. The experimental results also confirm that our cache-aware analysis
correctly estimated the resource requirements of the system in practice: the theory predicted that the tasks
would not be schedulable, and this was confirmed in practice by the nonzero deadline miss ratio, which was
10 We note that the interfaces given by the HYBRID method and the BASELINE method are the same as the interfaces given by
the cache-aware hybrid analysis method and task-centric analysis method proposed in the conference version [Xu et al., 2013],
respectively.

32

Meng Xu et al.

0.07% for the HYBRID approach and 7% for the task-centric approach. We also observe that the HYBRID
approach had fewer deadline misses than, and thus outperformed, the task-centric approach.

11 Related Work
Several compositional analysis techniques for multicore platforms have been developed (see e.g., [Baruah
and Fisher, 2009; Easwaran et al., 2009; Leontyev and Anderson, 2008; Lipari and Bini, 2010]) but, unlike
this work, they do not consider the platform overhead. There are also methods that account for cache-related
overhead in multicore schedulability analysis (e.g., [Brandenburg, 2011]), but they cannot be applied to the
virtualization and compositional setting. To the best of our knowledge, the only existing overhead-aware
interface analysis is for uniprocessors [Phan et al., 2013].
Prior work has already extended the multiprocessor resource model in a number of ways. Most notably,
Bini et al. introduced generalizations such as the parallel supply function [Bini et al., 2009], as well as later
refinements. These models capture the resource requirements at each different level of parallelism; thus,
they minimize the interface abstraction overhead that the MPR model incurs. However, they also increase
the complexity of the interface representation and the interface computation. Our work follows a different
approach: instead of adding more information, we make the supply pattern of the resource model more
deterministic. As a result, we can improve the worst-case resource supply of the model without increasing its
complexity. In addition, this approach helps to reduce the platform overhead that arises when these interfaces
are scheduled at the next level.
The semi-partitioned EDF scheduling we use at the VMM level is similar to the strategy proposed for
soft real-time tasks by Leontyev and Anderson [Leontyev and Anderson, 2008], in which the bandwidth
requirement of a container is distributed to a number of dedicated processors as well as a periodic server,
which is globally scheduled onto the remaining processors. The two key differences to our work are that 1) we
use gEDF within the domains, which necessitates a different analysis, and that 2) unlike our work, [Leontyev
and Anderson, 2008] does not consider cache overhead.
There are other lines of cache-related research that benefit our work. For example, results on intrinsic
cache analysis and WCET estimation [Hardy et al., 2009] can be used as an input to our analysis; studies
on cache-related preemption and migration delay [Bastoni et al., 2010] can be used to obtain the value of
cache-overhead per task value ∆crpmd
used in our analysis; and cache-aware scheduling, such as [Guan et al.,
τi
2009], can be used to reduce the additional cache-related overhead in the compositional/virtualization setting.

12 Conclusion
We have presented a cache-aware compositional analysis technique for real-time virtualization multicore
systems. Our technique accounts for the cache overhead in the component interfaces, and thus enables a safe
application of the analysis theories in practice. We have developed three different approaches, BASELINE,
TASK - CENTRIC - UB and MODEL - CENTRIC , for analyzing the cache-related overhead and for testing the
schedulability of components in the presence of cache overhead. We have also introduced an improved supply
bound function for the MPR model and a deterministic extension of the MPR model, which improve the
interface resource efficiency, as well as accompanying overhead-aware interface computation methods. Our
evaluation on synthetic workloads shows that our improved SBF and the DMPR interface model can help
reduce resource bandwidth by a significant factor compared to the MPR model with the existing SBF, and that
a hybrid of TASK - CENTRIC - UB and MODEL - CENTRIC achieves significant resource savings compared to the
BASELINE method (which is based solely on WCET inflation).
Acknowledgements This research was supported in part by the ONR N000141310802, NSF CNS-1329984, NSF CNS-1117185,
NSF ECCS-1135630, and MKE (The Ministry of Knowledge Economy), Korea, under the Global Collaborative R&D program
supervised by the KIAT (M002300089).

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

33

References
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and
Andrew Warfield. Xen and the art of virtualization. In SOSP, 2003.
S. Baruah and N. Fisher. Component-based design in multiprocessor real-time systems. In ICESS, 2009.
Sanjoy Baruah and Theodore Baker. Schedulability analysis of global EDF. Real-Time Systems, 38(3):
223–235, 2008.
Andrea Bastoni, Bjorn B. Brandenburg, and James H. Anderson. Cache-Related Preemption and Migration
Delays: Empirical Approximation and Impact on Schedulability. In OSPERT, 2010.
Swagato Basumallick and Kelvin Nilsen. Cache issues in real-time systems. In LCTES, 1994.
E. Bini, M. Bertogna, and S. Baruah. Virtual multiprocessor platforms: Specification and use. In RTSS, 2009.
Björn B. Brandenburg. Scheduling and Locking in Multiprocessor Real-Time Operating Systems. PhD thesis,
The University of North Carolina at Chapel Hill, 2011.
Björn B. Brandenburg, Hennadiy Leontyev, and James H. Anderson. An overview of interrupt accounting
techniques for multiprocessor real-time systems. Journal of Systems Architecture, 57(6):638–654, 2011.
F. Bruns, S. Traboulsi, D. Szczesny, E. Gonzalez, Y. Xu, and A. Bilgic. An Evaluation of Microkernel-Based
Virtualization for Embedded Real-Time Systems. In ECRTS, 2010.
John M. Calandrino, Hennadiy Leontyev, Aaron Block, UmaMaheswari C. Devi, and James H. Anderson.
LITMUS RT: A testbed for empirically comparing real-time multiprocessor schedulers. In RTSS, 2006.
A. Crespo, I. Ripoll, and M. Masmano. Partitioned Embedded Architecture Based on Hypervisor: the XtratuM
Approach. In EDCC, 2010.
Arvind Easwaran, Madhukar Anand, and Insup Lee. Compositional analysis framework using edp resource
models. In RTSS, 2007.
Arvind Easwaran, Insik Shin, and Insup Lee. Optimal virtual cluster-based multiprocessor scheduling.
Real-Time Systems, 43(1):25–59, 2009.
Nan Guan, Martin Stigge, Wang Yi, and Ge Yu. Cache-aware scheduling and analysis for multicores. In
EMSOFT, 2009.
Damien Hardy, Thomas Piquet, and Isabelle Puaut. Using bypass to tighten wcet estimates for multi-core
processors with shared instruction caches. In RTSS, 2009.
Taesoo Kim, Marcus Peinado, and Gloria Mainar-Ruiz. System-level protection against cache-based side
channel attacks in the cloud. In USENIX Security, 2012.
J. Lee, S. Xi, S. Chen, L. T. X. Phan, C. Gill, I. Lee, C. Lu, and O. Sokolsky. Realizing compositional
scheduling through virtualization. In RTAS, 2012.
Jaewoo Lee, Linh T. X. Phan, Sanjian Chen, Oleg Sokolsky, and Insup Lee. Improving resource utilization for
compositional scheduling using DPRM interfaces. SIGBED Rev., 2011.
H. Leontyev and J. H. Anderson. A hierarchical multiprocessor bandwidth reservation scheme with timing
guarantees. In ECRTS, 2008.
Giuseppe Lipari and Enrico Bini. A framework for hierarchical scheduling on multiprocessors: From
application requirements to run-time allocation. In RTSS, 2010.
Linh T. X. Phan, Meng Xu, Jaewoo Lee, Insup Lee, and Oleg Sokolsky. Overhead-aware compositional
analysis of real-time systems. In RTAS, 2013.
Lui Sha, John P. Lehoczky, and Ragunathan Rajkumar. Solutions for Some Practical Problems in Prioritized
Preemptive Scheduling. In RTSS, 1986.
I. Shin and I. Lee. Periodic resource model for compositional real-time guarantees. In Proc. of the 24th IEEE
Real-Time Systems Symposium (RTSS), Cancun, Maxico, 2003.
Insik Shin, A. Easwaran, and Insup Lee. Hierarchical scheduling framework for virtual clustering of multiprocessors. In ECRTS, 2008.
Sisu Xi, Meng Xu, Chenyang Lu, Linh T. X. Phan, Christopher Gill, Oleg Sokolsky, and Insup Lee. Real-time
multi-core virtual machine scheduling in xen. In EMSOFT, 2014.
Meng Xu, Linh T. X. Phan, Insup Lee, Oleg Sokolsky, Sisu Xi, Chenyang Lu, and Christopher D. Gill.
Cache-aware compositional analysis of real-time multicore virtualization platforms. In RTSS, 2013.

