Predictable Execution Model: Concept and Implementation by Pellizzoni, Rodolfo et al.
Predictable Execution Model: Concept and Implementation
Rodolfo Pellizzoni†, Emiliano Betti†, Stanley Bak†, Gang Yao], John Criswell† and Marco Caccamo†
† University of Illinois at Urbana-Champaign, IL, USA,
{rpelliz2, ebetti, sbak2, criswell, mcaccamo}@illinois.edu
] Scuola Superiore Sant’Anna, Italy, g.yao@sssup.it
Abstract
Building safety-critical real-time systems out of inex-
pensive, non-real-time, COTS components is challenging.
Although COTS components generally offer high perfor-
mance, they can occasionally incur significant timing de-
lays. To prevent this, we propose controlling the operating
point of each COTS shared resource (like the cache, mem-
ory, and interconnection buses) to maintain it below its sat-
uration limit. This is necessary because the low-level ar-
biters of these shared resources are not typically designed
to provide real-time guarantees. In this work, we introduce
a novel system execution model, the PRedictable Execution
Model (PREM), which, in contrast to the standard COTS ex-
ecution model, coschedules at a high level all active COTS
components in the system, such as CPU cores and I/O pe-
ripherals. In order to permit predictable, system-wide exe-
cution, we argue that real-time embedded applications need
to be compiled according to a new set of rules dictated by
PREM. To experimentally validate our theory, we developed
a COTS-based PREM testbed and modified the LLVM Com-
piler Infrastructure to produce PREM-compatible executa-
bles.
1. Introduction
Real-time embedded systems are increasingly being built
using commercial-off-the-shelf (COTS) components such
as mass-produced CPUs, peripherals and buses. Overall
performance of mass produced components is often signif-
icantly higher than custom-made systems. For example, a
PCI Express bus [14] can transfer data three orders of mag-
nitude faster than the real-time SAFEbus [8]. However, the
main drawback of using COTS components within a real-
time system is the presence of unpredictable timing anoma-
lies since the individual components are typically designed
paying little or no attention to worst-case timing behavior.
Additionally, modern COTS-based embedded systems in-
clude multiple active components (such as CPU cores and
I/O peripherals) that can independently initiate access to
shared resources, which, in the worst case, cause contention
leading to timing degradation.
Computing precise bounds on timing delays due to con-
tention is difficult. Even though some existing approaches
can produce safe upper bounds, they need to be very pes-
simistic due to the unpredictable behavior of arbiters of
physically shared COTS resources (like caches, memories,
and buses). As a motivating example, we have previously
shown that the computation time of a task can increase lin-
early with the number of suffered cache misses due to con-
tention for access to main memory [16]. In a system with
three active components, a task’s worst case computation
time can nearly triple. To exploit the high average perfor-
mance of COTS components without experiencing the long
delays occasionally suffered by real-time tasks, we need to
control the operating point of each COTS shared resource
and maintain it below saturation limits. This is necessary
because the low-level arbiters of the shared resources are
not typically designed to provide real-time guarantees. This
work aims at showing that this is indeed possible by care-
fully rethinking the execution model of real-time tasks and
by enforcing a high-level coscheduling mechanism among
all active COTS components in the system. Briefly, the key
idea is to coschedule active components so that contention
for accessing COTS shared resources is implicitly resolved
by the high-level coscheduler without relying on low-level,
non-real-time arbiters. Several challenges had to be over-
come to realize the PRedictable Execution Model (PREM):
• Task execution times suffer high variance due to inter-
nal CPU architecture features (caches, pipelines, etc.)
and unknown cache miss patterns. This source of tem-
poral unpredictability forces the designer to make very
pessimistic assumptions when performing schedulabil-
ity analysis. To address this problem, PREM uses a
novel program execution model with three main fea-
tures: (1) jobs are divided into a sequence of non-
preemptive scheduling intervals; (2) some of these
scheduling intervals (named predictable intervals)
are executed predictably and without cache-misses by
prefetching all required data at the beginning of the in-
terval itself; (3) the execution time of predictable in-
tervals is kept constant by monitoring CPU time coun-
ters at run-time.
• I/O peripherals with DMA master capabilities contend
for physically shared resources, including memory and
buses, in an unpredictable manner. To address this
problem, we expand upon on our previous work [1]
and introduce hardware to put the COTS I/O subsys-
tem under the discipline of real-time scheduling.
• Low-level COTS arbiters are usually designed to
achieve fairness instead of real-time performance.
To address this problem, we enforce a coschedul-
ing mechanism that serializes arbitration requests
of active components (CPU cores and I/O peripher-
als). During the execution of a task’s predictable in-
terval, a scheduled peripheral can access the bus and
memory without experiencing delays due to cache
misses caused by the task’s execution.
Our PRedictable Execution Model (PREM) can be used
with a high level programming language like C by set-
ting some programming guidelines and by using a modified
compiler to generate predictable executables. The program-
mer provides some information, like beginning and end of
each predictable execution interval, and the compiler gen-
erates programs which perform cache prefetching and en-
force a constant execution time in each predictable inter-
val. In light of the above discussion, we argue that real-
time embedded applications should be compiled according
to a new set of rules dictated by PREM. At the price of mi-
nor additional work by the programmer, the generated ex-
ecutable becomes far more predictable than state-of-the-art
compiled code, and when run with the rest of the PREM
system, shows significantly reduced worst-case execution
time.
The rest of the paper is organized as follows. Section 2
discusses related work. In Section 3 we describe our main
contribution: a co-scheduling mechanism that schedules I/O
interrupt handlers, task memory accesses and I/O peripheral
data transfers in such a way that access to shared COTS re-
sources is serialized achieving zero or negligible contention
during memory accesses. Then, in Sections 4 and 5 we dis-
cuss the challenges in term of hardware architecture and
code organization that must be met to predictably compile
real-time tasks. Section 6 presents our schedulability analy-
sis. Finally, in Section 7 we detail our prototype testbed, in-
cluding our compiler implementation based on the LLVM
Compiler Infrastructure [9], and provide an experimental
evaluation. We conclude with future work in Section 8.
2. Related Work
Several solutions have been proposed in prior real-time
research to address different sources of unpredictability in
COTS components, including real-time handling of periph-
eral drivers, real-time compilation, and analysis of con-
tention for memory and buses. For peripheral drivers,
Facchinetti et al. [4] proposed using a non-preemptive in-
terrupt server to better support the reusing of legacy drivers.
Additionally, analysis can be done to model worst-case
temporal interference caused by device drivers [10]. For
real-time compilation, a tight coupling between com-
piler and worst-case execution time (WCET) analyzer
can optimize a program’s WCET [5]. Alternatively, a
compiler-based approach can provide predictable pag-
ing [17]. For analysis of contention for memory and
buses, existing techniques can analyze the maximum de-
lay caused by contention for a shared memory or bus
under various access models [15, 20]. All these works at-
tempt to analyze or control a single resource, and ob-
tain safe bounds that are often highly pessimistic. Instead,
PREM is based on a global coschedule of all relevant sys-
tem resources.
Instead of using COTS components, other researchers
have discussed new architectural solutions that can
greatly increase system predictability by removing sig-
nificant sources of interference. Instead of a standard
cache-based architecture, a real-time scratchpad archi-
tecture can be used to provide predictable access time
to main memory [22]. The Precision Time (PRET) ma-
chine [3] promises to simultaneously deliver high com-
putational performance together with cycle-accurate esti-
mation of program execution time. While our PREM ex-
ecution model borrows some ideas from these works,
it exhibits one key difference: our model can be ap-
plied to existing COTS-based systems, without requir-
ing significant architectural redesign. This approach al-
lows PREM to leverage the advantage of the economy of
scale of COTS systems, and support the progressive migra-
tion of legacy systems.
3. System Model
We consider a typical COTS-based real-time embedded
system comprising of a CPU, main memory and multi-
ple DMA peripherals. While in this paper we restrict our
discussion to single-core systems with no hardware multi-
threading, we believe that our predictable execution model
is also applicable to multicore systems. We will present a
predictable execution model for multicore systems as part
of our planned future work. The CPU can implement one or
more cache levels. We focus on the last cache level, which
typically employs a write-back policy. Whenever a task suf-
COTS
Motherboard
FSB
ATA
CPU
Disk
North
Bridge
South
Bridge
RAM
PCIe
PCI
Real-Time
Bridge
COTS
Peripheral
Real-Time
Bridge
COTS
Peripheral
Real-Time
Bridge
COTS
Peripheral
Peripheral
Scheduler
Figure 1: Real-Time I/O Management System.
fers a cache miss in the last level, the cache controller must
access main memory to fetch the newly referenced cache
line and possibly write-back a replaced cache line. Periph-
erals are connected to the system through COTS intercon-
nect such as PCI or PCIe [14]. DMA peripherals can au-
tonomously initiate data transfers on the interconnect. We
assume that all data transfers target main memory, that is,
data is always transferred between the peripheral’s inter-
nal buffers and main memory. Therefore, we can treat main
memory as a single resource shared by all peripherals and
by the cache controller.
The CPU executes a set of N real-time periodic tasks
Γ = {τ1, . . . , τN}. Each task can use one or more periph-
erals to transfer input or output data to or from main mem-
ory. We model all peripheral activities as a set of M pe-
riodic I/O flows ΓI/O = {τ I/O1 , . . . , τ I/OM } with assigned
timing reservations, and we want to schedule them in such
a way that only one flow is transferred at a time. Unfortu-
nately, COTS peripherals do not typically conform to the
described model. As an example, consider a task receiv-
ing input data from a Network Interface Card (NIC). Delays
in the network could easily cause a burst of packets to ar-
rive at the NIC. Since a high-performance COTS NIC is de-
signed to autonomously transfer incoming packets to main
memory as soon as possible, the NIC could potentially re-
quire memory access for significantly longer than its ex-
pected periodic reservation. In [1], we first introduced a so-
lution to this problem consisting of a real-time I/O manage-
ment scheme. A diagram of our proposed architecture is de-
picted in Figure 1. A minimally intrusive hardware device,
called real-time bridge, is interposed between each periph-
eral and the rest of the system. The real-time bridge buffers
all incoming traffic from the peripheral and delivers it pre-
dictably to main memory according to a global I/O sched-
ule. Outgoing traffic is also retrieved from main memory
in a predictable fashion. To maximize responsiveness and
avoid CPU overhead, the I/O schedule is computed by a sep-
arate peripheral scheduler, a hardware device based on the
previously-developed [1] reservation controller, which con-
trols all real-time bridges. For simplicity, in the rest of our
discussion we assume that each peripheral services a sin-
gle task. However, this assumption can be lifted by support-
!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931#
8/&,9'3:;-&#&%&'()*"#9"3&/7:-2#
<8=#$%&'()*"#
<:'>&#.&3'>&2#:",#
/&?-:'&4&"32#
8&/9?>&/:-#,:3:#
3/:"2.&/2#
4&4*/1##
?>:2&#
&%&'()*"#
?>:2&#
Figure 2: Predictable Interval with constant execution time.
ing peripheral virtualization in real-time bridges. More de-
tails about our hardware/software implementation are pro-
vided in Section 7.
Notice that our previously-developed I/O management
system [1] does not solve the problem of memory interfer-
ence between peripherals and CPU tasks. When a typical
real-time task is executed on a COTS CPU, cache misses
are unpredictable, making it difficult to avoid low-level con-
tention for access to main memory. To overcome this is-
sue, we propose a set of compiler and OS techniques that
enable us to predictably schedule all cache misses during
a given portion of a task execution. The code for each
task τi is divided into a set of Ni scheduling intervals
{si,1, . . . , si,Ni}, which are executed sequentially at run-
time. The timing requirements of τi can be expressed by
a tuple {{ei,1, . . . , ei,Ni}, pi, Di}, where pi, Di are the pe-
riod and relative deadline of the task, withDi ≤ pi, and ei,j
is the maximum execution time of si,j , assuming that the in-
terval runs in isolation with no memory interference. A job
can only be preempted by a higher priority job at the end
of a scheduling interval. This ensures that the cache con-
tent can not be altered by the preempting job during the ex-
ecution of an interval. We classify the scheduling intervals
into compatible intervals and predictable intervals.
Compatible intervals are compiled and executed with-
out any special provisions (they are backwards compatible).
Cache misses can happen at any time during these intervals.
The task code is allowed to perform OS system calls, but
blocking calls must have bounded blocking time. Further-
more, the task can be preempted by interrupt handlers of
associated peripherals. We assume that the maximum exe-
cution time ei,j for a compatible interval can be computed
based on traditional static analysis techniques. However, to
reduce the pessimism in the analysis, we prohibit periph-
eral traffic from being transmitted during a compatible in-
terval. Ideally, there should be a small number of compati-
ble intervals which are kept as short as possible.
Predictable intervals are specially compiled to execute
according to the PREM model shown in Figure 2, and ex-
hibit three main properties. First, each predictable interval is
divided into two different phases. During the initial memory
phase, the CPU accesses main memory to perform a set of
!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931#
<# =<# ><# ?<# @<# A<# B<#
9"C(3##
,:3:#
*(3C(3##
,:3:#
D#'*4C:);-&#
##9"3&/7:-#
D#4&4*/1#
##CE:2&#
D#&%&'()*"#
##CE:2&#
τ1
τ2
τ
I/O
1
τ
I/O
2
s1,1 s1,2 s1,3
s2,2s2,1 s2,3 s2,4
D#FGH#I*J#
Figure 3: Example System-Level Predictable Schedule
cache line fetches and replacements. At the end of the mem-
ory phase, all cache lines required during the predictable in-
terval are available in last level cache. Second, the second
phase is know as the execution phase. During this phase, the
task performs useful computation without suffering any last
level cache misses. Predictable intervals do not contain any
system calls and can not be preempted by interrupt handlers.
Hence, the CPU does not perform any external main mem-
ory access during the execution phase. Due to this prop-
erty, peripheral traffic can be scheduled during the execu-
tion phase of a predictable interval without causing any con-
tention for access to main memory. Third, at run-time, we
force the execution time of a predictable interval to be al-
ways equal to ei,j . Let ememi,j be the maximum time required
to complete the memory phase and eexeci,j to complete the ex-
ecution phase. Then offline we set ei,j = ememi,j + e
exec
i,j and
at run-time, even if the memory phase lasts for less than
ememi,j time units, the overall interval still completes in ex-
actly ei,j . This property greatly increases task predictability
without affecting CPU worst-case guarantees. In particular,
as we show in Section 6, it ensures that hard real-time guar-
antees can be extended to I/O flows.
Figure 3 shows a concrete example of a system-level pre-
dictable schedule for a task set comprising two tasks τ1, τ2
together with two I/O flows τ I/O1 , τ
I/O
2 which service τ1 and
τ2 respectively. Both tasks and I/O flows are scheduled ac-
cording to fixed priority, with τ1 having higher priority than
τ2 and τ
I/O
1 higher priority than τ
I/O
2 . We set Di = pi
and assign to each I/O flow the same period and deadline
as its serviced task and a transmission time equal to 4 time
units. As shown in Figure 3 for task τ1, this means that the
input data for a given job is transmitted in the period be-
fore the job is executed, and the output data is transmitted
in the period after. Task τ1 has a single predictable inter-
val of length e1,2 = 4 while τ2 has two predictable inter-
vals of lengths e2,2 = 4 and e2,3 = 3. The first and last
interval of both τ1 and τ2 are special compatible intervals.
These intervals are needed to execute the associated periph-
eral driver (including interrupt handlers) and set up the re-
ception and transmission buffers in main memory (i.e. read
and write system calls). More details are provided in Sec-
tion 7. I/O flows can be scheduled both during execution
phases and while the CPU is idle. As we will show in Sec-
tion 6, the described scheme can be modeled as a hierar-
chical scheduling system [21], where the CPU schedule of
predictable intervals supplies available transmission time to
I/O flows. Therefore, existing tests can be reused to check
the schedulability of I/O flows. However, due to the charac-
teristics of predictable intervals, a more complex analysis is
required to derive the supply function.
4. Architectural Constraints and Solutions
Predictable intervals are executed in a radically differ-
ent way compared to the speculative execution model that
COTS components are typically designed to support. In this
section, we detail the challenges and solutions to implement
the PREM execution model on top of a COTS architecture.
4.1. Caching and Prefetch
Our general strategy to implement the memory phase
consists of two steps: (1) we determine the complete set of
memory regions that are accessed during the interval. Each
region is a continuous area in virtual memory. In general, its
start address can only be determined at run-time, but its size
A is known at compile time. (2) During the memory phase,
we prefetch all cache lines that contain instructions and data
for required regions; most COTS instruction sets include a
prefetch instruction that can be used to load specific cache
lines in last level cache. Step (1) will be detailed in Section
5. Step (2) can be successful only if there is no cache self-
eviction, that is, prefetching a cache line never evicts an-
other line that has already been accessed (prefetched) dur-
!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931#
<:
=&
#>
#
<:
=&
#?
#
<:
=&
#@
#
':
'A
&#
B
:1
#
':
'A
&#
-9"
&#
79/3(:-#4&4#2<:'&# <A129':-#4&4#2<:'&# -9"&#
9",&%#
C#
>#
?#
@#
C#
>#
?#
@#
C#
>#
Figure 4: Cache organization with one memory region
ing the same memory phase. In the remainder of this sec-
tion, we describe self-eviction prevention.
Most COTS CPUs implement the last-level cache as an
N -way set associative cache. Let B be the total size of the
cache and L be the size of each cache line in bytes. Then the
byte size of each of theN cache ways isW = B/N . An as-
sociative set is the set of all cache lines, one for each way,
which have the same index in cache; there are W/L asso-
ciative sets. Last level cache is typically physically tagged
and physically indexed, meaning that cache lines are ac-
cessed based on physical memory addresses only. We also
assume that last level cache is not exclusive, that is, when
a cache line is copied to a higher cache level it is not re-
moved from the last level. Figure 4 shows an example where
L = 4,W = 16 (parameters are chosen to simplify the
discussion and are not representative of typical systems).
The main idea behind our conflict analysis is as follows: we
compute the maximum amount of entries in each associa-
tive set that are required to hold the cache lines prefetched
for all memory regions in a scheduling interval. Based on
the cache replacement policy, we then derive a safe lower
bound on the amount of entries that can be prefetched in an
associative set without causing any self-eviction.
Consider a memory region of size A. The region will oc-
cupy at mostK = dA−1L e+1 cache lines. As shown in Fig-
ure 4 for a region with A = 15,K = 5, the worst case is
produced when the region uses a single byte in its first cache
line. Assume now that virtual memory addresses coincide
with physical addresses. Then since the region is contigu-
ous and there are W/L cache lines in each way, the max-
imum number of entries used in any associative set by the
region is d KW/Le. For example, the region in Figure 4 re-
quires two entries in the set with index 0. We then derive
the maximum number of entries for the entire interval by
summing the entries required for each memory region. Un-
fortunately, this is not generally true if the system employs
paged virtual memory. If the page size P is smaller than the
size W of each way, the index of each cache line inside the
cache way is different for virtual and physical addresses. In
the example of Figure 4 with P = 8, the number of entries
for the memory region is increased from 2 to 3. We con-
sider two solutions: 1) if the system supports it, we can se-
lect a page size multiple of W just for our specific process.
This solution, which we employed in our implementation,
solves the problem because the index in cache for virtual
and physical addresses is the same no matter the page allo-
cation. 2) We use a modified page allocation algorithm in
the OS. Cache-aware allocation algorithms have been pro-
posed in the literature, for example in [11] for cache parti-
tioning. Note that a suitable allocation algorithm could de-
crease the required number of associative entries by control-
ling the allocation in physical memory of multiple regions.
We plan to pursue this solution in our future work.
We now consider the cache replacement policy. We con-
sider four different replacement policies: random, FIFO,
LRU and pseudo-LRU, which cover most implemented
cache architectures. Let Q be the maximum number of en-
tries in any associative set required by the predictable in-
terval. Furthermore, let Q′ be the number of such entries
relative to cache lines that are accessed multiple times dur-
ing the memory phase; in our implementation, this only in-
cludes the cache lines that contain the small amount of in-
structions of the memory phase itself, so Q′ = 1. A de-
tailed analysis of the four replacement policies, based on
the work in [7, 19], is provided in Appendix A; in partic-
ular, based on the replacement policy and possibly Q′, we
show how to compute a lower bound onQ such that no self-
eviction can happen if Q is less than or equal to the derived
bound. It is important to notice that the bound for some
policies depends on the state of the cache before the begin-
ning of the predictable interval. Therefore, we consider two
different cache invalidation models. In the full-invalidation
model, none of the cache lines used by a predictable inter-
val are already available in cache at the beginning of the
memory phase. Furthermore, the replacement policy is not
affected by the presence of invalidated cache lines. As an
example, this model can be enforced by fully invalidating
the cache either at the end or at the beginning of each pre-
dictable interval, which in most architectures has the side
effect of resetting all replacement queues/trees. While this
model is more predictable, several COTS CPUs do not sup-
port it, including our testbed described in Section 7. In the
partial-invalidation model, selected lines in last level cache
can be invalidated, for example to handle data coherency
between the CPU and peripherals in compatible intervals.
Furthermore, the replacement policy always selects inval-
idated cache lines before valid lines, independently of the
state of the queue/tree. Results in Appendix A are summa-
rized by the following theorem.
Theorem 1. If cache lines are prefetched in a consistent or-
der, then at the end of the memory phase all Q cache lines
in an associative set will be available in cache, requiring at
most Q fetches to be loaded, if Q is at most equal to:
• 1 for random;
• N for FIFO and LRU;
• log2N + 1 for pseudo-LRU with partial-invalidation
semantic;
• {
N
2Q′
+Q′ if Q′ < log2N ;
log2N + 1 otherwise,
for pseudo-LRU with full-invalidation semantic.
4.2. Computing Phase Length
To provide predictable timing guarantees, the maximum
time required for memory phase ememi,j and for execution
phase eexeci,j must be computed. We assume that an upper
bound to eexeci,j can be derived based on static analysis of
the execution phase. Note that our model only ensures that
all required instructions and data are available in last level
cache; cache misses can still occur in higher cache levels.
Analyses that derive patterns of cache misses for split level
1 data and instruction caches have been proposed in [12,18].
These analyses can be safely executed because according to
our model, level 1 misses during the execution phase will
not require the CPU to perform any external access. ememi,j
depends on the time required to prefetch all accessed mem-
ory regions. Note that since the task can be preempted at
the boundary of scheduling intervals, the cache state is un-
known at the beginning of the memory phase. Hence, each
prefetch could require both a cache line fetch and a re-
placement in last cache level. An analysis to compute up-
per bounds for read/write operations using a COTS DRAM
memory controller is detailed in [13]. Note that since the
number and relative addresses of prefetched cache lines is
known, the analysis could potentially exploit features such
as burst read and parallel bank access that are common in
modern memory controllers.
Finally, for systems implementing paged virtual mem-
ory, we employ the following three assumptions: 1) the
CPU supports hardware Translation Lookaside Buffer
(TLB) management; 2) all pages used by predictable in-
tervals are locked in main memory; 3) the TLB is large
enough to contain all page entries for a predictable inter-
val without suffering any conflict. Under such assump-
tions, each page used in a predictable interval can cause at
most one TLB miss during the memory phase, which re-
quires a number of fetches in main memory equal to at
most the level of the page table.
4.3. Interval Length Enforcement
As described in Section 3, each predictable interval is re-
quired to execute for exactly ei,j time units. If ei,j is set
to be at least ememi,j + e
exec
i,j , the computation in the execu-
tion phase is guaranteed to terminate at or before the end
of the interval. The interval can then enter active wait until
ei,j time units have elapsed since the beginning of its mem-
ory phase. In our implementation, elapsed time is computed
using a CPU performance counter that is directly accessi-
ble by the task; therefore, neither OS nor memory interac-
tion are required to enforce interval length.
4.4. Scheduling synchronization
In our model, peripherals are only allowed to transmit
during a predictable interval’s execution phase or while the
CPU is idle. To compute the peripheral schedule, the periph-
eral scheduler must thus know the status of the CPU sched-
ule. Synchronization can be achieved by connecting the pe-
ripheral scheduler to a peripheral interconnection as shown
in Figure 1. Scheduling messages can then be sent by ei-
ther a task or the OS to the peripheral scheduler. In partic-
ular, at the end of each memory phase the task sends to the
peripheral scheduler the remaining amount of time until the
end of the current predictable interval. Note that propagat-
ing a message through the interconnection takes non-zero
time. Since this time is typically negligible compared to the
size of a scheduling interval1, we will ignore it in the rest of
our discussion. However, the schedulability analysis of Sec-
tion 6 could be easily modified to take such overhead into
account.
Finally, to avoid executing interrupt handlers during pre-
dictable intervals, a peripheral should only raise interrupts
to the CPU during compatible intervals of its serviced task.
As we describe in Section 7, in our I/O management scheme
peripherals raise interrupts through their assigned real-time
bridge. Since the peripheral scheduler communicates with
each real-time bridge, it is used to block interrupt propa-
gation outside the desired compatible intervals. Note that
blocking real-time bridge interrupts to the CPU will not
cause any loss of input data because the real-time bridge
is capable of independently acknowledging the peripheral
and storing all incoming data in the bridge local buffer.
5. Programming Model
Our system supports COTS applications written in stan-
dard high-level languages such as C. Unmodified code can
be executed within one or more compatible intervals. To
1 In our implementation we measured an upper bound to the message
propagation time of 1us, while we envision scheduling intervals with
a length of 100-1000us.
create predictable intervals, programmers add source code
annotations as C preprocessor macros. The PREM real-time
compiler creates code for predictable intervals so that it
does not incur cache misses during the execution phase and
the interval itself has a constant execution time.
Due to current limitations of our static code analysis, we
assume that the programmer manually partitions the task
into intervals. In particular, compatible intervals can be han-
dled in the conventional way while each predictable inter-
val is encapsulated into a single function. All code within
the function, and any functions transitively called by this
function is executed as a single predictable interval. To cor-
rectly prefetch all code and data used by a predictable inter-
val, we impose several constraints upon its code:
1. Only scalar and array-based memory accesses should
occur within a predictable interval; there should be no
use of pointer-based data structures.
2. The code can use data structures, in particular arrays,
that are not local to functions in the predictable inter-
vals, e.g. they are allocated either in global memory or
in the heap2. However, the programmer should spec-
ify the first and last address that is accessed within the
predictable interval for each non-local data structure.
In general, it is difficult for the compiler to determine
the first and last access to an array within a piece of
code. The compiler needs this information to be able
to load the relevant portion of each array into the last-
level cache.
3. The functions within a predictable interval should not
be called recursively. As described below, the com-
piler will inline callees into callers to make all the
code within an interval contiguous in virtual memory.
Furthermore, no system calls should be made by code
within a predictable interval.
4. Code within a predictable interval may only have di-
rect calls. This alleviates the needs for pointer-analysis
to determine the targets of indirect function calls; such
analysis is usually imprecise and would bloat the code
within the interval.
5. No stack allocations should occur within loops. Since
all variables must be loaded into the cache at function
entry, it must be possible for the compiler to safely
hoist the allocations to the beginning of the function
which initiates the predictable interval.
While these constraints may seem restrictive, some of
these features are rarely used in real-time C code e.g., indi-
rect function calls, and the others are met by many types
of functions. We believe that the benefit of faster, more
2 Note that data structures in the heap must have been previously allo-
cated during a compatible interval.
predictable behavior for program hot-spots outweighs the
restrictions imposed by our programming model. Further-
more, existing code that is too complex to be compiled into
predictable intervals can still be executed inside compati-
ble intervals. Therefore, our model permits a smooth transi-
tion for legacy systems.
Notice that, the compiler can be used to verify that all of
the aforementioned restrictions are met. Simple static anal-
ysis can determine whether there is any irregular data struc-
ture usage, indirect function calls, or system calls. During
compilation, the compiler employs several transforms to en-
sure that code marked as being within a predictable inter-
val does not cause a cache miss. First, the compiler inlines
all functions called (either directly or transitively) during
the interval into the top-level function defining the interval.
This ensures that all program analysis is intra-procedural
and that all the code for the interval is contiguous within vir-
tual memory. Second, the compiler can transform the pro-
gram so that all cache misses occur during the memory
phase, which is located at the beginning of the predictable
scheduling interval. To be specific, it inserts code after the
function prologue to prefetch the code and data needed to
execute the interval. Based on the described constraints, this
includes three types of contiguous memory regions: (1) the
code for the function; (2) the actual parameters passed to the
function and the stack frame (which contains local variables
and register spill slots); and (3) the data structures marked
by the programmer as being accessed by the interval. Third,
the compiler inserts code to send scheduling messages to
the peripheral scheduler as will be described in Section 7.
Fourth, the compiler emits code at the end of the predictable
interval to enforce its constant length. In particular, the com-
piler identifies all return instructions within the function and
adds the required code before them. Finally, based on the in-
formation on prefetched memory regions, we assume that
an external tool, such as a static timing analyzer used to
compute maximum phase length in Section 4.2, can check
the absence of cache self-evictions according to the analy-
sis provided in Section 4.1.
6. Schedulability Analysis
PREM allows us to enforce strict timing guarantees for
both CPU tasks and their associated I/O flows. By setting
timing parameters as shown in Figure 3, the task sched-
ule becomes independent of I/O scheduling. Therefore, task
schedulability can be checked using available schedulabil-
ity tests. As an example, assume that tasks are scheduled
according to fixed priority scheduling as in Figure 3. For
a task τi, let ei =
∑Ni
j=1 ei,j be the sum of the execution
times of its scheduling intervals, or equivalently, the execu-
tion time of the whole task. Furthermore, let hpi ⊂ Γ be the
set of higher priority tasks than τi, and lpi the set of lower
priority tasks. Since scheduling intervals are executed non
preemptively, τi can suffer a blocking time due to lower pri-
ority tasks of at most Bi = maxτl∈lpi maxj=1...Nl el,j . The
worst-case response time of τi can then be found [2] as the
fixed point ri of the iteration:
rk+1i = ei +Bi +
∑
l∈hpi
⌈rki
pl
⌉
el, (1)
starting from r0i = ei+Bi. Task set Γ is schedulable if ∀τi :
ri ≤ Di.
We now turn our attention to peripheral scheduling. As-
sume that each I/O flow τ I/Oi is characterized by a maxi-
mum transmission time eI/Oi (with no interference in both
main memory and the interconnect), period pI/Oi and rel-
ative deadline DI/Oi , where D
I/O
i ≤ pI/Oi . The schedula-
bility analysis for I/O flows is more complex because the
scheduling of data transfers depends on the task schedule.
To solve this issue, we extend the hierarchical scheduling
framework proposed by Shin and Lee in [21]. In this frame-
work, tasks/flows in a child scheduling model execute using
a timing resource provided by a parent scheduling model.
Schedulability for the child model can be tested based on
the supply bound function sbf(t), which represents the min-
imum resource supply provided to the child model in any
interval of time t. In our model, the I/O flow schedule is
the child model and sbf(t) represents the minimum amount
of time in any interval of time t during which the execu-
tion phase of a predictable interval is scheduled or the CPU
is idle. Define the service time bound function tbf(t) as the
pseudo-inverse of sbf(t), that is, tbf(t) = min{x|sbf(x) ≥
t}. Then if I/O flows are scheduled according to fixed pri-
ority, in [21] it is shown that the response time rI/Oi of flow
τ
I/O
i can be computed according to the iteration:
r
I/O,k+1
i = tbf
(
e
I/O
i +
∑
l∈hpI/Oi
⌈rI/O,ki
p
I/O
l
⌉
e
I/O
l
)
, (2)
where hpI/Oi has the same meaning as hpi. In the remainder
of this section, we detail how to compute sbf(t).
For the sake of simplicity, let us initially assume that
tasks are strictly periodic and that the initial activation time
of each task is known. Furthermore, notice that using the
solution described in Section 4, we could enforce interval
lengths not just for predictable intervals, but also for all
compatible intervals. Finally, let h be the hyperperiod of
task set Γ, defined as the least common multiple of all tasks’
periods. Under these assumptions, it is easy to see that if Γ is
feasible, the CPU schedule can be computed offline and re-
peats itself identically with period h after an initial interval
of 2h time units (h time units if all tasks are activated simul-
tenously). Therefore, a tight sbf(t) can be computed as the
minimum amount of supply (time during which the CPU is
idle or in execution phase of a predictable interval) during
any interval of time t in the periodic task schedule, start-
ing from the initial two hyperperiods. More formally, let
{t1, . . . , tK} be the set of start times for all scheduling in-
tervals activated in the first two hyperperiods; the following
theorem shows how to compute sbf(t).
Theorem 2. Let sf(t′, t′′) be the amount of supply provided
in the periodic task schedule during interval [t′, t′′]. Then:
sbf(t) = min
k=1...K
sf(tk, tk + t). (3)
Proof. Let t′, t′′ be two consecutive interval start times in
the periodic schedule. We show that ∀t, t′ ≤ t ≤ t′′ :
min
(
sf(t′, t′ + t), sf(t′′, t′′ + t)
) ≤ sf(t, t + t). This im-
plies that to minimize sbf(t), it suffices to check the start
times of all scheduling intervals.
Assume first that t falls inside a compatible interval or
a memory phase. Then the schedule provides no supply in
[t′, t]. Therefore: sf(t′, t′ + t) = sf(t, t′ + t) ≤ sf(t, t+ t).
Now assume that t falls in an execution phase or in an idle
interval before the start time t′′ of the next scheduling in-
terval. Then sf(t, t′′) = t′′ − t. Therefore: sf(t′′, t′′ + t) ≤
sf(t′′, t′′+t−(t′′−t))+(t′′−t) = sf(t′′, t+t)+sf(t, t′′) =
sf(t, t+ t).
To conclude the proof, it suffices to note that since the
schedule is periodic, if t′ ≥ 2h, then sf(t′, t′ + t) =
sf(tk, tk + t) where tk = (t′ mod h) + h.
Unfortunately, the proposed sbf(t) derivation can only
be applied to strictly periodic tasks. If Γ includes any spo-
radic task τi with minimum interarrival time pi, the sched-
ule can not be computed offline. Therefore, we now pro-
pose an alternative analysis that is independent of the CPU
scheduling algorithm and computes a lower bound sbfL(t)
to sbf(t). Note that since sbf(t) is the minimum supply
in any interval t, using Equation 2 with sbfL(t) instead of
sbf(t) will still result in a sufficient schedulability test.
Let sbf(t) be the maximum amount of time in which the
CPU is executing either a compatible interval or the mem-
ory phase of a predictable interval in any time window of
length t. Then by definition, sbf(t) = t− sbf(t). Our anal-
ysis derives an upper bound sbfU (t) to sbf(t). Therefore,
sbfL(t) = t−sbfU (t) is a valid lower bound for sbf(t). For
a given interval size t, we compute sbf(t) using the follow-
ing key idea. For each task τi, we determine the minimum
and maximum amount of time Emini (t), E
max
i (t) that the
task can execute in a time window of length t while meet-
ing its deadline constraint. Let ti, Emini (t) ≤ ti ≤ Emaxi (t),
be the amount of time that τi actually executes in the time
window. Since the CPU is a single shared resource, it must
hold
∑N
i=1 ti ≤ t independently of the task scheduling al-
gorithm. Finally, for each task τi, we define the memory
bound function mbfi(ti) to be the maximum amount of time
pi −Di
min(t, ei)
￿
min(t− ei − (pi −Di) mod pi, ei)
￿+
ei
Emaxi (t)
￿￿ t− ei − (pi −Di)
pi
￿￿+
ei
!"
#"
$"
%" #" &" ''" '#" '(" %!"
t = 20
Figure 5: Derivation of Emaxi (t), periodic or sporadic task.
!"
#"
$"
%%" %#" %&" '!"
t = 20
pi − ei Di − ei
￿
min(t− (pi +Di − 2ei) mod pi, ei)
￿+
Emini (t)
￿￿ t− (pi +Di − 2ei)
pi
￿￿+
ei
Figure 6: Derivation of Emini (t), periodic task.
that the task can spend in compatible intervals and mem-
ory phases, assuming that it executes for ti time units in the
time window. We can then obtain sbfU (t) as the maximum
over all feasible {t1, . . . , tN} tuples of
∑N
i=1 mbfi(ti). In
other words, we can compute sbfU (t) by solving the fol-
lowing optimization problem:
sbfU (t) = max
N∑
i=1
mbfi(ti), (4)
N∑
i=1
ti ≤ t, (5)
∀i, 1 ≤ i ≤ N : Emini (t) ≤ ti ≤ Emaxi (t). (6)
Emaxi (t) and E
min
i (t) can be computed according to the
scenarios shown in Figures 5, 6 where the notation (x)+ is
used for max(x, 0).
Theorem 3. For a periodic or sporadic task:
Emaxi (t) = min(t, ei) +
(⌊ t− ei − (pi −Di)
pi
⌋)+
ei+(
min(t− ei − (pi −Di) mod pi, ei)
)+
. (7)
For a sporadic task,Emini (t) = 0, while for a periodic task:
Emini (t) =
(⌊ t− (pi +Di − 2ei)
pi
⌋)+
ei+ (8)(
min(t− (pi +Di − 2ei) mod pi, ei)
)+
.
Proof. As shown in Figure 5, the relative distance between
successive executions of a task τi is minimized when the
task finishes at its deadline in the first period within the time
window of length t, and as soon as possible (e.g., ei time
units after its activation) for each successive period. Equa-
tion 7 directly follows from this activation pattern, assum-
ing that the time window coincides with the start time of the
first job of τi. In particular:
• term min(t, ei) represents the execution time for the
first job, which is activated at the beginning of the time
window;
• term
(⌊
t−ei−(pi−Di)
pi
⌋)+
ei represents the execution
time of jobs in periods that are fully contained within
the time window; note that the first period starts ei +
(pi − Di) time units after the beginning of the time
window;
• finally, term (min(t − ei − (pi −Di) mod pi, ei))+
represents the execution time for the last job in the time
window.
Similarly, as shown in Figure 6, the relative distance be-
tween successive executions of a periodic job τi is max-
imized when the task finishes as soon as possible in its
first period and as late as possible in all successive peri-
ods. Equation 8 follows assuming that the time window co-
incides with the finishing time of the first job of τi, noticing
that periodic task activations start (pi − ei) + (Di − ei) =
pi+Di−2ei time units after the beginning of the time win-
dow. Finally, Emini (t) is zero for a sporadic task since by
definition the task has no maximum interarrival time.
Lemma 4. The optimization problem of Equations 4-6 ad-
mits solution if task set Γ is feasible.
Proof. The optimization problem admits solution if the set
of ti values that satisfy Equations 5 and 6 is not empty. Note
that by definition Emaxi (t) ≥ Emini (t), therefore there is at
least one admissible solution iff
∑N
i=1E
min
i (t) ≤ t. De-
fine EminUi (t) =
(
t − (Di − ei)
)+ ei
pi
. Then it is easy to
see that ∀t, Emini (t) ≤ EminUi (t): in particular, both func-
tions are equal to ei for t = pi +Di− ei and increase by ei
every pi afterwards. Now note that EminUi (t) ≤ t eipi , there-
fore it also holds:
∑N
i=1E
min
i (t) ≤ t
∑N
i=1
ei
pi
. The proof
follows by noticing that since Γ is feasible, it must hold∑N
i=1
ei
pi
≤ 1.
!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931#
<#
=#
>#
?#
mbfi(ti)
t1 t2 t3 t4
ti
ei,1 ei,2 ei,3 ei,4
@A#
B# C# ?# @A# @D#
αi = 1/2
δi = 5/2
mbfUi(ti)
E#'*4F:);-&#
##9"3&/7:-#
E#4&4*/1#
##FG:2&#
E#&%&'()*"#
##FG:2&#
Figure 7: Derivation of mbfi(ti) and mbfUi(ti).
mbfi(ti) can be computed using a strategy similar to the
one in Theorem 2. Since ti represents the amount of time
that τi executes, we consider a schedule in which τi is ex-
ecuted continuously being activated every ei time units, as
shown in Figure 7. The start time of the first scheduling in-
terval in the first job of τi is t1 = 0, the start time of the sec-
ond scheduling interval is t2 = ei,1, and so on and so forth
until the first interval of the second job which has start time
equal to ei. Then mbfi(ti) can be computed as the maxi-
mum amount of time that the task spends in a compatible
interval or memory phase in any time window ti.
Theorem 5. Let mfi(t′, t′′) be the amount of time that task
τi spends in a compatible interval or memory phase in the
interval [t′, t′′], in the schedule in which τi is executed con-
tinuously. Then:
mbfi(ti) = max
k=1...Ni
mfi(tk, tk + ti). (9)
Proof. Note that the schedule in which τi executes contin-
uously is periodic with period pi. The same strategy as in
Theorem 2 can be used to show that if t′, t′′ are consecutive
interval start times, then ∀t, t′ ≤ t ≤ t′′ : max (mfi(t′, t′ +
ti),mfi(t′′, t′′ + ti)
) ≥ mfi(t, t+ ti).
As an example, in Figure 7 mbfi(ti) is maximized in the
time window that starts at t4 = 10.
Since mbfi(ti) is a nonlinear function, comput-
ing sbfU (t) according to Equations 4-6 requires solving
a nonlinear optimization problem. To simplify the prob-
lem, we consider a linear upper bound approximation
mbfUi(ti) = αiti + δi, as shown in Figure 7, where
αi =
( ∑
∀si,j compatible
ei,j +
∑
∀si,j predictable
ememi,j
)
/ei,
(10)
and δi is the minimum value such that ∀ti,mbfUi(ti) ≥
mbfi(ti). Using mbfUi(ti), Equations 4-6 can be ef-
ficiently solved in O(N) time. Furthermore, we can
show that sbfU (t) can be computed for all relevant val-
ues of t by solving the optimization problem of Equations
4-6 in a finite number of points, and using linear interpola-
tion to derive the remaining values. Due to the complexity
of the resulting algorithm, the full derivation is pro-
vided in Appendix B.
7. Evaluation
In order to verify the validity and practicality of PREM,
we implemented the key components of the system. In this
section, we describe our evaluation, first introducing the
new hardware components followed by the corresponding
software driver and OS calibration effort. We then discuss
our compiler implementation and analyse its effectiveness
on a DES benchmark. Finally, using synthetic tasks we mea-
sure the effectiveness of the PREM system as a function of
cache stall time, and show traces of PREM when running
on COTS hardware.
7.1. PREM Hardware Components
When introducing PREM in Section 3, two additional
hardware components were added to the COTS system (re-
call Figure 1). Here, we briefly review our previous work
which describes these hardware components in more de-
tail [1], and describe the additions to these components
which we made to provide the mechanism for PREM ex-
ecution. First we describe the real-time bridge component,
then we describe the peripheral scheduler component.
Our real-time bridge prototype is implemented on an
ML507 FPGA Evaluation Platform, which contains a COTS
TEMAC Ethernet hardware block (the I/O peripheral un-
der control), and a PCIe edge connector (the interface to the
main system). Interposed between these hardware compo-
nents, our real-time bridge uses a System-on-Chip design
running Linux. The real-time bridge is also wired to pe-
ripheral scheduler, which allows direct communication be-
tween these components. Three wires are used to transmit
information between each real-time bridge and the pe-
ripheral scheduler, data ready, data block, and
interrupt block. The data ready wire is an out-
put signal sent to the peripheral scheduler which is as-
serted whenever the COTS peripheral has buffered data
in the real-time bridge. The peripheral scheduler sends
two signals, data block and interrupt block,
to the real-time bridge. The data block signal is as-
serted to block the real-time bridge from transferring data
in DMA mode over the PCIe bus from its local buffer to
main memory or viceversa. The interrupt block sig-
nal is a new signal in the real-time bridge implementation,
added to support PREM, and instructs the real-time bridge
to not further raise interrupts. In previous work [1], we pro-
vide a more detailed description and buffer analysis of
the real-time bridge design, and demonstrate the compo-
nent’s effectiveness and efficiency.
The peripheral scheduler is a unique component in the
system, connected directly to each real-time bridge. Unlike
our previous work where the peripherals were scheduled
asynchronously with the CPU [1], PREM requires the CPU
and peripherals to coordinate their access to main memory.
The peripheral scheduler provides a hardware implemen-
tation of the child scheduling model used to schedule pe-
ripherals, with each peripheral given a sporadic server to
schedule its traffic. Meanwhile, the OS real-time scheduler
runs the parent scheduling model for CPU tasks to com-
plete the hierarchical PREM scheduling model described
in Section 6. The peripheral scheduler is connected to the
PCIe bus, and exposes a set of registers accessible from the
main CPU. In the configuration register, constant pa-
rameters such as the maximum cache write-back time are
stored. Writing a value to the yield register indicates that
the CPU will not access main memory for the given amount
of time, and I/O peripherals should be allowed to read to and
write from RAM. The value written to the yield register
contains a 14 bit unsigned integer indicating the number of
microseconds to permit peripheral traffic with main mem-
ory, as described in Section 4. The CPU can also use the
yield register to allow interrupts to be raised by peripher-
als. Another 14 bit unsigned integer indicates the number of
microseconds to allow peripheral interrupts, and a 3 bit in-
terrupt mask selects which peripherals should be allowed to
raise the interrupts. In this way, different CPU tasks can pre-
dictably service interrupts from different I/O peripherals.
7.2. Software Evaluation
The software effort required to implement PREM exe-
cution involves two aspects: (1) creating the drivers which
control the custom PREM hardware discussed in the previ-
ous section, and (2) calibrating the OS to eliminate unde-
sired execution interference. We now discuss these, in or-
der.
The two custom hardware components, the real-time
bridge and the peripheral scheduler, each require a soft-
ware driver to be controller from the main CPU. Addition-
ally, each peripheral requires a driver running on the real-
time bridge’s CPU to control the COTS peripheral. The
driver for the peripheral scheduler is straightforward, map-
ping the bus addresses corresponding to the exposed regis-
ters to user space where a PREM-compiled process can ac-
cess them. The driver for each real-time bridge is more dif-
ficult, since each unique COTS peripheral requires a unique
driver. However, since in our implementation both the main
CPU and the real-time bridge’s CPU are running Linux
(version 2.6.31), we can reuse existing, thoroughly tested
Linux drivers to drastically reduce the driver creation ef-
fort [1]. The presence of a real-time bridge is not apparent
in user space, and software programs using the COTS pe-
ripherals require no modification.
For our experiments, we use a Intel Q6700 CPU with
a 975X system controller; we set the CPU frequency to
1Ghz obtaining a measured memory bandwidth of 1.8Ghz/s
to configure the system in line with typical values for em-
bedded systems. We also disable the speculative CPU HW
prefetcher since it negatively impacts the predictability of
any real-time task. The Q6700 has four CPU cores and each
pair of cores shares a common level 2 (last level) cache.
Each cache is 16-associative with a total size of B = 4
Mbytes and a line size of L = 64 bytes. Since we use a
PC platform running a COTS Linux operating system, there
are many potential sources of timing noise, such as inter-
rupts, kernel threads, and other processes, which must be re-
moved for our measurements to be meaningful. For this rea-
son, in order to emulate at our best a typical uni-processor
embedded real-time platform, we divided the 4 cores in two
partitions. The system partition, running on the first pair of
cores, receives all interrupts for non-critical devices (ex:
the keyboard) and runs all the system activities and non
real-time processes (ex: the shell we use to run the exper-
iments). The real-time partition runs on the second pair of
cores. One core in the real-time partition runs our real-time
tasks together with the drivers for real-time bridges and the
peripheral scheduler; the other core is not used. Note that
the cores of the system partition can still produce a small
amount of unscheduled bus and main memory accesses, or
raise rare inter-processor interrupts (IPI) that can not be eas-
ily prevented. However, in our experiments we found these
sources of noise to be negligible. Finally, to solve the pag-
ing issue detailed in Section 4, we used a large, 4MB page
size, just for the real-time tasks, using the HugeTLB fea-
ture of the Linux kernel for large page support.
7.3. Compiler Evaluation
We built the PREM real-time compiler prototype using
the LLVM Compiler Infrastructure [9], targeting the com-
pilation of C code. LLVM was extended by writing self-
contained analysis and transformation passes, which were
then loaded into the compiler.
In the current PREM real-time compiler prototype,
we rely on the programmer to partition the task into pre-
dictable and compatible intervals . The partitioning is done
by putting each predictable interval into its own func-
tion. The beginning and end of the scheduling interval cor-
respond to the entry and exit of the function, and the start of
execution phase (the end of memory access phase) is man-
ually marked by the programmer. We assume that non-local
data accessed during a predictable interval exists in contin-
uous memory spaces, which can be prefetched by a set of
PREFETCH DATA(start address, size) macros
that must be placed by the programmer during the mem-
ory phase. The implementation of this macro does the
actual prefetching of the data into level 2 cache by prefetch-
ing every cache line in the given range with the i386
prefetcht2 instruction. After the memory phase, the
programmer adds a STARTEXECUTION(wcet) macro
to indicate the beginning of the execution phase. This
macro measures the amount of time remaining in the pre-
dictable interval using the CPU performance counter, and
writes the time remaining to the yield register in the pe-
ripheral scheduler.
All remaining operations needed to transform the inter-
val are performed by a new LLVM function pass. The pass
iterates over all the functions in a compilation unit. When
a function representing a predictable interval is found, the
pass performs code transformation. First, our transform
inlines called functions using preexisting LLVM inlining
functions. This ensures that there are only a single stack
frame and segment of code that need to be prefetched into
the cache. Second, our transform inserts code to read the
CPU performance counter at the beginning of the interval
and save the current time. Third, it inserts code to prefetch
the stack frame and function arguments. Bringing the stack
frame into the cache is done by inserting instructions into
the program to fetch the stack pointer and frame pointer.
Code is then inserted to prefetch the memory between the
stack pointer and slightly beyond the frame pointer (to in-
clude function arguments) using the prefetcht2 instruc-
tion. Fourth, the transform prefetches the code of the func-
tion. This is done by transforming the program so that the
function is placed within a unique ELF section. We then
use a linker script to define variables pointing to the begin-
ning and end of this unique ELF section. The compiler then
adds code that prefetches the memory inside the ELF sec-
tion. Finally, the pass identifies all return instructions inside
the predictable interval function and adds a special function
epilog before them. The epilog performs interval length en-
forcement by looping until the performance counter reaches
the worst-case cycle count based on the time value saved
at the beginning of the interval. It may also enable periph-
eral interrupts by writing the worst-case interrupt process-
ing time to the peripheral scheduler’s yield register.
To verify the correctness of the PREM real-time com-
piler prototype and to test its applicability, we used LLVM
to compile a DES cypher benchmark. The DES benchmark
was selected because it represents a typical real-time data
flow application. The benchmark comprises one schedul-
ing interval which encrypts a variable amount of data. We
compiled it as both a predictable and a compatible inter-
val (e.g. with and without prefetching), and measured num-
ber of cache misses with a performance counter. Adapting
the interval required no modification to any cypher func-
Data size 4K 8K 32K 128K 512K 1M
Compatible 138 254 954 3780 15k 31k
Predictable 2 2 4 2 1 81
Table 1: DES benchmark.
tions and a total of 11 PREFETCH DATA macros.
Results are shown in Table 1 in terms of the number of
cache misses suffered in the execution phase of the pre-
dictable interval (after prefetching), and in the entire com-
patible interval. Data size is in bytes. The compatible in-
terval suffers an excessive number of cache misses, which
increases roughly proportionally with the amount of pro-
cessed data. Conversely, the execution phase of the pre-
dictable interval has almost zero cache misses, only suffer-
ing a small increase when large amounts of data are being
processed. The reason the number of cache misses is not
zero is that the Q6700 CPU core used in our experiments
uses a random cache replacement policy, meaning that with
more than one contiguous memory region the probability of
self-eviction is non-zero. In all the following experiments,
we observed that the number of self-evictions is typically so
small that it can be considered negligible.
7.4. WCET Experiments with Synthetic Tasks
In this section, we evaluate the effects of PREM on the
execution time of a task. To quickly explore different ex-
ecution parameters, we developed two synthetic applica-
tions. In our linear access application, each schedul-
ing interval operates on a 256-kilobyte global data structure.
Data is accessed sequentially, and we vary the amount of
computation performed between memory references. The
random access application is similar, except that refer-
ences inside the data structure are nonsequential. For each
application, we measured the execution time after com-
piling the program in two ways: into predictable intervals
which prefetch the accessed memory, and into standard,
compatible intervals. For each type of compilation, we ran
the experiment in two ways, with and without I/O traffic
transmitted by an 8-lane PCIe peripheral with a measured
throughput of 1.2Gbytes/s. In the case of compatible in-
tervals, we transmitted traffic during the entire interval to
mirror the worst case according to the traditional execution
model.
Figures 8 and 9 show the observed worst case execution
time for any scheduling interval as a function of the cache
stall time of the application, averaged over 10 runs. The
cache stall time represents the percentage of time required
to fetch cache lines out of an entire compatible interval, as-
suming a fixed (best-case) fetch time based on the max-
imum measured main-memory throughput. Only a single
line is shown for predictable intervals because experiments
confirmed that injecting traffic during the execution phase
20 30 40 50 60 70
200
400
600
800
1000
1200
1400
1600
Cache Stall Time %
Ex
ec
ut
ion
 T
im
e 
(m
s)
 
 
compatible/traffic
compatible
predictable
Figure 8: random access
20 30 40 50 60 70 80 90
200
400
600
800
1000
1200
1400
Cache Stall Time %
Ex
ec
ut
ion
 T
im
e 
(m
s)
 
 
compatible/traffic
compatible
predictable
Figure 9: linear access
does not increase execution time. In all cases, the computa-
tion time decreases with an increase in stall time. This is be-
cause stall time is controlled by varying the amount of com-
putation between memory references. Furthermore, execu-
tion times should not be compared between the two figures
because the two applications execute different code.
In the random access case, predictable intervals out-
perform compatible intervals (without peripheral traffic) by
up to 28%, depending on the cache stall time. We believe
this effect is primarily due to the behavior of DRAM main
memory. Specifically, accesses to adjacent addresses can be
served quicker in burst mode than accesses to random ad-
dresses. Thus, we can decrease the execution time by load-
ing all the accessed memory into cache, in order, at the be-
ginning of each predictable interval. Furthermore, note that
transmitting peripheral traffic during a compatible interval
can increase execution time by more than 60% in the worst
case. In Figure 9, predictable intervals perform worse than
compatible intervals (without peripheral traffic). We believe
this is mainly due to out-of-order execution in the Q6700
core. In compatible intervals, while the core performs a
cache fetch, instructions in the pipeline that do not depend
on the fetched data can continue to execute. When perform-
ing linear accesses, fetches require less time and this effect
is magnified. Furthermore, the gain in execution time for
the case with peripheral traffic is decreased: this occurs be-
cause bursting data on the memory bus reduces the amount
of blocking time suffered by a task due to peripheral in-
terference (this effect has been previously analyzed in de-
tail [15]). In practice, we expect the effect of PREM on an
application’s execution time to be between the two figures,
based on the specific memory access pattern.
7.5. System-wide Coscheduling Traces
We now present execution traces of the implemented
system which demonstrate the advantage of the PREM
coscheduling approach. The traces are obtained by using the
peripheral scheduler as a logic analyzer for the various sig-
nals which are being sent to or from the real-time bridges,
data block, data ready, and interrupt block.
Additionally, the peripheral scheduler has a trace register
which allows timestamped trace information to be recorded
with a one microsecond resolution when instructed by the
main CPU, such as at the start and end of an execution inter-
val. An execution trace is shown for a task running the tra-
ditional COTS execution model in Figure 10, and the same
task running within the PREM model is shown in Figure
113.
In the first trace (Figure 10), although the execution
is divided into unpreemptable intervals, there is no mem-
ory phase prefetch or constant execution time guarantees.
When the scheduling intervals of task T1 finish executing,
an I/O peripheral begins to access main memory, which may
happen if T1 had written output data into RAM. Task T2
then executes, suffering cache misses that compete for main
memory bandwidth with the I/O peripheral. Due to the cold
cache and peripheral interference, the execution time of T2
grows from 0.5 ms (the execution time with warm cache and
no peripheral I/O), to 2.9ms as shown in the figure, an in-
crease of about 600%.
In the second trace (Figure 11), the system executes ac-
cording to the PREM execution model, where peripher-
als only access the bus when permitted by the peripheral
scheduler. The predictable interval is divided into a mem-
ory phase and an execution phase. Instead of competing for
main memory access, task T2 gets contentionless access to
3 The (compatible) intervals at the end of T1 and at the start of T2 were
measured as 0.107ms and 0.009ms, respectively, and have been exag-
gerated in the figures to be visible.
Scheduling Intervals
Memory
Access
2.9ms
T1
T2
T1I/O
Figure 10: An unscheduled bus-access trace (without
PREM)
T1
T2
T1I/O
Compatible 
Intervals
Predictable
Execution
Phase
Predictable 
Memory Phase
Memory
Access
1.6ms
Figure 11: A scheduled trace using PREM
main memory during the memory phase. After all to-be-
accessed data is loaded into the cache, the execution phase
begins which incurs no cache misses, and the peripheral is
allowed to access the data in main memory. The constant
execution time for the predictable interval in the PREM ex-
ecution model is 1.6ms, which is significantly lower than
the worst-case observed for the unscheduled trace (and is
about the same as the execution time of the scheduling in-
terval of T2 in the unscheduled trace with a cold cache and
no peripheral traffic).
8. Conclusions
We have discussed the concept and implementation of a
novel task execution model, PRedictable Execution Model
(PREM). Our evaluation shows that by enforcing a high-
level co-schedule among CPU tasks and peripherals, PREM
can greatly reduce or outright eliminate low-level con-
tention for shared resource access. We plan to further de-
velop our solution in two main directions. First, we will
study extensions to our compiler infrastructure to lift some
of the more restrictive code assumptions and compile and
test a larger set of benchmarks. Second, it is important to
recall that contention for shared resources becomes more
severe as the number of active components increases. In
particular, worst-case execution time can greatly degrade in
multicore systems [16]. Since PREM can make the system
contentionless, we predict that the benefits of our approach
will become even more significant when applied to multi-
ple processor systems.
References
[1] S. Bak, E. Betti, R. Pellizzoni, M. Caccamo, and L. Sha.
Real-time control of I/O COTS peripherals for embedded
systems. In Proc. of the 30th IEEE Real-Time Systems Sym-
posium, Washington DC, Dec 2009.
[2] G. Buttazzo. Hard Real-Time Computing Systems: Pre-
dictable Scheduling Algorithms and Applications. Kluwer
Academic Publishers, Boston, 1997.
[3] S. A. Edwards and E. A. Lee. The case for the precision
timed (pret) machine. In DAC ’07: Proc. of the 44th annual
Design Automation Conference, 2007.
[4] T. Facchinetti, G. Buttazzo, M. Marinoni, and G. Guidi.
Non-preemptive interrupt scheduling for safe reuse of legacy
drivers in real-time systems. In ECRTS ’05: Proc. of the 17th
Euromicro Conf. on Real-Time Systems, pages 98–105, 2005.
[5] H. Falk, P. Lokuciejewski, and H. Theiling. Design of a
wcet-aware c compiler. In ESTMED ’06: Proc. of the 2006
IEEE/ACM/IFIP Workshop on Embedded Systems for Real
Time Multimedia, pages 121–126, 2006.
[6] C. Ferdinand, F. Martin, and R. Wilhelm. Applying compiler
techniques to cache behavior prediction. In Proceedings of
the ACM SIGPLAN Workshop on Languages, Compilers and
Tools for Real-Time Systems, 1997.
[7] D. Grund and J. Reineke. Precise and efficient FIFO-
replacement analysis based on static phase detection. In Pro-
ceedings of the 22nd Euromicro Conference on Real-Time
Systems (ECRTS), Brussels, Belgium, July 2010.
[8] K. Hoyme and K. Driscoll. Safebus(tm). IEEE Aerospace
Electronics and Systems Magazine, pages 34–39, Mar 1993.
[9] C. Lattner and V. Adve. LLVM: A compilation framework
for lifelong program analysis and transformation. In Proc. of
the International Symposium of Code Generation and Opti-
mization, San Jose, CA, USA, Mar 2004.
[10] M. Lewandowski, M. Stanovich, T. Baker, K. Gopalan, and
A. Wang. Modeling device driver effects in real-time schedu-
lability: Study of a network driver. In Proc. of the 13th IEEE
Real Time Application Symposium, Apr 2007.
[11] J. Liedtke, H. Hartig, and M. Hohmuth. OS-controlled cache
predictability for real-time systems. In Proceedings of the
3rd IEEE Real-Time Technology and Applications Sympo-
sium (RTAS), 1997.
[12] F. Mueller. Timing analysis for instruction caches. Real Time
Systems Journal, 18(2/3):272–282, May 2000.
[13] M Paolieri, E Quinones, F. J. Cazorla, and M. Valero. An an-
alyzable memory controller for hard real-time CMPs. IEEE
Embedded System Letter, 1(4), Dec 2009.
[14] PCI SIG. Conventional PCI 3.0, PCI-X 2.0 and PCI-E 2.0
Specifications. http://www.pcisig.com.
[15] R. Pellizzoni and M. Caccamo. Impact of peripheral-
processor interference on wcet analysis of real-time embed-
ded systems. IEEE Trans. on Computers, 59(3):400–415,
Mar 2010.
[16] R. Pellizzoni, A. Schranzhofer, J.-J. Chen, M. Caccamo, and
L. Thiele. Worst case delay analysis for memory interference
in multicore systems. In Proceedings of Design, Automation
and Test in Europe (DATE), Dresden, Germany, Mar 2010.
[17] I. Puaut and D. Hardy. Predictable paging in real-time sys-
tems: A compiler approach. In ECRTS ’07: Proc. of the
19th Euromicro Conf. on Real-Time Systems, pages 169–178,
2007.
[18] H. Ramaprasad and F. Mueller. Bounding preemption de-
lay within data cache reference patterns for real-time tasks.
In Proc. of the IEEE Real-Time Embedded Technology and
Application Symposium, Apr 2006.
[19] J. Reineke, D. Grund, C. Berg, and R. Wilhelm. Timing pre-
dictability of cache replacement policies. Real-Time Systems,
37(2), 207.
[20] S. Schliecker, M. Negrean, G. Nicolescu, P. Paulin, and
R. Ernst. Reliable performance analysis of a multicore mul-
tithreaded system-on-chip. In CODES/ISSS, 2008.
[21] I. Shin and I. Lee. Periodic resource model for compositional
real-time guarantees. In Proceedings of Proceedings of the
23th IEEE Real-Time Systems Symposium, Cancun, Mexico,
Dec 2003.
[22] J. Whitham and N. Audsley. Implementing time-predictable
load and store operations. In Proc. of the Intl. Conf. on Em-
bedded Systems (EMSOFT), Grenoble, France, Oct 2009.
A. Analysis of Cache Replacement Policy
We first provide a brief overview of each analyzed policy.
Under random policy, whenever a new cache line is loaded
in an associative set, the line to be evicted is chosen at ran-
dom among allN lines in the set. While this policy is imple-
mented in some modern COTS CPUs with very large cache
associativity, such as Intel Core architecture, it is clearly ill
suited to real-time systems. In FIFO policy, a FIFO queue
is associated with each associative set as shown in Figure
12(a). Each newly fetched cache line is inserted at the head
of the queue (position q1), while the cache line which was
previously at the back of the queue (position qN ) is evicted.
In the figure, l1, . . . , l4 represent cache lines used in the pre-
dictable interval, while dashes represent other cache lines
available in cache but unrelated to the predictable interval.
Finally, grayed boxes represent invalidated cache lines. The
Least Recently Used (LRU) policy also uses a replacement
queue, but a cache line is moved at the head of the queue
whenever it is accessed. Due to the complexity of imple-
menting LRU, an approximated version of the algorithm
known as pseudo-LRU can be employed whereN is a power
of 2. A binary replacement tree with N − 1 internal nodes
and N leaves q1, . . . , qN is constructed as shown in Fig-
ure 13 for N = 4. Each internal node encodes a “right” or
“left” direction using a single bit of data, while each leaf
encodes a fetched cache line. Whenever a replacement de-
cision must be taken, the encoded directions are followed
starting from the root to the leaf representing the cache line
!!!"!
!!!"!
!!!"!
"!
l1 
l2 
l3 
l4 q1 
q2 
q3 
q4 
q5 
q6 
q7 
q8 
!"#$
l3 
l4 
"!
"!
"!
"!
"!
"!
l3 
l4 
l3 
l4 
l1 
l2 
l2 
l1 
"!
"!
"!
"!
"!
"!
!!!"!
!!!"!
!!!"!
"!
l3 
l4 
"!
"!
!%#$ !&#$ !'#$ !(#$ !)#$
#
$%
&!'
()
*$
!
+,
-)
.&!
/!
.0,
$*
!
1$
23
(!
/!
.0,
$*
!
1$
23
(!
/!
.0,
$*
!
#
$%
&!'
()
*$
!
#
$%
&!'
()
*$
!
Figure 12: FIFO policy, example replacement queue.
to be replaced. Furthermore, whenever a cache line is ac-
cessed, all directions on the path from the root to the leaf of
the accessed cache line are set to be the opposite of the fol-
lowed path. Figure 13 shows an example where four cache
lines are accessed in the order l1, l2, l3, l2, l4, causing a self-
eviction.
Without loss of generality, in all proofs in this section
we focus on the analysis of a single associative set with Q
cache lines accessed during the predictable interval. Fur-
thermore, let l1, . . . , lQ be the set of Q cache lines in the
order in which they are first accessed during the memory
phase (note that this order can be different from the order in
which the lines are prefetched due to the Q′ lines that are
accessed multiple times). Results for LRU and random are
simple and well-known, see [6] for example; we detail the
bound derivation in the following theorem to provide a bet-
ter understanding of the replacement policies.
Theorem 6. A memory phase will not suffer any cache self-
eviction in an associative set requiring Q entries, if Q is at
most equal to:
• 1: for random replacement policy;
• N : for LRU replacement policy.
Proof. Clearly no self-eviction is possible if Q = 1, since
after the unique cache line is fetched no other cache line can
be accessed in the associative set.
Now consider LRU policy. When a cache line li is first
accessed, it is fetched (if it is not already in cache) and
moved to the beginning of the replacement queue. Since no
cache line outside of l1, . . . , lQ is accessed in the associa-
tive set, when li is first accessed, cache lines l1, . . . , li−1
must be at the head of the queue in some order. But Q ≤ N
implies i−1 ≤ N −1, hence the algorithm always picks an
unrelated line at the back of the queue to be replaced.
!" !" !" !" l1 !" !" !" l1 !" l2 !"
l1 l3 l2 !" l1 l3 l2 !" l4 l3 l2 !"
#$
%&
'(
")
*'
*+
"
,+
*-
."
l 1 
,+
*-
."
l 2 
,+
*-
."
l 3 
/
--
+)
) l
2 
,+
*-
."
l 4 
!"#$ !%#$ !&#$
!'#$!(#$ !)#$
q1 q2 q3 q4 q1 q2 q3 q4 q1 q2 q3 q4 
Figure 13: Pseudo-LRU policy, full-invalidation, Q′ = 1.
l1 l2 l3 !" l1 l2 l3 !" l4 l2 l3 !"#$
%&
'(
")
*'
*+
"
,
--
+)
)"
l 1,
 l 2
, l
3 
.+
*-
/"
l 4 
!"#$ !%#$ !&#$
q1 q2 q3 q4 q1 q2 q3 q4 q1 q2 q3 q4 
Figure 14: Pseudo-LRU policy, partial-invalidation.
Note that for random the bound of 1 is tight, since in the
worst-case fetching a second cache line in the associative
set can cause the first cache line to be evicted. Furthermore,
note that if the cache is not invalidated between successive
activations of the same predictable interval, then some of
the l1, . . . , lQ cache lines can still be available in cache. As-
sume that at the beginning of the predictable interval l2 is
available but not l1. Then even with LRU policy, prefetch-
ing l1 can potentially cause l2 to be evicted. According to
our definition, this is not a self-eviction: l2 is evicted be-
fore it is accessed during the memory phase, and is then
reloaded when it is first accessed, thus guaranteeing that all
required lines are available in cache at the end of the mem-
ory phase. In fact, Theorem 6 proves that the bounds for
random and LRU are independent of the state of the cache
before the beginning of the predictable interval.
Unfortunately, the same is not true of FIFO and pseudo-
LRU. Consider FIFO policy. If a cache line li is already
available in cache, then when li is accessed during the mem-
ory phase no modification is performed on the replacement
queue. In some situations this can cause li to be evicted by
another line fetched after li in the memory phase, causing
a self-eviction. Due to this added complexity, when analyz-
ing FIFO and pseudo-LRU we need to consider the inval-
idation model. A detailed analysis of FIFO replacement is
provided in [7]; here we summarize the results and intu-
itions relevant to our system.
Theorem 7 (Directly follows from Lemma 1 in [7]). As-
sume FIFO replacement with full-invalidation semantic.
Then no self-eviction is possible for an associative set with
Q ≤ N .
The analysis for partial-invalidation FIFO is more com-
plex. Figure 12 (analogous to Figure 2 in [7]) depicts a clar-
ifying example, where the “mem. phase”, “fetch” and “in-
val.” labels show the state of the replacement queue after
the memory phase and after some cache lines have been ei-
ther fetched or invalidated outside the predictable interval,
respectively. Note that after Step (e), lines l1, l2 remain in
cache but not l3, l4. Hence, when l3 and l4 are fetched dur-
ing the next memory phase in Step (f), they evict l1 and l2.
The solution is to prefetch l1, . . . , lQ multiple times in the
memory phase, each time in the same order.
Theorem 8 (Directly follows from Theorem 3 in [7]). As-
sume FIFO replacement with partial-invalidation semantic.
If the Q cache lines in an associative set, with Q ≤ N ,
are prefetched at least Q times in the same access-order
l1, . . . , lQ, then all Q cache lines are available in cache at
the end of the memory phase. Furthermore, no more than Q
fetches are required.
Note that while executing the prefetching code Q times
might seem onerous, in practice the majority of the time in
the memory phase is spent fetching and writing back cache
lines, and Theorem 8 ensures that no more than Q fetches
and write backs are required for each associative set.
While LRU and FIFO allow to prefetch up to N cache
lines with N fetches, the same is not true of pseudo-LRU,
even in the full-invalidation model. Consider again Figure
13, where no relevant lines are available in cache at the be-
ginning of the memory phase but Q′ = 1, meaning that line
l2 can be accessed multiple times. Then it is easy to see
that no more than 3 cache lines can be prefetched without
causing a self-eviction. The key intuition is that up to Step
(d), and with respect to l1, l2, l3, the replacement tree en-
codes the LRU order exactly. However, after l2 is accessed
again at Step (e), the LRU order is lost, and the next replace-
ment causes l1 to be evicted instead of the remaining un-
related cache line. Furthermore, note that the strategy em-
ployed in Theorem 8 does not work in this case, because
l1 has already been fetched during the memory interval be-
fore being evicted. A similar situation can happen even if
Q′ = 0 in the partial-invalidation model, as shown in Fig-
ure 14. Assume that due to partial replacements and cache
accesses, at the beginning of the memory phase the cache
state is as shown in Figure 14(a), with l1, l2 and l3 being
available in cache but not l4. Then after the first three cache
lines are prefetched in order, l1 will be replaced when l4 is
next prefetched.
A lower bound that is independent of Q′ and of the in-
validation model was first proven in [19].
Theorem 9 (Theorem 10 in [19]). Under pseudo-LRU re-
placement, no self-eviction is possible for an associative set
with Q ≤ log2N + 1.
We now show in Theorem 12 that under the full-invalidation
model, a better bound can be obtained based on the value
of Q′. In the theorem, a subtree of a pseudo-LRU replace-
ment tree is a tree rooted at any internal node of the replace-
ment tree. For example, the replacement tree in Figures 13,
14 has three subtrees: the whole tree, which has height 2 and
leaves q1, . . . , q4, and its left and right subtrees with height
1 and leaves q1, q2 and q3, q4, respectively. For each subtree
of height k, we compute a lower bound Lbpk on the num-
ber of cache lines in l1, . . . , lQ that can be allocated in the
subtree (either because they are already available in cache
at the beginning of the predictable interval or because they
are fetched during the memory phase) without causing any
self-eviction, assuming that none of the lines were avail-
able in cache at the beginning of the predictable interval
and that up to p lines can be accessed multiple times dur-
ing the memory phase. Note that by definition, Lbik ≤ Lbjk
if i ≥ j. Furthermore, ∀p, Lbp1 = 2 by Theorem 9. The fol-
lowing Lemmas 10, 11 prove some important facts on Lbpk.
Lemma 10. Lb0k = 2Lb0k−1.
Proof. Consider once again the left and right subtrees with
height k − 1. By contradiction and without loss of general-
ity, assume that a self-eviction happens in the left subtree;
then at least Lb0k−1 + 1 cache lines must be allocated in it.
Since no relevant cache line is originally available in cache
and furthermore no cache line is accessed more than once, it
follows that every access results in a fetch and replacement.
Furthermore, whenever a cache line is fetched in the left
subtree, the root of the height-k subtree is changed to point
to the right subtree and viceversa. Hence, to fetch Lb0k−1+1
lines in the left subtree, at least Lb0k−1 lines must be fetched
in the right subtree. Therefore, at least 2Lb0k−1 + 1 total
cache lines must be allocated in the whole height-k sub-
tree to cause a self-eviction, implying that Lbk = 2Lb0k−1
is a valid lower bound.
Note that from Lemma 10 and Lb01 = 2 it immediately fol-
lows that Lb0k = 2
k.
Lemma 11. For any subtree of height k > 1 with p > 0,
Lbpk = min(Lb
p−1
k−1 + 1, 2Lb
p
k−1).
Proof. Consider the left and right subtrees with height k−1,
and assume that out of the p cache lines that can be accessed
multiple times, i are allocated in the left subtree and j are al-
located in the right subtree, with i+j = p. By contradiction
and without loss of generality, assume that a self-eviction
happens in the left subtree; then at least Lbik−1 + 1 cache
lines must be allocated in it. We distinguish two cases.
Case (1): j > 0. Then following the same reasoning as
in Theorem 9, it is sufficient to allocate a single cache line
in the right subtree, resulting in a total of Lbik−1+2 lines re-
quired to cause a self eviction; note that this value is mini-
mized by maximizing i, e.g. when j = 1, i = p− 1, result-
ing in Lbp−1k−1 + 2 required cache lines.
Case (2): j = 0. Then following the same reasoning as
in Lemma 10, at least Lbik−1 cache lines must be allocated
in the right subtree, for a total of 2Lbik−1 +1 = 2Lb
p
k−1 +1
cache lines.
Combining case (1) and (2), it follows that no eviction is
possible if the number of allocated lines is at most equal to
min(Lbp−1k−1 + 1, 2Lb
p
k−1), which concludes the proof.
We can finally prove Theorem 12.
Theorem 12. Under pseudo-LRU replacement with full-
invalidation semantic, no self-eviction is possible if:
Q ≤
{
N
2Q′
+Q′ if Q′ < log2N ;
log2N + 1 otherwise.
Proof. The proof proceeds by induction on the
pair (p, k) ordered by p first, e.g. in the sequence
(0, 1), . . . , (0, Q), (1, 1), . . . , (1, Q), . . . , (Q′, 1), . . . , (Q′, Q).
In particular, we show that the following property holds
∀p, k :
Lbpk =
{
2k−p + p if p < k;
k + 1 otherwise.
Since a pseudo-LRU replacement tree for a N -associative
set has height log2N , the theorem then follows.
By Lemma 10, the property is verified for p = 0 since
Lb0k = 2
k. By Theorem 9, the property is verified for k = 1
since Lbp1 = 2. Therefore, it remains to complete the in-
duction step by showing that the property holds at step
(p, k) with p > 0, k > 1. We do so by applying Lemma
11, assuming that the property also holds at previous steps
(p− 1, k− 1) and (p, k− 1). We consider three cases based
on the relative value of p, k.
Case (1): p < k − 1. We have to show that Lbpk =
2k−p + p. Then:
Lbpk = min
(
2(k−1)−(p−1) + (p− 1) + 1, 2(2(k−1)−p + p)
)
=
= min
(
2k−p + p, 2k−p + 2p
)
= 2k−p + p.
Case (2): p = k − 1. We have to show that Lbpk =
2k−p + p = k + 1. Then:
Lbpk = min
(
2(k−1)−(p−1) + (p− 1) + 1, 2((k − 1) + 1)
)
=
= min
(
2k−p + p, 2k
)
= k + 1.
Case (3): p ≥ k. We have to show that Lbpk = k + 1.
Then:
Lbpk = min ((k − 1) + 1 + 1, 2((k − 1) + 1)) =
= min (k + 1, 2k) = k + 1.
B. Solving The Optimization Problem
Using mbfUi(ti) = αiti + δi, the optimization problem
of Equations 4-6 can be rewritten as follows:
sbfU (t) =
N∑
i=1
(δi + αiEmini (t)) + max
N∑
i=1
αixi, (11)
N∑
i=1
xi ≤ t−
N∑
i=1
Emini (t), (12)
∀i, 1 ≤ i ≤ N : 0 ≤ xi ≤ Emaxi (t)− Emini (t), (13)
where we substituted xi = ti − Emini (t). Without loss of
generality, assume that the N tasks are ordered by non-
increasing values of αi. Then Algorithm 1 computes the so-
lution val to Equations 4-6. The algorithm first computes
the time bound t −∑Ni=1Emini (t) on the sum of the vari-
ables xi. Then, starting from x1, it assigns to each variable
the minimum between the remaining time trem and the up-
per constraint of Equation 13. Since the tasks are ordered
by non-increasing values of αi, it is trivial to see that Al-
gorithm 1 computes the maximum of Equation 11. Further-
more, the while loop at Lines 5-10 is executed at most N
times, and each iteration requires constant time. Hence, the
algorithm has complexity O(N).
Algorithm 1 Compute sbfU (t) for a given t.
1: procedure COMPUTEINSTANT(t, τ1, . . . , τN ordered
by non-increasing αi)
2: trem := t−
∑N
i=1E
min
i (t)
3: val :=
∑N
i=1(δi + αiE
min
i (t))
4: i := 1
5: while trem > 0 and i ≤ N do
6: xi = min
(
trem, E
max
i (t)− Emini (t)
)
7: trem := trem − xi
8: val := val + αixi
9: i := i+ 1
10: end while
11: return (val, i, {x1, . . . , xN})
12: end procedure
Algorithm 1 computes sbfU (t) for a specific value of
t. To test the schedulability condition of Equation 2 using
the upper bound tbfU (t) = min{x|sbfL(x) ≥ t}, sbfU (t)
must be computed for all t in the interval [0,maxiD
I/O
i ].
The reason is as follows. Note that if for any i:
tbfU
(
e
I/O
i +
∑
l∈hpI/Oi
⌈rI/O,ki
p
I/O
l
⌉
e
I/O
l
)
> max
i
D
I/O
i , (14)
then the system is not schedulable. By definition of tbfU (t),
if eI/Oi +
∑
l∈hpI/Oi
⌈
r
I/O,k
i
p
I/O
l
⌉
e
I/O
l > sbfL(maxiD
I/O
i ),
then Equation 14 is verified. Hence, it suffices to compute
sbfL(t), sbfU (t) in [0,maxiD
I/O
i ].
Luckily, we do not need to run Algorithm 1 for all values
of t in [0,maxiD
I/O
i ]. Since functions E
max
i (t), E
min
i (t)
are piecewise linear, we can obtain sbfU (t) by interpola-
tion of a finite number of values as shown in Algorithm
2. The algorithm returns sbfU (t) as a set pset of points
{. . . , (t′, val′), (t′′, val′′), . . .}, where for t′ ≤ t ≤ t′′,
sbfU (t) = val′+(t−t′) val′′−val′t′′−t′ , e.g. sbfU (t) is the linear
interpolation between (t′, val′) and (t′′, val′′). The algo-
rithm first computes in Line 4 the set t1, . . . , tM−1 of angu-
lar points of Emaxi (t), E
min
i (t) in interval [0,maxiD
I/O
i ),
with tM being the maximum value of interest maxiD
I/O
i ;
note that this implies that in every open interval (tj , tj+1),
each function Emaxi (t), E
min
i (t) has a constant slope equal
to either 0 or 1. In particular, function EMaxAdd(l, tj)
returns the slope of Emaxl (t) in interval (t
j , tj+1), while
EMinAdd(l, tj) returns the slope of Eminl 9(t) in interval
(tj , tj+1). For each point tj , Algorithm 1 is applied to com-
pute the value val = sbfU (tj), as well as the index i of the
last task for which variable xi has been assigned a non-zero
value. (tj , val) is then inserted in pset at Line 11.
Finally, the algorithm iterates in Lines 10-26, increasing
the current time value tcur starting at tcur = tj , until tcur
exceeds the next angular point tj+1. Note that when Algo-
rithm 1 is run to compute the solution to Equations 11-13
for tcur = tj , variables xl will be assigned as follows:
xl = Emaxl (tcur)− Eminl (tcur) ∀l, 1 ≤ l ≤ i− 1, (15)
xl = 0 ∀l, i < l ≤ N, (16)
xi = tj −
N∑
l=1
Eminl (tcur)−
∑
l 6=i
xl = (17)
= tj −
i−1∑
l=1
Emaxl (tcur)−
N∑
l=i
Eminl (tcur).
Now consider the solution val, with variable assign-
ment x¯1, . . . , x¯N , computed by Algorithm 1 for
time instant t¯ = tcur + ∆, where ∆ is a small
enough value. Note that as long as t¯ ≤ tj+1, then
Emaxl (t¯) = E
max
l (tcur) + ∆EmaxAdd(l, tcur) and sim-
ilarly Eminl (t¯) = E
min
l (tcur) + ∆EminAdd(l, tcur).
Furthermore, define div ≡ ∑i−1l=1 EMaxAdd(l, tj) +∑N
l=i EMinAdd(l, t
j) as computed in Line 12. Then based
on Equations 15-17, we obtain:
x¯l = Emaxl (t¯)− Eminl (t¯) ∀l, 1 ≤ l ≤ i− 1, (18)
x¯l = 0 ∀l, i < l ≤ N, (19)
x¯i = t¯−
i−1∑
l=1
Emaxl (t¯)−
N∑
l=i
Eminl (t¯) = (20)
= tj + ∆(1− div)−
i−1∑
l=1
Emaxl (tcur)−
N∑
l=i
Eminl (tcur) =
= xi + ∆(1− div),
val =
N∑
l=1
(δl + αlEminl (t¯)) +
N∑
l=1
αlx¯l = (21)
=
N∑
l=1
δl +
i−1∑
l=1
αlE
max
l (t¯) +
N∑
l=i
αlE
min
l (t¯) + αix¯i =
=
( N∑
l=1
δl +
i−1∑
l=1
αlE
max
l (tcur) +
N∑
l=i
αlE
min
l (tcur)+
+ αixi
)
+
i−1∑
l=1
αl∆EMaxAdd(l, tj)+
+
N∑
l=i
αl∆EMinAdd(l, tj) + αi∆(1− div) =
= val +
i−1∑
l=1
αl∆EMaxAdd(l, tj)+
+
N∑
l=i
αl∆EMinAdd(l, tj) + αi∆(1− div),
and such solution is valid as long as ∆ ≤ tj+1 − tcur and
furthermore ∆ is small enough that it holds: 0 ≤ x¯i ≤
Emaxi (t¯)−Emini (t¯); note that based on Equation 20, the lat-
ter condition can be rewritten as:
xi + ∆(1− div) ≥ 0, (22)
xi + ∆(1− div) ≤ Emaxi (tcur)− Emini (tcur)+ (23)
+
(
EMaxAdd(i, tj)− EMinAdd(i, tj))∆.
Since x¯i can be increasing or decreasing with ∆ based on
the value of div, we have to consider several different cases.
In all cases, note that Equation 21 is linear in ∆, hence as
long as the Equation holds in an interval (t′, t′′), the solu-
tion to Equations 11-13 can be obtained by linear interpola-
tion of (t′, val′), (t′′, val′′).
Case (1): div > 1 (Lines 14-17). The constraint in Equa-
tion 23 is verified for all ∆. Solving Equation 22 yields
∆ ≤ xi/(div − 1). We have two subcases (a) and (b). (a)
Assume that xi/(div−1) < tj+1−tcur. Then Equations 20,
21 are valid in the interval [tcur, tcur + xi/(div− 1)]. Note
that for ∆ = xi/(div − 1), x¯i = 0. Furthermore, we have
αi∆(1 − div) = −αixi. Hence, val can be computed as
in Line 16, new values for tcur, val are assigned as tcur :=
tcur+xi/(div−1), val := val and the new point (tcur, val)
is inserted in Line 11 at the next iteration. Finally, note that
at the next iteration, all variables xl with l < i − 1 are still
assigned their maximum value Emaxl (tcur) − Eminl (tcur)
and all xl, l ≥ i are set to 0. Hence, we can apply the same
reasoning as in Equations 18-21 after setting i := i − 1
as in Line 17. In particular, note that i can never be as-
signed the invalid value 0. By contradiction, assume that
in the current iteration i = 1. Then all xl = 0, meaning
that t¯ =
∑N
l=1E
min
l (t¯) according to Equation 12; this in
turn implies
∑N
l=1 EMinAdd(l, t
j) = 1, which contradicts
div > 1. (b) Assume that xi/(div−1) ≥ tj+1− tcur. Then
Equations 20, 21 are valid in the interval [tcur, tj+1], the
condition of the while loop in Line 10 becomes false and
the point (tj , val) is added to pset in the next iteration of
the for loop at Line 7.
Case (2): div = 0 and EMaxAdd(i, tj) = 0 (Line 19-
22). Note div = 0 implies EMinAdd(i, tj) = 0. The con-
straint in Equation 22 is verified for all ∆. Solving Equa-
tion 23 yields ∆ ≤ Emaxi (tcur) − Emini (tcur) − xi.
Once again we have two subcases. (a) Assume that
Emaxi (tcur) − Emini (tcur) − xi < tj+1 − tcur.
Then Equations 20, 21 are valid in the interval
[tcur, tcur +Emaxi (tcur)−Emini (tcur)− xi]. Note that for
∆ = Emaxi (tcur)− Emini (tcur)− xi, x¯i is set to the maxi-
mum value Emaxi (t¯)− Emini (t¯). Also note that in Equation
21, αi∆(1−div) = αi∆ and
∑i−1
l=1 αl∆EMaxAdd(l, t
j)+∑N
l=i αl∆EMinAdd(l, t
j) = 0, hence val can be com-
puted as in Line 21. The same considerations as in Case (1a)
then apply except that we set i := i+ 1 since in the next it-
eration, all variables xl with l ≤ i are assigned their maxi-
mum value Emaxl (tcur) − Eminl (tcur) and all xl, l > i + 1
are still set to 0. Note that it is possible to set i to the in-
valid value N + 1; this is covered in Case (6). (b) As-
sume that Emaxi (t
j)− Emini (tj)− xi ≥ tj+1 − tcur. Then
the same considerations as in Case (1b) apply.
Case (3): div = 1 and EMaxAdd(i, tj) =
0,EMinAdd(i, tj) = 1 (Line 19-22). As in Case (2), the
constraint in Equation 22 is verified for all ∆, while solving
Equation 23 yields ∆ ≤ Emaxi (tcur)−Emini (tcur)−xi. Fur-
thermore, note that for ∆ = Emaxi (tcur)−Emini (tcur)−xi,
x¯i is again set to the maximum value Emaxi (t¯) − Emini (t¯)
and from Equation 21 we obtain val = val + αi∆. Hence,
this case is equivalent to Case (2a)-(2b) and can be han-
dled by the same Lines 19-22.
Case (4): div = 0 and EMaxAdd(i, tj) = 1 (Line 24).
As in Case (2), note div = 0 implies EMinAdd(i, tj) = 0.
Then both Equations 22 and 23 are verified for all ∆. Hence,
as in Case (1b), Equations 20, 21 are valid in the interval
[tcur, tj+1] and the point (tj , val) is added to pset in the
next iteration of the for loop.
Case (5): div = 1 and either EMaxAdd(i, tj) = 1 or
Algorithm 2 Compute sbfU (t) in [0,maxiD
I/O
i ].
1: procedure COMPUTEINTERVAL(maxiD
I/O
i , τ1, . . . , τN )
2: ∀i, 1 ≤ i ≤ N : compute αi, δi.
3: Order τi, . . . , τN by non-increasing values of αi.
4: Compute t1, . . . , tM−1 as the set of angular points for any Emaxi (t), E
min
i (t) in [0,maxiD
I/O
i ).
5: tM := maxiD
I/O
i
6: pset := ∅
7: for j = 1 . . .M − 1 do
8: (val, i, {x1, . . . , xN}) = ComputeInstant(tj , τ1, . . . , τN )
9: tcur := tj
10: while tcur < tj+1 do
11: add (tcur, val) to pset
12: div :=
∑i−1
l=1 EMaxAdd(l, t
j) +
∑N
l=i EMinAdd(l, t
j)
13: if div > 1 then
14: ∆ := xi/(div − 1)
15: tcur := tcur + ∆
16: val := val + ∆
(∑i−1
l=1 αlEMaxAdd(l, t
j) +
∑N
l=i αlEMinAdd(l, t
j)
)− αixi
17: i := i− 1
18: else if (div = 0 & EMaxAdd(i, tj) = 0) | (div = 1 & EMaxAdd(i, tj) = 0 & EMinAdd(i, tj) = 1) then
19: ∆ := Emaxi (tcur)− Emini (tcur)− xi
20: tcur := tcur + ∆
21: val := val + αi∆
22: i := i+ 1
23: else
24: break
25: end if
26: end while
27: end for
28: return pset
29: end procedure
EMinAdd(i, tj) = 0 (Line 24). Then again both Equations
22 and 23 are verified for all ∆, hence this case is equiva-
lent to Case (4).
Case (6): i > N (Line 24). This can only happen if in
the previous iteration either Case (2a) or (3a) is executed
with i = N . Note that in both cases, in the current iter-
ation div =
∑N
l=1 EMaxAdd(l, t
j) = 0, hence Line 24
is executed. We show that the assignment where each vari-
able is maximal, e.g. xl = Emaxl (t¯) − Eminl (t¯), is optimal
for all t¯ in the interval [tcur, tj+1]. Equation 12 is equiva-
lent to:
∑N
l=1
(
Emaxl (t¯) − Eminl (t¯)
) ≤ t¯ −∑Nl=1Eminl (t¯),
which is in turn equivalent to:
∑N
l=1E
max
l (t¯) ≤ t¯. But
since EMaxAdd(l, tj) = 0 for all l and furthermore at
tcur it must hold
∑N
l=1E
max
l (tcur) ≤ tcur, Equation 12
is verified for all t¯ in [tcur, tj+1]. Hence, the assignment
xl = Emaxl (t¯)−Eminl (t¯) must result in the optimum. Com-
puting val according to Equation 21 results in val = val,
hence sbfU (tj) is constant between point (tcur, val) and
point (tj , val), which is added to pset in the next iteration
of the for loop.
We can now prove our main theorem.
Theorem 13. A correct upper bound sbfU (t) to sbf(t) in
the interval [0,maxiD
I/O
i ] can be computed by linear in-
terpolation of the point set returned by Algorithm 2.
Proof. Based on the discussion above, whenever Algo-
rithm 2 inserts two consecutive points (t′, val′), (t′′, val′′)
in pset, sbfU (t) can be computed by linear approximation
of (t′, val′), (t′′, val′′) for all t in the interval [t′, t′′]. Since
furthermore points are inserted for the beginning and end
of the interval [0,maxiD
I/O
i ], to conclude the proof it re-
mains to show that the algorithm terminates; this is not ob-
vious, since in the while loop in Lines 10-26, ∆ can be set
to 0 and furthermore i can be either incremented or decre-
mented. We show that if i is incremented in the first itera-
tion of the while loop, then it can not be decremented in fur-
ther iterations; similarly, if i is decremented in the first it-
eration, it can not be incremented afterwards. Since i ≥ 1
by Case (1a), and furthermore the while loop is terminated
whenever i > N , it follows that the loop is executed at most
N + 1 times.
By contradiction, assume that Lines 14-17 are executed
in one iteration, and Lines 19-22 in the following. Note that
if i is incremented or decremented by 1, the value of div can
only change by 1. Hence, it must hold div = 2 in the first
iteration and div = 1 in the second. However, this implies
EMaxAdd(i, tj) = 1 in the second iteration, hence Lines
19-22 can not be executed. Similarly, assume that Lines 19-
22 are executed in one iteration, and Lines 14-17 in the
following; it must hold div = 1 in the first iteration and
div = 2 in the second. This implies EMinAdd(i, tj) = 0
in the first iteration, hence Lines 19-22 can not be exe-
cuted.
It remains to discuss the computational complexity of
Algorithm 2. Note that in a time window of length t, the
number of angular points for Emaxi (t) or E
min
i (t) are not
more than 2 + 2d tpi e. Hence, an upper bound to the num-
ber of iterations of the for loop at Line 7 can be com-
puted as N
(
4 + 4
⌈maxiDI/Oi
mini pi
⌉)
+ 1, where the 1 term ac-
counts for tM = maxiD
I/O
i . Lines 8-9 can be executed
in O(N). Based on the proof of Theorem 13, the while
loop at Line 10 is repeated at most N + 1 times. Finally,
note that Lines 11-25 can be optimized to run in constant
time rather than linear time. This is because the value of
i changes by 1 between successive iterations. Hence, the
summations in Lines 12, 16 can be computed based on
their values at the previous step in constant time. In con-
clusion, Algorithm 2 has a pseudo-polynomial complexity
of O
(
N2 maxiD
I/O
i /mini pi
)
.
