Tunable WCET for hard real-time multicore system by Yoon, Man Ki
c 2011 by Man Ki Yoon. All rights reserved.
TUNABLE WCET FOR HARD REAL-TIME MULTICORE SYSTEM
BY
MAN KI YOON
THESIS
Submitted in partial fulllment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2011
Urbana, Illinois
Adviser:
Professor Lui Sha
Abstract
In recent years, multicore processors have been receiving a signicant amount of attention
from avionic and automotive industries as the demand for high-end real-time applications
drastically increases. However, the unpredictable worst-case timing behavior that mainly
arises from shared resource contention in current multicore architectures has been the biggest
stumbling block for a widespread use of multicores in hard real-time systems. A great deal
of research eorts have been devoted to address the issue. Among others, the development
of a new multicore architecture has emerged as an attractive solution because it is possible
to eliminate the sources of unpredictable interferences in the rst place, or at least to turn
them into predictable ones. Accordingly, this opens a new possibility of system-level op-
timizations with multicore-based hard real-time systems. To address this issue, this study
proposes a new perspective of WCET model called tunable WCET, in which the WCET
of a task is partitioned into xed execution time and tunable delay. Our tunable WCET
model enables WCET-aware shared resource allocation/arbitration by elastically deforming
the tunable delays of tasks. For this, we also propose novel shared bus arbitration and cache
partitioning methods called harmonic round-robin bus scheduling and two-level cache parti-
tioning. We present a mixed integer linear programming (MILP) formulation as the solution
to the optimization problem of tunable WCETs. Our experimental results show that the
proposed methods can signicantly lower overall system utilization.
ii
To mom and dad.
iii
Acknowledgments
First and foremost, I would like to thank my advisor, Professor Lui Sha, who has supported
me throughout my study with his advice, encouragement, and endless patience. Without
his valuable guidance and instruction this work would not have been possible. I would like
to thank Min Young Nam for his sincere help both at and away from school. I would also
like to thank Professor Chang-Gun Lee for guiding me to pursue my study in the eld of
Real-Time Systems in the rst place.
Finally, I would like to thank Jung Eun, my intellectual and life companion, with whom
I have shared laughter and tears through good times and bad.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Resource Sharing Problem in Multicore System . . . . . . . . . . . . . . . . 1
1.2 Motivating Hard Real-Time Multicore Architecture . . . . . . . . . . . . . . 3
1.3 Multicore Architecture and Its Eect on WCET . . . . . . . . . . . . . . . . 4
1.3.1 Intra-core interferences . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Inter-core interferences . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 The scope of this study . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Tunable WCET and Its Optimization . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2 Optimization of Tunable WCET . . . . . . . . . . . . . . . . . . 13
2.1 Harmonic Round-Robin Bus Arbitration . . . . . . . . . . . . . . . . . . . . 13
2.2 Two-Level Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Tunable WCET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 3 Tunable WCET Analysis . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Bus Access Delay dBi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Bank Conict Delay dMi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Computation of Djk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Computation of dsj;' . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 4 MILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Parameters and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 System parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Decision variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Range variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
4.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Harmonic round-robin . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Bus scheduling table . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Task to core mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.4 Core to bank mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.5 Task to column mapping . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.6 WCET calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.7 Task and core utilization . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Experimental groups . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.3 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Impact of core count . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Impact of cache accesses intensity . . . . . . . . . . . . . . . . . . . . 59
5.2.3 Impact of cache conguration . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 62
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vi
List of Tables
4.1 List of system parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Experimental parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Experimental groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
List of Figures
1.1 Tunable WCET model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 The impact of bus schedule and task allocation on bus access delay. . . . . . 14
2.2 Bank conict delays with dierent bus and cache congurations. . . . . . . . 17
2.3 An example of unbounded bank conict delay. . . . . . . . . . . . . . . . . . 19
2.4 Two-level cache partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Delay components of proposed tunable WCET model. . . . . . . . . . . . . . 22
3.1 Worst-case execution time of task i on core j. . . . . . . . . . . . . . . . . . 25
3.2 Dierent bus access delays with dierent bus schedules. . . . . . . . . . . . . 26
3.3 Task to column mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Four possible scenarios for the range of task i's columns. . . . . . . . . . . . 29
3.5 The worst-case bank access scenario of core 2; 5; 6; and 8. . . . . . . . . . . . 31
3.6 An example busy period w2;17. . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Calculation of wij;' and d
s
j;'. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 The assignments of j;s for HRR of (2; 4; 8; 8). . . . . . . . . . . . . . . . . . 43
4.2 The number of required banks for dierent number of required columns. . . . 45
4.3 Bank sharing constraint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 The range constraint of mapping from a task to columns. . . . . . . . . . . . 49
4.5 An illustration of ws0;s;k for slot s = 14 with LM = 2  LB. . . . . . . . . . . . 50
5.1 Minimum system utilization with dierent core counts. . . . . . . . . . . . . 58
5.2 Minimum system utilization with dierent cache access intensities. . . . . . . 59
5.3 Minimum system utilization with dierent cache congurations. . . . . . . . 61
viii
List of Abbreviations
WCET Worst-Case Execution Time.
TDMA Time Division Multiple Access.
FSB Front-side bus.
HRR Harmonic Round-Robin.
MILP Mixed Integer Linear Programming.
DMA Direct Memory Access.
IMA Integrated Modular Avionics.
ix
List of Symbols
N  Number of tasks.
NC Number of cores.
NB Number of banks of the cache.
NW Number of columns in a bank.
NX Number of columns of the cache.
NXi Number of columns required for task i.
NMi Number of cache accesses of task i.
Tmax Upper-bound of HRR periods.
Tmin Lower-bound of HRR periods.
LB Bus access latency.
LM Bank access latency.
x
Chapter 1
Introduction
Multicores have been increasingly adopted by chip manufacturers as a solution to scale the
performance beyond the thermal and power walls. Intel has ten core chips on the market [1],
and Sun has introduced SPARC T3 processor which has sixteen hyper-threaded cores [2].
Furthermore, a research chip by Intel has as many as 80 cores that can perform more than
one trillion oating-point operations per second [3]. While most of the current multicore
systems are targeted for servers and desktops, it is also expected that real-time embedded
systems and cyber-physical systems will follow the trend in the near future as the demand
for high-end real-time applications is rapidly growing. For instance, Freescale's QorIQ P4080
processor [4] and the ARM11 MPCore processors [5] are receiving wide attention from avionic
and automotive industries. Moreover, many recent smartphones and tablet PCs have begun
to be equipped with mobile multicore processors [6, 7].
1.1 Resource Sharing Problem in Multicore System
However, one of the major obstacles in using multicore processors for these domains is that
the execution time of applications can vary noticeably depending on how physical resources,
such as cache and system interconnect, are shared and/or contended between co-scheduled
tasks on the system. For example, shared cache is one of the most critical and contended
resources on multicores [8, 9, 10]. If two tasks start to evict each others cache line out of the
shared cache to bring in its own data, they will take much longer to complete than when each
of them runs alone or is scheduled together with a non-interfering task with smaller working
1
set. In addition to this, shared bus is another major source of indeterminism in multicore
processors [11, 12, 13]. Similar to the situation of shared cache, if two or more tasks on
dierent cores try to access the shared bus simultaneously for cache fetches, some tasks will
experience longer delays than others due to the contention. If we consider, furthermore,
I/O tracs injected into the system through DMA (Direct Memory Access), the problem
becomes more serious since the tracs may impose additional delays on the core-initiated
bus accesses.
This unpredictability of the timing behavior of current multicore architecture is a huge
barrier, especially for safety-critical systems in which the predictability of the worst-case
temporal behavior is of primary importance. One of solutions to this kind of problem is
to develop an analysis method that can precisely estimate the worst-case execution times
of applications in the presence of shared resource contentions [14, 15, 16, 17, 18]. The
assumptions made in the existing analyses are commonly too restrictive, however, and thus
the results are often very pessimistic or not even applicable directly to the current multicore
architectures. The more serious problem is that, as multicore architectures become more
complex, the correlation among the sources of unpredictabilities becomes much larger than
before. Hence it becomes harder or even impossible to achieve accurate estimation of WCET
with the existing analyses.
Due to such fundamental limitations of the analytic methods, hardware modication of
multicore systems has emerged as an attractive and viable solution [19, 20, 21, 22]. While
the analytic methods try to analyze interferences caused by resource contentions, the new
multicore architectures focus on eliminating such interferences in the rst place for higher
predictability. However, some of the architectures support only certain type of program-
ming language [19], and/or require software applications to be modied in order to take
advantage of the hardware modications [21]. Moreover, some architectures experience poor
average-case performance due to the lack of certain performance-support features such as
multithreading [20]. In this point of view, the multicore architecture proposed by Paolieri
2
et al. [22], which this study is based on, provides a good architectural foundation for future
hard real-time multicore systems.
1.2 Motivating Hard Real-Time Multicore
Architecture
In [22], Paolieri et al. introduced a new hard real-time multicore architecture in which
accesses to shared resources, such as shared bus or cache, are controlled by hierarchical bus
arbiters (refer Figure 1 in [22]). The architecture employs round-robin as the shared bus
arbitration policy; all cores are fairly given the equal chance to access the bus. Through
the nature of round-robin policy, the maximum delay that a bus request of a task can suer
from others is bounded by the total number of hard real-time tasks ready to be executed at
the same time. In addition to this bus access interference, they also analyzed shared cache
interference with regard to two factors - bank access interference and storage interference,
both of which are the causes of unpredictable and unanalyzable worst-case timing behavior
of shared cache in multicore systems. The maximum delay due to bank access interferences
for a task is similarly bounded by the number of hard real-time tasks. They also addressed
cache partitioning techniques which eliminate storage interference by splitting cache space
into separately assigned pieces - banks or columns.
This architecture does not have the limitations of other multicore architectures. First
of all, since resource contentions are resolved by hardware arbiters, it does not require any
modication on applications' source code, and for a similar reason, it does not impose any
restrictions on programming language or OS. Furthermore, it can properly support multi-
threading of applications with the help of Intra-Core Bus Arbiters (ICBAs), thus it does not
compromise system performance in applications' execution times.
3
1.3 Multicore Architecture and Its Eect on WCET
WCET of a real-time task is aected by dierent level of factors, from software-level features
to architecture-specic low-level system conguration. The former mainly includes applica-
tion structure, compiler optimization, task scheduling, and other software- or OS-level fac-
tors. Although these factors aect WCET to some extent even in multicore environments,
such eects are not greatly dierent from those that can be seen in singlecore architectures.
The latter, i.e., architecture-level features, on the other hand, cause signicant dierences
in WCET analysis between multicore and singlecore architectures, especially due to the re-
source sharing problem of multicore architectures briey described in Section 1.1. Thus, in
what follows, we will explain what architecture-level factors could aect WCETs of real-time
tasks in multicore systems and then present the scope of this study.
1.3.1 Intra-core interferences
F1. Task scheduling
In a multicore processor, tasks can migrate from a core to another core unless OS or
the processor itself does not allow task migrations, and thus the task set on a core may
change from time to time. This variable task set makes WCET analysis harder because
the execution of a task can be aected by which other tasks are co-scheduled with.
dierent task sets leads to dierent task scheduling, which in turn change the memory
access patterns of tasks. As an another example, out-of-order execution techniques,
such as pipelining or speculative execution, may cause intra-core interference due to
register le contentions. Although this is also problematic even in singlecore processors,
the problem becomes more serious in multicore environment since the change of task
set, and thus their schedules, is often unpredictable.
F2. Private cache
Cache thrashing due to cache line contention is one of the major causes of interference
4
even in private caches (local caches). For example, unless the private cache of a core
is not partitioned in task-level, it is possible that two dierent tasks on the core al-
ternately fetch their instructions or data into the same cache lines, which makes their
WCETs increase. This also happens in singlecore processors and is generally analyz-
able, however, as in the above case, the unpredictable variation of task sets makes the
problem more complex in multicore processors. Furthermore, the cache coherency in
a multicore processor can be also problematic. For example, it is possible for a task to
experience an unexpected cache miss delay if the cache line was invalidated due to the
cache coherency protocol among the private caches.
1.3.2 Inter-core interferences
F3. On-chip interconnect
F3.1 Interconnect type
Among the many types of interconnect, shared bus and switch fabric are the
most commonly used internal communication networks in multicore architectures.
The primary dierence between the two types is that while a shared bus can
transfer only one request at a time, a switch fabric enables two or more requests
to be served at the same time as long as the destinations are dierent. Thus, in
estimating the WCET of a task, while how to arbitrate requests from multiple
cores is the key factor with shared bus, with switch fabric we need to also consider
the destination information.
F3.2 Arbitration method
Priority based arbitration is advantageous in that each request can be prioritized
according to the priority of the task or core that initiated the request. However,
in the perspective of the complexity of WCET analysis, it is often dicult to
analyze and bound the WCET of a task because it requires the knowledge of other
tasks that could delay each request, and furthermore, the delays of low-priority
5
requests is hardly bounded without a proper starvation prevention mechanism.
With TDMA (Time Division Multiple Access) or Round-Robin, on the other
hand, it is straightforward to analyze the upper-bound delay that a request could
suer due to the requests from other cores.
F3.3 I/O trac
An I/O trac that injected from an external device into a system is transferred to
memory through an on-chip interconnect, e.g., Front-side bus (FSB), by DMA. It
commonly interferes with, and thus impose additional delays on, the core-initiated
requests. Since it is dicult to know the arrival patterns of I/O tracs and their
workload a priori, the WCET analysis becomes further more complex without an
I/O ow control mechanism. This is not a specic problem of multicore, however
this makes the analysis more complex and harder.
F4. Shared last-level cache
F4.1 Multi-banks
Caches are usually partitioned into multiple banks, each of which can handle only
one request at a time. Thus two dierent cache requests can access the shared
last-level cache at the same time as long as their destination banks are dierent.
Otherwise, one of the requests could suer a bank conict delay if it tries to access
a bank in which the other is being handled.
F4.2 Cache partitioning
Partitioning the shared last-level cache is the commonly used way to reduce stor-
age interferences that could arise from sharing the same cache columns among
tasks. Task-level partitioning allocates a private subset of cache columns to each
task, and thus the storage interference can be eliminated. At core-level parti-
tioning, on the other hand, each core has its private subset of cache columns and
allows the tasks in the core share the columns. While it can avoid inter-core stor-
age interferences, the tasks in the same core could experience intra-core storage
6
interferences.
F4.3 Latency
If the relative latency of cache request, LM , to that of interconnect request, LB,
is high, the eect of cache access conict, i.e., bank conict, on WCET can be
diminished. As an extreme case, if the ratio of LM to LB is less than one, the cache
conict interferences can be totally eliminated. However, the speed of interconnect
is faster than that of caches in most processor architecture. In order to reduce
cache conicts, the request latencies of cache and interconnect therefore need to
be taken into account in cache partitioning and interconnect arbitration.
F5. O-chip memory
5.1 Multi-interface
Similar to the case of multi-banks caches, multi-interface o-chip memories en-
ables the memory access parallelism; that is, two memory accesses from two dif-
ferent cores can be served simultaneously by two separate interfaces. However
this can be fully utilized only when parallel data transfer is supported by the
interconnect such as switch fabric.
5.2 DMA transfer
A DMA transfer between I/O device and memory or between two memories can
delay core-initiated memory accesses. It is often dicult to analyze this type of
interference for a similar reason of Factor F3.3. To avoid or reduce such interfer-
ence, a device-level or interconnect-level I/O ow control mechanism is needed.
1.3.3 The scope of this study
Workload model assumptions
A1. A set of real-time tasks is given and does not change.
7
A2. Each task either is pre-allocated to a core or can be allocated by an optimization
process.
A3. Each task cannot migrate from a core to another core.
A4. The cache footprint and the number of cache accesses of a task is pre-proled by
a static analysis.
A5. The critical path of a task obtained by a static analysis is long enough so that
the variable delays due to bus and cache conicts do not change the critical path.
A6. The application model assumed in this study is a subclass of numerical real-time
tasks such as signal or image processing applications. Note that such applications
have few branches and the cache footprints from period to period do not change
a lot. Accordingly, we assume that the tasks under consideration in this study do
not branch during their executions and the cache footprint that each task use in
each period is steady throughout its execution. Furthermore, we assume a static
schedule for the critical paths of the tasks on each core is made.
Architecture-level assumptions
With the above task workload model, we limit ourselves to considering the inter-core
interferences, especially due to on-chip interconnect (F3) and shared last-level cache (F4),
as the primary factors that aect our WCET model. Among others, we directly address the
following architecture-level factors throughout the thesis:
F3. On-chip interconnect
F3.1 Interconnect type
We consider a bus-based interconnect that transfers instructions and data between
cores and the shared last-level cache.
F3.2 Arbitration method
We consider a round-robin based arbitration. In particular, we propose a new ar-
8
bitration method called Harmonic round-robin scheduling, in which the worst-case
delay of a bus access is aected by which of cores initiates the access. Section 2.1
will describe harmonic round-robin scheduling in detail.
F4. Shared last-level cache
F4.1 Multi-banks
We assume that the system has only one shared last-level multi-banks cache, in
which dierent accesses to dierent banks can be served simultaneously.
F4.2 Cache partitioning
We assume that the shared last-level cache can be partitioned statically. In par-
ticular, we propose a new partitioning scheme called two-level cache partitioning,
in which the banks are partitioned in core-level and each bank is partitioned in
task-level. In this scheme, each bank can be shared among cores, however each
column of the bank is private to each task. Thus storage conict interference
is totally eliminated. However bank conict interference can occur due to bank
sharing. Section 2.2 will describe two-level cache partitioning scheme in detail.
F4.3 Latency
We assume the cache access latency, LM , is higher than the bus access latency,
LB. Thus our WCET model includes the interference due to cache access conict,
i.e., bank access conict, as aforementioned. However there is no assumption on
the order of magnitude of LM
LB
.
1.4 Tunable WCET and Its Optimization
Paolieri's multicore architecture summarized in Section 1.2 provides a high degree of pre-
dictability of the applications' worst-case execution time in the perspective of the interference
factors in Subsection 1.3.3; each core has its exclusive time slot to access the bus (Factor
F3) and is given a set of private banks or columns of the shared cache (Factor F4).
9
fixed exec time tunable delay
wcet
i
bound
fixed exec time tunable delay
fixed exec time
tunable 
delay
Configuration A
Configuration B
Configuration C
Figure 1.1: Tunable WCET model.
While this makes an WCET analysis much easier by eliminating the potential sources of
resource contentions, one major limitation is that the resources may have limited capacities to
accommodate a given workload. Recall that every application, i.e., task, is assigned to private
cache banks or columns in their architecture. Bank-level partitioning requires as many banks
as the number of tasks in the system. Column-level partitioning may resolve the capacity
problem, however tasks may experience additional delays in accessing the banks. Therefore
a proper partitioning method which can fully utilize the shared cache while minimizing the
bank access interferences is needed.
Another possible way of improvement is the use of application-aware bus scheduling. As
explained by the interference factor F3.2 in Subsection 1.3.2, round-robin scheduling can
ensure that a bus access has a nite upper-bound on delay by isolating accesses of a task
from others'. Thus it is easy to estimate the worst-case bus access delay. However, pure
round-robin scheduling may be inecient in that every task has to wait for the same amount
of delay regardless of application characteristics; some memory-bound tasks are highly likely
to access the bus more intensively than others, or one may want to minimize the WCET of
a certain higher-criticality task. If we give more frequent slots to such tasks by lengthening
others' waiting times, we can achieve an enhanced overall eciency. For example, a task
that cannot be schedulable with the worst-case bus access delay of 4 may turn into being
schedulable with a reduced bus access delay, say 2. However, it is not always straightforward
to determine a better schedule since shortening one's delay leads to lengthened delay of
others.
10
In order to address the above challenges, this study proposes a new perspective of WCET
model called tunable WCET, in which WCET of a task is partitioned into two parts - xed
execution time and tunable delay, as shown in Figure 1.1. While most traditional WCET
analyses have focused only on how to precisely estimate WCETs, our tunable WCET model
enables system-level optimization for certain purposes by elastically deforming conguration
of interference sources. In particular, we focus on the two major sources of inter-core inter-
ference, on-chip interconnect and shared last-level cache (see F3 and F4 in Subsection 1.3.2).
In this study, we investigate how dierent congurations of bus arbitration and cache par-
tition could aect tasks' tunable delay and on how such dierent interference sources are
correlated. In order to achieve this goal, we adopt Paolieri's multicore architecture (re-
fer Section 1.2 or [22]) and propose novel bus arbitration and cache partitioning methods
called harmonic round-robin and two-level cache partitioning, respectively. By our harmonic
round-robin bus arbitration policy, we can vary the delays incurred for bus accesses of dif-
ferent tasks on dierent cores and hence realize application-aware bus scheduling. Similarly,
our two-level cache partitioning scheme maps banks to cores and columns to tasks in such
a way that delays due to bank access conict are minimized with the help of our harmonic
round-robin bus scheduling. One might easily expect that bus scheduling and cache parti-
tioning, in conjunction with task allocation, are so highly dependent upon each other that it
is not straightforward to nd out the optimal system conguration for a given set of tasks.
Accordingly, we shall also present a mixed integer linear programming (MILP) formulation
as the solution to the optimization problem of tunable WCETs. As will be seen later in
this thesis, with our proposed methods, the solutions of the optimization problem are always
better than, or at least identical to, those that can be achieved with Paolieri's architecture
in terms of minimum achievable system utilization.
11
1.5 Organization of This Thesis
The rest of the thesis is organized as follows: In Chapter 2, we introduce our harmonic
round-robin bus arbitration and two-level cache partitioning methods and then formally
dene the problem of tunable WCET optimization. In Chapter 3, we describe in detail
the tunable WCET model and its analysis. In Chapter 4, we present the mixed integer
linear programming (MILP) formulation for our tunable WCET optimization problem. In
Chapter 5, we present the experimental results obtained by our MILP optimization. Finally,
Chapter 6 concludes this thesis and discusses potential future work.
12
Chapter 2
Optimization of Tunable WCET
In this chapter, we rst introduce the proposed harmonic round-robin bus arbitration and
two-level cache partitioning methods and then describe how these aect our tunable WCET
model in the perspective of system-level optimization.
2.1 Harmonic Round-Robin Bus Arbitration
As briey mentioned in Section 1.2, the maximum delay that a task can suer due to
bus interference is bounded by the total number of cores with Paolieri's pure round-robin
arbitration policy 1. That is, the worst-case scenario occurs when each task running at every
core tries to access the shared bus at the same time. Because the maximum number of slots
that a task needs to wait for the next available slot is same with the number of cores in the
system, the upper-bound of bus interference delay is NC  LB, where NC is the number of
cores and LB is the bus latency. This means that every task has the same upper-bound of
bus access delay regardless of how their execution characteristics are dierent. Accordingly,
a more memory-intensive or utilized task has to wait for the same amount of delay as the
other less memory-intensive or utilized tasks. This is inecient in that the same amount
of bus request delay of dierent tasks aect a certain performance metric dierently. For
example, suppose that task A in core 1 and task B in core 3 suer the same worst-case bus
delay of 4  LB as shown in Figure 2.1a and that we want to minimize the overall system
utilization. If their period is 50 but the total numbers of cache accesses are 500 and 100,
1We assume that a bus request should arrive before each designated time slot to be granted to send the
request.
13
1 2 3 4 1 2 3 4 1 2 3 4 . . . .  
Task A’s request 
from core 1
4·L
B
4·L
B
Task B’s request 
from core 3
1 2 1 3 1 2 1 4 1 2 1 3 . . . .  
2·L
B
8·L
B
Task A’s request 
from core 1
Task B’s request 
from core 3
(a) Pure RR (4,4,4,4). 
(b) HRR (2,4,8,8). 
Figure 2.1: The impact of bus schedule and task allocation on bus access delay.
respectively, then the contributions of their bus access delays to the system utilization are
ubusA =
500  4  LB
50  speedbus and u
bus
B =
100  4  LB
50  speedbus ;
respectively. As another example, suppose now that the period of task A is 10 and the
number of its cache accesses is 100. Then, the contributions of their bus access delays to the
system utilization become
ubusA =
100  4  LB
10  speedbus and u
bus
B =
100  4  LB
50  speedbus ;
respectively. In both cases, ubusA =u
bus
B is 5, that is, task A aects the system utilization 5
times more than task B. Now let us suppose that core 1 is allowed to access the bus every
2 slots and core 3's period is increased to 8 slots, as shown in Figure 2.1b. Then, ubusA and
ubusB become
ubusA =
100  2  LB
10  speedbus and u
bus
B =
100  8  LB
50  speedbus ;
14
respectively. Accordingly, the net contribution, i.e., ubusA + u
bus
B , is reduced from
2400  LB
50  speedbus
to
1800  LB
50  speedbus :
As can be observed from this example, by giving more frequent slots to cores on which
memory-intensive or highly utilized tasks run, we can lower the overall system utilization. If
the tasks are not pre-assigned to specic cores, we can further reduce it by grouping such tasks
and assigning them to the more prioritized cores. Thus, as a method of core prioritization in
bus scheduling, we propose Harmonic Round-Robin arbitration policy, which can be dened
as follows:
HRR , (NC ; Tmin; Tmax; TRR;P);
where P is a set of HRR periods denoted by fT1; T2;    ; TNCg, 8j Tmin  Tj  Tmax, and
TRR is the hyper period of P, i.e., the length of one round. The periods of cores harmonize
with each other if and only if they satisfy the following conditions:
815j5NC 1 Tj+1Tj 2 N (positive integer) and
NCX
j=1
1
Ti
= 1: (2.1)
Because HRR periods are bounded by [Tmin; Tmax], only nite number of harmonic sets
can be made within the given range. First of all, the rst condition enforces that the
periods to be in non-decreasing order, T1  T2      TNC , and that every Tj+1 to be a
positive integer multiple of Tj. In addition, the second condition further requires that the
scheduling table has to be a complete round-robin. By complete we mean if we can create
a scheduling table in which every core j has slots every Tj. For example, the schedules
shown in Figure 2.1 are complete round-robins. However, (2; 4; 4; 8) is not a complete one
since there is no way to build such a table, that is, some core j cannot be guaranteed to be
15
able to access to the bus at every Tj. (2; 4; 8; 16) is not complete either, however it can be
transformed into a complete one by adjusting T4 from 16 to 8. In fact, this holds for any
sets of periods fTjg, where
PNC
j=1
1
Tj
< 1, satisfy the rst condition of 2.1. Once a sequence
of HRR periods (T1; T2;    ; TNC) satisfying these constraints is obtained, we can create the
unique corresponding harmonic round-robin scheduling table of length TRR = TNC . For
example, Figure 2.1b shows an HRR of (2; 4; 8; 8). Note that the pure round-robin in Figure
2.1a is also an HRR whose periods are (4; 4; 4; 4).
With our harmonic round-robin arbitration, we can also achieve the same argument of
[22] that the maximum bus access delay of a task can be obtained without the knowledge
of other real-time tasks. Also, due to its regular pattern, the bus access delay analysis is
done straightforwardly, as will be explained in Section 3.1. Furthermore, more importantly,
it helps to reduce bank conicts, in conjunction with our two-level cache partitioning scheme
which will be introduced in the following section.
2.2 Two-Level Cache Partitioning
As explained by Factor F4 in Section 1.3.2, the shared last-level cache is another major
source of inter-core interferences in multicore systems. As a solution to eliminate such
an interference, the authors of [22] consider a column-level cache partitioning method called
columnization. In columnization, a task suers a bank conict delay if another one is already
accessing the same bank as shown in Figure 2.2a, which illustrates the worst-case scenario of
bank conicts among core 2, 3, and 4 sharing a same bank, k. As a way of avoiding such an
interference, we can allocate a set of private banks to each task, which is called bankization,
as explained in the same paper. Because no two tasks can share any bank, the bank conict
delay totally disappears. This is restrictive, however, in that the number of banks of the
shared cache should be at least the number of tasks.
However we can reduce or even eliminate bank conicts without giving private banks to
16
B B
1 2 3 4 5 6 7 8 1 2 ...
M
k
T
RR
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
B B
1 2 3 4 5 6 7 8 1 2 …
M
k
M
k
M
k
M
k
B B M
k
'
M
k
'
M
k
'
M
k
'
B B
M
k
M
k
M
k
M
k
B B
1 2 1 3 1 2 1 4 1 2
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
1 5
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
...
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
M
k
M
k
B B
M
k
M
k
M
k
M
k
M
k
M
k
M
k
1 2 1 3 1 2 1 4 1 2 1 5 ...
(a) Pure RR. Core 2, 3, and 4 share Bank k. Core 3 and 4 suffer bank conflict 
delays.
(b) Pure RR. Core 2 and 4 share Bank k, and Core 3 uses Bank k’. By assigning 
a different bank to Core 3, all bank conflict delays become eliminated.
(c) HRR (2,4,24,24,24,24,24,24). Core 2, 3, and 4 share Bank k. LM=2·LB. The 
bank conflict delays can be eliminated by scheduling the bus with a harmonic 
round-robin arbitration.
(d) HRR (2,4,24,24,24,24,24,24). Core 2, 3, and 4 share Bank k. LM=(5/2)·LB. 
An increase of LM causes bank conflict delays to Core 2, 3, and 4.
Figure 2.2: Bank conict delays with dierent bus and cache congurations.
17
every task, by taking into account our harmonic round-robin arbitration. An example of
such situations is shown in Figure 2.2a and b. From the gure we can observe that core 2
and core 4 are able to share a bank without suering any bank conict delay since the bank
access request from core 4 begins after core 2 completes its request. However, core 2 and 3
cannot share the bank without suering any bank conict delay since their bank requests
will be overlapped. For the same reason, core 3 and core 4 cannot share any bank without
suering the delays. However, as shown in Figure 2.2b, if we assign another bank k0 to core
3, any of those cores does not experience bank conict delays. Moreover, in an extreme
case, by assigning bank k to core 1, 3, 5, and 7 and bank k0 to core 2, 4, 6, and 8, we can
totally eliminate bank conict delays. However, if the number of banks needed by core 1, 3,
5, and 7 exceeds one, we need to assign more banks to those cores. As long as the number
of available banks is sucient, nevertheless, all tasks can still avoid suering bank conict
delays.
However one may encounter a situation where no more banks are available for core 3 in
Figure 2.2a. In that case, we can eliminate or reduce their bank conict delays by scheduling
the requests with a harmonic round-robin schedule, as shown in Figure 2.2c. The harmonic
round-robin schedule in this example enables the bank access requests from core 2, 3, and 4
to be pipelined so that any bank requests are not overlapped.
Another important factor that inuences on bank conict delay and bank sharing is
the ratio of bank latency, LM , to the bus latency, LB, as explained by Factor F4.3 in
Subsection 1.3.2. To make the point clear, let us consider again Figure 2.2c. Since the bank
latency, which is 4 in this example, is two times of the bus latency, i.e, 2, and each slot of
the cores sharing the bank is apart from each other by 2 LB, there is no bank conict delay
among them. However once the bank latency becomes 5, then those cores start experiencing
bank conict delays, as shown in Figure 2.2d. Furthermore, the delays could be accumulated
even beyond the rst round. In this case, the accumulated delay will aect the same slots
in the next round making each delay increase without a bound, which we call unbounded
18
3 1 2 1
B B
B B
B B
B B
1 2 1 3 1 2 1 4 1 2
T
RR
M M M M M
M M M M M
M M M M M
M M M M M
B B M M M M M
1
B B M
B B
Bank conflict delays increase 
without bounds.
Figure 2.3: An example of unbounded bank conict delay.
bank conict delay problem (see Figure 2.3 for an example). To prevent such a situation, the
net workload, i.e., the sum of bank access delays including latencies, going to a shared bank
should be less than or equal to the hyper period of the harmonic round-robin schedule. Note
that, in the worst-case, each core j generates TRR
Tj
bank requests within one hyper period,
TRR, and thus the condition that prevents such unbounded bank conict delays for a shared
bank, k, can be expressed as follows:
X
8 core j using bank k
LM
LB  Tj  1: (2.2)
Thus, in the case of Figure 2.3, we should make one of the cores use another bank.
Sharing banks among cores may introduce unnecessary bank conict delays if they are not
properly coordinated, as illustrated in Figure 2.2, and the problem becomes aggravated when
there are insucient banks to be allocated to tasks. One ecient way to utilize the shared
cache and thus to minimize bank sharing is to allocate a contiguous subset of banks to each
core and to allow any two cores to share at most one bank - the leftmost or the rightmost
bank allocated to each core 2. By restricting like this the memory address mapping can
be simplied and moreover the number of bank sharings can be reduced. In addition to
this core-level partitioning, we subdivide a bank into several columns and then map each
task to a set of contiguous columns belonging to the banks allocated to the core where the
2It does not mean that a bank can be shared by at most two cores. Instead, three or more cores can
share a same bank as long as they do not cause the unbounded bank conict delay problem.
19
Core i Core j
Task A Task B
Bank kBank k-1 Bank k+1 Bank k+2
Core i uses banks [k-1,k], and
Task A occupies five columns.
Core j uses banks [k,k+2], 
Task B occupies four columns, and
Task C occupies three columns.
Bank k is shared between Core i and j. 
Bank k+1 is the BCD-free bank of Core j
Task C
Figure 2.4: Two-level cache partitioning.
task resides in, as shown in Figure 2.4. By this task-level subpartitioning, we can eliminate
interferences due to cache thrashing among tasks in a same core. Moreover, we can prioritize
the tasks even within a core in allocating their columns. Let us consider the example shown
in Figure 2.4, where bank k is shared between core i and j. While the columns of task B
stretch from bank k to bank k+1, those of task C are in bank k+1. Since a part of bank k
is mapped to task A, it may be possible that task B suers bank conict delays. However,
bank k+1 of core j cannot be shared with any other cores by our core-level partitioning, and
hence the cache accesses from task C are free from interferences caused by bank conicts. We
call such banks as BCD-free banks - a set of banks in which tasks using the columns of the
banks cannot experience any bank conict delays 3. Accordingly, a more memory-intensive,
highly-utilized, or higher-criticality task can benet from using the columns in a BCD-free
bank.
3Note that a core may not have its BCD-free banks if only one or two banks are allocated to the core
and all the banks are shared with other cores. If a core is given three or more banks, there always exists a
BCD-free bank.
20
As has been seen, our two-level cache partitioning method does not rely only on tasks'
cache usages. Instead, both of bus conguration and task allocation also aect the decision
of how to map core-to-banks and task-to-columns and thus tasks' bank conict delays. In
Section 3.2, we will analyze in detail how bank conict delays are aected by dierent
congurations of bus, cache partition, and task allocation.
2.3 System Model
We consider a multicore system that consists of NC homogeneous cores denoted by C =
fC1; C2;    ; CNCg. This system has an unied shared cache B which is partitioned into NB
banks fB1; B2;    ; BNBg, each of which are again divided into NW columns (see Figure 2.4
for an example). Thus the system has total NX = B  NW columns which are denoted by
X = fX1; X2;    ; XNXg. Each core uses contiguous blocks of banks, however any two cores
cannot share two or more banks. Cache requests from the cores are delivered to the shared
cache via a shared bus arbitrated by a harmonic round-robin policy, which is introduced in
Section 2.1 (see Figure 2.1b).
On that system, we assume N  real-time tasks, which is denoted by f1; 2;    ; N g,
that satisfy the workload assumptions presented in Subsection 1.3.3. Each task i is assigned
to one of the cores and is periodically executed on the core with the execution time of ei
and the period of pi. Also, each i uses a set of cache columns as the place to store/load its
instructions and data. Accordingly, a task i can be represented by
i = (ei; pi; N
X
i ; N
M
i );
where NXi is the number of cache columns that i would need during its execution and N
M
i
is the total number of cache accesses to the columns. Here we assume that the size of each
cache request is xed and thus there is no variance in the latencies of bus access, LB, and
bank access, LM . Each task is allowed to own only the columns in the banks allocated to
21
fixed exec time tunable delay
bus latencybus access delay bank latencybank conflict  delay
wcet
i
bound
Figure 2.5: Delay components of proposed tunable WCET model.
the core at which the task is located. However no two tasks can share a column even if they
are in the same core, as explained in Section 2.2. With these constraints, we further assume
that there is no task migration, and the number of required cache columns and that of cache
accesses are pre-proled by a static analysis.
2.4 Tunable WCET
Figure 2.5 illustrates the rationale behind our tunable WCET model proposed in this study.
In this model, the worst-case execution time of task i is partitioned into two parts - xed
execution time and tunable delay. The xed execution time of a task is the maximum time
duration that the task could take to execute instructions over its critical path. On the other
hand, the tunable delay is the sum of the delays incurred for all of the memory accesses over
the same path, and it can be variable according to the congurations of bus and cache, as
described in Section 2.1 and 2.2. Since the delay factors that aect a memory operation are
the bus and cache (inter-core interference factor F3 and F4), the tunable part is subdivided
into bus access delay and cache access delay. It should be noted here that the latencies of
accesses to the both are xed. Now let L be the sum of the xed latencies, di be the variable
delays, and ei be the xed execution time of task i. Then, without loss of generality, the
worst-case execution time of task i can be dened as follows:
wceti = ei +N
M
i  (L+ di); (2.3)
22
where NMi is the number of task i's cache accesses, and L = 2  LB + LM 4.
One may argue that the critical path of, and thus NMi of, a task can be changed according
to the variable delays. While this is true in general, the analysis of tunable WCET and its
optimization will become signicantly more complex if we take variable critical path into
account. Thus, by the workload model A5 in Subsection 1.3.3, we assume that the critical
path obtained by a static analysis is long enough so that the variable delays cannot change
the critical path.
Our tunable delay model enables optimization of bus scheduling and cache partitioning
as aforementioned. Here the optimization objective can vary; one may want to shorten some
certain application's worst-case execution time or want to minimize the number of required
cache banks and so on. In this study, among various objectives, we focus on the optimization
of overall system utilization. In optimizing such a global metric, it is important to consider
how to utilize limited resources. For example, reducing the bus access delay of a task and/or
allocating private banks to a task may increase the delays of other tasks. Moreover, the
optimization factors are correlated to each other, for example the shorter a core's HRR
period, the more private banks the core needs to occupy. Accordingly, it is important to
understand how to parameterize the optimization model. In Chapter 3, we will describe
in detail how dierent congurations of our harmonic round-robin arbitration and two-level
cache partitioning methods aect tunable WCET of tasks.
2.5 Problem Description
For a given set of N  real-time tasks f1; 2;    ; N g, NC homogeneous cores fC1; C2;    ;
CNCg, and NB banks fB1; B2;    ; BNBg consisting of NX columns fX1; X2;    ; XNXg, our
problem is to nd the optimal task assignments, harmonic round-robin schedule, and core-
4We assume the bus is full-duplex as was assumed by [22], which means that requests from cores to
the shared cache do not conict with the ones fetched from the cache. Therefore, only the core-to-cache
requests can be delayed. Note that the cache-to-core requests are not shown in any gure of this thesis for
the simplicity of illustrations.
23
to-banks and task-to-columns mappings that minimize the overall system utilization, i.e.,
minimize
N X
i=1
wceti
pi
: (2.4)
Low system utilization is generally preferred in system development because 1) the less-
utilized system can be more utilized by accommodating additional tasks. The other way
around is also true - 2) the same set of tasks can be implemented with lower-speed cores,
which can reduce the unit cost of production. In Chapter 4, we will present our mixed integer
linear programming (MILP) formulation for this optimization problem.
24
Chapter 3
Tunable WCET Analysis
C
j
B M
L
M
L
B
C
j
C
j
d
i
B
d
i
M
HRR 
Schedule
request from core j
Figure 3.1: Worst-case execution time of task i on core j.
As described in Section 2.4, the worst-case execution time of a task is modeled as the sum
of xed execution time and tunable delay. The latter, in turn, is divided into xed latencies
and variable delays incurred during bus and bank accesses, as illustrated in Figure 3.1 1.
Now let dBi and d
M
i be the upper-bound of bus access delay and that of bank conict delay,
respectively. Then, Equation 2.3 can be redened as follows:
wceti = ei +N
M
i  fL+ (dBi + dMi )g: (3.1)
In this model, only dBi and d
M
i are variable depending on bus schedules, cache partitions,
and task allocations. Accordingly, in the rest of this chapter, we describe in detail how such
congurations aect the worst-case execution time of a task and how these are correlated.
25
1 2 3 4 1 2 3 4 1 2 3 4 . . . .  
request from core 1
4·LB
1 2 1 3 1 2 1 4 1 2 1 3 . . . .  
2·LB
request from core 1
1 2 1 3 1 2 1 4 1 2 1 3 . . . .  
request from core 3
8·LB
(a) Pure RR (4,4,4,4). Task i is allocated to core 1.
(b) HRR (2,4,8,8). Task i is allocated to core 1.
(c) HRR (2,4,8,8). Task i is allocated to core 3.
Figure 3.2: Dierent bus access delays with dierent bus schedules.
3.1 Bus Access Delay dBi
The upper-bound of a bus access delay is dened as the maximum length of time that a bus
request could take until it is granted. In pure round-robin, as illustrated in Figure 3.2a, a
bus access delay is upper-bounded by NC  LB and is independent of which core the task
is allocated to 2. In a harmonic round-robin scheduling, on the other hand, task allocation
becomes the delay factor since dierent cores may have dierent HRR periods. For instance,
let us suppose that the bus is scheduled by HRR of (2; 4; 8; 8) (Figure 3.2b and c). If task i
is allocated to core 1, a bus access from the task can be delayed at most two bus slots in the
case where it just missed the beginning of core 1's turn. However, if the task is not in core
1, but rather in core 3, dBi becomes 8  LB. Accordingly, the upper-bound of a bus access
1Note that cache-to-core requests are not shown.
2Recall that we assume that a bus request should arrive before each turn in order to be granted to send
the request.
26
delay of task i can be expressed as the following equation:
dBi = Tj  LB; (3.2)
where Tj is the HRR period of core j to which task i is allocated.
3.2 Bank Conict Delay dMi
As explained in Section 2.2, a bank conict delay occurs if a task tries to access a bank that
is already being accessed by another request (see Figure 2.2). Hence it is highly likely for a
task to suer such a delay if the core which the task is in shares a bank with many other
cores. Furthermore, if each core is near each other in HRR scheduling table, the chances
become even greater. However, if they are far enough apart from each other, as illustrated in
Figure 2.2c, we can reduce or even eliminate such bank conict delays. Accordingly, task al-
location, HRR scheduling, and cache partitions should be taken into account simultaneously
in analyzing the upper-bound of bank conict delays.
Now let us suppose that task i is allocated to core j. From our system model in Sec-
tion 2.3, the core uses a contiguous subset of banks. Let us denote the banks as
Bj = fBs; Bs+1;    ; Bs+nBj  1g;
where nBj is the number of cache banks required by core j, which depends on the total
number of cache columns required by the all tasks in the core. Recall that, again by our
system model, only the leftmost and rightmost banks, i.e., Bs and Bs+nBj  1, can be shared
with others. For simplicity of notations, let us denote Bs+nBj  1 by Be.
It is now important to note that task i could suer a bank conict delay if and only if
the columns that the task occupies are a part of the shared banks. In other words, if the
all columns of the task are in the BCD-free banks of the core, the task can never suer any
27
Core j
sB
Task A
eB
Task B Task C
Shared Shared
Non-Shared
Figure 3.3: Task to column mappings.
bank conict delays, as explained in Section 2.2. In this case, dMi is always 0. Otherwise, the
upper-bound of bank conict delays could be non-zero, and it depends on in which banks the
columns are located and whether the banks are shared with others or not. Let us consider
the following cases, which is also shown in Figure 3.3:
Case 1. a part or all of the columns reside in Bs, but not in Be,
Case 2. a part or all of the columns reside in Be, but not in Bs, and
Case 3. none of the columns reside in either Bs or Be.
The upper-bound of bank conict delays can be dierent in each case. For example, in
Figure 3.3, Task B is free from any bank conict delays since all of its columns are in the
BCD-free banks of Core j (Case 3). On the other hand, Task A and Task C use some of the
columns in Bank Bs (Case 1) and Be (Case 2), respectively, and thus could experience bank
conict delays. Here dierent bank accesses of Task A may suer dierent delays since its
columns stretch from a shared bank to a non-shared bank. However, we assume that the
target address of, and thus the target column index of, an access is not known in analyzing
the WCET of a task. Thus, as long as at least one column of a task is located in a shared
28
( ) Wsidx B N⋅
sB eB
Case 1
( 1) Weidx B N− ⋅
1sB + 1eB −
Case 2
Case 3
Case 4
1 NB
Figure 3.4: Four possible scenarios for the range of task i's columns.
bank, we consider that every access of the task goes to the shared bank. Otherwise, the
analysis would become intractable since we need to consider which access goes to which
bank and when.
Meanwhile, the columns of a task may stretch from Bs to Be, especially if the number of
available banks is small or/and the task requires many columns (not shown in Figure 3.3).
Case 4. the columns stretch from Bs to Be.
In this case, an access from the task could go to one of Bs and Be, or other non-shared banks.
Thus, as aforementioned, all accesses of the task are assumed to experience the worst-case
delay, which will depend on the bank conict delay of Bs and Be.
Now let us dene Djs and D
j
e as the upper-bounds of bank conict delays that a task
could experience by accessing bank Bs and Be, respectively. It is important to know here
that Djs and D
j
e are independent of other tasks. That is, any task in core j can suer a same
amount of Djs (D
j
e) if at least one column of a task resides in Bs (Be). To identify whether
task i could access a bank, let us rst denote the cache columns occupied by task i as
Xi = fXk; Xk+1;    ; Xk+NXi  1g;
where NXi is the number of columns required by task i. Then, Case 1 { 4 can be restated as
29
the following conditional expressions (see also Figure 3.4):
Case 1. idx(Xk)  idx(Bs) NW and idx(Xk+NXi  1)  (idx(Be)  1) NW ,
Case 2. idx(Bs) NW + 1  idx(Xk) and (idx(Be)  1) NW + 1  idx(Xk+NXi  1),
Case 3. idx(Bs) NW + 1  idx(Xk) and idx(Xk+NXi  1)  (idx(Be)  1) NW , and
Case 4. idx(Xk)  idx(Bs) NW and (idx(Be)  1) NW + 1  idx(Xk+NXi  1),
where idx is a function that returns the index of a bank or a column. If Case 1 holds for
task i, the bank conict delays that the task could suer, i.e., dMi , is upper-bounded by
Djs. Similarly, d
M
i for Case 2 is D
j
e. For Case 3, d
M
i is 0 since all columns of the task are in
non-shared banks. Lastly, dMi for Case 4 is the maximum of D
j
s and D
j
e. Accordingly, the
upper-bound of bank conict delays of task i in core j can be expressed by the following
single equation:
dMi = max(D
j
s; D
j
e): (3.3)
Here, Dje, D
j
s, and both D
j
s and D
j
e are 0 for Case 1 { 3, respectively.
As has been seen, the way the shared cache is partitioned aects only whether a task
would experience bank conict delays or not and a possible set of delays, i.e., Djs, D
j
e, or 0.
In fact, the calculation processes of Djs and D
j
e are same because it only depends on which
other cores share the bank and how they are scheduled in the shared bus. Now let us denote
by Djk the upper-bound of bank conict delays that every task in core j could suer due to
using bank k. As will be described in the following subsection, we can calculate Djk for every
pair of core and bank by using bus schedule, i.e., HRR periods of cores, and core-to-bank
mappings.
30
88
B B
B B
B B
B B
1 2 3 4 1 2 5 6 1 2
1st round
7
M M M M
B B
M M M M
M M M M
M M M M
M M M M
1 2 3 4 1 2 5 6 1 2
2nd round
7
B B M M M M
B B
B B
B B
B B
M M M M
M M M M
M M M M
M M M M
B B M M M M B B M M M M
.     .     .                  1 2 3 4
Slot # 24121 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20 21 22 23
Figure 3.5: The worst-case bank access scenario of core 2; 5; 6; and 8.
3.2.1 Computation of Djk
To help to understand the analysis presented in this section, let us rst consider an example
shown in Figure 3.5. The gure shows a situation where eight cores are scheduled by the
HRR of (4; 4; 12; 12; 12; 12; 12; 12), core 2, 5, 6, and 8 are sharing bank k, and each of the
cores tries to access the bus at each slot making the worst-case scenario. We assume here
that a task under analysis, say i, is executing on core 2, and a part of its columns is residing
in bank k. Now suppose that we want to nd the upper-bound of bank conict delay that
task i can suer.
We can rst observe from the gure that the dierent banks accesses from core 2 may
experience variable bank conict delays. This is because the cores have dierent HRR
periods; core 3, 4, and 1 may aect core 2 at slot 6, and similarly core 5 and 6 may delay
core 2 at slot 10, and so on. Furthermore, it depends on which cores share bank k with core
2. Thus, we need to calculate each delay that task i could suer at each slot of core 2. Now
let us call each such delay slot delay and dene it as follows:
dsj;';
where j and ' are the index of core and slot, respectively. For example, in Figure 3.5, ds2;6
is 0, ds2;14 is 4, and so on. Note that if core j does not use slot ', d
s
j;' is 0. That is, in the
same example, all ds2;' is 0, except for ' = 2; 6; 10; 14; 18; 22; : : :.
However, it is unnecessary to calculate all slot delays because of the periodicity of har-
31
monic round-robin scheduling. That is, the following is always valid:
dsj;' = d
s
j;'+TRR
;
except for ' = j, where j is the rst slot of core j in a HRR table. It should be noted that
we set dsj;j to 0 and use it as a reference in nding D
j
k since the backlog that could delay
the slot at j cannot be known without evaluating all preceding ones. It is also important
to note that the above equality does not hold if the bank sharing causes an unbounded bank
conict delay problem. In this case, it is unable to nd the upper-bound delay.
Assuming there is no unbounded bank conict delays, we can nd the upper-bound of
slot delays of core j using bank k by considering only the slots in rst two consecutive rounds
starting from j. That is,
Djk = max(d
s
j;'); (3.4)
for all ' = j; j + Tj; : : : ; j + 2  TRR. Recall that TRR is a positive integer multiple of Tj
because all periods are harmonized with each other.
3.2.2 Computation of dsj;'
Now let us describe how to calculate each slot delay, dsj;'. From Figure 3.5, we can see
that dsj;' is aected by how much the workload, i.e., unnished bank accesses, has been
accumulated until to the point when core j tries to access the shared bank. Here it is
important to take into account that dsj;' could be delayed by not only the accesses initiated
between '  Tj and ', but also the ones preceding '  Tj. For example, in Figure 3.5, core
2 at slot 14 would not experience any delay if one of core 5 at slot 7 and core 6 at slot 8
did not share, and thus did not access, the same bank. To model such backlog, let us dene
wj;' as the time instant at which the most recent bank access completes. In nding wj;', we
need to compute the longest busy period of bank accesses that could delay the slot under
32
12 13 14 15 16 17 18 19 20
81 2 5 6 1 2 7
B B
B B
M M M M
B B M M M M
M M M M
B B M M M M
2,17w
2,17
sd
20
Figure 3.6: An example busy period w2;17.
consideration.
To help to understand how to nd wj;', let us rst consider an example illustrated in
Figure 3.6. Suppose in this example that we want to compute w2;17. The busy period begins
from core 2 at slot 13 since there is no backlog of bank access at the slot. Thus, the initial
busy period of w2;17 is
w02;17 = 13 + LB + LM
= 16:
Since the bank access of core 2 at 13 completes at 16, that of core 5, which tries to access
the bank at 15 (= 14 + LB), is delayed by w
0
2;17   (14 + LB), which results in
w12;17 = 14 + LB + (w
0
2;17   (14 + LB)) + LM
= w02;17 + LM
= 18:
Likewise, the access of core 6 at the next slot is delayed by the accumulated delays, thus the
33
C
j
C
j
L
B
L
M
...
L
M
L
B
...
, jj Tw ϕ−
,jw ϕ
jTϕ −
ϕ
jT
, j
s
j Td ϕ −
,
s
jd ϕ
0
,jw ϕ
Figure 3.7: Calculation of wij;' and d
s
j;'.
busy period becomes
w22;17 = 15 + LB + (w
1
2;17   (15 + LB)) + LM
= w12;17 + LM
= 20:
The busy period stops growing because core 1 at slot 16 does not access the shared bank,
thus the nal value of w2;17 ends up being 20. From this example, we can see that the busy
period grows by LM as long as a new access has to be delayed by backlogs. However, it is
possible for a busy period to be discontinued, especially if the busy period does not reach to
the time instant of a new bank access. In this case, a new busy period starts from the new
access. Thus, for every new access to a shared bank at slot  , the busy period, wij;', grows
iteratively from w0j;' as follows:
wi+1j;' =
8><>: w
i
j;' + LM if  + LB  wij;';
 + LB + LM otherwise.
(3.5)
34
Algorithm 1 Calc wj;' (j; '; k;HRR)
1: HRR[ ] : index of core using slot  in the given HRR schedule.
2: Bk[C] : true if core C uses bank k and false otherwise.
3: j : the index of the rst slot of core j in the given HRR schedule.
4:
5: if ' = j then
6: w0j;' := 0
7:  := 1
8: else
9: w0j;' := '  Tj + LB + dsj;' Tj + LM
10:  := '  Tj
11: end if
12:
13: i := 0
14: while  < ' do
15: if Bk[HRR[ ]] =true then
16: if  + LB  wij;' then
17: wi+1j;'  wij;' + LM
18: else
19: wi+1j;'   + LB + LM
20: end if
21: else
22: wi+1j;'  wij;'
23: end if
24: i i+ 1
25:    + 1
26: end while
In the above example, the busy period begins from the slot at which there is no backlog.
However, it may be possible that the slot was delayed by previous bank accesses. To nd
the exact value of wj;', one may begin from the very rst slot of scheduling table. This is
an inecient way, however, in that the same procedure is unnecessarily repeated. Instead,
we can take dsj;' Tj into account in computing wj;' as illustrated in Figure 3.7. That is, it is
sucient to consider only the slots between ['  Tj; '] of core j because the initial value of
35
busy period can be derived by the following equality:
w0j;' = wj;' Tj + LM
= '  Tj + LB + dsj;' Tj + LM : (3.6)
Recall that we calculate all slot delays of core j, and hence dsj;' Tj should be available when
computing dsj;'. Accordingly, we can nd the longest busy period that could delay slot ' of
core j, i.e., wj;', by Algorithm 1.
Finally, a slot delay dsj;' is non-zero if the busy period, wj;', delays the bank access of j
th
slot and zero otherwise, as illustrated in Figure 3.7. Thus, for a given slot ' of core j that
shares bank k with others, dsj;' can be computed as follows:
dsj;' = max(wj;'   '+ LB; 0): (3.7)
Note that dsj;' is lower-bounded by 0.
36
Chapter 4
MILP Formulation
In this chapter, we present a mixed integer linear programming (MILP) formulation for
our tunable WCET optimization problem described in Chapter 2. Our optimization model
takes as input a set of real-time tasks,  , which are the workload model assumption A1{
A6 in Subsection 1.3.3, and a set of system parameters such as the number of cores, NC,
and the shared cache model, NB and NX. For the given input, it nds the optimal task
assignments, harmonic round-robin schedule, and mappings of core-to-banks and task-to-
columns that minimize the overall system utilization. In the rest of this chapter, we will
describe how to formulate our proposed harmonic round-robin arbitration and two-level
cache partitioning schemes (Chapter 2), and our tunable WCET model (Chapter 3), as a set
of MILP constraints.
4.1 Parameters and Variables
4.1.1 System parameters
Table 4.1 shows the list of system parameters given as input.
4.1.2 Decision variables
The followings are zero-one variables indicating to which task or core a shared resource is
mapped. These and other decision variables not shown here will be explained in detail in
the rest of this chapter when needed.
37
Table 4.1: List of system parameters.
Parameter Description
N  number of tasks
NC number of cores
NB number of banks of the cache
NW number of columns in a bank
NX number of columns of the cache
NXi number of columns required for task i
NMi number of cache accesses of task i
Tmax upper-bound of HRR periods
Tmin lower-bound of HRR periods
LB bus access latency
LM bank access latency
Task to Core Mapping
i;j =
8><>: 1 if task i is allocated to core j,0 otherwise.
Core to Cache Bank Mapping
j;k =
8><>: 1 if core j uses bank k,0 otherwise.
Core to HRR Slot Mapping
j;s =
8><>: 1 if core j uses slot s of a harmonic round-robin table,0 otherwise.
38
4.1.3 Range variables
The following integer variables are used for representing ranges of banks and cache columns
assigned to each core and each task, respectively (see Figure 4.4).
Bank
8><>: b
j
s : the index of the leftmost bank allocated to core j;
bje : the index of the rightmost bank allocated to core j:
Column
8><>: x
i
s : the index of the leftmost column allocated to task i;
xie : the index of the rightmost column allocated to task i:
4.2 Objective Function
As presented in Section 2.5, the optimization objective we consider in this study is to mini-
mize the overall system utilization, that is,
Minimize
N X
i=1
wceti
pi
:
Note that our optimization model is not restricted to a specic objective. We can, for
example, minimize the WCET of a specic task or the number of required banks with the
function of wceti or
PNC
j=1 b
j
e   bjs + 1, respectively.
4.3 Constraints
4.3.1 Harmonic round-robin
Recall that a bus scheduling is a harmonic round-robin if and only if the HRR periods of
the table satisfy Condition (2.1), which can be redened as the following three conditions:
1. Tmin  T1  T2      TNC  Tmax,
39
2. 8jNC 1 Tj+1Tj = mj 2 N, and
3.
PNC
j=1
1
Tj
= 1,
where Tj is a positive integer variable representing the HRR period of core j. Note that the
above conditions are non-linear. However these can be converted into separable functions
with the help of piecewise linear approximation [23], and this does not compromise the ac-
curacy of our optimization model since the constraints consist of only bounded integer values.
By the rst condition above, each Tj has to be an integer in the range [T
min; Tmax]. This
is a selection of a discrete point in the range, thus Tj can be expressed by the following
piecewise linear function:
Tj = T
min  j;1 + (Tmin + 1)  j;2 +   + Tmax  j;(Tmax Tmin+1); (4.1)
where j;p is an indicator variable that is 1 if Tj has the value p+ T
min  1 and 0 otherwise.
Equation (4.1) therefore can be formulated as the following constraint:
Constraint 1. For each core j,
Tj =
TmaxX
p=Tmin
p  j;(p Tmin+1): (4.2)
Here only one term of the above equation has to end up having a positive value. Thus the
sum of all j should be equal to 1, that is
Constraint 2. For each core j,
Tmax Tmin+1X
p=1
j;p = 1: (4.3)
The second condition requires that each HRR period, Tj+1, is a positive integer multiple
40
of Tj in order for the periods to be a harmonic set. Note that, by the rst condition, the
value of mj cannot be greater than
Tmax
Tmin
. Thus, each integer multiple mj is chosen within
the range of 1 to T
max
Tmin
, and this can be expressed similar to the above approximation method
as follows:
mj = 1  0j;1 + 2  0j;2 +   +mmax  0j;mmax ; (4.4)
where mmax is a constant whose value is T
max
Tmin
, and 0j;q is an indicator variable that is 1 if
mj has the value of q and 0 otherwise, and the sum of all 
0
j is equal to 1. This equation
therefore can be formulated as the following two constraints:
Constraint 3. For each core j,
mj =
mmaxX
q=1
q  0j;q: (4.5)
Constraint 4. For each core j,
mmaxX
q=1
0j;q = 1: (4.6)
It should be noted that the above constraints are not directly related to HRR periods; it
just determines the ratio of every adjacent HRR periods. Thus, we now need an additional
constraint that relates mj to
Tj+1
Tj
. First of all, from Equation (4.2) we know that if j;p is
1, the value of Tj is p+ T
min   1. Also, by Equation (4.5), mj = q if 0j;q is 1. Therefore, if
both j;p and 
0
j;q are 1, the value of Tj+1 has to be equal to
Tj+1 = Tj mj = (p+ Tmin   1)  q;
which can be represented as
41
Constraint 5. For each core j, 1  p  Tmax   Tmin + 1, and 1  q  mmax,
if j;p = 1 and 
0
j;q = 1 =) j+1;(p+Tmin 1)q Tmin+1 = 1: (4.7)
Depending on both values of Tj and mj, Tj+1 may exceed the upper limit of HRR periods,
i.e., Tmax. Hence we need another constraint that
Constraint 6. For each core j, 1  p  Tmax   Tmin + 1, and 1  q  mmax,
if (p+ Tmin   1)  q  Tmax + 1 =) 0j;q = 0: (4.8)
For the formulation for conditional constraints, please refer [23].
The third condition enforces the HRR periods to be a complete round-robin as explained
in Section 2.1. In the condition, the value of each term of the summation is not in a linear
relation with each HRR period. However, in this case also we can transform it into a separable
form by the aforementioned piecewise linear approximation as each Tj has a discrete integer
value. From Equation (4.1) we already know that each Tj is bounded between Tmin and
Tmax, and the indicator variable j;p has the value of 1 if Tj is T
min + p  1. Now let Oj be
the reciprocal of Tj, i.e.,
1
Tmin+p 1 . Then, by substituting Oj for Tj, and
1
Tmin+x
for Tmin+ x
from Equation 4.1, Oj can be expressed as follows:
Constraint 7. For each core j,
Oj = (
1
Tmin
)  j;1 + ( 1
Tmin + 1
)  j;2 +   + ( 1
Tmax
)  j;(Tmax Tmin+1): (4.9)
We can therefore simply substitute Oj for
1
Tj
in the original condition, which results in
42
1 2 1 3 1 2 1 4
HRR
Schedule
1 2 3 4 5 6 7 8s
1
1
1 1 1
1
1
1
0
0 0
0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
s,1σ
s,2σ
s,3σ
s,4σ
Figure 4.1: The assignments of j;s for HRR of (2; 4; 8; 8).
Constraint 8.
NCX
j=1
Oj = 1: (4.10)
In summary, once j;p is chosen, Tj and its corresponding Oj are determined by Equation
(4.2) and (4.9).
4.3.2 Bus scheduling table
As explained in Section 2.1, once a set of HRR periods fT1; T2; : : : ; TNCg is given, the cor-
responding harmonic round-robin schedule can be uniquely determined by the following
constraints:
1. a slot can be assigned to only one core,
2. the rst slot of core j can appear only after at least one slot is assigned to core
1; : : : ; j   1, and
3. slot s is assigned to core j if slot s  Tj is assigned to core j.
Recall that j;s is a decision variable indicating whether core j occupies s
th slot of the
scheduling table. Thus, building a harmonic round-robin scheduling table is equivalent to
43
assigning a set of j;s to each core j. An example of j;s assignment is shown in Figure 4.1.
First of all, the rst constraint above can be formulated straightforwardly by the follow-
ing:
Constraint 9. For each slot s,
NCX
j=1
j;s = 1: (4.11)
The second constraint requires that the rst slot of core j cannot appear before jth slot
of the table.
Constraint 10. For each core j and each slot s,
if s  j   1 =) j;s = 0: (4.12)
By this constraint the sequence is uniquely determined for a given set of HRR periods. In
other words, without this constraint, < 1; 3; 1; 2; 1; 4; 1; 2 > is a possible schedule table for
HRR of (2; 4; 8; 8) instead of the one shown in Figure 4.1.
The periodicity of the slots of core j can be ensured by checking the sum of every Tj
consecutive j;s of the core. In the example of Figure 4.1, the sum of four consecutive 2;s
should be equal to 1 as T2 is 4. Accordingly, we need the following constraint:
Constraint 11. For each core j, Tmin  p  Tmax, and 1  s  Tmax   p+ 1,
s+p 1X
t=s
j;t = 1: (4.13)
4.3.3 Task to core mapping
Every task should be allocated to one of NC cores, thus
44
(a) 1 bank
Bankk Bankk+1Bankk Bankk+1
BanklBankk BanklBankk
(b) 1 or 2 banks
(c) n+1 banks (d) n+1 or n+2 banks
Figure 4.2: The number of required banks for dierent number of required columns.
Constraint 12. For each task i,
NCX
j=1
i;j = 1: (4.14)
4.3.4 Core to bank mapping
The minimum number of cache banks required by core j, i.e., nBj , depends on the sum of
the columns required by all tasks in the core, which is expressed as
nBj =
N X
i=1
i;j  N
X
i
NW
; (4.15)
where NXi and N
W are the number of cache columns that task i requires and that a bank
has, respectively. Note that nBj may not be integer divisible by N
W , which means that for
example at least two banks should be allocated to the core that requires 1.2 banks. However,
this does not necessarily imply that two banks are sucient for that core. It may require
three banks. Let us consider the following cases which are illustrated in Figure 4.2:
(a) nBj =
1
NW
: The core requires only one column. In this case, one bank is sucient.
45
(b) nBj =
2
NW
: The core requires two columns. These can belong to a bank, however it is
possible that the columns may stretch across two banks.
(c) nBj = n +
1
NW
: The core requires n  NW + 1 columns. In this case, n + 1 is the
necessary and sucient number of banks. It never happens that the columns stretch
across n+ 2 banks.
(d) nBj = n +
2
NW
: The core requires n NW + 2 columns. These can t in n + 1 banks,
however the columns may stretch across n+ 2 banks.
Recall that the range of banks are expressed by bjs and b
j
e; the indices of the rst and last
banks allocated to core j, and thus the number of banks that core j will use is bje   bjs + 1
banks. We can therefore formulate the following constraint:
Constraint 13. For each core j,
nBj  (bje   bjs + 1)  nBj + 2 
2
NW
; (4.16)
where nBj is dened by Equation (4.15). If n
B
j is 0, however, no banks should be allocated
to the core. Thus we need another constraint as follows:
Constraint 14. For each core j,
if nBj = 0 =) bje = bjs = NB + 1: (4.17)
For the purpose of bank conict delay calculation, we should relate bjs and b
j
e with j;k
indicating whether core j uses bank k or not.
Constraint 15. For each core j and each bank k,
if bjs  k  bje =) j;k = 1: (4.18)
46
j
sb jeb'jeb ≤
j
sb
j
eb 'jsb≤
or
Figure 4.3: Bank sharing constraint.
In our proposed system model we restrict any two cores to share at most one bank. In
addition to this, if two cores share a bank, it should be either the rst or the last bank of
each core (see Figure 4.3). This constraint can be expressed as follows:
Constraint 16. For each core 1  j  NC 1, and each j + 1  j0  NC,
bj
0
e  bjs or bje  bj
0
s : (4.19)
For the formulation for logical constraints, please refer [23].
Finally, as described in Section 2.2, unbounded bank conict delay may occur when a
bank is shared among too many cores, or a few cores whose HRR periods are short. The
condition that prevents such an unbounded delay is dened by Condition 2.2, and can be
restated as follows:
For each bank k,
NCX
j=1
j;k  LM
Tj  LB  1:
Recall that we can substitute Oj for
1
Tj
. Thus the above condition results in the following
constraint:
Constraint 17. For each bank k,
LM
LB
NCX
j=1
j;k Oj  1: (4.20)
47
The product of two variables, i.e., j;k  Oj, is non-linear. Furthermore, it is not directly
separable and it thus has to be transformed. Please refer [23] for the detail.
4.3.5 Task to column mapping
Recall that the range of columns occupied by task i is bounded by xis and x
i
e, and the required
number of columns, NXi , is given as an input. The mapping of task to columns can be simply
expressed by the following constraint:
Constraint 18. For each task i,
xie   xis + 1 = NXi : (4.21)
Furthermore, no column is shared between tasks:
Constraint 19. For each task 1  i  N  1, and each i+ 1  i0  N ,
xi
0
e  xis   1 or xie + 1  xi
0
s : (4.22)
Now an important constraint is that task i residing in core j can only occupy a contiguous
subset of columns belonging to the banks allocated to the core. As can be seen in Figure 4.4,
the indices of the rst column of bank bjs and the last column of bank b
j
e are (b
j
s 1) NW +1
and bje  NW , respectively. The following constraint requires that [xis; xie] should be in the
range of [(bjs   1) NW + 1; bje NW ]:
Constraint 20. For each task i and each core j,
if i;j = 1 =) (bjs   1) NW + 1  xis and xie  bje NW ; (4.23)
48
Core j
j
sb
j
eb
NW columns
1 NB1−jsb 1+jeb
Task i
j
sx
j
ex
Figure 4.4: The range constraint of mapping from a task to columns.
where NW is the number of columns in a bank. This constraint can be transformed into the
following two inequalities:
(bjs   1) NW + 1  xis  NX(1  i;j); (4.24)
bje NW   xie  (NW  NX)(1  i;j): (4.25)
4.3.6 WCET calculation
By Equation 3.1, the worst-case execution time of task i is formulated as follows:
Constraint 21. For each task i,
wceti = ei +N
M
i  f(2  LB + LM) + (dBi + dMi )g; (4.26)
where dBi and d
M
i are the upper-bound of bus access delay and that of bank conict delay,
respectively.
49
81 2 3 4 1 2 5 6 1 2
1st round
7 1 2 3
Slot # 121 2 3 4 5 6 7 8 9 10 11 13 14 15
1,14,k Bw L= −
2,14, 0kw =
6,14, 2k Bw L= ⋅
7,14,k Bw L=
8,14, 0kw =
12,14, 0kw =
...
...
13,14,k Bw L= −
Figure 4.5: An illustration of ws0;s;k for slot s = 14 with LM = 2  LB.
Bus Access Delay dBi
As explained in Section 3.1, the upper-bound of a bus access delay depends on to which core
the task is allocated. Since the decision variable i;j is 1 if task i is allocated to core j, the
following constraint can be used to represent Equation 3.2:
Constraint 22. For each task i,
dBi = LB(
NCX
j=1
i;j  Tj): (4.27)
Bank Conict Delay dMi
The most crucial part in nding upper-bound bank conict delay is the calculation of slot
delays. If it is given that which core uses which banks, how the banks are shared with others,
and how the cores are scheduled by a harmonic round-robin schedule, it is straightforward
to compute slot delays with the information, as described in Section 3.2. However, those are
variable during optimization, thus the steps presented here are slightly dierent from the
analysis in Section 3.2.
Let us rst dene ws0;s;k as the residual workload generated in the range of s
0 to s   1
50
that could delay a bank access of slot s to bank k, and which can be represented by the
following expression:
ws0;s;k = LM
s 1X
t=s0
NCX
j=1
j;t  j;k   LB(s  s0): (4.28)
Recall that j;t is the decision variable introduced in Subsection 4.3.2 representing whether
core j is scheduled at slot t or not. Thus, j;t  j;k is 1 if and only if core j at slot t uses
bank k. The rationale behind this expression is based on the fact that every slot accessing to
bank k generates LM workload and LB is consumed by each slot afterward. It is important
to know here that not all ws0;s;k are exact because either 1) a busy period may be broken at
some slots during [s0; s   1] or 2) a busy period that counts from slot s0 may ignore some
preceding workload generated at slots s00 < s0. However, the exact longest busy period is
guaranteed to be taken into account by the above equation if it exists. To help to understand
the underlying idea of ws0;s;k, let us consider Figure 4.5 as an example. Suppose here that
core 2; 5; 6; and 8 are sharing core k and that we want to nd ws0;14;k for all 1  s0  13.
As aforementioned, some of the values are incorrect. For example, w2;14;k does not capture
that the busy period starting from slot 2 discontinues at slot 5. Also, w7;14;k is calculated
assuming there is no backlog at slot 7. The values of ws0;14;k for both cases are therefore
meaningless. On the other hand, w6;14;k is the exact residual workload that could maximally
delay slot 14 of core 2, because it counts from a slot with no backlog and also there is no
discontinuity in its busy period. In fact, such a busy period is uniquely determined if it
exists.
As explained in Subsection 3.2.1, dierent slots of a core may experience dierent bank
conict delays, thus we need to calculate ws0;s;k for each s. Because j is variable with HRR
schedules, ws0;s;k should be computed from s
0 = 1. However, we do not need to compute the
slot delays in the rst round since the delay of each slot in the second round is always greater
than or equal to the same slot in the rst round, as aforementioned. Also, ws0;s;k may vary
51
depending on which bank the core at slot s shares with others. Accordingly, each ws0;s;k can
be computed by the following constraint:
Constraint 22. For each bank k, TRR + 1  s  2  TRR, and 1  s0  s  1,
ws0;s;k = LM
s 1X
t=s0
NCX
j=1
j;t  j;k   LB(s  s0): (4.29)
Now let us dene by us;k the worst-case delay that slot s could experience when accessing
bank k. Since it is the maximum of ws0;s;k for all 1  s0  s  1, we can represent it by the
following constraint:
Constraint 23. For each bank k, TRR + 1  s  2  TRR, and 1  s0  s  1,
us;k  ws0;s;k: (4.30)
Note that us;k should be lower-bounded by 0 because the core at slot s may suer no delay,
especially if it does not use or share bank k. For example, in Figure 4.5, while u14;k is 2 LB,
u15;k should be 0 since there is no access from slot 15 to bank k. Note also that if an access
from slot s to bank k could experience any delay, us;k always results in a positive value.
One can easily think of us;k as a slot delay, i.e., d
s
j;', analyzed in Subsection 3.2.2
1.
Now let us dene by zj;k the upper-bound of slot delays of core j using bank k, i.e., D
j
k in
Subsection 3.2.1. The following constraint is equivalent to Equation 3.4:
Constraint 24. For each core j and each bank k, and TRR + 1  s  2  TRR
zj;k  j;s  us;k: (4.31)
1The subscript s of us;k is dierent from the superscript s of d
s
j;'. While the former is the index of a slot,
the latter is just a notational symbol representing that ds is a slot delay. The subscript ' of dsj;' is the slot
index.
52
zj;k ends up being a positive value if and only if 1) core j occupies slot s of the scheduling
table , 2) core j uses bank k, and 3) core j suers bank conict delays when accessing the
bank. If at least one of the three conditions does not hold, then zj;k is always 0.
As explained in Section 3.2, in order to nd dMi , we need to compute D
j
s and D
j
e rst.
Recall that the variables bjs and b
j
e have the index of the leftmost and the rightmost bank of
core j respectively, thus the following constraints can be used to nd Djs and D
j
e from slot
delays zj;k:
Constraint 25. For each core j and each bank k,
if bjs = k =) Djs = zj;k: (4.32)
Similarly,
Constraint 26. For each core j and each bank k,
if bje = k =) Dje = zj;k: (4.33)
A task running on core j could experience bank conict delays if some of its columns
are in shared banks as shown in Figure 3.4. That is, if the columns are in the leftmost (the
rightmost) bank of core j, the delay could be at most Djs (D
j
e). However, if the columns
stretch across both ends, the upper-bound delay is the maximum of Djs and D
j
e. Recall now
that the column range of a task, say i, is bounded by xis and x
i
e variables (see Figure 4.4).
Accordingly, the following constraints can nally nd the bank conict delay that an access
from task i on core j can suer in the worst-case:
53
Constraint 27. For each task i, each core j, and each bank k,
if i;j = 1 and x
i
s  bjs NW =) dMi  Djs: (4.34)
Constraint 28. For each task i, each core j, and each bank k,
if i;j = 1 and (b
j
e   1) NW + 1  xie =) dMi  Dje: (4.35)
Note here that if [xis; x
i
e] stretch across [b
j
s; b
j
e], d
M
i results in max(D
j
s; D
j
e) by the above two
constraints.
4.3.7 Task and core utilization
If the worst-case execution time of a task exceeds its period, it can never be schedulable.
Thus we need to restrict each task's worst-case execution time to be bounded by its period,
which can be simply formulated as follows:
Constraint 29. For each task i,
wceti
pi
 1: (4.36)
Likewise, we need to limit each core utilization to 1, or a specic bound, e.g., Liu and
Layland's bound [24].
Constraint 30. For each core j,
N X
i=1
wceti
pi
 i;j  1: (4.37)
54
Chapter 5
Evaluation
In this chapter, we evaluate the proposed tunable WCET optimization model in terms of
the minimum achievable system utilization. For this, we used IBM ILOG CPLEX 12.1 [25]
to solve the optimization problem formulated in Chapter 4.
5.1 Evaluation Method
5.1.1 Experimental parameters
Table 5.1 summarizes the experimental parameters used for the experiments. We consider a
multicore system following the model described in Section 2.3. Each system consists of 4, 6,
or 8 cores and has a shared cache partitioned into 4 or 8 banks. Each bank is divided into 16
or 32 columns, however the total number of columns is maintained at 128. The bus access
latency is assumed to be 2:5 nanosecond, and the latency of bank access is twice of bus.
With these system parameters, we generate synthetic task sets, each of which is randomly
generated as follows: Each set consists of 20, 30, or 40 tasks. The xed execution time and
the period of a task are randomly selected from the ranges of [10 ms, 250 ms] and [500 ms,
10000 ms], respectively. Also, each task can occupy a set of at most 5 cache columns, and
accesses the shared cache at least 105 times but at most 107 times. We assume the random
task sets are generated according to the workload model A1{A6 in 1.3.3.
55
Table 5.1: Experimental parameters.
Parameter Value
NC f4; 6; 8g cores
NB f4; 8g banks
NW f16; 32g columns
LB 2:5 ns
LM 2  LB
N  f20; 30; 40g tasks
ei uniform from [10; 250] ms
pi uniform from [500; 10000] ms
NXi uniform from [1; 5] columns
NMi uniform from [10
5; 106  f1; 3; 5; 7; 10g] times
Table 5.2: Experimental groups.
PureRR BFD Proposed
Task allcation exible pre-allocated exible
Bus schedule pure round-robin harmonic round-robin harmonic round-robin
Cache partition two-level and exible
5.1.2 Experimental groups
With these experimental parameters, we compare the following three methods, which are
also summarized in Table 5.2:
PureRR In this method, the bus is scheduled by pure round-robin arbitration method.
That is, every core has the same slot period, i.e., NC, as in [22]. However, task allocations
are exible and the cache is partitioned by our proposed two-level partitioning method.
BFD In order to see how well our proposed method works for the cases in which each task
is allocated to a xed core, we employ Best-Fit Decreasing heuristic for task allocation. In
this method, we rst sort the task in decreasing order by estimated task utilization, which is
dened as
[wceti
pi
=
ei +N
M
i  f2  LB + LM + (NC  LB)g
pi
:
56
Each task is then allocated in the sorted order to the core which after accommodating the
task will have the least remaining utilization. One may notice that [wceti above is similar
to Equation 2.3. In fact, we compute the estimated WCET of task i assuming that it does
not experience any bank conict delay and that the bus is scheduled by pure round-robin.
It should be noted that, however, the bus schedule may change to a harmonic round-robin
during optimization. Also, the cache is partitioned by our two-level partitioning method.
Proposed This is the proposed method in this study. That is, all of task allocation, bus
schedule, and cache partitioning are exible and thus are optimized by our proposed methods.
It should be noted that in all the methods, the shared cache is not pre-partitioned be-
cause the unbounded bank conict delay problem may arise with a random or xed pre-
partitioning.
The optimization for the above methods is solved with the MILP formulation in Chap-
ter 4. The only dierence is that all Tj have the same value N
C in PureRR, and all i;j
values are pre-assigned in BFD. Thus one can easily expect that the minimum utilization
obtained with our proposed method is always less than, or at least equal to, those that can
be achieved with PureRR and BFD since these are more constrained optimization problem.
5.1.3 Evaluation metric
We compare the aforementioned three methods in terms of minimum achievable system
utilization as explained in Section 2.5. For this, let us denote the metric by UPureRR, UBFD,
and UProposed for each method. Note that dierent task sets may have dierent baseline
system utilizations. Thus, for fair comparison, we normalize UBFD and UProposed to UPureRR
for each task set and then take the average of 20 random sets for each conguration. Also,
we compare and present UPureRR UProposed, UPureRR UBFD, and UBFD UProposed in order
to illustrate the magnitude of the dierences between the methods.
57
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
4 cores
20 tasks
6 cores
30 tasks
8 cores
40 tasks
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
no
rm
al
ize
d 
to
 U
Pu
re
R
R
 
 
PureRR BFD Proposed
(a) Minimum utilizations normalized to UPureRR.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
4 cores
20 tasks
6 cores
30 tasks
8 cores
40 tasks
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
di
ffe
re
nc
e
 
 
UPureRR − UProposed
UPureRR − UBFD
UBFD − UProposed
(b) Dierences of minimum utilizations.
Figure 5.1: Minimum system utilization with dierent core counts.
5.2 Evaluation Result
In this section, we will present the evaluation results for the aforementioned methods ob-
tained with dierent 1) numbers of cores 2) intensities of cache accesses, and 3) cache con-
guration.
5.2.1 Impact of core count
Figure 5.1 compares the minimum system utilization as increasing the number of cores. In
this experiment, the number of banks and that of columns of a bank are 8 and 16 respectively,
and the number of cache accesses of tasks are randomly chosen from the range of [105; 107].
Also, in order to maintain average load for the cores, we increase the number of tasks as the
core count increases.
As the result in Figure 5.1a shows, our proposed methods can achieve less system uti-
lization than PureRR by 10% at least to 20% at most in average. We can see that the gap
is increasing with the number of cores and thus of tasks. This can be backed up by the fact
that the number of bank sharing among cores increases with the core and task counts due
to the xed capacity of the shared cache, i.e., the number of banks and columns. Another
58
1 million 3 million 5 million 7 million 10 million
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
The upper−limit number of cache accesses
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
no
rm
al
ize
d 
to
 U
Pu
re
R
R
 
 
PureRR BFD Proposed
(a) Minimum utilizations normalized to UPureRR.
1 million 3 million 5 million 7 million 10 million
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
The upper−limit number of cache accesses
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
di
ffe
re
nc
e
 
 
UPureRR − UProposed
UPureRR − UBFD
UBFD − UProposed
(b) Dierences of minimum utilizations.
Figure 5.2: Minimum system utilization with dierent cache access intensities.
important factor that can be attributed to this trend is that with more cores a harmonic
round-robin schedule is able to be more exible in prioritizing the cores so that highly-
utilized tasks can benet from being allocated to the cores whose HRR periods are short.
Meanwhile, the improvement gap between Proposed and BFD can be explained in a similar
manner, that is, pre-assigning tasks to cores prevents further optimization by tightening the
constraints on the bus schedule and bank sharing.
One interesting observation from this experiment is that BFD outperforms PureRR. This
implies that even if there is no exibility in assigning tasks to cores, it is likely that the system
utilization can signicantly be lowered by employing our proposed harmonic round-robin bus
scheduling, which in turn enhances the eciency of two-level cache partitioning as described
in Section 2.2.
5.2.2 Impact of cache accesses intensity
In order to see how the dierent cache access intensities aect the proposed optimization
model, we perform another experiment as increasing the upper-limit number of cache ac-
cesses. In this experiment, the numbers of cores, tasks, and banks are xed to 8, 40, and
59
8 respectively, and each bank consists of 16 columns. We vary the upper-limit number of
cache accesses from 1 million to 10 million while the lower-limit is xed to 0:1 million. With
these parameters, NMi for each task is chosen randomly between the limits.
As can be seen from Figure 5.2, the utilization improvement of Proposed over PureRR
increases with the upper-limit of cache accesses. This can be explained by the underlying
rationale behind our Tunable WCET model (Section 2.4). That is, the possibility of further
optimization grows with the ratio of tunable delay to the xed execution time. Thus if a
task accesses the bus and banks more intensively than others, it is more likely for the task to
enable the overall system utilization to be further reduced by a similar argument explained
in the previous discussion. However, this does not necessarily imply that higher intensity
of cache accesses would always lead to less system utilization, as can be observed from the
above result. We can see from Figure 5.2a that the improvements of Proposed and BFD over
PureRR converge to certain levels (around 20% and 15%, respectively) as the tasks become
more memory-intensive. This is due to the fact that as the proportion of tunable delay
grows, the sensitivity of WCET variation to changes in bus schedule and cache partitioning
also increases. Recall that, by our tunable WCET model, a decrease in one's delay naturally
leads to increases in the delays of the rest of the tasks. Therefore, if most tasks are sensitive
to WCET variation, the overall improvement may not be possible because it is more likely
for some tasks or cores to exceed the utilization bound constraints.
5.2.3 Impact of cache conguration
Figure 5.3 shows how dierent cache congurations aect our optimization model. In this
experiment, the numbers of cores and tasks, and the upper-limit cache accesses counts are
xed to 8, 40, and 10 million, respectively. With these parameters, we consider two dierent
cache congurations: 4 banks  32 columns and 8 banks  16 columns.
It can be seen from the gure that the utilization improvements of Proposed and BFD
over PureRR with 8 16 structure are slightly higher than those with 4 32 one. Note that
60
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
4 banks
32 columns/bank
8 banks
16 columns/bank
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
no
rm
al
ize
d 
to
 U
Pu
re
R
R
 
 
PureRR BFD Proposed
(a) Minimum utilizations normalized to UPureRR.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
4 banks
32 columns/bank
8 banks
16 columns/bank
Av
g.
 o
f m
in
im
um
 s
ys
te
m
 u
tili
za
tio
n 
di
ffe
re
nc
e
 
 
UPureRR − UProposed
UPureRR − UBFD
UBFD − UProposed
(b) Dierences of minimum utilizations.
Figure 5.3: Minimum system utilization with dierent cache congurations.
the total number of columns required by all tasks are similar between two cases. Although it
is small but the reason why such a dierence nevertheless arises is mainly due to the dierent
granularity of core-to-bank mappings. To put it simply, let us suppose that a bank requires
35 columns. With the 816 cache, it is possible that the core is mapped to 3 out of 8 banks.
With the 4 32 cache, on the other hand, the core needs at least 2 but possibly 3 out of 4
banks, which is equivalent to 4 or 6 banks of the 8 16 cache. Because each core is likely to
take more banks than it actually needs, the number of bank sharing can increase with the
4  32 cache. As an extreme case, if each bank consists of only one column, then no core,
and thus no task, suers any bank conict delay due to the same reason. Another factor
that inuences the dierence is that, as already described in the previous discussions, with
more cores bank conict delays can further be reduced by the help of harmonic round-robin
scheduling.
61
Chapter 6
Conclusion and Future Work
In this study, we have proposed a novel perspective of WCET model called tunable WCET,
which enables system-level optimization for hard real-time multicore system. In this model,
the WCET of tasks in a system is no longer dependent upon the system conguration, but
rather decides how to congure the shared bus and cache of the system. As the WCET-
aware shared resource arbitration and allocation methods, we have introduced harmonic
round-robin bus scheduling and two-level cache partitioning schemes. With these methods,
the tasks, and even cores, are prioritized in accessing the shared resources. To nd out
the optimal conguration of the shared bus and cache for a given set of tasks, we have
derived the tunable WCET analysis and formulated its optimization problem. We then have
investigated the impact of several conguration and workload factors on minimum achievable
system utilization, and the experimental results have shown that our proposed WCET-aware
shared resource conguration schemes can signicantly lower system utilizations. Although
we have considered only the overall system utilization as the optimization objective, our
model can be also applied to other WCET-related objectives as well.
There are several further directions that our proposed study could be improved. First,
we will develop an ecient heuristic algorithm to solve the proposed MILP optimization
problem. As one may notice that, in our optimization model, all the optimization factors,
i.e., task allocation, bus scheduling, and cache partitioning, are strongly coupled with each
other. We will therefore study on how to decompose the original problem into multiple
sub-problems and what order to solve them. Second, we will investigate how to extend our
resource allocation methods to support soft real-time tasks as well. Recall that we have
62
assumed no two tasks can share a cache column, which will limit the ecient use of a shared
cache if we start to consider soft real-time tasks. One possible way to solve it is to allow soft
real time tasks to share a designated area of the shared cache and then modify the tunable
WCET model to take the delay due to column sharing into account. In this direction,
how many banks or columns to assign to the soft real-time tasks will be another important
research issue. Lastly, we will study other inter-core interference factors, such as I/O trac
and o-chip memory (Factor F3.3 and F5 in Subsection 1.3.2) in future work.
63
References
[1] \Intel Xeon E7 Processors," http://www.intel.com/products/server/processor/xeonE7/
index.htm.
[2] \Oracle Unveils SPARC T3 Processor and SPARC T3 Systems," http://www.oracle.
com/us/corporate/press/173536.
[3] J. Held, J. Bautista, and S. Koehl, \From a Few Cores to Many: A Tera-scale Com-
puting Research Overview," http://download.intel.com/research/platform/terascale/
terascale overview paper.pdf.
[4] \Freescale's P4080: QorIQ P4080 Eight-Core Communications Processors with Data
Path," http://www.freescale.com/webapp/sps/site/prod summary.jsp?code=P4080.
[5] \ARM11MPCore Processor," http://www.arm.com/products/processors/classic/
arm11/arm11-mpcore.php.
[6] \LG Optimus 2X: rst dual-core smartphone launches with Android, 4-
inch display, 1080p video recording," http://www.engadget.com/2010/12/15/
lg-optimus-2x-rst-dual-core-smartphone-launches-with-android/.
[7] \Apple's iPad 2 Makes Dual-Core Mainstream," http://gigaom.com/2011/03/02/
apples-ipad2-makes-dual-core-mainstream/.
[8] A. Fedorova, S. Blagodurov, and S. Zhuravlev, \Managing Contention for Shared Re-
sources on Multicore Processors," Communications of the ACM, vol. 53, no. 2, pp.
49{57, Feb 2010.
[9] S. Zhuravlev, S. Blagodurov, and A. Fedorova, \Addressing Shared Resource Con-
tention in Multicore Processors Via Scheduling," in Proc. of the 15th International
Conference on Architectural Support for Programming Languages and Operating Sys-
tems(ASPLOS'10), 2010.
[10] M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, and O. Ozturk, \Op-
timizing Shared Cache Behavior of Chip Multiprocessors," in Proc. of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture (Micro'09), 2009, pp. 505{
516.
64
[11] J. Rosen, A. Andrei, P. Eles, and Z. Peng, \Bus Access Optimization for Predictable Im-
plementation of Real-Time Applications on Multiprocessor Systems-on-Chip," in Proc.
of IEEE International Real-Time Systems Symposium (RTSS'07), 2007, pp. 49{60.
[12] S. Chattopadhyay, A. Roychoudhury, and T. Mitra, \Modeling Shared Cache and Bus
in Multi-cores for Timing Analysis," in Proc. of International Workshop on Software
and Compilers for Embedded Systems (SCOPES'10), 2010, pp. 1{10.
[13] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and R. Kegley, \A
Predictable Execution Model for COTS-based Embedded Systems," in Proc. of The 17th
IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'11),
2011.
[14] J. Yan and W. Zhang, \WCET Analysis for Multi-Core Processors with Shared L2
Instruction Caches," in Proc. of IEEE Real-Time and Embedded Technology and Appli-
cations Symposium (RTAS'08), 2008, pp. 80{89.
[15] W. Zhang and J. Yan, \Accurately Estimating Worst-Case Execution Time for Multi-
core Processors with Shared Direct-Mapped Instruction Caches," in Proc. of the 15th
IEEE International Conference on Embedded and Real-Time Computing Systems and
Applications (RTCSA'09), 2009, pp. 455{463.
[16] Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury, \Timing Analysis of
Concurrent Programs Running on Shared Cache Multi-Cores," in Proc. of IEEE Inter-
national Real-Time Systems Symposium (RTSS'09), 2009, pp. 57{67.
[17] A. Schranzhofer, J.-J. Chen, and L. Thiele, \Timing Analysis for TDMA Arbitration
in Resource Sharing Systems," in Proc. of IEEE Real-Time and Embedded Technology
and Applications Symposium (RTAS'10), 2010, pp. 215{224.
[18] R. Pellizzoni, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thiele, \Worst Case De-
lay Analysis for Memory Interference in Multicore Systems," in Proc. of the Conference
on Design, Automation and Test in Europe (DATE'10), 2010, pp. 741 { 746.
[19] M. Schoeberl, \JOP: A Java Optimized Processor," in Proc. of Workshop on Java
Technologies for Real-time and Embedded Systems (JTRES'03), 2003.
[20] A. El-Haj-Mahmoud, A. S. AL-Zawawi, A. Anantaraman, and E. Rotenberg.
[21] B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards, and E. A. Lee, \Predictable
Programming on a Precision Timed Architecture," in Proc. of International Conference
on Compilers, Architecture, and Synthesis from Embedded Systems (CASES'08), 2008.
[22] M. Paolieri, E. Qui~nones, F. J. Cazorla, G. Bernat, and M. Valero, \Hardware Support
for WCET Analysis of Hard Real-Time Multicore Systems," in Proc. of IEEE/ACM
International Symposium on Computer Architecture (ISCA'09), 2009, pp. 57{68.
[23] H. P. Williams, Model Building in Mathematical Programming, 4th ed. Wiley, 1999.
65
[24] C. L. Liu and J. W. Layland, \Scheduling Algorithms for Multiprogramming in a Hard
Real-Time Environment," Journal of the ACM, vol. 20, no. 1, pp. 46{61, January 1973.
[25] \IBM ILOG CPLEX Optimizer," http://www-01.ibm.com/software/integration/
optimization/cplex-optimizer.
66
