Predictable inter-core communication for strictly partitioned real-time multi-core systems by Wen, Jen-Yang
c© 2019 Jen-Yang Wen





Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the





The Federal Aviation Administration (FAA) in its recent CAST-32A certification guid-
ance suggests that more than one core in a multi-core chip for safety-critical avionics appli-
cations can be used only if all the inter-core interference channels have been bounded and
accounted for. Researchers have proposed frameworks to partition and regulate access to
shared resources, such that multi-core processors can be deployed for safety-critical avionics
applications. Strict partitioning of memory resources among cores, however, does not allow
inter-core communication.
In this paper, we propose a Communication Core Model(CCM) that implements the
inter-core communication by bounding the amount of intercore interference in a partitioned
multi-core system. A system level perspective of how to realize such CCM along with the
implementation details is provided. A formula to derive WCET of the tasks using CCM is
provided. We compare our CCM with Contention-based Communication (CBC) where no
private banking is enforced for any core. The results of the analytical approach using San
Diego Vision Benchmark Suite (SD-VBS) for two models indicate that the CCM shows an
improvement of up to 65 percent when compared to the CBC. Moreover, our experimental
results indicate that the measured WCET of the tasks of SD-VBS is within the bounds
calculated using the proposed analysis.
ii
To I-Han, Renatus, and Ashton, for their love and support.
iii
ACKNOWLEDGMENTS
I would like to express my deepest appreciation to Professor Lui Sha for his patience and
understanding during the time I have spent at UIUC. I would like to thank Mr. Rohan
Tabish, Professors Renato Mancuso, Heechul Yun, Rodolfo Pellizzoni, Marco Caccamo and
Lui Sha for their direct contributions to this thesis. I would also like to thank the supporting
staff from the Computer Science Department that makes this thesis possible.
I would also like to thank Dr. Yi-Zong Ou, Dr. Po-Liang Wu and Mr. Jiyang Chen for
their supports. I enjoyed sharing the office space with them a lot.




CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 RELATED WORK AND DRAM BACKGROUND . . . . . . . . . . 4
2.1 DRAM Background and Related Work . . . . . . . . . . . . . . . . . . . . . 4
2.2 Background on Partitioning Shared Resources . . . . . . . . . . . . . . . . . 5
CHAPTER 3 SYSTEM MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Architectural Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Resource Allocation To Communication Tasks . . . . . . . . . . . . . . . . . 7
3.3 Application Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER 4 BOUNDING INTERFERING MEMORY REQUESTS IN THE
PROPOSED SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Interference Caused By CC . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Interference Caused By Other ACs to AC under Analysis . . . . . . . . . . . 10
4.3 Total Interference Caused to AC under Analysis . . . . . . . . . . . . . . . . 10
CHAPTER 5 RESPONSE TIME ANALYSIS . . . . . . . . . . . . . . . . . . . . . 12
5.1 Contention Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Latency Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CHAPTER 6 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Implementation Details of TX/RX Buffer for CCM . . . . . . . . . . . . . . 19
CHAPTER 7 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.1 Experimental Setup and Benchmarks . . . . . . . . . . . . . . . . . . . . . . 24
7.2 Task WCET with Different Memory Regulation Budget Assignment . . . . . 25
7.3 Throughput of the CT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 CCM and CBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.5 Analytical Bound and Measurement . . . . . . . . . . . . . . . . . . . . . . . 28
CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
CHAPTER 1: INTRODUCTION
Multi-core processors have been developed by industry to meet the ever-growing process-
ing requirements. These processors provide great average case performance. However, the
use of multi-core processors for safety-critical avionics applications is strictly regulated by
FAA [1]. Current restrictions on the use of multi-core for safety-critical avionics applications
are because of the fact that multi-core processors have shared resources such as shared last
level cache (LLC), available bus bandwidth, and the shared DRAM banks. The contention
on these shared resources by other cores leads to unpredictable behavior of the tasks on the
core under analysis. Researchers in [2] demonstrated that strict partitioning of the shared
resources (LLC, bus bandwidth and DRAM banks) in a multi-core environment is required
to achieve predictable execution of the tasks running on each core. Strict partitioning of the
shared resources has been adopted by the FAA in its recent CAST-32A position paper [1].
While the previous research [2, 3, 4, 5, 6, 7, 8] on partitioning shared resources in a multi-
core environment paved the pathway for the use of multi-core processors for hard real-time
safety-critical applications, however, the solution was incomplete. It had yet to support
inter-core communication. In order to implement the inter-core communication, the strict
partitioning assumption on the shared resources needs to be relaxed in a controlled manner.
There are many ways to implement the inter-core communication and it depends on the
amount of data needed to be shared. For small communication messages, an intuitive ap-
proach is to use a portion of LLC and avoid accessing the main memory. However, the im-
plementation of locking chunks of messages in the LLC requires specific hardware support.
Dealing with the small amount of data and using the cache as a medium for communication
is out of the scope of this paper.
Other research works in [9] proposed a platform that integrates cache and DRAM bank
partitioning to guarantee predictable execution time of mixed-criticality tasks in multi-core
(MC2) system. Although the authors in [10] show how to support inter-processor commu-
nication inside MC2 framework, no theoretical bound was provided for all the interference
sources of the shared communication medium, which is one of the goals of our paper.
In this paper, we present an inter-core communication approach that uses DRAM memory
as a way of implementing the inter-core communication. When implementing the inter-core
communication inside a multi-core system with strictly partitioned shared resources, the
proposed strict partitioning of the DRAM banks required for isolation needs to be relaxed.
When relaxing such an assumption to implement inter-core communication we propose to
minimize the number of interfering cores accessing the same DRAM bank.
1
Strict partitioned of shared resources in a multi-core framework was originally designed
to support the use of the standard Integrated Modular Avionics (IMA) architecture on each
core. The (single core) IMA architecture uses Time Division Multiplexing Access (TDMA)
to run applications with different criticality in different partitions. Within each partition,
tasks are scheduled by generalized rate-monotonic algorithm [11]. In the IMA standard,
the zero partition (I/O partition) is used to handle all the I/O and inter-partition message
exchanges. In the SCE framework, we consolidate the zero partitions from each core into a
specific core, called I/O core, to manage the I/O accesses [12]. It is natural to extend the
I/O core architecture to implement inter-core communication using the model as shown in
Figure 1.1. We name the proposed model as the communication core model (CCM). Under
the proposed model, the I/O core is renamed as the communication core (CC), whereas all
the other cores are named as the application cores (ACs). In CCM, the CC is responsible
for moving I/O data between I/O devices and all the other ACs as well as the inter-core
communication data between ACs.
Using the CCM, one can handle the I/O data from I/O devices using the following two
approaches: 1) either the communication core transfers data from/to device memory into
its own private bank and move it from/to the private bank of AC that needs it; 2) or the
CC can directly transfer the data from the I/O device to the bank of the application core
that needs it. For simplicity, we consider the second approach, shown as black arrows in
Figure 1.1. This paper focuses on safety-critical hard real-time inter-core communication.
Figure 1.1: Block Diagram
In CCM, the tasks deployed on ACs are allowed to access only the private DRAM banks
assigned to them. Communication task (CT) running on the CC has access to its private
banks and to the banks of all the other ACs. CT is responsible for moving data from
2
the private bank of one AC to the private bank of another AC to implement inter-core
communication. In particular, we are interested only in the movement of the data between
ACs and do not talk about the end-to-end response time analysis of the communication
message delivery to the final task. It is beyond the scope of this paper.
The main contributions of this paper include the following:
• A novel CCM that bounds the amount of contention on the DRAM banks for imple-
menting shared memory communication inside the SCE framework is proposed.
• A way on how to calculate WCET of a task under the proposed CCM is provided.
• Implementation details of the communication library (CommLib) and a communication
task (CT) based on the proposed CCM are provided.
• An extensive evaluation of the proposed CCM is provided and is compared with CBC
where the inter-core communication is implemented without private banks.
The rest of this paper is organized as follows, Chapter 2 introduces the related work
and background. Chapter 3 introduces the system model and assumptions. Chapter 4
presents how to bound the inter-core memory interference with the proposed CCM. Chapter 5
presents the delay analysis of the proposed system. Chapter 6 describes the details about the
implementation of the proposed library and communication server. Chapter 7 presents the
analytical results of the CCM and the CBC approach as well as provides the measurement
based results for CCM on the P4080 platform. Finally, Chapter 8 concludes this work.
3
CHAPTER 2: RELATED WORK AND DRAM BACKGROUND
This chapter is divided into two sections. Section 2.1 describes the background related to
the DRAM memory controller and some of the previous works that have been proposed by the
research community to analyze and predict the behavior of the DRAM memory controller.
In particular, we describe the work proposed in [13] that we use as a basis to analyze our
proposed system model. Next, in Section 2.2 we describe the necessary background and
related work required for the understanding of this paper.
2.1 DRAM BACKGROUND AND RELATED WORK
Main memory such as DRAM is a shared resource in a multi-core processor which greatly
affects the overall performance of the system. DRAM memory controllers are designed to
generate DRAM specific commands to access data in the DRAM.
The DRAM controller and the DRAM module are connected through a command/address
bus and a data bus. The DRAM memory is generally organized into a set of ranks, each
rank is divided into multiple banks that can be accessed in parallel provided that no collision
occurs on either bus. Each bank has a two-dimensional array of memory organized into rows
and columns. In order to access the data in a row, an activate (ACT) command must be
issued to load the data in the row buffer. Once loaded, all the read/write memory requests
(CAS) accessing the data in the row buffer will constitute a row hit. However, if a memory
request targets a different row then the current buffer must be written back to the array
with a precharge command (PRE) first before the second row can be activated. A request
that is a hit in the row buffer (open row access) takes a much shorter time than that is a
miss in the row buffer (close row access). The minimum time it takes to complete open row
access and close row access is device dependent. Once the DDR device for the system is
selected, the timing constraint values can be found on JEDEC standard documents, such
as [14].
Scheduling algorithm in COTS memory controllers have been developed to offer low
average-latency and maximize the throughput. One of the most common scheduling al-
gorithms is the First-Ready First-Come-First-Serve (FR-FCFS) scheduling algorithm that
prioritizes: (1) row hits over row conflicts; (2) old commands over newer commands in case
of row conflicts. Another widely used scheduling algorithm is round-robin.
In the real-time community, there has been a great effort in analyzing the behavior of a
DRAM memory controller. The complexity of the COTS DRAM systems, however, has made
4
such efforts difficult. To address such complexity, researchers have chosen to design hardware
(HW) real-time DRAM controllers [15, 16, 17, 18, 19, 20, 21] that are easier to analyze. The
problem with these HW real-time DRAM controllers is that they have lower performance
compared to the COTS DRAM controllers. Moreover, these HW DRAM controllers have
yet to be adopted by the industry. It is because of this reason that there is a need to analyze
the DRAM memory controller that is accompanied by modern COTS processors.
To analyze the memory behavior of COTS-based memory controllers some researchers
have proposed to model the main memory as a black box where each request from the
memory controller takes a specific amount of time and memory requests from other cores are
serviced in a round-robin or first-come first-serve (FCFS) basis. A variety of previous works
addressing main memory as a black box include [22, 23, 24, 25, 26]. The memory model
assumed in these approaches is however not safe for COTS multicore platforms because it
hides critical details necessary to place an upper bound on its timing [27].
Recently, the authors in [13] proposed a new approach for bounding memory interfer-
ence. In their approach, they considered the timing characteristics of major resources in the
DRAM system, including the re-ordering effect of FR-FCFS and the rank/bank/bus tim-
ing constraints. Using this approach, the authors showed that they obtain a tighter upper
bound on the worst-case memory interference delay for a task when it executes in parallel
with other tasks. The presented technique combined two approaches: a request-driven
and a job-driven approach. The request-driven approach focuses on the tasks own memory
requests, and the job-driven approach focuses on interfering memory requests during the
tasks execution. Combining two approaches yields a tight upper bound on the worst-case
response time of a task in the presence of memory interference.
2.2 BACKGROUND ON PARTITIONING SHARED RESOURCES
In a multi-core system, there are shared resources such as available bus bandwidth and
DRAM banks. These shared resources can be partitioned among the cores to avoid con-
flicts. Researchers in the real-time community have introduced many OS-based techniques
to regulate access to the shared resources to bound the inter-core interference. For example,
memory regulation techniques such as [3] proposed to regulate the amount of main memory
access by each core in each regulation period to reduce the interference on the memory con-
troller. Other researchers proposed to partition the shared resource to reduce the inter-core
interference channels. Techniques such as [5, 9] partition the DRAM banks in the shared
main memory among cores. While other techniques such as [6, 28, 29] partitions the shared
last level cache (LLC) to prevent cache evictions caused by inter-core interference.
5
In this paper, we use approaches similar to MemGuard [3] and PALLOC [5] to partitioned
the shared resource in our system. However, the resource partitioning can also be achieved
using the MC2 work in [9, 10].
MemGuard is a memory bandwidth reservation mechanism that is implemented at the
Operating System (OS) layer. The main purpose of this mechanism is to distribute the
bandwidth available from the memory controller equally among all the cores. It works
periodically, and for each interval e.g., 1ms, a fixed memory budget (Qp) is assigned to each
core. During each period, the hardware performance counter (PMC) on each core measures
the number of memory requests that are generated by the core. The PMC is programmed
to generate an overflow interrupt to the core once its assigned budget has been exhausted.
Upon the reception of the overflow interrupt, MemGuard stalls the core by descheduling
all the tasks. At the beginning of the new period, a new budget assignment takes place,
the previously descheduled tasks are scheduled again. Although MemGuard implements a
heuristic based reclaiming mechanism, however, this does not improve the hard real-time
performance of the system, therefore it is disabled throughout the discussion of this paper.
DRAM memory module contains multiple resources often referred to as banks that can
be accessed in parallel. In COTS multicore platforms, banks are typically shared among
all the cores even though programs running on the cores do not share memory space. In
order to partition the banks and assign each bank to a particular core, we rely on PALLOC.
PALLOC allows partitioning of banks to avoid bank sharing among cores, thereby improving
isolation on COTS multicore platforms without requiring any special hardware support.
6
CHAPTER 3: SYSTEM MODEL
3.1 ARCHITECTURAL ASSUMPTIONS
We assume a standard COTS multi-core processor with n cores. Each core in the system
features out-of-order execution and has a private cache. There is also a last-level cache (LLC)
that is shared among all the cores. The LLC is partitioned and its content is statically locked.
We assume a technique such as the one presented in [30] is implemented, so that the sum
of available local miss status holding registers (MSHRs) in each active core is equal to or
smaller than the number of global MSHRs in the shared LLC. As such, MSHR contention
does not occur. We assume that the HW platform is timing compositional, or that the
experimentally derived worst-case execution time (WCET) is appropriately upper-bounded
to account for the lack of hardware compositionality, as described in [31]
To avoid cache coherency problems, we assume that shared data (messages to be com-
municated) are not cacheable at any cache level. We also assume that the underlying main
memory is a DRAM with B banks. CPUs access main memory through a shared bus. We
also assume that the platform provides a mechanism to measure the number of memory
requests made by each core in the main memory. We assume that the platform is capable
of counting aggregated read and write memory accesses. The arbitration scheme on the
memory controller is assumed to be FR-FCFS. We assume that reads and writes are treated
in the same way at the memory controller [13].
3.2 RESOURCE ALLOCATION TO COMMUNICATION TASKS
In our system, we partition shared resources such as DRAM banks and the available
memory bandwidth. In the proposed CCM, one communication core (CC) is dedicated for
inter-core communication, whereas the rest of the cores are ACs. All ACs are only allowed
to access their dedicated DRAM banks, whereas the CC is capable of accessing all DRAM
banks. A block diagram of such a model is shown in Figure 1.1.
Here, the CC is responsible for copying data from one bank to another bank on behalf
of other ACs. The task responsible for this data movement is the CT running on the CC.
A summary of the system parameters and their value used for evaluations in Chatper 7 is
provided in Table 3.1. Within each memory regulation period, the CC is capable of accessing
all the banks. There exist at most (n−1) · (n− 2) communication sequences that need to be
completed in one memory regulation period assuming all the ACs need to communicate with
7
Table 3.1: System Parameters
System Parameters Symbol Value
Number of Cores n 8
Number of ACs n− 1 7
Memory Regulation Period P 1ms
each other. For each pair of communicating cores, we assume CC issues at most tc memory
requests to the sender’s private bank, and at most tc memory requests to the receiver’s private
bank. The total number of memory requests made by the CC to banks of ACs during one
memory regulation period is represented by Tc = 2 · (n− 1) · (n− 2) · tc. When an AC needs
to access an I/O device buffer, CC issues at most tio memory requests from the TX buffer in
the sender’s private bank (I/O output), and at most tio memory requests to the RX buffer
in the receiver’s private bank (I/O input). The memory transactions required to move data
to/from a device buffer to the private bank of ACs is represented by Tio = 2 · (n− 1) · tio.
In each memory regulation period, after performing Tc + Tio memory transactions to the
private banks of ACs, CC can use the remaining regulation budget (Qp−Tc−Tio) to execute
tasks that access CC’s own private banks.
3.3 APPLICATION TASK MODEL
We consider a partitioned and fixed priority scheduling policy, where each core has a set
Γ of N periodic application tasks, {τ1, ...., τN}, each with different priority whereby τ1 has
the highest priority and τN has the lowest priority. Each task τi can be represented as τi =
{Corei, Hi, ei, Ti}. Where Corei is the core τi is assigned to. Hi is the worst-case number
of main memory accesses of τi. ei is the worst-case execution time of the task measured in
isolation, i.e. when no interference tasks are present and no memory regulation is enforced.
The amount of communication data that a task needs to send to another task is included
in Hi. Ti is the period of the task. The deadline of a task is equal to its period. Table 3.2
summarizes the task parameters.
An AT is a task deployed on an AC. It accesses only the private DRAM bank assigned to
it. It only shares DRAM banks with ATs on the same core and the CT.
Table 3.2: Task Parameters
Description Symbols
Core to which τi has been assigend Corei
Number of main memory requests of any job of task τi Hi
Solo execution time ei
Period (and deadline) Ti
8
CHAPTER 4: BOUNDING INTERFERING MEMORY REQUESTS IN THE
PROPOSED SYSTEM
The maximum number of memory requests that each core can issue in a memory regulation
period is Qp. However, as discussed in [13, 5], interfering memory requests that access the
same bank (intra-bank interference) as the task under analysis produces more delay and
leads to more pessimistic WCET, compared to memory requests that access different banks
(inter-bank interference). In this section, we describe how to bound the interfering intra-bank
(Aintra) and inter-bank memory requests (Ainter) in a memory regulation period according
to the proposed CCM described in Chapter 3.
In order to calculate the total interference caused by the CC and all the ACs to the AC
under analysis during a memory regulation period, we apply the following approach: first,
we calculate the total interference caused by CC; second, we calculate the interference caused
by all the ACs; and finally, we sum the two interferences to get the total interference.
4.1 INTERFERENCE CAUSED BY CC
The amount of inter-bank and intra-bank interference caused by the CC in the CCM can
be summarized as follows:
• The intra-bank interference (Aintra) caused by CC when moving I/O data to(input)/from(output)
the bank under analysis. This intra-bank interference can be accounted as 2 · tio.
• The inter-bank interference (Ainter) caused by CC when moving I/O data to(input)/from(output)
other (n− 2) ACs. This (Ainter) can be accounted as 2 · (n− 2) · tio.
• The intra-bank interference (Aintra ) caused by CC when moving communication data
to(receive)/from(send) the bank under analysis. This (Aintra ) can be accounted as
2 · (n− 2) · tc.
• The inter-bank interference (Ainter ) caused by CC by moving communication data
to(receive)/from(send) other ACs is 2 · (n− 2) · (n− 2) · tc. This is due to the fact that
CC accesses each private bank of an AC for at most 2 · (n − 2) · tc transactions, and
there are (n− 2) banks belonging to other ACs that can cause inter-bank interference
to the AC under analysis.
• The inter-bank interference (Ainter ) caused by the leftover CC budget after the deple-
tion of the I/O and communication budget. This can be written as Qp − 2 · (n − 1) ·
tio − 2 · (n− 1) · (n− 2) · tc
9
The total intra-bank and inter-bank interference caused by CC can be obtained by sum-
ming the various terms, as expressed in Equation 4.1 and Equation 4.2 below. Note that the
total memory interference caused by CC during a memory regulation interval is the sum of
Equation 4.1 and Equation 4.2 and is equal to the memory regulation budget (Qp).
AintraCC = 2 · tio + 2 · (n− 2) · tc (4.1)
AinterCC = Qp − 2 · tio − 2 · (n− 2) · tc (4.2)
4.2 INTERFERENCE CAUSED BY OTHER ACS TO AC UNDER ANALYSIS
In our proposed model, all the ACs only access their own bank with a memory regulation
budget of Qp. This means that the only interference introduced by all other ACs to the AC
under analysis is inter-bank interference (Ainter).
The total intra-bank and inter-bank interference caused by all the ACs to the AC under
analysis are expressed in Equation 4.3 and Equation 4.4, respectively.
AintraAC = 0 (4.3)
AinterAC = (n− 2) ·Qp (4.4)
4.3 TOTAL INTERFERENCE CAUSED TO AC UNDER ANALYSIS
To obtain the total intra-bank interference caused by CC and the ACs to the AC under
analysis, we simply add Equation 4.1 and Equation 4.3 to obtain Equation 4.5. Whereas,
the total inter-bank interference can be obtained by adding Equations 4.2 and 4.4 to obtain
Equation 4.6.
Aintra = AintraCC + A
intra
AC
= 2 · tio + 2 · (n− 2) · tc
(4.5)
Ainter = AinterCC + A
inter
AC
= (n− 1) ·Qp − 2 · tio − 2 · (n− 2) · tc
(4.6)
From Equation 4.5 we can see the value of Aintra is dependent on the system parameters
10
tio and tc and that it is only a fraction of the overall memory regulation budget. This shows
that the proposed CCM indeed reduces the intra-bank memory interference, compared to
the CBC where we cannot use bank privatization while supporting inter-core communication
in a strictly partitioned system.
In the CBC configuration, where intercore communication is implemented with no bank
privatization. In the worst case, we can have all cores issuing memory requests to the same
bank. This results in a much higher intra-bank interference count as shown in Equation 4.7
and Equation 4.8.
AintraCBC = (n− 1) ·Qp (4.7)
AinterCBC = 0 (4.8)
11
CHAPTER 5: RESPONSE TIME ANALYSIS
The response time of a task or group of tasks in a memory regulated system is inflated
compared to solo execution time because of: 1) memory contention caused by tasks on other
cores; 2) stall induced by memory regulation. During each memory regulation period, a
core either makes Qp memory accesses, exhausting all of its budgets and being stalled, or
it does not exhaust its full budget. We define a memory regulation period in which a core
exhausts its full Qp budget and is stalled because of regulation as a stall period ; whereas,
a period in which a core does not utilize its full memory regulation budget is defined as
a contention period. During a regulation period, the faster a core exhausts its Qp budget
the more regulation delay it suffers. Hence, in the worst case, we can assume that the core
immediately performs Qp memory accesses at the beginning of the period and is stalled for
the entire period (P ).
To compute the response time of the task in our proposed CCM model, we first measure
the solo execution time of the task in isolation. The cores in our model are out-of-order;
in the best case, the memory access latency can be hidden from the processor because, in
absence of data dependencies, the CPU pipeline will not stall waiting for the completion of
a memory load. (i.e., the instruction level parallelism of the task is high).
When measuring the execution time of the task in isolation, it is not known how many
memory requests generated by the task were reordered and overlapped with CPU instruc-
tions. To obtain a safe upper bound to the total response time, one can simply assume that
all memory requests had zero latency when measured solo, while all requests experience full
memory latency: there is no MSHR available, or no instruction that can be reordered to
hide the latency of this memory request, and access close rows when considering interference
from other cores.
In order to compute the response time analysis of a task in the proposed CCM we thus
proceed as follows:
1) Similar to [4], for each task τj, we define a pure computation time cj as the execution
time of the task minus the minimum latency for the Hj memory requests of the task. As
discussed above, the minimum latency is zero, therefore the pure computation time equals
the measured execution time (cj = ej).
2) Then, when considering the tasks that execute in the busy interval, we add an extra
latency term P for each stall period (since they are just stalled for the whole period, without
computation in the worst case). For memory requests issued in contention periods, we instead
add a latency term that represents the maximum cumulative latency of all such requests
12
(including the effects of contention). Let us call RL the total latency for stall periods, and
CL the total memory latency (including contention effects) for contention periods. We also
define the total memory latency as ML = RL+ CL.
We can then perform response time analysis [32] of independent periodic tasks by comput-







e as the total number of memory requests performed by all tasks
on core under analysis that arrive in the interval Ri (including task under analysis).
• c̄i =
∑
∀j,j≤i cj · d
Ri
Tj
e as the total computation performed by all tasks on core under
analysis that arrive in the interval Ri (including the one under analysis).
We then compute Ri for the next iteration as:
Ri ← P + c̄i +ML(Ri, H̄i), (5.1)
and continue to iterate until convergence, or Ri > Ti. Note that we are summing the total
computation with the overall latency. We also need to add, however, an extra term P to
account for the fact that the critical instant might start right after the memory regulation
budget has been exhausted (by previous tasks not in the busy interval). The challenge is
how to compute ML at each iteration, which we discuss in Section 5.2. As reported in the
equation, we will show that ML is a function of the length of the busy interval (Ri), and
the total number of memory requests (H̄i).
Figure 5.1 gives an example of the breakdown of a measured WCET of a task in a memory
regulated system. In Figure 5.1 we can see that the measured WCET can be broken down
into 4 stall periods (blue and green blocks) and 5 contention periods (red blocks). Note that
all the periods except the last one must take up an entire memory regulation period. The
last memory regulation period of a task may finish before the end of the memory regulation
period. Inside a contention period, the task executes and suffers memory latency due to
contention. The first regulation period (blue block) represents the initial stall term due
to the memory regulation budget being already exhausted when the task under analysis is
activated. The sum of the last three stall periods (green blocks) in Figure 5.1 is the RL of
the example task. The sum of all the memory latency (light blue blocks) within each of the
5 contention periods in Figure 5.1 is the CL of the example task. The total memory latency
(ML) is the sum of RL and CL.
13
Figure 5.1: Breakdown of measured WCET for a generic task with different periods
5.1 CONTENTION LATENCY
Before detailing how to derive the total memory latency (ML), we need to discuss the
contention latency (CL). In general, the contention latency is a function of the number
of memory requests, as well as the DRAM device being used, the behavior of the DRAM
controller, and the employed latency analysis, as discussed in Section 2.1. In this paper,
we adopt the Job-Driven delay analysis proposed in [13]. We discuss only Job-Driven delay
because the Request-Driven delay analysis leads to extremely pessimistic bounds for out-of-
order execution cores [33]. Based on [13], contention latency is a function of three parameters:
the number of memory requests issued by the core under analysis (which we denote with
J), the interfering memory requests issued from other cores targeting the bank accessed
by the core under analysis (I intra), and the number of interfering memory requests issued
by other cores targeting banks that the core under analysis does not access (I inter). Thus,
we write CL(J, I intra, I inter) to denote an upper bound on the cumulative memory latency
of J requests of the core under analysis. We now show how to derive CL(J, I intra, I inter)
based on the analysis in [13]. It is important to note, however, that any memory analysis
able to derive such a function could be used instead, without any change to the rest of the
analysis. Hence, our general framework is independent of the specific characteristics of the
DRAM controller being used, as long as an inter-bank request causes less interference than
an intra-bank request.
Based on the observations in [13], the worst case access latency for close-row memory
access, due to row conflict caused by previous access to the same bank, can be expressed
as Lconf . Since conflicting accesses to the same bank cannot proceed in parallel, interfering
intra-bank memory access can at most cause Lconf delay to the core under analysis. An











inter can be dreived from the DRAM
device timing constraints, as discussed in [13].
14
Theorem 5.1 Assume there are I intra interfering memory accesses to banks that the core
under analysis can access, and there are I inter interfering memory accesses to banks that
the core under analysis cannot access. Then, the time taken by the core under analysis to
perform J memory accesses with an FR-FCFS scheduling algorithm is bounded by:
CL(J, I intra, I inter) = (J + I intra) · Lconf + I inter · Linter. (5.2)
First we consider the case where I inter = 0. The worst case memory latency for a system
with out-of-order processors is when there is no reordering and overlapping instructions
available in the microarchitecture so that the memory latency cannot be concealed from the
processor. As the authors of [13] observed, every intra-bank memory request suffers a worst
case latency of Lconf due to bank conflict; Hence, CL(J, I
intra, 0) = J · Lconf + I intra · Lconf
is the time it takes for a DRAM bank to serve J + I intra memory requests when all the
consecutive accesses are row-conflicts and the memory latency is not optimized by out-of-
order processors.
Then, we consider the case when there are I inter > 0 inter-bank memory accesses. Based on
the inter-bank interference delay derived in [13], each inter-bank memory interference causes
at most Linter additional delay to a memory transaction accessing another bank because of
the timing constraints defined in the specifications [14]. Therefore, I inter inter-bank memory
accesses create at most I inter · Linter memory delay.
By combining the two cases, we derive Equation ??.
The value of Lconf and Linter are related to the DRAM device timing parameters. When a
specific DRAM device instance is selected, these values can be treated as constants. Through-
out this paper we use DDR-1333H as the selected device, based on the timing constraints
defined in [14], we have Linter = 37.5ns, Lconf = 58.5ns. We refer interested readers to [13]
for the details on how to derive the value of Lconf and Linter from the DRAM timing con-
straints.
5.2 LATENCY COMPUTATION
Based on the function CL(J, I intra, I inter), we now discuss how to determine ML(Ri, H̄i).
Given a response time Ri, the number of memory regulation periods that the tasks in the
busy interval extend on (excluding the first one that is fully stalled due to previous tasks)
is equal to K = d(Ri − P )/P e. As explained earlier in this section, out of these K periods,
some are regulation, and some are contention. Let us call Kreg the number of regulation
periods, and Kcont the number of contention ones. Since we do not know the number of
15
such intervals that results in the worst case latency ML, we will treat Kreg and Kcont as
variables and use them to write an optimization problem with the objective of maximizing
ML. Clearly, it must hold by definition: Kreg +Kcont = K.
Similarly, we can split the memory requests of the tasks in the busy interval between
requests in regulation periods and contention periods. Let us call H̄regi and H̄
cont
i the requests
in regulation and contention periods, respectively. We then also have by definition: H̄regi +
H̄conti = H̄i. Furthermore, since we need to have Qp memory requests for each regulation
period, it must also hold: H̄regi = K
reg · Qp. That implies: Kreg ≤ bH̄i/Qpc, and H̄conti =
H̄i −Kreg ·Qp.
We can now derive the latency. Given Kreg regulation periods, the regulation latency is
simply: RL(Kreg) = Kreg · P . For the contention latency, note that since we have Kcont
contention periods, there are a total of I intra = Aintra ·Kcont and I inter = Ainter ·Kcont intra-
bank and inter-bank memory requests, respectively (based on Equation 4.5, 4.6 derived
in Chapter 4). We can plug in the values we obtained in the CL function to obtain a
contention latency: CL(H̄conti , A
intra ·Kcont, Ainter ·Kcont). Finally, summing the regulation
and contention latencies yields Equation 5.3:
ML(Ri, H̄i) = RL(K
reg)
+ CL(H̄conti , A
intra ·Kcont, Ainter ·Kcont)
= Kreg · P
+ CL(H̄i −Kreg ·Qp, Aintra · (K −Kreg)
, Ainter · (K −Kreg))
(5.3)
In summary, at each iteration performed to compute the task response time we need to
solve the following optimization problem to determine ML:
Maximize (over variable Kreg):
Kreg · P + CL(H̄i −Kreg ·Qp,
Aintra · (K −Kreg), Ainter · (K −Kreg))
(5.4)
Subject to:







For a general CL function, one could try all possible values of Kreg subject to constraint
in Inequality (5.5) and find the one that maximizes Equation 5.4. However, when employing





P − (Qp + Aintra) · Lconf − Ainter · Linter
)
+ (H̄i + A
intra ·K) · Lconf + Ainter ·K · Linter. (5.6)
Hence, if P − (Qp +Aintra) ·Lconf −Ainter ·Linter is positive, then ML is maximized by taking







; otherwise, by taking the minimum Kreg = 0.
17
CHAPTER 6: IMPLEMENTATION
In this chapter, we describe the overall implementation of the proposed CCM. We first
describe an example of how to realize CCM given a multicore and later describe how to
implement the CCM.
6.1 MOTIVATING EXAMPLE
Consider a system with eight cores (n = 8) in which we have a CC and seven ACs. Each
core has its own dedicated bank. Let us assume that the minimum guaranteed bandwidth
rate provided by the memory controller is computed experimentally using the approach in [3]
and is found to be 1.2GB/s. If we split the bandwidth equally among the cores then each of
the core will get 153MB/s. Let us assume that we have memory regulation implemented at
the granularity of 1ms. Given the minimum guaranteed bandwidth of each core is 153MB/s,
each core is assigned a Qp of 2520 memory transactions per memory regulation period. Here
1 Memory Transaction is 64 bytes.
Let us assume that on CC, Tc = Qp memory transactions. are dedicated on CC to perform
communication. These 2520 memory transactions will be divided equally between three ACs
which gives us the per-pair communication budget of tc = Tc/(2·(n−1)·(n−2)) = 30 memory
transactions. This translates to the data size of 1920 bytes per memory regulation period. By
assigning tc = 30 memory transactions for one AC-pair, we can say that during each memory
regulation period the maximum packet size that can be successfully transferred from the bank
of one application core to the bank of another application core is 1920 bytes. Although in
this example we assume Tc = Qp. However, as we will show in our evaluation that CC
along with the communication is responsible for handling OS related tasks. Therefore, the
available Tc is always less than Qp.
As depicted in Figure 6.1(a), when a task running on one AC wants to send data to
another task running on a different AC, it writes the data to send (TX) buffer in its private
DRAM bank. In the Figure 6.1(a), the TX buffer that stores outgoing messages from ACi
to ACj is named TX i j. It should be noted that all the tasks on an AC sending data to
the other receiving (RX) tasks on a particular destination AC would write to the same TX
buffer. For instance, Figure 6.1 shows that AC1 has separate TX buffers to send to different
ACs. The situation is symmetric on the other cores. The main reason for having separate
TX buffers per AC pair is to reflect the fact that we assign tc for each AC pair.
For the receiver task, there is a separate RX buffer for each pair of communicating ATs.
18
We name the RX buffer that stores the incoming messages from AT k to AT j as RX k j.
The data from the TX buffer is copied into the RX buffer of a destination AT in another
AC using the CC, as depicted in Figure 6.1(b). The TX and RX buffers are non-cacheable
to the ACs. In the next section, we provide the details of how the TX buffer and RX buffers
are implemented.
Figure 6.1: Message Flow Diagram
6.2 IMPLEMENTATION DETAILS OF TX/RX BUFFER FOR CCM
The TX/RX buffers are created/implemented in the private banks of ACs using POSIX
shm create(). It is the responsibility of CT as a part of the initialization process to create
these buffers. The buffers are mapped to the ATs running on ACs that need to access it by
mmap(). All those ATs that need to send inter-core messages to receiving ATs need to access
the corresponding TX buffer in their dedicated bank as shown in Figure 6.1. The receiving
ATs access their local RX buffers to read any data produced by ATs on a different core. In
order for the ATs running on the ACs to access TX/RX buffers, we have implemented a
shared library, named CommLib.
19
We assume that there is a system configuration file, provided by the system administrator,
that specifies all the possible inter-core communication channels, message sizes, and periods,
between the ATs in the system. Based on parameters recorded in the system configuration
file, the TX/RX buffers are created and initialized with appropriate size so that the buffers
will never overflow as long as all ATs use the library according to the parameters recorded
in the configuration file. When the CT and the ATs that use the CommLib initialize, they
read the same configuration file to obtain the names of the buffers they interact with, and
stores the list of buffers along with other metadata in their own local data structure. The
ATs use CommLib to write/read data to/from the TX/RX buffers. The CT running on CC
has access to all the TX/RX buffers. As discussed in Sec. 3, all the TX and RX buffers are
mapped to be non-cacheable. In our implementation, we make the buffers non-cacheable
by modifying the mmap() system call so that we can make the tasks in our system always
access the TX/RX buffers as non-cacheable.
As described in earlier sections, an inter-core communication budget (tc) is assigned for
each pair, therefore we implemented a TX buffer for each AC-pair in our proposed CCM.
The TX buffer is shared by the CT and all the ATs running on the same AC that want to
send data to a specific AC. Hence, access to the shared data structure needs to be protected
to avoid race conditions. To reduce the long blocking times for tasks accessing the TX
buffer, we propose the use of two circular buffers, as the Message Schedule Queue and
the Outgoing Message Queue shown in Figure 6.2.
Figure 6.2: Per AC-Pair TX Buffer and Per AT-Pair RX Buffer
20
Using two circular buffers results in less blocking. In fact, in this case, it is enough to
acquire a mutex only for the amount of time required to update the metadata of the TX
buffer, rather than for the entire duration of a send operation. The Outgoing Message
Queue in Figure 6.2 is used to store the actual TX packet data sent. The sent data is written
to a free memory location pointed in the next free entry in the queue (nextFreeBufPtr). The
data written to the nextFreeBufPtr location can be less than or equal to the packet size
supported by our CCM as described in Figure 6.2.
send(txTaskID, rxTaskID, txData, size)




temp : = txBufferPtr.nextFreeBufPtr ;
Increment txBufferPtr.nextFreeBufPtr;
unlock(txBufferPtr);
memcpy (temp, txData, size);
lock(txBufferPtr);




Algorithm 6.1: Pseudo Code for send API
The pseudo code of the send API that takes txTaskID, rxTaskID, pointer to the txData
and size is shown in Algorithm 6.1. Based upon the txTaskID and rxTaskID passed in
the send API, an array of metadata holding information about all the TX buffers that the
current AT may access, and their corresponding metadata are searched to find the correct
TX buffer (txBufferPtr) to which the send data must be written to, as shown in line 2 of
Algorithm 6.1. Once the correct TX buffer has been identified the task tries to acquire the
mutex. There can be multiple ATs that call send and try to write to the same TX buffer.
Therefore, synchronization is required in the form of a mutex lock.
Once a lock has been acquired the send procedure saves the current nextFreeBufPtr in
the temp variable, increments the nextFreeBufPtr, and releases the lock. The sent data is
then copied to the address pointed by temp (see lines 5 through 9 in Algorithm 6.1). After
data copy has been completed via the temp pointer, the address in the temp, along with
other metadata such as txTaskID, rxTaskID and size, have to be stored into the Message
Schedule Queue. The Message Schedule Queue is also shared between all the ATs that
21
access the same TX buffer. As such, the send procedure acquires a lock on the metadata
of the Message Schedule Queue. The metadata of the Message Schedule Queue are
rdPtr and wrPtr. The only metadata that needs locking as a part of the send call is wrPtr.
After a lock has been acquired on the metadata of the Message Schedule Queue the
temp pointer is written at the wrPtr, wrPtr is then incremented and the lock is released
(line 10 to 13 in Algorithm 6.1). The CT only reads wrPtr to determine if the queue is
full, it never updates the value of wrPtr, therefore it does not have to acquire the mutex.
Note that in our implementation, the critical sections contain only an update for the shared
pointers. Therefore, the blocking time between ATs due to synchronization is short and is
independent of the packet size. In addition, no synchronization between the CT and the
ATs is required.
For the RX API, we create per AT-pair RX buffers so that the interference among all the
receiving tasks can be minimized. Each RX buffer is only shared between the CT and a
single receiving task. Therefore, a Incoming Message Queue with a rdPtr and a wrPtr
is implemented. Since only the RX AT updates the rdPtr, whereas the CT only updates the
wrPtr, there is no mutex required at the RX buffer.
receive(txTaskID, rxTaskID, rxData, size)
rxBufferPtr : =findrxBuffer(txTaskID,rxTaskID) ;
if rxBufferPtr.full() then





Algorithm 6.2: Pseudo Code for receive API
The pseudo code of the receive API is shown in Algorithm 6.2. Similar to the send API, the
receive API searches all the RX buffers linked to this task as shown in line 2 of Algorithm 6.2.
Upon match, it checks if there is new data in the RX buffer by comparing the rdPtr and
wrPtr pointers as shown in Figure 6.2. Since we design the receive to be non-blocking, in
case no new data is found in the receive buffer, the call returns -1 as shown in lines 3 and
4 of pseudo code in Algorithm 6.2. Each receiving task has its own Incoming Message
Queues, no synchronization is required. Lines number 5 through 7 in Algorithm 6.2 describe
this. When there is an incoming message in the queue, it is read into the buffer pointed by
rxData passed to the receive API. The rdPtr of RX buffer is incremented. The number of
bytes being read is returned.
Note that both the send and the receive interact with the buffers on the caller AT’s private
bank, no inter-bank memory interference is introduced by these functions.
22
The CT running on CC has its own per AC-pair communication bandwidth replenished
every memory regulation period (P ). It then iterates over all the TX buffers in the private
banks of all the ACs. For each TX buffer, based on the sender and receiver information
contained in the Message Schedule Queue, the CT is responsible for copying the data:
from the Outgoing Message Queue, to the Incoming Message Queue of corresponding
RX buffer in the private DRAM bank of the RX core. When the Message Schedule Queue
is empty, or the communication bandwidth for this particular TX buffer is exhausted, the
CT moves to the next TX buffer. After all the TX buffers have been processed, the CT
sleeps for the rest of the regulation period.
23
CHAPTER 7: EVALUATION
This chapter provides a detailed evaluation of our proposed CCM and compares it with the
CBC where no private bank is enforced, as described in Chapter 4. We start by describing
the experimental setup and the benchmarks that we have used for evaluation. We then
analytically show how different memory budget assignments (Qp) impact the WCET. Next,
we evaluate the communication bandwidth of the implemented CT based on our platform.
Using the analysis discussed in Chapter 5, we then compare and show the benefit of CCM
over the CBC approach for the considered benchmarks. Finally, we show that the proposed
analysis for CCM provides a safe WCET bound.
7.1 EXPERIMENTAL SETUP AND BENCHMARKS
Our experimental setup considers the P4080 platform from Freescale that employs eight
Power Architecture e500mc cores operating at frequencies up to 1.5 GHz. Each core in the
system has its dedicated 32 KB I/D Level 1 cache and a 128KB Level 2 backside cache. A
2MB of shared Level 3 cache is also present. As discussed in Chapter 3, we cannot formally
prove that the considered HW platform is timing compositional; as is the case with most
available COTS platforms, no precise microarchitectural model is available. Therefore, in
the paper, we rely on an experimental evaluation based on measurements to show that the
derived WCET provides safe bound. If such an architectural model was available, we argue
that the proposed communication scheme and analysis could still be applied after deriving
task parameters (Hi and ei) based on static program analysis [31].
A Linux-3.0.6 operating system that supports resource partitioning is installed on the eval-
uation platform. The task under analysis and all the stressing tasks are statically allocated
to each core by sched setaffinity(). Only one DDR controller is enabled. For the proposed
CCM, PALLOC [5] is enabled and configured so that all the ACs can only access one single
private DRAM bank, while the CC can access all the DRAM banks. We use MemGuard [23]
to enforce memory regulation on every core in the system, and every core is regulated by
the same memory budget. The memory regulation period is configured to 1ms. and the
memory regulation budget is 2520 memory transactions for each core. The 2520 memory
transactions per MemGuard period correspond to a memory bandwidth of 153MB/s per
core.
For the proposed CCM, we consider a system with a single CC and 7 ACs. The WCET
is obtained by using the equations derived in Section 5. The parameters used to compute
24
WCET are listed in Table ??. The worst-case scenario for CCM is when the task under
analysis runs on an AC, while there are 6 interfering ACs, each issuing Qp memory requests
towards its own DRAM bank during every memory regulation period.
Whereas, a periodic CT is deployed on the CC and accesses private banks of each AC and
generates communication memory traffic of Tc at every MemGuard regulation period. The
CC is also using its remaining memory budget to stress its own bank.
In order to evaluate the system, we use San Diego Vision Benchmark Suite (SD-VBS) [34].
We use the on-chip event processing unit (EPU) provided by P4080 to profile the memory
access counts (Hi) of each task under analysis. We measured the solo run time (ei) and
memory access count of the benchmarks with cif (352x240) input resolution on the evalu-
ation platform. The memory regulation budget is set to a number that is larger than the
available bandwidth, so no regulation is enforced. The measured parameters are listed in
Table 7.1, the value is the maximum value observed of 200 instances on the platform.
Table 7.1: SD-VBS Benchmark Solo Measurements
Benchmark ei (ms) Hi Memory Access rate (1/ms)
disparity 318 4448615 13989
localization 244 668 3
mser 44 719914 16362
sift 521 2668107 5121
stitch 293 1588683 5422
svm 290 214138 738
texture synthesis 25 42342 1694
tracking 176 289821 1647
7.2 TASK WCET WITH DIFFERENT MEMORY REGULATION BUDGET
ASSIGNMENT
As discussed in Sec 5, in a memory regulated system, the WCET of a task is dependent
on the memory budget assigned to the core. When the memory budget is small, the task
tends to suffer more memory regulation and less memory contention from the interfering
cores. Whereas, if the memory budget is large the task tends to suffer memory contention
from the interfering cores rather than regulation [35]. Depending upon the characteristics
of tasks, the optimal memory budget assignment for different tasks can be different.
We analyzed the WCET of the tasks in SD-VBS benchmark with various memory budget
assignment for the proposed CCM. Figure 7.1 shows the WCET of three selected tasks with
various memory budget assignment. All the Qp assignment is a multiple of 84 because there
25
are 42 communication pairs among the 7 ACs, and each transaction composes a read and
a write memory access. In this experiment, we assume Tc = Qp for all the Qp assignment
so that the number of intra-bank inter-core interference(Aintra) is maximized and thus the
bound is safe. Note that in real-world implementations, Tc = Qp cannot be achieved since
the CC might be using some of the memory transactions for its local computation and the
OS related overhead. More details are described in the next section.
The inverted bell curves of disparity and tracking show that the WCET of the tasks under
analysis increases rapidly when the assigned memory budget is very small or very large. The
budget assignment that determines the shortest WCET is different for the two tasks that
give the smallest WCET are different. disparity has the smallest WCET when the memory
budget is around 2520 while tracking has the smallest WCET when the system has a memory
budget around 1344. localization is a special case in Figure 7.1. It is extremely computation
intensive, the average memory access rate is 3 access per millisecond. The curve shows that
it provides the smaller WCET when memory budget is smaller since it is very unlikely that
it can exhaust the memory budget and get regulated even with an extremely small memory
budget.
Figure 7.1: WCET with different memory regulation budget assignments
A memory budget assignment that produces relatively small WCETs can be found exper-
imentally. For example, authors in [36] discussed how to obtain better system performance
by assigning uneven memory regulation budgets to different cores. The development of a
near-optimal memory budget assignment algorithm is beyond the scope of this paper.
For our experimental and analytically results, we pick a per-core memory regulation budget
(Qp) to 2520 which corresponds to the minimum guaranteed bandwidth as used in the
previous research [4], and it provides a reasonable WCET for the benchmarks we evaluated.
26
7.3 THROUGHPUT OF THE CT
Considering the system parameters in the Section 7.1 for the considered P4080 platform
with CCM. When assigning a Qp = 2520 to CC in our implementation some of the memory
transactions are used by the CC to manage the OS related overhead. Table 7.2 summarizes
the distribution of Qp on the CC. With CT not deployed on CC. We find out that on average
604 memory transactions on CC within a memory regulation period are used to deal with
OS related overhead. This indicates that in our implementation the maximum value of Tc
available to CT is Qp− 604 = 1916. In our evaluation, we pick a value of Tc = 1848 because
it is the exact multiple of 84 that does not exceed the maximum available Tc. Using a
communication budget of 1848, the actual amount of memory transactions used to move the
data between different pairs of ACs is 1596. This means around 13.6 percent of memory
transactions issued by CT is used in dealing with the metadata. The memory transactions
of 1596 per memory regulation period can move data at a rate of 389 Mbps between all pairs
of ACs.
Table 7.2: Budget Distribution on CC
Total Budget Assignment (Qp) 2520
Average OS Overhead 604
Communication Budget (Tc) 1848
Metadata Overhead (Percentage) 13.6
7.4 CCM AND CBC
In this section, we compare the WCET of tasks deployed on the target P4080 platform
with our proposed CCM versus the one with CBC that does not employ private bank.
The WCET of tasks in the CCM is obtained by assigning a Memguard budget of Qp to
all the cores.
For simplicity, all the cores are assigned a budget of Qp = 2520 in CCM. The six interfering
ACs with their assigned budget stress their private banks. Out of the total budget of
Qp = 2520, CC uses a communication budget of Tc = 1848 to move the data between all
the pairs of banks used by ACs. Whereas, the remaining memory budget of Qp − Tc = 672
is used by CC to access its own private bank. The WCET of the task under analysis is
measured on the seventh AC that runs different benchmarks from SD-VBS.
For CBC, the WCET is obtained by considering the following worst case. The task under
analysis runs in one AC, while 6 memory intensive interfering ACs stress the memory, each
27
with all its memory budget. The ACs are assigned the same Memory Budget (Qp = 2520 )
as in the CCM experiment. The CC is assigned a memory budget of 0 and stays idle. Since
CC is not required in the CBC scheme, we make it stay idle to get a fair comparison between
the two approaches.
Since there is no private bank enforced in the CBC, the worst case scenario corresponds
to the case in which, during the busy interval, the memory access of all the active cores
are issued to the same DRAM bank and all the interfering memory access are considered to
cause intra-bank contention delay. From Figure 7.2 can see that for all the benchmarks, CCM
provides a smaller WCET compared to CBC, with an average of 56% WCET reduction. For
the localization benchmark, the WCET on CCM is reduced by 65% compared to on CBC.
Figure 7.2: WCET of tasks in CBC and CCM
7.5 ANALYTICAL BOUND AND MEASUREMENT
In this section, we show that the proposed WCET bound for CCM is safe for the target
platform. We configure the PALLOC and MemGuard to the parameters as described in Sec-
tion 7.1. For the 6 interfering ACs, we run a memory intensive bandwidth [3] benchmark to
stress the private banks of the ACs. We also deploy a CommTask on the CC to periodically
access the private DRAM banks of all ACs to stress the memory controller with Tc = 1848
communication traffic at every regulation period.
The analytical and measured WCET of CCM normalized to solo runtime of the SD-VBS
28
is shown in Figure 7.3. The results in the Figure 7.3 show that the analyzed WCET safely
bounds the execution time when measured on the platform.
Figure 7.3: Analyzed and measured WCET
29
CHAPTER 8: CONCLUSION
In this paper, we complete the strictly partitioned multi-core framework by bringing inter-
core communication into the picture. For our evaluation, we considered two communication
models that are CBC and CCM. Compared to the CBC where all the cores can access all
the DRAM banks, the CCM where at most only two cores access any DRAM bank can help
improve the worst-case system performance. This approach provides tighter upper bounds
on the inter-core interference that can be easily factored into schedulability analysis. The
presented results show the gain of CCM over the CBC. Moreover, our presented approach
and model gives the system level perspective of how to move networked single core processors
into a single multi-core architecture without breaking the hard real-time requirements that
need to be met within a single core.
30
REFERENCES
[1] FAA, “FAA position paper on multi–core processors, CAST-32A,” Federal Aviation
Administration, Tech. Rep.
[2] L. Sha, M. Caccamo, R. Mancuso, J.-E. Kim, M.-K. Yoon, R. Pellizzoni, H. Yun,
R. Kegley, D. Perlman, G. Arundale et al., “Single core equivalent virtual machines for
hard realtime computing on multicore processors,” Tech. Rep., 2014.
[3] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memguard: Memory band-
width reservation system for efficient performance isolation in multi-core platforms,” in
Real-Time and Embedded Technology and Applications Symposium (RTAS), 2013 IEEE
19th. IEEE, 2013, pp. 55–64.
[4] R. Mancuso, R. Pellizzoni, M. Caccamo, L. Sha, and H. Yun, “WCET(m) estimation
in multi-core systems using single core equivalence,” in Real-Time Systems (ECRTS),
2015 27th Euromicro Conference on, July 2015, pp. 174–183.
[5] H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni, “PALLOC: DRAM bank-aware memory
allocator for performance isolation on multicore platforms,” in Real-Time and Embedded
Technology and Applications Symposium (RTAS), 2014 IEEE 20th. IEEE, 2014, pp.
155–166.
[6] R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, and R. Pellizzoni, “Real-time
cache management framework for multi-core architectures,” in Real-Time and Embedded
Technology and Applications Symposium (RTAS), 2013 IEEE 19th. IEEE, 2013, pp.
45–54.
[7] H. Kim, A. Kandhalu, and R. Rajkumar, “A coordinated approach for practical os-level
cache management in multi-core real-time systems,” in Real-Time Systems (ECRTS),
2013 25th Euromicro Conference on. IEEE, 2013, pp. 80–89.
[8] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, “A software memory partition
approach for eliminating bank-level interference in multicore systems,” in Proceedings of
the 21st international conference on Parallel architectures and compilation techniques.
ACM, 2012, pp. 367–376.
[9] N. Kim, B. C. Ward, M. Chisholm, C. Y. Fu, J. H. Anderson, and F. D. Smith, “At-
tacking the one-out-of-m multicore problem by combining hardware management with
mixed-critcality provisioning,” in 2016 IEEE Real-Time and Embedded Technology and
Applications Symposium (RTAS), April 2016, pp. 1–12.
[10] M. Chisholm, N. Kim, B. Ward, N. Otterness, J. Anderson, and F. Smith, “Reconciling
the tension between hardware isolation and data sharing in mixed-criticality, multi-
core systems,” in 2016 IEEE International Real-Time Systems Symposium (RTSS’16),
December 2016.
31
[11] L. Sha, R. Rajkumar, and S. S. Sathaye, “Generalized rate-monotonic scheduling theory:
A framework for developing real-time systems,” Proceedings of the IEEE, vol. 82, no. 1,
pp. 68–82, 1994.
[12] J.-E. Kim, M.-K. Yoon, R. Bradford, and L. Sha, “Integrated modular avionics (ima)
partition scheduling with conflict-free i/o for multicore avionics systems,” in Computer
Software and Applications Conference (COMPSAC), 2014 IEEE 38th Annual. IEEE,
2014, pp. 321–331.
[13] H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding
memory interference delay in cots-based multi-core systems,” in Real-Time and Embed-
ded Technology and Applications Symposium (RTAS), 2014 IEEE 20th. IEEE, 2014,
pp. 145–154.
[14] A. JEDEC, “DDR3 SDRAM Specification,” JESD79-3F: DDR3 SDRAM Specification,
2012.
[15] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: a predictable sdram memory
controller,” in Proceedings of the 5th IEEE/ACM international conference on Hard-
ware/software codesign and system synthesis. ACM, 2007, pp. 251–256.
[16] L. Ecco, S. Tobuschat, S. Saidi, and R. Ernst, “A mixed critical memory controller using
bank privatization and fixed priority scheduling,” in Embedded and Real-Time Comput-
ing Systems and Applications (RTCSA), 2014 IEEE 20th International Conference on.
IEEE, 2014, pp. 1–10.
[17] S. Goossens, B. Akesson, and K. Goossens, “Conservative open-page policy for mixed
time-criticality memory controllers,” in Proceedings of the Conference on Design, Au-
tomation and Test in Europe. EDA Consortium, 2013, pp. 525–530.
[18] Y. Krishnapillai, Z. P. Wu, and R. Pellizzoni, “A rank-switching, open-row dram con-
troller for time-predictable systems,” in Real-Time Systems (ECRTS), 2014 26th Eu-
romicro Conference on. IEEE, 2014, pp. 27–38.
[19] Z. P. Wu, Y. Krish, and R. Pellizzoni, “Worst case analysis of dram latency in multi-
requestor systems,” in Real-Time Systems Symposium (RTSS), 2013 IEEE 34th. IEEE,
2013, pp. 372–383.
[20] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, “Pret dram controller: Bank
privatization for predictability and temporal isolation,” in Hardware/Software Code-
sign and System Synthesis (CODES+ ISSS), 2011 Proceedings of the 9th International
Conference on. IEEE, 2011, pp. 99–108.
[21] J. Nowotsch, M. Paulitsch, D. Bühler, H. Theiling, S. Wegener, and M. Schmidt, “Multi-
core interference-sensitive wcet analysis leveraging runtime resource capacity enforce-
ment,” in Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on. IEEE,
2014, pp. 109–118.
32
[22] D. Dasari, B. Andersson, V. Nelis, S. M. Petters, A. Easwaran, and J. Lee, “Response
time analysis of cots-based multicores considering the contention on the shared memory
bus,” in Trust, Security and Privacy in Computing and Communications (TrustCom),
2011 IEEE 10th International Conference on. IEEE, 2011, pp. 1068–1075.
[23] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory access control
in multiprocessor for real-time systems with mixed criticality,” in Real-Time Systems
(ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 299–308.
[24] R. Pellizzoni, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thiele, “Worst case
delay analysis for memory interference in multicore systems,” in Proceedings of the Con-
ference on Design, Automation and Test in Europe. European Design and Automation
Association, 2010, pp. 741–746.
[25] S. Schliecker, M. Negrean, and R. Ernst, “Bounding the shared resource load for the
performance analysis of multiprocessor systems,” in Proceedings of the conference on
design, automation and test in Europe. European Design and Automation Association,
2010, pp. 759–764.
[26] B. Andersson, A. Easwaran, and J. Lee, “Finding an upper bound on the increase in
execution time due to contention on the memory bus in cots-based multicore systems,”
ACM Sigbed Review, vol. 7, no. 1, p. 4, 2010.
[27] H. Yun, R. Pellizzon, and P. K. Valsan, “Parallelism-aware memory interference de-
lay analysis for cots multicore systems,” in Real-Time Systems (ECRTS), 2015 27th
Euromicro Conference on. IEEE, 2015, pp. 184–195.
[28] B. C. Ward, J. L. Herman, C. J. Kenna, and J. H. Anderson, “Outstanding paper
award: Making shared caches more predictable on multicore platforms,” in 2013 25th
Euromicro Conference on Real-Time Systems (ECRTS). IEEE, 2013, pp. 157–167.
[29] V. Suhendra and T. Mitra, “Exploring locking & partitioning for predictable shared
caches on multi-cores,” in Design Automation Conference, 2008. DAC 2008. 45th
ACM/IEEE. IEEE, 2008, pp. 300–303.
[30] P. K. Valsan, H. Yun, and F. Farshchi, “Taming non-blocking caches to improve iso-
lation in multicore real-time systems,” in Real-Time and Embedded Technology and
Applications Symposium (RTAS), 2016 IEEE. IEEE, 2016, pp. 1–12.
[31] S. Hahn, M. Jacobs, and J. Reineke, “Enabling compositionality for multicore timing
analysis,” in Proceedings of the 24th international conference on real-time networks and
systems. ACM, 2016, pp. 299–308.
[32] J. Lehoczky, L. Sha, and Y. Ding, “The rate monotonic scheduling algorithm: Exact
characterization and average case behavior,” in Real Time Systems Symposium, 1989.,
Proceedings. IEEE, 1989, pp. 166–171.
33
[33] R. Pellizzoni and H. Yun, “Memory servers for multicore systems,” in Real-Time and
Embedded Technology and Applications Symposium (RTAS), 2016 IEEE. IEEE, 2016,
pp. 1–12.
[34] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, S. Belongie, and M. B.
Taylor, “Sd-vbs: The san diego vision benchmark suite,” in Workload Characterization,
2009. IISWC 2009. IEEE International Symposium on. IEEE, 2009, pp. 55–64.
[35] G. Yao, H. Yun, Z. P. Wu, R. Pellizzoni, M. Caccamo, and L. Sha, “Schedulability anal-
ysis for memory bandwidth regulated multicore real-time systems,” IEEE Transactions
on Computers, vol. 65, no. 2, pp. 601–614, 2016.
[36] R. Mancuso, R. Pellizzoni, N. Tokcan, and M. Caccamo, “Wcet derivation under single
core equivalence with explicit memory budget assignment,” in LIPIcs-Leibniz Inter-
national Proceedings in Informatics, vol. 76. Schloss Dagstuhl-Leibniz-Zentrum fuer
Informatik, 2017.
34
