Toward time-predictable execution of multi-task real-time systems by Bui, Bach
  
 
TOWARD TIME-PREDICTABLE EXECUTION OF MULTI-TASK 
REAL-TIME SYSTEMS 
 
 
 
 
BY 
 
BACH DUY BUI 
 
 
 
 
DISSERTATION 
 
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Computer Science 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 2012 
 
 
Urbana, Illinois 
 
Doctoral Committee: 
 
Associate Professor Marco Caccamo, Chair and Director of Research 
Professor Lui Sha  
Professor Tarek Abdelzaher  
Professor Sanjoy Baruah, University of North Carolina 
Abstract
Guaranteeing time-predictable execution in real-time systems involves the management of not only pro-
cessors but also other supportive components such as cache memory, network on chip (NoC), memory
controllers. These three components are designed to improve the system computational throughput through
either bringing data closer to the processors (e.g cache memory) or maximizing concurrency in moving data
inside the systems (e.g. NoC and memory controllers). We observe that these components can be sources of
significant unpredictability in task executions if they are not operated in a deterministic manner. In particu-
lar, our analysis and experiments in [6, 35] show that with the standard cache and memory controller sharing
mechanism, the execution time of a task may be unpredictably extended up to 33 to 44% in a single-core
processor. We also show that analysis techniques and scheduling algorithms that have been proposed to
account for and/or to mitigate this unpredictability often do not adequately address the problem at hand. As
the consequence, those techniques and algorithms can only guarantee real-time execution in systems with
under-utilized shared resources.
In this dissertation, we study the software and hardware infrastructure, optimization techniques and
scheduling algorithms that guarantee predictable execution in real-time systems that use cache memory,
network on chip (NoC), and memory controllers. The main challenge is how to guarantee system pre-
dictability in such a way that maximizes the benefits and the utilization of these components. We achieve
that by carefully analyzing both theoretical and practical assumptions in the use of these components and
deriving novel solutions based on this understanding. For cache memory, we propose the use of software-
based cache partitioning techniques and a real-time optimization method to minimize the system real-time
utilization. The proposed solution renders better performance because of its fully utilization of available
cache area. For NoC scheduling, we proposed novel scheduling algorithms that are designed to cope di-
rectly with the unique assumption on resource sharing in NoC. As will be shown, in practical systems, these
scheduling algorithms can achieve near optimal performance. For memory controllers, we propose a soft-
ii
ware and hardware infrastructure and coscheduling algorithms that are used to control the accesses of the
DMA-enabled peripherals to the main memory. The goal is to prevent these accesses from delaying tasks’
execution beyond the worst-case execution while still maximizing the I/O throughput.
iii
Table of Contents
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Reference and Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 Cache Partitioning Techniques and Optimization for Real-time Systems . . . . . . . . 7
2.1 The Impact of Cache Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Cache Partitioning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Cache Partitioning Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Terminology and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Local Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Mutation Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Crossover Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Approximate Utilization Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 Evaluation with Simulated Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3 Network-on-Chip Real-time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Real-time NoC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Real-time Scheduling for Ring-topology NoC . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Scheduling Algorithms for Acyclic Transactions . . . . . . . . . . . . . . . . . . . 35
iv
3.3.2 Scheduling Algorithms for Cyclic Transaction Sets . . . . . . . . . . . . . . . . . . 51
3.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Real-time Scheduling for General NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 Transaction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Real-time Scheduling for Acyclic Transaction Sets on General NoC . . . . . . . . . 66
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 4 System Design for Predictable Memory Access . . . . . . . . . . . . . . . . . . . . . . 84
4.1 Peripheral Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 I/O traffic and Tasks Coscheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Peripheral Load Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.2 Cache Miss Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.3 I/O-inflicted Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.4 Coscheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
v
Chapter 1
Introduction
In the past few decades, the real-time system research community has been very successful in providing solid
theoretical foundation for the development of real-time systems. Various scheduling algorithms and analysis
techniques for guaranteeing the predictable execution of real-time systems have been introduced, which first
deal with a simple task model on single-core processors [26], then with more complex task models on
multi-core processors [4, 8, 52]. However, the development of modern processor architectures has resulted
in many new challenges. Among these challenges are the fact that guaranteeing predictable execution now
goes beyond processor scheduling: other shared resources in computer systems such as cache, Network-
on-Chip (NoC), and the memory controller can significantly affect the execution time of real-time tasks.
Nevertheless, the existing real-time system theory has not paid adequate attention to the effect that these
components have on predictable execution. As shown in Fig. 1.1, in order to increase execution speed
of competing concurrent tasks, besides using multiple processing elements, a typical modern computer
system also uses other supportive components. The following paragraphs describe these components and
the challenges they create with regards to real-time systems.
 Cache: cache is a small high-speed memory used to buffer data exchanged between the processing
units and the main memory. Having been designed to maximize software development productivity
and software execution speed, cache is shared between multiple tasks and most of its operations are
automatic and hidden from software perspective. As the matter of fact, the sharing of self-controlled
cache can greatly affect the predictable performance of real-time systems. We have shown in [6] that in
a typical real-time embedded system, unsupervised cache sharing can increase task execution time by
up to 33% in a single-core processor. It is because the preemptive execution of different tasks causes
1
Figure 1.1: Computer Systems
cache lines of a task to be invalidated by the execution of another under the common assumption that
cache is globally shared. Due to the random nature of cache access, without a predictable cache-
sharing mechanism, identifying cache interference between tasks often results in very pessimistic
estimation of task worst-case execution time.
 Network-on-Chip (NoC): NoC is high speed interconnect connecting cores with each other and with
peripherals. In order to support an increasing numbers of cores, the focus of NoC design has been on
improving NoC scalability and reducing transaction latency. This design focus results in NoC with
high degrees of transaction concurrency and complex resource sharing model. The performance of a
real-time system built on processors with NoC depends both on the scheduling of real-time tasks on
cores, and on the scheduling of real-time data transactions on NoC. While the theory of the former
has been mature, that of the later is not. Scheduling of real-time data transactions on a NoC while
maximizing the NoC utilization poses new challenges which are different from those of traditional
real-time systems. In particular, although the problem has the form of multiple resource scheduling, it
cannot be formulated as multi-processor scheduling problem because a transaction may use multiple
2
resources, i.e. physical links, at a time.
 Memory Controller: the memory controller is the component that arbitrates the main memory accesses
between cores and peripherals. The memory controller is designed to maximize operation concurrency
in the system. In particular, Direct-Memory-Access (DMA)-enabled peripherals can have access to
the main memory while some computation tasks are running in the CPU cores. Like cache, the
memory controller also operates without software supervision, which means its operations are not
explicitly visible from the software perspective. As a consequence, the effect of memory controller
on the execution of real-time systems is often hard to control. In [35], we observed that, due to
contention for access to the shared infrastructure between a mono-core processor and a DMA-enabled
peripheral, a task execution time may be unexpectedly extended up to 44%. Note that the processor is
stalled when missed cache lines are being reloaded from the main memory. This stall time can further
increase (affecting task execution time) if the memory controller is busy serving memory access of
peripherals.
Given the unprecedented challenges raised by cache, NoC, and the memory controller, our major goal of
the research in this dissertation is to provide a sound theoretical framework and system designs to cope with
the unpredictability caused by these components in real-time systems. In the following paragraph we will
summary the main contributions of this dissertation.
As aforementioned, managing cache sharing in real-time systems is crucial for the system predictability.
However, a full software control of cache accesses is not desirable since it will require significant software
and hardware modification. Also, giving software designers full control of cache accesses may be counter-
productive as they will have to pay more attention to low-level operations. We believe a viable alternative is a
cache sharing mechanism which requires a minimum change in the existing computer systems, yet provides
more efficient real-time performance. In particular, the sharing mechanism should provides better real-time
predictability than the existing systems without too much compromising in software execution speed. In
Chapter 2, we analyze cache interference in several practical real-time systems. The analysis shows that
the effective cache usage of real-time tasks is often much smaller than their memory footprint and than the
total cache in the system. However, with the standard cache-sharing mechanism, cache of all tasks may
fully overlap, which results in unnecessarily pessimistic worst-case execution time of the tasks. Given this
analysis, we propose a cache-partitioning model which allows software designers to specify how cache is
shared between real-time tasks. This cache-sharing mechanism can be implemented on existing computer
3
systems and requires little change in the current software development practice and system architecture. We
also propose an optimization method which identifies how much cache should be assigned to a certain task.
Our simulation results and case studies show a counter-intuitive result that is by giving tasks less cache but
explicitly controlling how cache is shared can actually improve the performance of real-time systems in
terms of system utilization.
Real-time scheduling on Network-on-Chip (NoC) is a new and challenging problem, mainly because of
the two reasons: 1) NoC supports high degrees of transaction concurrency; and 2) the pattern in which trans-
actions share multiple resources is different with what has been considered in traditional real-time theory. In
particular, a transaction in NoC, which spans multiple physical links, uses multiple resources simultaneously,
while in processor scheduling, a task uses only one processor at a time. Existing works [44, 43, 24, 3] that
extend the multi-processor real-time scheduling algorithms and analysis to NoC often do not fully consider
transaction topological relationships. Not surprisingly, these proposals can only guarantee real-time dead-
lines in systems with low NoC utilization, especially when transactions overlap multiple non-overlapping
transactions (i.e. non-transitive transaction dependence). It is because the non-transitive dependence is not
an assumption of multi-processor scheduling. Higher utilization scheduling algorithms will obviously have
to take into account the topological dependence between transactions. However, a general algorithm of this
type, i.e one that works on all cases, is intractable. In Chapter 3, we first consider several special cases of
transaction topologies and propose efficient algorithms for them. We show that the proposed algorithms
have near optimum performance which surpasses that of existing works. The better performance of these
algorithms comes precisely from the fact that they are able to exploit the topological dependence between
transactions. These algorithms are then extended for some transaction sets on NoC with general topologies.
The extension is still an efficient algorithm and has competitive performance.
In the state-of-the-art computer architecture, to maximize the throughput, many active components in
the system, such as hard drives and network cards, are allowed to directly transfer data to or from memory
without the control of software. The memory controller is used to manage these autonomous memory
accesses. This controller operates on transaction-by-transaction basis therefore it can only make local-
optimal decisions. As the effect of this sub-optimal operations and the random nature of memory accesses
(i.e. ones from cache misses and peripherals), task execution time may be significantly extended under
certain memory access patterns. Offline execution time estimation [35] that takes into account this effect can
be used to predict the worst-case. However, the estimation is often very pessimistic and inefficient because
4
worst-case situations may significantly affect the task execution but do not occur very often. In Chapter
4, we propose a hardware and software system to regulate the memory accesses of PCI-based peripherals.
This system allows us to dictate when peripherals can transfer data in and out of the memory so that it has
minimum effect on task execution. An algorithm is also proposed to co-schedule peripherals and real-time
tasks. The algorithm strives to maximize the I/O traffic under the constraint that the run-alone worst-case
execution time of every task is not violated.
1.1 Reference and Acknowledgment
This work has not been possible without the help of my advisers and the contributions of my colleagues. I
would like to give a special appreciation to my adviser Prof. Marco Caccamo for all of the advises he has
given to me. Those advises have not only helped me to shape my research direction but also to sharpen my
research skill. Through many rough discussions with Marco, I have been able to recognize my drawbacks
and to improve my critical thinking. I also want to give special thanks to Prof. Lui Sha. His deep insight
knowledge of computer systems has always been the source of excellent research ideas, and of how to tackle
hard problems.
I am also grateful to Prof. Tarek Abdelzaher and Prof. Sanjoy Baruah for serving on my thesis committee
and for their kind comments to make my research complete.
Last but not least, I would like to thank my PhD fellow Rodolfo Pellizzoni. It has been a great pleasure to
be working with him in many of my researches. His exceptional intelligence has helped me to significantly
improve the quality of my works. His careful reviews and insightful comments were crucial for the success
of our publications.
Note that material used in the dissertation previously appeared in the following publications.
 Bach D. Bui, Rodolfo Pellizzoni, Marco Caccamo, Real-time Scheduling of Concurrent Transactions
in Multi-domain Ring Buses, IEEE Transactions on Computer, August, 2011.
 Bach D. Bui, Rodolfo Pellizzoni, Marco Caccamo, A Slot-based Real-time Scheduling Algorithm for
Concurrent Transactions in NoC, in the Proceedings of the 17th IEEE International Conference on
Embedded and Real-Time Computing Systems and Applications, 2011.
 Bach D. Bui, Deepti K. Chivukula, Marco Caccamo, and Lui Sha, Real-time Communication for
5
Multicore Systems with Multi-domain Ring Buses, Proceedings of the 16th IEEE International Con-
ference on Embedded and Real-Time Computing Systems and Applications, 2010.
 Bach D. Bui, Marco Caccamo, Lui Sha, and Joseph Martinez, Design and Evaluation of a Cache Par-
titioned Environment for Real-Time Embedded Systems, Proceedings of the 14th IEEE International
Conference on Embedded and Real-Time Computing Systems and Applications, 2008. (Best paper
award)
 R. Pellizzoni, Bach D. Bui, M. Caccamo, and L. Sha, Coscheduling of CPU and I/O Transactions
in COTS-based Embedded Systems. Proceedings of the 29th IEEE Real-Time Systems Symposium,
2008.
These papers are copyright of the Institute of Electrical and Electronics Engineers (IEEE). Permission to
reprint/republish this material for advertising or promotional purposes or for creating new collective works
for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org, with
the exception of ProQuest Information and Learning, which is permitted to supply single copies of the
dissertation. By choosing to view this material, you agree to all provisions of the copyright laws protecting
it.
6
Chapter 2
Cache Partitioning Techniques and
Optimization for Real-time Systems
In this chapter, we will detail the issue of inter-task cache interference in real-time systems through ex-
periments and analytical analysis. As will be shown, this interference may inflate the worst-case system
utilization if the cache shared between tasks is not predictably managed. We will then propose and analyze
the use of cache partitioning techniques for managing cache sharing. The proposed technique is evaluated
through case studies and simulated experiments.
2.1 The Impact of Cache Interference
To understand the impact of cache interference in real-time systems, we conducted an experiment with a
real system. In our test-bed we used a DELL with Intel Pentium 4 1:5GHz processor, memory bus speed
400MHz, and 256KB L2 cache. LinuxRK [32], a real-time kernel developed at CMU, was used as the
test-bed operating system. To measure the effect of multi-task cache interference, we used a fixed priority
scheduler and two processes running at different priorities: 1) the low priority process low has two different
pairs of execution times1 and periods: (5ms; 10ms) and (10ms; 20ms); 2) the high priority one high has
three different pairs of execution times and periods: (1ms; 2ms), (2ms; 4ms), (5ms; 10ms). For each
experiment, there were two runs: during the first one the high priority task limited the number of operations
involving a memory access; during the second one, the high priority task was run such that it invalidated as
1Note that these execution times were measured in an experimental setup with very limited cache interference.
7
PPPPPPPPPPPPlow (ms)
high (ms)
(1; 2) (2; 4) (5; 10)
(5; 10) 13:65% 6:1%  
(10; 20) 13:6% 6:15% 2:35%
Table 2.1: Task utilization increment.
many cache lines as possible of the low priority process. The execution time of low was measured during
both runs and low’s utilization increment was computed for each experiment. The results when using a
MPEG decoder application as the low priority process are shown in Table 2.1. As can be seen, the task
utilization increment can be as high as 13% even though the system has a small L2 cache of size 256KB. It
is worth noticing that the utilization increment of low is independent of its period (assuming constant task
utilization) while it increases inversely proportional to high’s period. The results of this simple experiment
are in agreement with the case study and extensive simulations of Section 2.5; in fact, 13% task utilization
increment can occur even when low is suffering a rather limited number of preemptions due to a higher
priority task.
By using analytical analysis on a practical real-time systems, we can also show the significant increase
in task worst-case execution time (WCET) due to cache interference. Consider the PowerPC processor
MPC7410, a popular embedded processor used in real-time systems, which has 2MB two-way associative
L2 cache (see Table 2.2 for further details). The time taken by the system to reload the whole L2 cache
is about 655s. This reloading time directly affects task execution time. A typical partition size of an
avionic system compliant to the ARINC-653 standard [1] can be as small as 2ms. Under this scenario,
the execution time increment due to cache interference can be as big as 655s=2ms  33%. In general,
the effect of cache interference on task execution time is directly proportional to cache size and CPU clock
frequency and inversely proportional to memory bus speed. It is worth noticing that since CPU speed,
memory bus speed, and cache size are constantly increasing in modern computer architectures, it is unlikely
that this problem will be less severe in the near future.
The above impact analysis has shown that it is crucially important in real-time system design to have
reliable techniques to calculate the worst-case cache interference. Such a technique has been proposed by
Ramaprasad et la. in [39, 38] for fixed-priority scheduling. With the standard cache-sharing mechanism,
this technique assumes that in the worst-case cache of all tasks fully overlap. However, embedded system
8
Processor MPC7410
CPU Speed 1000Mhz
L2 cache 2MB two-way
set associative
Memory bus speed 125Mhz
Instruction/Data L1 miss 9/13 CPU-cycles
+ L2 hit latency
Memory access latency 17 Memory-cycles
/ 32 bytes
Table 2.2: PowerPC configuration
applications [14] usually have smaller memory footprint (i.e. from 50KB - 250KB) than typical size of the
cache memory (i.e 1MB - 4MB). Therefore the cache areas of different tasks are not necessary overlap.
Furthermore, let define the cache size of an application to be the smallest size of the cache needed by the
application to have minimum execution time. It has been shown in [14] that the cache size of a task is usually
much smaller than its memory footprint. Our measurement with experimental avionic applications also
supports this argument. Table 2.3 shows the memory footprint and cache size of all tasks in our experimental
avionic systems. There are eight tasks in total and all of them have cache sizes that are multiple times
smaller than their memory footprints. As a consequence, assuming that the cache areas of all tasks fully
overlap results in unnecessarily pessimistic WCET estimation.
In order to improve the system efficiency with respect to cache interference, we propose using cache
partitioning technique to eliminate and/or minimize the inter-task cache interference. In the next sections,
we will describe two different approaches to partition cache and an optimization method to identify the size
for cache partitions that minimizes system utilization.
2.2 Cache Partitioning Techniques
The cache partitioning techniques can be largely classified into two main types: hardware-based techniques
and software-based ones.
SMART proposed by Kirk [19] is a typical hardware-based cache partitioning. In SMART, the cache
memory is divided into equal-sized small segments and one large segment is referred to as shared pool.
9
Task Memory Cache
footprint (KB) size (KB)
1 8000 508
2 33000 644
3 668 104
4 9000 416
5 280 28
6 140 60
7 28 8
8 230 216
Table 2.3: Task parameters
The large segment is shared by non-critical tasks while the small ones are dedicated to real-time tasks or
groups of real-time tasks. Hardware-based cache partitioning has the benefit of being transparent to higher
layers thus requiring little software modification. However its major disadvantages, such as having only
fix partition sizes and requiring custom-made hardware, make software-based approaches better choices in
practice.
The idea of software-based cache partitioning techniques was first proposed by Wolfe in [50]. By means
of software, the code and data of a task are logically restricted to only memory portions that map into the
cache lines assigned to the task. In essence, if the task memory footprint is larger than its cache partition, its
code and data must reside in memory blocks that are regularly fragmented through out the address space. In
[25], Liedtke extended Wolfe’s idea exploring the use of operating systems (OS) to manage cache memory.
By mapping virtual to physical memory, an operating system determines the physical address of a process,
thus also determines its cache location. In contrast, Mueller [30] investigated using compilers to assign
application code into physical memory. During compilation process, code is broken into blocks, blocks are
then assigned into appropriate memory portions. Since the code address space is no longer linear, the com-
piler has to add branches to skip over gaps created by code reallocation. Due to the observation that tasks at
the same priority level are scheduled non-preemptively with respect to each other, the author suggested that
all tasks can be accommodated by using a number of partitions which is no more than the number of priority
levels. The main disadvantage of software-based partitioning techniques is that they tend to have larger
10
partition granularity than the hardware-based ones. For example, a technique that use the OS’s supports has
the smallest partition size to be a page size (i.e. 4KB). However, a software-based technique is much easier
to implement in practice and allows much more flexibility in terms of cache partition configuration than a
hardware-based one. Furthermore, since the effective memory footprints of practical embedded applications
are usually multiple of KB, the partition granularity problem is often not a significant issue. For these rea-
sons, in this research, we advocate the use of software-based OS-controlled techniques like the one in [25].
In the following sections, we will focus on optimizing the software-based cache partitioning mechanism to
maximize real-time system schedulability.
2.3 Cache Partitioning Optimization
In this section, we formulate the partitioning problem as an optimization problem (minimizing worst-case
utilization) whose solution is expressed by the size of each cache partition and the assignment of tasks to
partitions. Since the problem is NP-Hard, a genetic algorithm is used to find a near optimal solution and an
approximate utilization lower bound is presented that will be used as a comparison metric to evaluate the
effectiveness of our genetic algorithm.
2.3.1 Terminology and assumptions
We consider a single processor multi-tasking real-time system S as a pair S =
T ;Ksize	, where T is a
task set of size N : T = fi : i = [1; N ]g, Ksize is total number of cache partition units available in the
system. Let  be the size of a cache partition unit. Note that the value of  depends on the cache partitioning
technique employed: for example, considering an OS-controlled technique, the value  is the page size, e.g.
4KB. In this case, if CPU has 2MB cache, then Ksize = 2048KB=4KB = 512 units. Denote as Uwc the
worst-case system utilization.
Regarding task parameters, each task i is characterized by a tuple i =

pi; exec
C
i (k); CRTi(k)
	
where pi is the task period, execCi (k) is cache-aware execution time, CRTi(k) is cache reloading time, and
k is an integer index that represents cache size k  . Functions execCi (k) and CRTi(k) will be formally
defined in the following paragraphs.
Definition 2.3.1 The cache-aware execution time execCi (k) of task i is the worst case execution time of the
task when it runs alone in a system with cache size k  .
11
0 200 400 600 800 1000
0.8
1.3
1.8
e
xe
cC
 
(m
s)
Cache size (KBytes)
 
 
0 200 400 600 800 1000
6.8
8.8
10.8
12.8
e
xe
cC
 
(m
s)
Cache size (KBytes)
 
 
0 200 400 600 800 1000
1.9
2.4
2.9
e
xe
cC
 
(m
s)
Cache size (KBytes)
 
 
0 200 400 600 800 1000
1.2
1.7
2.2
e
xe
cC
 
(m
s)
Cache size (KBytes)
 
 
task 1 task 2
task 3 task 4
Figure 2.1: execC of avionic applications
There have been many techniques proposed for measuring single task worst-case execution time including
those that take into account cache effect [48]. In this research, we assume that function execCi can be
obtained by measuring i’s execution time for different cache sizes by using an available tool. Figure 2.1
depicts execC functions of four tasks in an experimental avionic system. According to the figure, it can be
noticed that execCi (k) is composed of two sub-intervals. More precisely, the cache-aware execution time
is decreasing in the first sub-interval and becomes almost constant in the second one. Intuitively, we can
explain this phenomenon as follows: when the cache size is smaller than a task’s memory footprint, the
execution time diminishes as the cache size increases due to a reduction of cache misses; on the other hand,
when the cache size is larger than task’s memory footprint (k0i ), the execution time is no longer cache-size
dependent since the number of cache hits does not change.
Next, we define the cache reloading time function.
Definition 2.3.2 The cache reloading time function CRTi(k) is total CPU stall time due to cache misses
caused by the preemptions that i can experience within a period pi.
CRTi(k) is the overhead that occurs only in multi-tasking systems and takes into account the effect
of cache lines invalidated by preempting tasks. In general, CRTi(k) depends on the maximum number of
12
times that i can be preempted and the size of cache partition that i is using. A good technique to estimate a
safe upper bound of CRTi(k) for fixed priority scheduling is proposed in [39]. This technique can be easily
extended for cyclic executive. In this research, we assume that given a task set T , a technique as described
in [39] can be used to find CRTi(k) for each task i.
With the problem at hand, two unknown variables need to be determined for each task i:
 ki: size of the cache partition assigned to the task i
 ai: an indication variable whose value is ai = 1 if task i uses a private cache partition or 0 otherwise.
For convenience, we call a tuple [ki; ai] as an assignment for task i, and fai : 8i 2 [1; N ]g as an
arrangement of tasks. Having defined all elements, it is now possible to compute the WCET of a task i
given an assignment [ki; ai]:
WCETi(ki) = exec
C
i (ki) + (1  ai) CRTi(ki) (2.3.1)
2.3.2 Genetic algorithm
The cache partitioning problem is now formulated as an optimization problem whose objective function is
to minimize the worst-case system utilization under the constraint that the sum of all cache partitions cannot
exceedKsize (total available cache size) and all the safety critical tasks (i.e., level A and B of task criticality)
should be assigned to a private cache (i.e., ai = 1). Note that a less critical task can use either a shared or a
private cache partition.
 Problem Statement: Given a real-time system S = T ;Ksize	, T = fi : i = [1::N ]g, and 8i 2
[1; N ] : i =

pi; exec
C
i (k); CRTi(k)
	
, find a system configuration Copt = f[ki; ai] : 8i 2 [1; N ]g
that minimizes system worst-case utilization Uwc.
The mathematical definition of the mentioned optimization problem follows:
min Uwc =
P
i
WCETi(ki)
pi
(2.3.2)
s.t.
P
i ai  ki + kshare  Ksize
8i 2 [1; N ] : 0 < ki  Ksize
8i 2 [1; N ] : kshare = ki j ai = 0
8i 2 [1; N ] : ai = 1 j i is a safety critical task
13
Let Uoptwc be the minimum value of the objective function. This problem always has a feasible solution. In
general, two questions need to be answered for each task: 1) should we put the task in private or shared
cache? (except for the case of safety critical tasks). 2) what is the size of the cache partition? Notice that all
tasks that are assigned to the shared partition will have the same cache size as a solution for the optimization
problem and it is set equal to kshare. This cache partitioning problem is NP-hard since it can be reduced to
knapsack problem in polynomial time.
In the remaining of this section, we describe our genetic algorithm that is used to solve the optimization
problem. The algorithm is based on the GENOCOP framework [51] which has been shown to perform
surprisingly well on optimization problems with linear constrains 2. The constraint-handling mechanism of
GENOCOP is based on specialized operators that transform feasible individuals into other feasible individ-
uals.
Algorithm 1 Cache Partitioning Genetic Algorithm
Input: S = fT ;Ksizeg
Output: f[ki; ai] : 8i 2 T g
1: g  0 Initialize P (g)
2: while g  Gmax do
3: mutate some individuals of P (g)
4: cross over some individuals of P (g)
5: locally optimize P (g)
6: evaluate P (g)
7: g  g + 1
8: select P (g) from P (g   1)
Our solution is shown in Algorithm 1. Specific information of the problem at hand is employed to design
the operators which are local optimizer, mutation and crossover. At each generation g, a set of individuals
(i.e., population P (g)) is processed. Individuals are feasible solutions of the optimization problem. A non-
negative vectorX of fixed lengthN +1 is used to represent an individual, where: N is number of tasks; and
8i 2 [1; N ] if X[i] > 0, then X[i] is the size of the private cache of i, if X[i] = 0, then i uses the shared
2Notice that the considered optimization problem involves the evaluation of a non-monotonic function CRTi(k). As a conse-
quence, solvers as hill climbing or simulated annealing could be easily trapped within a local optimum. Hence, a genetic algorithm
was chosen to circumvent this problem.
14
partition of size X[N + 1]. For example, X = [2; 3; 0; 0; 5] means 1 and 2 use two private partitions with
k1 = 2 and k2 = 3, whereas 3 and 4 use the shared partition with k3 = k4 = kshare = 5. Individuals are
evaluated (line 6) using Equation 2.3.2. The outcome of the evaluation i.e. Uwc of each individual is then
used to select which individual will be in the next generation. The lower Uwc an individual has, higher is the
probability that it will survive. LetUhwc be the utilization of the output configuration i.e. the lowest utilization
found by the algorithm. The three operators (i.e., local optimizer, mutation, and crossover) are described in
the following sections. In the following algorithms, all random variables generated by instruction random
have uniform distribution.
2.3.3 Local Optimizer
Since execCi (k) is a non-increasing function, increasingX[i] always results in a smaller or equal utilization
of i thus a smaller or equal system utilization. This observation leads to the design of a simulated annealing
local optimizer as showed in Algorithm 2. Each individual of population P (g) undergoes the local improve-
ment before being evaluated. The original individuals are then replaced by the local optima. This heuristic
reduces the search space of the genetic algorithm to only those solutions that are locally optimal.
Algorithm 2 Simulated Annealing Local Optimizer
Input: X[1::N + 1]
Output: X[1::N + 1]
1: T  Tmax
2: while T > Tmin do
3: tries 0
4: while tries < triesmax do
5: X 0  X
6: i random[1; N ] such that X[i] > 0
7: X 0[i] X 0[i] + random(X 0[i];Ksize  Pj=Nj=1;j 6=iX 0[j])
8: U  U 0wc   Uwc
9: if U < 0 and random[0; 1] < 1=(1 + eU=T ) then
10: X[i] X 0[i]
11: tries tries+ 1
12: reduce T
15
At each trial, an X[i] is chosen at random (line 6) and its value is increased by a random amount within
its feasible range (line 7) . The new value is accepted with a probability proportional to temperature T . Only
variables representing private cache are considered (i.e. X[i] > 0 8i 2 [1; N ]). Note that enlarging the size
of the shared partition, X[N + 1], does not necessarily result in reducing utilization since it may increase
the cache reloading time. Obviously, this heuristic can only find approximate local optima with respect to
a certain arrangement of tasks; the global optimum might require a task to use a shared partition instead of
a private one (or vice-versa) as assigned by the local optimizer. The mutation and crossover operators are
designed to allow the algorithm to search beyond local optima.
2.3.4 Mutation Operator
Algorithm 2 minimizes the utilization by enlarging size of private cache partitions thus the solution is only
locally optimal with respect to a certain arrangement of tasks. The mutation operator described in Algorithm
3 helps to search beyond local optima by randomly rearranging tasks into a private or shared cache partition.
In other words, it creates new landscape where the local optimizer can work.
Algorithm 3Mutation Operator
Input: X[1::N + 1]
Output: X[1::N + 1]
1: flip randomfTAIL;HEADg
2: if flip = TAIL then
3: i random[1; N + 1]
4: assign X[i] to a random value in its feasible range
5: else
6: i random[1; N ] such that X[i] > 0
7: X[i] 0
The operator takes as input an individual at the time. The operator randomly chooses a task i for either
modifying its private cache size or rearranging the task into shared cache by assigning 0 to X[i]. The size
of shared cache partition may also be changed. The operator guarantees that the generated individual is
feasible.
16
2.3.5 Crossover Operator
The crossover operator (Algorithm 4) aims to transfer good chromosome of the current generation to the next
one. Algorithm 2 finds, for each individual, a local optimum with respect to its arrangement of tasks. After
undergoing local optimization, if one individual has lower utilization than another, there is a high probability
that the former has a better arrangement. In other words, the genetic advantage of an individual over another
is implicitly expressed by its arrangement. Our design of crossover operator exploits this information.
The operator spawns descendants by randomly combining the arrangements of pairs of predecessors.
The algorithm takes two parents as input and produces two children. If i of one of the parents uses shared
cache (i.e. X[i] = 0), the assignment of the child’s i is the assignment of either of the parent’s i, otherwise
it is the arithmetic crossover (i.e. b X1[i] + (1  b)X2[i] where b is a continuous uniform random variable
taking values within [0; 1]). The operator guarantees that if parents are feasible then their children are also
feasible.
Algorithm 4 Crossover Operator
Input: X1[1::N + 1]; X2[1::N + 1]
Output: X 01[1::N + 1]; X 02[1::N + 1]
1: a random[0; 1]
2: for i = 1 to N + 1 do
3: if X1[i] = 0 or X2[i] = 0 then
4: flip randomfTAIL;HEADg
5: if flip = TAIL then
6: X 01[i] X1[i]
7: X 02[i] X2[i]
8: else
9: X 01[i] X2[i]
10: X 02[i] X1[i]
11: else
12: X 01[i] b X1[i] + (1  b) X2[i]
13: X 02[i] b X2[i] + (1  b) X1[i]
17
2.4 Approximate Utilization Lower Bound
In this section we derive an approximate utilization lower bound for the cache partitioning problem which
will be used to evaluate our heuristic solution. The bound gives a good comparison metric to evaluate the
effectiveness of a given partitioning scheme and is approximate because it is computed by using execCi (j)
andCRTi(j) functions. In realistic scenarios, these two functions are derived experimentally and have some
approximation errors. Hence, the bound is approximate too. A definition of approximate utilization lower
bound Ub follows.
Definition 2.4.1 An approximate utilization lower bound Ub of the cache partitioning problem (defined in
Section 3.2) is a value that satisfies Ub  Uoptwc .
The bound Ub is easily computed starting from an initial phase that assumes unlimited cache size and all
tasks have a private cache size of their memory footprint. Then, at each step the total cache size is reduced
either by shrinking the size of a private partition or by moving a task from private cache to shared one: the
decision is made in such a way that the increment of total task utilization is minimized at each step. This
technique is similar3 to Q-RAM [37], and the iterative algorithm is executed until the total cache size is
reduced toKsize.
Although the bound value is not exactly Uoptwc , as will be shown in Section 2.5, it is still a good mea-
sure for the performance of any heuristic solving the analyzed optimization problem. In addition, system
designers can also use this bound to predict how much utilization at most can be saved when applying any
partitioning heuristic. According to equations 2.3.1 and 2.3.2, the utilization of a task may increase or de-
crease depending on its cache attributes like size, private/shared partition and cache reloading time. The
derivation of Ub is based on the notion of two distinct utilization differentials per cache unit: ei (j) and
ri (j). In fact, 
e
i (j) indicates how a task’s utilization varies as a function of the size of its private cache
partition, and ri (j) is a normalized index on how task’s utilization varies when switching from private to
shared cache. The formal definitions of ei (j) and 
r
i (j) follow.
Definition 2.4.2 The utilization differential ei (j) of task i is the difference in utilization per cache unit
when i’s private cache is reduced from j to j   1 units.
3Q-RAM starts always from a feasible state with minimum resource allocation, while our algorithm starts from an infeasible
state by assuming unlimited resource availability.
18
ei (j) =
execCi (j   1)  execCi (j)
pi
(2.4.1)
Task i’s utilization may increase when changing from private to shared cache due to the presence of
cache reloading time. If i’s private cache size is j then when the switching takes place, the amount of freed
cache is j units. Thus we have the following definition of utilization differential ri (j) caused by cache
reloading time.
Definition 2.4.3 The utilization differential ri (j) of task i is the difference in utilization per cache unit
caused by cache reloading time when i is moved from a private partition of size j to a shared partition of
the same size.
ri (j) =
CRTi(j)
j  pi (2.4.2)
Having introduced the basic notions, we now describe the algorithmic method for estimating Ub (Algo-
rithm 5). Since execCi is a non increasing function, i has its smallest utilization when its cache is private
and has maximal size (i.e. [ki = k0i ; ai = 1]). Consequently, the system utilization is minimized since
every task has its own private cache of maximum size. That absolute smallest system utilization (Uabsb ) and
the total needed cache size (K) are calculated in line 6 and 7, respectively. This configuration, however, is
not feasible when K > Ksize which holds in most practical systems. Note that after line 8, Ub = Uabsb
is a lower bound following Definition 2.4.1. Nevertheless, it is not the tightest bound that can be found in
polynomial time. A tighter bound is estimated using procedure starting from line 11. Essentially, it reduces
the value of K toward that of Ksize (line 13 and 17). Then for each unit of cache size taken from K, the
value of Ub is increased (line 14 and 18) such that Ub approaches but never exceeds Uoptwc . This is done by
using the smallest values ofe andr to update Ub at each step. The correctness of the procedure is proven
in the following paragraphs.
Consider a configuration C = f[ki = k0i ; ai = 1] : 8i 2 [1; N ]g that uses total cache sizeK =
P
i k
0
i >
Ksize, there are three basic operations that can be executed to let C to converge toward Copt:
1. reducing the size of any private partition by 1 unit thus reducing K by 1 unit and increasing Ub by a
valueei (j).
2. moving a task from its private partition of size j to a shared partition of the same size, thus reducing
K by j units and increasing Ub by a valueri (j)  j.
19
3. reducing the size of the shared partition by 1 unit thus reducingK by 1 unit.
Lemma 2.4.1 shows that since operation 3 is equivalent to a sequence of the other two, only operation 1
and 2 are needed to compute Ub.
Lemma 2.4.1 Every sequence of operations used to compute Ub can be converted to a sequence of opera-
tions 1 and 2.
Proof.
We only need to prove that any sequence of operation 3 can be represented as a sequence of operations 1
and one operation 2. Assume that the final size of the shared partition is kshare. We can always reduce the
size of any task’s private partition to kshare using operation 1, then by applying operation 2 those tasks can
be moved to the shared partition and Ub can be computed without using operation 3. 2
Using Lemma 2.4.1, we can prove the following theorem that implies the correctness of Algorithm 5
Theorem 2.4.1 The output of Algorithm 5 (Ub) is smaller or equal to Uoptwc
Proof.
Lemma 2.4.1 proves that every transformation applied to compute Ub is composed of sequences of operation
1 and 2. Consider a task i currently using a private cache of size j: for each operation 1 applied to i, Ub
increases byei (j) andK decreases by 1; for each operation 2 applied to i, Ub increases by
r
i (j)  j and
K decreases by j. Since Algorithm 5 uses the smallest value among e and r to update Ub at each step,
after the while loop (when K = Ksize), Ub  Uoptwc . The time complexity of the while loop is bounded byP
i k
0
i  Ksize 2
Note that Ub is an utilization lower bound and Algorithm 5 does not produce a feasible solution since it
might end up splitting the cache of a task into a private part and a shared one.
20
Algorithm 5 Ub Estimation
Input: S = fT ;Ksizeg
Output: Ub
1: for i = 1 to N do
2: for j = 1 to k0i do
3: EXE[
Pi 1
l=1 k
0
l + j] ei (j)
4: REL[
Pi 1
l=1 k
0
l + j] ri (j)
5: sort EXE and REL in decreasing order
6: Uabsb  
P
i exec
C
i (k
0
i )=pi
7: K  Pi k0i
8: Ub  Uabsb
9: e size of EXE
10: r  size of REL
11: whileK > Ksize do
12: if EXE[e] < REL[r] then
13: K  K   1
14: Ub  Ub +EXE[e]
15: e e  1
16: else
17: K  K  min(K  Ksize; j)
18: Ub  Ub +min(K  Ksize; j) REL[r]
19: r  r   1
2.5 Evaluation
This section evaluates the proposed algorithm by using input data taken from real and simulated systems.
Although the solution can be applied to systems with any scheduling policy, all the experiments in this sec-
tion assume a cyclic executive scheduler. This assumption is motivated by the fact that the cyclic executive
scheduler is commonly used by avionic industry due to the high criticality of the developed real-time sys-
tems. In our evaluation, we are concerned only with the last level of cache as it affects system performance
the most. However, it is noted that taking into account other cache levels would not change the performance
21
of the proposed algorithm. We start first by describing the methodology used to generate input data, then
we show the experimental results.
2.5.1 Methodology
This section discusses our method to simulate cache access patterns and to generate functions execC and
CRT . In [49], Thie´baut developed an analytical model of caching behavior. The model is provided in the
form of a mathematical function (Equation 2.5.1) assigned to each task that dictates the cache miss rate for
a given cache size k. Although it was originally developed for a fully associative cache, it has been verified
that the model generates synthetic miss rates that are very similar to the cache behavior of actual traces
across the complete range of cache associativities [49].
MissRate(k)
=
8>>>>><>>>>>:
1  k
A1
(1  1

) A2
1 A2 if k  A1
A

k(1 ) A2
1 A2 if A1 < k  k0
0 if k0 < k
(2.5.1)
A1 = A
=( 1) (2.5.2)
A2 =
A

k0
(1 )
(2.5.3)
Cache behavior is function of three task-dependent parameters, A; ; k0. A determines the average size
of a task’s neighborhood (i.e., working set). The higher A is, the larger the task’s neighborhood becomes.
Notice that a larger working set increases the probability of a cache miss.  is the locality parameter which
is inversely proportional to the probability of making large jumps. The probability that a memory access
visits a new cache line diminishes as  increases. It has been shown statistically that real applications have
 ranging from 1:5 to 3 [49]. It is worth noticing that Equation 2.5.1 only models the capacity cache miss.
A complete model needs to also encapsulate the effect of compulsory cache miss (i.e., cold cache miss). A
system with single or non-preemptive tasks has cache miss rate converges to that of Equation 2.5.1. However
this is not the case in preemptive multi-tasking systems where compulsory cache miss becomes significant
due to frequent preemptions. Equation 2.5.4 proposed by Dropsho [9] modifies Equation 2.5.1 to calculate
the miss rate occurring at the warm-up phase too. The improved model calculates the instantaneous miss
22
rate by using the current number of unique cache entries as the instantaneous effective cache size.
InstantRate(m; k)
=
8><>:
MissRate(m) ifm < k
MissRate(k) otherwise
(2.5.4)
The inverse of instantaneous miss rate is the average number of memory references required to arrive at
the next miss. This information is used to calculate the number of references required to have M misses
(Equation 2.5.5).
Ref(M;k) =
MX
m=1
1=InstantRate(m; k) (2.5.5)
We are now ready for the calculation of execC and CRT . Considering task  that has total number of
memory references R, the task’s execC is calculated according to Equation 2.5.6. Note that, in this case we
can directly use Equation 2.5.1 to approximate cache miss rate since by definition execC is the execution
time of  when running non-preemptively. Hence, execC can be computed as it follows:
execC(k) = (1 Missrate(k)) R HitDelay (2.5.6)
+Missrate(k) R MissDelay;
whereHitDelay andMissDelay are the execution times of an instruction when a cache hit or cache miss
occur, respectively. The values of these constants depend on the adopted platform. Assume now that  runs
on a multi-tasking system scheduled by cyclic executive with a time slot of size s (in general, execC is
longer than s): since no assumption is made on the memory access pattern of other tasks, in the worst case
 has to warm-up the cache again at each resumption. In other words, in order to calculate the number of
memory references (Ref ) taken place in each time slot s, Equation 2.5.5 must be used and the value ofM is
such that the induced execution time in that time slot ((Ref(M;k) M)HitDelay+M MissDelay) is
equal to s. Note that in a multi-tasking system, the number of memory references (i.e. instructions) executed
in a time slot is less than what can be executed within the same time slot in a non-preemptive or single-task
system due to inter-task cache interference problem. Therefore, task  would take more than execC time to
complete its total number of references. By definition, that additional time is captured by CRT .
In summary, to generate simulated input data, we generate a set of task-dependent parameters for each
task: A; ; k0, the size of scheduling time slot s, and the total number of task’s memory references. execC
23
Task Number of Memory Time Cache
references usage (KB) slot size (KB)
1 61560 8000 4 508
2 259023 33000 5 644
3 76364 668 1 104
4 90867 9000 3 416
5 32544 280 2 28
6 6116 140 1 60
7 41124 28 1 8
8 217675 230 6 216
Table 2.4: Task parameters
and CRT of each task are then calculated accordingly using Equation 2.5.5 and 2.5.6. We emphasize that
the generated data is only for the purpose of evaluating the performance of Algorithm 1; they do not replace
techniques that estimate execC and CRT . The proposed heuristic can be applied to any practical system
whenever information of execC and CRT is available by any means.
2.5.2 Case Study
In this section, an experimental avionic system with eight tasks is used to verify the effectiveness of the
proposed approach. The system parameters are the same as those in Table 2.2. The scheduler runs with a
major cycle of six time slots: five tasks have their own time slot and three others share one time slot. The
size of each time slot is shown in column 2 of Table 2.5. Tasks’ parameters, including number of memory
references, memory usage and time slot allocation are shown in Table 2.4. All tasks have the same period
of 16:67ms. The number of memory hits and misses were measured as a function of the cache size for
each task. The traces were then used to find execC and CRT functions. execC of tasks 1 to 4 are plotted
in Figure 2.1. Although some tasks may use a large amount of memory, the range of cache sizes at which
its execC function differential is significant may be smaller. This is because, in most cases, the worst case
execution time path accesses only a small part of all the memory allocation. For example, memory size of
task 4 is about 9000KB but its execC is subject to big variations only at cache sizes smaller than 512KB. In
other words, it is possible to assign to a task an amount of cache smaller than its memory allocation without
24
Time Size Baseline: slot Heuristic: slot
slot (ms) utilization (Ushare) utilization (Uhwc)
1 2:3 103:8 70:4
2 1:2 45:9 35:8
3 3:2 48:0 42:2
4 1:9 54:8 48:9
5 4:4 99:5 80:5
6 3:67 100:2 77:4
Table 2.5: Task worst-case utilization
increasing its execution time much. This suggests that cache partitioning can still be useful in practice even
with big memory-footprint applications.
To calculate CRT function, we use the method discussed in Section 2.5.1. In addition, the task’s pa-
rameters, i.e. A; ; k0, are given by fitting Missrate(k) (Equation 2.5.1) into the measured trace. The
correctness of this model is verified by the fact that CRT is always smaller than 30% of execC (see Section
2.1 for more details). In this case study, the algorithm outputs a partitioned configuration where all tasks use
a private cache. Column 5 of Table 2.4 reports the resulting size of each cache partition. As expected, in
many cases the partition size is much smaller than task’s memory size.
Table 2.5 reports time slot’s utilization Ushare for the baseline configuration (that uses only a shared
cache) and the improved time slot’s utilization (by using the proposed heuristic) on columns 3 and 4, re-
spectively. The utilization of a time slot is the percentage of slot duration used by the task(s) assigned to that
slot; tasks running in a time slot are not schedulable if the slot utilization is greater than 100%. In this case
study, slots 1 and 6 are not schedulable under the baseline configuration but they are under the partitioned
configuration. The baseline utilization Ushare, the partitioned one Uhwc, and the utilization bound Ub are
79:7%, 64:2%, and 61:3%, respectively. The utilization gain is 15:4% while Uhwc is only 2:9% greater than
Ub.
2.5.3 Evaluation with Simulated Systems
The same system parameters shown in Table 2.2 are used for simulations. Tasks’ parameters are randomly
chosen with uniform distribution within the intervals presented in Table 2.6. The range ofA and  is selected
25
512 1024 2048
0
10
20
30
D
iff
er
en
ce
 in
 u
til
iz
at
io
n 
(%
)
(a) Total cache size (KB)
 
 
10 20 30 40
0
10
20
30
D
iff
er
en
ce
 in
 u
til
iz
at
io
n 
(%
)
(b) Number of tasks
 
 
[1 3]ms [1 5]ms [1 10]ms
0
10
20
30
D
iff
er
en
ce
 in
 u
til
iz
at
io
n 
(%
)
(c) Range of time slot size
 
 
U
share−U
h
wc
Uprop−U
h
wc
Ub−U
h
wc
Figure 2.2: Effect of the three factors
A (KB)  k0 (KB) number of
memory references
[1; 10] [1:5; 3] [A1; 1024] [10
3; 106]
Table 2.6: Simulation parameters
based on values given by [49] which also fit well with those used in the case study. All experiments use
the range of k0 shown in Table 2.6 except where noted otherwise. A1 is calculated using Equation 2.5.2.
Each task has utilization smaller than 100% when it runs with the baseline configuration. The following
simulations describe how different factors (i.e., the total cache size, the number of tasks, and the size of
time slots) affect the performance of the proposed algorithm. The performance measures are the average
utilization gain defined as Ushare   Uhwc and the average value of Ub   Uhwc. For comparison purposes, we
also estimate the worst case utilization of proportional cache partitioned systems (Uprop). In such a system, a
task is assigned a private cache partition of size proportional to its memory footprint, i.e. ki =
k0iPN
1 k
0
j
Ksize.
The average value of Uprop   Uhwc is reported. Simulated system parameters are as follows except where
stated otherwise: N = 10, Ksize = 2MB, and time slot range from 1 to 3ms. All results are the average
over the outputs of 30 different task sets.
Effect of the total cache size: in this experiment, we measured systems having cache sizes of 512KB,
1MB, and 2MB. The performance measures plotted in Figure 2.2(a) show that the heuristic performs very
well especially when the total cache size is large, i.e. 2MB: the gain in utilization is about 15% and the
difference between the heuristic utilization and the bound is less than 2%.
Effect of the number of tasks: in this experiment, the number of simulated tasks is 10, 20, 30, and 40.
26
Results are plotted in Figure 2.2(b). Notice that two phenomena occur when the number of tasks increases:
1) the gain in utilization Ushare   Uhwc is smaller since more tasks are forced to use the shared partition,
2) the poor performance of proportional cache partitioning is more prominent since there are more tasks
running on smaller private cache partitions.
Effect of the size of time slots: in this experiment, tasks’ time slot sizes are randomly chosen with
uniform distribution within the following three ranges: 1  3ms, 1  5ms, 1  10ms. Intuitively, when the
time slot size increases, the effect of cache-related preemption delay is reduced since tasks experience less
preemptions during each execution. Consequently, the utilization gain due to cache partitioning is lower.
This behavior is shown in Figure 2.2(c): as expected, the utilization gain Ushare   Uhwc is reduced to 8% at
the largest range whereas the gap between Uhwc and Ub is almost constant. Notice that time slot variation has
no effect on Uprop   Uhwc.
2.6 Conclusion
In this research, we have shown that cache partitioning can be used to improve system schedulability. This
is because in real-time systems a schedulability test has to always assume the worst-case inter-task cache
interference. This assumption often implies a low cache area utilization in practical systems. The proposed
cache partitioning optimization is designed to optimize cache area usage so that the cache interference is
minimized when necessary with regards to real-time guarantee.
27
Chapter 3
Network-on-Chip Real-time Scheduling
As mentioned in Chapter 1, scheduling of real-time transactions in a NoC while maximizing its utilization
poses new challenges which are different from those of traditional real-time systems. In particular, although
the problem has the form of multiple resource scheduling, it is not the same as multi-processor schedul-
ing because a transaction may use multiple resources, i.e. physical links, simultaneously. This pattern of
resource sharing create non-transitivity dependence between transactions. For example, while two transac-
tions 1 and 2 which do not share any links can be transmitted concurrently, they both will be delayed by
the transmission of transaction 3 which uses links shared with both 1 and 2. To the best of our knowl-
edge, this real-time resource sharing problem has not received adequate attention. In related works [43, 44],
Shi et al. proposed a method to calculate worst-case latency of transactions scheduled by a fixed-priority
scheduling algorithm. Since these works extend the traditional real-time scheduling paradigm to NoC, they
are not able to take full advantage of the parallelism available between non-overlapping transactions.
In this research, we propose novel real-time scheduling algorithms specifically designed for NoC. Throu-
gh understanding of the dependence between transactions, the algorithms make dynamic scheduling deci-
sions so that transaction concurrency is maximized. Due to the problem complexity, we will first focus on
NoC with ring topology (Section 3.3). Transaction sets on ring-topology NoC are classified into acyclic and
cyclic transaction sets. We propose scheduling algorithms first for the former then extend them for the latter.
We will then extend our theory for ring-topology NoC to the general NoC. More specifically, in Section 3.4,
we propose scheduling algorithms for acyclic transaction sets on general NoC. All proposed algorithms are
evaluated through simulated experiments and implementation.
28
3.1 Related Works
Many of the early works on hard real-time communication [22, 46, 31] focus on communication between
computers on single-domain bus networks. In these networks, only one transaction can be transferred on
a bus at any time because the bus is shared between all transactions. A system with multiple buses is
considered in [13]. However, each bus in the system still has one domain. Since a single-domain bus bears
a similarity to single-processor systems, the traditional real-time scheduling theory for single-processor
systems [26] is applied or extended to solve the problem in these works. The many-core SoC in which we
are interested have multi-domain buses where non-overlapping transactions can be transferred concurrently.
In addition, the number of domains on a bus is determined by the topology of bus transactions.
There has also been significant research focused on real-time communication on multi-domain buses.
Most of these works [43, 44, 3, 24, 28] are concerned with the fixed-priority scheduling paradigm. In
these algorithms, each transaction is given a fixed priority at design time and higher priority transaction i
always preempts its overlapping transactions which have lower priority (the preemption occurs at the flit
level). Most recently, Zheng et al. propose in [43, 44] a solution to optimally assign fixed priorities to real-
time transactions and a method to analyze the worst-case transaction latency (WTL) under a fixed-priority
scheduling algorithm. Although our work has the same assumption about multi-domain buses, our proposed
scheduling algorithm is based on the dynamic-priority scheduling paradigm: that is a transaction may be
assigned with different priorities at runtime. To the best of our knowledge, our research is the first to do so.
As will be shown in the evaluation section, the performance of our approach on a typical NoC is better than
that of related works. We also note that since our proposed algorithm dynamically computes the schedule,
it has higher runtime overhead than that of the fixed-priority ones. This overhead may adversely affect the
algorithm performance. However, as we analyzed and demonstrated by experiments in Section 3.3.4, this
overhead is, in fact, relatively small.
3.2 Real-time NoC Model
We consider a system comprising multiple Processing Elements (PEs) interconnected by a NoC. The NoC is
composed of routers connected by communication links. Each router has several links directly connecting it
with neighbors. The number of the neighbors of each router depends on the network topology. For example,
a router on a ring-topology NoC has two neighbors, while one on a 2D-topology NoC has four. Links can
29
RT
1
RT
2
RT
3
RT
13
RT
6 RT
7
RT
8
RT
12
RT
11
RT
4
RT
14
RT
9
RT
5
RT
15
RT
10
RT
16
RT
17
RT
18 RT
19
RT
20
1τ
2τ
3τ
4τ
5τ
11τ
7τ
9τ
10τ
6τ
RT
i
Transaction
Physical 
Link
Router
8τ
PO-set
1D
2D3D
4D
5D
6D
7D
Figure 3.1: A transaction set in NoC
be either unidirectional or bidirectional, but each link is single-duplex which supports only a single data
transmission at a time. Most NoC employ full-duplex connections; in our model, they are represented by
two unidirectional links. Figure 3.1 shows an example of a 4 5 mesh NoC.
We assume that real-time applications running on multiple processing elements exchange data through
the NoC. A data transaction is defined as a request made by an application to transfer a certain amount
of data between two processing elements. We consider a scheduling problem where applications request
periodic data transactions, each comprising an infinite sequence of jobs. Each data transaction is transferred
hop-by-hop from the source to the destination. We assume that each data transaction has a fixed route which
consists of routers through which it reaches the destination. Also, when a transaction is transmitted, it uses
all links on its route at the same time. This is the assumption used in NoC with wormhole switching [7, 10],
which are the most popular switching protocol used in NoC. Note that this assumption does not imply that
the link resource is wasted because, in practice, a packet will be split into multiple flits, which then will be
sent in pipeline along its route. The above assumption means that two data transactions overlap and can
not be transferred concurrently if their routes share a same link. However, multiple non-overlapping data
transactions can be sent at the same time.
In this research, we propose the use of slot-based scheduling model, in which the time line is divided into
consecutive equal slots and transactions are scheduled on contention-free slots. This type of scheduling has
30
been used to build PFair [4] and BF [52], which are the optimal scheduling algorithms for multi-processors.
Although this approach requires synchronization between routers and computation at the end node, it can
significantly reduce implementation complexity of real-time NoC because it eliminates the need of buffers
and arbiters at the routers. The slot-based scheduling model has been successfully implemented in the
Aethereal NoC [12], a guaranteed-service NoC developed at Phillips Laboratory.
3.3 Real-time Scheduling for Ring-topology NoC
In this section, we will discuss the scheduling problem for NoC with ring topology. An example of NoC
of this type is represented in Figure 3.2. In the followings, we will first introduce some terminologies
that will be used throughout this section. Then, we will propose a dynamic scheduling algorithm that has
better performance than existing works. The algorithm performance is evaluated through simulation and
implementation.
Transaction Model
Let routers on the ring-topology NoC be indexed with a unique number in [1; N ] where N  is the number
of routers. We define T as the set of data transactions: T = fi : i = [1; N ]g. A data transaction i is
characterized by a tuple i = (ei; pi; 1i ; 
2
i ) where ei is the time that the NoC spends to transmit a job of i,
pi is the period of i. Each job must complete within its period, i.e. relative deadlines are equal to periods.
1i and 
2
i , where 
1
i 6= 2i , are the indexes of the source and destination routers of the transactions. 1i , 2i
are called the first and second endpoint of i, respectively. A transaction has two endpoints 1i and 
2
i if its
route uses all consecutive routers from element 1i to 
2
i in the clockwise direction. Transaction i is said to
go through element  if  6= 1i ,  6= 2i and element  is on the route of i. The NoC utilization ui of i is
calculated as: ui = ei=pi. We assume that all data transactions arrive at time 0. Let hyper-period h of T be
the least common multiple of the periods of all transactions in T .
Two transactions are said to overlap and can not be transferred concurrently if they use a same link.
Given a data transaction set T , we define an overlap indicating function OV : T  T 7! f0; 1g where
OV (i; j) = 1 if i and j overlap, and 0 otherwise. Figure 3.2 shows a transaction set where 1, 2, 3,
and 4 overlap each other but they do not overlap 7.
A pairwise overlap set (PO-set) D is defined as a maximal subset of T such that 8i; j 2 D :
OV (i; j) = 1. For convenience, we consider that a transaction that does not overlap any transactions
31
Multi-Domain Ring Bus
1 62 543
1τ
2τ
3τ
4τ 5
τ
6τ
12 11 10 9 8 7
8τ
7τ
Router
n
Bus Element Bus Segment Transactions
Figure 3.2: Bus Architecture and Acyclic transaction set
1 2 3 4 5 6 7 8 9 10 11 12
2τ
1τ
3τ
4τ
5τ
6τ
7τ
8τ
element 
index
13
Figure 3.3: Indexed straight line representation
belongs to a PO-set that contains only that transaction. In general a transaction may belong to more than
one PO-set. Figure 3.2 shows an example of a transaction set with four PO-sets: D1 = f1; 2; 3; 4g,
D2 = f2; 4; 5g, D3 = f4; 5; 6g, D4 = f7; 8g. Let the total number of PO-sets in a transaction set be
ND. Since each PO-set contains at least one element different from those of other PO-sets and transactions
are arranged in an one dimensional space, ND  N .
A transaction set is said to be acyclic if there exists a router which has no transaction going through. The
transaction set is cyclic, otherwise. Note that, the cyclic transaction set creates a cycle on the ring-topology
NoC: the cycle comprises of several overlapping transactions. Figure 3.2 shows an example of an acyclic
where element 1, 7, and 8 have no transaction going through, whereas Figure 3.4 shows an example of a
cyclic transaction set. For ease of identifying the first and second endpoints in the figure, we depict each
transaction i as an arrow which always directs from the first endpoint to the second endpoint of i. The
direction of the arrow does not imply the direction of the transaction.
32
Multi-Domain Ring Bus
1 62 543
1τ
2τ
3τ
4τ
5τ
12 11 10 9 8 7
Figure 3.4: Cyclic transaction set
Scheduling Model
We adopt a slot-based, contention-free scheduling model similar to the model used in [4, 52] 1. In this model
scheduling decisions are made at integral values, starting from 0. The real interval between time t 2 N and
time t + 1 i.e. [t; t + 1) is called slot t. We assume that every transaction’s execution time and period are
multiples of slots. Thereafter, we will use a slot as a time unit unless specified otherwise. A schedule S
is defined as a function S:    N 7! f0; 1g where S(i; t) = 1 if and only if i is scheduled at slot t. A
schedule S is valid if and only if according to S, it never happens that a transaction is scheduled in the same
slot together with one or more other transactions that overlap with it.
Given the constraint on overlapping transactions, a necessary condition on the schedulability of a trans-
action set can be easily derived as in Theorem 3.3.1.
Theorem 3.3.1 A transaction set T is schedulable only if:
8D  T : uD =
X
8i2D
ui  1 (3.3.1)
Proof.
Since, by definition, no two transactions of a PO-set D can be scheduled concurrently, all transactions of D
must be scheduled in sequence. In other words, the transactions of D can be considered to be sharing one
resource. Therefore, Inequality 3.3.1 must be satisfied. 2
Let E(k) be a set of all transactions in T that use the link between router k and k + 1. The following
lemma is necessary for later discussion.
1A hardware implementation of this model is presented in[12].
33
Lemma 3.3.1 Given a transaction set T that satisfies the necessary condition, the following inequality
holds. X
8i2E(k)
ui  1
Proof.
Since transactions in E(k) pairwise overlap, there exists D such that E(k)  D. Therefore the lemma is
implied by Theorem 3.3.1. 2
Indexed Straight-line Representation of Acyclic Transaction Sets
For ease of presentation, thereafter, we use the indexed straight-line representation described below to model
acyclic transaction sets. Given an acyclic transaction set, we select a router which has no transaction going
through to be the first element. Then, the routers are indexed ascendingly from 1 to N  in clockwise direc-
tion in which the first element’s index is 1. Bus elements in Figure 3.2 are indexed following this definition.
Since there are no transactions going through element 1, the overlaps between transactions in the acyclic
transaction set remains the same if we do the following transformation: 1) let routerN , instead of connect-
ing to router 1, connect to an additional router which is indexed N  +1 ; 2) change every transaction which
has the second endpoint at 1, i.e. i = (ei; pi; 1i ; 1), to be i = (ei; pi; 
1
i ; N
 + 1). For example, in Figure
3.2, 7 = (e7; p7; 8; 1) will be changed to be 7 = (e7; p7; 8; 13). Since the overlaps between transactions
are still the same after the transformation, a valid schedule of the transformed transaction set is also a valid
schedule of the original transaction set and vice versa. Given this transformation, the acyclic transaction set
can be represented as a set of overlapping line intervals on an indexed straight line where each line interval
corresponds to a transaction and the straight line is indexed from 1 toN +1. Figure 3.3 shows the indexed
straight line representation of the transaction set shown in Figure 3.2. The following properties are obvious
in the straight-line representation of an acyclic transaction set.
Property 3.3.1 For every transaction i, we have 1i < 2i .
Property 3.3.2 For every transaction i and j , i and j overlap if and only if 1i < 2j and 1j < 2i .
We study the scheduling problem for acyclic transaction sets in Section 3.3.1. We then extend our
solution to cyclic transaction sets in Section 3.3.2.
34
3.3.1 Scheduling Algorithms for Acyclic Transactions
In this section we present our scheduling algorithms for the proposed real-time transaction sets on the ring-
topology NoC. The discussion is divided into two parts.
First, we propose an algorithm, namely POBase, which schedules every acyclic transaction set whose
transactions have the same period. We will prove that the necessary condition (Theorem 3.3.1) is also
the sufficient condition for same-period acyclic transaction set to be schedulable by POBase. Therefore,
POBase is optimal (in term of schedulability) for these transaction sets. The algorithm is pseudo-polynomial
Second, a scheduling algorithm, namely POGen, is proposed to schedule acyclic transaction sets whose
transactions do not have the same period. POGen, which is built based on POBase, is an online algorithm.
POGen can schedule all transaction sets whose PO-set utilizations satisfy the following utilization bound:
8D  T : uD  L  1
L
; (3.3.2)
where L is defined as the greatest common divisor (GCD) of all transaction periods measured in number of
slots. Note that the bound approaches 0when L is small; for example, when transaction periods are mutually
prime numbers. The bound, however, approximates 1 when L is large. We believe that this assumption holds
in most practical real-time applications [27]. As we will show in the implementation section, with the speed
of the state of the art many-core SoC [2], the practical slot size is about 100us to 10us (which is also the size
of a time unit in our definition). Meanwhile, the periods in practical real-time applications [27] are usually
measured in millisecond units. Therefore, the smallest value in time unit of the GCD of all transaction
periods is 1ms. Because 1ms = 10  100us = 100  10us, we have that L has practical values ranging
from 10 to 100 slots. This results in the utilization bound between 0.9 and 0.99. We also note that this bound
is only a sufficient bound and we plan to improve this bound in our future works.
The POBase algorithm
The problem of acyclic same-period transaction set scheduling is similar to the Interval Graph Coloring
Problem (IGCP) [20] (see Chapter 4.1). An interval graph is a graph constructed from a set of intervals on
the real line where each vertex represents an interval and there is an edge between two vertices if the two
corespondent intervals overlap. The IGCP is the problem of assigning a color in a minimum set of colors
to each vertex in the interval graph such that two adjacent vertices do not have a same color. We note that
the IGCP is a special case of the our problem and the coloring algorithm in [20] can only handle this special
35
0 8
time
1τ
2τ
3τ
4τ
5τ
6τ
7τ
8τ
2 4 61 3 5 7
Figure 3.5: An example of the POBase algorithm
case. Our proposed algorithm POBase is a new algorithm to solve the problem at hand. POBase is a first-fit
algorithm with respect to a transaction ordering. More specifically, in POBase, the transactions are ordered
ascendingly by their first endpoint (stored in list L). Then, each transaction in L is assigned to the earliest
slots where no smaller-ordered overlapping transaction has been already assigned to2. This condition is
enforced by the use of array lastEndpoint (Step 6). lastEndpoint has size equal to the transactions’ period
p. The initial values of all items of lastEndpoint is 1 which is also the smallest index of the routers. Except
for the initial value, during the algorithm execution, the value of item t of lastEndpoint will be the second
endpoint (Step 8) of the last transaction that has been assigned to slot t. Since transactions are being assigned
in ascending order of their first endpoints, if condition lastEndpoint[t]  1i in Step 6 is satisfied then i
does not overlap with all transactions that have been assigned to t before i. We will formally prove this
statement in Lemma 3.3.3. This proof requires Lemma 3.3.2. Finally, the proof of POBase’s optimality will
be shown in Theorem 3.3.2.
Figure 3.5 shows an example of the schedule generated by POBase for the transaction set shown in
Figure 3.2 whose transactions have period equal to 8 and execution times: e1 = 2; e2 = 1; e3 = 2; e4 =
3; e5 = 4; e6 = 1; e7 = 4; e8 = 4. Consider the schedule of transactions of D2 = f2; 4; 5g. 5 is
scheduled in slots f0; 1; 3; 4g, because its smaller-ordered overlapping transactions 2 and 4 are scheduled
2The transactions can also be ordered by their second endpoint and the schedule is generated in descending ordered of the order
list.
36
in slots f2; 5; 6; 7g.
Algorithm 6 POBase
Input: T such that 8i 2 T : pi = p
Output: schedule S for period p
1: L  list of 8i 2 T in ascending order of 1i
2: 8t 2 [0; p) : lastEndpoint[t] 1
3: for each i 2 L do
4: r  ei
5: for each t 2 [0; p) do
6: if lastEndpoint[t]  1i then
7: S(i; t) 1
8: lastEndpoint[t] 2i
9: r  r   1
10: if r = 0 then
11: break //complete schedule assignment of i
Lemma 3.3.2 At each iteration of Step 6, for every j which has been assigned to slot t, 2j  lastEndpoint[t].
Proof.
We prove by induction.
Base case: Consider the first iteration, the lemma holds because there is not any transaction being assigned.
Induction case: Assume that the lemma holds at iteration k where j is being assigned, we will prove that it
also holds at iteration k+1. Consider slot t to which j is assigned. Due to the condition at Step 6, we have
lastEndpoint[t]  1j . Then by the induction assumption, we have: for every l that has been assigned to t
before iteration k, 2l  1j . Furthermore, if j is assigned to t, we have lastEndpoint[t] = 2j after Step 8.
Since by Property 1 of the indexed straight-line presentation, 1j < 
2
j , we have the lemma holds at iteration
k + 1. 2
Lemma 3.3.3 Slot t has not been assigned to any overlapping transaction of i if and only if lastEndpoint[t] 
1i .
37
Proof.
Necessary condition: We prove this condition by showing that if lastEndpoint[t] > 1i then slot t has been
assigned to a transaction that overlaps i. Since lastEndpoint[t] > 1i , there must exist a transaction j that
has been assigned to t before i where 2j = lastEndpoint[t] > 
1
i . And since j has been assigned to the
slot before i, 1i  1j . Therefore, we have: 2j > 1i and 2i > 1j . Then, by Property 2 of the indexed
straight-line presentation, i and j overlap.
Sufficient condition: If lastEndpoint[t]  1i , then by Lemma 3.3.2 we have that: for every j that has been
assigned to slot t before i, 2j  lastEndpoint[t]  1i . Therefore, by Property 2 of the indexed straight-line
presentation, j and i do not overlap. 2
Theorem 3.3.2 If the same-period acyclic transaction set T satisfies the necessary condition 3.3.1, then
POBase generates a feasible schedule for T .
Proof.
The generated schedule is valid because a transaction is not scheduled in the same slot with its overlapping
transactions (Lemma 3.3.3) . It remains to show that if a transaction set satisfies the necessary condition,
then at the end of the algorithm,
8i :
X
x2[0;p)
S(i; x) = ei: (3.3.3)
We will prove this by induction.
Base case: Consider the first iteration of the for-loop starting at Step 3. In this iteration, the schedule of 1
in L is generated. Since all items of lastEndpoint have value 1, the condition at Step 6 is satisfied for every
t. Furthermore, we have e1  p. Therefore, at the end of the iteration, Equation 3.3.3 must hold for 1 i.e.P
x2[0;p) S(1; x) = e1.
Induction case: Assume after iteration k of the for-loop starting at Step 3, Equation 3.3.3 holds for all
transactions fi : i 2 [1; k]g. We will prove that Equation 3.3.3 also holds for k+1 after iteration k + 1.
By contradiction, assume that at the end of the iteration k + 1,
P
x2[0;p) S(k+1; x) < ek+1. Let E(
1
k+1)
be the set of transactions that use the link between router 1k+1 and 
1
k+1 + 1. By the way the transactions
are ordered and Property 2 of the indexed straight-line presentation, we have that 8i 2 T if the schedule
38
of i has been generated before k+1 and i overlaps k+1 then 1i  1k+1 < 2i . Therefore i 2 E(1k+1).
In other words, among all the transactions that overlap with k+1, only transactions in E(1k+1) have their
schedule generated. Therefore, the contradiction assumption occurs only when:
X
i2E(1k+1)
X
x2[0;p)
S(i; x) = p: (3.3.4)
Since the following is true:
8i 2 E(1k+1) n fk+1g :
X
x2[0;p)
S(i; x) = ei;
by the contradiction assumption and Equation 3.3.4 we have:
P
i2E(1k+1) ei > p. This contradicts with
Lemma 3.3.1 which implies that
P
i2E(1k+1) ei  p. Therefore, at the end of the iteration, Equation 3.3.3
must hold for k+1. This completes the proof. 2
POBase Analysis: If an efficient sorting algorithm is used at Step 1, the time complexity of this step will
be O(N log(N)). Furthermore, Step 6 to 11 require constant number of operations. Therefore the time
complexity of POBase to build a schedule of p slots (where p is the common period) for N transactions is
O(N  max(log(N); p)). We note that the time complexity of POBase is pseudo-polynomial, and when
p  N , it is equivalent to that of PFair [4] and BoundaryFair [52], which takes O(N  p) to generate the
schedule of p slots.
The POGen algorithm
In this subsection we propose an online scheduling algorithm (POGen) for acyclic transaction sets whose
transactions do not have the same period. Our proposed solution is inspired by the work in [52], in which
the execution time line is divided into intervals and the number of slots in each interval which is assigned
to a task is proportional to the task’s utilization. However, since work in [52] does not have the transaction
overlap assumption, their proposed algorithms can not be used for the problem at hand.
In POGen, the execution time line from 0 to the hyper-period h, i.e. [0; h), is divided into a set of
consecutive scheduling intervals: fintk = [tk; tk+1) : k 2 N^ 0  tk < tk+1 < hg. Let jintkj = tk+1  tk.
Scheduling intervals must respect two fundamental properties: 1) the arrival time (also deadline) of any
transaction must coincide with the finishing time of a scheduling interval and the start time of the next one;
2) the minimum length of any scheduling interval must be at least L where L is the greatest common divisors
39
1 2 3 4 5 6 7 8 9 10
2τ
3τ
element index
1τ
Figure 3.6: Example of three transactions
1τ
2τ3τ
0 5 time
2τ
1τ
1 2 3 4 6
1τ 3τ
2τ
1τ
0int 1int 2int 3int
Figure 3.7: Scheduling intervals on the execution time line
of all transaction periods. As we will show in Theorem 3.3.3, the second property is essential to induce a
feasible utilization bound: intuitively, longer scheduling intervals allow POGen to better approximate the
fluid scheduling model. There are multiple feasible assignments of scheduling intervals that respect the two
properties. For example, Figure 3.7 shows the scheduling intervals induced by the set of three transactions
shown in Figure 3.6, where 1 = fe1 = 1; p1 = 2; 11 = 1; 21 = 3g, 2 = fe2 = 1; p2 = 3; 12 =
1; 22 = 4g and 3 = fe3 = 1; p3 = 6; 13 = 2; 23 = 5g. In this example, the scheduling intervals are the
intervals between two closest arrival times of any two transactions. Note that by definition of L and since
all transactions arrive at time 0, it follows that the minimum length of scheduling intervals in the example
is indeed L. An alternative feasible definition for scheduling intervals consists in assigning tk = kL, e.g.
all scheduling intervals have fixed length L. By definition of L, it then follows that the arrival time of
any transaction coincides with the start time tk of some interval intk. In the rest of this section, we will not
restrict ourselves to any specific interval assignment, instead only assuming that scheduling intervals respect
the two fundamental properties.
In each scheduling interval intk, each transaction i is assigned an interval load lki which is the number
of slots in the interval allocated to schedule i. The interval loads of each transaction are calculated such
that at the end of each interval, the transaction’s execution approximates its execution in the fluid scheduling
model [15]. The interval load of a PO-set is the sum of the interval loads of its transactions. Given the
interval loads of all transactions in interval intk, POBase is used to generate the schedule of intk.
As shown in the previous subsection, the interval schedule given by POBase will be feasible if and only
40
if:
8D  T :
X
i2D
lki  jintkj:
A schedule of a transaction set, which is generated by POGen, is feasible if it satisfies the following two
conditions:
Condition 3.3.1 for each transaction i, the sum of the interval loads over the transaction period is equal
to ei.
Condition 3.3.2 there is a feasible schedule for every scheduling interval.
In the following paragraphs, we will discuss our solution to identify the scheduling intervals and the interval
loads which induces a feasible schedule.
With regard to the interval loads, we define for each transaction i and scheduling interval intk a lag
function:
lag(i; int
k) = ui  tk+1  
X
x2[0;tk)
S(i; x):
The function calculates how much time i must be executed in interval intk such that at the end of intk it
is scheduled according to the fluid scheduling model [15]. We also define for each PO-set D a similar lag
function:
lag(D; intk) = uD  tk+1  
X
i2D
X
x2[0;tk)
S(i; x):
The goal of POGen is to generate a feasible load set for each interval intk, that is, a set of transaction
loads that satisfy the following inequalities:
8i 2 T : blag(i; intk)c  lki  dlag(i; intk)e; (3.3.5)
8D  T : blag(D; intk)c 
X
i2D
lki
 min(jintkj; dlag(D; intk)e): (3.3.6)
Inequality 3.3.5 sets conditions on the interval load for each transaction, based on the closest integral values
of the lag functions. Inequality 3.3.6 sets conditions on the total interval load of each PO-set. Note that
the right side of Inequality 3.3.6 guarantees that each PO-set with feasible loads is schedulable in intk
by POBase, that is, Condition 2 is satisfied for intk. Similarly, if all loads satisfy the lower bounds of
41
Inequalities 3.3.5, then the generated schedule satisfies Condition 1. The reason is as follows. Consider the
last scheduling interval of a period of transaction i: int = [t; a  pi) where t and a are some integers, the
lag function of i is:
lag(i; int) = a  ui  pi  
X
x2[0;t)
S(i; x):
Since ui  pi = ei is an integer, and so is S(i; x), blag(i; int)c = lag(i; int). That means the total interval
load of i up to slot a  pi, which is calculated as:
blag(i; int)c+
X
x2[0;t)
S(i; x);
is equal to aei and satisfies Condition 1. However using only the lower bound loads does not guarantee that
Inequality 3.3.6 can be satisfied at the same time. This is also true if only upper bound loads are used. The
following example illustrates this point. Consider again the example of the transaction set in Figure 3.6 and
3.7. If the algorithm runs with interval loads to be their lower bound loads, then we have the sets of interval
loads in four intervals int0; int1; int2; int3 respectively are fl01 = 1; l02 = 0; l03 = 0g, fl11 = 0; l12 = 1; l13 = 0g,
fl21 = 1; l22 = 0; l23 = 0g, fl31 = 1; l32 = 1; l33 = 1g. This means the schedule in int3 is not feasible because
its total load is 3 while jint3j = 2. If otherwise, the upper bound loads are used only, then the schedule
of interval int0 is also not feasible because the total load in this interval is 3. An algorithm that generates
feasible schedules must use a combination of these values and computing this is not trivial.
POGen achieves this feature by iteratively computing a feasible load set for each scheduling interval. It
is an online algorithm which is invoked at the beginning of each interval and generates the schedule for that
interval. In Lemma 3.3.4, we first show that the following inequalities initially hold at the beginning of the
first interval int0:
8D  T : blag(D; intk)c  jintkj; (3.3.7)
8D  T :
X
i2D
X
x2[0;tk)
S(i; x)  buD  tkc: (3.3.8)
A feasible load set is then computed in Step 1 of POGen by the GenerateLoad procedure, which is assumed
to satisfy the following proposition:
Proposition 3.3.1 Assume that all PO-sets satisfy the utilization bound in Inequalities 3.3.2. If Inequalities
3.3.7, 3.3.8 hold for intk, then GenerateLoad computes a feasible load set for intk.
Given a feasible load set for interval int0, Lemma 3.3.4 guarantees that Inequalities 3.3.7, 3.3.8 again hold
for int1. Hence, in the next execution of POGen at t1, GenerateLoad can compute a feasible load set for int1,
42
and so on and so forth for all scheduling intervals in the hyper-period. Since a feasible load set is obtained
for all scheduling intervals, Condition 1 and Condition 2 are satisfied and thus POGen generates a feasible
schedule of T . In the next Section 3.3.1, we will prove that GenerateLoad indeed satisfies Proposition 3.3.1.
Algorithm 7 POGen
Input: transaction set T , interval intk
Output: schedule S for intk
1: flki : 8i 2 [1; N ]g  GenerateLoad(T ,intk)
2: T 0  flki ; jintkj; 1i ; 2i g : 8i 2 [1; N ]	
3: S for intk  POBase(T 0)
Lemma 3.3.4 If GenerateLoad satisfies Proposition 3.3.1, then Inequalities 3.3.7, 3.3.8 hold for every
scheduling intervals.
Proof.
We prove by induction.
Base step: Consider the first scheduling interval int0 = [0; t1). Inequalities 3.3.7 for this interval hold
because
8D  T : blag(D; int0)c = buD  t1c  jint1j;
and Inequalities 3.3.8 hold because
8D  T :
X
i2D
X
x2[0;0)
S(i; x) = 0 = buD  0c:
Induction step: Assume that Inequalities 3.3.7, 3.3.8 hold in every scheduling interval up to intk. We prove
that Inequalities 3.3.7, 3.3.8 also hold before the execution of GenerateLoad at interval intk+1. Since In-
equalities 3.3.7, 3.3.8 are satisfied at interval intk, GenerateLoad generates a feasible load set and POBase
generates a feasible schedule for the interval. Therefore after Step 3, we have:
8D  T :
X
i2D
X
x2[tk;tk+1)
S(i; x) =
X
i2D
lki :
Then by the left side of Inequalities 3.3.6, we obtain the following which proves that Inequalities 3.3.8 hold
43
for intk+1.
8D  T :
X
i2D
X
x2[0;tk+1)
S(i; x)
=
X
i2D
X
x2[0;tk)
S(i; x) +
X
i2D
lki

X
i2D
X
x2[0;tk)
S(i; x) +j
uD  tk+1
k
 
X
i2D
X
x2[0;tk)
S(i; x)
= buD  tk+1c
Now consider Inequalities 3.3.7. Notice that since S(i; x) is integer, we have:
8D  T : blag(D; intk+1)c
=
j
uD  tk+2
k
 
X
i2D
X
x2[0;tk+1)
S(i; x):
Since Inequalities 3.3.8 hold for intk+1, Inequalities 3.3.7 also hold because (the last inequality holds be-
cause of Inequality 3.3.2):
8D  T : blag(D; intk+1)c
=
j
uD  tk+2
k
 
X
i2D
X
x2[0;tk+1)
S(i; x)
 buD  tk+2c   buD  tk+1c
 duD  (tk+2   tk+1)e  jintk+1j:
This completes the proof. 2
The GenerateLoad procedure
As we mentioned, procedure GenerateLoad searches for a feasible load set of each scheduling interval.
There are two questions that have to be answered: (Q1) is there a feasible load set? (Q2) is there an efficient
algorithm to find it? We will show that the problem at hand is equivalent to the problem of Circulations in
flow Graphs with Demands and Lower bounds (CGDL) [20] where all the demands are zero. A flow graph is
a directed graph where each edge has a capacity and there is a flow going through each edge in its specified
44
direction. The magnitude (a.k.a value) of a flowmust be smaller than the capacity of the edge it goes through.
If each edge also has a lower bound then the flow value must be higher than this bound. Furthermore, each
vertex may have a demand d. The values of the flows must satisfy the conservation constraint that is: the
total value of the flows entering a vertex must be d units higher than that of flows exiting the same vertex.
The CGDL problem is the problem of finding a set of flows that satisfies all the constraints (a.k.a feasible
flows) in a given graph. It has been proved in [20] that the Ford-Fulkerson algorithm can be used to generate
a set of feasible flows in a CGDL problem if there exists one. More specifically, if the graph has integer
edge capacities, lower bounds and vertex demands, then the generated feasible flows are also integers. The
Ford-Fulkerson algorithm is a well-known algorithm in graph theory which is originally designed to solve
the max-flow min-cut problem [20].
Having proved that the problem of searching for a feasible load set is equivalent to the CGDL problem,
we will then prove that if the utilization of each PO-set is smaller than the utilization bound expressed by
Inequalities 3.3.2, there always exists a feasible solution therefore answering Q1. Then, since the Ford-
Fulkerson algorithm [20] can be used to solve the problem, Q2 is also answered.
In the following, we will intuitively describe the construction of a directed graph from the input of
GenerateLoad. Each vertex of the constructed graph represents a PO-set Dj . We define for each vertex, a
PO-set edge gDj whose real-valued flow f
D
j represents the interval load of the corresponding PO-set. An
integer-valued lower bound bDj and an integer-valued capacity c
D
j are defined for each of the PO-set edges
such that Inequalities 3.3.6 are imposed on their flow values, i.e.:
8Dj  T : bDj  fDj  cDj : (3.3.9)
where
bDj = blag(Dj ; intk)c;
cDj = min(jintkj; dlag(Dj ; intk)e):
Furthermore, for each transaction i, a transaction edge is defined whose real-valued flow fi represents the
interval load of the corresponding transaction. A lower bound value bi and a capacity ci are defined for each
of the transaction edges such that Inequalities 3.3.5 are imposed on their flow values:
8i 2 T : bi = blag(i; intk)c  fi  ci = dlag(i; intk)e: (3.3.10)
The flow of a transaction edge entering a vertex represents the contribution of the corresponding transaction’s
interval load to the corresponding PO-set’s interval load. The endpoints and the direction of each edge are
45
defined in such a way that the values of the flows in and out a vertex preserve the relationship between the
interval load of the corresponding PO-set and that of its transactions. The graph has a feasible circulation
flow which represents a feasible load set.
The following definition is necessary for the graph construction. Let the index PO-set order of a transac-
tion set T be an ordered list of all PO-sets in T where PO-setD with smallermini2Dj 2i has smaller index.
Ties are broken arbitrarily. Since each PO-set has only one valuemini2Dl 
2
i , the order is well-defined. The
transaction set in Figure 3.2 has the index PO-set order be fDj : j 2 [1; 4]g where D1 = f1; 2; 3; 4g,
D2 = f2; 4; 5g, D3 = f4; 5; 6g, D4 = f7; 8g. Figure 3.8 shows the graph G constructed from the
transaction set in Figure 3.2. Transaction edges are represented by solid lines while PO-set edges are repre-
sented by dotted lines.
Graph construction: let us define a tuple G = (V;E) as follows:
 For each PO-set Dj in the index PO-set order, define a vertex vj .
 For each PO-set Dj in the index PO-set order, define a directed edge gDj with capacity
cDj = min(jintkj; dlag(Dj ; intk)e) and lower bound bDj = blag(Dj ; intk)c. Let gDj be a PO-set edge.
 For each transaction i, define a directed edge gi with capacity ci = dlag(i; intk)e, and lower bound
bi = blag(i; intk)c. Let gi be a transaction edge.
 fgi : i 2 D1g are edges that enter v1; gD1 is an edge that exits v1.
 8j : 1 < j  ND, fgi : i 2 Dj n Dj 1g and gDj 1 are edges that enter vj ; fgi : i 2 Dj 1 n Djg
and gDj are edges that exits vj . This construction step deals with the situation where two PO-sets
Dj 1;Dj share some transactions. More specifically, to preserve the relationship between the interval
loads of the PO-sets and that of its transactions, the transaction edge of a transaction common to two
PO-sets would have to enter the two corresponding vertexes vj 1; vj . Since in a qualified graph, each
directed edge can enter at most one vertex, this situation must be avoided. This can be accomplished
by representing the interval loads of the common transactions on the second PO-set (vj) as the interval
load of the first PO-set (i.e., gDj 1 enters vj) minus the interval load of the transactions that are only in
the first set (i.e., fgi : i 2 Dj 1 n Djg exit vj). Lemma 3.3.5 will detail the proof of this argument.
 V = fvj : j 2 [1; ND]g
46
1g
8g
7g6g
5g
4g
3g
2g
1g
D
2g
D
3g
D
4g
D
2v1v 3v 4v
transaction edge PO-set edge
Figure 3.8: Constructed graph G
 E = fgi : j 2 [1; N ]g [ fgDj : j 2 [1; ND]g
Finally, the graph flow is subject to the flow conversation constraint [20] in which given a vertex, the sum
of the flow values entering it minus the sum of the flow values exiting it is zero. As a graph construction
example, consider PO-set D2. Vertex v2 has an output PO-set edge gD2 which represents the interval load of
D2. Since D1 has 2 and 4 in common with D2 but not 1 and 3, v2 has an input PO-set edge gD1 which
represents the interval load ofD1 and two output transaction edges g1 and g3 that represent the interval loads
of 1 and 3, respectively. Furthermore, note that edges g2 and g4 for the common transactions 2 and 4
enter v1, not v2. g2 exits v3, but g4 exits v4 because 4 also belongs toD3. Finally v2 has an input transaction
edge g5 that represents the interval load of 5. Since 5 is in D2 and D3 but not D1 and D4, g5 exits v4.
Lemma 3.3.5 shows that G is indeed a directed graph.
Lemma 3.3.5 G is a directed graph.
Proof.
Since every edge of G is directed, it remains to show that each edge has only one or two endpoints. There is
one edge defined for each PO-set and one edge defined for each transaction.
For each PO-set Dj , the PO-set edge gDj exits only vj . In addition, gDj enters only vj+1 when j < ND.
Therefore each PO-set edge exits exactly one vertex and enters at most one vertex.
By the index PO-set ordering, if i 2 Dj n Dj 1, then i =2 Dk n Dk 1 where j < k  ND. Therefore, the
elements of the following set are disjoint: A = fgi : i 2 D1g; fgi : i 2 Dj n Dj 1g : j 2 (1; ND]	. By
definition, A contains the transaction edges of G that enter some vertices. Also the union of the elements of
A is fgi : i 2 T g. Therefore, each transaction edges enters exactly one vertex.
By a similar proving technique, we can show that each transaction edge exits at most one vertex. Due to
space constraints, we skip the detailed proof. In conclusion, every edge of G has at most two endpoints and
47
is directed. 2
It remains to show that GenerateLoad satisfies Proposition 3.3.1 and therefore POGen generates feasible
schedules for all transaction sets that satisfy the utilization bound of Inequalities 3.3.2. For simplicity of
exposition, we split the proof in multiple lemmas. First, Lemma 3.3.6 proves an important property of graph
G regarding the flow values. Then, this property will be used to prove in Lemma 3.3.7 that graph G has a
feasible flow set if Inequalities 3.3.7, 3.3.8 are satisfied for interval intk and furthermore all PO-sets satisfy
an utilization constraint based on jintkj. Note that we know from [20] that if graph G has a feasible flow
set, then it has an integral feasible flow set (i.e. the values of all flows in the set are integers) which can be
found by the Ford-Fulkerson algorithm [20]. Therefore, to complete the proof, we will have to prove that a
feasible load set can be derived from an integral feasible flow set ofG (Lemma 3.3.8). Finally, we will show
that the utilization bound of Inequalities 3.3.2 implies the utilization bound used in Lemma 3.3.7. Hence,
Proposition 3.3.1 holds.
Lemma 3.3.6 A flow in graph G satisfies the flow conservation constraint at every vertex vj if and only if
the following equalities hold for every PO-set Dj:X
i2Dj
fi = f
D
j : (3.3.11)
Proof.
Proof of Lemma 3.3.6: We prove this by induction over the ordered set of vertices.
Base case: By the graph construction rules regarding the edges that enter and exit v1, the flow conservation
constraint holds at vertex v1 if and only if: X
i2D1
fi = f
D
1
Induction case: Assume that the lemma holds for every PO-set from D1 to Dj 1. We will prove that the
lemma also holds for Dj . By the graph construction rules regarding the edges that enter and exit vj and the
definition of the flow conservation constraint at vertex vj , we haveX
i2DjnDj 1
fi =
X
i2Dj 1nDj
fi + f
D
j   fDj 1:
Since X
i2Dj
fi =
X
i2DjnDj 1
fi +
X
i2Dj\Dj 1
fi;
48
we have the flow conservation constraint holds at vertex vj if and only ifX
i2Dj
fi =
X
i2Dj 1nDj
fi + f
D
j   fDj 1 +
X
i2Dj\Dj 1
fi
=
X
i2Dj 1
fi + f
D
j   fDj 1
Finally, by the induction hypothesis, the above equation is equivalent to
X
i2Dj
fi = f
D
j 1 + f
D
j   fDj 1 = fDj
This completes the proof. 2
Lemma 3.3.7 There exists a feasible flow set in graph G if Inequalities 3.3.7, 3.3.8 are satisfied for interval
intk and furthermore the PO-set utilizations satisfy the following condition.
8Dj  T : uDj 
jintkj   1
jintkj (3.3.12)
Proof.
First note that Inequalities 3.3.7 are necessary for the edge constraints on each PO-set edge (Inequality 3.3.9)
to be satisfied. Let us construct a flow set as follows.
8i 2 T : fi = lag(i; intk)
8Dj  T : fDj = lag(Dj ; intk)
We will have to prove that the constructed flow set satisfies the edge constraints and the flow conservation
constraints. Given the constructed flow, it is easy to verify that the edge constraints of each transaction edge
(Inequality 3.3.10) and the left-side edge constrains of each PO-set edge (Inequality 3.3.9) are satisfied. The
right-side edge constraints of each PO-set edge are satisfied because by the definition of the lag function and
by Inequalities 3.3.8, before the execution of GenerateLoad for interval intk we have the following:
lag(Dj ; intk) = uDj  tk+1  
X
i2Dj
X
x2[0;tk)
S(i; x)
 uDj  tk+1   buDj  tkc
< uDj  tk+1   uDj  tk + 1:
49
Now by Inequalities 3.3.12, the following holds:
lag(Dj ; intk) < uDj  tk+1   uDj  tk + 1  jintkj:
It remains to verify that the flow conservation constraint is honored at every vertex. Since the constructed
flow set satisfied Equation 3.3.11, the sufficient condition of Lemma 3.3.6 proves this statement. 2
Lemma 3.3.8 If there is an integral feasible flow set in graph G, then there is a feasible load set where
8i 2 T : lki = fi .
Proof.
Given an integral feasible flow set, 8i 2 T let lki = fi. The following inequality holds thanks to Inequalities
3.3.10.
8i 2 T : blag(i; intk)c  lki  dlag(i; intk)e
Thus the interval loads satisfy Inequality 3.3.5. We now have to prove that the interval loads also satisfy
Inequality 3.3.6. By the necessary condition of Lemma 3.3.6, we have
P
i2Dj fi = f
D
j . Then sinceP
i2Dj l
k
i =
P
i2Dj fi and f
D
j is subject to PO-set edge constraints in Inequality 3.3.9, Inequality 3.3.6 is
satisfied. 2
Theorem 3.3.3 The acyclic transaction set T is schedulable by POGen if:
8Dj  T : uDj 
L  1
L
:
Proof.
Since L  mink(jintkj), Inequalities 3.3.12 hold. Assume that Inequalities 3.3.7, 3.3.8 hold for a specific
interval intk. Then by Lemma 3.3.7 and [20], the constructed graph G has an integral feasible flow set.
Hence, by Lemma 3.3.8 algorithm GenerateLoad computes a feasible load set, which proves Proposition
3.3.1. Since furthermore, according to Lemma 3.3.4, Inequalities 3.3.7, 3.3.8 hold for every interval intk,
it follows that Inequalities 3.3.5 and 3.3.6, and therefore feasibility Conditions 1 and 2, also hold for every
interval. This concludes the proof. 2
Algorithm analysis: Since 8gi 2 E : ci   bi  1 and 8gDj 2 E : cDj   bDj  1, we have  =P
gi2E (ci   bi) +
P
gDj 2E (c
D
j   bDj )  N + ND  2N . The time complexity of the Ford-Fulkerson
50
algorithm in finding a feasible circulation in graph G is O(jEj  fmax) where fmax is the maximum flow
value of a graph derived from G and fmax   (see [20] for details). Since   2N , the time complexity
of GenerateLoad is O(N2). Finally, since the time complexity of POBase is O(N max(log(N); jintkj)),
the time complexity of POGen to generate the schedule for jintkj slots isO(N max(N;maxk jintkj)). The
worst-case time complexity to generate the schedule for a hyper-period is O(N2  h), which occurs when
every scheduling interval has size 1.
3.3.2 Scheduling Algorithms for Cyclic Transaction Sets
The cyclic transaction set scheduling problem is NP-complete because the special case where all transmis-
sion times are the same and all periods are equal is equivalent to the Circular-Arc Coloring Problem (CACP)
[20](Chapter 10.3). A Circular-Arc is a set of overlapping arcs that create a cycle. The CACP is the problem
of assigning a color in a minimum set of colors to each arc such that two overlapping arcs have different
colors. The CACP has been shown to be NP-complete [20](Chapter 10.3). In this section we will pro-
pose a heuristic algorithm for this problem. The proposed solution uses the transaction buffer at a router to
transform a cyclic transaction set into an acyclic one such that the latter’s schedule can be used to execute
the former. More specifically, we select a router k and split each transaction i that goes through k into
two pseudo transactions  0i and 
00
i . 
0
i transfers data of i from 
1
i to k, and 
00
i transfers the data which is
stored in k by  0i to 
2
i . We said that 
0
i and 
00
i is feasibly transferred data of i if data of every job of i is
transferred to 2i before its deadline. The new transaction set is acyclic since there is no transaction going
through k. However, there is a precedence constraint between the pseudo transactions i.e. if  00i transferred
data in slot t then that data must be stored in k by  0i before t. Since this constraint is not an assumption
of transaction sets that can be scheduled by POGen,  0i and 
00
i may not feasibly transfer data of i. More
specifically, note that POBase does not, in general, guarantee any order of slots used by two different trans-
actions. As the result, the POBase procedure in POGen may generate a schedule where  00i uses only slots
that are before those of  0i . Therefore, some execution slots of 
00
i (at least ones in the first interval) are
wasted since there has been no data in k to transfer. Figure 3.9 shows an example of this situation where in
int0 all the execution slots of  00i are wasted (because 
0
i only starts executing at time 2). Note that, however,
in the next interval  00i can transfer what has been transfered by 
0
i in the previous interval and so on. This
essentially means that if a period pi has m scheduling intervals, we ”waste” one out of m intervals both
for  0i (which must complete transferring data before the last interval within the period) and for 
00
i (which
51
may only transfers data from the second interval within the period). To guarantee that both  0i and 
00
i can
effectively transfer all data in m   1 intervals instead of m, we have to inflate the utilization of the two
transactions. This, in effect, means that the execution time of the two transactions within each period is
higher than that of i. The question now is: what is the minimum necessary increment in the execution time
needed. We will address this question in Lemma 3.3.10 after we formally describe the problem in the next
paragraphs. We also note that our solution only works well when pi is relatively larger than the size of every
interval (as shown in Inequality 3.3.13).
As described above, we replace each transaction i 2 T where i = (ei; pi; 1i ; 2i ) and i goes through
k with two pseudo transactions (p-transactions)  0i and 
00
i where 
0
i = (e
+
i ; pi; 
1
i ; k), 
00
i = (e
+
i ; pi; k; 
2
i ),
and e+i > ei. 
0
i and 
00
i have the same utilization u
+
i = e
+
i =pi > ui. i is called the original transaction
(o-transaction) of  0i and 
00
i . The new transaction set is called pseudo transaction set denoted by T 0. The
following (work conserving) rule is applied to the schedules of a p-transaction.
Rule 3.3.1 A p-transaction always transfers data of the current job of its o-transaction in slot t if there is
data of the job stored in the source element of the p-transaction at time t.
Note that the execution of a p-transaction in slot t can transfer data of the current job of its o-transaction
only if the data has already been stored in its source elements before time t, otherwise the execution does
nothing. In the former case, we say that the execution slot of the p-transaction is loaded, and is empty in
the latter. Note that  0i execution slots are always loaded until it transfers all the data of the current job of
i. It is because when a job of  0i is ready, a job of i is also ready and all data of the job of i has been
stored in 1i . Therefore, the statement is true by Rule 3.3.1. Figure 3.9 shows an example of the schedule of
 0i and 
00
i with the given transaction parameters and with every scheduling interval having size 5. Consider
int0 in which  0i has 2 execution slots and 
00
i has 3 execution slots (this number is determined by function
GenerateLoad). Since  0i is scheduled in slot 2 and 3 (this schedule is determined by POBase), there is no
data of i stored in k before time t = 3. Therefore, the 3 execution slots of  00i in int
0 are empty.
We say that a p-transaction is effective in execution slot t when either of the following cases occurs: 1)
the execution slot is loaded or 2) the execution slot is empty and the p-transaction has transferred all data of
the current job of its o-transaction at time t. Note that  0i is always effective in all slots because its execution
slots are always loaded until it transfers all the data of the current job of i. However that is not the case for
 00i . In Figure 3.9, 
00
i is not effective in slot 0, 1, 2, and 12. In these slots, there is still data of the current job
of i stored at 1i but there is no data of the job stored at k. The following lemma is obvious due to Rule
52
0 20 time
0int
5 10 15 25
'iτ
1int 2int 3int 4int
empty loaded
''iτ
0t 1t 2t 3t 4t 5t
Figure 3.9: Schedule of p-transactions where 8k : intk = 5, i = (8; 25; 1i ; 2i ), and  0i = (12; 25; 1i ; k),
 00i = (12; 25; k; 
2
i )
.
3.3.1.
Lemma 3.3.9 Consider job j of i which is ready at t and has deadline at t + pi, and consider scheduling
interval intk where t  tk < tk+1  t+ p. If  00i is not effective in one of its execution slot in intk, then, at
tk+1, the amount of data of j that has been transferred by  00i must be at least equivalent to bu+i  tkc u+i  t
slots.
Proof.
By Rule 3.3.1, if  00i is not effective in int
k, then by tk+1,  00i must have transferred all the data of j which
had been transferred by  0i until t
k and there must be a portion of data of j still being stored at 1i after
tk. This also means that all execution slots of  0i between t and t
k are loaded. By POGen, the number of
execution slots of  0i between t (i.e. when job j is ready) and t
k are bu+i  tkc   u+i  t. Since  0i is always
loaded in these slots, the amount of data of j that has been transferred by  0i until t
k is equivalent to at least
bu+i  tkc   u+i  t slots. This completes the proof. 2 The following lemma describes how to determine e+i
such that i is schedulable and u+i is minimized.
Lemma 3.3.10 Every job of i completes before its deadline if T 0 is schedulable by POGen and
8intk : e+i =
jei + 1

k
; (3.3.13)
where  = mink (1  jintkj=pi).
Proof.
Since T 0 is schedulable by POGen, jobs of  0i and  00i complete before their deadlines which are the same as
53
the deadline of the correspondent job of i. Thus, to complete the proof, we only need to show that the jobs
of  0i and 
00
i together transfer all the data of i during their execution.
Consider job j of i which is ready at t and has deadline at t+pi. Let intk where t  tk < tk+1  t+pi
be the last scheduling interval where  00i is not effective in at least one slot in int
k. We have:
 By POGen, the maximum number of execution slots of  00i within [t; tk+1) is du+i  tk+1e   u+i  t.
Therefore, the smallest number of execution slots of  00i within [t
k+1; t + pi) is A = e+i   du+i 
tk+1e+ u+i  t.
 By Lemma 3.3.9, The amount of data of j that  00i needs to transfer after tk+1 is equivalent to at most
B = ei   bu+i  tkc+ u+i  t slots.
Since  00i is always effective within [t
k+1; t+pi), all data of i will be transferred to 2i before the deadline if
A  B ,
e+i   du+i  tk+1e+ u+i  t  ei   bu+i  tkc+ u+i  t
, e+i   ei  du+i  tk+1e   bu+i  tkc: (3.3.14)
Since
du+i  tk+1e   bu+i  tkc  du+i  tk+1e   du+i  tke+ 1
 du+i  jintkje+ 1;
if the following inequality is true then Inequality 3.3.14 is also true:
e+i   ei  du+i  jintkje+ 1 (3.3.15)
Since e+i is integer, Inequality 3.3.15 is equivalent to
de+i  e  ei + 1: (3.3.16)
We complete the proof by showing in the following that if e+i satisfies 3.3.13 then 3.3.16 is true:
de+i  e =
ljei + 1

k
 
m

lei + 1

  1

 
m
=

(ei + 1)  

= ei + 1:
The last equation in the above derivation comes from the fact that ei is integer and  < 1. 2 Note that in
54
Equality 3.3.13, when jintkj approaches pi, e+i becomes larger and the proposed algorithm will be less likely
to successfully schedule the transaction set. Despite this drawback, as we show through simulations (see
Section 3.3.3), the proposed algorithm still performs significantly better than existing works on randomly-
generated cyclic transaction sets.
The scheduling algorithm for cyclic transaction set T , namely cPOGen, is summarized as follows.
Algorithm cPOGen:
 Step 1: Find k such that the derived pseudo transaction set T 0, whose execution time of the p-
transactions are calculated by Equation 3.3.13, passes the sufficient utilization bound test of POGen.
 Step 2: Schedule T 0 using POGen in which every scheduling interval has its length to be L, and
schedule T accordingly following Rule 3.3.1.
Algorithm analysis: Note that we only need to execute Step 1 once and offline. This step can be
implemented as: 1) visiting each router that has transactions going through (time complexity O(B) where
B is the number of routers) ; 2) at each visit, for each transaction i going through the router, calculate e+i
(time complexity O(N)) and compare the new utilization of each PO-set with the bound (time complexity
O(N)). Hence, time complexity of Step 1 is O(N2  B). Step 2 is a POGen algorithm therefore it has the
same time complexity as POGen.
3.3.3 Evaluation
Most of the previous related works [43, 44, 3, 24, 28] have focused on the Fixed-priority Scheduling Algo-
rithm (FPA). These works deal with the methods for schedulability analysis and priority assignment. More
specifically, Shi et al. have recently proposed in [43] a branch-and-bound algorithm that searches for a fea-
sible priority set for a transaction set. If a feasible priority set exists, then the transactions set is guaranteed
to be schedulable under the worst-case transaction latency (WTL) analysis proposed in [44]. The works in
[44, 43] are the state of the art.
In this section, we are interested in comparing the performance of POGen=cPOGen on ring-topology
NoC with the solution proposed in [44, 43]. The analyzed performance metric is the percentage of random
transaction sets which are schedulable under POGen=cPOGen and FPA. Under POGen, an acyclic transac-
tion set is schedulable if it passes the utilization bound test of Theorem 3.3.3. Meanwhile, under cPOGen,
55
0.2 0.4 0.6 0.8 1.0
Maximum PO-set Utilization
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
A
cc
e
p
ta
n
ce
 R
a
te
EDF
FPA
POGen (L=5)
POGen (L=10)
POGen (L=20)
POGen (L=50)
POGen (L=100)
Figure 3.10: Acceptance rate of acyclic transaction sets with various maximum PO-set utilization
an cyclic transaction set is schedulable if its pseudo transaction set is schedulable. A transaction set is
schedulable under FPA if it has a feasible priority set generated by the algorithm in [43].
In our experiments, we used three controlled parameters: 1) the maximum PO-set utilization of a trans-
action set3; 2) the size of transaction sets; and 3) the number of bus elements. The transactions’ sources
and destinations are randomly selected from the set of bus elements. Meanwhile, the transactions’ utiliza-
tion, transmission time and period are generated as follows. Given a maximum PO-set utilization umax,
the utilization of transaction i is initially generated according to the uniform distribution algorithm in [5]
such that the utilizations of all PO-sets are no larger than umax. The transmission time ei is generated as
a uniformly-distributed random number in the range of 1 to 100 slots. The period pi is then determined as
dei=(ui  L)e  L. Finally, given the pair of fei; pig, we recalculate ui to be ei=pi.
The following graphs depict the average acceptance rate of POGen=cPOGen and FPA over 1000 differ-
ent random transaction sets, in each of which two controlled variables are kept constant while the other one
is varied. We report the acceptance rates of the algorithm on acyclic and cyclic transaction sets separately.
Figure 3.10 and 3.11 show the average acceptance rates of the algorithms on acyclic and cyclic trans-
action sets, respectively, (when L is 5, 10, 20, 50, or 100) under various maximum PO-set utilization.
In these experiments, the size of transaction sets and the number of bus elements are set at 20 and 10,
3it is equivalent to ”the maximum link utilization” in [43]
56
0.2 0.4 0.6 0.8 1.0
Maximum PO-set Utilization
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
A
cc
e
p
ta
n
ce
 R
a
te
EDF
FPA
cPOGen (L=5)
cPOGen (L=10)
cPOGen (L=20)
cPOGen (L=50)
cPOGen (L=100)
Figure 3.11: Acceptance rate of cyclic transaction sets with various maximum PO-set utilization
respectively. For comparison purpose, we also include in these figures the acceptance rates of the Earliest-
Deadline-First (EDF); it is a well-known scheduling algorithm that can be used when the parallel execution
of non-overlapping transactions is not allowed. It can be seen that in most cases the acceptance rate of
POGen=cPOGen is better than that of FPA especially when PO-set maximum utilization is high and L > 5.
The better performance of POGen=cPOGen comes from the fact that the WTL analysis in [44] does not
always take advantage of the parallelism between non-overlapping transactions. For example, consider the
transaction set shown in Figure 3.2. Assume 3 and 5 have higher priority than 4. According to the WTL
analysis in [44], the interference of transactions 3 and 5 on the execution of 4 is calculated as if all trans-
actions were using a single-shared resource. However, POGen=cPOGen allows 3 and 5 to be executed in
parallel as shown in Figure 3.5. In other words, the acceptance rate of FPA will be reduced when the number
of PO-sets that contain a same transaction increases. Let denote the number of PO-sets that contain a same
transaction as PcT. We also notice from Figure 3.10 and 3.11 that the acceptance rates of cyclic transaction
sets is lower than that of acyclic ones. This is true both for our proposed algorithms and for FPA: with the
former, the reduction is because of the utilization inflation we used to transform the transactions; with the
latter, the reduction is because of the higher PcT when transaction sets are cyclic. In average, the maximum
reduction of both POGen=cPOGen and FPA is about 10% and it is higher when the maximum utilization is
higher. The figures also show (in the difference in performance of EDF and POGen=cPOGen) that allowing
57
01
2
3
4
5
6
7
8
N
u
m
b
e
r 
o
f 
P
O
se
ts
5 10 15 20 25 30
Transaction Set Size
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
A
cc
e
p
ta
n
ce
 R
a
te
Average of Maximum PcT
FPA with acyclic sets
FPA with cyclic sets
POGen with acyclic sets (L=10)
POGen with cyclic sets (L=10)
Figure 3.12: Acceptance rate with various size of transaction sets (maximum PO-set utilization is 0.95)
parallelism between transactions dramatically increases the bus utilization.
Figure 3.12 shows the acceptance rate with the maximum PO-set utilization to be 0.95, the number of bus
elements to be 10, and various size of transaction sets. The results of acyclic and cyclic transaction sets are
shown in the solid and dashed lines, respectively. We also draw in Figure 3.12 a bar graph which shows the
average (over all transaction sets) of the maximum PcT in each set. The performance of POGen=cPOGen is
better than FPA especially when the size of transaction sets are higher. The reason is that, when the number
of bus elements is fixed, the higher the size of a transaction set, the bigger the maximum PcT (as shown in
the bar graph). As a consequence, FPA suffers more from the effect described in the previous paragraph.
The performance of POGen=cPOGen also reduces when the transaction number is higher because there are
more transaction sets that do not meet the utilization bound. The reductions, however, are less than that of
FPA.
The same reason explains the better performance of POGen=cPOGen in Figure 3.13 which shows the
acceptance rate of POGen=cPOGen (with L = 10) and FPA when the number of bus elements is varied. In
these experiments, the maximum PO-set utilization is set at 0.95, the size of transaction sets are fixed at 20.
When the number of bus elements increases, there will be more longer transactions which may belong to
higher number of distinct PO-sets as shown with the bar graph.
58
01
2
3
4
5
6
7
N
u
m
b
e
r 
o
f 
P
O
se
ts
6 8 10 12 14 16
Bus Length
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
A
cc
e
p
ta
n
ce
 R
a
te
Average of Maximum PcT
FPA with acyclic sets
FPA with cyclic sets
POGen with acyclic sets (L=10)
POGen with cyclic sets (L=10)
Figure 3.13: Acceptance rate with various bus length (maximum PO-set utilization is 0.95)
3.3.4 Implementation
The main drawback of POGen is that its overhead, in general, is higher than that of fixed-priority ones. This
overhead heavily depends on specific implementations and platforms. To demonstrate the applicability of
POGen, in this section, we discuss an implementation of POGen on a Cell Broadband Engine processor [7]
and our measurement of its execution time overhead.
The Cell processor has 1 PowerPC Processing Element (PPE) and 8 Synergy Processing Elements (SPE)
each of which is an element on the Cell ring bus. There are also 3 additional bus elements which are a mem-
ory controller and two I/O controllers. POGen is implemented to run on SPEs as an online algorithm. It is
invoked at the beginning of each scheduling interval by a timer-interrupt handler and generates the schedule
of all transactions in that interval. Then, if the generated schedule has S(i; t) = 1, a slot scheduler will
transfer data of i in slot t using Direct-Memory-Access commands. The slot scheduler is also invoked by a
timer-interrupt handler. Note that since POGen is only executed once at the arrival time of each transactions,
the number of interrupts that invokes POGen is the same as that of the fixed-priority schedulers. Note that
the execution time of the former, however, may be larger that of the latter. In our implementation (shown
later), the execution time of POGen is typically no bigger than 3% in utilization. We also note that the ex-
ecution time of the slot scheduler must also be accounted as the overall overhead of POGen. Since the slot
scheduler is simply a table-lookup function, we expect that this scheduler can be implemented in hardware
59
as part of the router. If so, its overhead will be negligible and will not affect the processing elements. Our
software implementation of the slot scheduler shows that it adds about 1% to the total utilization. For com-
parison, the number of interrupts incurred by POGen is equivalent to that of the Boundary-Fair algorithm
[52] (which is significantly smaller than that of PFair [4]).
POGen execution time was measured under various slot sizes which are 10us, 20us, 50us, and 100us.
We assume that the period of every transaction is a multiple of 1ms which is also the smallest possible
scheduling interval. We also selected the size of every scheduling interval to be equal to the GCD of all
periods, which happened to be 1ms in all generated transaction sets. Given the different values of slot size
and the GCD of all periods, the size of L measured in number of slots are 100, 50, 20 and 10 slots. We
generated transaction sets with various sizes using the same methodology discussed in Section 3.3.3. The
sizes of transaction sets are 10, 20, 30, and 40.
Figure 3.14 shows the execution time of POGen in ms under various conditions. This execution time
also includes the latency of the timer-interrupt handler that invokes POGen. It can be seen that the algorithm
overhead increases when L has a higher value. However the algorithm overhead is no more than 0:03ms
even when L = 100 slots. Since each scheduling interval is 1ms, under the given conditions, the maximum
algorithm overhead is less than 3% of the scheduling interval size.
Our measurement also shows that the execution time of the slot scheduler is no more than 0:125us. In
other words, if the slot size is 10us, the overhead is less than 1:25% of the slot size. The overhead is smaller
when the slot size is bigger.
3.4 Real-time Scheduling for General NoC
In Section 3.3, we introduce two scheduling algorithms for real-time transactions on a ring-topology NoC:
Algorithm POBase is designed to schedule transactions whose periods are the same, whereas Algorithm
POGen is designed for general transaction sets. The latter is built upon two algorithms POBase and
GenerateLoad. Although the general framework of POGen can be used for transaction sets on general
NoC, Algorithm POBase and GeneratedLoad can not be reused. The reason is that these two algorithms
rely on the fact that for each transaction set on a ring-topology NoC, there exists a total order which is
defined solely based on transactions’ endpoints. This total order does not exist in a general NoC. Therefore,
a different transaction order (more general) must be defined. Like with ring-topology NoC, we also classify
transaction sets on general NoC into acyclic and cyclic transaction sets. This classification is a generaliza-
60
L=10 slots L=20 slots L=50 slots L=100 slots
0.000
0.005
0.010
0.015
0.020
0.025
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
N=10 N=20 N=30 N=40
Figure 3.14: Average execution time of POGen
RT
1
RT
2
RT
3
RT
13
RT
6 RT
7
RT
8
RT
12
RT
11
RT
4
RT
14
RT
9
RT
5
RT
15
RT
10
RT
16
RT
17
RT
18 RT
19
RT
20
1τ
2τ
3τ
4τ
5τ
11τ
7τ
9τ
10τ
6τ
RT
i
Transaction
Physical 
Link
Router
8τ
PO-set
1D
2D3D
4D
5D
6D
7D
Figure 3.15: A transaction set in NoC
61
tion of the classification introduced in Section 3.3. In this research, we will focus only on the scheduling
problem of acyclic transaction sets on general NoC. As shown later, transaction sets of this type often appear
in practical applications.
3.4.1 Transaction Model
In this section, we introduce the transaction model used in general NoC. This model is a generalization of
the model introduced in Section 3.3. We define T as the set of all data transactions: T = fi : i = [1; N ]g.
A data transaction i is characterized by a tuple i = (ei; pi;Ri) where ei is the time required to transfer
each job of i; pi is the period of i; and Ri is the fixed route of i through the network, that is the set
of links that the transaction traverses. As an example, the route of transaction 4 in Figure 3.15 is R4 =
fRT2 ! RT3 ! RT8 ! RT13 ! RT14g, where ! represents a physical link. Each job of i must
complete within its period, i.e. relative deadlines are equal to periods. The network utilization ui of i is
calculated as: ui = ei=pi. We assume that all data transactions arrive at time 0. Let hyper-period h of T be
the least common multiple of the periods of all transactions in T . Finally, two transactions i and j are said
to overlap if their routes have any link in common, that isRi \Rj 6= ;. Given a data transaction set T , we
define an overlap indicating function OV : T T 7! f0; 1g where OV (i; j) = 1 if i and j overlap, and
0 otherwise.
PO-sets and incident tree: A pairwise overlap set (PO-set) Dj is defined as a maximal set of overlap-
ping transactions, i.e. a subset of T that satisfies the following two conditions: 1) 8i; j 2 D : OV (i; j) =
1; and 2) if k =2 D then 9i 2 D : OV (i; k) = 0. By definition, every two transactions that overlap each
other must belong to at least a same PO-set. For convenience, we consider that a non-overlapping trans-
action belongs to a PO-set that contains only that transaction. Let ND be the total number of PO-sets of
T . In general a transaction may belong to more than one PO-set. As an example, the transaction set in
Figure 3.15 comprises seven PO-sets: D1 = f1; 2; 3; 4g, D2 = f1; 5g, D3 = f4; 6g, D4 = f4; 7g,
D5 = f3; 8; 9g, D6 = f9; 10g, D7 = f3; 8; 11g. Two PO-sets Di;Dj are said to be connected if
Di \ Dj 6= ;.
The PO-graph of T is defined as the incident graph of PO-sets of T , i.e. the graph whose vertexes
represent PO-sets and there is an edge between two vertexes if the two correspondent PO-sets are connected.
Thereafter we shall use a PO-set and its correspondent vertex in PO-graph interchangeably. We assume that
a PO-graph is a connected graph. The reason is that if the graph is disconnected, its components do not
62
1D
2D
3D
4D
5D
6D
7D
1D
2D 3D
4D
5D
6D 7
D
(a) PO-graph (b) PO-tree
Figure 3.16: PO-graph, PO-tree of T shown in Figure 3.15
interfere with each other in terms of scheduling and therefore we can deal with them as separate transaction
sets. Figure 3.16(a) shows the PO-graph of the transaction set showed in Figure 3.15.
Let PO-tree denote a spanning tree of the PO-graph and let PO-set Dr be the root of the PO-tree. Figure
3.16(b) shows a PO-tree rooted at D1 of the PO-graph in Figure 3.16(a). With respect to a specific PO-tree,
PO-set Di is an ancestor of PO-set Dj , denoted by Di  Dj , if Di is on the path from the root Dr to
Dj on the PO-tree. Let ancestors(Dj) be the set of all ancestors of Dj . For convenience, we use notation
Di  Dj to indicate that Di 2 ancestors(Dj) [ Dj . Furthermore, Di is the parent of Dj , denoted by
Di = parent(Dj), if and only if Di is the immediate ancestor of Dj . Note that in a tree, a non-root node
has a unique parent. Let children(Di) be the set of all PO-sets whose parent is Di. As an example, in
Figure 3.16(b), ancestors(D6) = fD1,D5g, D5 = parent(D6), and children(D5) = fD6;D7g. We define
the height of tree node Di, denoted by height(Di), as follows: if children(Di) = ; then height(Di) = 0,
otherwise height(Di) = 1 +maxDj2children(Di) height(Dj).
In Section 3.3, we proved that there exists a category of transaction sets in uni-dimensional buses, called
cyclic transaction sets, for which the scheduling problem is NP-complete. Since uni-dimensional buses
are a special case of NoC, the same result applies in our case. In this paper, we extend the definition of
acyclic/cyclic transaction sets in Section 3.3 to encompass transaction sets on multi-dimensional NoC. We
say that a NoC transaction set is acyclic if it has a PO-tree that satisfies Property 3.4.1 and 3.4.2, and it is
cyclic otherwise.
Property 3.4.1 If Dl  Dm  Dn and i 2 Dl and i 2 Dn, then i 2 Dm.
Property 3.4.2 If Dl 6 Dm and Dm 6 Dl, then Dl \ Dm = ;.
Note that a transaction set that is cyclic or acyclic by the definition in Section 3.3 is also cyclic or acyclic,
63
respectively, by the current definition. In this research, we will focus only on acyclic transaction sets because
of two reasons: 1) as we discuss later, many real-time applications exhibit data-flow topologies that are
acyclic; 2) for this instance of the problem, we can derive an efficient and near-optimum solution in practical
settings. We believe that having a good solution for this specific problem will provide a good theoretical
foundation for solving the general NP-complete problem. Thereafter, we use PO-tree to denote the spanning
tree that satisfies the aforementioned properties. The transaction set shown in Figure 3.15 is acyclic because
given its PO-tree shown in Figure 3.16(b), all transactions and PO-sets satisfy Property 3.4.1 and 3.4.2. For
example, since D1  D3  D4, any transaction that is in both D1 and D4 (e.g. 4) is also in D3 (Property
3.4.1), or since D2 6 D3 and D3 6 D2, D2 \ D3 = ; (Property 3.4.2).
For ease of presentation, we define the following index scheme for the PO-sets in the PO-tree: select
any Depth-First-Search (DFS) travel over PO-tree. Then each PO-set is indexed by a unique number from
1 to ND in the order it is visited in the selected DFS. PO-sets in Figure 3.16(b) are indexed following this
definition. Note that, with this index scheme, if Dl  Dm then l  m but l  m does not always imply
Dl  Dm (see D2 and D3 in Figure 3.16). The following property, however, is true.
Property 3.4.3 If l  m  n and Dl  Dn, then Dl  Dm.
Proof.
Assume by contradiction that Dl 6 Dm. If Dm  Dl, then m < l. If otherwise Dm 6 Dl, then either
m < l orm > n. Both cases contradict the property’s assumption. 2
Given an indexed PO-sets, let maxidi and minidi denote the indexes with maximum and minimum value,
respectively, among all PO-sets to which i belongs. Also denote these PO-sets as maxPOi and minPOi,
respectively. As an example, maxid3 = 7, minid3 = 1, maxPO3 = D7, minPO3 = D1. We have the
following property.
Property 3.4.4 For every Dl which contains i, minPOi  Dl.
Proof.
Assume by contradiction that there exists Dl where i 2 Dl and minPOi 6 Dl. Since, by definition,
minidi < l, we have Dl 6 minPOi. Then, by Property 3.4.2, Dl cannot share i with minPOi, which
contradicts the property’s assumption. 2
64
Finally, we have the following lemmawhich will be used for calculating the proposed algorithms complexity.
Lemma 3.4.1 If T is acyclic, then its number of PO-sets ND  N .
Proof.
By the defined PO-set index scheme and by Property 3.4.1 and 3.4.2, we have transactions inDjnparent(Dj)
do not belong to
S
i2[1;j 1]Di. Therefore sets in collection A = fDj n parent(Dj) : j 2 [1; ND]g are
mutually disjoint. Since the size of T is N , by pigeonhole principle, we have jAj = ND  N . 2
Schedulability Necessary Condition
Since by definition, no two transactions of a PO-set Dl can be scheduled concurrently, all transactions of Dl
must be scheduled in sequence. In other words, the transactions of Dl can be considered to be sharing one
resource. This results in a necessary condition on the schedulability of a transaction set shown in Theorem
3.4.1
Theorem 3.4.1 A transaction set T is schedulable only if:
8Dj  T : uDj =
X
8i2Dj
ui  1: (3.4.1)
Motivating Applications for Acyclic Transaction Sets
Many real-time applications are in the form of data-flow processing applications [27, 47] where data may be
processed through multiple consecutive stages. An example is the multipurpose status display application
on an avionic system [27] which shows the status of all aircraft avionics devices. A task periodically gets
data from I/O devices such as radars, then processes the data before sending information to a display task
which is in charge of displaying useful information to the pilots. Another example is an application that
processes transmission flows in 3G base stations [47]. In the down-link direction, the input flows are the
data flow and the control flow. These flows are processed separately through several stages then merged
back into one flow. The merged flow then goes through several additional processing stages before being
sent to the output port. A popular programming model for this type of applications in SoC is the streaming
model [40, 47] in which each processing stage is executed in one processing element. processing elements
are in either a serial or parallel pipeline. Data is transfered between processing stages through the NoC.
65
RT
1
RT
2
RT
3
RT
13
RT
6 RT
7
RT
8
RT
12
RT
11
RT
4
RT
14
RT
9
RT
16
RT
17
RT
18 RT
19
1D2D
3D
4D
1D
PO-tree
2D
3D 4D
1D
PO-graph
2D
3D
4D
Figure 3.17: A data-stream processing application
If the streaming programming model is used, it is usually easy to come up with an allocation of stages to
processing elements that result in acyclic transactions.
Figure 3.17 shows an acyclic transaction set created by a data-flow processing applications: input flows
are processed in parallel by processing elements at fRT1;RT6g, fRT11;RT16g, then merged at RT2 and
processed by the processing element at this router; the merged flow continues going through two additional
processing stages at RT3 and RT4. The induced PO-graph has the following PO-set nodes: D1,D2,D3,D4.
In Figure 3.17, we represent each PO-set with a dashed circle where transactions that cut through a circle
are transactions in the correspondent PO-set of the circle. The PO-graph and a PO-tree (rooted at D1) of
this transaction set is also shown in Figure 3.17. The transaction set is acyclic because it satisfies Property
3.4.1 and 3.4.2. For example: the transaction which is in both D1 and D4 is also in D2 (Property 3.4.1); and
D3 \ D4 = ; (Property 3.4.2).
3.4.2 Real-time Scheduling for Acyclic Transaction Sets on General NoC
As aforementioned, we use a same scheduling framework POGen proposed in Section 3.3 for the problem
at hand. However, the component algorithms of POGen which are POBase and GenerateLoad need to be
designed anew to accommodate the generalized transaction model. Also note that, unlike the algorithms in
Section 3.3, the new proposed algorithms are polynomial.
66
We proved in Section 3.3 that the schedule generated by executing POGen iteratively over all scheduling
intervals is feasible if the following two conditions are true.
Condition 3.4.1 POBase is optimal for a transaction set whose transactions’ periods are the same, mean-
ing that the schedule generated by POBase is feasible if the transaction set satisfies the necessary condition
in Theorem 3.4.1.
Condition 3.4.2 GenerateLoad generates a feasible load set flki : 8i 2 T g for interval intk if its inputs
satisfy the following Inequalities:
8Dj  T : blag(Dj ; intk)c  jintkj (3.4.2)
X
x2[0;tk)
S(i; x)  bui  tkc (3.4.3)
In the following sections, we will propose algorithm POBase and GenerateLoad for acyclic transaction
sets on NoC that satisfy Conditions 3.4.1 and 3.4.2. To differentiate with the algorithms in Section 3.3,
we name the new algorithms POBaseNoC and GenerateLoadNoC. The proposed POBaseNoC satisfies
Condition 3.4.1 because it feasibly schedules any same-period acyclic NoC transaction set that satisfies the
necessary condition of Theorem 3.4.1. Furthermore, the proposed GenerateLoadNoC satisfies Condition
3.4.2 if all PO-sets satisfy the following utilization bound:
8Dj  T : uDj 
L  1
L
; (3.4.4)
where L is defined as the greatest common divisor (GCD) of all transaction periods and is measured in terms
of the basic slot size. Although the utilization bound is sufficient, it approximates 1 when L is large. The
value of L in a real system is a function of both the slot size and transaction periods. As we discuss in more
details in Section 3.5, we believe that in many modern NoCs L has practical values ranging from 10 to 100
slots. Since the proposed GenerateLoadNoC imposes a stricter requirement on the transaction set than that
of POBaseNoC, Inequality 3.4.4 becomes a sufficient utilization bound of POGen.
67
The POBaseNoC algorithm
Algorithm 8 POBaseNoC
Input: T whose all transactions have same period p
1: L  list of 8i 2 T in ascending order of minidi
2: add d =

0; p; fnullg	 to Dur
3: for each i 2 L do
4: r  ei
5: for each d 2 Dur do
6: M fl 2 d:tlist : maxidl  minidig
7: j  argminl2M (maxidl)
8: //check if duration d is valid for i
9: if j =None or (j 6=None and OV (j ; i) = 0) then
10: if r  d:t2   d:t1 then
11: d:tlist d:tlist [ fig
12: add fd:t1; d:t2g to Sched(i)
13: r  r   (d:t2   d:t1)
14: else
15: replace d 2 Dur with d1 = fd:t1; d:t1 + r; d:tlist [ figg and d2 = fd:t1 + r; d:t2; d:tlistg
16: add fd1:t1; d1:t2g to Sched(i)
17: r  0
18: break
The goal of POBaseNoC (Algorithm 8) is to find a feasible schedule for a transaction set which sat-
isfies necessary condition in Theorem 3.4.1 and has all transactions with a same period p. The proposed
POBaseNoC is executed as follows: each transaction, in ascending order of itsminid, is assigned to the ear-
liest valid slots (transactions with the sameminid are ordered by their indexes). Slot t is valid for transaction
i if t has not been assigned to transactions overlapping with i. In the next paragraphs, we will discuss an
efficient implementation of POBaseNoC and prove its correctness and optimality.
In POBaseNoC, in order to find valid slots for a transaction, we scan through a list of durations (Line 5)
(instead of scanning through all slots as in Algorithm 6). Each duration represents a set of consecutive slots
68
to which some non-overlapping transactions have been assigned. A transaction i can only be assigned to
slots in duration d if transactions that have been assigned to d do not overlap with i. We call these durations
the valid durations of i.
In POBaseNoC, Dur stores the list of durations. Each duration d is represented as a tuple
fd:t1; d:t2; d:tlistg where all and only slots in [d:t1; d:t2) are slots of d. The size of d is the number of slots
of d. d:tlist is the list of transactions that have been assigned to d. Transaction i is said to be assigned
to d if and only if i is assigned to all slots in d. The algorithm guarantees that a transaction is assigned to
all slots of a duration or none of them. The first duration appended to Dur is

0; p; fnullg	 where fnullg
denotes an empty list. A duration will be updated or split into two durations after an assignment of a
transaction completes. Sched(i) stores the list of pairs ft1; t2g where all slots in [t1; t2) are assigned to i,
or equivalently 8t 2 [t1; t2) : S(i; t) = 1. Consider when transaction i is being assigned. The value of
variable r will be the remaining amount of transmission time of i to be assigned. The algorithm checks if
duration d is valid for i using Line 6 to 9 (we will describe these steps in next paragraphs). If d is valid
then: 1) if the remaining transmission time r of i is larger than the size of d, i is assigned to all slots of d
(Line 12). In this case, d will be updated by adding i to d:tlist (Line 11); 2) otherwise, d is split into two
durations which are d1 = fd:t1; d:t1 + r; d:tlist[ figg and d2 = fd:t1 + r; d:t2; d:tlistg (Line 15), and i
is assigned to d1 (Line 16).
Lines 6 to 9 are used to check if d is valid for i (i.e. if there is no transaction in d:tlist overlapping i).
To do this, it suffices to check i with only one transaction in d:tlist which is j = argminl2M (maxidl)
whereM = fl 2 d:tlist : maxidl  minidig (see Lemma 3.4.3). This implementation results in a more
time-efficient algorithm as will be shown later.
Figure 3.18 shows a schedule generated by POBaseNoC for the transaction set shown in Figure 3.1
where the DFS travel is D1;D2;D3;D4;D5;D6;D7 and the tuple (ei; pi) of each transaction is given as
follows: 1 : (2; 8); 2 : (2; 8); 3 : (3; 8); 4 : (1; 8); 5 : (6; 8); 6 : (7; 8); 7 : (7; 8); 8 : (1; 8);
9 : (4; 8); 10 : (4; 8); 11 : (4; 8). Given the transactions’ parameters, all PO-sets have utilization
100%. The schedule of each PO-set is depicted by a set of continuous rectangles on the horizontal time
line. Each rectangle represents an execution of a transaction. A valid transaction ordering in Line 3 is
1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11 (i.e. ascending order of transactions’ minid). Consider when 3 is
being allocated. At this point since 1 and 2 have been allocated, Dur has three durations d1 = f0; 2; f1gg,
d2 = f2; 4; f2gg, d3 = f4; 8; fnullgg. Since only d3 is valid, POBaseNoC assigns 3 to slots in [4; 7) and
69
0 8
t
8τ
10τ
2 4 61 3 5 7
2τ 3τ
1τ 5τ
7τ
3τ9τ
10τ
1D
2D
3D
4D
5D
6D
11τ7D 8τ
9τ
6τ
9τ
9τ
1τ 4τ
4τ
4τ
3τ 11τ
Figure 3.18: Schedule generated by POBaseNoC
splits d3 into two durations which are f4; 7; f3gg and f7; 8; fnullgg.
The correctness of POBaseNOC is proved in Lemma 3.4.3 and 3.4.4 which show that Line 6 to 9
guarantees that every transaction is assigned to only valid durations which contains only valid slots. The
proof of these lemmas requires Lemma 3.4.2. The optimality of the algorithm is, then, shown in Theorem
3.4.2.
Lemma 3.4.2 Consider transaction i and j . If transaction j has been allocated before i andOV (i; j) =
1, then j 2 minPOi.
Proof.
Since OV (i; j) = 1, there exists Dl where i; j 2 Dl. Assume by contradiction that j 62 minPOi, i.e.
Dl 6= minPOi. Since j has been allocated before i, by the algorithm operation and the definition ofminidi,
we have minidj  minidi < l. By Property 3.4.3 and 3.4.4, we have minPOj  minPOi  Dl. Then, by
Property 3.4.1, we have j 2 minPOi. 2
Lemma 3.4.3 If at Line 3, each duration d has d:tlist containing only non-overlapping transactions, then
transaction i will be allocated into only its valid durations.
Proof.
70
Consider a duration d and let M = fl 2 d:tlist : maxidl  minidig and j = argminl2M (maxidl).
We prove this lemma by showing that at Line 6, if there exists transaction k in d:tlist that overlaps i
(thus d is invalid) then j  k (thus if-condition at Line 9 is false). Note that k does not overlap i when
maxidk < minidi because all PO-sets containing k are different with that of i. Hence, we can assume that
k 2M. Assume by contradiction that j 6 k.
We will first prove that the propositions A, B, and C are true, where A  (minPOk  minPOi), B 
(minPOj  minPOi), C  (minPOk  maxPOj). For A, since k 2 M and k has been allocated before
i, we have minidk  minidi  maxidk. Then since minPOk  maxPOk (Property 3.4.4), by Property
3.4.3, we have A is true. We can prove B true in a similar way by replacing k with j . For C, since
j 2 M and k is allocated before i, we have minidk  minidi  maxidj . Together with the definition of
j , we have minidk  maxidj  maxidk. Then by Property 3.4.3, we have C is true.
Due to the tree structure of the PO-tree, from A and B, we have either proposition D or E is true, where
D  (minPOj  minPOk) and E  (minPOk  minPOj). If D is true, then together with C, we have
F is true, where F  (minPOj  minPOk  maxPOj). By Property 3.4.1, F implies that j 2 minPOk.
This, in turn, implies that j overlaps k which contradicts the lemma’s assumption. If E is true, then
together with B, we have G is true, where G  (minPOk  minPOj  minPOi). Since k 2 minPOk and
k 2 minPOi (Lemma 3.4.2), by Property 3.4.1, G implies that k 2 minPOj . This, in turn, implies that
j overlaps k which, again, contradicts the lemma’s assumption. Therefore j  k which means d is only
invalid for i when i overlaps j , otherwise d is valid and i will be allocated into slots of d (Line 10 to 18).
2
Lemma 3.4.4 Each transaction is allocated into only its valid durations.
Proof.
Since, at the first iteration of Line 3, the assumption of Lemma 3.4.3 is true (because d:tlist = null), by
Lemma 3.4.3 and the operation from Line 10 to 18, we have 1 is allocated into valid durations and, at the
end of this iteration, every duration d has d:tlist containing only non-overlapping transactions. Hence, the
assumption of Lemma 3.4.3 is true again at the second iteration of Line 3. The proof, then, is complete by
induction reasoning. 2
Theorem 3.4.2 POBaseNoC is optimal for same-period acyclic transaction sets.
71
Proof.
We will show that if transaction set satisfies the necessary condition in Theorem 3.4.1, then at the end of the
algorithm, the number of valid slots assigned to each transaction i is ei. Consider when transaction i is
being allocated. By Lemma 3.4.2, we have that all transactions whose schedule is allocated before i and
overlap i must belong to minPOi. Therefore invalid slots of i must have all been assigned to transactions
in minPOi. Assume by contradiction that at the end of the algorithm, the number of valid slots assigned to
i is smaller than ei. Since the algorithm guarantees that a transaction can only be assigned to all slots of a
duration or none of them (Line 12 and 16), all and only valid slots of a transaction are contained in its valid
durations. Therefore, the contradiction assumption is true only if at Line 4, the number of valid slots of i is
smaller than ei. This implies that p 
P
j2minPOinfig ej < ei. This contradicts with condition in Theorem
3.4.1 which implies that
P
j2minPOi ej  p. In other words, there are always enough valid slots to allocate
i. Since this is also true for all other transaction, the proof completes.
2
POBaseNoC Analysis: Since a new duration is added only when a transaction assignment is complete
i.e. r = 0 (Line 15), the maximum number of durations in Dur isN . Therefore, the for-loop between Line 3
and 18 is executed at mostN2 times. Since we can implement d:tlist as a sorted queue based onmaxid, the
operation to look up j at Line 6 and 7 will take O(log(N)) time. And the operation to insert new item into
the ordered list d:tlist at Line 11 and 15 will also take O(log(N)) time. Furthermore, since the rest of the
code between Line 6 and 18 can be implemented with a constant number of operations, the time complexity
of POBaseNoC is O(N2 log(N)). Finally, since there is at most N durations, we need at most O(N) space
to store the schedule of each transactions.
The GenerateLoadNoC procedure
Asmentioned in Section 3.4.2, the second condition forPOGen to work is that: procedureGenerateLoadNoC
can generate a feasible load set flki : 8i 2 T g for interval intk when its inputs satisfy Inequalities 3.4.2 and
3.4.3. A feasible load set is one that satisfies Inequalities 3.3.5 and 3.3.6. In this section, we will present a
version of GenerateLoadNoC. There are two questions that have to be answered: (1) is there a feasible load
set? (2) is there an efficient algorithm to find it? We will show that the problem of finding a feasible load
set is equivalent to the problem Circulations in Graphs with Demands and Lower bounds [20] where the
demand at every vertex is 0. This is the problem of finding a feasible circulation flow in a directed graph G
72
whose each edge has a capacity and a lower bound. Furthermore, we will prove that if the utilization of each
PO-set is smaller than the utilization bound expressed by Inequalities 3.4.4, there always exists a feasible
solution therefore answering Question 1. Then, since the Ford-Fulkerson algorithm [20] can be used to solve
the problem, Question 2 is also answered.
In the following, we will describe the construction of the directed graph G from the input of
GenerateLoadNoC. Graph G is constructed such that: 1) the value of the flow on an edge represents the
interval load of a transaction or the total interval load of a PO-set. Flow values are subject to the same upper
bounds and lower bounds of the loads they represent (as in Inequalities 3.3.5 and 3.3.6); 2) the equality
between the sum of the input-flow values and the sum of the output-flow values at a vertex represents the
equality between the total interval load of some PO-sets and the sum of the interval loads of the transactions
in these PO-sets. The construction guarantees that the value of a feasible flow on an edge is also a feasible
interval load of the transaction or the PO-set it represents.
Graph construction: we define a tuple G = (V;E) as follows:
Vertexes of G:
 For each PO-set Dl, define a vertex vl.
 The set of vertices of G is V = fv; vl : l 2 [1; ND]g where v is an additional vertex.
Edges and flow values of G:
 For each PO-set Dl, define a directed edge gDl which is called a PO-set edge. Furthermore, define for
each edge gDl two integer constants c
D
l and b
D
l which are called the capacity and the lower bound of
edge gDl where c
D
l = min(jintkj; dlag(Dl; intk)e) and bDl = blag(Dl; intk)c. Finally, define for each
edge gDl a real variable x
D
l which is called the flow value of the edge. The flow value is subject to the
constraints in Inequality 3.4.5 (which is the same as Inequality 3.3.6):
8Dl  T : bDl  xDl  cDl : (3.4.5)
 For each transaction i, define a directed edge gi which is called a transaction edge. Furthermore,
define for each edge gi two integer constants ci and bi which are called the capacity and the lower
bound of edge gi where ci = dlag(i; intk)e and bi = blag(i; intk)c. Finally, define for each edge gi a
73
real variable xi which is called the flow value of the edge. The flow value is subject to the constraints
in Inequality 3.4.6 (which is the same as Inequality 3.3.5):
8i 2 T : bi  xi  ci: (3.4.6)
 The set of edges of G is: E = fgi : i 2 [1; N ]g [ fgDl : j 2 [1; ND]g. The total number of edges is
jEj = N +ND.
Rules for directing edges:
 The set of edges that enter v is Enter = fgi : i 2 D1g; the set of edges that exit v is gD1 .
 The set of edges that enter vl is PO-set edge gDl and transaction edges representing transactions that
are in children of Dl but not in Dl, i.e. Enterl =
S
Dm2children(Dl)fgi : i 2 Dm n Dlg.
 The set of edges that exit vl are PO-set edges fgDm : Dm 2 children(Dl)g and transaction edges
representing transactions that are in Dl but not in children of Dl, i.e. Exitl =

gi : i 2 Dl nS
Dm2children(Dl)Dm
	
.
Flow conservation constraint:
 The flow values in graphG is subject to the flow conversation constraint [20] in which given a vertex,
the sum of the flow values entering it minus the sum of the flow values exiting it is zero.
Figure 3.19 shows graph G constructed from the transaction set shown in Figure 3.15. Consider vertex
v1. Edges that enter v1 are gD1 and Enter1 = fg5; g6; g8; g9g. Edges in Enter1 are edges that represent
transactions which belong to children(D1) = fD2;D3;D5g but do not belong to D1. Furthermore, edges
that exit v1 are fgD2 ; gD3 ; gD5 g and Exit1 = fg2g. As we will prove later, this construction guarantees that
the equality between the sum of the input-flow values and the sum of the output-flow values at v1 (due to
the flow conservation constraint) represents the equality between the sum of interval loads of transactions in
every child of D1 and the total interval load of its children.
We will prove in Lemma 3.4.7 that graph G is indeed a directed graph in which every edge is directed
and has two endpoints. The proof will use Lemma 3.4.5, and 3.4.6.
Lemma 3.4.5 For every Dl;Dm : Dl 6= Dm, the following is true: Dl n parent(Dl) \  Dm n parent(Dm) = ;: (3.4.7)
74
Dg1
Dg3
7
Dg
4
Dg
Dg2
3v
2v
5v
1v
*v
4v
6v
7v
Dg5
6
Dg
2g
11g
1g
5g
3g
7g
9g
10g
8g
4g
6g
Figure 3.19: Graph G of T shown in Figure 3.15
Proof.
Since the lemma is trivial when Dl = parent(Dm) or Dm = parent(Dl), we assume that these conditions
are false. If Dl 6 Dm and Dm 6 Dl, then the lemma is true because of Property 3.4.2. If Dl  Dm,
since Dl 6= parent(Dm), we have Dl  parent(Dm)  Dm. By Property 3.4.1, for every i where i 2 Dl
and i 2 Dm we have i 2 parent(Dm). Therefore Dl \
 Dm n parent(Dm) = ;, which proves the
lemma. Finally, using similar technique and interchanging Dl with Dm, we can also prove the lemma when
Dm  Dl. 2
Lemma 3.4.6 For every Dl;Dm : Dl 6= Dm, the following is true:
Dl n [
Dn2children(Dl)
Dn
	\
Dm n [
Dn2children(Dm)
Dn
	
= ;: (3.4.8)
Proof.
Since the lemma is trivial when Dl = parent(Dm) or Dm = parent(Dl), we assume that these conditions
are false. If Dl 6 Dm and Dm 6 Dl, then the lemma is true because of Property 3.4.2. If Dl  Dm, since
Dl 6= parent(Dm), we have there exists Dn 2 children(Dl) where Dl  Dn  Dm. By Property 3.4.1, for
every i where i 2 Dl and i 2 Dm we have i 2 Dn. Therefore
 Dl n Dn \ Dm = ;, which proves the
75
lemma. Finally, using similar technique and interchanging Dl with Dm, we can also prove the lemma when
Dm  Dl. 2
Lemma 3.4.7 G is a directed graph where jEj  2N .
Proof.
Since every edge ofG is directed, it remains to show that each edge enters one and only one vertex and exits
one and only one vertex. Note that there is one edge defined for each PO-set and one edge defined for each
transaction.
Consider PO-set edges: By the rules of directing edges, PO-set edge gD1 exits only vertex v
. Further-
more, if Dl = parent(Dm), the PO-set edge gDm exits vertex vl. Since each Dm where Dm 6= D1 has one
and only one parent, the PO-set edge gDm exits one and only one vertex. Therefore, each PO-set edge exits
one and only one vertex. In addition, each PO-set edge gDl enters only vertex vl. Therefore each PO-set
edge enters one and only one vertex.
Consider transaction edges: By the rules of directing edges, the set of transaction edges that enter v
is Enter and enter vl is Enterl. By Lemma 3.4.5, the collection of sets fEnter;Enterl : 8lg is pairwise
disjoint. Therefore each transaction edge gi will appear in only one of the sets in the collection. Furthermore,
since
S
Dm2T Dm n parent(Dm) = T , each transaction edge gi must appear in at least one of the sets in the
collection fEnter;Enterl : 8lg. In other words, each transaction edge gi enters one and only one vertex.
By the rules of directing edges, the set of transaction edges that exits v is empty and exits vl is Exitl. By
Lemma 3.4.6, the collection of sets fExitl : 8lg is pairwise disjoint. Therefore each transaction edge gi will
appear in only one of the sets in the collection. Furthermore, since
S
Dl2T (Dl n
S
Dm2children(Dl)Dm) = T ,
each transaction edge gi must appear in at least one of the sets in the collection fExitl : 8lg. In other words,
each transaction edge gi exits one and only one vertex.
Now we will prove that jEj  2N , since each transaction edge exits one and only one vertex and each
vertex except v has at least one transaction edge exists from it, we have jV nfvgj  N . Since by definition
jV n fvgj = ND and jEj = N +ND, we have jEj  2N 2
It remains to show that GenerateLoadNoC honors Condition 3.4.2. For simplicity of exposition, we
split the proof in multiple lemmas. First, Lemma 3.4.8 proves an important property of graph G regarding
the flow values. Then, this property will be used to prove in Lemma 3.4.9 that graph G has a feasible flow
76
if Inequalities 3.4.2, 3.4.3 are satisfied for interval intk and furthermore all PO-sets satisfy the utilization
bound shown in Inequality 3.4.4. Note that we know from [20] that if graphG has a feasible flow, then it has
an integral feasible flow which can be found by the Ford-Fulkerson algorithm [20]. Therefore, to complete
the proof, we will have to prove that a feasible load set can be derived from an integral feasible flow of G
(Lemma 3.4.10).
Lemma 3.4.8 A flow in graphG honors the flow conservation constraint at every vertex vl if and only if the
following equalities hold for every PO-set Dl:
X
i2Dl
xi = x
D
l : (3.4.9)
Proof.
We prove the lemma by induction.
Basis case: Consider vl where height(Dl) = 0 or equivalently children(Dl) = ;. By the rules of directing
edges, we have the set of edges that enter vl is fgDl g and the set of edges that exit vl is fgi : i 2 Dlg .
Therefore, the flow conservation constraint holds at vertex vl if and only if
P
i2Dl xj = x
D
l .
Induction case: Assume that the hypothesis is true 8Dl where height(Dl)  H   1. We will prove that
the hypothesis is also true 8Dl where height(Dl) = H . By definition of the tree node height, the induction
hypothesis implies that 8Dm 2 children(Dl), vm honors the flow conservation constraint if and only if:
X
i2Dm
xi = x
D
m: (3.4.10)
By the rules of directing edges and Lemma 3.4.5, the total value of flows that enters vertex vl is:
xenterj =
X
Dm2children(Dl)
X
i2DmnDl
xi + x
D
l ;
and the total value of flows that exits vertex vl is:
xexitj =
X
i2Dln
S
Dm2children(Dl)Dm
xi +
X
Dm2children(Dl)
xDm:
To complete the induction step, it remains to show that xenterj = x
exit
j if and only if Equation 3.4.9 holds.
77
Note that by Equation 3.4.10, xenterj = x
exit
j is equivalent to:
xDl =  
X
Dm2children(Dl)
X
i2DmnDl
xi (3.4.11)
+
X
i2Dln
S
Dm2children(Dl)Dm
xi +
X
Dm2children(Dl)
X
i2Dm
xi
=
X
Dm2children(Dl)
X
i2Dm\Dl
xi +
X
i2Dln
S
Dm2children(Dl)Dm
xi
Finally, since by Property 3.4.2, PO-sets in children(Dl) are pairwise disjointed sets, Equation 3.4.11 is
equivalent to:
xDl =
X
i2Dl
T S
Dm2children(Dl)Dm
xi
+
X
i2Dln
S
Dm2children(Dl)Dm
xi =
X
i2Dl
xi
This completes the proof. 2
Lemma 3.4.9 There exists a feasible flow in graph G if Inequalities 3.4.2, 3.4.3 are satisfied for interval
intk and all PO-sets satisfy the utilization bound in Inequality 3.4.4.
Proof.
First note that Inequalities 3.4.2 are necessary for the edge constraints on each PO-set edge (Inequality 3.4.5)
to be satisfied. Let us construct a flow as follows.
8i 2 T : xi = lag(i; intk)
8Dl  T : xDl = lag(Dl; intk)
We will have to prove that the constructed flow satisfies the edge constraints and the flow conservation
constraints. Given the constructed flow, it is easy to verify that the edge constraints of each transaction edge
(Inequality 3.4.6) and the left-side edge constrains of each PO-set edge (Inequality 3.4.5) are satisfied. The
right-side edge constraints of each PO-set edge are satisfied because by the definition of the lag function and
78
by Inequalities 3.4.3, before the execution of GenerateLoadNoC for interval intk we have the following:
lag(Dl; intk) = uDl  tk+1  
X
i2Dl
X
x2[0;tk)
S(i; x)
 uDl  tk+1   buDl  tkc
< uDl  tk+1   uDl  tk + 1: (3.4.12)
Since L  mink(intk), we have L  1
L
 jint
kj   1
jintkj . Hence, by Inequality 3.4.4, we have u
D
l 
jintkj   1
jintkj .
Apply this inequality to Inequality 3.4.12, we have:
lag(Dl; intk) < uDl  tk+1   uDl  tk + 1  jintkj:
Since furthermore lag(Dl; intk)  dlag(Dl; intk)e, it follows that lag(Dl; intk)  min(jintkj; dlag(Dl; intk)e).
It remains to verify that the flow conservation constraint is honored at every vertex. Since the constructed
flow satisfied Equation 3.4.9, the sufficient condition of Lemma 3.4.8 proves this statement. 2
Lemma 3.4.10 If there is an integral feasible flow in graph G, then there is a feasible load set where
8i 2 T : lki = xi (3.4.13)
Proof.
We have to prove that the load set where 8i 2 T : lki = xi satisfies Inequalities 3.3.5 and 3.3.6. By the
transaction edge constraints in Inequality 3.4.6, the following inequality holds.
8i 2 T : blag(i; intk)c  lki  dlag(i; intk)e
Thus the interval loads satisfy Inequality 3.3.5.
Furthermore, by the necessary condition of Lemma 3.4.8, we have:
8Dl 2 T :
X
i2Dl
xi = x
D
l : (3.4.14)
Then since
P
i2Dl l
k
i =
P
i2Dl xi and x
D
l is subject to PO-set edge constraints in Inequality 3.4.5, In-
equality 3.3.6 is satisfied. 2
79
Algorithm analysis: Since 8gi 2 E : ci   bi  1, 8gDl 2 E : cDl   bDl  1 and ND  N , we have
 =
P
gi2E (ci   bi)+
P
gDl 2E (c
D
l   bDl )  2N . The time complexity of the Ford-Fulkerson algorithm in
finding a feasible circulation in graph G is O(jEjfmax) where fmax is the maximum flow value of a graph
derived from G and fmax   (see [20] for details). Since   2N and jEj  2N , the time complexity
of GenerateLoadNoC is O(N2). Finally, since the time complexity of POBaseNoC is O(N2 log(N)), the
time complexity of POGen to generate the schedule for each scheduling interval is O(N2 log(N)).
The sufficient utilization bound of POGen: As shown in Lemma 3.4.9, GenerateLoadNoC induces a
sufficient utilization bound shown in Inequality 3.4.4. Since this bound is stricter than that of POBaseNoC,
it becomes a sufficient utilization bound of POGen.
3.5 Implementation
In this section, we discuss some practical factors in the implementation of POGen in real systems and show
the experimental measurement of the algorithm execution overheads.
An important parameter in the operation of POGen is the size of L in number of slots because it dictates
the POGen’s sufficient utilization bound (shown in Inequalities 3.4.4). Note that the real-time value of L
is determined by the applications. As shown in [27], typical real-time applications have periods that are
multiple of milliseconds. Therefore, we believe it can be assumed that the greatest common divisor of
all transaction periods is at least 1ms. Meanwhile a typical modern SoC bus [2, 10, 12] has bus clock
frequency no smaller than 100MHz. Therefore, it is reasonable to select the slot size to be no bigger than
10us (equivalent to 1000 bus cycles), which results in the value of L in number of slots to be no smaller than
100. The utilization bound, therefore, will be 0.99. It is worth noting that the size of a slot does not affect
the algorithm overhead because the time complexity POGen only depends on the number of transactionsN .
Implementing our scheduling framework requires two components: 1) POGen is executed on each pro-
cessing element at the beginning of each interval to generate the interval’s schedule and 2) a slot scheduler
transmits a given transaction in its assigned slots according to the generated interval schedule. The imple-
mentation of the slot scheduler depends on the NoC architecture. In a software controlled NoC such as in
the CellBE processor [2], the slot scheduler can be implemented in software on each processing element by
using an interrupt handler to trigger the scheduler at the beginning of each duration; since there are at most
N durations, on each processing element there are at mostN interrupts within each scheduling intervals. In
80
N 10 20 30 40
Average (us) 4.5 6.4 9.5 12.1
Max (us) 6.9 7.7 11.2 13.8
Table 3.1: POGen execution overhead
a custom NoC, the slot scheduler could be implemented in the router connected to the processing element.
In order to measure POGen overhead in the real SoC systems, we implemented POGen on a IBM
CellBE processor platform [7]. We selected Cell processor because it represents a typical modern SoC
whose components are interconnected by a software controllable NoC. We generated random transaction
sets using the same method described in Section 3.6. Table 3.1 shows the average and maximum execution
time of POGen given various transaction sets size N . Given the smallest scheduling interval to be 1ms, the
overhead is less than 1.5% of the scheduling interval size.
3.6 Evaluation
Most of the previous related works [43, 44, 3, 24, 28] have focused on the Fixed-priority Scheduling Algo-
rithm (FPA). These works deal with the methods for schedulability analysis and priority assignment. More
specifically, Shi et al. have recently proposed in [43] a branch-and-bound algorithm that searches for a fea-
sible priority set for a transaction set. If a feasible priority set exists, then the transactions set is guaranteed
to be schedulable under the worst-case transaction latency (WTL) analysis proposed in [44]. To the best of
our knowledge, the works in [44, 43] are the state of the art.
In this section, we are interested in comparing the performance of POGen with the solution proposed
in [44, 43] assuming acyclic transaction sets4. The concerned performance metric is the acceptance rate
of POGen and the FPA where the acceptance rate is calculated as the number of schedulable transaction
sets over the total number of generated transaction sets. A transaction set is schedulable under POGen if it
passes the utilization bound test shown in Inequality 3.4.4, whereas it is schedulable under the FPA if it has
a feasible priority set generated by the algorithm in [43].
To generate random transaction sets for the experiments, we used similar parameters and methods used
in [43] except that we are only concerned with acyclic transaction sets. The transactions’ sources and
destinations are randomly generated on square 2D mesh and transactions are routed using the dimension-
4Note that algorithms in [44, 43] can also be used for cyclic transaction sets.
81
0.5 0.6 0.7 0.8 0.9 1.0
Maximum PO-set Utilization
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
A
cc
e
p
ta
n
ce
 R
a
te
FPA
POGen (L=10)
POGen (L=20)
POGen (L=50)
POGen (L=100)
Figure 3.20: Acceptance rate versus PO-set utilizations
order X-Y routing. Like [43], we use the maximum PO-set utilization of a transaction set (which is ”the
maximum link utilization” in [43]) as a controlled variable. Given a maximum PO-set utilization umax, the
utilization of transactions is generated according to the uniform distribution algorithm in [5] such that the
utilizations of all PO-sets are no larger than umax. The transmission time ei of transaction i is a uniformly-
distributed random number in the range from 1 to 1024 slots. The period pi is then determined as a multiple
of L using the following formula: dei=(ui  L)e  L. Given a pair of fei; pig, we recalculate ui to be ei=pi.
Following the discussion in Section 3.3.4, we generated transaction sets with L to be 10, 20, 50, and 100
slots. In the following experiments, there are 1000 transaction sets generated at each measurement point.
Figure 3.20 shows the acceptance rates of FPA and POGen with various maximum PO-set utilization.
The size of NoC is 10x10 and each transaction set has 20 transactions. For POGen, we report the results with
L equals to 10, 20, 50, 100. The better performance of POGen comes from the fact that the WTL analysis
in [44] does not take advantage of the parallelism between non-overlapping transactions. For example,
consider the transaction set shown in Figure 3.1. Assume 6 and 7 have higher priority than 4. According
to the WTL analysis in [44], the interference of transactions 6 and 7 on the execution of 4 is calculated as
if all transactions were using a single-shared resource. However, POGen allows 6 and 7 to be executed in
parallel as shown in Figure 3.18.
Figure 3.21 shows the acceptance rate of FPA and POGen with various transaction set sizes and NoC
sizes. We report the result where L equals to 10 slots and the maximum PO-set utilization is 0.95. In most
82
Figure 3.21: Acceptance rate versus NoC sizes and N
cases, the acceptance rate of POGen is higher than that of FPA especially when transaction set size is higher
and the NoC size is smaller. The reason is that in these situations, there are more transaction overlaps.
Therefore, FPA suffers more from the effect described in the previous paragraph.
3.7 Conclusion
In this research, we investigated on the problem of real-time communication scheduling on multi-core pro-
cessor buses. This scheduling problem has main assumptions about resource sharing pattern that are dif-
ferent with that of traditional real-time problems. In particular, transactions on multi-core processor buses
may depend on each other in an intransitive manner. We proposed a novel scheduling framework that are
able to dynamically schedule transactions by taking into account the intransitive dependence between them.
Compared to related works, our algorithms achieves much higher bus utilization. Our research has also
opened up a new scheduling paradigm that worths further investigations.
83
Chapter 4
System Design for Predictable Memory
Access
As described in Chapter 1, real-time systems design is facing significant challenges to handle in a pre-
dictable manner the interaction between the CPU cache and peripherals while they are accessing the shared
main memory. In modern computer architectures, peripherals are connected to the system through a periph-
eral interconnection with master capabilities (DMA), such as the PCI bus, can directly initiate read/write
transactions toward either other peripherals or the main memory. Bus master mode is essential to avoid
overloading the processor, especially in the case of fast I/O interfaces that could otherwise produce mil-
lions of interrupts per second. However, since the memory is a shared resource in the system, peripheral
transactions can interfere with cache line fetches produced by the CPU memory controller whenever a task
experiences a cache miss. This interaction can slow down task execution tremendously: our experiments in
Section 4.3 show that task execution time in the presence of heavy I/O load is increased up to 44%.
In general, two types of solutions can be feasibly applied to this problem. The first is to account for
the effect of all peripheral traffic in the worst case computation time (WCET) of each task. In this case, we
need an analysis that can compute the increase in WCET given design-time bounds on peripheral traffic.
The problem of this solution is that, as already mentioned, the WCET increment can be very large, up to
44%. The approach that we propose in this research is to coschedule CPU tasks and I/O transactions: in fact,
assuming we find a way to control when peripherals are allowed to transmit, we can create a bus transmission
schedule and synchronize it with the CPU task schedule. We can then formulate our coscheduling objective
as follows: maximize the traffic transmitted by each peripheral, while guaranteeing that each task meets its
84
allow_state
REQO#=REQ#
block_state
REQO#=1
BLOCK=1 & REQ#=1
BLOCK=0 | REQ#=0
BLOCK=1
BLOCK=0
Figure 4.1: Peripheral Gate State Machine
real-time deadline.
This research has two main contributions. First, in Section 4.1 we introduce the design of a device for
the PCI/PCI-X bus, called a peripheral gate (or p-gate for short), that allows us to control peripheral access
to the bus. The implemented p-gate is compatible with existing computer architectures: no modification to
either the peripheral or the motherboard is required. Second, we propose in Section 4.2 a novel coscheduling
method to coordinate tasks’ execution and peripheral operations. More specifically, we divide each task into
a series of superblocks; each superblock can include branches and loops, but superblocks must be executed
in sequence. By running the task, the CPU can collect information on the number of cache misses in each
superblock. We can then compute a safe WCET bound by determining a worst case arrival pattern of cache
misses in each superblock. The coscheduling heuristic uses theWCET analysis and the run-time information
provided by the OS to compute available task slack and it dynamically opens the p-gates when it is safe to
do so. In Section 4.3, we show that our heuristic performs well compared to the best possible run-time,
adaptive and predictive algorithm.
4.1 Peripheral Gate
In this section, we first provide a brief overview of the Peripheral Component Interconnect (PCI) stan-
dard and then describe our p-gate implementation. PCI is the current standard family of architectures for
motherboard - peripheral interconnection in the personal computer market; it is also widely popular in the
embedded domain [34]. The standard can be divided in two parts: a logical specification, which details how
85
DCLK
Q
Q#
D
CLK
Q
Q#
REQ#
BLOCK
CLK
RST
REQO#
Figure 4.2: Peripheral Gate Schematic
the CPU configures and accesses peripherals through the system controller, and a physical specification,
which details how peripherals are connected to and communicate with the motherboard. Several widely dif-
ferent physical specifications have been published; here we focus on the PCI/PCI-X physical specification,
which uses a shared bus architecture with support for multiple bus segments connected by bridges. To gain
access to the shared bus, each peripheral must first obtain permission from the bus segment arbiter using a
standard handshake with two point-to-point, active-low wires, REQ# and GNT#. The peripheral first lowers
REQ# to signal a request for the bus, and the arbiter grants permission by lowering GNT#. The peripheral
then waits for the bus to become free and starts a data transfer (also called a bus transaction). The handshake
finishes after both the peripheral and the arbiter raise REQ# and GNT# in succession; if the peripheral wants
to initiate another transaction, it must reacquire the grant.
We implemented the p-gate based on a PCI extender card, a debug card that is interposed between
the peripheral card and the motherboard and provides easy access to all signals. We modified the card to
intercept the REQ# signal and to control it based on an input block signal coming from the reservation
controller. The main idea is to force REQ# to remain high whenever block is active; in this way, the
peripheral is not able to get the grant from the arbiter and thus can not transmit. The actual implementation
is more complex: if block is raised while REQ# is active low, we could violate the PCI specification by
immediately deactivating REQ#. Instead, we must allow the current request to finish and then we can block
all further requests. A corresponding synchronous state machine is shown in Figure 4.1 and an optimized
schematic in Figure 4.1, where REQ# is the input from the peripheral and REQO# is the controlled output
to the arbiter. Our implementation uses discreet components: two positive-edge-triggered D flip-flops, two
nor gates and an inverting tri-state buffer. The output buffer is required by the specification to set the output
to high impedance whenever the bus is reset. We measured a worst case propagation delay for the circuit of
86
7ns, which allowed us to run the bus at a frequency up to 66Mhz.
The reservation controller outputs a block signal for each p-gate in the system. We implemented a
prototype reservation controller based on a Xilinx ML505 board. The board is connected to the system
using a PCI-E motherboard slot, and uses a Virtex-5 FPGA to implement a custom peripheral. All registers
used by the peripheral are memory mapped; a PCI driver is used to allocate the registers in the CPU virtual
memory space, hence tasks running in user mode can communicate with the peripheral performing memory
reads/writes. The reservation controller can run in two different modes: in data acquisition mode (see
Section 4.2.2) it simply collects statistics about the task execution while keeping all p-gates closed. In
execution mode (see Section 4.2.4) it runs the coscheduling algorithm and dynamically controls the p-gates.
This solution moves as much computation as possible in hardware on the FPGA, thus minimizing the overall
CPU overhead.
4.2 I/O traffic and Tasks Coscheduling
As aforementioned, there are two types of approaches for handling the I/O traffic interference on task ex-
ecution. The first one is to estimate the worst-case I/O interference. This estimation requires as the inputs
the cache access function and the I/O load function. If these functions are coarse the estimation will be very
pessimistic. However, obtaining a precise cache access function is very hard. Both running the task on real
hardware and using static analysis [39] only provides imprecise information, i.e. number of cache misses
in an interval. In this research we propose an alternative solution: we use a combination of the worst-case
interference analysis with the regulation of I/O traffic. The end result is a coscheduling algorithm that can
efficiently guarantee tasks’ deadlines while maximizing the I/O throughput. In our solution, we assume that
at compile time, a control flow graph for the task can be derived comprised of a series of S superblocks
fs1; : : : ; sSg. Each superblock can include branches and loops, but superblocks must be executed in se-
quence. For each si, we measure the worst case execution time wceti without peripheral interference and
the worst case number of cache misses CMi, either through static analysis or making use of CPU self-
measures. We can then obtain a safe bound on task delay in the following way: for each superblock si,
we consider the worst case pattern of CMi cache misses in an interval of length wceti, i.e. the pattern that
results in the highest possible delay. We will then use wceti and CMi to derive adaptive algorithms which
decide when to open the p-gate to maximize the I/O throughput. In the followings, we first detail how to
obtain the described measurement in a concrete setting and then we provide our algorithms.
87
0 1 2 3 4 5 6 7 8 9 10
x 107
0
1
2
3
4
5
6 x 10
5
time (ns)
tim
e 
(ns
)
 
 
Peripheral Load Bound
Cumulative Bus Time
Figure 4.3: Measured Peripheral Load.
4.2.1 Peripheral Load Evaluation
The peripheral load function can be obtained in two ways. If the peripheral is an I/O interface and the node
is part of a distributed system using a real-time communication protocol, then a bound on the peripheral
activity can be derived analytically. Otherwise, we propose a following testing methodology for systems
where analytical analysis is not available1. A trace of activity for a PCI/PCI-X peripheral can be gathered
monitoring the bus with a logic analyzer. For example, Figure 4.3 shows the first 100ms of a measured
trace for a 100Mb/s Ethernet network card in term of cumulative time taken by peripheral transactions on
a 32bit, 33Mhz PCI bus segment; the whole recorded trace consisted of 1000 transactions. We developed
a simple algorithm that computes the peripheral load function E(t) from a trace in quadratic time in the
number of bus transactions. The algorithm performs a double iteration over all transactions, computing at
each step the amount of peripheral traffic in an interval between the beginning of any transaction and the
end of any other successive transaction. The computed values are inserted into a list ordered by interval
length, and all non maximal values are culled. Figure 4.3 shows the resulting E(t) function in the interval
1While testing can fail to reveal the real worst case, we argue that it is nevertheless an accepted and commonly used methodology
in the industry.
88
[0; 100ms]. If multiple traces are recorded, then an upper bound can be computed by merging the computed
load functions for each trace and again removing all non maximal values. Finally, note that the computed
E(t) expresses a load bound for the bus segment on which the peripheral is located. In the case of the
PCI/PCI-X architecture, the segment is connected to the memory controller through one or multiple bridges,
each of which has buffering capabilities. In this case, a safe bound on the generated memory controller load
can be obtained by summing a factor B=C to the computed E(t) function, where B is the sum of the sizes
of all traversed buffers and C is the speed of the memory controller.
4.2.2 Cache Miss Measurement
We devised a testing methodology to experimentally obtain the worst case execution time (WCET) and worst
case number of cache misses for each superblock. Our implementation uses the Intel Core Microarchitec-
ture architectural performance counters [17], but other CPU architectures such as IBM PowerPC provide
similar support for CPU self-measures. Support was added by modifying the Linux/RK kernel [33]. The
Core Microarchitecture specifies support for three architectural performance counters, each of which can
be configured to count a variety of internal events. In particular, we used Counter 0 to count the number
of elapsed CPU clock cycles and Counter 1 to count the number of level-2 cache misses. To accurately
measure task execution without the effects of OS overhead, we configured both counters to be active only
when the CPU is executing in user mode. Finally, we allowed reading the counter values from user mode
with the rdpmc instruction (the counters can still be written and configured only in kernel mode) to reduce
measurement overhead.
Counters are read inside each task by adding the checkpoint code in Figure 4.4 at the end of each
superblock. The cpuid instruction inserts a synchronization barrier, i.e. it makes sure that all instructions
fetched before cpuid are completed before the counters are read; this is required to cope with out-of-order
execution. The counter values are then sent to the reservation controller running in data collection mode;
this ensures that no write to system memory is performed at the checkpoint, which could cause an additional
cache miss. The reservation controller determines the execution time and number of cache misses in each
superblock computing the difference between the values obtained in successive checkpoints. After the task
has finished, the computed values are read back from the reservation controller and wceti and CMi can be
determined as the worst case over several task runs. Note that performance counters are not task-specific,
so we had to modify the kernel to support reading the counters in a multitasking environment. We added
89
cpuid; //synchronization barrier
mov ECX, 0000 0000H;
rdpmc; //read Counter 0
//move value from DL:EAX to reservation controller
mov [RESCON COUNTER0 H], DL;
mov [RESCON COUNTER0 L], EAX;
mov ECX, 0000 0001H;
rdpmc; //read Counter 1
mov [RESCON COUNTER1 H], DL;
mov [RESCON COUNTER1 L], EAX;
Figure 4.4: Checkpoint Assembler Code.
two new fields to the task descriptor, counter extime and counter cmisses, to store the counter
information. When a task is created, the fields are set to zero. When a task is preempted, the kernel first
reads the counter values and saves them in the preempted task’s descriptor. Then, it writes the values in the
preempting task’s descriptor back in the counters. Finally, the kernel writes the id of the preempting task in
a register of the reservation controller, so that the controller can correctly associate the received information
with the running task.
We implemented a compiler pass using the LLVM compiler infrastructure [21] to automatically add
checkpoint code to the task. In the current implementation, the designer must manually identify the su-
perblock boundaries selecting an initial and final basic block for each superblock. The choice involves
a tradeoff, as smaller superblocks provide better information and tighter wcet bounds but at the price of
increased measurement overhead.
4.2.3 I/O-inflicted Delay Analysis
Given load function E(t) and values wceti; CMi for superblock si, we use the analysis provided in [35] to
determine the worst case delay D(si) suffered by all cache misses in si. The key idea is that a worst case
pattern can be produced by ”spreading out” the cache misses over the length wceti of the superblock.
90
4.2.4 Coscheduling Algorithm
It is important to note that even when the analysis in [35] is tight, it is rare that at run-time a task will suffer
a delay equal to the bound: a particular pattern for both cache misses and peripheral transactions is required
to produce the worst case. As such, accounting for the worst case delay inflicted by all peripherals in the
computation time budget of each task can lead to a large waste of resources. Motivated by this observation,
we propose an alternative solution based on a run-time adaptive algorithm which leverages on our described
architecture, composed of software checkpointing, peripheral gates and reservation controller. The idea is to
assign to a task the minimal time budget of
P
1iS wceti, and then to monitor the actual execution of the
task and open the p-gates whenever possible. At run-time, information on the execution time consumed by
the current job is sent to the reservation controller at each checkpoint. The controller uses this information to
determine the actual execution time ei of superblock si for the current job. The accumulated slack time after
superblock si can then be computed as
P
1ji(wcetj   ej); the slack time is the maximum delay that the
task can suffer while still meeting its computation time budget. We can then design a coscheduling algorithm
that strives to maximize the amount of time that the p-gates are opened under the constraint that the slack
can never become negative. Our proposal is to integrate both the analysis and coscheduling techniques in
a mixed-criticality system. Inspired by the avionic domain, we consider two types of guaranteed real-time
tasks: safety critical tasks like flying control, that have stringent delay and verification requirements, and
mission critical tasks that are still hard real-time but have lower criticality. We propose to schedule safety
critical tasks blocking all I/O traffic except the one from peripherals used by the task, and to account the
delay in the time budget. For mission critical tasks we instead use coscheduling. Finally, we also assume
that the system runs best effort or soft real-time tasks for which all p-gates are opened.
Algorithm 9 is our main coscheduling heuristic. For simplicity, we describe the algorithm for a single
controlled task and a single peripheral, but it can be easily extended to a multitasking environment with
multiple p-gates. Communication from the task to the reservation controller triggers the algorithm at the
beginning of each job and at each checkpoint. The algorithm maintains two variables: i is the index of
the last executed superblock, and slack represents the accumulated slack. At the end of each superblock
si, the algorithm first recomputes the slack and then performs a check: if the slack is at least equal to the
maximum delay D(si+1), then the p-gate is opened because we are sure that the slack will be non negative
after the next superblock si+1 is executed. Otherwise, the p-gate is kept closed.
The limitation of Algorithm 9 is that it greedily ”allocates” all slack to the next superblock by immedi-
91
Algorithm 9 Adaptive Algorithm
JobStart() f
slack := 0
i := 0
CloseGate()
g
Checkpoint(e) f
i := i+ 1
slack := slack + wceti   e
if D(si+1)  slack then
OpenGate()
else
CloseGate()
g
ately opening the p-gate. This can lead to a suboptimal allocation, as superblock si+1 could be short and
have a lot of cache misses while superblock si+2 could be longer with very few cache misses. If we have
additional information on the task, we can potentially do better using a predictive heuristic. In particular,
Algorithm 10 assumes that the average case computation time avgi and average case delay Davg(si) for
each superblock si is known. The algorithm keeps track of the predicted slack, i.e. the total slack assuming
that all future superblocks will execute for ej = avgj . We can then compute a strategy that maximizes the
amount of time that the p-gate is opened by allocating the predicted slack among all future superblocks: if
we decide to open the p-gate during sj , we consume an amount of slack equal to Davg(sj) and the p-gate
is opened for avgj +Davg(sj) time units. It is easy to see that this allocation problem is equivalent to the
KNAPSACK problem [18], which is well known to be NP-hard. We therefore use a sub-optimal polyno-
mial time greedy solver: off-line, we order all superblocks by non-increasing values of avgj+D
avg(sj)
Davg(sj)
. At
run-time, we perform the allocation by iterating through the list ignoring all superblocks already executed.
When the iteration arrives to the next superblock si+1, the p-gate is opened if the remaining predicted slack
is greater or equal than Davg(si+1).
Note that Algorithm 10 is not the only possible predictive algorithm; in fact, no on-line algorithm can be
optimal, since any optimal algorithm must known exactly the computation time of future superblocks, i.e.
92
Algorithm 10 Predictive Algorithm
JobStart() f
slack := 0
pslack :=
PS
k=1(wcetk   avgk)
i := 0
CloseGate()
g
Checkpoint(e) f
i := i+ 1
slack := slack + wceti   e
tmp := pslack := pslack + avgi   e
for all k in ORDERED LIST (avgj+D
avg(sj)
Davg(sj)
) do
if k > i+ 1 ^Davg(sk)  tmp then
tmp := tmp Davg(sk)
if k == i+ 1 then
if D(si+1)  slack ^Davg(si+1)  tmp then
OpenGate()
else
CloseGate()
return
g
93
it must be clairvoyant. However, for the sake of comparison it is interesting to compute an upper bound on
the best possible performance of any on-line predictive algorithm. Assume that for a specific run, ei is the
execution time of si assuming that the p-gate is closed, and ei is the execution time assuming that the p-gate
is opened. An upper bound can be computed by solving the following integer linear programming problem:
max
SX
i=1
xiei (4.2.1)
8i; 1  i  S : xiD(si) 
i 1X
j=1
(wcetj   (1  xj)ej   xjej) (4.2.2)
8i; 1  i  S : xi 2 f0; 1g; (4.2.3)
where fx1; : : : ; xSg are indicator variables (i.e., xi = 1 if the p-gate is opened during si). Equation 4.2.1
maximizes the open time, while Equation 4.2.2 expresses the slack constraint.
4.3 Experimental Results
To validate our architecture, we performed experiments on a platform comprised of an Intel Core2 CPU
and an Intel 975X system controller. Using a PC platform allowed us easy access to all PC slots; however,
to derive meaningful measures we changed the memory controller clock frequency obtaining a speed of
900Mhz for the CPU and a theoretical bandwidth of 2.4Gbyte/s for the memory controller, which is in line
with typical values for embedded platforms.
We first performed an experiment to evaluate the maximum delay incurred by a task due to peripheral
interference. To obtain repeatable measures, we implemented a custom traffic generator peripheral for the
PCI-X bus based on a Xilinx ML455 board. The peripheral periodically initiates write transactions to main
memory, and both the period and transactions length can be configured to produce a load up to the maximum
of 1 Gbyte/s supported by PCI-X. We then designed a task to maximize cache stall time. The task allocates
a memory buffer of double the size of the CPU level 2 cache, and then cyclically reads from the buffer,
one word for each 128-byte cache line; a cache miss is thus generated for each memory read. We first ran
the task without using the traffic generator and measured an execution time of 48.73ms and 580,227 cache
misses. Using a memory benchmark, we evaluated a main memory throughput of C = 1:8Gbyte/s, which
is slightly lower than the theoretical memory controller speed. Since 128 bytes must be transferred for each
cache miss, the task is actually stalled for 580;2271281:8e9 = 41:26ms, resulting in the desired high cache stall
94
SLACKONLY ADAPTIVE PREDICTIVE BOUND
4.89% 31.21% 36.65% 40.85%
Table 4.1: Benchmark Results.
time of 84.67%. We then ran the task again, enabling the traffic generator with maximum load, and we
measured an average increase in computation time of 43.85%.
It is important to note that the measured value is an average case delay, since obtaining the worst case
pattern over more than 500,000 fetches is improbable. Our analysis is able to compute the worst case delay,
but we need additional information on the system controller, in particular the type of arbitration used by
the memory controller and the maximum length values L and L0. These data are typically available for
components used in embedded systems, see [29] for example. However, in the PC market manufacturers are
often wary of revealing precise details for fear of losing competitive edge. We therefore used the following
experimental methodology to obtain the required information: we first guessed values for L;L0 and the
arbitration type, and we built a simulator to predict average case delay for a variety of settings. We then
performed extensive experiments and confronted the measured values with the predictions to determine if
our guesses were acceptable.
Experimental results are shown in Figure 4.5, where each point is an average over 5 runs. We measured
the percentage increase in computation time for the aforementioned task varying the offered peripheral load
and length of peripheral transactions. Note that for small lengths we are not able to significantly load the
bus, as the period of the traffic generator is constrained by PCI-related overhead; hence, some points in
Figure 4.5(a) can not be generated on the bus and we show them as zero values. All results are within 5%
of our predictions, assuming round-robin arbitration and L = L0 = 32bytes=C = 17:8ns, which means
that each fetch is broken down into 4 data transfers on the memory controller. Unexpectedly, we found
that delay is constant over 70% load. Investigation of the PCI-X bus using a logic analyzer revealed that
it is an issue of the PCI bridge, which is not fast enough to buffer all peripheral data. We also performed
additional experiments varying the cache stall time of the task; this can be achieved by inserting a variable
number of instructions between each successive cache line read. The obtained wcet increases also matched
our simulation results within a small deviation.
We then evaluated the performance of the described coscheduling algorithms on our platform. We chose
a MPEG decoder [11] as our benchmark for two reasons: it is both a memory and I/O intensive application,
95
10 20 30 40 50 60 70 80 90 5
6
7
8
9
10
0
20
40
60
transaction length
in log2(#bytes)
% throughput of PCI−X bus
%
 in
cr
ea
se
 in
 c
om
pu
ta
tio
n 
tim
e
(a) Measured Delay.
5
10
15
20
25
30
35
40 5
10
15
20
25
30
20
30
40
50
60
70
80
90
100
% wcet over avg increase
% cache stall time
%
 s
wi
tc
h 
op
en
 ti
m
e
(b) Synthetic Tasks, ADAPTIVE.
5
10
15
20
25
30
35
40 5
10
15
20
25
30
0
2
4
6
8
10
12
14
% wcet over avg increase
% cache stall time
%
 O
PT
PR
ED
IC
T 
ov
er
 A
DA
PT
IV
E 
in
cr
ea
se
(c) Synthetic Tasks, algorithm ratio.
Figure 4.5: Experimental Results.
96
5
10
15
20
25
30
35
40 5
10
15
20
25
30
10
20
30
40
50
60
70
80
90
100
% comp. time variation
% cache stall time
%
 p
−g
at
e 
op
en
 ti
m
e
Figure 4.6: Synthetic Tasks, ADAPTIVE,  = 0:1
and it is representative of the type of video computation that is becoming increasingly important for mission
control in avionic systems. We collected average and worst case statistics on a test video clip after placing
multiple checkpoints for each frame; the MPEG decoder is run as a periodic task, with 20 superblocks in
each period. To mirror the behavior of a real application and increase the number of cache misses, we
also ran a higher priority task that preempts the MPEG decoder every 1ms and replaces its cache content.
Results averaged over 50 runs are shown in Table 4.1 in term of the percentage of time that the p-gate
is opened in the task period. In the table, SLACKONLY represents a baseline solution where the p-gate
is kept closed while the task is executing and is opened after the task has finished for its remaining time
budget
P
1iS(wceti   ei); ADAPTIVE is Algorithm 9; PREDICTIVE is Algorithm 10; BOUND is the
bound computed by solving the ILP problem of Equations 4.2.1-4.2.3. Note that since BOUND is not
implementable at run-time, we computed the bound offline using measured values of computation times and
number of cache misses. We can see that SLACKONLY tends to perform very poorly; ADAPTIVE is within
30% of BOUND and PREDICTIVE is roughly in between the two, which seems to suggest that prediction
offers limited improvement.
To check whether the obtained results hold for more general settings, we also performed extensive
simulations on synthetic tasks, each composed of 20 superblocks, varying three parameters ; ; . For
each task and each superblock si, we first generated the average computation time avgi from a uniform
distribution with constant mean and coefficient of variation , and the average cache stall time stalli from a
97
5
10
15
20
25
30
35
40 5
10
15
20
25
30
10
20
30
40
50
60
70
80
90
100
% comp. time variation
% cache stall time
%
 im
pr
ov
em
en
t o
f B
O
UN
D 
ov
er
 A
DA
PT
IV
E
Figure 4.7: Synthetic Tasks, ADAPTIVE,  = 0:2
5
10
15
20
25
30
35
40 5
10
15
20
25
30
20
30
40
50
60
70
80
90
100
comp. time variation
cache stall time
%
 p
−g
at
e 
op
en
 ti
m
e
Figure 4.8: Synthetic Tasks, ADAPTIVE,  = 0:4
98
5
10
15
20
25
30
35
40 5
10
15
20
25
30
0
2
4
6
8
10
12
14
% comp. time variation
% cache stall time
%
 im
pr
ov
em
en
t o
f B
O
UN
D 
ov
er
 A
DA
PT
IV
E
Figure 4.9: Synthetic Tasks, algorithm ratio,  = 0:1
uniform distribution with mean  and coefficient of variation . We then simulated 10 task runs by extracting
for each run and each superblock a computation time ei = avgi(1+ ) and a number of cache misses equal
to stalli(1+ ), where  is extracted from a uniform distribution with mean 0 and maximum value . Note
that this implies wceti = avgi(1 + ), i.e.  is the increase in computation time between the average and
the worst case.
We focus on results for the ADAPTIVE and BOUND cases: Figure 4.6, 4.7, 4.8 show the value of
ADAPTIVE for values of  equal to 0:1; 0:2; 0:4 respectively. Figure 4.9, 4.7 4.11 shows the competitive
ratio of BOUND ADAPTIVEADAPTIVE , again for  = 0:1; 0:2; 0:4. All points are averages over 10 tasks (100 runs total
of the simulator). We varied the average cache stall time  between [0:05; 0:4] and the computation time
variation  between [0:05; 0:3]; axis direction is inverted between the two sets of figures for easier visual-
ization. Note that the definition of  implies SLACKONLY= 1+ for all cases. First note that the obtained
results are very close for the different values of , which seems to indicate that none of the tested algorithm
is very sensitive to variations in superblock size. The second main observation is that the performance of
the algorithms depends on the difference between  and . For  > , both algorithms can open the p-gate
almost all the time because the high wcet variability forces us to over-provision the computation time budget
of the task; however, note that a coscheduling algorithm is still needed to guarantee safety, as there are times
where the p-gate must be closed to ensure that the task meets its deadline. In the case  < , which is
representative of more predictable, but memory intensive real-time tasks, the fraction of time the p-gate can
99
5
10
15
20
25
30
35
40 5
10
15
20
25
30
0
2
4
6
8
10
12
14
% comp. time variation
% cache stall time
%
im
pr
ov
em
en
t o
f B
O
UN
D 
ov
er
 A
DA
PT
IV
E
Figure 4.10: Synthetic Tasks, algorithm ratio,  = 0:2
5
10
15
20
25
30
35
40 5
10
15
20
25
30
0
2
4
6
8
10
12
14
16
18
% comp. time variation
% cache stall time
%
 im
pr
ov
em
en
t o
f B
O
UN
D 
ov
er
 A
DA
PT
IV
E
Figure 4.11: Synthetic Tasks, algorithm ratio,  = 0:4
100
be opened decreases as the delay D(si) becomes significant compared to wceti   avgi. The performance
of ADAPTIVE degrades more rapidly than BOUND, but it remains with a competitive ratio of 18%, which
compares even more favorably than the MPEG case.
4.4 Related Work
Apart from [36], to the best of our knowledge the only works that study the impact of I/O load on real-time
scheduling are [42] and [16]. [42] uses a PCI-based testbed similar to ours, but its empirical approach can not
derive safe wcet bounds. [16] uses an analytical approach but it assumes highly predictable cycle-stealing
bus arbitration, which is not true of commodity systems.
Two other research areas are related to our work. First of all, peripheral activities impose an additional
overhead on the CPU: device driver execution. Techniques to account for such overhead have been described
in [23, 45] for network cards and hard disks based on experimentally-derived bounds. Second, there is a
second potential source of interference at either the cache or the memory controller level: other CPUs. A
methodology to compute cache access delay in multiprocessor systems has been proposed in [41] based
on static analysis. However, we argue that this problem domain is essentially different from ours because
synchronizing task schedule across multiple processors is easier than synchronizing CPU and peripheral
execution.
4.5 Conclusions
The effects of the conflicts in accessing memory between I/O devices and processors can not be ignored: our
experiments using typical settings for an embedded system reveal that interference at the memory controller
can increase the computation time of a task by almost half. In this research, we proposed a hardware and
software system to cope with this effect. The system is used to control peripheral activities using a peripheral
gate and a coscheduling algorithm that dynamically allows/disallows peripherals to transmit making sure
that the task computation time does not exceed its WCET without traffic. Our experiments show that even
a simple adaptive coscheduling heuristic can greatly improve the amount of allowed traffic compared to the
baseline approach of blocking all peripherals while the task is executing. More complex predictive heuristics
can do even better, but our experiments revealed that the improvement space is somehow limited.
101
Chapter 5
Conclusion
In this research, we have investigated on the issue of guaranteeing task predictable execution in the current
computer architecture with focus on the issue involving cache, memory controller and Network-on-Chip.
These are supportive components that may inflict significant unpredictability in tasks’ execution time but
have not received adequate attention from previous researches. Each of these components has its own
characteristics and affect task execution in different ways. For this reason, we have analyzed each component
separately and proposed suitable solutions that we believe to be both practical and sound in theory.
As discussed in Chapter 2, uncontrolled use of cache may cause significant increase in tasks’ worst-case
execution time. In practice, however, this increase is unnecessary and can be avoided if cache sharing is
properly managed. Our proposed cache-partitioning optimization technique strives to eliminate inter-task
cache interference when possible. As the result, it significantly improves system utilization. Although,
our technique is designed for single-core professors, it can also be used on multi-core processors whose
cores share a cache memory because the technique partitions cache on task basis. However, on multi-core
processors which has no inter-core shared cache, a cache partition algorithm should be designed together
with a task allocation algorithm. Since in these processors, cache partition and task allocation are inter-
dependent. This problem is left open for future works.
The new resource sharing paradigm appearing in NoC has opened up a challenging scheduling problem.
In particular, the non-transitive dependence between transactions in NoC can cause the existing scheduling
algorithms inefficient. By taking into account this new dependence pattern, we proposed a novel scheduling
framework that, under practical conditions, is able to guarantee transaction deadlines with very high NoC
utilization. The novelty of our solutions comes from our use of graph theory tools to take advantage of
102
the topological relationship between transactions. Although our current proposal can only work on some
popular transaction topologies, I believe it has provided deep insights on the problem at hand. We recognize,
however, that more works need to be done in order to make the framework applicable to all cases. The
general solution may be in the form of an approximation algorithm that borrows the insights from this
research. Another interesting future direction is to apply our framework to the scheduling of real-time tasks
on parallel-computing systems. A task running on these systems may has the same resource usage pattern
as a transaction on NoC, i.e a task may use multiple processors simultaneously.
Like with cache, main memory sharing can also be the source of unpredictability in task execution.
Accounting for this unpredictability by worst-case analysis alone results in system that is significantly un-
derutilized. To tackle this issue, we proposed a software and hardware system and a co-scheduling technique
which are used to regulate the peripheral traffic to the main memory such that tasks are guaranteed to not
execute longer than their run-alone WCET.
I believe that the research in this dissertation has addressed some of the key issues in guaranteeing real-
time task execution predictability and has made important contribution to real-time scheduling theory. With
the continuing advance in computer architecture technology, building safe and sound real-time systems will
continue to demand significant research efforts. I hope this dissertation will be considered as an important
building block of these ongoing efforts.
103
Bibliography
[1] Aeronautical Radio Inc. ARINC 653 Specification. http://www.arinc.com/.
[2] T. W. Ainsworth and T. M. Pinkston. Characterizing the Cell EIB on-chip network. IEEE Micro,
27(5):6–14, 2007.
[3] Shobana Balakrishnan and Fu¨sun O¨zgu¨ner. A priority-driven flow control mechanism for real-time
traffic in multiprocessor networks. IEEE Trans. Parallel Distrib. Syst., 9(7):664–678, 1998.
[4] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. Proportionate progress: a notion of
fairness in resource allocation. In STOC ’93: Proceedings of the twenty-fifth annual ACM symposium
on Theory of computing, pages 345–354, New York, NY, USA, 1993. ACM.
[5] E. Bini and G. C. Buttazzo. Measuring the performance of schedulability tests. Real-Time Syst.,
30(1-2), 2005.
[6] B. D. Bui, M. Caccamo, L. Sha, and J. Martinez. Impact of cache partitioning on multi-tasking real
time embedded systems. In Proceedings of the 2008 14th IEEE International Conference on Embedded
and Real-Time Computing Systems and Applications, pages 101–110, Washington, DC, USA, 2008.
IEEE Computer Society.
[7] T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband Engine architecture and its first imple-
mentation: A performance view. IBM Research, 2005.
[8] Hyeonjoong Cho, Binoy Ravindran, and E. Douglas Jensen. An optimal real-time scheduling algorithm
for multiprocessors. Real-Time Systems Symposium, IEEE International, 0:101–110, 2006.
[9] S. Dropsho. Comparing caching techniques for multitasking real-time systems. Technical Report
UM-CS-1997-065, Amherst, MA, USA, 1997.
104
[10] J. Howard et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In IEEE
International Solid-State Circuits Conference, 2010.
[11] FFMPEG project. libavcodec multimedia library. http://ffmpeg.mplayerhq.hu/.
[12] K. Goossens, J. Dielissen, and A. Radulescu. Aethereal network on chip: concepts, architectures, and
implementations. Design Test of Computers, IEEE, 22(5):414 – 421, 2005.
[13] S. Gopalakrishnan, L. Sha, and M. Caccamo. Hard real-time communication in bus-based networks. In
RTSS ’04: Proceedings of the 25th IEEE International Real-Time Systems Symposium, pages 405–414,
Washington, DC, USA, 2004. IEEE Computer Society.
[14] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench:
A free, commercially representative embedded benchmark suite. In Proceedings of the Workload
Characterization. WWC-4. IEEE International Workshop, 2001.
[15] P. Holman and J. H. Anderson. Adapting pfair scheduling for symmetric multiprocessors. J. Embedded
Comput., 1(4):543–564, 2005.
[16] Tay-Yi Huang, Jane W. S. Liu, and Jen-Yao Chung. Allowing cycle-stealing direct memory access I/O
concurrent with hard-real-time programs. In Int. Conf. on Parallel and Distributed Systems, Tokyo,
1996.
[17] Intel Corp. Intel 64 and IA-32 Architectures Software Developer’s Manual, February 2008.
[18] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack Problems. Springer, 2004.
[19] D. B. Kirk and J. K. Strosnider. Smart (strategic memory allocation for real-time) cache design using
the mips r3000. In Proceedings of the 11th IEEE Real-Time Systems Symposium, 1990.
[20] J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA, 2005.
[21] C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and trans-
formation. In Proc. of the 2004 International Symposium on Code Generation and Optimization
(CGO’04), Palo Alto, California, March 2004.
105
[22] J. P. Lehoczky and L. Sha. Performance of real-time bus scheduling algorithms. SIGMETRICS Per-
form. Eval. Rev., 14(1):44–53, 1986.
[23] M. Lewandowski, M. Stanovich, T. Baker, K. Gopalan, and A. Wang. Modeling device driver effects
in real-time schedulability: Study of a network driver. In Proceedings of the 13th IEEE Real Time
Application Symposium, Apr 2007.
[24] J. Li and M. W. Mutka. Real-time virtual channel flow control. J. Parallel Distrib. Comput., 32(1):49–
65, 1996.
[25] J. Liedtke, H. Hartig, and M. Hohmuth. Os-controlled cache predictability for real-time systems. In
Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium, 1997.
[26] C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time
environment. J. ACM, 20(1):46–61, 1973.
[27] C. D. Locke, D. R. Vogel, L. Lucas, and J. B. Goodenough. Generic avionics software specification.
Technical Report CMU/SEI-90-TR-8, 1990.
[28] Z. Lu, A. Jantsch, and I. Sander. Feasibility analysis of messages for on-chip networks using worm-
hole routing. In ASP-DAC ’05: Proceedings of the 2005 Asia and South Pacific Design Automation
Conference, pages 960–964, New York, NY, USA, 2005. ACM.
[29] Marvell. Discovery II PowerPC System Controller MV64360 Specifications. http://www.
marvell.com/.
[30] F. Mueller. Compiler support for software-based cache partitioning. In Proceedings of the ACM
Workshop on Languages, Compilers, and Tools for Real-Time Systems, 1995.
[31] M. D. Natale and A. Meschi. Scheduling messages with earliest deadline techniques. Real-Time Syst.,
20(3):255–285, 2001.
[32] S. Oikawa and R. Rajkumar. Linux/rk: A portable resource kernel in linux. In Proceedings of the 19th
IEEE Real-Time Systems Sumposium, 1998.
[33] S. Oikawa and R. Rajkumar. Linux/RK: a portable resource kernel in linux. In Proceedings of the 19th
IEEE Real-Time System Symposium, Madrid, Spain, December 1998.
106
[34] PCI SIG. Conventional PCI 3.0, PCI-X 2.0 and PCI-E 2.0 Specifications. http://www.pcisig.
com.
[35] R. Pellizzoni, B. D. Bui, M. Caccamo, and L. Sha. Coscheduling of cpu and i/o transactions in cots-
based embedded systems. In Proceedings of the 29th IEEE Real-Time Systems Symposium, 2008.
[36] R. Pellizzoni and M. Caccamo. Towards the predictable integration of real-time COTS based systems.
In Proc. of the 28th IEEE Real-Time System Symposium, Dec 2007.
[37] R. Rajkumar, C. Lee, J. P. Lehoczky, and D. P. Siewiorek. Practical solutions for QoS-based resource
allocation. In Proceedings of the 19th IEEE Real-Time Systems Symposium, 1998.
[38] H. Ramaprasad and F. Mueller. Bounding worst-case data cache behavior by analytically deriving
cache reference patterns. In Proceedings of the 11th IEEE Real Time on Embedded Technology and
Applications Symposium, 2005.
[39] H. Ramaprasad and F. Mueller. Bounding preemption delay within data cache reference patterns for
real-time tasks. In Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applica-
tions Symposium, 2006.
[40] IBM Research. Cell BE programming tutorial. IBM, 2007.
[41] J. Rosen, P. Eles A. Andrei, and Z. Peng. Bus access optimization for predictable implementation of
real-time applications on multiprocessor systems-on-chip. In Proc. of the 28th IEEE Real-Time System
Symposium, December 2007.
[42] S. Scho¨nberg. Impact of pci-bus load on applications in a pc architecture. In Proceedings of the 24th
IEEE International Real-Time Systems Symposium, Cancun, Mexico, Dec 2003.
[43] Z. Shi and A. Burns. Priority assignment for real-time wormhole communication in on-chip networks.
In RTSS ’08: Proceedings of the 2008 Real-Time Systems Symposium, pages 421–430, Washington,
DC, USA, 2008. IEEE Computer Society.
[44] Z. Shi and A. Burns. Real-time communication analysis for on-chip networks with wormhole switch-
ing. In NOCS ’08: Proceedings of the Second ACM/IEEE International Symposium on Networks-on-
Chip, pages 161–170, Washington, DC, USA, 2008. IEEE Computer Society.
107
[45] M. Stanovich, T. Baker, and A. Wang. Throttling on-disk schedulers to meet soft-real-time require-
ments. In Proc. of the 14th IEEE RTAS, St. Louis, Missouri, April 2008.
[46] K. Tindell, A. Burns, and A.Wellings. Analysis of hard real-time communications. Real-Time Systems,
9:147–171, 1995.
[47] D. Wiklund and D. Liu. Design mapping, and simulations of a 3G WCDMA/FDD basestation using
Network-on-Chip. System-on-Chip for Real-Time Applications, International Workshop on, 0:252–
256, 2005.
[48] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand,
R. Heckmann, T. Mitra, F. Muller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenstro¨m. The worst-case
execution time problem - overview of methods and survey of tools. Technical report, 2007.
[49] J. L. Wolf, H. S. Stone, and D. Thie´baut. Synthetic traces for trace-driven simulation of cache memo-
ries. IEEE Trans. Comput., 41(4):388–410, 1992.
[50] A. Wolfe. Software-based cache partitioning for real-time applications. J. Comput. Softw. Eng.,
2(3):315–327, 1994.
[51] M. Zbigniew and D. B. Fogel. How to Solve It: Modern Heuristics. Springer, December 2004.
[52] D. Zhu, D. Mosse, and R. Melhem. Multiple-resource periodic scheduling problem: How much fair-
ness is necessary. In 24th IEEE International Real-Time Systems Symposium, 2003.
108
