Instruction Cache Optimizations in Embedded Real-Time Systems by DING HUPING
INSTRUCTION CACHE OPTIMIZATIONS IN
EMBEDDED REAL-TIME SYSTEMS
DING HUPING
(B.Eng., Harbin Institute of Technology)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF
PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE




First of all, my gratitude goes to my Ph.D. advisor Prof. Tulika Mitra. Thanks
for her persistent and generous guidance on the research. She is full of wisdom,
and I benefit a lot from her insightful comments and advices. I would also thank
her patience and encouragement during my study, especially when there are
difficulties. She also offered me the research assistant position in the last year
of my study. Without her help, this thesis would not be possible.
I would like to thank my thesis committee members. Thanks for their time
and valuable comments.
I would like to express my sincere gratitude to Prof. Wong Weng-Fai.
Thanks for his guidance in my early stage of Ph.D. study. He is generous and
kind, and helped me a lot. I am also grateful to Dr. Liang Yun in Peking Uni-
versity for the research collaborations. I collaborated with him in most of my
research work. It is my great pleasure to cooperate with him.
I also thank my friends and lab mates, Sudipta Chattopadhyay, Wang Chun-
dong, Qi Dawei, Chen Jie, Chen Liang, Mihai Pricopi and Thannirmalai Somu
Muthukaruppan, for their help in the research work and the fun in daily life.
I also give my sincere gratitude to my girlfriend Fu Qinqin, the beautiful
and thoughtful girl, for being together with me for over four years. She brought
me happiness during my Ph.D. study. She encourages me to pursue my dreams.
Thanks for her patience and great love.
I also want to thank my parents and my little sister. They have been always
supportive of me in pursuing my dreams. Thanks for their support, encourage-
ment and great love.
The work presented in this thesis was partially supported by Singapore Min-






List of Publications viii
List of Tables ix
List of Figures x
1 Introduction 1
1.1 Embedded Real-time Systems . . . . . . . . . . . . . . . . . . 1
1.2 Cache Modeling and Optimization . . . . . . . . . . . . . . . . 3
1.2.1 Cache in Uni-Processor . . . . . . . . . . . . . . . . . . 4
1.2.2 Shared Cache in Multi-core Processors . . . . . . . . . 6
1.3 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 11
2.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Worst-case Execution Time Computation . . . . . . . . . . . . 14
2.3.1 Micro-architectural Modeling . . . . . . . . . . . . . . 15
2.3.2 Program Path Analysis . . . . . . . . . . . . . . . . . . 18
3 Literature Review 21
3.1 Cache Analysis in Uni-processor . . . . . . . . . . . . . . . . . 21
3.1.1 Intra-task Cache Conflict Analysis . . . . . . . . . . . . 21
3.1.2 Inter-task Cache Interference Analysis . . . . . . . . . . 23
ii
3.2 Cache Analysis in Multi-core . . . . . . . . . . . . . . . . . . . 25
3.3 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Cache Locking for Single Task . . . . . . . . . . . . . . 27
3.3.2 Cache Locking in Multitasking . . . . . . . . . . . . . . 28
3.4 Memory Optimizations in Multi-core Processors . . . . . . . . . 29
3.5 Other Optimizations for Worst-case Performance . . . . . . . . 30
3.5.1 Cache Partitioning . . . . . . . . . . . . . . . . . . . . 30
3.5.2 Code Layout Optimization . . . . . . . . . . . . . . . . 31
3.5.3 Scratchpad Memory . . . . . . . . . . . . . . . . . . . 31
4 Partial Cache Locking for Single Task 34
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Cache Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Cache States . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Partial Cache Locking Algorithms . . . . . . . . . . . . . . . . 39
4.4.1 Optimal solution with concrete cache states . . . . . . . 40
4.4.2 Heuristic with abstract cache states . . . . . . . . . . . 43
4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 47
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Partial Cache Locking vs. Static Analysis . . . . . . . . 47
4.5.3 Partial versus Full Cache Locking . . . . . . . . . . . . 48
4.5.4 Impact of Different Associativity . . . . . . . . . . . . 50
4.5.5 Impact of Different Block Sizes . . . . . . . . . . . . . 53
4.5.6 Optimal vs. Heuristic Approach . . . . . . . . . . . . . 53
4.5.7 Percentage of Lines Locked . . . . . . . . . . . . . . . 55
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Partial Cache Locking for Multitasking 57
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 WCET Comparison of Various Locking Schemes. . . . . 61
5.2.2 Scheduling Results of RMS . . . . . . . . . . . . . . . 62
5.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 WCET and CRPD Analysis . . . . . . . . . . . . . . . . . . . . 66
5.5.1 Intra-Task WCET . . . . . . . . . . . . . . . . . . . . . 66
5.5.2 Inter-Task CRPD . . . . . . . . . . . . . . . . . . . . . 67
iii
5.6 Locking Algorithm for Multitasking . . . . . . . . . . . . . . . 69
5.6.1 Cost-benefit analysis within a task . . . . . . . . . . . . 70
5.6.2 Cost-benefit analysis of other tasks . . . . . . . . . . . . 71
5.6.3 Memory block selection strategy . . . . . . . . . . . . . 72
5.6.4 Integrated Locking + Analysis Algorithms . . . . . . . . 73
5.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 78
5.7.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . 78
5.7.2 CPU Utilization Comparison . . . . . . . . . . . . . . . 79
5.7.3 Response Time Speed-up . . . . . . . . . . . . . . . . . 79
5.7.4 CPU Utilization Breakdown . . . . . . . . . . . . . . . 80
5.7.5 Unlocked Cache Space . . . . . . . . . . . . . . . . . . 81
5.7.6 Runtime of Our Approach . . . . . . . . . . . . . . . . 82
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Dynamic Cache Locking 84
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Cache Modeling and Locking . . . . . . . . . . . . . . . . . . . 88
6.3.1 Cache Modeling . . . . . . . . . . . . . . . . . . . . . 89
6.3.2 Cache Locking Mechanism . . . . . . . . . . . . . . . . 89
6.4 Dynamic Cache Locking Algorithm . . . . . . . . . . . . . . . 90
6.4.1 Framework Overview . . . . . . . . . . . . . . . . . . . 91
6.4.2 WCET Analysis . . . . . . . . . . . . . . . . . . . . . 92
6.4.3 Resilience Analysis . . . . . . . . . . . . . . . . . . . . 93
6.4.4 Locking Slot Analysis . . . . . . . . . . . . . . . . . . 94
6.4.5 Memory Block Selection . . . . . . . . . . . . . . . . . 101
6.4.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . 102
6.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 103
6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 103
6.5.2 Comparison with Static Approaches . . . . . . . . . . . 104
6.5.3 Comparison with Region-based Approach . . . . . . . . 105
6.5.4 Runtime of Different Methods . . . . . . . . . . . . . . 107
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 Cache Locking for Shared Cache Multi-core Processors 109
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Motivating Example for Task Mapping . . . . . . . . . . . . . . 111
iv
7.3 Task Model and System Architecture . . . . . . . . . . . . . . . 113
7.4 Task Mapping Framework Overview . . . . . . . . . . . . . . . 113
7.5 Components of the Task Mapping Framework . . . . . . . . . . 116
7.5.1 Intra-Task Cache Analysis . . . . . . . . . . . . . . . . 117
7.5.2 WCRT Estimation . . . . . . . . . . . . . . . . . . . . 117
7.5.3 ILP Formulation for Task Mapping . . . . . . . . . . . 118
7.6 Cache Locking in Multi-core Processors . . . . . . . . . . . . . 122
7.6.1 Locking Mechanisms . . . . . . . . . . . . . . . . . . . 123
7.6.2 Locking Algorithm for Multi-core Processors . . . . . . 123
7.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 127
7.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 127
7.7.2 DEBIE Case Study . . . . . . . . . . . . . . . . . . . . 130
7.7.3 Synthetic Task Graphs . . . . . . . . . . . . . . . . . . 132
7.7.4 Impact of Different Number of Cores . . . . . . . . . . 134
7.7.5 L1 Block Size vs. L2 Block Size . . . . . . . . . . . . . 134
7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8 Conclusion 136
8.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . 136




Applications in embedded real-time systems are required to meet their timing
constraints. Deadline miss in hard real-time systems results in catastrophic ef-
fects. Thus, the worst-case performance of application plays an important role
in the schedulability of hard real-time systems. However, due to the existence
of micro-architectural features, such as caches, the worst-case timing analysis
becomes intractable.
Caches are widely employed in modern embedded real-time systems. They
bridge the performance gap between the fast CPU and the slow off-chip mem-
ory. However, they also introduce timing unpredictability in real-time systems,
as it is not known statically whether a memory block is in the cache or in
the main memory. Existing approaches dealing with timing unpredictability of
caches usually employ static cache analysis or cache locking techniques. Cache
analysis statically models the cache behavior. However, it may not produce ac-
curate results due to the existence of conservative estimation. Cache locking
locks the entire cache with selected memory blocks and guarantees predictable
timing. Nevertheless, such aggressive locking technique may have negative im-
pact on the execution time, as the unlocked memory blocks cannot reside in the
cache and exploit their locality.
In this thesis, we propose partial cache locking technique to optimize the
worst-case performance of embedded real-time systems. Partial cache locking
only locks a part of the cache space, while the rest of the cache remains free
and can be used by the unlocked memory blocks to exploit their cache locality.
Thus, static cache analysis is still required for the unlocked cache space, while
the locked cache contents are selected through accurate cost-benefit analysis.
By integrating static cache analysis and cache locking, our partial cache locking
approach can achieve the best of these two techniques.
We first exploit the cache optimization in uni-processors. We propose static
partial instruction cache locking for single task to minimize the WCET (Worst-
case Execution Time), where intra-task cache conflicts are carefully handled.
An optimal approach based on concrete cache state analysis and a time-efficient
vi
heuristic method based on abstract cache analysis are developed to select the
cache contents. Substantial improvement on WCET is achieved, compared to
state-of-the-art static cache analysis approach and full cache locking method.
We extend our approach to multitasking real-time systems, where both intra-
task cache conflicts and inter-task interference are considered. Our approach
takes the global effects on all task into account and selects the most benefi-
cial memory blocks in improving the schedulability/utilization. Subsequently,
we explore dynamic cache locking for single task. We propose a loop-based dy-
namic partial cache locking approach to minimize the WCET. Our approach can
better capture the dynamic program behavior, compared to static cache locking.
An ILP (Integer Linear Programming) formulation with global optimization is
developed to allocate the amount of locked cache space for each loop, and the
most beneficial memory blocks are selected to fill this space.
Finally, we also apply partial cache locking in multi-core processors with
shared cache, where the inter-core cache interference from concurrent executing
tasks must also be carefully handled. Prior to cache locking, an ILP formulation
based task mapping approach is proposed to optimize the WCRT (Worst-case
Response Time) of multitasking applications. Based on the generated task map-
ping, we lock the memory blocks in the private L1 cache, which not only reduces
the number of cache misses in L1 cache but also reduces the number of accesses
to L2 cache. Experimental evaluation shows further improvement on WCRT for
multitasking applications via cache locking.
In summary, this thesis proposes and studies partial instruction cache lock-
ing in the context of different architectures and system models in embedded
real-time systems. The worst-case performance of the applications is greatly
improved, compared to the existing approaches.
vii
List of Publications
• WCET-Centric Partial Instruction Cache Locking. Huping Ding, Yun
Liang and Tulika Mitra. In Proceedings of the 49th annual Design Au-
tomation Conference (DAC ’12), June 2012.
• Timing Analysis of Concurrent Programs Running on Shared Cache Multi-
cores. Yun Liang, Huping Ding, Tulika Mitra, Abhik Roychoudhury,
Yan Li, Vivy Suhendra. Real-Time Systems Journal, Volume 48, Issue
6, 2012.
• Shared Cache Aware Task Mapping for WCRT Minimization. Huping
Ding, Yun Liang and Tulika Mitra. In Proceedings of 18th Asia and South
Pacific Design Automation Conference (ASP-DAC ’13), January 2013.
• Integrated Instruction Cache Analysis and Locking in Multitasking Real-
time Systems. Huping Ding, Yun Liang and Tulika Mitra. In Proceedings
of the 50th annual Design Automation Conference (DAC ’13), June 2013.
• WCET-Centric Dynamic Instruction Cache Locking. Huping Ding, Yun
Liang and Tulika Mitra. In Proceedings of Design Automation and Test
in Europe (DATE ’14), March 2014.
viii
List of Tables
1.1 A Case study for ndes . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Characteristic of benchmarks. . . . . . . . . . . . . . . . . . . 47
4.2 Analysis time of different algorithms. . . . . . . . . . . . . . . 54
4.3 Percentage of lines locked in cache (cache: 4-way set associa-
tive, 32-byte block). . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Characteristics of task sets . . . . . . . . . . . . . . . . . . . . 79
5.2 Runtime of our approach . . . . . . . . . . . . . . . . . . . . . 83
6.1 WCET analysis for the motivating example. . . . . . . . . . . . 87
6.2 Memory block sets for N1 computation. . . . . . . . . . . . . . 98
6.3 Cost-benefit analysis for N1 computation. . . . . . . . . . . . . 98
6.4 Characteristic of benchmarks . . . . . . . . . . . . . . . . . . . 104
6.5 Runtime of different approaches . . . . . . . . . . . . . . . . . 107
7.1 Code size of the tasks from DEBIE benchmark. . . . . . . . . . 128
7.2 Code size of WCET benchmarks used as tasks in synthetic task
graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3 Runtime of our task mapping approach and the optimal (exhaus-
tive enumeration) task mapping approach. . . . . . . . . . . . . 133
ix
List of Figures
1.1 An example of full cache locking. . . . . . . . . . . . . . . . . 5
1.2 An example of partial cache locking. . . . . . . . . . . . . . . . 7
2.1 Memory hierarchy in a processor. . . . . . . . . . . . . . . . . 12
2.2 Cache architecture. . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Worst-case Execution Time of a task. . . . . . . . . . . . . . . . 14
2.4 Update function and join function for must analysis. . . . . . . . 16
2.5 Update function and join function for may analysis. . . . . . . . 17
2.6 Update function and join function for persistence analysis. . . . 18
3.1 An example for inter-task cache interference and CRPD. . . . . 24
3.2 Scratchpad memory. . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Advantage of partial cache locking over full cache locking and
cache modeling with no locking. The program consists of four
loops. The first loop contains two paths (P0 and P1) and the
other three loops contain only one path. The loop iteration
counts appear on the back edges. . . . . . . . . . . . . . . . . . 36
4.2 Concrete cache states and abstract cache states. . . . . . . . . . 38
4.3 Trampoline mechanism. . . . . . . . . . . . . . . . . . . . . . . 39
4.4 WCET improvement of partial cache locking (optimal and heuris-
tic solution) over static cache analysis with no locking (cache:
4-way set associative, 32-byte block). . . . . . . . . . . . . . . 49
x
4.5 WCET improvement of partial cache locking (optimal and heuris-
tic solution) over Falk et al.’s method (cache: 4-way set associa-
tive, 32-byte block). . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 WCET improvement of partial cache locking over static cache
analysis (no locking) for direct mapped cache, 32-byte block. . . 51
4.7 WCET improvement of partial cache locking over static cache
analysis (no locking) for 2-way set-associative cache, 32-byte
block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for direct mapped cache, 32-byte block. . 52
4.9 WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for 2-way set-associative cache, 32-byte
block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.10 WCET improvement of partial cache locking over static cache
analysis (no locking) for 2-way set-associative cache, 64-byte
block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.11 WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for 2-way set-associative cache, 64-byte
block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 An example of PD-locking. . . . . . . . . . . . . . . . . . . . . 58
5.2 An example of ASRV-locking. . . . . . . . . . . . . . . . . . . . 58
5.3 An example of our approach. . . . . . . . . . . . . . . . . . . . 58
5.4 Motivating example. . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 WCET path of T1 and T2. . . . . . . . . . . . . . . . . . . . . . 61
5.6 Framework for Locking + Analysis approach. . . . . . . . . . . 65
5.7 WCET and CRPD Analysis. . . . . . . . . . . . . . . . . . . . 66
5.8 Utilization comparison of different approaches. . . . . . . . . . 80
5.9 Response time speed-up. . . . . . . . . . . . . . . . . . . . . . 81
5.10 Utilization breakdown for medium-2KB. . . . . . . . . . . . . . 81
5.11 Percentage of unlocked cache lines with our approach. . . . . . 82
6.1 An example of our loop-based dynamic cache locking approach. 85
xi
6.2 Motivating example for dynamic cache locking. . . . . . . . . . 87
6.3 Effect of difference locking positions. . . . . . . . . . . . . . . 91
6.4 Framework of dynamic cache locking. . . . . . . . . . . . . . . 92
6.5 Complete ILP formulation. . . . . . . . . . . . . . . . . . . . . 100
6.6 ILP formulation for the motivating example. . . . . . . . . . . . 100
6.7 Comparison between loop-based dynamic locking and static ap-
proaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.8 Comparison between loop-based and region-based dynamic lock-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1 Multi-core architecture with shared L2 cache. . . . . . . . . . . 110
7.2 Overall framework for cache locking in multi-core processors. . 111
7.3 Motivating example. . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Task Mapping Framework. . . . . . . . . . . . . . . . . . . . . 114
7.5 Illustration of the iterative WCRT analysis modeling shared cache.116
7.6 Cache locking framework . . . . . . . . . . . . . . . . . . . . . 126
7.7 Cache locking granularity . . . . . . . . . . . . . . . . . . . . . 127
7.8 Task graph for DEBIE benchmark. . . . . . . . . . . . . . . . . 128
7.9 Synthetic task graphs with WCET benchmarks as tasks. . . . . . 129
7.10 Improvement in WCRT due to task mapping and cache locking
for DEBIE benchmark. . . . . . . . . . . . . . . . . . . . . . . 131
7.11 Improvement in WCRT due to task mapping and cache locking
for synthetic task graphs (4-core). . . . . . . . . . . . . . . . . 132
7.12 Improvement in WCRT due to task mapping and cache locking




1.1 Embedded Real-time Systems
Embedded systems are ubiquitous nowadays, not only in the avionics, but also
in our daily life, such as automobiles, washing machines, microwave ovens, mo-
bile phones and so on. Compared to general-purpose computer systems, such
as personal computers, that satisfy various needs (e.g., word processing, web
browsing and games), embedded systems are application-specific computer sys-
tems. An embedded system runs specific application and performs dedicated
function during its lifetime. Thus, an important characteristic of embedded sys-
tems is that the applications running on the processing engines are known in
advance. Such feature creates great many opportunities for the optimizations
in embedded systems, as the optimization now can target specific applications.
Generally, embedded systems can be customized or optimized from both hard-
ware and software perspectives for the sake of improvement of performance,
power consumption, cost, reliability and so on.
Apart from the application-specific feature, there are also real-time con-
straints in embedded systems, such as timing constraint. With the timing con-
straint, embedded systems are not merely required to produce correct results,
but also have to meet the requirement of real-time response time, in order to
guarantee the quality of service (QoS) or proper functioning. In other words,
applications on embedded real-time systems need to complete before their cor-
responding time deadlines, while no timing constraint is required in general-
purpose computer systems. Real-time systems that have timing constraint can
be classified into two types, soft real-time systems and hard real-time systems.
In soft real-time systems, the timing constraint is elastic. Miss of the deadline in
soft real-time systems only results in loss of QoS but not the failure of systems.
Thus, the time deadline can be missed occasionally, while the results are still
1
acceptable. MP3 player is an example of soft real-time systems, where frame
loss with low probability is tolerable and acceptable. In hard real-time systems,
the time deadline is deterministic and hard. Applications are mission-critical
and should never miss their deadlines. Deadline miss in hard real-time systems
will lead to failure of the systems and result in disastrous consequences. There-
fore, all applications must be successfully scheduled in hard real-time systems.
A well-known example of hard real-time system is the anti-lock braking system
(ABS) in automobiles. The brakes of the automobile must be released within a
time constraint to prevent the wheels from locking. Otherwise, the automobile
may slide on the ground, and traffic accidents may happen.
Due to the critical timing constraint, significant research efforts have been
invested into hard real-time systems, in order to guarantee the schedulability of
the tasks and proper functioning of systems. A task is schedulable in real-time
systems when its worst-case response time (WCRT) does not exceed its corre-
sponding time deadline, where WCRT of a task is the maximum time elapsed
from its release to its completion. Detailed WCRT computation or schedulabil-
ity analysis is based on the corresponding scheduling policies, such as earliest
deadline first (EDF) [29] and rate monotonic scheduling (RMS) [71]. Neverthe-
less, several basic timing factors must be taken into account in the process of
WCRT computation or schedulability analysis, including worst-case execution
time (WCET), context switching cost and so on, regardless of the scheduling
policies. WCET is the maximum execution time of a task over all possible in-
puts under a specific architecture when there is no interruption. Commercial
tools (e.g., aiT [8]) as well as open-source tools (e.g., Chronos [59]) are avail-
able for WCET analysis [109]. However, WCET usually is not equivalent to
the WCRT of tasks, as there are interaction and interference among tasks in the
multitasking real-time systems. Therefore, besides the WCET, there are addi-
tional delays in execution time, such as the context switching cost. These delays
should also be carefully considered to ensure the safety in hard real-time sys-
tems.
To perform the worst-case timing analysis for tasks in embedded real-time
systems, program path analysis is required, and WCET is computed along the
longest path. On the other hand, micro-architecture modeling is also required.
Instruction execution in the micro-architecture contributes to the basic timing
effects, such as the memory access latency and execution latency in the func-
tional units. Modern processors in embedded real-time systems feature spe-
cial hardware components, such as cache and branch predictors. These compo-
nents significantly improve the average-case performance of the processors [50].
2
However, they also introduce timing unpredictability in real-time systems, due
to the cache misses, control dependency, data dependency and so on [93]. For
instance, because of the existence of cache memory, it is not known statically
whether a memory block is in the cache or in the main memory, which makes
the memory access latency unpredictable. Therefore, to perform the worst-case
timing analysis in hard real-time systems, careful modeling of these components
are required.
1.2 Cache Modeling and Optimization
Memory system plays an important role in computer systems, as it greatly in-
fluences the performance. However, the speed of memory becomes a bottleneck
due to the performance gap between the fast CPU and slow off-chip memory.
Thus, supplying all the data from the main memory directly will significantly
degrade the performance, as the speed of main memory is far behind that of the
CPU in orders of magnitude. Cache, in this case, comes to rescue. It is special
on-chip memory located between the fast CPU and the slow off-chip memory,
and its speed is close to that of the CPU. Cache holds copies of data from the
main memory and provides a fast memory access mechanism. In a processor
with cache, a memory access will first resort to the cache, instead of main mem-
ory. As most of memory accesses hit in the cache in average case [24], cache
greatly speeds up program execution, and thus bridges the performance gap be-
tween the fast CPU and the slow off-chip memory.
Instruction cache is widely employed in modern embedded real-time sys-
tems. It stores copies of instructions and speeds up the instruction fetch in the
processors. Instruction cache is accessed by the CPU almost very cycle in the
processors, and it significantly influences the average-case performance of pro-
cessors. Moreover, instruction cache also consumes a large part of the power
in the processors [19]. In embedded real-time systems, instruction cache in-
troduces timing unpredictability [102], as mentioned earlier. Thus, it greatly
affects the worst-case performance [16, 49, 66]. In this thesis, we focus on the
optimization of instruction cache. More specifically, we optimize the instruction
cache for worst-case performance in hard real-time systems. We not only tar-
get the cache in uni-processor, but also consider the shared cache in multi-core
processor.
3
1.2.1 Cache in Uni-Processor
In uni-processors, there is at most one active task executing on the processor
at any point of time. Therefore, a task can exclusively use the cache during its
execution. However, it still suffers from both intra-task cache conflicts and inter-
task cache interference. For a task T , the loading of a memory block m1 ∈ T
into the cache may evict another memory block m2 ∈ T . Thus, later memory
accesses to the evicted memory block m2 result in cache misses, due to such
intra-task cache conflict in T . In preemptive multitasking real-time systems,
multiple tasks are scheduled on the same processor. Inter-task interference in
the cache is thus incurred due to the task preemption. When an active task T is
preempted by another task T ′ with higher priority, the cache contents of T may
also be replaced by T ′. In this case, when task T resumes execution, it needs
to reload the memory blocks that is evicted by T ′ and will be reused in later
execution. Therefore, such inter-task interference in the cache leads to addi-
tional delay in execution time (reloading cost of memory blocks). This delay is
called cache-related preemption delay (CRPD), which must be considered in the
schedulability analysis. So, as a result of the intra-task cache conflicts and inter-
task cache interference, the cache behavior is unknown, leading to unpredictable
timing in embedded real-time systems. In order to deal with the timing unpre-
dictability problem of cache, many approaches have been proposed, including
static cache analysis and cache locking method.
Static Cache Analysis Static cache analysis statically analyzes the program
and models the cache, in order to capture the cache behavior of the program.
It is commonly used to model the intra-task cache conflict and estimate the
WCET of a task [65, 101, 81]. Memory accesses are classified into cache hit
or cache miss based on the results of static analysis. The estimated WCET of
the task is then carried out by integrating program path analysis and hit/miss
classification. Static cache analysis is also employed to capture the inter-task
cache interference in multitasking real-time systems [56, 103, 82, 54]. Static
cache analysis can accurately identify the deterministic memory access pattern,
and thus, it is widely adopted in real-time systems to bound the execution time.
However, the results of static analysis may not be accurate when the control flow
of a program is complex. In such circumstance, many memory accesses cannot
be deterministically classified. Due to the safety-critical nature of hard real-
time systems, conservative estimation is usually adopted. For example, when
a memory access can neither be classified into cache hit nor cache miss, it is
conservatively assumed to be cache miss in most of the cases. Because of such
4
conservative classification, the timing may be overestimated.
Locked cache line
4‐way set‐associative cache
Figure 1.1: An example of full cache locking.
Cache Locking Cache locking is another approach to tackle the timing un-
predictability problem. Cache locking is a software controlled technique that
is employed in many commercial processors [6, 2, 1, 5, 7, 4]. Once a memory
block is locked into the cache, it cannot be evicted by the cache replacement
policies until it is unlocked. When the entire cache is locked, all accesses to the
locked memory blocks are cache hits, while accesses to the unlocked memory
blocks result in cache misses, as shown in Figure 1.1. In this case, the timing
is predictable, and no static analysis is required. Cache locking technique is
also used to improve the worst-case performance in embedded real-time sys-
tems [87, 15, 86, 23, 38, 72, 84, 14, 74]. Static full locking in instruction cache
is applied in [38, 72, 84], in order to improve the WCET for single task. The
memory blocks that significantly contribute to the WCET are selected, and the
entire cache is locked. However, when the cache size is small, full cache locking
may have negative impact on the overall WCET, as most of the memory blocks
cannot reside in the cache and need to be loaded from the main memory. Cache
locking is also employed in multitasking real-time systems [87, 23, 14]. As the
cache is used for locking and no free space is left in the cache, CRPD analysis is
completely eliminated, and the timing is predictable. In [87] and [23], the cache
is statically shared in space among tasks via cache locking, and the performance
is thus limited by the cache size. While the cache is dynamically shared in a
time-multiplexed style among tasks through cache locking in [14]. However,
cache re-locking is required at each preemption, and the re-locking cost may
greatly affect the timing of the tasks. Dynamic instruction cache locking is also
proposed to optimize WCET [15, 86, 74]. A program is partitioned into regions,
and each region has a corresponding locking state. However, region-based ap-
proaches are usually coarse-grained and may not accurately capture the dynamic
cache behavior of program. Meanwhile, all these approaches employ full cache
5
locking, which may lead to negative impact on the overall WCET, as we have
discussed.
1.2.2 Shared Cache in Multi-core Processors
Recently, both embedded systems and general-purpose computing systems have
made the irreversible transition toward multi-core processors due to thermal and
power constraints. The performance of an application can be greatly improved
by partitioning the computation among multiple tasks and executing them in
parallel on different cores. Multi-core systems, however, introduce additional
challenges for the WCET analysis. More concretely, the shared resources in the
multi-core architecture, such as the cache, suffer from interference among the
tasks concurrently executing on different cores. Therefore, the WCET of a task
cannot be determined in isolation; we have to take into account the interference
or conflicts for shared resources from the tasks simultaneously executing on
other cores.
Generally, in a multi-core processor with share cache, concurrently execut-
ing tasks interfere with each other in the shared cache. That is, a memory block
in the shared cache may be evicted by the memory blocks of tasks simultane-
ously executing on other cores, which results in additional delay. Static cache
analysis technique is employed to model the shared cache behavior [112, 62,
47], where the inter-core cache interference in shared cache contributes a lot to
the timing of the tasks in embedded multi-core processors. Hardy et al. [47]
reduce the inter-core interference in the shared cache through bypassing static
single usage blocks from the shared caches via compile time analysis. In [96]
and [75], cache partitioning is employed in the shared cache to eliminate inter-
core cache interference. However, cache partitioning may limit the shared cache
performance, as each task can only use a portion of the shared cache.
1.3 Research Aims
As we have mentioned, start-of-the-art approaches dealing with timing unpre-
dictability of cache usually employ static cache analysis or cache locking tech-
nique. Static cache analysis analyzes the program and models the cache. How-
ever, conservative estimation is usually applied when the cache behavior cannot
be deterministically classified. Thus, it may overestimate the execution time
and produce inaccurate results, especially when the control flow is complex.
On the other hand, existing cache locking approaches lock the entire cache. As
6
the cache is fully locked, static analysis is not required and the cache behavior
is predictable. However, such aggressive methods may have negative impact on
the overall timing, since all unlocked memory contents should be provided from
the main memory directly.
In this thesis, we aim to optimize the instruction cache in embedded real-
time systems, in order to improve the worst-case performance of applications
and guarantee the schedulability of hard real-time systems. We synergistically
combine static cache analysis and cache locking techniques and propose par-
tial cache locking approach to achieve the best of these two methods. In our
study, we only lock a portion of the cache, while the free cache space is used
by the unlocked memory blocks to exploit their cache locality, as shown in
Figure 1.2. Therefore, static cache analysis is still required for the unlocked
cache space. Meanwhile, the locked cache contents are selected through accu-
rate cost-benefit analysis. Our fine-grained approach optimizes the worst-case





Figure 1.2: An example of partial cache locking.
We present an example to show the superiority of our partial cache locking,
compared to the state-of-the-art approaches. We take the program ndes from the
MRTC benchmark suite [46]. Its binary code size is 6, 352 bytes. We assume
a uni-processor with only one level of instruction cache. The instruction cache
is 4-way set-associative with 32-byte block size. Its capacity is 2KB, and thus
there are altogether 64 lines in the cache. We set the cache hit latency to be 1
cycle, while the cache miss penalty is 30 cycles. We analyze the WCET of ndes
through three techniques, static cache analysis [101], full cache locking [38]
and our partial cache locking approach. The results are shown in Table 1.1.
As can be observed, full cache locking locks the entire cache, but it produces
the worst WCET. The cache size is 2KB, while the program size is more than
6KB. Thus, most of the instructions cannot reside in the cache with full locking,
and there is high access latency to these unlocked instructions, leading to long
7
execution time. Our partial cache locking technique only locks a part of the
cache, while the rest of the cache can still be used by the unlocked instructions.
We select the most beneficial memory blocks towards minimizing the WCET
to lock, based on static cache analysis. Thus, our technique outperforms both
static cache analysis and full cache locking.
Table 1.1: A Case study for ndes
Methods WCET (cycles) # of locked lines
Static cache analysis 227,749 -
Full cache locking 591,757 64
Partial cache locking 141,213 14
In this thesis, we perform cache locking in both uni-processors and multi-
core processors. We study static cache locking for single task as well as mul-
titasking in uni-processors. We also extend our approach to dynamic cache
locking for single task. Finally, we consider cache optimizations in multi-core
processor with shared cache.
1.4 Thesis Contributions
In this thesis, we perform post-compilation instruction cache optimizations via
partial cache locking in embedded real-time systems. We select the locked con-
tents based on a static analysis of the program binary executable. We make the
following contributions in this thesis.
• We propose a static partial cache locking approach to optimize the WCET
(Worst-case Execution Time) for single task in real-time systems. Lock-
ing a memory block in the cache has both locking benefit and locking
cost on the overall WCET of the task, as accesses to the locked mem-
ory block are cache hits while locking a memory block reduces the free
space in the cache. We judiciously select the locked contents through ac-
curate cache modeling that determines the impact of the decision on the
program WCET. An optimal approach based on concrete cache states as
well as a heuristic approach based on abstract cache states are proposed.
Meanwhile, worst-case path change is carefully considered. Experimental
results show that our approaches substantially improve the WCET com-
pared to both the static cache analysis approach and full cache locking.
• We extend static partial cache locking for single task to multitasking in
uni-processors, in order to improve the schedulability/utilization of real-
8
time systems. In our approach, each task statically locks a portion of
the cache, while there is still unlocked cache space that is shared by all
tasks in a time-multiplexed style. Locking a memory block in multitask-
ing real-time systems influences both WCET and CRPD (Cache-related
Preemption Delay), and has global effects on all the tasks. We develop an
accurate cost-benefit analysis that captures the overall locking effects, and
iteratively select the most beneficial memory block to lock. Evaluation
results indicate that our method outperforms state-of-the-art static cache
analysis and cache locking approaches in multitasking real-time systems.
• We also extend static partial cache locking to dynamic cache locking for
a single task. We propose a flexible loop-based dynamic cache locking
approach. We not only select the memory blocks to be locked but also
the locking points (e.g, loop level). We judiciously allow memory blocks
from the same loop to be locked at different program points with consid-
eration to global optimization of the WCET. We design a constraint-based
approach that incorporates a global view to decide on the number of lock-
ing slots at each loop entry point and then select the memory blocks to be
locked for each loop. Experimental evaluation with real-time benchmarks
shows that our dynamic cache locking approach achieves substantial im-
provement of WCET compared to prior techniques.
• We also perform partial cache locking in multi-core processors with shared
cache. Prior to cache locking optimization, a task mapping approach is
first proposed to improve the WCRT (Worst-case Response Time). We
demonstrate the importance of shared cache modeling in task mapping.
An ILP (Integer Linear Programming) formulation method is used to ob-
tain the task mapping solution. Our task mapping approach not only max-
imizes the workload balancing but also minimizes the inter-core interfer-
ence in shared cache. Partial cache locking approach is later employed
based on the task mapping technique to further improve the WCRT of
multitasking applications. Memory blocks are locked at the private L1
cache for each task, which not only reduces the number of L1 cache
misses, but also minimizes the number of L2 cache accesses. Experi-
mental evaluation with real-world application and synthetic task graphs
indicates that we achieve significant minimization on WCRT with both
task mapping and cache locking techniques.
9
1.5 Thesis Organization
In this chapter, we have introduced the motivation and contributions of our
study. The rest of the thesis is organized as follows. Chapter 2 lays out the
foundation of our research work in this thesis, including cache architecture,
cache locking technique, and WCET computation. Chapter 3 reviews the tech-
niques related to the cache optimizations for worst-case performance. Chapter 4
presents the static partial cache locking mechanism that attempts to improve the
WCET for a single task in real-time systems. Chapter 5 extends the static partial
cache locking work in Chapter 4 to multitasking real-time systems, in order to
improve the schedulability/utilization. Chapter 6 further extends static partial
cache locking to dynamic cache locking for the sake of improving the WCET
for single task in real-time systems. Chapter 7 presents the cache locking work
in multi-core processors with shared cache. Finally Chapter 8 summarizes the




In this chapter, we look into the details of the background for our study, in-
clude cache memory, cache locking technique, and worst-case execution time
computation.
2.1 Cache
Cache is a special on-chip memory between the fast CPU and the slow off-
chip memory, as shown in Figure 2.1. It is usually implemented with SRAM
(Static Random Access Memory). SRAM is more expensive but much faster
than DRAM (Dynamic Random Access Memory), which is usually used to im-
plement the main memory. Cache stores the copies of frequently and recently
used data from the main memory, and its speed is close to that of the CPU. In a
processor with cache, a memory access will first resort to the cache, instead of
the main memory. If the data accessed is present in the cache, it is a cache hit,
which results in a low memory access latency. Otherwise, it is a cache miss, and
the corresponding memory access latency is high. Due to the temporal and spa-
tial locality of memory accesses, most of the memory accesses are serviced by
the cache. Temporal locality defines the characteristic that a referenced memory
location is likely to be reused in the near future; while spatial locality describes
a phenomenon that the nearby memory locations of a recently accessed memory
location will be referenced in the near future with high probability. So, with a
small high-speed cache, the price of memory hierarchy remains at the level of
main memory, while the speed of memory access is close to that of the cache.
Cache design involves a few parameters. The unit of data or instruction
transfer between the cache and main memory is called cache line (block). We
define cache line (block) size as L. A cache is divided into K sets. Given









Figure 2.1: Memory hierarchy in a processor.
set (addr modulo K). In each set, there are A cache lines, which defines the
associativity of the cache. Then, the capacity of the cache is L×K ×A. When
A is equal to 1, the cache is called direct-mapped cache. Otherwise, it is called
set-associative cache. When K is equal to 1 for a set-associative cache, it is
called fully associative cache. The replacement policies of cache define the
cache content updating mechanisms, e.g., LRU (Least Recently Used) and FIFO
(First In First Out). For example, when a new memory block is brought into the
cache, the LRU replacement policy will evict the memory block that is least
recently used to make room for the new memory block.
Figure 2.2 illustrates the cache architecture. For each cache line, there is
a valid bit to indicate the status of the datum. If the bit is not set, there is no
valid data in the cache line. The tag in the cache line represents the address
of the data from the main memory, while data from the corresponding address
is stored in the line. Memory address from the main memory is used to index
into the cache to check the data availability, and it is divided into three parts, as
shown in Figure 2.2. The index determines the cache set where the data may be
stored, while block offset represents the offset in the cache block. When there
is a memory access to the cache, tag comparison is performed in the cache set
indicated by index. If the tag matches and the data is valid, the memory access is
a cache hit. In this case, the data is fetched and provided to the CPU. Otherwise,
it is a cache miss, and data must be loaded from the next level of memory, thus
leading to higher memory access latency. The contents in the corresponding
cache set will also be updated with the cache replacement policy.
12















Figure 2.2: Cache architecture.
2.2 Cache Locking
Cache locking is a software controlled technique that selects and stores a sub-
set of the memory blocks in the cache. Modern embedded processors feature
cache locking technique to improve the performance or timing predictability.
Many commercial processors equip cache locking mechanism, e.g., Intel Xs-
cale [6], ARM 940T [2], ARM 920T [1], IDT 79RC64574/RC64575 [5], black-
fin 5xx [7] and IBM PowerPC 440 [4]. Once a memory block is loaded and
locked in the cache, it cannot be evicted by the cache replacement policies until
it is unlocked. All accesses to the locked memory blocks are cache hits, while
all accesses to the unlocked memory blocks are cache misses when the entire
cache is used for locking. Usually, the locked memory contents are decided
statically before execution, and locking/unlocking routines are used to perform
the locking/unlocking operations [1, 2].
There are two types of cache locking from the perspective of locking gran-
ularity: way locking and line locking. With way locking, cache locking is per-
formed at the granularity of cache ways. When a cache way is locked, all the
sets in this particular way are locked. Way locking is employed in [1], [2] and
so on. While line locking allows different number of cache lines to be locked
in different cache sets. Thus, compared to way locking, line locking is more
flexible and fine-grained. Line locking is used in [6], [5] and so on.
Cache locking can also be classified into static cache locking and dynamic
cache locking. With static cache locking, memory blocks are locked at the be-
ginning of execution. The locked memory contents of a task remain unchanged
13
throughout the execution. Most of the cache locking approaches employ static
cache locking. With dynamic cache locking, the locked memory contents vary
during the execution of a task, in order to capture the dynamic program behav-
ior. Usually a program is partitioned into different regions in dynamic cache
locking. The locked contents are adjusted based on the change of the regions
during execution. To adjust the locked contents, cache locking routines are usu-
ally required at the reloading points of program in dynamic cache locking.
2.3 Worst-case Execution Time Computation
Worst-case execution time (WCET) bears significant importance in schedula-
bility analysis of real-time systems. It is one of the fundamental elements to
compute the worst-case response time (WCRT). The WCET of a task is the
maximum execution time of this task under a particular architecture across all
possible inputs, as shown in Figure 2.3. It indicates an upper bound of the exe-
cution time for a task. The longest feasible path in terms of execution time leads
to the WCET of a program. Thus, to obtain the actual WCET, testing all pos-
sible inputs and enumerating all possible paths under a particular architecture
are required. Obviously, such approach is infeasible for most of the programs,
as the number of program paths may explode due to the existence of branches
and loops. In this circumstance, an estimated WCET is used to bound the actual
WCET of a task in real-time systems, and the gap between the actual WCET
and estimated WCET is known as the tightness of WCET estimation. To tighten
WCET estimation, many techniques are proposed, such as infeasible path de-
tection [98]. To obtain precise WCET estimation of a task, micro-architectural




Figure 2.3: Worst-case Execution Time of a task.
14
2.3.1 Micro-architectural Modeling
Micro-architectural modeling captures the timing effects of the underlying com-
ponents, including pipeline [55, 90, 61], cache [60, 101], branch predictor [30,
60], etc. In the architectural modeling, due to the interaction among differ-
ent underlying components, there is a counter-intuitive behavior called timing
anomaly [78, 107] in timing analysis. Timing anomaly indicates a phenomenon
that local WCET may not lead to global WCET. In other words, accumulation
of local WCET may result in underestimation of the global WCET of a task.
Thus, timing anomaly should be carefully handled in timing analysis.
Instruction cache modeling attracts lots of attentions in micro-architectural
modeling. One of the most well-known approach for instruction cache mod-
eling is abstract interpretation [101]. This method is also used in modeling
of multi-level cache [48] and shared cache [62]. In abstract interpretation ap-
proach, abstract cache states are defined at each program point to represent the
possible cache behavior. Three types of cache analysis are performed on the
abstract cache states, must analysis, may analysis and persistence analysis. As
abstract interpretation approach for cache analysis bears crucial importance in
our study, we present the details of these three types of analysis.
We assume a set-associative cache with LRU cache replacement policy. The
associativity of the cache is A. As a memory block can be mapped to only
one cache set (see Section 2.1), different cache sets are independent and can
be modeled independently. Thus, we only describe the modeling technique for
one cache set, while the same modeling technique can be repeated for the other
cache sets. The abstract cache state is defined as follows, where M denotes the
set of memory blocks mapped to the cache set.
Definition 1 (Abstract Cache State) An abstract cache
state a is a vector 〈a[0], ...a[A− 1]〉 of length A where a[i] ∈ 2M .
For a task T , the abstract cache state at each program point is obtained
through a fixed-point computation based on the control flow of the program.
The initial abstract cache state is set to be empty. Each time the abstract cache
state references a memory block, it should be updated with a update function.
When several paths in the program merge at a program point, a join function is
employed to obtain the new abstract cache state.
The update function and join function for must analysis under LRU cache
replacement policy are shown in Figure 2.4. Updatemust(a,m) updates abstract
cache state a when there is an access to memory block m. Obviously, m will be
the youngest memory block in the new abstract cache state after accessing m.
15
However, the memory blocks that are younger than m will be aged by 1 when
m is in a, and all memory blocks in a will be aged by 1 after accessing m when
m is not in the cache. Joinmust(a1, a2) joins abstract cache states a1 and a2 to
generate the new abstract cache state a, and max(x, y) returns the maximum
number between x and y. A memory block m remains in the new abstract cache
state, only when it is present in both the abstract cache states a1 and a2 before
joining. Meanwhile, the maximal age in a1 and a2 is adopted as the new age of
m. At a given program point, must analysis captures the memory blocks that
are guaranteed to be in the cache. Thus, accesses to the memory blocks in the
abstract cache states of must analysis are cache hits.
Updatemust(a,m) =

a if m ∈ a[0]
a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i ≤ k − 1;
a′[k] = a[k − 1] ∪ (a[k] \ {m});
a′[k] = a[k], k + 1 ≤ i < A;
if ∃1 ≤ k < A s.t. m ∈ a[k]
a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i < A;
otherwise
Joinmust(a1, a2) = a, where a[i] =
{m|∃0 ≤ x < A and 0 ≤ y < A,m ∈ a1[x] ∧m ∈ a2[y] ∧ i = max(x, y)},
0 ≤ i < A.
Figure 2.4: Update function and join function for must analysis.
We also present the update function and join function for may analysis under
LRU cache replacement policy, as shown in Figure 2.5. Updatemay(a,m) up-
dates abstract cache state a when there is an access to memory block m. m will
also be the youngest in the cache after accessing m. However, memory blocks
that are not older than m will be aged by 1 after accessing m when m is in a,
and the age of all memory blocks will be increased by 1 after accessing m if m
is not in the abstract cache state a. Joinmay(a1, a2) joins abstract cache states
a1 and a2 to generate the new abstract cache state a, and min(x, y) returns the
minimum number between x and y. Memory block m will be present in the
new abstract cache state when m appears in any of the abstract cache states a1
and a2. Meanwhile, the younger age is adopted as the new age of m when m
is present in both a1 and a2. At a given program point, may analysis captures
the memory blocks that may be in the cache. In other words, memory blocks
that are never in the cache will not be present in the abstract cache states of may




a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i ≤ k;
a′[k + 1] = a[k + 1] ∪ (a[k] \ {m});
a′[k] = a[k], k + 2 ≤ i < A;
if ∃0 ≤ k < A s.t. m ∈ a[k]
a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i < A;
otherwise
Joinmay(a1, a2) = a, where a[i] =
{m|∃0 ≤ x < A and 0 ≤ y < A,m ∈ a1[x] ∧m ∈ a2[y] ∧ i = min(x, y)}
∪ {m|m ∈ a1[i] ∧ m /∈ a2[y], ∀0 ≤ y < A}
∪ {m|m ∈ a2[i] ∧ m /∈ a1[x], ∀0 ≤ x < A},
0 ≤ i < A.
Figure 2.5: Update function and join function for may analysis.
The update function and join function of traditional persistence analysis in
[101] are illustrated in Figure 2.6. An additional virtual cache line (cache line
A in Figure 2.6) is introduced to hold the memory blocks that are evicted from
the cache. Persistence analysis updates the abstract cache states similarly to
that of must analysis. The main difference is that the memory blocks in cache
line A will not be aged. Meanwhile, when the age of m is 0 in a, the other
memory blocks in the same cache line will be aged by 1 after accessing m.
The join function of persistence analysis is similar to that of may analysis. The
difference is that the maximal age is adopted as the new age of m when m is
present in both a1 and a2. At a program point, persistence analysis determines
the memory blocks that may be miss at the first access but will never be evicted
once loaded into the cache. A memory block is not persistent if it is present in
the virtual cache line.
Recently, both Cullmann [32] and Huynh et al. [51] identify a safety issue in
the traditional persistence analysis [101]. Memory accesses may be improperly
classified as persistent with the traditional persistence analysis, and the timing
may be underestimated. Cullmann [32] enhances the persistence analysis with
may analysis. Huynh et al. [51] propose a concept called younger set. The
younger set of a memory block m contains the memory blocks that may be
younger than m during the analysis. Thus, younger set is used to bound the




a′ where a′[0] = {m};
a′[1] = (a[0] \ {m}) ∪ a[1];
a′[i] = a[i], 2 ≤ i ≤ A;
if m ∈ a[0]
a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i ≤ k − 1;
a′[k] = a[k − 1] ∪ (a[k] \ {m});
a′[k] = a[k], k + 1 ≤ i ≤ A;
if ∃1 ≤ k < A s.t. m ∈ a[k]
a′ where a′[0] = {m};
a′[i] = a[i− 1], 1 ≤ i < A;
a′[A] = a[A− 1] ∪ (a[A] \ {m});
otherwise
Joinpersist(a1, a2) = a, where a[i] =
{m|∃0 ≤ x ≤ A and 0 ≤ y ≤ A,m ∈ a1[x] ∧m ∈ a2[y] ∧ i = max(x, y)}
∪ {m|m ∈ a1[i] ∧ m /∈ a2[y], ∀0 ≤ y ≤ A}
∪ {m|m ∈ a2[i] ∧ m /∈ a1[x], ∀0 ≤ x ≤ A},
0 ≤ i ≤ A.
Figure 2.6: Update function and join function for persistence analysis.
2.3.2 Program Path Analysis
There are generally three approaches for program path analysis, tree-based method,
path-based method and implicit path enumeration approach, in order to compute
the WCET of a task. Tree-based method is also known as timing schema [92,
83]. It associates each node with the corresponding estimated time which is
derived from the timing rules on the statements of the program. A bottom-up
search on the syntax tree of the program is used to calculate the timing. Path-
based method explicitly searches for the path with the longest execution time, in
order to obtain the WCET [49]. Due to the explicit path enumeration, additional
information can be integrated during analysis, such as infeasible path informa-
tion. Thus, it usually produces precise results. However, path-based method
suffers from scalability issue. To reduce the complexity in path-based method,
path searching on acyclic fragment (e.g., loop body) is employed [94, 98]. Im-
plicit path enumeration approach is implemented with integer linear program-
ming (ILP) formulation. Control flows of the program are represented by linear
constraints and equations in ILP formulation [63]. Other constraints, such as
loop bounds and infeasible path information can be also included in the for-
mulation to facilitate or to improve the precision of WCET estimation. The
objective of the ILP formulation is to maximize the overall WCET, where the
18
execution time and execution frequency of each basic block are included. This
ILP problem can be solved with an ILP solver, such as IBM CPLEX [3]. The
solution of ILP problem captures the quantitative value of overall WCET as well
as the execution frequency of basic blocks and control flow edges. As the so-
lution does not explicitly identify the program path in the worst-case scenario,
this ILP-based method is called implicit path enumeration approach. ILP-based
implicit path enumeration method is employed in many existing WCET analysis
tools, such as Chronos [59] and aiT [8].
We present the detailed ILP formulation for the implicit path enumeration
approach. Suppose there is a task T with N basic blocks {b0, b1, ... , bN−1 }.
We use B to represent the set of basic blocks in task T . Then, in implicit path
enumeration approach, the WCET of the task is expressed as follows.∑
bi∈B
Ci × ni (2.1)
where Ci is the worst-case execution time of basic block bi and ni represents its
corresponding execution count. Ci can be obtained through micro-architectural
modeling and program analysis, and thus it is constant. Therefore, we focus on
the constraints on ni. Obviously, the execution count of the entry basic block of
T is 1. Suppose b0 is the entry basic block, then we have
n0 = 1 (2.2)
On the other hand, for a basic block bi ∈ B, the execution count summation for
all its incoming edges must be equal to the execution count of the basic block





where INi is the set of incoming edges of the basic block bi, ex,i represents the
incoming edge from basic block bx ∈ B, and dx,i is the execution count of the
corresponding edge. Similarly, the execution count summation for all outgoing





where OUTi is the set of outgoing edges of the basic block bi, ei,y represents
the outgoing edge to basic block by ∈ B, and di,y is the execution count of
the corresponding edge. The objective function in the ILP formulation of the
19









In this chapter, we present an overview of the existing research works on mem-
ory optimization in embedded real-time systems. We first show the related re-
search works on static cache analysis in uni-processors. Then, the cache analy-
sis and optimizations in multi-core processors with shared cache are presented.
Later, cache locking techniques in embedded real-time systems are presented.
At last, we review other memory optimization approaches that improve the
worst-case performance.
3.1 Cache Analysis in Uni-processor
We introduce the existing cache analysis approaches that target both intra-task
cache conflict and inter-task cache interference in uni-processors.
3.1.1 Intra-task Cache Conflict Analysis
Cache makes the worst-case timing analysis in real-time systems challenging,
as the timing is unpredictable due to the cache. Conservatively assuming that all
memory accesses are cache misses will significantly overestimate the timing in
real-time systems. Ferdinand et al. [41] and Theiling et al. [101] perform cache
analysis via abstract interpretation approach [31]. Abstract cache states are de-
fined at each program point to represent the possible cache behavior, and virtual
inlining and virtual unrolling (VIVU) technique [79] is also utilized. The details
of abstract interpretation approach are presented in Section 2.3.1 of Chapter 2.
Based on the resulting abstract cache states at each program point, memory ac-
cesses are classified into always hit, always miss, persistent and non-classified.
Memory access classification is integrated with program path analysis to es-
timate the WCET. Hardy and Puaut [48] extend the analysis to non-inclusive
21
multi-level instruction cache with abstract interpretation. The memory access
classification at a particular cache level l is used as the input for the analysis in
the next level of cache l+1. Based on the memory access classification at cache
level l, the memory references are categorized into three types, never, always
and uncertain. That is, the memory accesses that are never performed at cache
level l + 1, the memory accesses that are always performed at cache level l + 1,
the memory accesses that cannot be guaranteed to be never and always, re-
spectively. For uncertain memory references, both the cases of accessing and
not accessing cache level l + 1 should be considered in the analysis. Ballabriga
and Casse [17] propose multi-level persistence analysis. In their work, persis-
tence analysis is performed for each loop. Compared to the global persistence
analysis in [101], their persistence analysis captures the local program behavior
and produces more accurate results. Cullmann [32] identifies a problem that
may underestimate the timing in the traditional persistence analysis [101], as
we have mentioned in the previous chapter. In the traditional persistence anal-
ysis, memory accesses may be improperly classified as persistent. The author
employs may analysis to enhance the persistence analysis, in order to guarantee
safe timing estimation. Mueller [81] proposes static cache simulation that inte-
grates abstract cache states analysis and data flow analysis for precise memory
access classification. Li et al. [64] present an effective approach to model the
direct-mapped instruction cache. Cache conflict graph is constructed to cap-
ture the program behavior in cache. Based on the cache conflict graph, linear
constraints are derived, which will be used in the ILP formulation of implicit
path enumeration approach. They extend the analysis of direct-mapped instruc-
tion cache to set-associative instruction cache, data caches and unified caches in
[65]. Thomas and Stenstro¨m [77] adopt symbolic execution method to perform
the cache analysis.
Compared to the instruction cache, the analysis of data cache is much more
complicated, as data reference address analysis is required. An instruction may
access different data locations under different contexts. White et al. [108]
calculate virtual addresses of data references. With the virtual addresses, data
references are categorized via a static cache simulator. Abstract interpretation
approach [31] is also used to model the data cache [42, 91, 57]. Ferdinand
and Wilhelm [42] employ persistence analysis while Sen and Srikant [91] adopt
must analysis. Lesage et al. [57] extend the work in [48] to multi-level set-
associative data caches with abstract interpretation. However, abstract cache
state analysis approaches in data cache usually suffer from high overestimation.
There are also data cache modeling approaches based on access pattern analy-
22
sis. Ghosh et al. [44] propose cache miss equation (CME) framework to analyze
cache behavior. They adopt the concept of reuse vector [110] and generate the
CMEs, i.e., a set of diophantine equations. These CMEs are used to perform the
cache hit/miss classification. Chatterjee et al. [25] analyze the cache behavior
of nested loops, and Presburger arithmetic formalism is applied. However, com-
putational complexity of the modeling is super-exponential in their approach.
More recently, Huynh et al. [51] propose a new approach for data cache analy-
sis with persistence analysis. It is a combination of abstract interpretation-based
approach and access pattern-based method. The concept temporal scope bears
great importance in their approach. For a data memory block m accessed by an
instruction in a loop lp, temporal scope defines the closed loop iteration interval
[lw, up] that the memory block could be accessed. Two memory blocks mapped
to the same cache set in loop lp will not conflict with each other if they have
disjoint temporal scopes. Multi-level persistence analysis based on the temporal
scopes of the memory references are performed to obtain the classification of
memory accesses to the data cache.
3.1.2 Inter-task Cache Interference Analysis
In multitasking real-time systems, multiple tasks are scheduled on the same
processor. As we have mentioned in Chapter 1, there is inter-task cache inter-
ference when preemption happens, and additional delay called CRPD (Cache-
related Preemption Delay) is incurred. CRPD of a task is the reloading cost of
the useful memory blocks that are evicted by the preempting task. As CRPD is
important in the schedulability analysis of the tasks, the inter-task interference
should be carefully modeled in real-time systems.
We present an example to show the inter-task cache interference and CRPD,
as illustrated in Figure 3.1. Suppose we have two tasks T and T ′. T ′ has higher
priority than T . Figure 3.1(a) presents the control flows of these two tasks. All
the memory blocks m1, m2, m′1 and m
′
2 are mapped to the same cache set. The
numbers on the loop back edges are the corresponding loop bounds. We assume
the cache is 2-way set-associative. Thus, if there is no interference from the
other task, T will only have two cold misses in its first iteration, and all memory
accesses in the rest iterations are cache hits. Figure 3.1(b) shows the scheduling
of the tasks T and T ′. Task T starts execution first. We assume T executes for
5 iterations in its loop, and then T ′ is ready. As T ′ has higher priority, T ′ will
preempt T , and the cache state at the preemption point is shown in the figure
(m1 and m2 are present in the cache). During the execution of T ′, T ′ loads its
23
own memory blocks into the cache. Thus, after T ′ finishes execution, m1 and
m2 are replaced with m′1 and m
′
2. Then, T resumes execution. Obviously, T
needs to reuse m1 and m2, while m′1 and m
′
2 are present in the cache due to the
interference from T ′. In this cache, T will reload the memory blocks m1 and



















Figure 3.1: An example for inter-task cache interference and CRPD.
Lee et al. [56] propose the concept of UCB (Useful Cache Blocks) for pre-
empted task. UCB at a program point is the set of memory blocks that may be
cached at this point and may be reused after this point without being evicted.
The CRPD is thus bounded by the maximum number of UCB at a program
point. Altmeyer and Maiza Burguie`re [10] enhance CRPD computation via re-
definition of UCB. In their method, a UCB at a program point cannot be a cache
miss in the WCET analysis. That is, the UCB must always be in the cache
from this program point to the possible reuse point in the program. Tomiyama
and Dutt [103] bound the CRPD by analyzing the preempting task. They use the
24
memory blocks accessed in the preempting task to bound the CRPD imposed on
the preempted task, and program path information is used to prevent pessimistic
results. The memory blocks used by the preempting task is known as ECB
(Evicting Cache Blocks). Following this work, most of the CRPD estimation
approaches consider the effect of both preempted and preempting tasks (UCB
and ECB) [82, 100, 95, 52, 54]. Among these approaches, Negi et al’s method
[82] adopts concrete cache states analysis. At each program point, they com-
pute the reaching cache states (RCS) and live cache states (LCS), which leads
to a fine-grained analysis of UCB and ECB. Thus, their approach produces ac-
curate CRPD estimates. However, their analysis has high time complexity and
is restricted to direct-mapped cache. Staschulat and Ernst [95] propose a more
scalable CRPD analysis approach compared to the method in [82], while the
precision is retained. In their work, the number of cache states is bounded by
merging similar cache states. Altmeyer et al. [9] also tighten the CRPD for
set-associative caches by resilience analysis. The resilience of a UCB defines
the maximum number of allowed memory accesses from the preempting task
before it can be evicted. Only when the number of ECB exceeds the resilience
of the UCB, it will contribute to the CRPD. More recently, Kleinsorge et al. [54]
synergistically combine the methods in [82] and [9], and they achieve the best
of these two approaches.
In summary, many existing cache modeling methods aim to capture the pro-
gram behavior in the cache through static analysis. However, static cache anal-
ysis may fail to deterministically identify the memory access behavior, such as
non-classified memory accesses in the abstract interpretation approach. Due
to the safety-critical nature of the hard real-time systems, conservative estima-
tion is usually adopted, which may lead to great overestimation of timing in
real-time systems. In this thesis, we adopt cache locking to improve the timing
predictability and worst-case performance of the cache, which will be shown in
details in Chapter 4, 5, 6 and 7.
3.2 Cache Analysis in Multi-core
As multi-core processors become widely used in the real-time systems due to
thermal and power constraints, significant research efforts have been invested
into this area. The shared resource among different cores, such as shared cache
and shared bus, makes the analysis in multi-core processors more challenging,
compared to that of uni-processors.
In multi-core processors, the inter-core contention in the shared cache makes
25
timing analysis even more difficult. Yan and Zhang [112, 113] account for inter-
core cache contention by detecting accesses across cores that are mapped to
the same set in the shared cache. However, the lifetime of the tasks are not
considered in their work. In other words, any two tasks on different cores are
considered as interfering with each other. Therefore, their approach produces
pessimistic WCET estimates. Hardy et al. [47] filter out static single usage
blocks from the shared caches, and only blocks statically known to be reused
are cached. Such kind of bypass strategy reduces the pollution in the shared
caches, thus, tightens the WCET estimates for multi-core processors with shared
instruction caches. The lifetime of the tasks is again not considered in their
work. Li et al. [62] present a shared cache modeling approach based on abstract
interpretation. The lifetime of the tasks are carefully studied in their approach.
Two tasks on different cores are considered as interfering with each other only
when there is no dependence between them and their lifetimes overlap. An
optimization for set associative caches is also developed, which improves the
estimation accuracy. Lesage et al. [58] extend the work in [47] to shared data
cache in multi-core processors. Bypass strategy is also used in their work to
reduce the inter-core interference in shared data cache. Apart from the analysis
on shared cache, there are also many research works on modeling the shared
bus, in order to bound the execution time [28, 53, 26].
As can be observed, existing analysis approaches focus on modeling the
inter-core cache interference in shared cache of multi-core processors. This
inter-core cache interference results in additional cache misses in shared cache
and contributes to increased timing. In Chapter 7, we try to reduce the inter-core
interference in shared cache and improve the worst-case response time (WCRT)
of multitasking applications.
3.3 Cache Locking
As we have mentioned, cache locking is used in many commercial processors [6,
2, 1, 7, 5, 4]. Cache locking is employed for timing predictability in real-time
systems. By carefully selecting the memory blocks to lock, it can also improve
the the execution time of the tasks. In multitasking real-time systems, it can also
be used to improve the schedulability.
26
3.3.1 Cache Locking for Single Task
Falk et al. [38] perform cache locking by taking into account the change of
worst-case path and achieve better WCET reduction. Their greedy algorithm
computes the worst-case path and selects the procedure with maximum WCET
reduction for locking. This process continues until the cache is fully locked.
Liu et al. [72] present an optimal solution to minimize WCET via cache locking.
However, their approach is optimal on the premise that the cache is fully locked.
It may not be optimal towards minimizing WCET. More importantly, they do not
consider the cache mapping function at all in the locking algorithm. They simply
assume that a memory block can be locked in any cache set. After locking
decisions are taken, they have to use code placement/layout techniques [45, 69]
that force the locked memory blocks to be mapped to the appropriate cache
sets. This can lead to serious code size blowup, which has not been addressed.
Plazar et al. [84] select the memory blocks to lock in instruction cache via ILP
(Integer Linear Programming) formulation approach, in order to improve the
WCET. The control flow and cache size are modeled as constraints in the ILP
formulation. The objective function of the ILP formulation is to minimize the
overall WCET.
These approaches employ static full cache locking to achieve WCET mini-
mization for single task. That is, the cache contents are locked at the beginning
of program execution and the entire cache is locked. As we have mentioned,
such aggressive full cache locking may have negative impact on the overall tim-
ing, even though the timing is predictable. In Chapter 4, we propose partial
cache locking that combines cache locking with static cache analysis to achieve
better WCET for single task.
Arnaud and Puaut [15] propose a region-based dynamic instruction cache
locking approach. They partition the program into independent regions. The
initial regions are set to the basic blocks of program. Execution frequency is
obtained through profiling approach for each basic block. An exploration on re-
gions via merging and inlining operations is used to search the best program par-
titioning for WCET minimization. Static full cache locking is applied for each
region separately based on the execution frequencies of basic blocks. When
a region is entered, its corresponding locked cache content is loaded into the
cache. Puaut [86] also adopts region-based dynamic cache locking. However,
loops and procedures are selected as the locking regions, instead of region ex-
ploration in [15]. Liu et al. [74] choose the branch nodes in their EFT (Execution
Flow Tree) as the candidate reloading points to adjust the locked cache state. A
trade-off between swapping cost and locking benefit is carried out to select the
27
reloading points and corresponding contents to lock. Vera et al. [104] perform
a region-based dynamic cache locking for data cache. Locking and unlocking
instructions are inserted into the path merging points of the program to load and
release the locked contents.
All these dynamic locking approaches [15, 86, 74, 104] adopt full cache
locking, which may have negative impact on overall WCET as we have men-
tioned. [15] and [104] rigidly partition the program into independent regions
and assign each region a static locking state. Even though inlining allows pro-
motion of locked content within a region to its caller, the entire locked content
has to be promoted. In [86], cache locking is performed at the granularity of ba-
sic block, instead of memory block. That is, each time a basic block is selected
to be locked, which is coarse-grained. Meanwhile, a simple heuristic approach
without global optimization is used to select the locked contents for each re-
gion. The approach in [74] allows locking at different program points only for
branching; that is, there is only one locking point for single-path regions as in
nested loops. Meanwhile, similar to [72], the approach in [74] does not consider
cache mapping function during locking decisions and requires post-locking code
placement to force the locked memory blocks to the corresponding cache sets.
In Chapter 6, we present a new dynamic instruction cache locking, in order to
improve the WCET for the tasks. Memory blocks are locked at the entry points
of loops and unlocked at the corresponding exit points of loops. The locking
slots at each loop are determined through global optimization, and we select the
most beneficial memory blocks to fill the locking slots.
3.3.2 Cache Locking in Multitasking
In the context of multitasking real-time systems, Puaut and Decotigny [87] em-
ploy static cache locking for the tasks. Two content selection algorithms have
been proposed in their work to minimize the CPU utilization and inter-task in-
terferences, respectively. The profiles on the worst-case path are used for the
memory block selection. Each task locks a portion of the cache, while all the
tasks together lock the entire cache. Obviously, The performance of this ap-
proach is limited by the cache size. Campoy et al. [23] develop static locking
solutions in multitasking real-time systems using genetic algorithms, in order
to minimize system utilization. Both [87] and [23] do not model the change in
worst-case path after locking a memory block. Aparicio et al. [14] propose a
cache locking solution among the tasks based on ILP formulation. The cache is
shared by the tasks in a time-multiplexed style through cache locking, and each
28
task has exclusive access to the cache during execution. Thus, there is dynamic
locking behavior among the tasks. However, when preemption happens, the pre-
empted task must reload and lock its cache contents when it resumes execution.
The re-locking cost may contribute a lot to the timing of the preempted task. In
Chapter 5, we integrate cache locking and cache analysis in multitasking real-
time systems. Such approach avoids the disadvantages of existing cache locking
approaches and improves the schedulability/utilization in multitasking real-time
systems.
Apart from improving the worst-case performance in real-time systems, cache
locking is also used to improve the average-case execution time [11, 70, 73].
3.4 Memory Optimizations in Multi-core Proces-
sors
Task mapping on multi-core systems is a well studied problem. In this thesis, we
restrict our discussion to task mapping approaches that consider cache behavior.
Anderson et al. [13] propose and evaluate a cache-aware Pfair-based scheduling
scheme for multi-core real-time systems. Tasks that generate large amount of
L2 cache misses are discouraged from being co-scheduled to enure the real-time
constraints. On the other hand, in [12], Anderson and Calandrino propose a
scheduling scheme encouraging threads that are cooperative and share common
working set to be co-scheduled for effective use of shared resource in the L2
cache. In [22], Calandrino et al. integrate the approaches presented in [13] and
[12].
Fedorova et al. [40] present an operating system scheduling algorithm CASC
(Cache-Aware Scheduling). CASC achieves contention reduction in the L2
cache by co-scheduling threads that have large footprint in the L2 cache with
threads that have small footprint in the L2 cache. Meanwhile, threads with low
L2 cache footprint are assigned higher priority. In [39], Fedorova et al. propose
cache-fair scheduling algorithm to address the unfair cache sharing problem.
Liu et al. [75] aim to minimize WCET on multiprocessor system-on-chip
(MPSoC) through task assignment. They adopt cache locking in the L1 cache
for predictable WCET estimation. Tasks are assigned based on the results of L1
cache locking. Then, they perform L2 cache partitioning in order to minimize
the total WCET. Finally, adjustment for task assignment and L2 cache partition-
ing are utilized to further improve the WCET. Suhendra and Mitra [96] partition
the shared cache based on cores or tasks. Both static cache locking and dynamic
29
cache locking are explored for predictable shared cache in multi-core processor.
In Chapter 7, we propose a task mapping approach to reduce the WCRT
(Worst-case Response Time) in multi-core real-time systems. Our approach dif-
fers from previous work in that we integrate task mapping with detailed shared
L2 cache modeling in multi-core systems for worst-case response time mini-
mization. Unlike [75], we do not partition the L2 cache among different cores.
Rather we allow conflicts among accesses from different cores in the shared
cache. We take into account the impact of this conflict in our task mapping ap-
proach. The co-scheduling approaches [13, 12] try to identify tasks that should
or should not be scheduled together so as to improve the shared cache behavior.
They model the cache behavior of the tasks at a very high level. We also employ
partial cache locking in Chapter 7 based on the resultant task mapping. Our
cache locking approach in multi-core processors further improves the WCRT.
3.5 Other Optimizations for Worst-case Performance
3.5.1 Cache Partitioning
Cache partitioning is a mechanism that partition cache into isolated portions.
With cache partitioning, each portion of the cache is privately used by a task or
a core. Thus, cache partitioning can be used to eliminate the inter-task cache
interference in multitasking real-time systems or inter-core cache interference
in multi-core processors with shared cache. As the inter-task or inter-core cache
interference makes the timing analysis more complicated, the execution time in
real-time systems is more predictable with cache partitioning.
Sasinowski and Strosnider [89] optimally partition the cache via dynamic
programming approach, in order to minimize the task utilization. Mueller [80]
proposes a compiler-assisted software-based cache partitioning approach. The
cache is partitioned into distinct portions for tasks based on task priority and task
size. As each task is restricted to use only a portion of the cache, code/data trans-
formation and data reference adjustment are thus required to map the code/data
to the particular cache space. Discussion regarding the influence of cache parti-
tioning is also presented. Plazar et al. [85] also propose a software-based cache
partitioning method. Compared to [80], their approach is WCET-driven. They
partition the cache with ILP formulation approach to have lowest overall system
WCET. As mentioned earlier, Suhendra and Mitra [96] combine cache parti-
tioning and cache locking in the shared cache of multi-core processor for pre-
dictability. Liu et al. [75] also partition the L2 shared cache in multi-core pro-
30
cessor to guarantee the predictability, while cache locking is applied for each
task to have precise WCET estimation.
3.5.2 Code Layout Optimization
Due to the address mapping function in the cache, a memory block can only be
mapped to one cache set. In this case, the mapping function also determines
the memory blocks mapped to the same cache set, and thus the corresponding
cache conflicts in the cache set. Therefore, by carefully modifying the code lay-
out, cache conflicts can be eased or even completely eliminated, which usually
results in better execution time. This is known as code positioning or code lay-
out change. Generally, code positioning can be performed at the granularity of
basic block, procedure and task.
Zhao et al. [114] try to reduce the WCET through code positioning. They tar-
get the control flow in the worst-case path and attempt to make basic blocks con-
tiguous. However, their approach does not take into account the cache behavior.
Lokuciejewski et al. [76] propose a WCET optimization approach via procedure
placement. Their approach is based on the procedure call graph of the program.
Two procedures with high calling frequency is contiguously placed. Falk and
Kotthaus [37] extend the work in [76] based on more fine-grained cache conflict
graph. The edge weight in the conflict graph considers the corresponding cache
misses on the worst-case path. They first perform code positioning for each
procedure at the basic block level. Later, a global procedure placement is car-
ried out for the task. Gebhard and Altmeyer [43] perform code layout changes
at task level in multitasking applications with preemptive scheduling. Their
approach minimizes the inter-task cache interference and improves the overall
cache performance. Compared to the methods in [114, 76, 37] that require code
modification inside the task, [43] only needs to adjust the start address of each
task.
3.5.3 Scratchpad Memory
Scratchpad memories are small on-chip memories that are mapped into the ad-
dress space of the processor, as shown in Figure 3.2. A part of the code/data
are located in the scratchpad memory, while the rest stays in the main memory.
Low access latency is associated with the memory accesses to the code/data in
the scratchpad memory. As there are no tag memory and comparators, scratch-
pad memory is more energy-efficient and cost-effective than normal cache [18].












Figure 3.2: Scratchpad memory.
That is, the access latency to the code/data is known. This special characteris-
tic makes the memory access behavior predictable, which is critical for the hard
real-time systems. The disadvantage of scratchpad memory is that it requires ad-
ditional effort to modify the applications, in order to use the scratchpad memory.
Usually, compiler-assisted approaches are employed to allocate the code/data to
the scratchpad memory.
Puaut and Pais [88] partition the program into different regions. At the cor-
responding reloading point of each region, the memory content is loaded into
the scratchpad memory. Falk and Kleinsorge [36] allocate program code to
scratchpad memory with ILP formulation. They claim that their approach is
optimal for WCET reduction. As a portion of code is allocated into scratchpad
memory, control flow should be carefully adjusted, and additional penalty is in-
curred in their method. Wu et al. [111] also use scratchpad memory to optimize
the WCET in hard real-time systems. In their approach, the scratchpad memory
allocation problem is transformed into graph problem based on the control flow
of the program. A code selection algorithm with polynomial-time complexity is
proposed for non-nested loops in fully pipelined processor.
Suhendra et al. [97] propose a scratchpad memory allocation approach for
program data. Their purpose is to minimize the WCET, and three solutions are
proposed, including ILP formulation, branch and bound, and a time-efficient
heuristic method. Deverge and Puaut [33] apply dynamic scratchpad memory
allocation of data to improve the WCET, compared to static method in [97].
In other words, the contents in the scratchpad memory vary with the program
execution. Thus, it may better capture the program behavior. Wan et al. [106]
attempt to minimize the WCET by allocating the data to scratchpad memory
with graph coloring approach. Their approach is based on the observation that
32
data with disjoint live ranges can be allocated at the same location.
Suhendra et al. [99] investigate scratchpad memory allocation for multitask-
ing applications, in order to improve the WCRT (Worst-case Response Time).
The interference among the tasks are carefully handled and two tasks with dis-
joint lifetime can occupy the same space in scratchpad memory. An iterative al-
location algorithm is used to monotonically reduce the WCRT. Chattopadhyay
and Roychoudhury [27] develop a compile-time scratchpad allocation frame-
work for multi-processor platforms, where the processors virtually share on-
chip scratchpad space and external memory is accessed through a shared bus.
They adopt a static bus schedule scheme (Time Division Multiple Access) which
is incorporated by scratchpad allocation method. Overall WCRT is significantly
reduced by appropriate content selection and overlay optimization (variables
share the same scratchpad space due to disjoint lifetimes). Verma et al. [105]
propose a hybrid approach for scratchpad memory allocation in multiprocess
systems. Each process is allocated a disjoint region while the rest portion is
shared by all processes. Their approach aims to minimize the energy consump-
tion, and real-time scheduling is not considered in their work.
33
Chapter 4
Partial Cache Locking for Single
Task
In this chapter, we present the partial cache locking approach for single task to
optimize the worst-case execution time (WCET).
4.1 Overview
As mentioned in Chapter 1, static cache analysis and cache locking are usu-
ally employed to deal with timing unpredictability of cache in hard real-time
systems. Recently, a heuristic [38] and an optimal solution [72] have been pro-
posed to minimize the WCET via static instruction cache locking. These ex-
isting techniques make an implicit but important decision of locking the entire
cache. This crucial decision arises from the assumption that instruction cache
modeling for WCET analysis is quite imprecise. By employing full cache lock-
ing, [38, 72] can completely bypass cache modeling in WCET analysis phase
and thereby achieve tight WCET estimation. Indeed, as these techniques are
oblivious to cache modeling, they assume the worst-case behavior with empty
cache (where all the accesses are serviced from main memory) as the reference
point and improve upon it through locking of memory blocks along the WCET
path. In this context, it is guaranteed that locking the entire cache will provide
maximum WCET reduction compared to the baseline empty cache. In other
words, the cache locking problem becomes equivalent to the scratchpad mem-
ory allocation problem [36, 97].
In this chapter, we argue (and experimentally validate) that aggressive full
cache locking as proposed in [38, 72] may have substantial negative impact
on WCET reduction. State-of-the-art instruction cache modeling techniques
34
for WCET analysis are quite mature. Most memory accesses can thus be suc-
cessfully classified as hit/miss through WCET analysis techniques. Consider
a memory block m originally classified as cache hit in a normal cache through
static WCET analysis. Butm is not selected for locking under full cache locking
scenario. Thus m does not have any opportunity to reside in the cache and all its
accesses incur cache misses. Now consider an alternative scenario where par-
tial cache locking is employed. Again m is not selected for locking. However,
as the cache has some unlocked lines, m may still be brought into the cache at
runtime and the cache misses can be avoided.
Our partial locking mechanism integrates cache locking with cache model-
ing. We model the cache content at all program points and select the memory
blocks for locking based on the cache state and their impact on the WCET. In
particular, we use the concrete cache states or the abstract cache states to model
the cache content. Concrete cache state captures the exact path behavior while
abstract cache state is a compact representation that merges multiple concrete
cache states together. For concrete cache state, we use integer linear program-
ming (ILP) approach to optimally select the memory blocks for locking. As no
cache locking and full cache locking are just two extreme instances of partial
cache locking, partial locking is guaranteed to be equivalent to or better than
them. To improve the efficiency, we also propose a heuristic partial locking
strategy based on abstract cache states. Experimental results show that our par-
tial cache locking techniques improve WCET substantially for a large number
of embedded benchmark applications.
4.2 Motivating Example
We illustrate the benefit of partial cache locking over full cache locking with a
concrete example shown in Figure 4.1. The program consists of four loops and
we assume that all the memory blocks are mapped to the same cache set in a
2-way set associative cache.
Cache modeling with no locking: Let us first estimate the WCET via cache
modeling with no locking. We adopt Theiling et al.’s approach for cache mod-
eling in WCET analysis. Theiling et al. [101] model the cache states at all
program points. All the memory blocks in the first loop (m0, m1, m2) are cache
misses in the worst case because alternate execution of the two program paths
(P0 and P1) can lead to mutual eviction of the blocks. Thus, program path P1
with 2 cache miss is the worst case path in the first loop. For the other three







WCET: 20 + 3 = 23 misses 
Way-1 
Full Locking 
WCET: 20  + 80 = 100 misses 
Partial Locking 













Figure 4.1: Advantage of partial cache locking over full cache locking and cache
modeling with no locking. The program consists of four loops. The first loop
contains two paths (P0 and P1) and the other three loops contain only one path.
The loop iteration counts appear on the back edges.
a cold miss and the subsequent accesses are cache hits via persistence analysis
or virtual unrolling [79, 101]. Therefore, cache modeling estimates 23 cache
misses in the worst case — 20 misses for the first loop and 3 misses for the other
loops.
Full cache locking: Existing cache locking techniques [38, 72] first build the




5 ) assuming that all the accesses are
serviced from the main memory (i.e., there is no cache). Now memory blocks
are selected for locking along the worst-case path so as to improve the WCET
until the cache is fully locked. Both cache locking techniques [38, 72] model
the fact that the WCET path may change after locking some memory locks. For
this example, the heuristic [38] and the optimal [72] approach return the same
solution. m3 and m4 are chosen to be locked as they contribute most towards
WCET reduction. After locking, we get 100 cache misses in total in the worst
case — 20 misses in the first loop and 80 misses in the last loop. Thus, cache
locking performs worse than cache modeling in this example.
Partial cache locking: Our partial locking technique can determine that it is
beneficial to keep one cache line free so that accesses to m3, m4, and m5 can
be cache hits after the first cold miss. It only chooses to lock m1 or m2 in the
cache. Thus we get 13 cache misses in the worst case — 10 misses in the first
loop and 3 cold misses for the other loops. Thus partial cache locking improves
upon cache modeling and full locking.
From the example above, we first observe that full locking techniques [38,
72] are not guaranteed to perform better than cache modeling (with no locking)
36
specially when some memory accesses can be easily classified as cache hits
(m3, m4, m5 in our example). Locking these memory blocks with deterministic
access pattern does not yield any benefit. On the other hand, if the cache is
fully locked and these memory blocks with deterministic access pattern are not
chosen for locking, it can have serious impact on the WCET.
4.3 Cache Modeling
The details of the cache have been introduced in Section 2.1 of Chapter 2. In
this chapter, we assume the cache line (block) size is L, and there are K sets in
the cache. Meanwhile, the cache associativity is A, and LRU (Least Recently
Used) cache replacement policy is employed.
Given a memory block m, it is mapped to only one cache set. Thus, the
different cache sets are independent and can be modeled independently. In the
following, we describe our modeling technique for one cache set. The same
modeling techniques can be repeated for other cache sets. We use M to de-
note the set of memory blocks mapped to a cache set and use ⊥ to indicate the
absence of any memory block in a cache line.
4.3.1 Cache States
Definition 2 (Concrete Cache State) A concrete cache state c is a vector
〈c[0], ..., c[A− 1]〉 of length A where c[i] ∈M ∪{⊥}. If c[i] = m, then m is the
ith most recently used memory block in the cache set. We also define a special
concrete cache state c⊥ = 〈⊥, ...,⊥〉 called the empty concrete cache state.
Definition 3 (Concrete Cache State Hit) Given a concrete cache state c and a
memory access m ∈M
c hit(c,m) =
{
1 if ∃i (0 ≤ i ≤ A− 1) s.t. c[i] = m
0 otherwise
(4.1)
We use Ω to denote the set of all possible concrete cache states of a program.
Note that a program point can be reached via multiple paths and these paths may
lead to different concrete cache states. We use P to denote the set of all possible
concrete cache states at a program point, i.e., P ∈ 2Ω. We can easily compute
P at each program point through static program analysis as shown in [68].
Given the set of all possible cache states P at a program point and a memory
37
access m ∈M ,
p hit(P ,m) =
{
1 if ∀c ∈ P c hit(c,m) = 1
0 otherwise
(4.2)
That is, an access m is a hit at a program point with the set of all possible
concrete cache states P if and only if m is hit in all the concrete cache states of
P .
Maintaining the set of all possible cache states may not be feasible (and scal-
able) for large programs with complex control flows where a program point can
potentially have hundreds or even thousands of cache states. Thus we also em-
ploy abstract interpretation to compute the abstract cache state at every program
point [101]. An abstract cache state is derived by joining all possible concrete
cache states at a program point. The abstract cache state is defined similarly to
that in Section 2.3.1 of Chapter 2.
An abstract cache state maps a cache line to a set of memory blocks. Must
analysis and may analysis [101] are usually employed to compute abstract cache
states for WCET analysis. Given a program point, must analysis determines
the set of memory blocks that are guaranteed to be present in the cache, while
may analysis determines the set of memory blocks that are never in the cache.
Must analysis uses abstract cache states where the position of a memory block
is an upper bound of its age. In may analysis, the lower bound of the age of
a memory block is used as its position in the abstract cache state, in order to
capture the set of all memory blocks that may be in the cache. Figure 4.2 shows




























must analysis may analysis
Figure 4.2: Concrete cache states and abstract cache states.
38
4.4 Partial Cache Locking Algorithms
In this chapter, we consider static cache locking, where the selected memory
blocks are locked into the cache before the program starts execution and remain
unchanged throughout the execution. Furthermore, we consider line locking
mechanism, where different number of lines can be locked in different cache
sets. As discussed before, for our purposes, we can treat each cache set inde-
pendently because the memory blocks mapped to different cache sets do not
interfere. Each cache set can be considered as a fully associative cache contain-
ing A lines, where A is the associativity. Once a memory block is locked in a
cache line, it can not be evicted from the cache. The remaining unlocked lines
in the cache set serve as a fully associative cache with reduced capacity.
Note that the mapping of instructions to the cache sets depends on the code
memory layout. Inserting additional code for cache locking may tamper this
layout. To avoid this problem, we use the trampolines [20] approach. The extra
code to fetch and lock the memory blocks in the cache are inserted at the end of
the program as a trampoline. We leave a dummy NOP instruction at the entry



















Figure 4.3: Trampoline mechanism.
The main challenge is in selecting the memory blocks for locking so as to
minimize the WCET. In the following, we propose two solutions. The first one is
an optimal solution employing Integer Linear Programming (ILP) formulation
based on concrete cache states and the second one is a heuristic approach based
on abstract cache states.
39
4.4.1 Optimal solution with concrete cache states
The set of concrete cache states at any program point captures the exact set of
cache states resulting from all possible program paths. Based on this accurate
set of cache states, we formulate an ILP problem to optimally select the memory
blocks for partial locking. In the following, we first show the ILP formulation
for a loop and then extend it to the whole program.
ILP Formulation for Loop
We represent the loop body as a Directed Acyclic Graph (DAG). Each DAG is
associated with a unique source and sink node. We compute the set of possible
concrete cache states P at any point of the program through static program
analysis [68]. Given the set of all possible cache states P and a memory block
access m, p hit(P ,m) determines whether the access is a cache hit or miss
before locking. Next, we proceed to determine the cache access behavior of m
after locking.
For each memory block m, we define a 0-1 decision variable Lm, which
indicates whether m is locked in the cache. Thus,
Lm ∈ {0, 1} (4.3)
There are only A (associativity) cache lines available for locking in each cache
set. Thus for each cache set i ∑
m∈Mi
Lm ≤ A (4.4)
where Mi is the set of memory blocks mapped to cache set i.
The accesses to the locked memory blocks are cache hits. Let Locki denote
the set of memory blocks locked in cache set i. For an unlocked memory block
m mapped to cache set i (m ∈ Mi,m /∈ Locki), its access can be classified as
hit or miss depending on the concrete cache states P at that program point and
Locki.
For a concrete cache state c ∈ P , we define agecm as the age of the memory
block m in c, where agecm = 0 (age
c
m = A− 1) if m is the most (least) recently
accessed memory block in c. If m /∈ c (c hit(c,m) = 0), then agecm = A. Thus,
0 ≤ agecm ≤ A. If m ∈Mi and m /∈ Locki, then given a concrete cache state c,
40
the access to m is cache hit ifagecm + ∑
m′∈Locki∧agecm′>agecm
Lm′
 < A (4.5)
In other words, if a locked memory block m′ ∈ Locki is younger than m in
the cache state c, then locking m′ does not change the hit classification of m.
However, if m′ ∈ Locki is older than m in cache state c (i.e., agecm′ > agecm),
then locking m′ essentially increases age of m by 1. If the number of such
older memory blocks added to agecm exceeds the associativity, then m becomes
a cache miss due to locking.
We define a 0-1 variable hcm, which specifies whether m is a cache hit in c













However, the above equation is not linear. We substitute it with the equivalent








Lm′ − agecm + U − U × hcm > 0 (4.8)
where U is a large constant (U ≥ A).
The set of concrete cache states P at a program point usually contains more
than one concrete cache states (|P| > 1). Memory block access m is guaranteed
as cache hit if and only if it is cache hit for every concrete cache state c ∈ P .











We linearize the above equation as follows.∑
c∈P
hcm − hPm ≤ |P| − 1 (4.10)
∑
c∈P
hcm − |P| × hPm ≥ 0 (4.11)
41
Finally, for each memory block access m, we define a 0-1 decision variable
hitm, which specifies whether m is cache hit or miss after locking. Locked
memory blocks are guaranteed to be cache hits. On the other hand, for an un-
locked memory block m, we rely on its corresponding cache state P to deter-
mine the cache behavior.
hitm =
{
1 if Lm = 1
hPm otherwise
(4.12)
We linearize the above equation as follows.
hitm ≥ Lm, hitm ≥ hPm and hitm ≤ Lm + hPm (4.13)





(miss lat− (miss lat− hit lat)× hitm) (4.14)
where miss lat and hit lat are the cache miss penalty and cache hit latency,
respectively.
We also define a variable WB for each basic block B in the loop, which
represents the latency of the worst-case path rooted at basic block B in the DAG
after cache locking. Then
WB = max
B′∈imsucc(B)
{WB′ + TB} (4.15)
where imsucc(B) is the set of immediate successors of B in DAG. Therefore,
for any outgoing edge from node B to node B′ (B → B′) in the DAG, we have
the following constraint
WB ≥ WB′ + TB (4.16)
Since there is no outgoing edge for the sink node of the loop, it is defined spe-
cially
Wsink = Tsink (4.17)
Obviously, Wsrc will capture the latency of the worst-case acyclic path in the
DAG (src is the source node of DAG). Let lb be the loop bound of this loop
(maximum number of iterations of this loop). Then, Wsrc × lb is the WCET
of this loop after cache locking. Thus, the optimal cache locking result for this
loop can be obtained by minimizing Wsrc × lb (the objective function of ILP
formulation).
42
Extension to the Whole Program
In the previous section, we present an ILP formulation to obtain the optimal
cache locking for a loop. In order to obtain the ILP formulation for the whole
program, we are required to start from the innermost loops of the program.
We first generate the ILP formulation for the innermost loops, and then each
innermost loop is treated as a dummy basic block of the outer loop. Therefore,
we can construct the ILP formulation for the next level of loop. We continue this
way until we reach the outmost loop in the program. Clearly, Wentry represents
the WCET of the whole program under cache locking, where entry denotes the
entry node of program. Finally, the locking overhead (e.g., the execution of the
locking instructions) are included in the WCET of the whole program.
4.4.2 Heuristic with abstract cache states
In the previous section, we develop an optimal ILP formulation using concrete
cache states. However, programs with complex control flow may have hun-
dreds or even thousands of cache states at a program point. For such programs,
maintaining all possible concrete cache states may not be feasible. Also ILP
formulation may take very long to reach a solution specially for larger programs
and larger associativity. Thus, we propose a heuristic approach based on abstract
cache states. Abstract cache state is a more compact representation compared
to the set of concrete cache states.
We first perform WCET analysis with cache modeling based on abstract
interpretation [101]. Then we can easily determine cache hit/miss classification
for each memory access based on the abstract cache states. As a by-product of
the WCET analysis, we obtain the abstract cache states under must analysis at
all program points. Meanwhile, we also collect the execution frequency of each
basic block along the worse-case path. Then we iteratively lock some memory
blocks on the worse-case path to improve the WCET.
Suppose memory block m is on the worst-case path. Let latm be the access
latency of m according to the hit/miss classification in WCET analysis, and fm
be its execution frequency along the worst-case path. By locking memory block
m, all accesses to m will be cache hits. Therefore, we define the benefit of
locking m as
benefitm = (latm − hit lat)× fm (4.18)
where hit lat is the cache hit latency. Thus, locking a memory block guaranteed
to be hit before locking does not give any benefit.
43
On the other hand, locking memory block m in cache may have negative
impact for the memory blocks mapped to the same set as the associativity for
this set is reduced by 1. Similar to concrete cache state, we define the age of
a memory block m in abstract must cache state C as ageCm. When m ∈ C,
0 ≤ ageCm ≤ A− 1, where A is the associativity. Otherwise, we set its age to A.
Suppose we choose to lock memory blockm in the cache and its benefitm >
0. In other words, m is not in the abstract must cache state before locking. Then,
locking m will downgrade the memory block m′ from cache hit to cache miss if
ageCm′ = A− 1. Note that the associativity A here refers to the current associa-
tivity of the set. That is A refers to the original associativity of the cache minus
the number of memory blocks locked in the set so far. Therefore, we define the




(miss lat− hit lat)× fm′ (4.19)
where as before m ∈Mi. Then, the overall gain of locking m is
gainm = benefitm − costm (4.20)
We compare different memory blocks in terms of their gain and select the
most beneficial memory block m to be locked. However, gainm may not be the
actual WCET reduction because the worst-case path may change after locking
m. Thus, we update cache state for instructions mapped to the affected cache set
and perform WCET analysis again to obtain the exact WCET after locking m.
If the WCET is actually reduced, we lock m in the cache. We continually select
memory blocks for locking until either there is no actual WCET improvement
after locking any memory block or there is no gain in the cost-benefit analysis
for any memory block m (i.e., gainm ≤ 0). Finally, the locking overhead is
included. The detail algorithm is shown as follows.
Algorithm 1 presents the details of our heuristic approach. The input to the
algorithm is the cache configuration cfg and the binary executable prog. First,
we perform cache modeling based on abstract interpretation [101] for this binary
executable (line 3). The output of this analysis are the abstract cache states at
each program point. Next we perform WCET analysis of the binary executable
(line 4) where memory accesses are categorized into always hit, always miss,
and unclassified based on abstract cache states. The wcet obtained in this step is
the original WCET obtained through static cache analysis and no cache locking.
Now, we iteratively select the most beneficial memory block for locking into
44
Algorithm 1: Heuristic with abstract cache states
Input: Cache configuration cfg and binary executable prog
Output: Set of locked memory blocks lock set and WCET after locking
wcet
1 begin
2 stop locking := false; lock set := null;
3 analyze abstract cache states(prog, cfg);
4 wcet := analyze wcet();
5 while (!stop locking) do
6 /* select candidate memory block to lock */
cnd := null; gaincnd := 0;
7 foreach m ∈M do
8 Suppose m is mapped to cache set s;
9 Let assoc be the current associativity of s;
10 if (m /∈ lock set ∧ assoc > 0) then
11 benefitm := calculate benefit();
12 costm := calculate cost(assoc);
13 gainm := benefitm - costm;
14 if gainm > gaincnd then
15 cnd := m; gaincnd := gainm;
16 if cnd 6= null then
17 lock to cache(cnd);
18 /* update cache states for affected cache set */
update cache state(prog, cfg, cnd);
19 new wcet := analyze wcet();
20 if new wcet < wcet then
21 wcet := new wcet;
22 lock set := lock set ∪ cnd;
23 update associativity for the affected cache set;
24 else
25 stop locking := true;
26 else
27 stop locking := true;
45
the cache. LetM be the set of all memory blocks. We perform cost-benefit anal-
ysis for each memory block m ∈ M where m is not yet locked (m /∈ lock set)
and the cache set s where m is mapped to still has some unlocked cache lines
(assoc > 0). We gain benefit from lockingm ifmwas not guaranteed to be a hit
after static cache analysis. However, there is a cost associated with locking m.
The other memory blocks mapped to cache set s but not yet locked will have one
less cache block available in the cache set s. As discussed earlier, some of these
blocks now may incur cache miss (even though their accesses were hits under
static cache analysis) depending on their relative age with respect to the age of
m in cache set s. The additional latency incurred due to these cache misses will
contribute to the cost of locking m. The difference between benefit and cost of
locking m is the gain. We identify the memory block cnd with maximum gain.
If we cannot identify any memory block with positive gain, then the lock-
ing algorithm terminates. Note that the cost-benefit analysis is approximate in
nature because it depends on the frequency of memory accesses along the worst-
case path before locking memory block cnd. After locking memory block cnd,
the worst-case path may change. So we update the abstract cache states for the
cache set where cnd is mapped to and repeat WCET analysis with this new ab-
stract cache states. If the new WCET is indeed lower than the previous WCET,
then we add the memory block cnd to lock set. We also need to decrease the
associativity of the corresponding cache set. If the actual WCET after locking
m is not lower than the previous WCET, then we terminate the algorithm.
Complexity Analysis We analyze the computation complexity for the heuris-
tic approach. For the WCET analysis, the abstract cache states analysis is
fixed-point data flow analysis. Thus, the complexity depends on the cache con-
figuration and program control flow. We assume its complexity is O(w) for
the analysis on each cache set. For the cost-benefit analysis, the maximum
number of memory block we need to consider is M , suppose M is the total
number of memory blocks in the program. So, each time we select a mem-
ory block to lock, the complexity is O(M) + O(w). The number of locked
memory blocks is bounded by A × K, where A is the associativity and K is
the number of cache sets. Thus, the final complexity for heuristic approach is
O(A×K ×M) + O(A×K × w).
46
Table 4.1: Characteristic of benchmarks.












In this section, we present the experimental evaluation of partial cache lock-
ing. We compare both the optimal and the heuristic solutions with static cache
analysis [101] and the full cache locking approach proposed by Falk et al. [38].
4.5.1 Experimental Setup
We use the benchmarks from MRTC benchmark suite [46] as shown in Ta-
ble 4.1. We compile our benchmarks for SimpleScalar PISA (Portable ISA)
instruction set [21] — a MIPS like instruction set architecture — with gcc cross-
compiler. The control flow graphs of these benchmarks are extracted and pro-
vided as input to our cache locking analysis. Our framework is built on top of
the open-source WCET analysis tool Chronos [59]. The binary code size of
each program is shown in the second column of Table 4.1. We perform all the
experiments on 2.53GHz Intel Xeon CPU with 24GB memory. IBM CPLEX is
used as the ILP solver [3].
We assume only one level of instruction cache in the architecture. In other
words, an instruction access is either cache hit or it has to be fetched from mem-
ory. The cache hit latency is 1 cycle, while a cache miss takes 30 cycles. As we
are modeling the instruction cache, we assume a simple in-order processor with
unit-latency for all data memory references.
4.5.2 Partial Cache Locking vs. Static Analysis
Figure 4.4 shows the WCET improvement of partial cache locking over static
analysis with no locking based on abstract interpretation [101]. The instruction
47
cache is 4-way set associative with block size of 32 bytes, and its capacity is
varied from 256B to 2KB.
Our partial cache locking technique significantly improves the WCET over
static analysis with no locking for many benchmarks (e.g., cnt, crc and qurt) for
different cache sizes. However, some benchmarks show limited improvement of
WCET via partial cache locking, especially when the cache size is small. This
is mainly due to the fact that locking memory blocks destroys the deterministic
access pattern for some unlocked blocks and the locking cost is high. Therefore,
our partial locking technique decides not to lock these memory blocks and the
result of partial locking is close to that of static analysis.
For most of the benchmarks, the improvement increases as the cache size
increases, because there is more space for locking and more memory blocks
can be locked into the cache. However, for some benchmarks, the improve-
ment decreases as cache size increases, for example fir. For fir, when the
cache size increases, more memory accesses become deterministic, which can
be successfully identified by static cache analysis. Thus, cache locking may not
help to improve the WCET much compared to static cache analysis. Overall,
more WCET improvement is observed as the cache size increases. On an aver-
age, 16%, 16%, 23% and 30% improvement are achieved for 256B, 512B, 1KB
and 2KB size cache, respectively. As expected, WCET improvement of partial
cache locking over static analysis is usually much higher with bigger cache size
as more space is available for locking memory blocks.
4.5.3 Partial versus Full Cache Locking
There exist two typical full cache locking techniques as mentioned earlier [72,
38]. Even though Liu et al. [72] show that their approach can achieve better
WCET reduction compared to [38], it has several limitations. Liu et al. do
not consider the cache mapping function in the locking algorithm. They simply
assume that any memory block can be locked in any cache set (as if the cache
is a scratchpad memory). After locking decisions are made, they employ code
placement techniques that force the locked memory blocks to be mapped to the
appropriate cache sets. This can lead to code size blowup, which has not been
addressed in their work.
Thus we decide to compare our partial locking results with Falk et al.’s
method [38] as both approaches do not require any subsequent code placement
technique. We choose memory blocks as locking granularity instead of proce-





















(b) Cache size: 512B


























































(a) Cache size: 256B
(c) Cache size: 1KB
Figure 4.4: WCET improvement of partial cache locking (optimal and heuristic
solution) over static cache analysis with no locking (cache: 4-way set associa-
tive, 32-byte block).
of granularity does not change the greedy heuristic algorithm proposed in [38].
The instruction cache is 4-way associative with 32-byte blocks and capacity
varying from 256B to 2KB.
The WCET improvement of partial cache locking over Falk et al.’s method
is shown in Figure 4.5. Both optimal and heuristic partial locking approaches
outperform Falk et al.’s method for different cache sizes. Our partial cache
locking techniques usually lock part of the cache. Thus, after locking, there
are still some cache lines left for the unlocked memory blocks to exploit their
locality of accesses. However, in Falk et al.’s method, the cache is fully locked
and all the accesses to the unlocked memory blocks are cache misses. On an
average, partial cache locking improves the WCET by 64%, 61%, 45% and 34%
over full cache locking for 256B, 512B, 1KB and 2KB caches, respectively.
when the cache is large enough to hold the entire program, all the memory
blocks can be locked to achieve the minimum WCET. In that scenario, partial
and full cache locking obtain identical solutions for some benchmarks (e.g., cnt,
crc, fir, and matmult).
49
(b) Cache size: 512B























































































Figure 4.5: WCET improvement of partial cache locking (optimal and heuristic
solution) over Falk et al.’s method (cache: 4-way set associative, 32-byte block).
4.5.4 Impact of Different Associativity
In this subsection, we evaluate our partial cache locking for different cache as-
sociativity values. The 4-way cache associativity results have been presented.
Here we show the results of direct mapped and 2-way set associative caches,
while the block size remains constant at 32 bytes. Figure 4.6 and 4.7 present the
improvement of partial cache locking over static cache analysis with no locking
for direct mapped cache and 2-way set associative cache, respectively. Fig-
ure 4.8 and 4.9 present the improvement over full locking (Falk et al.’s method)
for different cache associativity.
It is observed that the WCET improvement of direct mapped cache is not as
good as that of 2-way and 4-way set associative cache, especially when the cache
size is small. For direct mapped cache, there is only one cache line available in
each set. Locking a memory block in a cache set implies that all the accesses to
the other memory blocks in the cache set will be cache miss. Thus our partial
cache locking method decides not to lock any memory block for most of the
benchmarks. Therefore, the partial cache locking results are similar to that of
static cache analysis with no locking, especially when the cache size is small. 2-
way set associative caches provide more opportunities for partial cache locking.














































































(a) Cache size:256B (b) Cache size:512B
(c) Cache size:1KB (d) Cache size:2KB
Figure 4.6: WCET improvement of partial cache locking over static cache anal-













































































(a) Cache size:256B (b) Cache size:512B
(c) Cache size:1KB (d) Cache size:2KB
Figure 4.7: WCET improvement of partial cache locking over static cache anal-
ysis (no locking) for 2-way set-associative cache, 32-byte block.
51
(a) Cache size:256B (b) Cache size:512B





















































































Figure 4.8: WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for direct mapped cache, 32-byte block.
(a) Cache size:256B (b) Cache size:512B





















































































Figure 4.9: WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for 2-way set-associative cache, 32-byte block.
52
Finally, partial cache locking outperforms full locking (Falk et al.’s method) for
different associativity.
4.5.5 Impact of Different Block Sizes
In this subsection, we evaluate our partial cache locking for different block size.
In the previous sections, we present results for 32 byte block size. Here we eval-
uate the benefits of partial cache locking for 64 bytes block size. Figure 4.10
and 4.11 present the WCET improvement with partial cache locking over static
cache analysis with no locking and full locking (Falk et al.’s method), respec-
tively. As shown, our partial locking still achieves significant improvement.











































































Figure 4.10: WCET improvement of partial cache locking over static cache
analysis (no locking) for 2-way set-associative cache, 64-byte block.
4.5.6 Optimal vs. Heuristic Approach
As shown in Figure 4.4 and 4.5, our heuristic approach obtains nearly the same
results as the optimal solution. Table 4.2 presents the average analysis time
of different algorithms for all the benchmarks. Clearly, our heuristic approach
produces comparable results to the optimal solution while it is more efficient in
analysis time.
53
(a) Cache size:256B (b) Cache size:512B





















































































Figure 4.11: WCET improvement of partial cache locking over Falk et al.’s
method (full locking) for 2-way set-associative cache, 64-byte block.
Table 4.2: Analysis time of different algorithms.
Benchmarks Optimal (sec) Heuristic (sec) Speedup
adpcm 313.37 1.28 245
cnt 0.43 0.05 9
compress 145.61 0.33 441
crc 1.44 0.10 14
edn 1.07 0.16 7
fir 0.10 0.02 5
jfdctint 0.35 0.06 6
matmult 0.37 0.07 5
minver 114.20 0.35 326
qurt 1.20 0.13 9
54
4.5.7 Percentage of Lines Locked
Table 4.3: Percentage of lines locked in cache (cache: 4-way set associative,
32-byte block).
Benchmarks
512B cache (%) 1KB cache (%)
optimal heuristic optimal heuristic
adpcm 25.00 25.00 56.25 56.25
cnt 25.00 25.00 65.63 75.00
compress 43.75 43.75 59.38 68.75
crc 68.75 68.75 71.88 71.88
edn 18.75 18.75 12.50 37.50
fir 75.00 75.00 40.63 87.50
jfdctint 50.00 75.00 75.00 75.00
matmult 18.75 18.75 40.63 46.88
minver 12.50 18.75 25.00 28.13
qurt 81.25 75.00 75.00 75.00
As mentioned before, the main strength of partial cache locking lies in the
fact that cache lines are locked judiciously after performing careful cost-benefit
analysis. If it is beneficial to keep a cache line unlocked so that multiple memory
blocks can benefit from it, partial cache locking can identify such situations. In
this subsection, we present the cache locking solutions derived by our partial
cache locking mechanisms (optimal and heuristic) when the cache is 4-way set
associative with 32-byte block size, and its capacity is varied from 512B to 1KB.
Table 4.3 presents the percentage of lines locked in the cache for different cache
configurations. As we can observe, for all the benchmarks, our partial cache
locking algorithms (optimal and heuristic) lock only a fraction of the cache lines.
The percentage of lines locked is generally lower when the cache size is small as
the unlocked memory blocks need the remaining cache lines. As the cache size
increases, partial cache locking chooses to lock higher percentage of cache lines.
These results clearly confirm that partial cache locking is indeed important to
minimize WCET compared to the two extreme ends of the spectrum of choices,
namely, full cache locking and no cache locking.
4.6 Discussion
The optimal approach models all possible concrete cache states at each program
point and employs ILP formation. However, this increases the complexity of the
method. When the cache size, program size or program control flow complexity
increase, the optimal approach may not be scalable. For the heuristic approach,
55
it employs abstract cache state analysis and greedy selection algorithm. Abstract
cache states are more compact representations than concrete cache states, and
the greedy selection algorithm is time-efficient. However, heuristic approach
may occasionally not achieve the optimal results.
4.7 Summary
In this chapter, we propose partial cache locking for WCET reduction. We have
proposed an optimal partial locking solution based on concrete cache states as
well as a heuristic approach based on abstract cache states. Our partial cache
locking significantly reduces the WCET compared to the static cache analysis
and the state-of-the-art cache locking techniques that fully lock the cache. Our
heuristic achieves comparable WCET reduction to the optimal solution but it is
more efficient in terms of runtime.
56
Chapter 5
Partial Cache Locking for
Multitasking
In this chapter, we extend cache locking for single task in Chapter 4 to multi-
tasking real-time systems, in order to improve schedulability/utilization.
5.1 Overview
In multi-tasking systems, there exist two different locking approaches. Puaut
and Decotigny [87] propose space sharing of the entire cache by locking a por-
tion of the cache per task, as shown in Figure 5.1. There are three tasks, T1, T2
and T3. Each of them locks a part of the cache. We call it PD-Locking approach
following the last names of the authors. The advantage of this approach is that
the cache content remains unchanged throughout the execution of the tasks. The
downside is that each task has access to only a fraction of the cache. Aparicio et
al. [14] observe this limitation and introduces a time-multiplexed sharing of the
cache through locking, called ASRV-Locking approach in the rest of the chap-
ter. Figure 5.2 presents an example of ASRV-locking. Two tasks T1 and T2
are scheduled, and T2 has higher priority than T1. In this approach, a task has
exclusive access to the entire cache during execution and locks the cache with
its own memory blocks leading to improved WCET per task compared to PD-
Locking approach. However, when a task resumes execution after preemption, it
has to reload the locked cache blocks leading to significantly higher (but fixed)
preemption cost. Note that both approaches can bypass CRPD analysis as PD-
Locking does not require cache reloading at preemption, while ASRV-Locking
has fixed cache re-loading/locking cost at preemption. Because the entire cache









Locking cost Priority: T2 > T1T1 T2




Figure 5.3: An example of our approach.
58
We propose a non-traditional approach that judiciously combines cache lock-
ing with cache modeling in preemptive multi-tasking real-time systems and
overcomes the space limitation of PD-Locking and cache reloading cost at pre-
emption for ASRV-Locking. Similar to PD-Locking, we adopt space sharing by
statically locking a portion of the cache per task but with a crucial difference.
We leave a portion of the cache unlocked and let the tasks take advantage of this
unlocked portion during execution through normal cache replacement policy.
That is, the locked portion of the cache is statically shared, while the unlocked
portion is time-multiplexed among the tasks. This relaxes the space constraint
for each task and eliminates cache re-loading at preemption. However, WCET
and CRPD analysis are required for the unlocked portion of the cache. An ex-
ample of our approach is illustrated in Figure 5.3.
Why do we want to give up the predictability offered through full cache
locking, but not embrace static cache analysis all the way? We believe cache
locking comes to rescue when cache analysis is unable to conclusively classify
a memory access as hit or miss. At the same time, advancement in both WCET
and CRPD analysis ensures that we can analyze the cache behavior quite pre-
cisely for most of the memory blocks. These memory blocks with predictable
access patterns can reside in the unlocked portion of the cache providing im-
proved performance. Indeed it is shown in Chapter 4 that, for a single task, a
partially locked cache provides optimal WCET compared to both static analysis
and full cache locking.
While we developed partial cache locking solution for a single task in Chap-
ter 4, it is challenging to make a locking decision for multi-tasking systems.
First, the change in execution time of a task impacts the schedulability of the
other tasks. Thus, we need to adopt a global optimization approach rather than
a local per task approach. Second, the unlocked portion of the cache requires
CRPD analysis. We propose an algorithm that employs accurate cost-benefit
analysis to capture the impact of locking a memory block on the WCET, CRPD,
and schedulability of all the tasks and thereby makes an informed decision to
choose the appropriate memory blocks for locking. We perform detailed ex-
perimental evaluation to validate the improved schedulability results of our ap-
proach.
5.2 Motivating Example
We illustrate the benefit of our integrated cache analysis and locking using the
example in Figure 5.4.
59
C1 = (6 misses) = 12 cycles
Static Analysis (No Locking)
PD-Locking (Static Locking)
Latency Task Period WCET Path
C2 = (3 misses + 9 hits) = 15 cycles
m4 m5
C1 = (6 misses) = 12 cycles C2 = (4 misses + 8 hits) = 16 cycles
ASRV-Locking (Dynamic Locking)
m1 m2
C1 = (2 misses + 4 hits) = 8 cycles C2 = (4 misses + 8 hits) = 16 cycles
Locking + Analysis (Partial Locking) m1
C1 = (4 misses + 2 hits) = 10 cycles C2 = (3 misses + 9 hits) = 15 cycles
Locking of  T1
m4 m5
Locking of  T2
(a) WCET of various locking schemes 
T1,1 ,T2 1 ,T2 2
T1,1 ,T2 1 T1,2 T1,3,T2 2
12 24 38 48 60 72
T1,2 T1,3
CRPD = 2 cycles
T1 deadline T1 deadlineT2 deadline
T2 meets 
deadline
T1,1 ,T2 1 T1,2 ,T2 2 T1,3 ,T2 2PD-Locking
ASRV-Locking
(b) Scheduling of tasks with RMS 
Preemption Load cache = 4 cycles







Cache Hit 1 cycle




T2 m44 m54 m64
T2,2
76
Figure 5.4: Motivating example.
60
5.2.1 WCET Comparison of Various Locking Schemes.
Figure 5.4(a) compares the WCET under various locking schemes. We assume
two tasks T1 and T2 with periods 24 cycles and 38 cycles. Cache hit latency






6) where mi represent a memory block, as shown in
Figure 5.5. The numbers on the loop back edges are the corresponding loop
bounds in Figure 5.5. We assume all the memory blocks are mapped to the
same cache set in a 2-way set associative cache. To simplify discussion, we
also assume that timing is solely determined by instruction cache effects and the












Figure 5.5: WCET path of T1 and T2.
Static Analysis estimates the WCET via cache modeling with no locking [101].
For T1, all the accesses are miss because 3 memory blocks compete for 2 cache
blocks resulting in 6 cache misses and WCET C1 = 12cycles. For T2, it has
3 cold misses while remaining accesses are hit resulting in C2 = 15cycles.
PD-Locking statically locks the entire cache with memory blocks from T2 by
selecting memory blocks with highest accessfrequency/period [87]. Thus,
accesses of T1 miss, while for T2 it depends on which blocks are locked. ASRV-
Locking employs dynamic locking where each task has exclusive access to the
entire cache. WCET of T1 is reduced; but not the WCET of T2. Finally, our
Locking + Analysis judiciously chooses to lock only one block of T1 and leaves
the other cache block unlocked so that T2’s accesses are still hits after the cold
misses.
61
5.2.2 Scheduling Results of RMS
Let us assume that the reloading overhead for ASRV-Locking is 4 cycles and the
CPRD for Static Analysis and our Locking + Analysis is 2 cycle. The execution
of the tasks over the hyper-period is shown in Figure 5.4(b). The tasks are
scheduled with Rate Monotonic Scheduling (RMS) policy where the task with
the shortest period (T1) has the highest priority. TX,Y represents the Y th instance
of task TX .
In Static Analysis, T1 with higher priority starts execution first, and its first
instance T1,1 is supposed to finish execution at cycle 12 (WCET is 12 cycles
under static analysis). Then, the first instance of T2, T2,1, is scheduled to exe-
cute. However, it is preempted by the second instance of T1 (T1,2) at cycle 24
before finishing execution. The execution of T1,2 lasts for 12 cycles and fin-
ishes execution at cycle 36. T2,1 resumes execution and the CRPD lasts for 2
cycles. However, the deadline of T2,1 is reached at cycle 38. Thus, T2,1 misses
its deadline.
In PD-locking, T1 has the same WCET as that in Static Analysis, while T2
has larger WCET. Thus, the scheduling process is more or less the same as that
in Static Analysis, leading to deadline miss of T2,1.
In ASRV-locking, 4 cycles are required to load and lock the cache contents
before execution for each instance of the tasks. Thus, T1,1 finishes execution
at cycle 12, and then T2,1 is scheduled. At cycle 24, T2,1 is preempted by T1,2
before finishing execution, and another 12 cycles is spent for T1,2. Thus, at
cycle 36, T2,1 resumes execution, and 4 cycles are required to relock the cache
contents. However, its deadline is reached at cycle 38, resulting in the deadline
miss of T2,1.
In our Locking + Analysis, T1,1 finishes execution at cycle 10 and T2,1 is
then scheduled. T2,1 is again preempted by T1,2 at cycle 24. However, T2,1 only
needs one more cycle to finish execution. The execution of T1,2 lasts for 10
cycles, and then T2,1 resumes execution at cycle 34. It spends 2 cycle for the
CRPD and another cycle for execution. Thus, T2,1 can be finished at cycle 37,
and it meets the deadline in our Locking + Analysis approach.
For Static Analysis and PD-Locking, they fail to meet the deadline due to
longer WCET. For ASRV-Locking, each task locks the entire cache and thus
have lower WCET. However, every time a new task instance starts execution
or a preempted task resumes execution, we need to reload and lock the cache
with 4-cycle penalty. The additional cache reloading overhead makes T2 miss its
first deadline. Due to time-multiplexing of the unlocked cache space among all
the tasks and lower preemption cost, our Locking + Analysis solution has lower
62
WCET/delay compared to Static Analysis, PD-Locking and ASRV-Locking. This
enables both T1 and T2 to meet their deadlines. A comparison with real task sets
will be presented later in the experimental evaluation section.
5.3 System Model
In this section, we present the basic models of caches and tasks.
Cache Model We define the cache similarly as that in Section 4.3 of Chap-
ter 4. We set cache line (block) size as L, while the number of cache sets is K
and associativity is A. We assume LRU (Least Recently Used) cache replace-
ment policy.
Given a memory blockm, it can be mapped to only one cache set (m modulo K).
To simplify the following discussion, we assume there is only one cache set as
the different cache sets do not interfere with each other. However, our locking
algorithm works with multiple cache sets. We use M to represent the set of
memory blocks mapped to a cache set. We also use⊥ to indicate the absence of
any memory block in a cache line. Concrete cache state and abstract cache state
are also used in this work. We define the abstract cache state similarly to that in
Section 2.3.1 of Chapter 2. Concrete cache state is defined similarly to that in
Section 4.3 of Chapter 4.
Task Model We assume a preemptive multi-tasking real-time system running
on uni-processor with a set of N independent periodic tasks T = {T1, ..., TN}.
For each task Ti, we use Pi to represent its period and Ci to represent its WCET.
We assume the deadline Di = Pi. The Ci value is obtained by performing intra-
task WCET analysis for Ti. In other words, the WCET analysis is performed
in isolation per task. In processors with caches, we also need to account for
the delay due to preemption: the CRPD and the context switching cost. For
each task Ti, we use ∆i to denote the delay due to preemption. Let U be the
total processor utilization for the task set. A necessary condition for feasible












(CRPD(Ti, Tj) + CSC)× n(Ti, Tj) (5.2)
where pt(Ti) is the set of tasks that may preempt Ti,CRPD(Ti, Tj) is the CRPD
of Ti imposed by Tj in one preemption, n(Ti, Tj) is the bound for the number
of preemption of Ti imposed by Tj , and CSC represents the context switching
cost.
EDF Scheduling Earliest Deadline First (EDF) is a dynamic priority based
scheduling policy. The priority of a task is determined by its deadline. At any
time instance, EDF chooses the ready task with the closest deadline for execu-
tion. For EDF, Equation 5.1 (U ≤ 1) is both sufficient and necessary condition
for feasible schedule. The task set that may preempt T consists of all the tasks
that may have earlier deadline than T [52].
RMS Scheduling Rate Monotonic Scheduler (RMS) is a static priority based
scheduling policy. The priority of a task is determined statically by its period.
Task Ti has higher priority than task Tj if Pi < Pj . Therefore, the set of tasks
that may preempt T is the set of tasks with higher priority. Unlike EDF, U ≤ 1
is not a sufficient condition for feasible schedule with RMS. There exists no
polynomial time schedulability test for RMS. An iterative method is employed
to estimate the response time of each task and compare it against the deadline.







e(Cj + CRPD(Ti, Tj) + CSC) (5.3)
where Sni is the response time of Ti in the n
th iteration, and hp(Ti) represents
the set of tasks that have higher priority than Ti.
5.4 Framework Overview
We first provide an overview of our integrated analysis and locking approach.
We propose to statically lock a part of the cache per task, while a part is left
unlocked to be used by all the tasks. The locked cache space is spatially shared
among the tasks, while the unlocked cache space is temporally shared by all the
tasks.
According to Equation 5.1, the execution time of a task depends on both
intra-task WCET and inter-task CRPD. For the locked memory blocks, they do
64
not incur any CRPD as they can not be evicted from the cache and their impact
on the WCET can be easily determined. However, for the unlocked memory
blocks, we still need to perform static analysis for both intra-task WCET and













Figure 5.6: Framework for Locking + Analysis approach.
Figure 5.6 illustrates the flow of our Locking + Analysis approach. We first
perform intra-task WCET analysis with abstract interpretation [101]. Mean-
while, we also perform inter-task CRPD analysis [54]. Then, for each memory
block in the task set, a cost-benefit analysis on WCET and CRPD is carried out
for cache locking. This cost-benefit analysis captures the impact of locking a
memory block on the WCET, CRPD, and schedulability of all the tasks. Based
on this cost-benefit analysis, we choose the most beneficial memory block to
lock. We perform intra-task WCET analysis and inter-task analysis again after
locking this memory block. We call it light WCET & CRPD analysis, because
it avoids some unnecessary cache analysis compared to the full-fledged WCET
& CRPD analysis. If either the schedulability or the utilization improves, we
continue to lock other memory blocks. Otherwise, the iterative process stops











RCS and LCS 
Computation





Figure 5.7: WCET and CRPD Analysis.
5.5 WCET and CRPD Analysis
In the following, we present a brief description of the static analysis techniques
for intra-task WCET and inter-task CRPD estimation (see Figure 5.7). This
analysis ignores cache locking. But this background material is required to ap-
preciate the WCET and CRPD estimation in the presence of cache locking for
our cost-benefit analysis presented in the next section.
5.5.1 Intra-Task WCET
As shown in Figure 5.7, intra-task WCET analysis involves three steps. First, it
performs abstract interpretation — must and may analysis — based on abstract
cache states [101]. Must analysis determines the set of memory blocks that are
guaranteed to be present in the cache at a program point. May analysis captures
the set of memory blocks that may be present in the cache at a program point.
Next, the memory blocks are classified based on the must and may analysis. The
memory blocks present in the abstract cache states of must analysis are classi-
fied as always hits; the memory blocks not present in the abstract cache states of
may analysis are classified as always misses; the remaining memory blocks be-
long to non-classified category, i.e., they are assumed to be cache misses during




A preempted task T incurs CRPD as some of its useful memory blocks are
evicted from the cache by the preempting task T ′. The “useful” memory blocks
are those memory blocks that have been loaded into the cache before preemption
and may be accessed again after preemption. Thus, the key to CRPD analysis is
to determine the useful cache blocks of the preempted task and verify whether
they could survive after the preemption.
Recently, Kleinsorge et al. [54] proposed a CRPD estimation approach for
set associative caches that combines techniques for direct mapped caches [82]
with resilience analysis for set associative caches [9]. Concrete cache states
are fundamental to inter-task CRPD analysis [54, 82]. Given a program point,
it may be reached via multiple program paths, which leads to multiple con-
crete cache states. Thus, in general, it is infeasible to maintain all the possible
concrete cache states for large programs with complex control flow. Inter-task
CRPD analysis aims to identify the cache states with the largest number of use-
ful memory blocks and thus higher preemption delay. The subsumed cache
states (with lower number of useful memory blocks) can be safely removed.
Hence, it is feasible to use concrete cache states for inter-task CRPD analysis as
shown in [54, 82]. We adopt the technique in [54] to estimate the CRPD for a
single preemption. The approach in [54] depends on the computation of UCB
(useful cache blocks) and ECB (evicting cache blocks). The resilience of a UCB
defines the maximum number of allowed memory accesses from the preempting
task before it can be evicted. Figure 5.7 shows the four steps of CRPD analysis
in [54]. The details of these four steps are shown as follows.
RCS and LCS Computation The useful memory blocks are computed using
two different types of cache states: the reaching cache states (RCS) and live
cache states (LCS). At a program point p, RCSp is the set of possible cache
states when p is reached via any incoming program path. Conversely, at a pro-
gram point p, LCSp represents the set of possible cache states via any outgoing
program path from p. The cache states RCS and LCS can be computed via
forward/backward fix-point data flow analysis [54, 82].
UCB and ECB Computation We use UCBp to denote the set of useful mem-
ory blocks at program point p. UCBp is computed as
UCBp = {c ∩ c′|c ∈ RCSp, c′ ∈ LCSp} (5.4)
67
where c ∩ c′ is defined as
c ∩ c′ = {b|∃ 0 ≤ j < A s.t. c[j] = b, ∃ 0 ≤ k < A s.t. c′[k] = b} (5.5)
Evicting cache block (ECB) captures the memory blocks that may be accessed
during the execution of the preempting task. Thus
ECB = RCSexit (5.6)
where exit is the exit point of the preempting task.
Cache Block Resilience Given a useful memory block m at a program point
p, its survival upon preemption depends on its resilience to the preempting task.
We define its resilience resmp as the maximum number of allowed memory ac-
cesses from the preempting task before m can be evicted and is computed as
follows. We define the distance of useful memory block m at program point p
distancemp =
{





where age↓m and age
↑
m denote the maximum age of m in RCSp and LCSp,
respectively. Then, the resilience is defined as
resmp = (A− 1)− distancemp (5.8)
CRPD Computation We can now bound the CRPD based on the UCB of
preempted task T and ECB of preempting task T ′. Let UCBp be the set of
useful cache blocks at a program p of the preempted task T and ECB be the
set of evicting cache blocks of the preempting task T ′. For any u ∈ UCBp and
e ∈ ECB
CRPDp(u,e) = |u \ {m|resmp ≥ |e|}| × CRT (5.9)
where CRT is the reloading overhead of one memory block. Then, the CRPD
at this program point p is the maximum among all the possible combinations of





The CRPD for this preemption is the maximum CRPD over all the program
points. That is
CRPD(T, T ′) = max
p∈PP
CRPDp (5.11)
where PP is the set of program points of the preempted task T .
During the execution of task T , it is possible that higher priority task T ′
preempts lower priority task T multiple times. The preemption bound imposed
by T ′ on T is denoted as n(T, T ′) in Equation 5.2. n(T, T ′) depends on the
scheduling policy. For EDF scheduling policy, we use the approach in [52] to
bound n(T, T ′); for RMS scheduling, n(T, T ′) is a by-product of the response
time computation as shown in [71] and Equation 5.3.
5.6 Locking Algorithm for Multitasking
As we have mentioned, existing locking techniques for multi-tasking systems
allocate the entire cache for locking [14, 87]. Such locking techniques eliminate
the CRPD analysis at the expense of poor performance. Thus, it is relatively
easy to compute the memory blocks to be locked. While in our approach, there
is a complex interplay between cache locking and its impact on schedulability
analysis. When a memory block of task T is locked in the cache, T generally
benefits from the locking. However, it also takes away valuable cache space
from the remaining tasks and also changes their CRPD. Any exact locking al-
gorithm for our approach will have exponential complexity. Thus we design an
efficient heuristic to decide on the memory blocks to be locked.
As noted earlier, we first perform intra-task WCET analysis and inter-task
CRPD for each task in the task set (see Figure 5.6) when no memory block
is locked. Then we compute the processor utilization and response time for
each task. As a by-product of intra-task WCET analysis, we have the abstract
cache states (must and may) at each program point. We also collect the memory
blocks along the WCET path and their execution frequencies for each task. Sim-
ilarly, we record the worst-case preemption point and the corresponding UCB
and ECB for each task during CRPD analysis. We design an iterative solution
to select the memory blocks for locking. In each iteration, we choose the most
beneficial memory block for locking. The benefit of locking a memory block is
defined differently for different scheduling policy (see Section 5.6.3). We stop
this process when there is no benefit due to locking and the remaining cache
space is left unlocked.
The cost and benefit of locking is based on the following observations.
69
Given a memory block m ∈ T , if m is locked, then all the accesses to m are
cache hits. But as cache size is reduced, it might have negative impact on the
other memory blocks mapped to the same cache set for all the tasks including
T . For task T , its intra-task WCET might be improved if the benefit of locking
m is greater than the cost on other memory blocks. However, for other tasks,
there is no positive effect on their intra-task WCET. Finally, the CRPDs for all
the tasks are usually reduced as the effective cache size is reduced after locking
m. In the following, we show how to estimate the cost and benefit of locking
memory block m. We assume m ∈ T and m is mapped to cache set s.
5.6.1 Cost-benefit analysis within a task
We only consider the memory blocks of task T along the WCET path for locking
as locking the other memory blocks has no benefit. Let m be a memory block
along the WCET path of T and fm be the execution frequency of m along the
WCET path of T . We use latm to denote the access latency of memory block
m. latm is determined by the classification (cache hit or cache miss) of memory
block m in must/may analysis. We use lathit and latmiss to represent the cache
hit and miss latency, respecitively. Then, the benefit of locking m on the WCET
of T is
wcet benefitTm = (latm − lathit)× fm (5.12)
However, locking m may also have negative impact on the other memory
blocks of T mapped to the same cache set s as the number of cache blocks in set
s is now reduced by one. Let C be the abstract cache state for must analysis of
set s. If m ∈ C , m is classified as cache hit before cache locking, thus locking
m does not evict any other memory block from the cache and wcet costTm = 0.
However, if m /∈ C, locking m will evict out the memory block m′ with age
A − 1 in C from the cache, which results in cache miss for the accesses of m′.
In this case, the cost of locking m is
wcet costTm =
∑
(m′ ∈Ms) ∧ (ageCm′ = A− 1)
(latmiss−lathit)×fm′ (5.13)
where Ms is the set of memory blocks mapped to set s in T , and fm′ indicates
the execution frequency of m′ along the worst-case path. Therefore the WCET
gain of T by locking m is
wcet gainTm = wcet benefit
T
m − wcet costTm (5.14)
70
Apart from the influence on the intra-task WCET of T , locking m may also
affect the CRPD of T . We assume T is preempted by another task T ′. Ob-
viously, locking m will not generate any new useful cache block because the
cache size is reduced. As mentioned before, we record the UCB, ECB and the
preemption point that lead to the worst-case CRPD for this preemption. Sup-





s ⊂Mus ) is the set of blocks that contribute to the CRPD before locking
m. To model the effect of locking m, we update ECB of set s and the resilience
of any block in Mus . With the new ECB and updated resilience, we can obtain
Mu
′′
s ⊂Mus , the new set of blocks that contribute to the CRPD after locking m.





s | − |Mu
′′
s |)× (latmiss − lathit) (5.15)






m × n(T, T ′) (5.16)
where pt(T ) is the set of tasks that may preempt T and n(T, T ′) is the bound on
the number of preemptions of T imposed by T ′. Finally the overall execution
time gain of T by locking m is
time gainTm = wcet gain
T
m + crpd gain
T
m (5.17)
5.6.2 Cost-benefit analysis of other tasks
Let T ′ 6= T and m′ ∈ T ′ be a memory block along the WCET path of T ′. We
assume m′ and m are mapped to the same cache set s and m′ is in the abstract
cache state of must analysis C. If the age of m′ is A − 1, then locking m will
evict m′ out of cache. Thus, locking m has negative impact on the WCET of





(m′ ∈M ′s) ∧ (ageCm′ = A− 1)
(latmiss−lathit)×fm′ (5.18)
where M ′s is the set of memory blocks mapped to set s in T
′ and fm′ is the
execution frequency of m′ along the WCET path.
The CRPD gain of T ′ by locking m, crpd gainT ′m , can be obtained via the
same approach as in Section 5.6.1. Thus, the overall execution time gain of task
71
T ′ by locking m is
time gainT
′
m = crpd gain
T ′
m − wcet costT
′
m (5.19)
5.6.3 Memory block selection strategy
We design different memory block selection strategies for EDF and RMS schedul-
ing policies.
EDF Scheduling Equation 5.1 is a sufficient and necessary condition for fea-
sible schedule. Thus we select the memory blocks based on their impact on total












where P is the period of task T , P ′ is the period of task T ′ and T is the task set.
The utilization gain of locking a memory block is used as a metric to select the
memory blocks for locking. In each iteration, we select the memory block with
maximum utilization gain over all memory blocks in the task set.
RMS Scheduling Utilization (Equation 5.1) is not a sufficient condition for
feasible schedule in RMS. Thus, for RMS, we first need to ensure the schedula-
bility of the task set. For each task, its response time can be computed using the
iterative method provided by Equation 5.3. We focus on the tasks with response
time greater than their deadline, and among them try to optimize the response
time of the task with highest priority first. Based on Equation 5.3, in order to
improve the response time of a task T , we can either reduce the execution time
of T , or improve the execution time of the tasks with higher priority than T . So,
when we try to lock a memory block m ∈ T , the corresponding response time
gain of T is








m − wcet costT
′
m )× n(T, T ′) (5.21)
where hp(T ) is the set of tasks with higher priority than T . When we try to
lock a memory block m′ ∈ T ′ with higher priority than T , the corresponding
72
response time gain of T is
rsp gainTm′ =(crpd gain
TT ′
m′ + wcet gain
T ′
m′)× n(T, T ′)− wcet costTm′
+
∑
T ′′∈hp(T )\{T ′}
(crpd gainTT
′′
m′ − wcet costT
′′
m′ )× n(T, T ′′)
(5.22)
where T ′′ is a task with higher priority than T and T ′′ 6= T ′, and n(T, T ′′)
represents the number of preemption bound imposed on T by T ′′. The WCET
and CRPD gain are different for T ′ and T ′′ through locking of m′. But both
of them contribute to the response time gain of T . Thus, rsp gainTm′ includes
both of them. Because m′ ∈ T ′, wcet gainT ′m′ is obtained via the approach
in Section 5.6.1. Meanwhile, m′ /∈ T and m′ /∈ T ′′, thus, wcet costTm′ and
wcet costT
′′
m′ are computed similarly via the approach in Section 5.6.2. crpd gain
TT ′
m′
and crpd gainTT ′′m′ can be obtained via the same approach as in Section 5.6.1.
We select the memory block with the maximum response time gain to lock,
while at the same time we ensure the utilization gain of this block to be non-
negative. After all the tasks are schedulable, we apply the same method used for
EDF scheduling to further minimize the utilization. For both scheduling poli-
cies, after each iteration, we recompute the abstract cache states of set s where
the selected memory block m is mapped to, and then recompute the WCET.
Similarly, the cache states for CRPD computation at each program point are also
updated. We then recompute the UCB and ECB for each task in task set and ob-
tain the new CRPD. Based on the new WCET and CRPD, we derive the metric
value. If there is improvement, we continue to lock. The iterative approach
stops only when all the memory blocks are locked or there is no improvement
after locking any memory block.
5.6.4 Integrated Locking + Analysis Algorithms
In this section, we present the detailed locking + analysis Algorithms used in
our approach, including cost-benefit analysis algorithm, utilization optimization
algorithm and schedulability improvement algorithm. The details are shown as
follows.
Cost-benefit Analysis Algorithm Algorithm 2 presents the detailed cost-benefit
analysis by locking a memory block m. For each task Ti in the task set T , if m
belongs to Ti, then there may be WCET benefit for Ti by locking m as all the
accesses to m are cache hits after locking (line 5). On the other hand, locking m
also impacts the other memory blocks mapped to the same cache set in Ti as the
73
Algorithm 2: Cost-benefit analysis on WCET and CRPD
Input: Task set T = {T1, T2...TN}, cache configuration config and
candidate memory block m
Output: WCET gain wcet gainTim and CRPD gain crpd gainTim by
locking memory block m for each task Ti ∈ T
1 begin
2 foreach Ti ∈ T do
3 Suppose Mi is the set of memory blocks of Ti;
4 if m ∈Mi then
5 wcet benefitTim = wcet benefit self();
6 wcet costTim = wcet cost self();
7 else
8 wcet benefitTim = 0;
9 wcet costTim = wcet cost others();
10 wcet gainTim = wcet benefit
Ti
m - wcet cost
Ti
m ;
11 crpd gainTim = crpd cost benefit analysis();
12 time gainTim = wcet gain
Ti
m + crpd gain
Ti
m ;
effective cache size is reduced after locking m, which may leads to more cache
misses. Therefore we compute the cost by locking m for Ti (line 6). If m does
not belong to Ti, obviously, there is no benefit for Ti by locking m. Thus we
only calculate the cost for Ti by locking m (line 8-9). The WCET gain for Ti
should consider both WCET benefit and WCET cost (line 10). Locking mem-
ory block m also impacts the CRPD as the effective cache size is reduced after
locking. Thus, we also compute the corresponding cost and benefit of CRPD
for task Ti by locking m (line 11). Finally, The overall execution time gain for
task Ti includes both the WCET gain and CRPD gain (line 12).
Utilization Optimization Algorithm Algorithm 3 shows the details of uti-
lization optimization. We first perform one round of WCET and CRPD analysis
for each task Ti ∈ T (line 3-9). For each task Ti, we perform abstract cache
state analysis and compute the WCET (line 4-5). We also perform RCS and
LCS analysis for each task Ti (line 6). Based on the RCS and LCS analysis
results, we calculate the UCB and ECB, as well as the resilience for each useful
cache block (line 7-8). Then we do the CRPD analysis for the task set (line 9).
With the CRPD and WCET, the initial utilization of the task set is then carried
out (line 10). Later, we iteratively select memory blocks with the maximum
utilization gain to lock. For each candidate memory block m, we first check
whether it is locked or not (line 16). Meanwhile, we check whether the corre-
sponding cache set is fully locked or not (line 16). If m has been locked or there
74
Algorithm 3: Utilization Optimization for EDF and RMS
Input: Task set T = {T1, T2...TN} and cache configuration config
Output: Set of locked memory blocks lock set and utilization after
locking util
1 begin
2 stop locking := false; lock set := null;
3 foreach Ti ∈ T do
4 abstract cache states analysis(Ti, config);
5 wcet analysis();
6 rcs lcs analysis(Ti, config);
7 ucb ecb computation();
8 resilience computation();
9 crpd analysis(T );
10 util = utilization computation(T );
11 while (!stop locking) do
12 mblk := null; util gainmblk := 0;
13 foreach Ti ∈ T do
14 foreach m ∈Mi do
15 Suppose m is mapped to cache set s;
16 if m /∈ lock set∧!is fully locked(s) then
17 cost benefit analysis(T , config, m);
18 foreach Ti ∈ T do
19 util gainTim = time gain
Ti
m /Pi;





21 if util gainm > util gainmblk then
22 util gainmblk = util gainm;
23 mblk = m;
24 if mblk 6= null then
25 lock to cache(mblk);
26 foreach Ti ∈ T do
27 update abstract cache state(Ti, mblk, config);
28 wcet analysis();
29 update rcs lcs(Ti, mblk, config);
30 ucb ecb computation();
31 resilience computation();
32 crpd analysis(T );
33 new util = utilization computation(T );
34 if new util < util then
35 util = new util;
36 lock set := lock set ∪ {mblk};
37 else
38 stop locking := true;
39 else
40 stop locking := true;
75
is no free space in the corresponding cache set that m mapped to, we skip m
and try other candidates. When we find a memory block m that can be locked,
we first perform the cost-benefit analysis on WCET and CRPD by using Algo-
rithm 2 (line 17). Then, we calculate the utilization gain for each task in the
task set (line 18-19). The total utilization gain for the entire task set by locking
m is the summation of utilization for all the tasks in the task set (line 20). We
compare m with the candidate memory block mblk that currently has the most
utilization gain (line 21). If m has more utilization gain than mblk, we update
mblk with m (line 22-23). We continue to do this until all candidate memory
blocks are considered. If we find no memory block with positive utilization
gain, this algorithm will terminate (line 39-40). Otherwise, we will end up with
a memory block mblk that has the maximum utilization gain. We lock mblk
into the cache (line 25). For each task, we update the abstract cache states in the
cache set that mblk mapped to, and recompute the WCET (line 27-28). We also
update the RCS and LCS for this particular cache set, and recompute the UCB,
ECB and resilience (line 29-31). After the resilience for each useful cache block
is updated, we perform the CRPD analysis again to get the new CRPD (line 32).
Based on the new WCET and CRPD, we obtain the new utilization of the task
set (line 33). If there is improvement on utilization of the task set, we update the
utilization and add mblk to the set of locked memory blocks, and continue to
lock other memory blocks (line 35-36). Otherwise we stop locking and obtain
the final results (line 38).
Schedulability Improvement Algorithm Algorithm 4 presents the detailed
approach to improve schedulability for RMS. For a task set T in RMS, since
Equation 5.1 is not a sufficient condition for feasible schedule in RMS, we
should first check the schedulability of T based on the response time of each
task. Therefore, We also need to perform one round of WCET and CRPD anal-
ysis first for the task set as we did in Algorithm 3 (line 3-9). Then, apart from
computing the utilization for the task set (line 10), we also need to calculate the
corresponding response time for each task in the task set (line 11). We check
the schedulability by comparing response time of each task with its deadline.
If all tasks meet their deadline, we set the boolean variable is sch to true (line
13). In this case, we stop locking for improving schedulability, and continue to
optimize utilization with Algorithm 3 (line 14-15). Otherwise, we choose the
highest priority task T among the tasks that do not meet their deadline, and try
to improve its response time first (line 16). Based on Equation 5.3, the response
time of T is mainly determined by T and the tasks that can preempt T . Thus, we
76
Algorithm 4: Schedulability Improvement for RMS
Input: Task set T = {T1, T2...TN} and cache configuration config
Output: Set of locked memory blocks lock set and utilization after locking util
1 begin
2 stop locking := false; lock set := null;
3 foreach Ti ∈ T do
4 abstract cache states analysis(Ti, config);
5 wcet analysis();
6 rcs lcs analysis(Ti, config);
7 ucb ecb computation();
8 resilience computation();
9 crpd analysis(T );
10 util = utilization computation(T );
11 response time computation(T );
12 while (!stop locking) do
13 is sch = check schedulability(T );
14 if is sch == true then
15 break;
16 Suppose T is the task with highest priority that cannot be scheduled;
17 rsp gainTmblk := 0; mblk := null;
18 suppose hp(T ) is the set of task with higher priority than T ;
19 foreach Ti ∈ T ∪ hp(T ) do
20 foreach m ∈Mi do
21 Suppose m is mapped to cache set s;
22 if m /∈ lock set∧!is fully locked(s) then
23 cost benefit analysis(T , config, m);
24 foreach Ti ∈ T do
25 util gainTim = time gain
Ti
m /Pi;





27 rsp gainTm = response time gain();
28 if rsp gainTm > rsp gainTmblk ∧ util gainm >= 0 then
29 rsp gainTmblk = rsp gain
T
m;
30 mblk = m;
31 if mblk 6= null then
32 lock to cache(mblk);
33 foreach Ti ∈ T do
34 update abstract cache state(Ti, mblk, config);
35 wcet analysis();
36 update rcs lcs(Ti, mblk, config);
37 ucb ecb computation();
38 resilience computation();
39 crpd analysis(T );
40 new util = utilization computation(T );
41 response time computation(T );
42 if new rspT > rspT ∧ new util <= util then
43 lock set := lock set ∪ {mblk};
44 else
45 stop locking := true;
46 else
47 stop locking := true;
77
only consider locking memory blocks belong to T or tasks that can preempt T
(line 18-19). For such a memory block m belongs to T or tasks that can preempt
T , if it is not locked and there is free space in the corresponding cache set, we
carry out its cost and benefit analysis (line 22-23). After that, we perform uti-
lization gain analysis by locking m as we did in Algorithm 3 (line 24-26). Apart
from the utilization gain, we also compute the response time gain by locking
m (line 27). We compare the response time gain between m and mblk that
currently has the most response time gain. If m has higher response time gain
than mblk and the utilization gain of m is not negative, we update mblk with
m (line 28-30). We continue to do this until all candidate memory blocks are
considered. If there is no suitable memory block to lock, we stop locking (line
47). Otherwise, we select the memory block mblk with the maximum response
gain on T to lock (line 32). Then, we recompute the new utilization as we do
in Algorithm 3, as well as the new response time (line 33-41). If utilization of
the task set does not become worse and there is response time improvement on
T , we add mblk to the set of locked memory blocks and continue to check the
schedulability for the task set (line 43). Otherwise, we stop locking (line 45).
5.7 Experimental Evaluation
In this section, we quantitatively compare our approach with static analysis [101,
54], ASRV-Locking [14], and PD-Locking [87].
5.7.1 Experiments Setup
We use similar task sets used in [87, 14]. The task sets are shown in Table 5.1.
They contain one small and one medium task set. All the tasks are from MRTC
benchmark suite [46]. We assume the deadline of a task is equal to its period.
Our framework is built on top of the open-source WCET analysis tool Chronos
[59]. All the tasks are compiled with gcc cross-compiler for an ARM-like in-
struction set [21].
We assume there is only one level of instruction cache. Instruction hit la-
tency is 1 cycle, while the cache miss latency is 30 cycles. The locking routine
is stored in non-cacheable memory and it uses five instructions to load and lock
a memory block [6, 2]. Thus, the cost of locking a memory block is 150 cycles.
The cache is 4-way set-associative with block size of 32 bytes. We also assume
each context switch takes 1,000 cycles per preemption for all the approaches.
78
For a fair comparison, we assume there is no line buffer for Aparicio et al.’s
approach [14].
Table 5.1: Characteristics of task sets











5.7.2 CPU Utilization Comparison
Figure 5.8 (a) and (b) present the utilization comparison of different approaches
under EDF and RMS scheduling. small-X KB (medium-X KB) denotes small
(medium) task set with cache size of X KB. As shown, our integrated cache
analysis and locking substantially improves the utilization irrespective of task
set size, scheduling policy, and cache size. PD-Locking has high utilization
when the cache size is small. For PD-Locking, the locked memory blocks for
each task are very limited and most of the memory accesses are serviced from
main memory instead of cache. As a result, the WCET of the tasks and utiliza-
tion of the task set are high. In the medium task set, the utilization of ASRV-
Locking is also high. First, the code size for tasks in medium task set is large.
Thus, there are still many unlocked memory blocks. Second, the period of task
qurt is much smaller than the other tasks, and these tasks suffer many preemp-
tions from qurt. Thus, the re-locking cost also contributes a lot to the utilization.
5.7.3 Response Time Speed-up
We compare the different approaches using response-time speedup metric pro-




It is calculated for the lowest priority task and indicates the slack available in the
schedule. Thus, a speedup greater than or equal to 1 implies that the task set is
schedulable. Figure 5.9 shows the response time speed-up for the task sets with
79
(a) Utilization with EDF











































Static analysis Ours ASRV-Locking PD-Locking
Figure 5.8: Utilization comparison of different approaches.
varying cache size. Clearly, with our approach, the tasks with lowest priority
are always schedulable. However, the lowest priority task in medium task set
with ASRV-Locking and PD-Locking are not schedulable in most of the cases.
5.7.4 CPU Utilization Breakdown
Figure 5.10 details the contribution to the utilization by WCET, CRPD and re-
locking overhead, respectively for the medium task set with 2KB cache size
under RMS scheduling policy. Compared to static analysis, our approach either
significantly reduces the WCET (qurt) or nearly eliminates the CRPD (minver,
jfdctint and fdct). While for ASRV-Locking, we observe a great contribution to
utilization due to re-locking overhead (jfdctint and fdct). Finally, the WCET
using ASRV-Locking and PD-Locking are usually large, because the unlocked
























Static analysis Ours ASRV-Locking PD-Locking
Figure 5.9: Response time speed-up.


















Figure 5.10: Utilization breakdown for medium-2KB.
5.7.5 Unlocked Cache Space
Figure 5.11 (a) and (b) show the percentage of the unlocked cache lines of our
approach under EDF and RMS, respectively. The percentage of the unlocked
cache space depends on the cache size and the scheduling policy. As shown,
with our approach, there is a portion of cache space left unlocked for all the
settings. The unlocked cache space can be used by all the tasks in the task
set. We also notice that the percentage of unlocked cache lines of 2KB cache is
smaller than that of 1KB and 4KB cache. When the cache is small, our approach
decides to lock only a small portion of the cache. It is because locking more
memory blocks may have significant negative impact on the WCET of the tasks.
On the other hand, when the cache is big, more memory blocks can be classified
as cache hits and locking those memory has no benefit. Thus, our approach
decides to lock only a small portion of a big cache.
Compared to the results in Section 4.5.7, the percentage of locked cache
lines is generally smaller in multitasking real-time systems. As locking a mem-
ory block has global effect on all tasks in multitasking real-time systems, lock-
81



















































(a) Percentage of unlocked space with EDF
(b) Percentage of unlocked space with RMS
Figure 5.11: Percentage of unlocked cache lines with our approach.
5.7.6 Runtime of Our Approach
Table 5.2 presents the runtime of our approach under both EDF and RMS schedul-
ing policy with different cache sizes. We perform all the experiments on 2.53GHz
Intel Xeon CPU with 24GB memory. The overall runtime depends on the num-
ber of locked memory blocks, WCET analysis and CRPD analysis. We notice
that for the small task set, the runtime of 2KB cache is higher than that of 4KB
cache. In small task set, the code size of crc is about 2KB. When the cache
size is set to 2KB, crc has complicated RCS and LCS analysis that leads to long
CRPD computation time.
82
Table 5.2: Runtime of our approach










Our approach is a trade-off between the predictability and worst-case perfor-
mance. To improve the worst-case performance, a portion of the cache is locked
with memory blocks, while static analysis for WCET and CRPD computation is
required for the unlocked cache space. Meanwhile, locking a memory block has
global effect on all tasks, and both the WCET and CRPD are affected by cache
locking. These factors add complexity to the problem.
5.9 Summary
In this chapter, we present an approach that integrate instruction cache analysis
and locking in multitasking preemptive real-time systems. A portion of the
cache is locked by the tasks in the task set, while the remaining portion is used
by all the tasks. We propose an algorithm based on accurate cost-benefit analysis
to select the appropriate memory contents to lock. Experimental results show
that our approach outperforms previous techniques that either time-multiplexes




In this chapter, we extend the static partial cache locking in Chapter 4 to dy-
namic cache locking for single task in embedded real-time systems.
6.1 Overview
In Chapter 4 and 5, we have explored static cache locking for single task and
multitasking in real-time systems, where the memory blocks are locked at the
beginning of execution. However, the drawback of static cache locking can
manifest for large programs executing on small caches. As the locked content
remains unchanged throughout execution, there is limited scope for optimiza-
tion. In this context, dynamic instruction cache locking that adjusts the locked
contents at runtime can further improve the WCET [15, 74]. The basic idea is
to partition the program into appropriate regions and select memory blocks for
locking in each region [15]. As the program execution moves from one region
to another, the memory blocks for the new region are locked in the cache. The
downside of this approach is the rigid partitioning that does not allow selective
locking of different memory blocks from the same region at different program
points (see the motivating example in Figure 6.2).
Liu et al. [74] extend the region-based approach and propose a swapping-
based method. The locked cache states are adjusted at the branching nodes of the
Execution Flow Tree (EFT) — which captures the control flow in the program
— by judiciously swapping in content from the nodes in the taken branch of
the EFT and swapping out the memory blocks from the non-taken branch of the
EFT. The swapping potentially allows memory blocks from a node to be locked
at different levels. However, this technique is applied only in the regions with
branching nodes and not in the single-path regions of the EFT such as nested
loops. Moreover, in the context of nested loops, the repeated swapping in and
84
swapping out would render it infeasible to lock memory blocks from the outer
loop.
In this chapter, we propose a loop-based dynamic instruction cache locking
approach to optimize the WCET. We focus on the loops, in particular nested
loops, as they contribute the most to the program execution time. As the locking
routine are usually stored in the non-cacheable memory [6, 2], the locking cost is
quite high and needs to be offset through repeated access to the locked memory
blocks in the program. This leads to memory blocks within loops as natural
candidates for locking. We also lock a memory block at the entry point of a loop
and unlock it at the corresponding exit point of the loop. This policy ensures that
locking and unlocking costs are incurred before and after the execution of the
loop.
Figure 6.1 presents an example of our loop-based dynamic cache locking
approach. There are two loops in the program, lp1 is the outer loop and lp2 is the
inner loop. Suppose memory block m1 is locked at lp1 and memory block m2
is locked at lp2. The cache is 2-way set-associative, and m1 and m2 are mapped
to the same cache set. We show our dynamic cache locking in this example step
by step. First, before execution, there is no memory block locked, as shown
in Figure 6.1(a). When we are going to enter lp1, we lock m1 into the cache,
which is shown in Figure 6.1(b). Similarly, each time we are going to enter lp2,
we need to lock m2 into the cache, as illustrated in Figure 6.1(c). Figure 6.1(d)
presents the case when we exit lp2, and m2 is unlocked at this time. As lp2 is
nested in lp1, locking and unlocking of m2 repeats in lp1. Finally, when we exit
lp1, we unlock m1, as shown in Figure 6.1(e). At this time, no memory block is



















Figure 6.1: An example of our loop-based dynamic cache locking approach.
Our approach differs from prior techniques along two important dimensions.
First, both [15] and [74] assume full cache locking for each region, i.e., the
85
unlocked memory blocks do not have any access to the cache. In contrast, we
adopt partial cache locking for each region.
More importantly, we carefully select not only the memory blocks that can
be locked but the program points where they should be locked. In particular, a
memory block m from an innermost loop L may be locked either at the level
of loop L or any of its enclosing outer loops. Moreover, this decision is taken
independently for each memory block in loop L. That is, memory blocks m and
m′ from the same loop L can potentially be locked at different loop levels. This
selective promotion of memory blocks to different loop levels for locking is a
key contribution of our approach (see example in Figure 6.2(c)).
We should point out that the selective promotion of memory blocks to dif-
ferent loop levels for locking is enabled by our decision to employ partial cache
locking. In our approach, the set of locked memory blocks during the inner
loop execution is a superset of the locked memory blocks during the outer loop
execution. This would not be beneficial with full cache locking because the
cache is locked entirely. But with partial cache locking, the unlocked memory
blocks in the outer loop can still enjoy the free cache space leading to improved
performance.
The challenge is to select the memory blocks and their locking points. We
develop a constraint-based approach to first determine the number of memory
blocks to be locked at each loop to minimize the WCET. In this process, we
exploit the concept of resilience sets to quickly and accurately estimate cost-
benefit tradeoff for cache locking. This is followed by a memory block selection
phase that identifies the actual memory blocks to be locked for each available
locking slot.
6.2 Motivating Example
We illustrate the benefit of our loop-based dynamic cache locking approach in
Figure 6.2 and Table 6.1, by comparing with static analysis approach without
locking and the region-based approach [15]. For simplicity, we use single-path
program here, but our loop-based approach can be applied to the general cases
where loops are on different branches. Figure 6.2(a) shows the original control
flow. There are three loops: lp1, lp2, lp3 where lp2 and lp3 are nested in lp1. The
numbers on the loop back edges are the corresponding loop bounds. We assume
a 2-way set associative cache with LRU replacement policy. The latency of
cache hit is 1 cycle and cache miss penalty is 30 cycles. We assume 150 cycles
overhead to lock and unlock a memory block because the locking/unlocking
86
routines involve multiple instructions to lock/unlock each memory block and
they are kept in the un-cacheable region of the program memory [6] (In [6],
Xscale uses 4 instructions to lock a memory block, and we assume there is
another instruction to unlock it). We also assume that all the memory blocks are

















































Figure 6.2: Motivating example for dynamic cache locking.














m1 190 10 0 490
18,490
m2 0 200 0 6,000
m3 0 200 0 6,000
m4 0 200 0 6,000
Region-based
approach
m1 200 0 10 1,700
11,100
m2 200 0 10 1,700
m3 200 0 10 1,700
m4 0 200 0 6,000
Loop-based
approach
m1 190 10 0 490
8,540
m2 200 0 1 350
m3 200 0 10 1,700
m4 0 200 0 6,000
No Cache Locking: We adopt static cache analysis to estimate the WCET
with the control flow shown in Figure 6.2(a), and the result is shown in Ta-
ble 6.1. There is no conflict for m1 in loop lp2, while all the other memory
blocks conflict with m1 in lp1. Thus, m1 is cache hit inside lp2, while it is clas-
87
sified as cache miss in lp1. So, in total, it incurs 190 hits and 10 misses. For m2,
m3 and m4, they always conflict inside lp3. As the associativity is only 2, each
of them incurs 200 cache misses.
Region-based approach: In the region-based approach, the program is par-
titioned into two regions, as shown in Figure 6.2(b). The memory blocks with
highest execution frequencies are chosen to be locked. Thus, in region 1, m1
is locked; while m2 and m3 are locked in region 2. So, all accesses to m1, m2
and m3 are cache hits, while all accesses to m4 are cache misses. When the
flow enters a region, the region must load its locked memory blocks. Thus, each
locked memory block (m1, m2, m3) is locked 10 times as shown in Table 6.1.
Loop-based approach: With the loop-based approach, m2 is promoted to be
locked at the loop entry of the outer loop lp1, while we lock m3 at the loop entry
of inner loop lp3 as shown in Figure 6.2(c). The flexibility that memory blocks
can be locked at different loop levels enables that the loop-based approach out-
performs the region-based approach. In loop lp2, as only m2 is locked, there is
still a cache line that can be used by m1. Thus, m1 only suffers 10 cache misses.
While in lp3, both m2 and m3 are locked. So, all accesses to m2 and m3 are hits
and accesses to m4 are cache misses. m2 needs to be locked only once, while
m3 needs to be locked 10 times, which is equivalent to the loop bound of the
outer loop lp1.
As shown in Table 6.1, the loop-based approach achieves better WCET com-
pared to both static cache analysis and region-based approach. As the loop-
based approach locks m2 at the outermost loop, its locking cost is substantially
reduced. This gain in locking cost could have been offset by the fact that m1 is
locked in region-based approach but not in loop-based approach. Partial cache
locking comes to rescue here as m1 can still benefit from caching and incurs
only 10 misses.
6.3 Cache Modeling and Locking
We define the cache similarly as that in Section 4.3 of Chapter 4. We set cache
line (block) size as L, while the number of cache sets is K and associativity
is A. We assume LRU (Least Recently Used) cache replacement policy and
uni-processor with only one level of cache.
88
6.3.1 Cache Modeling
Given a memory blockm, it can be mapped to only one cache set (mmodulo K)
and will not interfere with the blocks mapped to other cache sets. Thus the cache
sets are independent and can be modeled separately. To simplify the discussion
and explanation, we will restrict our cache modeling to one cache set. We use
M to denote the set of memory blocks mapped to cache set s. In addition, we
use ⊥ to indicate the absence of any memory block in a cache line.
Abstract cache state analysis is performed in this chapter. Thus, we define
abstract cache state similarly to that in Section 2.3.1 of Chapter 2. Abstract
cache state maps cache lines (blocks) to sets of memory blocks. It is a precise
and compact representation of the cache behavior.
Definition 4 (Age in Abstract Cache State) The age of memory block m in an
abstract cache state a is defined as
ageam =
{
i if ∃i (0 ≤ i ≤ A− 1) s.t. m ∈ a[i]
A otherwise
(6.1)
Definition 5 (Younger/Older Memory Block) For two memory blocks m and
m′ in abstract cache state a, we define m is younger than m′ if ageam ≤ ageam′ .
Otherwise m is older than m′.
6.3.2 Cache Locking Mechanism
In this chapter, we consider dynamic instruction cache locking for the sake of
WCET minimization. As we have mentioned, our approach is based on partial
cache locking. Thus, static cache analysis is still required for the unlocked
portion of the cache. We also adopt line locking mechanism as it is more flexible
and fine-grained compared to way locking.
Our dynamic cache locking approach is based on loops, that is, we lock a
memory block at the entry of a loop and unlock it at the corresponding exit of
the loop. For simplicity, in the rest of the chapter, a memory block m is locked
at a loop L implies that m is locked at the entry of L and it is unlocked at the exit
of L. We also define L as the effective locking region of m. A memory block
can be locked at any loop that contains it. Thus, a memory block in nested loops
may have multiple candidate locking points.
To lock/unlock memory blocks, we need to call the locking routine before
the loop entry and the unlocking routine after the loop exit. We adopt the tram-
polines approach proposed in [20]. That is, we insert instructions that call lock-
89
ing/unlocking routines into the program. For each loop, we first leave a dummy
NOP instruction at the entry point and at the exit point before we decide on
cache locking. If we decide to lock memory blocks at this loop, the NOP in-
struction at the entry gets replaced by a call to the locking routine, while a call
to the unlocking routine substitutes for the NOP instruction at the exit point.
This ensures that the code layout is not impacted by the locking decisions.
As a loop may have multiple exits, all these loop exits are handled similarly.
For the exits whose target is not the following basic block, a jump instruction
is required apart from the unlock instruction. All locking/unlocking routines
are placed at the end of the program. As the locking/unlocking routines are
stored in non-cacheable memory, they do not affect the cache contents of the
program during execution. Meanwhile, the number of instructions inserted into
the program is limited; and thus their effect on code size is usually negligible.
6.4 Dynamic Cache Locking Algorithm
In this section, we present our loop-based dynamic cache locking approach.
The loop-based approach requires global optimization to select the memory
blocks and the corresponding locking points, making the problem challenging.
For a memory block in a nested loop, it has several candidate locking points.
Different locking points lead to different locking costs for a memory block.
Figure 6.3 shows the locking effects at different loop levels in nested loops. In
this example, loop lp1 is the outer loop while loop lp2 is the inner loop, and
their loop bounds are both 10. In Figure 6.3(a), the locking point is at lp2, while
memory block is locked at lp1 in Figure 6.3(b). When we try to lock a memory
block m, obviously the locking benefit is the same at both locking points. How-
ever, the locking costs may be different. In Figure 6.3(a), locking m only affects
the memory blocks in lp2 (effective region), but the locking/unlocking routines
execute 10 times (loop bound of lp1). In Figure 6.3(b), effective region is en-
larged to lp1, and the memory blocks in region a and region b are also affected.
However, execution frequency of the locking/unlocking routine is only 1. Thus,
apart from cost-benefit analysis to select the memory blocks for locking, locking
points are also important as they affect the locking cost.
Deciding on the memory blocks to lock and their locking points is a chal-
lenging problem. Selecting memory blocks for cache locking itself requires both
locking benefit and locking cost analysis, as we did in Chapter 4 and 5. Now

















Figure 6.3: Effect of difference locking positions.
for each memory block. To tackle the complexity, we present a constraint-based
approach that solves the problem in a novel way.
We observe that locking/unlocking memory blocks at entry/exit point of a
loop essentially isolates the effect of each nested loop from others. Thus we
can analyze and optimize each nested loop independently. For each nested loop,
the question now is to identify the loop level at which a memory block should
be locked. Instead of first identifying the memory blocks to be locked per loop
level, we investigate the number of cache lines that can be locked at each loop
level to minimize the WCET. This process gives us a set of constraints that can
be solved using Integer Linear Programming (ILP) to determine the number of
locking slots at each loop level so as to minimize the WCET of the nested loop.
Next, we select the most beneficial memory blocks to fill the locking slots. This
two-step process ensures that we obtain a good quality solution with reasonable
compilation time.
6.4.1 Framework Overview
Figure 6.4 illustrates the flow of our dynamic cache locking approach. First, we
perform WCET analysis with abstract interpretation [101] for the entire cache
and obtain an initial WCET. Our locking content selection algorithm proceeds
for each cache set independently. For each cache set, we perform resilience
analysis for the memory blocks mapped to this cache set. The resilience of a
memory block m captures the number of memory blocks (excluding itself) that
can be locked before the access to m changes from cache hit to cache miss.
Based on the resilience analysis of memory blocks, we perform a locking slot













Figure 6.4: Framework of dynamic cache locking.
that should be locked at each loop, in order to improve the WCET. After the
number of locking slots is fixed for each loop, we select the most beneficial
memory blocks to fill the locking slots. Later, the abstract cache states for this
cache set are re-computed, and the new WCET after locking the cache set is
calculated. We call this light weight WCET analysis as the abstract cache states
analysis is restricted to the particular cache set being analyzed. When all the
cache sets are analyzed, we obtain the final WCET after locking using a com-
plete analysis. We detail the dynamic cache locking approach in the following
section.
6.4.2 WCET Analysis
In our approach, the WCET analysis involves three steps: abstract cache states
analysis, memory access classification and WCET computation.
First, we perform abstract cache state analysis via abstract interpretation
[101]. Three types of analysis are carried out: must analysis, may analysis and
persistence analysis. Some contemporary works attempt to improve the tra-
ditional persistence analysis [17, 51, 32]. Ballabriga and Casse [17] propose
multi-level persistence analysis to improve the accuracy of WCET estimation.
Both Huynh et al. [51] and Cullmann [32] detect and fix a safety issue that
92
may underestimate the WCET in the traditional persistence analysis [101]. We
adopt the multi-level persistence analysis technique [17] and use the Younger
Set approach in [51] to fix the safety problem in persistence analysis. Based on
the abstract cache states, we classify the accesses to memory blocks into four
categories: Always Hit, Always Miss, Persistent, and Non-Classified. With the
memory access classification and the cache hit/miss latency, we estimate the
WCET for each basic block. We calculate the program WCET via the implicit
path enumeration method [63] with ILP (Integer Linear Programming) formu-
lation. As a by-product of the WCET analysis, we collect the memory blocks
along the worst-case path and their corresponding execution frequencies, as well
as the abstract cache states at each program point.
6.4.3 Resilience Analysis
In cache locking, when we lock a memory block m, all accesses to m become
cache hits after locking. However, locking m also reduces the free cache space
in the cache set. That is, the age of the memory blocks that used to be younger
than m is increased by 1. Thus, some memory blocks that used to be present in
the cache may get evicted out after cache locking. In order to capture the influ-
ence of cache locking, we define resilience for each memory block, a concept
similar to that in [9].
Definition 6 (Resilience) The resilience of a memory block m in cache set s at
a program point p is the maximum number of older memory blocks that can be
locked in s before m is evicted out.
In our case, we only need to compute the resilience for the memory blocks
that are classified as Always Hit or Persistent. For the memory block m that can
be classified as Always Hit or Persistent, its resilience indicates the maximum
number of older memory blocks that can be locked before m can no longer be
classified as Always Hit or Persistent. Thus, based on the abstract cache states of
must analysis and persistence analysis, we calculate the resilience of a memory
block m at the program point p as follows.
respm =
{
A− 1− ageam if 0 ≤ ageam ≤ A− 1
−1 otherwise (6.2)
where a is the abstract cache state at program point p, ageam is the corresponding
age of m in a, and A is the cache associativity. We use the resilience value of
−1 to indicate that a memory block is not in the cache. For a memory block m
93
with non-negative resilience, we also define the set of younger memory blocks
of m as ysm.
ysm = {m′|respm ≥ 0 ∧ ageam′ ≤ ageam} (6.3)
where a is the abstract cache state. We collect the younger memory blocks of
m, as locking a memory block m′ ∈ ysm will not affect the age of m.
With the resilience analysis, we now classify the memory blocks into A+ 1
resilience sets. We use Si (−1 ≤ i ≤ A−1) to denote the set of memory blocks
with resilience value of i. Particularly, when i = −1, S−1 represents the the set
of memory blocks that are classified as Always Miss or Non-Classified. Note
that a memory block may have different classifications under different contexts.
Thus, we may find the same memory block in different resilience sets.
6.4.4 Locking Slot Analysis
The locking slot analysis determines the number of memory blocks that should
be locked at each loop level in order to minimize the overall WCET.
We assume there are N loops in the program, LP = {lp1, lp2, ...., lpN},
where LP is the set of loops. For a cache set s, we define the number of locking
slots for the loop lpi ∈ LP as ni. We use a function gain(lpi, ni) to represent
the locking gain on WCET by having ni locking slots at lpi. Thus, the total




Hence, to minimize the overall WCET, our purpose is to decide the value of ni
for each loop lpi ∈ LP in order to maximize the total locking gain.
As we have mentioned earlier, a global optimization is required for the
loop-based dynamic cache locking. Thus, independent computation of ni and
gain(lpi, ni) for each lpi ∈ LP does not help in solving the problem. We pro-
pose an ILP formulation approach to obtain ni for each lpi ∈ LP . We first
derive the constraints on ni. Then we approximate gain(lpi, ni), the locking
gain by having ni slots at loop lpi. The objective of the ILP formulation is to






Constraints in Local Loop
Suppose Ni is the number of memory blocks locked at loop lpi ∈ LP , when
static cache locking is independently applied to lpi. Thus Ni represents the
number of memory blocks locked at lpi that results in maximum locking gain
due to static locking of lpi. That is, locking more memory blocks may have
negative impact on the locking gain at lpi. So, in our case, we should make ni
bounded by Ni, in order to have better locking gain.
ni ≤ Ni (6.6)
To obtain Ni for lpi, we perform static locking cost-benefit analysis in lpi and
iteratively select the most beneficial memory blocks to lock as discussed below
and shown in Algorithm 5.
Algorithm 5: Locking slot bound computation
Input: Cache associativity A and Loop lpi
Output: Ni and gainmk
1 begin
2 stop locking := false;
3 k := 1;
4 while (!stop locking) do
5 mk := null;
6 gainmk := 0;
7 foreach m ∈ S−1i do
8 benefitm := calculate benefit(lpi) ;
9 costm := calculate cost(lpi) ;
10 gainm := benefitm − costm ;
11 if gainm > gainmk then
12 mk := m ;
13 gainmk := gainm ;
14 if mk 6= null then
15 update memory block sets(lpi) ;
16 k := k + 1 ;
17 A := A− 1 ;
18 if A = 0 then
19 stop locking := true ;
20 else
21 stop locking := true ;
22 Ni := k − 1 ;
Recall that the resilience analysis partitions the memory blocks into A + 1
sets. Thus, for loop lpi, its memory blocks can also be classified into A+ 1 sets:
{S−1i , S0i ,..., SA−1i }. When 0 ≤ x ≤ A − 1, Sxi indicates the set of memory
95
blocks whose resilience is x in loop lpi. While S−1i contains all memory blocks
that are classified as Always Miss or Non-Classified in loop lpi. Clearly, there
is locking benefit only when we lock the memory blocks in S−1i . Thus, for a
memory block m ∈ S−1i , its locking benefit can be defined as follows.
benefitm = (LATmiss − LAThit)× freqm (6.7)
where LATmiss is the access latency for a cache miss, LAThit is the cache hit
latency, and freqm is the execution frequency of m on the worst-case path.
However, locking m also incurs penalty, as the number of free cache lines in
cache set s is reduced by 1, which may result in the eviction of memory blocks




((LATmiss − LAThit)× freqm′) (6.8)
where cost′m represents the locking cost due to free cache space reduction, m
′
is a memory block with resilience of 0, and ysm′ is the set of younger memory
blocks of m′. As we have mentioned, when m ∈ ysm′ , there is no impact on m′
by locking m. Apart from the cost incurred by memory block eviction, locking
m also requires the execution of locking/unlocking routines. We use cost′′m to
represent this type of locking cost.
cost′′m = PENALTY × freqr (6.9)
where PENALTY is a constant indicates the penalty to execute the lock-
ing/unlocking routines for one memory block, and freqr is the total locking
frequency of memory block m at lpi on the worst-case path. Therefore, we






In this case, we can easily obtain the gain of locking m.
gainm = benefitm − costm (6.11)
We perform the analysis for all the memory blocks in S−1i , and select the mem-
ory block with the maximum locking gain to lock. Each time we lock a memory
block, we update the resilience sets, as the number of free cache line is reduced
by 1 and the resilience of memory blocks changes.
96
More concretely, let mk be the kth (1 ≤ k ≤ Ni) memory block selected
to be locked at lpi. For a memory block m, if mk is in the set of its younger
memory blocks, locking mk will not impact the age of m. Thus, its resilience
remains unchanged. Otherwise, the resilience is reduced by 1. Therefore, we
subdivide Sxi (0 ≤ x ≤ A − k) into two subsets Sx′i and Sx′′i , where Sx′i is the
set of memory blocks whose resilience remains unchanged and Sx′′i is the set of
memory blocks having their resilience reduced by 1.
Sx
′
i = {m|m ∈ Sxi ∧m 6= mk ∧mk ∈ ysm} (6.12)
Sx
′′
i = {m|m ∈ Sxi ∧m 6= mk ∧mk /∈ ysm} (6.13)














i \ {mk} if x = −1
(6.14)
We also record the individual locking gain for the kth memory block locked at
loop lpi in static locking analysis as gainki .
gainki = gainmk (6.15)
This value will be used to approximate the locking gain at lpi in dynamic cache





We continue to lock memory blocks with the updated memory block sets
until there is no locking gain for all memory blocks or the cache set s is fully
locked. In this case, we obtain the value ofNi, the maximum number of memory
blocks that can be locked at lpi when static locking is applied.
We use the program in the motivating example (Figure 6.2) to show the com-
putation of the local loop constraints with respect to loop lp1. Table 6.2 shows
the resilience sets for different iterations, where the numbers in parentheses are
the corresponding execution frequency of the memory block under a particular
context. As we have mentioned, a memory block may be found in different sets
based on the contexts. In Table 6.2, mlp11 indicates m1 in the context of lp1,
while mlp21 denotes m1 in the context of lp2. The cost-benefit analysis results
are presented in Table 6.3. In the first iteration, as there is no memory block in
S01 , the penalty is only in executing the locking routines. We select m2 to lock
and update the memory block sets as shown in the 2nd iteration of Table 6.2,
where m1 is now in S01 . Thus, in the 2nd iteration, the cost for m3 or m4 should
also take the eviction of m1 into account, as shown in Table 6.3. As the locking
97














Table 6.3: Cost-benefit analysis for N1 computation.
Iteration m1 m2 m3 m4
1st
benefit (cycles) 290 5,800 5,800 5,800
cost (cycles) 150 150 150 150
gain (cycles) 140 5,650 5,650 5,650
2nd
benefit (cycles) 290 N/A 5,800 5,800
cost (cycles) 150 N/A 5,660 5,660
gain (cycles) 140 N/A 140 140
gain is the same for m1, m3 and m4, we randomly lock one and the cache set
is fully locked. Thus, we obtain N1 = 2 at lp1. With similar approach, we also
compute N2 = 0 and N3 = 2 at lp2 and lp3, respectively.
Accumulated Constraints
In the loop-based dynamic locking, the memory blocks locked in the outer loops
will be brought into their corresponding inner loops. For a global optimization,
we also need to bound the accumulated locking slots at each loop in the program.
For lpi ∈ LP , we use OLi to indicate the set of outer loops of lpi, while ILi
represents the set of inner loops of lpi. Suppose loop lpj ∈ LP is an outer loop
of lpi, that is, lpj ∈ OLi. We define a loop set LPi,j as the set of loops between
outer loop lpj and inner loop lpi (inclusive).
LPi,j = {lpy|y = i ∨ y = j ∨ (lpy ∈ OLi ∧ lpy ∈ ILj)} (6.16)
Therefore, the accumulated number of locked memory blocks starting from lpj





We can also bound acci,j with Ni, the maximum number of memory blocks
that can be locked at lpi under static locking analysis. However, Ni is too re-
strictive for acci,j , as Ni only considers the locking benefit of memory blocks
in lpi, while the accumulated memory blocks locked at lpi can potentially come
from the outer loops LPi,j \ {lpi}. Therefore, an appropriate bound on acci,j
98
should consider maximum locking benefit and minimum locking cost. That is,
we should consider the locking benefit of memory blocks in lpj while only the
locking cost of memory blocks in lpi is taken into account. We define such
bound as Ni,j , the number of memory blocks that can be locked at lpj from the
perspective of lpi in static locking analysis. The computation of Ni,j is similar
to that of Ni. The main difference is that we choose the memory blocks from
the effective region lpj (S−1j ) for locking, while only the negative impact on lpi
is considered. Meanwhile, we update the resilience sets for both lpi and lpj .
Therefore, we have
acci,j ≤ Ni,j (6.18)
We study the N2,1 computation for the program in the motivating example
(Figure 6.2). As there is only eviction cost from lp2 when we compute N1 in
the previous section, we get the same value as N1. That is N2,1 = 2. Obviously,
N2 = 0 is too restrictive for acc2,1 = n1 + n2. Similarly, we have N3,1 = 2.
Locking Gain Approximation
The accurate locking gain at lpi when ni slots are allocated is not available until
the locked memory blocks and the locking slots are fixed for each loop. Thus,
we use the locking gain for each memory block while computing Ni for lpi to
approximate the locking gain gain(lpi, ni). Recall that Ni is a bound for ni.
So, in the best case, Ni slots are allocated at lpi in dynamic cache locking. We
define a 0-1 binary variable Bki to indicate whether the kth (1 ≤ k ≤ Ni) slot is
allocated at lpi in dynamic cache locking. Clearly, until the kth slot is allocated,
its subsequent slot cannot be allocated. Thus, we have
Bki ≥ Bk+1i ,where 1 ≤ k ≤ Ni − 1 (6.19)





When we lock a memory block in the kth slot at lpi, we use the locking gain of
the kth memory block in computing Ni to approximate its locking gain. When
the kth slot is allocated, the locking gain is gainki , otherwise we have no locking
99
gain. So the locking gain at the kth slot in dynamic locking is
gainki ×Bki (6.21)
The total gain function gain(lpi, ni) can be approximated as∑
1≤k≤Ni
gainki ×Bki (6.22)
Equations 6.6, 6.18, 6.19, 6.20 and 6.22 are the constraints in the ILP formu-
lation. The objective of this ILP formulation is to maximize
∑
lpi∈LP gain(lpi, ni).
The complete ILP formulation for locking slot analysis is shown in Figure 6.5,






Subject to ni ≤ Ni, (lpi ∈ LP )
acci,j ≤ Ni,j, (lpi ∈ LP, lpj ∈ LP, lpj ∈ OLi)








gainki ×Bki , (lpi ∈ LP )
Figure 6.5: Complete ILP formulation.
With the program in the motivating example (Figure 6.2), we now can have
an ILP formulation as shown in Figure 6.6. By solving this problem, we obtain
n1 = 1, n2 = 0 and n3 = 1.
Maximize gain(lp1, n1) + gain(lp2, n2) + gain(lp3, n3)
Subject to n1 ≤ 2; n2 ≤ 0; n3 ≤ 2










gain(lp1, n1) = 5650×B11 + 140×B21
gain(lp2, n2) = 0
gain(lp3, n3) = 4300×B13 + 4300×B23
Figure 6.6: ILP formulation for the motivating example.
100
6.4.5 Memory Block Selection
Next, we select the beneficial memory blocks to fill the locking slots determined
in the previous subsection for each loop. To simplify the computation, we as-
sume the locking cost that includes the cost of memory block eviction and lock-
ing/unlocking routine execution is fixed once the locking slots are determined.
Therefore, we only focus on maximizing the locking benefit. Although the ben-
efit of locking memory blocks can be determined through cost-benefit analysis,
the important issue here is to select the locking point. For a memory block m
from loop lp, it can be either locked at loop lp or its outer loops. Obviously, the
locking point of a memory block affects the locking point of the other memory
blocks.
Let us take the program in Figure 6.3 for example. Let us assume there is
one locking slot at both lp1 and lp2, and we want to lock memory block m. If
we lock m at the inner loop lp2, the rest of the memory blocks in lp1 compete
for the locking slot at lp1. On the other hand, if we lock m at lp1, the rest of
memory blocks in lp2 compete for the locking slot at lp2. In other words, the
memory blocks in region a and region b can no longer be locked.
We choose to fill the locking slots from the innermost loop to the outermost
loop in the program. For each loop, we try to use up all the locking slots de-
termined in the previous subsection in order to maximize the WCET reduction.
Without loss of generality, suppose lpi ∈ LP is the innermost loop with avail-
able locking slots. With the cost-benefit analysis in Section 6.4.4, we select the
memory block m ∈ S−1i with the maximum locking gain to fill a slot at lpi.
Later, we update the resilience sets for lpi and the outer loops of lpi with the
approach in Section 6.4.4. We continue to fill the slots until all the slots at lpi
are filled with memory blocks. Then, we mark lpi as filled and move to its outer
loops with available locking slots. The process terminates when all the locking
slots are filled. The details are shown in Algorithm 6. In the end, we perform
abstract cache states analysis for the cache set, and update the WCET and the
worst-case path information.
With the program in the motivating example (Figure 6.2), we first fill the slot
at lp3. As the locking gain of m2, m3 and m4 are the same at lp3, we randomly
fill it with m3. Then, we move to the outer loop lp1 and lock m2 in a similar
way.
When all the cache sets are analyzed for dynamic locking, we obtain the
memory blocks locked at each loop and the optimized WCET after cache lock-
ing. When there is no improvement on WCET compared to static analysis, no
dynamic cache locking will be applied. Meanwhile, no instruction is inserted
101
Algorithm 6: Memory block selection
Input: Loop set LP ;
Output: Locked memory block set Mi for each lpi ∈ LP
1 begin
2 for lpi = lpinnermost → lpoutermost do
3 /* lpinnermost is the innermost loop in LP */
4 /* lpoutermost is the outermost loop in LP */
5 if ni ≤ 0 then
6 continue;
7 Mi := null;
8 for k = 1→ ni do
9 mk := cost benefit analysis(lpi);
10 Mi := Mi ∪ {mk};
11 update memory block sets(lpi);
12 foreach lpj ∈ OLi do
13 /* OLi is the set of outer loops of lpi */
14 update memory block sets(lpj);
before loop entry and after loop exit. Thus, in the worst case, our dynamic cache
locking produces the same results as static analysis.
6.4.6 Complexity Analysis
We analyze the computational complexity of our four-step approach on dynamic
cache locking. For the WCET analysis, the abstract cache states analysis is
fixed-point data flow analysis. Thus, the complexity depends on the cache con-
figuration and program control flow. As it is a mature analysis approach, we
assume its complexity is O(w) for the analysis on each cache set.
In resilience analysis, we classify memory blocks into different resilience
sets, Obviously, the complexity is O(M), where M is the total number of mem-
ory blocks. In locking slot analysis, we first compute the local loop constraints
for each loop, as shown Algorithm 5. As it performs cost-benefit analysis
for all memory blocks in the loop, and the maximum memory blocks can be
locked is A, the complexity is O(M × A), where A is the cache associativ-
ity. As we perform the computation for the N loops, we have O(N ×M × A)
for the local loop constraints computation. As we have mentioned, accumu-
lated constraints computation also adopts Algorithm 5, each computation takes
O(M × A). While the computation frequency for each loop is bounded by its
outer loops. For example, the maximum computation frequency for the inner
most loop is N − 1, because its maximum number of outer loops is N − 1.




. Therefore, complexity for accumulated constraints compu-
tation is O(N2 ×M × A). The locking slot computation given the constraints
has AN complexity because at each loop we have A choices. For memory block
selection step, we fill the locking slots for each loop. When we select the most
beneficial memory block, we need to do the cost-benefit analysis for all the
memory blocks in a loop, and the maximum slots for a loop is A. So, the com-
plexity is O(A×M ×N). Thus, for the locking analysis of each cache set, the
complexity is O(M)+O(AN)+O(A×M×N)+O(N2×M×A)+O(w). We
simplify it to O(AN) +O(N2×M ×A) +O(w). Suppose the number of cache
sets is K, finally we have O(K × AN) + O(K ×N2 ×M × A) + O(K × w).
6.5 Experimental Evaluation
We now present an experimental evaluation of our loop-based dynamic cache
locking. We compare our approach with static cache locking proposed in Chap-
ter 4, static cache analysis [101, 17], and existing region-based dynamic cache
locking approach [15].
6.5.1 Experimental Setup
We use the benchmarks from MRTC benchmark suite [46] as shown in Ta-
ble 6.4, where the original code size is the program size without any change
in the benchmarks. For loop-based dynamic cache locking, we assume the call
instructions to locking/unlocking routines are inserted into the programs, and
the corresponding code size of these benchmarks after instruction insertion are
shown in the 3rd column of Table 6.4. As can be observed, the code size incre-
ment is minimal. The number of loops and nested loops for each benchmark is
presented in the last column of Table 6.4. We compile the benchmarks with gcc
cross-compiler for the Simplescalar PISA (Portable ISA) instruction set [21].
The runtime of our analysis algorithms are obtained on 2.53GHz Intel Xeon
CPU with 24GB memory. IBM CPLEX is used as the ILP solver to obtain both
the WCET and the locking slots.
As we are modeling the instruction cache, we assume a simple in-order pro-
cessor with unit-latency for all data memory references. Also, we consider ar-
chitectures without timing anomalies caused by interactions between caches and
other architecture features. We assume a 4-way set-associative cache with block
size of 32-byte. The cache hit latency is 1 cycle and the cache miss penalty is
30 cycles. We assume 150 cycles overhead to lock and unlock a memory block
103
because the locking routine involves multiple instructions to lock each memory
block and it is kept in the un-cacheable region of the program memory.













adpcm 11,000 11,248 2.25 15(10)
cnt 1,648 1,712 3.88 4(4)
crc 2,048 2,096 2.34 3(0)
edn 7,296 7,472 2.41 11(7)
fdct 5,176 5,208 0.62 2(0)
jfdctint 5,520 5,568 0.87 3(0)
matmult 1,632 1,712 4.90 5(5)
minver 6,256 6,536 4.48 17(16)
ndes 6,352 6,544 3.02 12(8)
st 2,248 2,312 2.85 4(0)
6.5.2 Comparison with Static Approaches
We first compare our loop-based dynamic cache locking approach with the static
approaches, that is, static cache analysis and static cache locking. We perform
static analysis with abstract interpretation [101, 17], while we adopt the heuris-
tic approach for partial cache locking in Chapter 4 to obtain the static locking
results. We use the static analysis results as the baseline, and normalize the
results of static locking and our dynamic locking approach. We consider two
different cache sizes. The results are shown in Figure 6.7.
In Figure 6.7(a), our loop-based dynamic locking outperforms static analysis
for all the benchmarks, and the improvement is up to 40% (cnt and matmult).
When compared with static locking, our approach wins in most of the cases
except for crc and fdct. For crc, static locking performs well as memory blocks
that mostly affect the WCET are locked, while cache locking does not help
much in fdct. Furthermore, we usually pay some extra cost due to code size
increase. In this case, we have worse WCET for crc and fdct. On an average,
static locking improves the WCET by 13%, while our dynamic cache locking
approach improves the WCET by 23%.
In Figure 6.7(b), as the cache size increases, more memory accesses can
be classified as cache hits, and the improvement via cache locking decreases
accordingly. In this case, more benchmarks produce worse dynamic locking
































Figure 6.7: Comparison between loop-based dynamic locking and static ap-
proaches.
proach still has 15% improvement on an average compared to static analysis,
while it is just 7% for static locking. For both cache configurations, the bench-
marks with nested loops (edn, minver and ndes) always produce good results
compared to both static analysis and static locking. As we have mentioned,
in nested loops, memory blocks can be flexibly locked at different loop lev-
els, which makes dynamic cache locking quite advantageous and leads to better
WCET. For adpcm, our approach achieves small improvement as most of its
loops are small and most of its memory accesses are classified as cache hits
6.5.3 Comparison with Region-based Approach
We also compare our loop-based dynamic locking approach with the existing
dynamic cache locking approaches. There are two main dynamic locking ap-
proaches for instruction cache: [15] and [74]. The approach in [74] does not
consider the cache mapping function. That is, they assume that a memory block



































Figure 6.8: Comparison between loop-based and region-based dynamic locking.
pad memory allocation. Code placement/layout change is required after the
analysis in order to map locked memory blocks to the corresponding cache sets.
This significantly increases the code size. Therefore, we implement the region-
based dynamic cache locking method [15] for comparison. [15] assumes that
there is no modification to the program due to locking. The locking and un-
locking of a memory block is handled by raising an exception. Thus, for a fair
comparison, we assume no code change due to locking for both loop-based and
region-based approaches.
Figure 6.8 shows the comparison results under different cache sizes, where
static analysis results are used as the baseline. For most of the benchmarks, our
results are much better than that of the region-based dynamic locking method.
For the region-based method, memory blocks can only be locked at the begin-
ning of a region, which does not provide fine-grained flexibility of selectively
locking memory blocks from the same region at different program points. [15]
also uses full cache locking that may prevent the unlocked memory blocks to
exploit their locality. However, there are also exceptions, e.g., matmult in Fig-
ure 6.8(b). When the cache size is 10% of the average task size, most of the
frequently executed memory blocks in matmult can fit into the cache in the
106
region-based approach. While the approximations used in the loop-based ap-
proach does not allocate all the cache lines for locking.
















adpcm 4.17 10.97 9.46 246.08
cnt 0.02 0.04 0.09 6.24
crc 0.04 0.26 0.13 7.03
edn 0.15 0.39 0.40 21.49
fdct 0.03 0.29 0.11 0.70
jfdctint 0.03 0.17 0.11 1.07
matmult 0.02 0.05 0.12 5.88
minver 0.29 0.65 0.85 62.00
ndes 0.24 0.48 0.64 35.18
st 0.03 0.08 0.13 9.91
6.5.4 Runtime of Different Methods
We present the runtime for static analysis, static locking, loop-based dynamic
locking and region-based dynamic locking when the cache size is 10% of the
average task size, as shown in Table 6.5. The runtime of our loop-based ap-
proach is close to that of static approaches, while region-based method takes
more time. Thus our approach is efficient.
6.6 Discussion
Our loop-based dynamic cache locking approach can further improve the WCET,
and it is more flexible than the existing approaches. However, dynamic cache
locking approaches may not be necessary when the cache size is large or pro-
gram size is small. Our approach considers the global optimization, but we still
have approximations when we compute the locking gain. This may compromise
the results, as we have mentioned in the evaluation section (for matmult, when
the cache size is 10% of the average task size).
107
6.7 Summary
In this chapter, we propose a loop-based dynamic locking approach for instruc-
tion caches to minimize the WCET. We accurately capture the locking cost and
benefit through resilience analysis. A global optimization method allocates the
locking slots across loop levels and the most beneficial memory blocks are se-
lected to fill the slots. Our approach substantially reduces the WCET compared
to static analysis, static locking and region-based dynamic cache locking. Mean-
while, the runtime of our approach is comparable to static approaches and is
more efficient compared to region-based dynamic cache locking method.
108
Chapter 7
Cache Locking for Shared Cache
Multi-core Processors
In Chapter 4, 5 and 6, we focused on cache optimization in uni-processors. In
this chapter, we perform partial cache locking in multi-core processors with
shared cache.
7.1 Overview
As mentioned in Chapter 1, multi-core processors are beginning to be widely
used in real-time systems, and the shared resources in multi-core systems in-
troduce additional challenges for the timing analysis. In particular, multi-core
systems often employ shared L2 cache (see Figure 7.1). The presence of this
shared resource requires the modeling of inter-core cache conflicts. For exam-
ple, consider a memory block m accessed by a task t in the shared L2 cache of
a multi-core system. If task t is allowed to use the L2 cache exclusively, then
static cache analysis may determine that access of m will be a guaranteed L2
cache hit. However, in reality, memory accesses from the tasks running on other
cores concurrently may conflict with m and evict m from the L2 cache. Thus,
the access of m may be changed to L2 cache miss and will have longer mem-
ory access latency leading to increased WCET of the task t. Thus, the conflicts
in the L2 cache can affect the WCET of the tasks, which in turn, can impact
the worst-case response time (WCRT) of the application. WCRT is the latest
completion time of any task in a multitasking application.
Existing methods guarantee the timing predictability of shared cache multi-
core processors through combination of cache locking and cache partitioning [96,














Figure 7.1: Multi-core architecture with shared L2 cache.
inates the inter-task or inter-core cache interference in the shared cache. Then,
full cache locking is applied after cache partitioning. By integrating cache par-
titioning and full cache locking, static cache analysis is not required, and the
timing is predictable. However, cache partitioning may limit the performance
of shared cache, as each core/task can only uses a fraction of the shared cache.
Furthermore, aggressive full cache locking may have negative impact on the
overall timing as we have discussed in Chapter 1 and 4.
On the other hand, there exist several static analysis efforts in WCRT estima-
tion for multi-core architectures with shared L2 cache [112, 62, 48]. However,
static analysis approaches may not produce accurate results, as we have men-
tioned earlier. Moreover, these WCRT estimation techniques focus on modeling
the shared cache conflicts and assume pre-determined task to core mapping.
That is, the task mapping phase is completely agnostic to the shared cache ef-
fects. However, task mapping significantly influences the set of tasks that ex-
ecute in parallel on different cores and hence the amount of conflicts in the
shared L2 cache. These shared cache conflicts, in turn, impact the WCET of
the tasks and eventually the workload balance. Clearly, decoupled task mapping
and shared cache modeling solution leads to sub-optimal WCRT for the entire
system.
In this chapter, we apply partial cache locking in multi-core processors with
shared cache, in order to improve the WCRT of multitasking applications. A
two-step framework is proposed, as shown in Figure 7.2. Prior to cache locking
optimization, we first propose a shared cache aware task mapping solution to
minimize the WCRT1. Our task mapping approach considers the workload bal-





Figure 7.2: Overall framework for cache locking in multi-core processors.
ance among the cores and the shared L2 cache conflicts in an integral fashion
leading to significantly improved WCRT. Our cache locking approach further
improve the WCRT based on the resultant task mapping. We statically lock
memory blocks in the private L1 cache for each task, which not only reduces
the number of L1 cache misses but also minimizes the number of accesses to
the shared L2 cache. Such two-step optimization substantially improves the
WCRT of multitasking applications.
7.2 Motivating Example for Task Mapping
In this section, we present an motivating example for the task mapping approach.
Figure 7.3 shows the impact of task mapping on the WCRT for a small task
graph consisting of five tasks as shown in Figure 7.3(a). It executes on a 2-
core architecture with 256 bytes of L1 cache for each core and 2KB shared
L2 cache shared among the two cores. The two cores are homogeneous. For
this simple example, we can exhaustively enumerate all the 16 possible task
mappings. For each task mapping, we show in Figure 7.3(b), the WCRT with
shared L2 cache modeling as will be described in Section 7.5 and the WCRT
without shared L2 cache modeling (i.e., assume cache miss (hit) for each L2
access in the worst (best) case). Clearly, the WCRT critically depends on the
task mapping. As expected, the estimated WCRT with L2 cache modeling is
lower than WCRT without L2 cache modeling. What is also interesting is that
the trends across different task mappings are very different with and without
L2 cache modeling. For example, the last two task mappings (#15 and #16)
yield the minimum WCRT with L2 cache modeling. But these particular task
mappings provide the worst WCRT without L2 cache modeling. In decoupled
task mapping and shared cache modeling approach, task mappings #2 and #9
that has the minimum WCRT without L2 cache modeling will be selected. On
111
the other hand, integrated task mapping and shared cache modeling approach,
however, is able to identify task mappings #15 and #16 as the best solutions.
Hence, it is imperative to design a shared cache aware task mapping solution so

























Optimal WCRT mappings 
w/o L2 cache modeling
Optimal WCRT mappings 
with L2 cache modeling
(a) Task graph
(b) Impact of  task mapping on WCRT with and w/o L2 cache modeling
Figure 7.3: Motivating example.
It is challenging to design an integrated task mapping and shared cache mod-
eling solution due to their interdependency. That is, task mapping influences the
amount of conflicts in the shared cache, while the cache conflicts change the ex-
ecution time of the tasks leading to load imbalance among the cores, which calls
for a different task mapping. We resort to an integer linear programming (ILP)
formulation to model this inter-dependency between shared cache modeling and
task mapping.
112
7.3 Task Model and System Architecture
In this section, we introduce the details of the application model and the system
architecture.
We represent our multi-tasking application as a task graph. A task graph is
simply a directed acyclic graph, where the nodes of the graph denote the tasks
and the edges denote the dependencies between tasks. More formally, a task
graph T consists of M tasks T = {t1, t2, . . . , tM}. For each task ti, pred(ti)
denotes the set of predecessors of task ti. Thus, task ti is only ready to start
execution after the completion of all the tasks in pred(ti). A pair of tasks ti and
tj can execute in parallel on different cores if there is no dependency relation
between them. As for the execution of the tasks mapped on each individual core,
we assume that the tasks are executed in a non-preemptive fashion. In other
words, once a task starts execution, it will continue until completion without
preemption from other tasks.
The modeled multi-core architecture with shared L2 cache modeled is shown
in Figure 7.1. There are N homogeneous cores {p1, p2, . . . , pN} in the system
P . Each core pi has its own private L1 cache while the L2 cache is shared
among all the cores. We consider set associative caches with Least Recently
Used (LRU) cache replacement policy. We assume no timing anomalies caused
by interaction between caches and the other architectural features during WCET
analysis. Notice that the timing anomaly would not impact the task mapping so-
lution; only the WCET analysis component needs to be modified. The L2 cache
block size is assumed to be larger than or equal to the L1 cache block size. We
are analyzing non-inclusive multi-level caches [48], where the following prop-
erties hold:
• A memory reference is searched in the L2 cache if and only if it is a miss
in L1 cache.
• For every miss at level L, the requested memory block is loaded into the
cache at level L.
Finally, we focus on instruction memory hierarchy in this work. We do not
model the data cache.
7.4 Task Mapping Framework Overview
Prior to cache locking optimization, we first generate an appropriate task map-
ping for the multitasking application. Our shared cache aware task mapping
113
(a) Intra-Task Cache Analysis 












(c) Iterative WCRT Computation 
L2 Cache Conflict 
Analysis 








Figure 7.4: Task Mapping Framework.
framework consists of three phases as shown in Figure 7.4: intra-task cache
analysis, task mapping with shared cache modeling, and iterative WCRT com-
putation.
In the intra-task cache analysis phase, we perform a static cache analysis for
each individual task in isolation. We start with the L1 cache. As the L1 cache
is a private cache and the tasks assigned to the same core execute in a non-
preemptive fashion, we can safely analyze L1 cache independently for each task.
The output of L1 cache analysis is the L1 cache accesses classification (e.g., hit
or miss). The accesses that are guaranteed to be L1 cache hits are filtered out
from L2 cache analysis. Then, we perform L2 cache analysis for the unfiltered
accesses (e.g., L1 cache miss). At this point, we assume there is no interference
from the tasks executing on the other cores for the L2 cache. This interference
is modeled in later stages. Therefore, the intra-task cache analysis is identical
to the multi-level non-inclusive instruction cache analysis proposed in [48].
In the task mapping phase, our goal is to derive an task mapping solutoin
that balances the workload among the cores and minimizes the shared L2 cache
interference so as to minimize the WCRT of the task graph. It is challenging to
achieve this goal due to the following reasons. First, the search space for task
mapping itself is quite large. The number of possible task mappings is exponen-
tial in terms of the number of tasks, all of which need to be evaluated. Second,
given a task, its execution lifetime depends on the shared L2 cache conflicts
caused by the tasks executing in parallel on the other cores, which in turn de-
pends on the task mapping. The inter-dependency between task mapping and
task execution time (due to the shared cache) introduces significant complexity
to the problem.
Clearly, optimal task mapping solution requires shared cache modeling for
WCRT estimation. Recently, Li et al. [62] proposed an iterative analysis frame-
work that accurately estimates the WCRT of multi-tasking program running on
114
multi-core processor with shared cache. The iterative solution is based on the
key observation that two tasks running on different cores do not conflict if they
have disjoint execution lifetime. Given a task mapping, we start with the worst-
case task interference (i.e., a task conflicts with all the other tasks mapped to dif-
ferent cores if they do not have dependencies) and iteratively improve the task
execution lifetime and the task interference. When the task interference does
not change, the iterative process terminates and returns the estimated WCRT.
The termination of this iterative process is guaranteed [62].
Figure 7.5 illustrates the iterative process using the task graph in Figure 7.3(a).
The five tasks are mapped to two cores as shown in Figure 7.5. The analy-
sis starts with the worst-case task interference as shown in Figure 7.5(a). Then,
task execution lifetime is determined using WCRT estimation (described in Sec-
tion 7.5.2). Figure 7.5(b) visualizes the execution lifetime for all the tasks,
where the duration of task execution is shown as a horizontal bar. Based on
this updated task execution lifetime, we observe that it is impossible for ndes
and compress to conflict as they have disjoint lifetime. Thus, the task interfer-
ence graph is refined as shown in Figure 7.5(c). Due to the reduction of task
interference, the execution time of ndes and compress decreases, and the new
task execution lifetime is calculated, as shown in Figure 7.5(d). We observe that
it is impossible for crc and compress to conflict as they have disjoint lifetime.
Therefore, we refine the task interference graph, as illustrated in Figure 7.5(e).
This process continues until there is no change to the task interference graph.
The shared cache modeling technique [62] produces accurate WCRT for a
given task mapping. Thus, exhaustively enumerating all task mappings for the
WCRT estimation framework can help us find the optimal task mapping. How-
ever, this may not be computationally feasible as the number of possible task
mappings is exponential (in number of tasks) and the iterative WCRT estima-
tion may have long runtime specially for large task graphs. Meanwhile, the
inter-dependency between task mapping and task execution time (due to the
shared cache) introduces significant complexity to the problem. Given a task,
its execution time depends on the shared L2 cache conflicts caused by the tasks
executing in parallel on the other cores, which in turn depends on task mapping.
Instead, we propose an integer linear programming (ILP) formulation that
integrates the dual objective of maximizing workload balance and minimizing
shared cache conflict for task mapping. We consider the impact of task map-
ping on shared cache interference and consequently on the WCRT. In order to
make the modeling computationally tractable, we approximate the new WCET
















Core 1 Core 2
(a) Initial interference graph 
based on task graph
(b) Task lifetimes determined in 
first round of analysis
(c) Interference graph after 









Core 1 Core 2
(d) Task lifetimes determined in 
second round of analysis
(e) Interference graph after 
second round of analysis
Figure 7.5: Illustration of the iterative WCRT analysis modeling shared cache.
the possible change on the worst-case path within the task. This introduces
sub-optimality in our solution. However, as we will show in the experimental
evaluation, the task mapping returned by our approach can achieve the optimal
result in practice. Finally, as the WCRT estimated by our ILP formulation for
the chosen task mapping may not be accurate, we call the iterative framework
developed in [62] with the chosen task mapping to derive the actual WCRT. In
the next section, we introduce our shared cache aware task mapping in details.
7.5 Components of the Task Mapping Framework
Our shared cache aware task mapping framework consists of three phases: intra-
task cache analysis, task mapping with shared cache modeling, and iterative
WCRT computation. In this section, we first provide a quick overview of the
intra-task cache analysis and the WCRT computation. Then, we present the
details of our ILP formulation for the task mapping problem.
116
7.5.1 Intra-Task Cache Analysis
We employ the multi-level non-inclusive cache analysis proposed in [48]. Each
task in the task graph is analyzed independently. Similar analysis is performed
for each cache level (L1 and L2 cache) separately based on abstract interpreta-
tion [101]. Virtual unrolling is also applied [79, 101]. For each level of cache,
we perform must and may analysis. Each analysis generates an abstract cache
state at each program point. The abstract cache state of must analysis contains
the memory blocks that are guaranteed to be in the cache, while the abstract
cache state of may analysis identifies the memory blocks that may be present in
the cache at that particular program point. A memory access can be classified
into the following categories based on the two abstract cache states (must and
may):
• Always Hit (AH): The memory block is present in the abstract cache state
of must analysis and hence its references will always result in cache hits.
• Always Miss (AM): The memory block is not present in the abstract
cache state of may analysis and hence its references are guaranteed to
be cache misses.
• Non-Classified (NC): The memory block cannot be classified as either
always miss or always hit.
Once the memory blocks have been classified at L1 cache level, we proceed
to analyze them at L2 cache level. Note that the memory references that are L1
cache hits will not reach the shared L2 cache. Therefore, we need to eliminate
these references from further consideration by applying a filter function [48].
The intra-task L2 cache analysis is identical to L1 cache analysis for the un-
filtered accesses. The reader may refer to [48] for further details of intra-task
cache analysis.
7.5.2 WCRT Estimation
We employ the WCRT estimation framework modeling shared cache conflicts in
multi-cores [62]. The intra-task cache analysis classifies each possible L2 cache
access as Always Hit, Always Miss, or Non-Classified. Due to the possible L2
cache conflicts from tasks concurrently executing on other cores, the L2 cache
access classification may change. We will describe this modeling in detail in our
ILP formulation in the next subsection. Once the classification for each memory
reference is known, we can determine the access latency in the best case and
117
the worst case. These access latencies are plugged into the timing analysis to
estimate the best-case execution time (BCET) and the worst-case execution time
(WCET) of each task.
For each task t, EarliestReadyt, EarliestF inisht, LatestReadyt and
LatestF inisht are used to represent its execution interval. EarliestReadyt
(LatestReadyt) represents the earliest (latest) time when all the predecessors of
task t have completed execution. EarliestF inisht (LatestF inisht) represents
the earliest (latest) time when task t completes its execution. The time interval
[EarliestReadyt, LatestF inisht] indicates the lifetime of task t.
Two tasks t and t′ interfere with each other in the L2 cache only when they
are mapped to different cores and their lifetimes overlap. Two tasks t and t′ are
called peers when they are mapped to the same core and their lifetimes overlap.
Two tasks with dependency between them can neither interfere with each other













where pred(t) is the set of predecessors of task t and peer(t) is the set of peers




where T is the set of tasks.
7.5.3 ILP Formulation for Task Mapping
First, we define a 0-1 decision variable Mij , which indicates if task ti is mapped
to core pj .
Mij ∈ {0, 1}, where 0 < i ≤M and 0 < j ≤ N. (7.6)
118
Each task can only be mapped to one core. Thus∑
0<j≤N
Mij = 1 (7.7)
In the following, we present the ILP formulation for the task interference
and peer relationship, shared cache modeling and WCRT computation.
Task Interference and Peer Relationship
For tasks ti and tj , we define a 0-1 decision variable Sij.k to indicate whether
task ti and tj are mapped to the same core pk
Sij.k =
{
1 if Mik = 1 and Mjk = 1
0 otherwise
(7.8)
We linearize the above equation as follows.
Mik + Mjk − Sij.k ≤ 1 (7.9)
Mik + Mjk − 2× Sij.k ≥ 0 (7.10)
For tasks ti and tj , we define another 0-1 decision variable Sij to indicate
whether task ti and tj are mapped to the same core.
Sij =
{
1 if ∃k s.t. Sij.k = 1
0 otherwise
(7.11)
We linearize the above equation as follows.





As previously mentioned,we use an iterative process in [62] to derive the
WCRT in the presence of shared cache conflicts. To avoid this fixed point com-
putation in the ILP formulation, we assume that two tasks interfere with each
other in the shared cache if there is no dependency between them and they are
mapped to different cores. In order to model the interference relationship be-
tween two tasks ti and tj , we define a 0-1 decision variable intfij . Similarly, a
0-1 decision variable peerij is also introduced to represent the peer relationship
119
between tasks ti and tj . If tasks ti and tj have dependency between them, their
execution lifetime will never overlap. Therefore, they will neither interfere with
each other nor be peers. Thus
intfij = 0 and peerij = 0 (7.14)
If there is no dependency between tasks ti and tj , then we assume they interfere
with each other when mapped to different cores and are peers when mapped to
the same core. Thus,
intfij =
{





1 if Sij = 1
0 otherwise
(7.16)
We linearize the above equations as follows.
intfij = 1− Sij and peerij = Sij (7.17)
Shared Cache Modeling
Recall that we first perform intra-task cache analysis for each task before task
mapping, where the interference in shared L2 cache is not considered. For a task
ti, we define its initial WCET as Wi, which is computed based on the hit/miss
classification of memory accesses after intra-task cache analysis. This Wi is
less than the actual WCET as it does not consider the L2 cache conflicts. As a
by-product, we also obtain the age in L2 abstract cache states of must analysis
for all the memory accesses classified as L2 hits. Meanwhile, we also compute
the WCET path for each task and collect the execution frequency of each basic
block along the worse-case path.
For each task, detailed path modeling can help us obtain an accurate WCET
estimate. However, it introduces a large number of variables to the ILP formu-
lation, which leads to a long solving time, especially in the presence of complex
control flow in the tasks. We ignore WCET path changes within a task in our
ILP modeling for faster solving time even though it may introduce sub-optimal
choice of task mapping. When the L2 cache conflicts are considered, some of
L2 hits may be downgraded to L2 misses. We combine this extra penalty with
the initial WCET to approximate the new WCET. Our experimental evaluation
confirms that this approximation can still produce optimal or near optimal task
120
mapping.
Let us define Mi as the set of memory blocks classified as L2 hits in the
worst case path of task ti as mentioned above. For each memory block m ∈Mi,
its new hit/miss classification depends on the interference from the other tasks.
Suppose m is mapped to set s in the L2 cache. Meanwhile, agem is defined as
the age of m in the abstract cache state of must analysis in the L2 cache. 0 ≤
agem < A, if access to m is classified as Always Hit. Then, the classification





(conf sj × intfij)
)
≥ A (7.18)
where conf sj is the number memory blocks mapped to cache set s (in the L2
cache) in task tj and are accessed in the L2 cache. The memory blocks from
task tj can conflict with m only if intfij = 1. The conflicts from other tasks can
increase the age of m. Therefore, when the total number of conflicts from other
tasks added to agem exceeds the associativity of the L2 cache (A), access to m
becomes Non-Classified.
We define a 0-1 variable Cm to indicate whether there is any change in clas-
sification of memory reference m due to conflicts in the L2 cache. If m is L2
cache hit in the intra-task cache analysis but downgraded to Non-Classified after








(conf sj × intfij) + C × Cm > 0 (7.20)
where C is a large constant. The extra penalty due to the conflicts in the L2




((l2 miss lat− l2 hit lat)× fm × Cm) (7.21)
where fm is the execution frequency of memory block m in the worst-case
path, and l2 hit lat and l2 miss lat are the L2 hit latency and L2 miss penalty,
respectively. Finally, the new WCET estimate of task ti, WECTi, is calculated
as follows.
WCETi = Wi + penalty (7.22)
121
WCRT Computation
For a task t, as previously described, we define four variables to represent its
lifetime: EarliestReadyt,EarliestF inisht, LatestReadyt, andLatestF inisht.
As EarliestReadyt and EarliestF inisht are constant across different task
mappings, we concentrate on computation ofLatestReadyt andLatestF inisht
in this section. According to Equation 7.3, for each task tj ∈ pred(ti), we have
LatestReadyti ≥ LatestF inishtj (7.23)
According to Equation 7.4, the peers of task ti may delay the start time of ti.
Therefore, we have to consider the delay incurred by ti’s peers when calculating




WCETj if peerij = 1
0 otherwise
(7.24)
We substitute it with equivalent equations as follows
pdij ≥ 0 (7.25)
pdij − C × peerij ≤ 0 (7.26)
pdij −WCETj + C − C × peerij ≥ 0 (7.27)
pdij −WCETj − C + C × peerij ≤ 0 (7.28)
where C is a large constant as before. Thus
LatestF inishti = LatestReadyti +
∑
0<j≤N
pdij + WCETi (7.29)
Finally, our objective is to minimize the WCRT of the entire application. Ac-
cording to Equation 7.5, we also introduce the following constraint for each task
t ∈ T
WCRT ≥ LatestF inisht (7.30)
7.6 Cache Locking in Multi-core Processors
After obtaining the task mapping with our ILP formulation approach, we per-
form partial cache locking to improve the WCRT. In this section, we present
122
the details of our cache locking approach in multi-core processors with shared
cache.
7.6.1 Locking Mechanisms
As shown in Figure 7.1, each core has its private L1 cache and multiple cores
share the same L2 cache. Thus we have the option of locking memory blocks
either in the private L1 caches or the shared L2 cache. Locking the memory
blocks into L1 caches certainly helps to improve the WCET of the current task
(e.g., by locking memory blocks that cause a lot of cache misses on the WCET
path in a task). In addition, more cache hits in the L1 cache implies less L2 cache
accesses. Thus, the tasks running on the other cores may benefit as well due to
the reduced L2 cache conflicts. It is also possible to lock memory blocks into
the shared L2 cache. However, as L2 cache has longer latency compared to L1
cache, the WCET reduction is much less for the current task. More importantly,
as the L2 cache size gets reduced after locking, the tasks on the other core might
suffer considerably. Thus, we choose to lock memory blocks only in the L1
cache.
As line locking is a fine grained and flexible locking mechanism compared
to way locking, we adopt line locking as we did in the locking works in uni-
processors. Furthermore, We consider static instruction locking for each task in
the task graph. In other words, the memory blocks are locked in the cache at the
beginning of execution of an task and remain locked throughout the execution of
the task. The implication is that we need to pay for the time to load and lock the
instructions into the cache before the execution of each task in the task graph.
Due to non-preemptive feature, a task can exclusively lock and use the cache
during its execution2.
7.6.2 Locking Algorithm for Multi-core Processors
The critical question that we need to answer for cache locking is how to select
the memory blocks that should be locked. We need to perform a cost-benefit
analysis to identify the most beneficial memory blocks to lock in the cache.
To identify the beneficial memory blocks, we first perform one round of
cache and WCRT analysis with the approach proposed in [62] and the task map-
ping obtained earlier. Then, we collect the profiles including the abstract cache
2In [67], cache locking is performed at the beginning of the task graph. Thus, tasks mapped
to the same core spatially share the locked space in the L1 cache, which is similar to the static
locking in Chapter 5.
123
states at L1 and L2 caches, task interference graph, memory access latency of
each memory block and the execution frequency of each memory block on the
WCET path. These information will be used in our cache locking modeling.
For memory block m, let latm be its worst-case access latency according
to its classification in L1 and L2 caches and fm be its execution frequency on
the WCET path, respectively. By locking memory block m into L1 cache, all
the accesses to m will be cache hits; thus the WCET of the current task may be
improved. We define the benefit for the current task as
self benefitm = (latm − hitL1)× fm
where hitL1 is the hit latency of private L1 cache.
Meanwhile, locking m into the private L1 cache of the core eliminates all
the accesses of m to the L2 cache. This may lead to reduced L2 cache conflicts
for the tasks running on other cores with memory blocks mapped to the same
L2 cache set as m.
Let us assume m belongs to task T running on core p and it is mapped to
cache setC in L2 cache before locking. Then, let conf(m) be the set of memory
blocks belonging to the tasks running on other cores (not p ) that can potentially
access the cache set C in the L2 cache. In the shared L2 cache analysis in [62],
for a memory block m′ ∈ conf(m), we will convert its access classification
from “Always Hit” to “Non-Classified” if |conf(m′)| ≥ AL2−ageL2(m′) where
ageL2(m
′) is the age of m′ in the abstract cache state of L2 cache must analysis
and AL2 is the associativity of the L2 cache. By locking m, we reduce the L2
cache conflicts for the tasks running on other cores. Thus we might be able to
avoid the conversion of some “Always Hit” references to “Non-Classified” due
to conflicts. If memory block m′ is converted from L2 “Non-Classified” to L2
“Always Hit”, then the WCET reduction is
L2 benefitm′ = fm′ × (missL2 − hitL2) (7.31)
where hitL2 represents the L2 cache hit latency and missL2 is the L2 cache miss





On the other hand, locking memory block m may have negative impact on
the memory blocks mapped to the same set in the private L1 cache of the same
124
task as the associativity for the private L1 cache is reduced through locking.
Let same(m) be the set of memory blocks mapped to the same cache set as m
in the private L1 cache of the task. From the L1 abstract cache states during
must analysis, we can easily find the age of these memory blocks. If m is
classified as cache hit before cache locking, thus locking m does not evict any
other memory block from the cache, and costm = 0. On the other hand, if the
age of a conflicting memory block m′ ∈ same(m) is AL1 − 1 where AL1 is the
L1 cache associativity, then m′ will be converted to L1 miss after locking m. In
the worst case, m′ will also be classified as “Non-Classified” L2 accesses. So,




(missL2 − hitL1)× fm (7.33)
where ageL1(m′) is the age of m′ in the abstract cache state of L1 cache must
analysis. Then, we define the overall benefit as
gainm = self benefitm + other benefitm − costm (7.34)
Note that, we use gainm to evaluate and compare the benefit of locking dif-
ferent memory blocks such that we can quickly identify some beneficial memory
blocks for locking. gainm may not be the actual WCRT reduction as both the
BCET and the WCET path may change after cache locking. Thus, the actual
WCET reduction may be more or less than we what predict. Also, the task in-
terference graph may change due to the change of BCET and WCET values.
But in practice, we find that gainm is a good metric to evaluate the benefit of
locking different memory blocks.
The overall cache locking framework is shown in Figure 7.6. We first per-
form the cache and WCRT analysis before the iterative process. Then, in each
iteration, we compute the gainm for all the unlocked memory blocks so far in
the task graph. If a cache set is fully locked for a task, we will not consider its
memory blocks mapped to that cache set. Then, we select the memory block
with the maximum gainm for locking. We break the ties arbitrarily. After that,
we perform cache and WCRT analysis to derive the precise WCRT after cache
locking. If WCRT is improved, we continue to lock; otherwise we stop the
process.
Cache Locking Granularity: So far, we assumed that L1 and L2 caches
have identical block size. However, in reality the block size of L2 cache can
be greater than or equal to the block size of L1 cache. We can choose locking
125




Pick and lock a 
beneficial memory 
block







Figure 7.6: Cache locking framework
either at L1 or L2 block granularity. Figure 7.7 shows the differences between
the two locking granularities. In this example, L2 block size is assumed to be
twice as big as L1 block size. m1 and m2 are two consecutive memory blocks
in L1 cache and both of them correspond to memory block m in L2 cache. If
we choose to lock at L2 memory block granularity, then we have to lock both
m1 and m2 in L1 cache simultaneously. More importantly, the references to
memory block m will not access the L2 cache any more. However, we can
not guarantee this if we choose to lock at L1 memory block granularity. For
example, if we choose to lock m1 into L1 cache, there might still be accesses of
m at L2 cache level due to miss of m2 in L1 cache. Thus, the L2 cache conflicts
are not reduced. So, if we choose to lock at L1 granularity level, we will not
include the benefit from other cores (other benefitm) in the final gainm when











(a) Cache locking at L1 block granularity
(b) Cache locking at L2 block granularity
Figure 7.7: Cache locking granularity
7.7 Experimental Evaluation
In this section, we present the experimental evaluation of our cache locking ap-
proach in multi-core processors. We first evaluate the task mapping method. Our
approach models conflicts in the shared cache during task mapping. We com-
pare the solution obtained via our task mapping approach with that of exhaustive
enumeration (which produces the optimal solution) and traditional approaches
that are agnostic to shared cache conflicts and solely focus on load balancing.
Later, we present the cache locking results based on the resultant task mapping,
and further improvement on WCRT is obtained.
7.7.1 Experimental Setup
We evaluate our task mapping and cache locking approach with both real-world
and synthetic benchmarks. We first perform a case study with a real-world em-
bedded benchmark DEBIE-I DPU Software [35], an in-situ space debris moni-
toring instrument developed by Space Systems Finland Ltd. We manually create
a task graph corresponding to DEBIE benchmark by identifying the compute-
intensive kernels of the benchmark and the dependencies among them, as shown
in Figure 7.8. The task graph consists of 12 tasks. These tasks have different
code sizes varying from 448 bytes to 23,288 bytes, as shown in Table 7.1.
We further validate our approach by creating synthetic task graphs using
TGFF [34]. However, we use real WCET benchmark kernels from MRTC
benchmark suite [46] as tasks for these synthetic task graphs. The code size















Figure 7.8: Task graph for DEBIE benchmark.
Table 7.1: Code size of the tasks from DEBIE benchmark.















































Task Graph 2 Task Graph 3
Task Graph  4 Task Graph 5 Task Graph  6
















































Figure 7.9: Synthetic task graphs with WCET benchmarks as tasks.
129
Table 7.2: Code size of WCET benchmarks used as tasks in synthetic task
graphs.

















ble 7.2. We create nine synthetic task graphs with different number of tasks.
The details of the synthetic task graphs are presented in Figure 7.9.
We compile the source code corresponding to our tasks with gcc cross-
compiler for SimpleScalar PISA (Portable ISA) instruction set architecture [21].
The cache analysis phase is built on top of the open-source WCET analysis tool
Chronos [59]. We perform all the experiments on 2.53GHz Intel Xeon CPU
with 24GB memory and use IBM CPLEX as ILP solver [3].
We assume our target architecture has four cores and two levels of instruc-
tion caches, as shown in Figure 7.1. The hit latency for L1 cache is 1 cycle. The
hit latency for L2 cache is 10 cycles, while its miss penalty is 100 cycles. As we
are modeling the instruction cache, we assume a simple in-order processor with
unit-latency for all data memory references.
7.7.2 DEBIE Case Study
For this case study application, we assume a 4-core processor. L1 cache size
is 1K bytes, with 2-way set associativity and 32-byte block size. L2 cache is
4-way set associative with block size of 64 bytes, and it capacity is 16K bytes.
The results are illustrated in Figure 7.10.
The first three bars in Figure 7.10 show the WCRT results of different task

































Figure 7.10: Improvement in WCRT due to task mapping and cache locking for
DEBIE benchmark.
and shared cache modeling as presented in Section 7.5. Note that the ILP formu-
lation in our approach generates a task mapping that is expected to minimize the
WCRT. However, the ILP formulation includes some approximations in the task
level WCET analysis to keep the ILP solver time tractable. The task mapping
generated by the ILP is given to the iterative WCRT estimation framework [62]
and we report this WCRT estimate.
We obtain the optimal task mapping solution via exhaustive search that ex-
haustively tests all the possible task mappings and invokes the iterative WCRT
analysis [62] to estimate the WCRT for each mapping. Obviously, given the
exponential number of task mappings and the long runtime of the WCRT anal-
ysis [62], this approach is computationally infeasible for large task graph.
We also compare our approach with traditional shared cache agnostic task
mapping approach (task mapping w/o L2 cache modeling). Basically, we ex-
haustively test all possible task mappings and invoke the iterative WCRT analy-
sis [62] without L2 cache modeling. Then, all task mappings leading to minimal
WCRT are collected. The task mappings generated this way are presented as in-
puts to the iterative WCRT analysis technique with L2 cache modeling [62] and
we report their average WCRT estimate. For example, in Figure 7.3, the tradi-
tional approach will generate the task mappings #2 and #9 because these map-
pings have the smallest WCRT without L2 cache modeling (shortest red bar).
We report the average WCRT corresponding to these mappings with L2 cache
modeling (i.e., the average of green bars corresponding to mappings #2 and #9).
The bar Task mapping w/o L2 cache modeling shows the average WCRT results
of this approach agnostic to shared cache conflicts.
As can be observed, our approach achieve significant reduction in WCRT
131
(27%) compared to the traditional approach agnostic to shared cache conflicts.
Moreover, the task mapping generated by our approach achieves the optimal
WCRT.
The last two bars in Figure 7.10 present the results of cache locking based
on the task mapping obtained by our ILP formulation approach. The WCRT
with cache locking at the granularity of L1 block size and L2 block size are
shown, respectively. It can be observed that cache locking can further optimize
the WCRT based on the task mapping approach. In this case study, we achieve
49% and 45% improvement on WCRT at the granularity of L1 block size and
L2 block size, respectively, compared to our task mapping result.
7.7.3 Synthetic Task Graphs
We consider a 4-core processor. L1 cache is 2-way set associative with block
size of 32 bytes, and its capacity is 512 bytes. L2 cache is 4-way set associative
with 64-byte block size, and its size is 4K bytes. We use smaller cache size to









































Figure 7.11: Improvement in WCRT due to task mapping and cache locking for
synthetic task graphs (4-core).
The results are shown in Figure 7.11, where the first three bars present the
task mapping results and the rest two bars illustrate the cache locking results for
each task graph. We normalize all the results, where the optimal task mapping
result returned via exhaustive enumeration approach is used as the baseline.
We first report the WCRT reduction with our task mapping approach com-
pared to the L2 cache agnostic task mapping approach. Clearly, our approach
132
generates task mappings that lead to the optimal WCRT for all the task graphs.
Furthermore, the task mappings generated by our approach is superior to the task
mappings produced without L2 cache modeling. Our approach considers both
workload balancing and interferences in the L2 cache, whereas, only workload
balancing is taken into account when L2 cache is not modeled. Therefore, our
approach produces better task mapping that leads to more reduction in WCRT.
We achieve an average 27% reduction in WCRT compared to the approach that
is agnostic to shared cache conflicts. We even achieve an reduction of 72%
for task graph 3, which underlines the importance of considering shared cache
conflicts in task mapping.
Cache locking further improves the WCRT of all task graphs based on our
task mapping approach. Our cache locking approach reduces the number of
cache misses in private L1 cache as well as the number of cache accesses in
shared L2 cache. We achieve on average 16% and 19% improvement on WCRT
at the granularity of L1 block size and L2 block size, respectively, compared to
the WCRT produced by our task mapping approach.
Table 7.3: Runtime of our task mapping approach and the optimal (exhaustive
enumeration) task mapping approach.





1 6 7.05 6.06
2 7 5.29 1.23
3 8 4.51 4.35
4 9 3.08 16.55
5 10 13.39 68.39
6 11 18.66 252.60
7 12 471.84 1,090.29
8 13 253.64 3,660.15
9 14 566.94 22,238.00
The run time for our task mapping approach and the exhaustive enumeration
approach is shown in Table 7.3. The exhaustive enumeration approach becomes
computationally infeasible as the number of tasks increases. As can be seen in
the table, the runtime increases exponentially for the exhaustive enumeration
approach as the number of tasks increases, whereas the runtime of our approach
is within 10 minutes.
133
7.7.4 Impact of Different Number of Cores
In this section, we consider 2-core processor instead of 4-core processor, and
evaluate the task mapping and cache locking results for synthetic task graphs.
L1 cache is 2-way set associative with block size of 32 bytes, and its capacity
is 512 bytes. L2 cache is 4-way set associative with 64-byte block size, and its
size is 4K bytes.
The results with 2-core processor are shown in Figure 7.12. The first three
bars still present the task mapping results and the rest two bars also show the
cache locking results for each task graph. We use the optimal task mapping
result as the baseline and normalize the other results.
As can be observed, our task mapping approach generates the optimal task
mapping for most of the task graphs. Compared to the approach that is agnostic
to shared cache conflicts, we obtain an average 19% reduction in WCRT. Cache
locking further improves the WCRT based on the resultant task mapping. On
average, 14% and 15% improvement on WCRT are achieved at the granular-
ity of L1 block size and L2 block size, respectively, compared to the WCRT







































Figure 7.12: Improvement in WCRT due to task mapping and cache locking for
synthetic task graphs (2-core).
7.7.5 L1 Block Size vs. L2 Block Size
As shown in Figure 7.10 and 7.11, locking at L2 block size granularity produces
better results than locking at L1 block size granularity for some task graphs, e.g.,
task graph 4 and task graph 8 in Figure 7.11. While sometimes locking at L1
134
block size can also outperform locking at L2 block size granularity, e.g., DEBIE
benchmark. Locking at L2 block size can completely eliminate the access to L2
memory block in L2 cache once a memory block is locked, which leads to the
reduction of shared L2 cache interference. One the other hand, locking at L1
block size granularity is more fine-grained and flexible than locking at L2 block
size granularity.
7.8 Discussion
A two-step framework is proposed to improve the WCRT for multitasking appli-
cations in multi-core processors with shared cache. However, we only consider
homogeneous multi-core processors, and the tasks are assumed to execute in a
non-preemptive fashion. In the task mapping step, we have approximations on
the interference modeling and WCET computation modeling, in order to reduce
the complexity of the ILP formulation. Thus, we may occasionally not achieve
the best task mapping.
7.9 Summary
In this chapter, we perform cache locking in multi-core processors with shared
cache. Prior to cache locking, a cache aware task mapping approach is pro-
posed to minimize the WCRT of concurrent tasks. Caches are modeled through
abstract interpretation and an ILP formulation approach is employed for task
mapping. Both the cache conflicts in the L2 cache and the workload balance are
considered in our approach. Cache locking further minimizes the WCRT using
the resultant task mapping. We statically lock the memory blocks in private L1
cache for each task. Both L1 block size granularity and L2 block size granular-
ity are explored. Our cache locking approach not only reduces the number of
cache misses in private L1 cache but also minimizes the number of accesses to
shared L2 cache. Experimental results with both synthetic task graphs and real-
world benchmarks show that both task mapping approach and cache locking
technique substantially improve the WCRT. Our task mapping approach returns
the best task mapping in most of the cases, and it is more efficient in runtime
compared to an exhaustive enumeration approach that can produce optimal so-






Timing constraint is an important feature in embedded real-time systems. Appli-
cations in real-time systems are required to meet their time deadlines, in order to
guarantee proper functioning. Worst-case performance, thus becomes a crucial
metric in the schedulability analysis of real-time systems.
In this thesis, we study cache optimizations by proposing partial cache lock-
ing in embedded real-time systems, in order to improve the worst-case perfor-
mance of applications. Our partial cache locking integrates static cache analysis
and cache locking. With partial cache locking, only a portion of the cache is
locked with memory blocks, while the free cache space can still be used by the
unlocked memory blocks to exploit their cache locality. Accurate cost-benefit
analysis is performed based on static cache analysis, in order to select the most
beneficial memory blocks that can minimize the worst-case execution time. Our
partial cache locking achieves the best of both static cache analysis and cache
locking approach.
Our partial cache locking is studied in different architectures and system
models in embedded real-time systems.
In uni-processors, static partial instruction cache locking is first developed
for single task, in order to improve the WCET. We carefully model the intra-task
cache conflict, as well as the cost and benefit by locking a memory block. An
optimal approach based on concrete cache state and a time-efficient heuristic
method based on abstract cache state are proposed to select the most beneficial
memory blocks in improving the WCET.
Then, we extend partial cache locking to multitasking real-time systems,
and both intra-task and inter-task cache conflicts are carefully considered. With
our approach, each task may lock a portion of the cache, while there is still
136
unlocked cache space that is shared by all the tasks in a time-multiplexed style.
As the cache is shared by all the tasks, locking a memory block has global
effect and can impact both WCET and CRPD. We propose a greedy selection
method that iteratively select the memory blocks, where the global effect of
cache locking is handled. Our approach improves the schedulability/utilization
for both RMS (Rate Monotonic Scheduling) and EDF (Earliest Deadline First)
scheduling policies.
Static partial cache locking is also extended to dynamic cache locking in
this thesis, in order to further improve the WCET for a single task. Compared
to the region-based approaches that partition program into different regions, we
propose a flexible loop-based dynamic cache locking approach. Our approach
locks the memory blocks at the entry point of a loop and unlocks them at the
corresponding exit point. Memory blocks from the same loop can be locked at
different program points with consideration to global optimization of the WCET.
Thus, we not only select the memory blocks to be locked, but also decide the
locking points where they should be locked.
At last, we also study partial cache locking in multi-core processors with
shared cache. Inter-core cache conflict is considered due to the existence of
shared cache. A two-step framework is proposed to minimize the WCRT. Prior
to cache locking, a task mapping method is adopted to minimize WCRT. The
task mapping approach considers both the workload balance and shared cache
conflict. Based on the resultant task mapping, we further improve the WCRT
via partial cache locking approach.
8.2 Future Directions
Data cache is as important as instruction cache in embedded real-time systems,
which provides fast access to the program data. Although we only study in-
struction cache optimizations for embedded real-time systems in this thesis, our
techniques can be extended to data cache. Our partial cache locking technique
relies on static cache analysis to select the beneficial memory blocks to lock.
Recently, Huynh et al. [51] propose a scope-aware static data cache analysis
method based on persistence analysis. They adopt the data address analysis
technique proposed in [108]. With the data address analysis framework and
the scope-aware abstract cache state analysis, we believe that our partial cache
locking technique can be easily extended to data cache.
As we have mentioned, cache locking can reduce the cache conflicts. How-
ever, cache locking cannot completely eliminate the cache conflicts. Suppose
137
there are three memory blocks m1, m2 and m3 in a loop, and they are mapped to
the same cache set of a 2-way set-associative cache. Clearly, these three mem-
ory blocks conflict in the cache set. When we lock one of them, all accesses to
the locked memory blocks are cache hits, while the other two memory blocks
can still conflict with each other. Recently, compiler-assisted code positioning
approaches have been proposed to optimize the WCET [76, 37]. These methods
change the layout of program codes, which may completely eliminate some of
the cache conflicts. Thus, we believe that a combined approach of partial cache
locking and code positioning may produce better results.
In this thesis, we have a trade-off between the performance and predictabil-




[1] ARM920T technical reference manual. http://infocenter.arm.
com.
[2] ARM940T technical reference manual. http://infocenter.arm.
com.
[3] IBM ILOG CPLEX Optimizer. http://www-01.ibm.com/
software/commerce/optimization/cplex-optimizer.
[4] PowerPC 440 embedded core. https://www-01.ibm.com/
chips/techlib/techlib.nsf/products/PowerPC\_440\
_Embedded\_Core.
[5] IDT 79RC64574/RC64575 user reference manual, Mar. 2000. Integrated
Device Technology.
[6] Intel XScale core developers manual. http://www.intel.com/
design/intelxscale, Jan. 2004.
[7] ADSP-BF53x/BF56x blackfin processor programming reference, Feb.
2006. Analog Devices, Inc.
[8] AbsInt. aiT: worst-case execution time analyzers. http://www.
absint.com/ait/.
[9] S. Altmeyer, C. Maiza, and J. Reineke. Resilience analysis: tightening
the crpd bound for set-associative caches. In Proceedings of the ACM
SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools
for embedded systems, LCTES ’10, pages 153–162, 2010.
[10] S. Altmeyer and C. Maiza Burguie`re. Cache-related preemption delay via
useful cache blocks: Survey and redefinition. J. Syst. Archit., 57(7):707–
719, Aug. 2011.
139
[11] K. Anand and R. Barua. Instruction cache locking inside a binary
rewriter. In Proceedings of the 2009 international conference on Compil-
ers, architecture, and synthesis for embedded systems, CASES ’09, pages
185–194, 2009.
[12] J. H. Anderson and J. M. Calandrino. Parallel task scheduling on multi-
core platforms. SIGBED Rev., 3(1):1–6, Jan. 2006.
[13] J. H. Anderson, J. M. Calandrino, and U. C. Devi. Real-time scheduling
on multicore platforms. In Proceedings of the 12th IEEE Real-Time and
Embedded Technology and Applications Symposium, RTAS ’06, pages
179–190, 2006.
[14] L. C. Aparicio, J. Segarra, C. Rodrı´guez, and V. Vin˜als. Improving the
WCET computation in the presence of a lockable instruction cache in
multitasking real-time systems. J. Syst. Archit., 57(7):695–706, Aug.
2011.
[15] A. Arnaud and I. Puaut. Dynamic instruction cache locking in hard real-
time systems. In Proceedings of the 14th International Conference on
Real-Time and Network Systems, RTNS ’06, 2006.
[16] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-
case instruction cache performance. In Proceedings of the 15th IEEE
Real-Time Systems Symposium, RTSS ’94, pages 172–181, 1994.
[17] C. Ballabriga and H. Casse. Improving the first-miss computation in set-
associative instruction caches. In Proceedings of the 2008 Euromicro
Conference on Real-Time Systems, ECRTS ’08, pages 341–350, 2008.
[18] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel.
Scratchpad memory: design alternative for cache on-chip memory in em-
bedded systems. In Proceedings of the 10th international symposium on
Hardware/software codesign, CODES ’02, pages 73–78, 2002.
[19] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for
architectural-level power analysis and optimizations. In Proceedings
of the 27th annual International Symposium on Computer Architecture,
ISCA ’00, pages 83–94, 2000.
[20] B. Buck and J. K. Hollingsworth. An api for runtime code patching. Int.
J. High Perform. Comput. Appl., 14(4):317–329, Nov. 2000.
140
[21] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0.
SIGARCH Comput. Archit. News, 25(3):13–25, June 1997.
[22] J. M. Calandrino and J. H. Anderson. Cache-aware real-time scheduling
on multicore platforms: Heuristics and a case study. In Proceedings of the
2008 Euromicro Conference on Real-Time Systems, ECRTS ’08, pages
299–308, 2008.
[23] A. M. Campoy, I. Puaut, A. P. Ivars, and J. V. B. Mataix. Cache contents
selection for statically-locked instruction caches: An algorithm compar-
ison. In Proceedings of the 17th Euromicro Conference on Real-Time
Systems, ECRTS ’05, pages 49–56, 2005.
[24] J. F. Cantin and M. D. Hill. Cache performance for SPEC CPU2000
benchmarks. http://www.cs.wisc.edu/multifacet/misc/
spec2000cache-data, May 2003.
[25] S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of
the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN
2001 conference on Programming language design and implementation,
PLDI ’01, pages 286–297, 2001.
[26] S. Chattopadhyay, C. L. Kee, A. Roychoudhury, T. Kelter, P. Marwedel,
and H. Falk. A unified WCET analysis framework for multi-core plat-
forms. In Proceedings of the IEEE 18th Real Time and Embedded Tech-
nology and Applications Symposium, RTAS ’12, pages 99–108, 2012.
[27] S. Chattopadhyay and A. Roychoudhury. Static bus schedule aware
scratchpad allocation in multiprocessors. In Proceedings of the 2011
SIGPLAN/SIGBED conference on Languages, compilers and tools for
embedded systems, LCTES ’11, pages 11–20, 2011.
[28] S. Chattopadhyay, A. Roychoudhury, and T. Mitra. Modeling shared
cache and bus in multi-cores for timing analysis. In Proceedings of the
13th International Workshop on Software & Compilers for Embedded
Systems, SCOPES ’10, pages 6:1–6:10, 2010.
[29] H. Chetto and M. Chetto. Some results of the earliest deadline scheduling
algorithm. IEEE Trans. Softw. Eng., 15(10):1261–1269, Oct. 1989.
[30] A. Colin and I. Puaut. Worst case execution time analysis for a processor
withbranch prediction. Real-Time Syst., 18(2/3):249–274, May 2000.
141
[31] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model
for static analysis of programs by construction or approximation of fix-
points. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on
Principles of programming languages, POPL ’77, pages 238–252, 1977.
[32] C. Cullmann. Cache persistence analysis: Theory and practice. ACM
Trans. Embed. Comput. Syst., 12(1s):40:1–40:25, Mar. 2013.
[33] J.-F. Deverge and I. Puaut. WCET-directed dynamic scratchpad memory
allocation of data. In Proceedings of the 19th Euromicro Conference on
Real-Time Systems, ECRTS ’07, pages 179–190, 2007.
[34] R. P. Dick, D. L. Rhodes, and W. Wolf. TGFF: task graphs for free.
In Proceedings of the 6th international workshop on Hardware/software
codesign, CODES/CASHE ’98, pages 97–101, 1998.
[35] European Space Agency. DEBIE – First standard space de-
bris monitoring instrument. http://gate.etamax.de/edid/
publicaccess/debie1.php, 2008.
[36] H. Falk and J. C. Kleinsorge. Optimal static WCET-aware scratchpad
allocation of program code. In Proceedings of the 46th Annual Design
Automation Conference, DAC ’09, pages 732–737, 2009.
[37] H. Falk and H. Kotthaus. WCET-driven cache-aware code positioning. In
Proceedings of the 14th international conference on Compilers, architec-
tures and synthesis for embedded systems, CASES ’11, pages 145–154,
2011.
[38] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction
cache locking using worst-case execution paths. In Proceedings of the
5th IEEE/ACM international conference on Hardware/software codesign
and system synthesis, CODES+ISSS ’07, pages 143–148, 2007.
[39] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair thread scheduling
for multicore processors. Technical Report TR-17-06, Harvard Univer-
sity, 2006.
[40] A. Fedorova, M. Seltzer, M. D. Smith, and C. Small. CASC: A
cache-aware scheduling algorithm for multithreaded chip multiproces-
sors,. Technical Report TR-2005-0142, Sun Labs, 2005.
142
[41] C. Ferdinand, F. Martin, R. Wilhelm, and M. Alt. Cache behavior predic-
tion by abstract interpretation. Sci. Comput. Program., 35(2-3):163–189,
Nov. 1999.
[42] C. Ferdinand and R. Wilhelm. On predicting data cache behavior for
real-time systems. In Proceedings of the ACM SIGPLAN Workshop on
Languages, Compilers, and Tools for Embedded Systems, LCTES ’98,
pages 16–30, 1998.
[43] G. Gebhard and S. Altmeyer. Optimal task placement to improve cache
performance. In Proceedings of the 7th ACM & IEEE international con-
ference on Embedded software, EMSOFT ’07, pages 259–268, 2007.
[44] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a com-
piler framework for analyzing and tuning memory behavior. ACM Trans.
Program. Lang. Syst., 21(4):703–746, July 1999.
[45] C. Guillon, F. Rastello, T. Bidault, and F. Bouchez. Procedure placement
using temporal-ordering information: Dealing with code size expansion.
J. Embedded Comput., 1(4):437–459, Dec. 2005.
[46] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper. The ma¨lardalen
WCET benchmarks - past, present and future. In Proceedings of the 10th
International Workshop on Worst-Case Execution Time Analysis, WCET
’11, pages 136–146, 2010.
[47] D. Hardy, T. Piquet, and I. Puaut. Using bypass to tighten WCET esti-
mates for multi-core processors with shared instruction caches. In Pro-
ceedings of the 30th IEEE Real-Time Systems Symposium, RTSS ’09,
pages 68–77, 2009.
[48] D. Hardy and I. Puaut. Wcet analysis of multi-level non-inclusive set-
associative instruction caches. In Proceedings of the 29th Real-Time Sys-
tems Symposium, RTSS ’08, pages 456–466, 2008.
[49] C. A. Healy, R. D. Arnold, F. Mueller, M. G. Harmon, and D. B. Wal-
ley. Bounding pipeline and instruction cache performance. IEEE Trans.
Comput., 48(1):53–70, Jan. 1999.
[50] J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edi-
tion: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2006.
143
[51] B. K. Huynh, L. Ju, and A. Roychoudhury. Scope-aware data cache
analysis for WCET estimation. In Proceedings of the 17th IEEE Real-
Time and Embedded Technology and Applications Symposium, RTAS
’11, pages 203–212, 2011.
[52] L. Ju, S. Chakraborty, and A. Roychoudhury. Accounting for cache-
related preemption delay in dynamic priority schedulability analysis. In
Proceedings of the conference on Design, automation and test in Europe,
DATE ’07, pages 1623–1628, 2007.
[53] T. Kelter, H. Falk, P. Marwedel, S. Chattopadhyay, and A. Roychoudhury.
Bus-aware multicore WCET analysis through TDMA offset bounds. In
Proceedings of the 23rd Euromicro Conference on Real-Time Systems,
ECRTS ’11, pages 3–12, 2011.
[54] J. C. Kleinsorge, H. Falk, and P. Marwedel. A synergetic approach to
accurate analysis of cache-related preemption delay. In Proceedings of
the 9th ACM international conference on Embedded software, EMSOFT
’11, pages 329–338, 2011.
[55] M. Langenbach, S. Thesing, and R. Heckmann. Pipeline modeling for
timing analysis. In Proceedings of the 9th International Symposium on
Static Analysis, SAS ’02, pages 294–309, 2002.
[56] C.-G. Lee, J. Hahn, Y.-M. Seo, S. L. Min, R. Ha, S. Hong, C. Y. Park,
M. Lee, and C. S. Kim. Analysis of cache-related preemption delay in
fixed-priority preemptive scheduling. IEEE Trans. Comput., 47(6):700–
713, June 1998.
[57] B. Lesage, D. Hardy, and I. Puaut. WCET analysis of multi-level set-
associative data caches. In Proceedings of the 9th Intl. Workshop on
Worst-Case Execution Time WCET Analysis, WCET ’09, 2009.
[58] B. Lesage, D. Hardy, and I. Puaut. Shared data cache conflicts reduction
for WCET computation in multi-core architectures. In Proceedings of
the 18th International Conference on Real-Time and Network Systems,
RTNS ’10, 2010.
[59] X. Li, Y. Liang, T. Mitra, and A. Roychoudhury. Chronos: A timing
analyzer for embedded software. Sci. Comput. Program., 69(1-3):56–67,
Dec. 2007.
144
[60] X. Li, T. Mitra, and A. Roychoudhury. Modeling control speculation for
timing analysis. Real-Time Syst., 29(1):27–58, Jan. 2005.
[61] X. Li, A. Roychoudhury, and T. Mitra. Modeling out-of-order processors
for wcet analysis. Real-Time Syst., 34(3):195–227, Nov. 2006.
[62] Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury. Timing
analysis of concurrent programs running on shared cache multi-cores. In
Proceedings of the 30th IEEE Real-Time Systems Symposium, RTSS ’09,
pages 57–67, 2009.
[63] Y.-T. S. Li and S. Malik. Performance analysis of embedded software
using implicit path enumeration. In Proceedings of the 32nd annual
ACM/IEEE Design Automation Conference, DAC ’95, pages 456–461,
1995.
[64] Y.-T. S. Li, S. Malik, and A. Wolfe. Efficient microarchitecture modeling
and path analysis for real-time software. In Proceedings of the 16th IEEE
Real-Time Systems Symposium, RTSS ’95, pages 298–, 1995.
[65] Y.-T. S. Li, S. Malik, and A. Wolfe. Cache modeling for real-time soft-
ware: beyond direct mapped instruction caches. In Proceedings of the
17th IEEE Real-Time Systems Symposium, RTSS ’96, pages 254–, 1996.
[66] Y.-T. S. Li, S. Malik, and A. Wolfe. Performance estimation of embed-
ded software with instruction cache modeling. ACM Trans. Des. Autom.
Electron. Syst., 4(3):257–279, July 1999.
[67] Y. Liang, H. Ding, T. Mitra, A. Roychoudhury, Y. Li, and V. Suhendra.
Timing analysis of concurrent programs running on shared cache multi-
cores. Real-Time Syst., 48(6):638–680, Nov. 2012.
[68] Y. Liang and T. Mitra. Cache modeling in probabilistic execution time
analysis. In Proceedings of the 45th annual Design Automation Confer-
ence, DAC ’08, pages 319–324, 2008.
[69] Y. Liang and T. Mitra. Improved procedure placement for set associative
caches. In Proceedings of the 2010 international conference on Compil-
ers, architectures and synthesis for embedded systems, CASES ’10, pages
147–156, 2010.
145
[70] Y. Liang and T. Mitra. Instruction cache locking using temporal reuse
profile. In Proceedings of the 47th Design Automation Conference, DAC
’10, pages 344–349, 2010.
[71] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
ming in a hard-real-time environment. J. ACM, 20(1):46–61, Jan. 1973.
[72] T. Liu, M. Li, and C. J. Xue. Minimizing WCET for real-time embedded
systems via static instruction cache locking. In Proceedings of the 15th
IEEE Symposium on Real-Time and Embedded Technology and Applica-
tions, RTAS ’09, pages 35–44, 2009.
[73] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for embedded
systems using probability profile. J. Signal Process. Syst., 69(2):173–
188, Nov. 2012.
[74] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for multi-task real-
time embedded systems. Real-Time Syst., 48(2):166–197, Mar. 2012.
[75] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Task assignment with cache parti-
tioning and locking for WCET minimization on MPSoC. In Proceedings
of the 39th International Conference on Parallel Processing, ICPP ’10,
pages 573–582, 2010.
[76] P. Lokuciejewski, H. Falk, and P. Marwedel. WCET-driven cache-based
procedure positioning optimizations. In Proceedings of the 2008 Euromi-
cro Conference on Real-Time Systems, ECRTS ’08, pages 321–330, 2008.
[77] T. Lundqvist and P. Stenstro¨m. An integrated path and timing analysis
method based on cycle-level symbolic execution. Real-Time Syst., 17(2-
3):183–207, Dec. 1999.
[78] T. Lundqvist and P. Stenstro¨m. Timing anomalies in dynamically sched-
uled microprocessors. In Proceedings of the 20th IEEE Real-Time Sys-
tems Symposium, RTSS ’99, pages 12–, 1999.
[79] F. Martin, M. Alt, R. Wilhelm, and C. Ferdinand. Analysis of loops. In In
Proceedings of the 7th International Conference on Compiler Construc-
tion, CC ’98, pages 80–94, 1998.
[80] F. Mueller. Compiler support for software-based cache partitioning. In
Proceedings of the ACM SIGPLAN 1995 workshop on Languages, com-
pilers, & tools for real-time systems, LCTES ’95, pages 125–133, 1995.
146
[81] F. Mueller. Timing analysis for instruction caches. Real-Time Syst.,
18(2/3):217–247, May 2000.
[82] H. S. Negi, T. Mitra, and A. Roychoudhury. Accurate estima-
tion of cache-related preemption delay. In Proceedings of the 1st
IEEE/ACM/IFIP international conference on Hardware/software code-
sign and system synthesis, CODES+ISSS ’03, pages 201–206, 2003.
[83] C. Y. Park. Predicting program execution times by analyzing static and
dynamic program paths. Real-Time Syst., 5(1):31–62, Mar. 1993.
[84] S. Plazar, J. C. Kleinsorge, P. Marwedel, and H. Falk. WCET-aware static
locking of instruction caches. In Proceedings of the 10th International
Symposium on Code Generation and Optimization, CGO ’12, pages 44–
52, 2012.
[85] S. Plazar, P. Lokuciejewski, and P. Marwedel. WCET-aware software
based cache partitioning for multi-task real-time systems. In Proceed-
ings of the 9th Intl. Workshop on Worst-Case Execution Time Analysis,
WCET’09, 2009.
[86] I. Puaut. Wcet-centric software-controlled instruction caches for hard
real-time systems. In Proceedings of the 18th Euromicro Conference on
Real-Time Systems, ECRTS ’06, pages 217–226, 2006.
[87] I. Puaut and D. Decotigny. Low-complexity algorithms for static cache
locking in multitasking hard real-time systems. In Proceedings of the
23rd IEEE Real-Time Systems Symposium, RTSS ’02, pages 114–, 2002.
[88] I. Puaut and C. Pais. Scratchpad memories vs locked caches in hard real-
time systems: a quantitative comparison. In Proceedings of the confer-
ence on Design, automation and test in Europe, DATE ’07, pages 1484–
1489, 2007.
[89] J. E. Sasinowski and J. K. Strosnider. A dynamic programming algo-
rithm for cache memory partitioning for real-time systems. IEEE Trans.
Comput., 42(8):997–1001, Aug. 1993.
[90] J. Schneider and C. Ferdinand. Pipeline behavior prediction for super-
scalar processors by abstract interpretation. In Proceedings of the ACM
SIGPLAN 1999 workshop on Languages, compilers, and tools for em-
bedded systems, LCTES ’99, pages 35–44, 1999.
147
[91] R. Sen and Y. N. Srikant. WCET estimation for executables in the pres-
ence of data caches. In Proceedings of the 7th ACM & IEEE international
conference on Embedded software, EMSOFT ’07, pages 203–212, 2007.
[92] A. C. Shaw. Reasoning about time in higher-level language software.
IEEE Trans. Softw. Eng., 15(7):875–889, July 1989.
[93] Y. N. Srikant and P. Shankar. The Compiler Design Handbook: Opti-
mizations and Machine Code Generation, Second Edition. CRC Press,
Inc., 2nd edition, 2007.
[94] F. Stappert, A. Ermedahl, and J. Engblom. Efficient longest executable
path search for programs with complex flows and pipeline effects. In Pro-
ceedings of the 2001 international conference on Compilers, architecture,
and synthesis for embedded systems, CASES ’01, pages 132–140, 2001.
[95] J. Staschulat and R. Ernst. Scalable precision cache analysis for real-time
software. ACM Trans. Embed. Comput. Syst., 6(4), Sept. 2007.
[96] V. Suhendra and T. Mitra. Exploring locking & partitioning for pre-
dictable shared caches on multi-cores. In Proceedings of the 45th annual
Design Automation Conference, DAC ’08, pages 300–303, 2008.
[97] V. Suhendra, T. Mitra, A. Roychoudhury, and T. Chen. WCET centric
data allocation to scratchpad memory. In Proceedings of the 26th IEEE
International Real-Time Systems Symposium, RTSS ’05, pages 223–232,
2005.
[98] V. Suhendra, T. Mitra, A. Roychoudhury, and T. Chen. Efficient detection
and exploitation of infeasible paths for software timing analysis. In Pro-
ceedings of the 43rd annual Design Automation Conference, DAC ’06,
pages 358–363, 2006.
[99] V. Suhendra, A. Roychoudhury, and T. Mitra. Scratchpad allocation
for concurrent embedded software. ACM Trans. Program. Lang. Syst.,
32(4):13:1–13:47, Apr. 2010.
[100] Y. Tan and V. Mooney. Integrated intra- and inter-task cache analysis for
preemptive multi-tasking real-time systems. In In Proceedings of the 8th
International Workshop, SCOPES 2004, in: Lecture Notes on Computer
Science, LNCS3199, SCOPES ’04, pages 182–199, 2004.
148
[101] H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise WCET
prediction by separated cache andpath analyses. Real-Time Syst.,
18(2/3):157–179, May 2000.
[102] L. Thiele and R. Wilhelm. Design for timing predictability. Real-Time
Syst., 28(2-3):157–177, Nov. 2004.
[103] H. Tomiyama and N. D. Dutt. Program path analysis to bound cache-
related preemption delay in preemptive real-time systems. In Proceed-
ings of the 8th International Workshop on Hardware/Software Codesign,
CODES ’00, pages 67–71, 2000.
[104] X. Vera, B. Lisper, and J. Xue. Data cache locking for tight timing calcu-
lations. ACM Trans. Embed. Comput. Syst., 7(1):4:1–4:38, Dec. 2007.
[105] M. Verma, K. Petzold, L. Wehmeyer, H. Falk, and P. Marwedel. Scratch-
pad sharing strategies for multiprocess embedded systems: a first ap-
proach. In The 3rd Workshop on Embedded Systems for Real-Time Mul-
timedia, ESTIMedia ’05, pages 115–120, 2005.
[106] Q. Wan, H. Wu, and J. Xue. WCET-aware data selection and al-
location for scratchpad memory. In Proceedings of the 13th ACM
SIGPLAN/SIGBED International Conference on Languages, Compilers,
Tools and Theory for Embedded Systems, LCTES ’12, pages 41–50,
2012.
[107] I. Wenzel, R. Kirner, P. Puschner, and B. Rieder. Principles of timing
anomalies in superscalar processors. In Proceedings of the 5th Interna-
tional Conference on Quality Software, QSIC ’05, pages 295–306, 2005.
[108] R. T. White, F. Mueller, C. Healy, D. Whalley, and M. Harmon. Timing
analysis for data and wrap-around fill caches. Real-Time Syst., 17(2-
3):209–233, Dec. 1999.
[109] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley,
G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut,
P. Puschner, J. Staschulat, and P. Stenstro¨m. The worst-case execution-
time problem - overview of methods and survey of tools. ACM Trans.
Embed. Comput. Syst., 7(3):36:1–36:53, May 2008.
[110] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In
Proceedings of the ACM SIGPLAN 1991 conference on Programming
language design and implementation, PLDI ’91, pages 30–44, 1991.
149
[111] H. Wu, J. Xue, and S. Parameswaran. Optimal WCET-aware code se-
lection for scratchpad memory. In Proceedings of the 10th ACM inter-
national conference on Embedded software, EMSOFT ’10, pages 59–68,
2010.
[112] J. Yan and W. Zhang. WCET analysis for multi-core processors with
shared l2 instruction caches. In Proceedings of the 14th IEEE Real-
Time and Embedded Technology and Applications Symposium, RTAS
’08, pages 80–89, 2008.
[113] W. Zhang and J. Yan. Accurately estimating worst-case execution time
for multi-core processors with shared direct-mapped instruction caches.
In Proceedings of the 15th IEEE International Conference on Embedded
and Real-Time Computing Systems and Applications, RTCSA ’09, pages
455–463, 2009.
[114] W. Zhao, D. Whalley, C. Healy, and F. Mueller. WCET code position-
ing. In Proceedings of the 25th IEEE International Real-Time Systems
Symposium, RTSS ’04, pages 81–91, 2004.
150
