Exploring Hybrid SPM-Cache Architectures to Improve Performance and Energy Efficiency for Real-time Computing by Wu, Lan
Virginia Commonwealth University
VCU Scholars Compass
Theses and Dissertations Graduate School
2013
Exploring Hybrid SPM-Cache Architectures to
Improve Performance and Energy Efficiency for
Real-time Computing
Lan Wu
Virginia Commonwealth University
Follow this and additional works at: http://scholarscompass.vcu.edu/etd
Part of the Engineering Commons
© The Author
This Dissertation is brought to you for free and open access by the Graduate School at VCU Scholars Compass. It has been accepted for inclusion in
Theses and Dissertations by an authorized administrator of VCU Scholars Compass. For more information, please contact libcompass@vcu.edu.
Downloaded from
http://scholarscompass.vcu.edu/etd/3280
Exploring Hybrid SPM-Cache Architectures to Improve Performance and Energy
Eciency for Real-time Computing
A dissertation submitted in partial fulllment of the requirements for the degree of
Doctor of Philosophy at Virginia Commonwealth University.
by
Lan Wu
B.S., University of Science and Technology of China, China, 2004
M.S. North China Institute of Computing Technology, China, 2007
Director: Dr. Wei Zhang,
Associate Professor, Department of Electrical and Computer Engineering
Virginia Commonwealth University
Richmond, Virginia
December, 2013
ACKNOWLEDGMENTS
I would like to thank Dr. Wei Zhang for his invaluable assistance and
insights leading to my Phd research and this dissertation in the last ve years.
My sincere thanks also goes to the members of my graduate committee for
their patience and understanding during the numerous time of eort that went
into the production of this dissertation.
Last but not the least, many thanks go to my family from the bottom of my
heart.
ii
TABLE OF CONTENTS
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Real-Time Computing . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 On-chip Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 A Model Checking Based Approach to Bounding Worst-Case Execution
Time for Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Prior Work in WCET Analysis . . . . . . . . . . . . . . . . 14
3.2.2 Prior Work in Model Checking . . . . . . . . . . . . . . . . . 15
3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 The Assumed Dual-Core Processor with a Shared L2 Instruc-
tion Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 SPIN and PROMELA . . . . . . . . . . . . . . . . . . . . . 19
iii
3.3.4 Static Analysis for WCET . . . . . . . . . . . . . . . . . . . 20
3.4 Model Checking Based WCET Analysis For Multicore Processors . 21
3.4.1 Model Single-Core Systems with SPIN . . . . . . . . . . . . 21
3.4.2 Model the Dual-Core Processor with SPIN . . . . . . . . . . 25
3.4.3 Improvement of the Previous Model . . . . . . . . . . . . . . 27
3.4.4 Architectural Parameters Impacting the Performance of Ver-
ication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1 Comparing the Performance Before and After Model Simpli-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Comparing the Model Checking Based Approach with Simu-
lation and Static Analysis . . . . . . . . . . . . . . . . . . . 37
3.6.3 Sensitivity to Cache Congurations . . . . . . . . . . . . . . 42
3.6.4 Limitation of the Model Checking Based Method . . . . . . 43
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Exploiting Hybrid SPM-Cache Architectures to Reduce Energy Consump-
tion for Embedded Computing . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Background on Hybrid SPM-Caches . . . . . . . . . . . . . . . . . . 53
4.3.1 Instruction Hybrid and Data Hybrid Architectures . . . . . 53
iv
4.3.2 Additional Hybrid Architectures . . . . . . . . . . . . . . . . 57
4.4 Energy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6.1 Energy Results . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6.2 Energy-Delay Product Results . . . . . . . . . . . . . . . . . 63
4.6.3 Sensitivity Study on SPM and Cache Partitioning . . . . . . 64
4.6.4 Sensitivity Study on SPM and Cache Sizes . . . . . . . . . . 66
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Reducing Worst-Case Execution Time of Hybrid SPM-Caches . . . . . . 75
5.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Background on Hybrid SPM-Caches . . . . . . . . . . . . . . . . . . 78
5.3 SPM Allocation for Hybrid SPM-Caches . . . . . . . . . . . . . . . 80
5.3.1 Frequency-based SPM Allocation . . . . . . . . . . . . . . . 83
5.3.2 Longest-Path based Allocation . . . . . . . . . . . . . . . . . 83
5.3.3 Hybrid SPM-Cache Allocation . . . . . . . . . . . . . . . . . 84
5.3.4 Enhanced Hybrid SPM-Cache Allocation . . . . . . . . . . . 87
5.3.5 WCET Analysis of Hybrid SPM-Caches . . . . . . . . . . . 89
5.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.1 Safety and Accuracy of WCET Analysis . . . . . . . . . . . 94
v
5.5.2 WCET Results of Dierent SPM Allocation Algorithms . . . 96
5.5.3 Average-Case Performance Results . . . . . . . . . . . . . . 97
5.5.4 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Cache-Aware SPM Allocation for Maximizing Performance on Hybrid SPM-
Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 Basic SPM Allocation Algorithms . . . . . . . . . . . . . . . . . . . 112
6.3.1 Frequency-based SPM Allocation . . . . . . . . . . . . . . . 112
6.3.2 Hybrid SPM-Cache Allocation . . . . . . . . . . . . . . . . . 112
6.4 Stack Distance Based SPM Allocation Algorithms . . . . . . . . . . 114
6.4.1 Stack Distance . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.2 Stack Distance Analysis for The HSC Architecture . . . . . 114
6.4.3 Stack Distance Based SPM Allocation Algorithms . . . . . . 117
6.4.4 Side Eects of Basic Block Based Allocation . . . . . . . . . 123
6.4.5 An Example To Compare HSA and GSDA . . . . . . . . . . 125
6.5 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6.1 Performance of Dierent Algorithms . . . . . . . . . . . . . 131
6.6.2 Sensitivity to the Cache Size . . . . . . . . . . . . . . . . . . 134
vi
6.6.3 Sensitivity to the Block Size . . . . . . . . . . . . . . . . . . 136
6.6.4 Running Time Under Conguration I and II . . . . . . . . . 138
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 Cache-Aware SPM Allocation for Maximizing Energy Eciency on Hybrid
SPM-Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Stack Distance Based SPM Allocation Algorithms for Energy . . . . 143
7.2.1 Stack Distance Analysis for HSC on Energy Consumption . 143
7.2.2 Exploit Cache Stack Distance to Improve SPM Allocation . 144
7.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.1 Memory Energy Consumption . . . . . . . . . . . . . . . . . 149
7.4.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . 151
7.4.3 EDP Results . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.5 Conclutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8 Conclusion Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
vii
LIST OF TABLES
3.1 Explanations for symbols used in Listing 3.3. . . . . . . . . . . . . . . . 28
3.2 Benchmark description. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Conguration of the dual-core memory hierarchy. . . . . . . . . . . . . 35
3.4 WCET of a single-core. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Comparison of the performance before and after using the simplied
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Comparing the worst-case execution time, L1 miss ratio and L2 miss ratio
of bs in core 1 among model checking, simulation and static analysis
based approaches.(W: WCET, RL1: L1 miss ratio, RL2: L2 miss ratio,
WMC
WSIMU
: MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio ) . . 39
3.7 Comparing the worst-case execution time, L1 miss ratio and L2 miss ratio
of the benchmark in core 2 among model checking, simulation and static
analysis based approaches.(W: WCET, RL1: L1 miss ratio, RL2: L2 miss
ratio, WMC
WSIMU
: MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio ) 40
3.8 Comparing the worst-case execution time, L1 miss ratio and L2 miss
ratio of qsort in core 1 among model checking, simulation and static
analysis based approaches.(W: WCET, RL1: L1 miss ratio, RL2: L2 miss
ratio, WMC
WSIMU
: MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio ) 40
viii
3.9 Comparing the worst-case execution time, L1 miss ratio and L2 miss ratio
of the benchmark in core 2 among model checking, simulation and static
analysis based approaches.(W: WCET, RL1: L1 miss ratio, RL2: L2 miss
ratio, WMC
WSIMU
: MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio ) 41
3.10 Compare the timing of model checking and static analysis approach (in
seconds). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.11 Conguration I of the Dual-core Chip Memory Hierarchy. . . . . . . . . 42
4.1 All the hybrid on-chip memories studied. . . . . . . . . . . . . . . . . . 55
4.2 Salient characteristics of benchmarks. . . . . . . . . . . . . . . . . . . . 59
5.1 Four SPM allocation algorithms studied in this chapter. . . . . . . . . 82
5.2 Benchmarks used in our experiments. . . . . . . . . . . . . . . . . . . . 93
6.1 The SPM and cache parameters used in the example. . . . . . . . . . . 127
6.2 The execution sequence of the basic blocks. . . . . . . . . . . . . . . . 127
6.3 The memory access trace before allocation. M=13. (A: instruction ad-
dress, BA: block address, SI: set index, SD: stack distance, M: number
of cache misses) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 The number of cache misses of the line blocks before the SPM allocation.
(LB: line block, INSTR: instruction, M: number of cache misses) . . . . 127
6.5 The memory access trace and cache misses after the SPM allocation by
the HSA. (A: instruction address, BA: block address, SI: set index, SD:
stack distance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
ix
6.6 Checking the cache misses for each line block to identify the rst candi-
date by the GSDA-based SPM allocation. (A: instruction address, BA:
block address, SI: set index, SD: stack distance, M: number of cache
misses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.7 Checking the cache misses for each line block to identify the second
candidate by the GSDA based SPM allocation. (A: instruction address,
BA: block address, SI: set index, SD: stack distance, M: number of cache
misses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.8 Three memory congurations in our experiments. . . . . . . . . . . . . 130
6.9 General information of all benchmarks . . . . . . . . . . . . . . . . . . 131
6.10 The cache misses of all 4 allocation algorithms in default conguration. 132
6.11 The running time (in msec) of all 4 allocation algorithms in default con-
guration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.12 The cache misses of all 4 allocation algorithms with the Conguration I. 135
6.13 The cache misses of all 4 allocation algorithms with the Conguration II. 137
6.14 The allocation time (in msec) of all 4 allocation algorithms in Congu-
ration I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.15 The allocation time (in msec) of all 4 allocation algorithms in congura-
tion II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.16 The compression ratio of OSDA model during verication. . . . . . . . 139
7.1 The symbols used in the equations. . . . . . . . . . . . . . . . . . . . . 143
x
LIST OF FIGURES
2.1 The WCET estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 (a) A normal dual-core with a shared L2 cache; (b) a dual-core with a
shared L2 instruction cache where the L1 data caches (i.e., dL1*) are
perfect, that is, there are no L1 data cache misses. . . . . . . . . . . . . 18
3.2 The control ow graph of the motivating example. . . . . . . . . . . . . 22
3.3 Control ow graphs of four dierent cases. Note the last three cases
shared the same control ow graph though they dier in the L2 cache
access conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Design ow for the model checking based dual-core WCET analysis. . . 34
3.5 Comparing the normalized WCET of three dierent cache models. . . . 43
3.6 Feasibility of WCET analysis according to the total number of conicting
instructions modeled and subset size of L2 cache used by these instruc-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Baseline caches or SPMs only architectures. . . . . . . . . . . . . . . . 49
4.2 The comparison of on-chip memory energy consumption between the IC-
DC and IS-DS architectures, which is normalized to that of the IS-DS
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 The comparison of total energy consumption between the IC-DC and IS-
DS architectures, which is normalized to that of the IS-DS architecture. 52
xi
4.4 The comparison of the performance (i.e., the total number of execution
cycles) between the IC-DC and IS-DS architectures, which is normalized
to that of the IS-DS architecture. . . . . . . . . . . . . . . . . . . . . . 52
4.5 Three hybrid SPM-Cache architectures. . . . . . . . . . . . . . . . . . . 53
4.6 Energy evaluation framework. . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 The comparison of on-chip memory and total energy consumption among
all 9 architectures, which is normalized to the on-chip and total energy
respectively of the IS-DS architecture. . . . . . . . . . . . . . . . . . . 60
4.8 The comparison of performance and EDP among all 9 architectures,
which are normalized to execution cycles and EDP respectively of the
IS-DS architecture respectively. . . . . . . . . . . . . . . . . . . . . . . 61
4.9 The comparison of on-chip memory and total energy consumption among
the IS-DS, the IC-DC, and the IH-DC, IC-DH, and IH-DH architectures
with two dierent SPM and cache partitions, which is normalized to the
on-chip memory and total energy consumption respectively of the IS-DS
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.10 The comparison of on-chip memory and total energy consumption of the
IC-DC and IS-DS architectures with their total size varying from 8KB
to 16KB, 32KB, and 64KB, which is normalized to the on-chip memory
and total energy consumption respectively of the 8KB IS-DS architecture. 67
xii
4.11 The comparison of on-chip memory and total energy consumption of the
IH-DC architectures with their total size varying from 8KB to 16KB,
32KB, and 64KB, which is normalized to the on-chip memory and total
energy consumption respectively of the 8KB IS-DS architecture. . . . . 69
4.12 The comparison of on-chip memory and total energy consumption of the
IC-DH architectures with their total size varying from 8KB to 16KB,
32KB, and 64KB, which is normalized to the on-chip memory and total
energy consumption respectively of the 8KB IS-DS architecture. . . . . 70
4.13 The comparison of on-chip memory and total energy consumption of the
IH-DH architectures with their total size varying from 8KB to 16KB,
32KB, and 64KB, which is normalized to the on-chip memory and total
energy consumption respectively of the 8KB IS-DS architecture. . . . . 70
5.1 The hybrid SPM-Cache system architecture. . . . . . . . . . . . . . . . 81
5.2 High-level overview of our evaluation framework. . . . . . . . . . . . . 92
5.3 Comparing the estimated WCET and the simulated WCET for all the
four SPM allocation algorithms, which is normalized to the simulated
WCET. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Comparing the WCET of dierent SPM allocation algorithms with the
default conguration, which is normalized to the WCET of the FSA
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
xiii
5.5 Comparing the ACET of dierent SPM allocation algorithms with the
default conguration, which is normalized to the ACET of the FSA al-
gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6 Comparing the WCET of dierent SPM allocation algorithms for the
hybrid SPM-Cache with a 96B SPM and a 32B cache, which is normalized
to the WCET of the FSA algorithm. . . . . . . . . . . . . . . . . . . . 99
5.7 Comparing the ACET of dierent SPM allocation algorithms for the
hybrid SPM-Cache with a 96B SPM and a 32B cache, which is normalized
to the ACET of the FSA algorithm. . . . . . . . . . . . . . . . . . . . . 100
5.8 Comparing the WCET of dierent SPM allocation algorithms for the hy-
brid SPM-Cache with a 32B SPM and a 128B cache, which is normalized
to the WCET of the FSA algorithm. . . . . . . . . . . . . . . . . . . . 101
5.9 Comparing the ACET of dierent SPM allocation algorithms for the hy-
brid SPM-Cache with a 32B SPM and a 128B cache, which is normalized
to the ACET of the FSA algorithm. . . . . . . . . . . . . . . . . . . . . 102
6.1 The example of side eect of basic block based allocation. . . . . . . . 125
6.2 The control ow graph of the example code segment. . . . . . . . . . . 126
6.3 The performance of all 4 algorithms in default conguration, which is
normalized to the total number of execution cycles of the FSA. . . . . . 132
6.4 The performance of all 4 algorithms with Conguration I, which is nor-
malized to the total execution cycles of the FSA under Conguration
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xiv
6.5 The performance of all 4 algorithms with Conguration II, which is nor-
malized to the total execution cycles of the FSA under Conguration
II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1 The memory energy of all the allocation algorithms in default congura-
tion (normalized to OSDA). . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 The total energy of all the allocation algorithms in default conguration
(normalized to FSA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3 Compare the EDP of all the allocation algorithms in default conguration
(normalized to FSA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xv
Abstract
Exploring Hybrid SPM-Cache Architectures to Improve Performance and Energy
Eciency for Real-time Computing
By Lan Wu, Ph.D.
A dissertation submitted in partial fulllment of the requirements for the degree of
Doctor of Philosophy at Virginia Commonwealth University.
Virginia Commonwealth University, 2013.
Director: Dr. Wei Zhang,
Associate Professor, Department of Electrical and Computer Engineering
Real-time computing is not just fast computing but time-predictable
computing. Many tasks in safety-critical embedded real-time systems have hard
real-time characteristics. Failure to meet deadlines may result in the loss of life or
in large damages. Known of Worst Case Execution Time (WCET) is important for
reliability or correct functional behavior of the system.
As multi-core processors are increasingly adopted in industry, it has become
a great challenge to accurately bound the worst-case execution time (WCET) for
real-time systems running on multi-core chips. This is particularly true because of
the inter-thread interferences in accessing shared resources on multi-cores, such as
shared L2 caches, which can signicantly aect the performance but are very
dicult to be estimate statically. We propose an approach to analyzing Worst
Case Execution Time (WCET) for multi-core processors with shared L2
instruction caches by using a model checking based method. Our experiments
1
indicate that compared to the static analysis technique based on extended ILP
(Integer Linear Programming), our approach improves the tightness of WCET
estimation more than 31.1% for the benchmarks we studied. However, due to the
inherent complexity of multi-core timing analysis and the state explosion problem,
the model checking based approach currently can only work with small real-time
kernels for dual-core processors.
At the same time, improving the average-case performance and energy
eciency has also been important for real-time systems. Recently, Hybrid
SPM-Cache (HSC) architectures by combining caches and Scratch-Pad Memories
(SPMs) have been increasingly used in commercial processors and research
prototypes. Our research explores the HSC architectures for real-time systems to
reconcile time predictability, performance, and energy consumption. We study the
energy dissipation of a number of HSC architectures by combining both caches and
Scratch-Pad Memories (SPM) without increasing the total on-chip memory size.
Our experimental results indicate that with the equivalent total on-chip memory
size, several hybrid SPM-Cache architectures are more energy-ecient than either
pure software controlled SPMs or pure hardware-controlled caches. In particular,
using the hybrid SPM-Cache to store both instructions and data can achieve the
best energy eciency.
However, the SPM allocation for the HSC architecture must be aware of the
cache performance to harness the full potential of the HSC architecture. First, we
propose and evaluate four SPM allocation strategies to reduce WCET for hybrid
2
SPM-Caches with dierent complexities. These algorithms dier by whether or
not they can cooperate with the cache or be aware of the WCET. Our evaluation
shows that the cache aware and WCET-oriented SPM allocation can maximally
reduce the WCET with minimum or even positive impact on the average-case
execution time (ACET).
Moreover, we explore four SPM allocation algorithms to maximize
performance on the HSC architecture, including three heuristic-based algorithms,
and an optimal algorithm based on model checking. Our experiments indicate that
the Greedy Stack Distance based Allocation (GSDA) can run eciently while
achieving performance either the same as or close to the optimal results got by the
Optimal Stack Distance based Allocation (OSDA).
Last but not the least, we extend the two stack distance based allocation
algorithms to GSDA-E and OSDA-E to minimize the energy consumption of the
HSC architecture. Our experimental results show that the GSDA-E can also
reduce the energy either the same as or close to the optimal results attained by the
OSDA-E, while achieving performance close to the OSDA and GSDA.
Detailed implementation and experimental results discussion are presented in
this dissertation.
3
CHAPTER 1
INTRODUCTION
Real-time systems have been widely used in our society especially for
safety-critical systems, such as automobile and aircraft controllers. Besides
performance, time predictability is also critical to real-time systems, because the
missing deadlines may either lead to disastrous consequences or decrease quality of
services badly. The Worst-Case Execution Time (WCET) of an application must
be calculated to determine if its deadline can be always met. Moreover, cache
memories have been widely used in modern processors to eectively bridge the
speed gap between the fast processor and the slow memory to achieve good
averagecase performance. However, cache performance is heavily dependent on the
history of memory accesses and the cache placement and replacement algorithms,
making it hard to accurately predict the worst-case execution time. When threads
running concurrently on dierent cores in a multicore platform, the shared cache
memory can signicantly impact execution time of each concurrent thread and
complicate WCET analysis.
Scratch-Pad Memory (SPM) is an alternative on-chip memory to the cache,
which has been increasingly used in embedded processors due to its energy and
area eciency. It is time-predictable because the allocation is controlled by
software and the latency to access data from the SPM is xed. However, SPMs
generally are not adaptive to runtime instruction and data access patterns, and
4
thus may lead to inferior average-case performance. Processors that employ caches
or SPMs alone can only benet either the average-case performance or the time
predictability, not both.
A hybrid cache and SPM model has also been used in some prototype or
commercial processors such as TRIPS [1], ARM1136JF-S [2], and Nvidia Fermi [3].
Recent studies show that a hybrid SPM-Cache can greatly improve the
performance [4], energy eciency [5] and time-predictability [6], all of which are
potentially benecial to embedded systems, including hard real-time systems.
However, the traditional SPM allocation, including both static and dynamic
allocation, mainly focuses on the SPM alone. These cache-unaware SPM allocation
algorithms are unlikely to harness the full potential of the hybrid SPM and cache.
To use the aggregate SPM and cache space more eciently, we believe the SPM
allocation for the hybrid SPM-Cache architecture must be aware of the cache
performance to maximally optimize the worst-case execution time, the average
performance and the energy consumption.
Motivated by these challenges, the rest of this dissertation is organized as
follows.
Chapter 2 provides the background knowledge.
Chapter 3 focuses on studying a model checking based approach to safely
and accurately estimate inter-thread cache interferences and WCET for real-time
tasks running on multicore processors. Our approach is built on top of Metzners
single-core analysis method [7] which is extended to model inter-thread
5
interferences in a dual-core processor with a shared L2 instruction cache. We
exploit program control ow information to derive basic block automation (BBA),
based on which a PROMELA process [8] is generated to model each concurrent
thread as well as its accesses to the shared cache. We use the SPIN model checker
[8] to prove the upper bound of execution time and then use a binary search
algorithm to compute the WCET for real-time tasks running on multicore
processors. Our experiments demonstrate that the model checking based method
indeed improves the tightness of WCET analysis, as compared to the
state-of-the-art static analysis approach [9], although the state explosion problem
currently limits its applicability to small real-time kernels for computers with
constrained physical memories.
Chapter 4 is built upon the prior work in [6] to study the performance and
time predictability of hybrid SPM-Cache architectures. We systematically study
the energy consumption behaviors of 7 dierent hybrid SPM-Cache architectures,
which is expected to provide important insights on energy-ecient on-chip
memory design for embedded processors. We demonstrates that the hybrid on-chip
memory architectures can also make better tradeos between performance and
energy consumption, making it a very attractive design option for real-time and
embedded systems.
Chapter 5, Chapter 6 and Chapter 7 study the SPM allocation algorithms for
the hybrid SPM-Cache architectures. First, Chapter 5 explores four dierent SPM
allocation algorithms to reduce WCET for the hybrid SPM-Cache architecture.
6
The rst one is the Frequency based SPM Allocation (FSA), which is not aware of
the cache and WCET, which is used as the baseline. The Longest Path based
Allocation(LPA) is WCET aware but cache uaware. The other two algorithms are
all cache-aware, but the Hybrid SPM-Cache Allocation(HSA) allocates the basic
blocks accroding to the results of cache analysis based on Abstract
Interpretation(AI)[10] and does not consider the WCET. The Enhanced Hybrid
SPM-Cache Allocation (EHSA) algorithm is a WCET-oriented and cache-aware
SPM allocation algorithm,and our experimental results indicate that the EHSA
algorithm can outperform other three algorithms to reduce WCET for the hybrid
SPM-Cache with little or even positive impact on the average-case performance.
In Chapter 6, we design and comparatively evaluate 4 dierent SPM
allocation algorithms to maximally optimize the execution time. The baselne
allocation algorithm is still the Frequency based SPM Allocation (FSA). The other
three algorithms are all cache-aware, but exploit cache information in dierent
ways. The Hybrid SPM-Cache Allocation (HSA) is dierent from that of Chapter
5 and it exploits cache proling information. It tries to allocate the memory
objects with the largest cache misses into the SPM. The remaining two algorithms
are both based on the Stack Distance Analysis (SDA) [11], [12]. The Greedy Stack
Distance based Allocation (GSDA) is a greedy algorithm, whereas the Optimal
Stack Distance based Allocation (OSDA) is an optimal algorithm by using model
checking. By experiments we nd that all the three cache-aware algorithms attain
superior performance than the FSA algorithm. In particular, the HSA and the
7
GSDA improve the performance by 9% and 11% respectively as compared to the
FSA. The OSDA always achieves the best performance, but requires signicantly
more memory space and longer running time and may not be scalable for larger
benchmarks. The GSDA can achieve performance either the same as or very close
to that of the OSDA.
Moreover, we extend the GSDA and OSDA to reduce the energy
consumption in Chapter 7, which are called the Greedy Stack Distance based
Allocation for Energy (GSDA-E) and the Optimal Stack Distance based Allocation
for Energy (OSDA-E). We evaluate them together with the four dierent SPM
allocation algorithms from Chapter 6, and nd that GSDA-E can reduce the
energy either the same as or close to the optimal results attained by the OSDA-E,
while achieving performance close to the OSDA and the GSDA.
Finally, Chapter 8 concludes this dissertation.
8
CHAPTER 2
BACKGROUND
In this chapter, background information is provided for the topics covered in
this dissertation.
2.1 REAL-TIME COMPUTING
Real-time computing is the study of hardware and software systems that are
subject to a real-time constraint. The tasks running on the real-time system
usually have strict time constraints which are referred to as \deadlines". There are
two kind of deadlines: hard deadline and soft deadline. Missing a hard deadline
can causes a total system failure or disastrous consequences, while missing a soft
deadline may degrade the quality of service of the system because the degradation
of the usefulness of the result. Therefore, the timing correctness of hard realtime
systems should be guaranteed by the safe and accurate estimation of worst-case
execution time (WCET) The WCET estimation is demonstrated in Figure 2.1,
which is dened as the upper bounds for the execution times of real-time tasks.
Figure 2.1. The WCET estimation
9
2.2 ON-CHIP MEMORY
In order to boost the performance of modern processors, cache memories
have been widely used in modern processors to eectively bridge the speed gap
between the fast processor and the slow memory to achieve better average-case
performance and to reduce the energy consumption of accessing the main memory.
It stores The data used before to save the access time to the same data requested
in the future. If the data requested are found in the cache, there will be a cache
hit, otherwise a cache miss happens. The performance of the cache memory is
heavily dependent on the history of memory accesses, as well as the cache
placement and replacement algorithms.
The Scratch-Pad Memory (SPM) [13] are also on-chip memories based on
SRAM, which can be used to store instructions, data, or both. Unlike caches that
are controlled by hardware, the mapping of program and data elements into the
SPM is usually performed either by the user or the compiler. This leads to
statically predictable memory access time, which is desirable to real-time systems.
Moreover, since an SPM does not need to use tag arrays, it is generally more
energy- and area-ecient than a cache with the same size. On the other hand,
since SPMs are totally controlled by software, they are generally less adaptable to
various instruction/data access patterns that are dependent on runtime inputs.
Also, because SPM allocation is done statically, the SPMs generally cannot
dynamically reuse the limited on-chip SPM space as eciently as the caches. A
number of commercial processors employing scratch-pad memory are already
10
available in the market such as Motorola MPC500 [14], ARMv6 [15].
11
CHAPTER 3
A MODEL CHECKING BASED APPROACH TO BOUNDING
WORST-CASE EXECUTION TIME FOR MULTICORE
PROCESSORS
3.1 CHAPTER OVERVIEW
In the last two decades, worst-case execution time (WCET) analysis has
been extensively studied, primarily for single-core processors [16]. However, due to
technology advancement and concerns about power and heat dissipation, the wide
use of multicore systems introduces new challenges to WCET analysis.
Specically, in a multicore platform, threads running concurrently on dierent
cores may interfere with each other in accessing shared resources, such as shared
buses, cache or memory modules, which can signicantly impact execution time of
each concurrent thread and complicate WCET analysis. Nevertheless, to safely
employ multicore chips to benet real-time cyber-physical systems, especially for
hard real-time systems, it is a necessity to derive safe and precise timing
guarantees through WCET analysis, which must take into account the impact of
inter-thread interferences.
In this chapter, we focus on studying a model checking based approach to
safely and accurately estimate inter-thread cache interferences and WCET for
real-time tasks running on multicore processors. In addition, while our approach
can be applied generally to any multicore processor, as a rst step towards using
12
model checking for multicore timing analysis, we focus on analyzing a dual-core
processor. Our approach is built on top of Metzners single-core analysis method [7]
which is extended to model inter-thread interferences in a dual-core processor with
a shared L2 instruction cache. We exploit program control ow information to
derive basic block automation (BBA), based on which a PROMELA process [8] is
generated to model each concurrent thread as well as its accesses to the shared
cache. We use the SPIN model checker [8] to prove the upper bound of execution
time and then use a binary search algorithm to compute the WCET for real-time
tasks running on multicore processors.
While recently there have been a few studies on bounding the worst-case
interthread cache interferences for multicore processors [17, 18, 19, 20, 9], the
novelty of our proposed approach lies in that we propose to analyze WCET for
multicore systems by using model checking technology. To the best of our
knowledge, this work is the rst eort to apply model checking technology to
multicore WCET analysis which has become increasingly important in this
multicore era, considering the benets of multicore computing, such as high
throughput and better energy eciency, etc. We have also introduced several
techniques to reduce the memory consumption of the model checker without
compromising the quality of analysis by intelligently exploiting domain-specic
information. Our experiments demonstrate that the model checking based method
indeed improves the tightness of WCET analysis, as compared to the
state-of-the-art static analysis approach [9], although the state explosion problem
13
currently limits its applicability to small real-time kernels for computers with
constrained physical memories.
The rest of the chapter is organized as follows. Section 3.2 discusses the
related work. Section 3.3 describes background information, and Section 3.4
introduces the model checking based WCET analysis for multicore processors. The
evaluation methodology is given in Section 3.5, and the experimental results are
presented in Section 3.6. Finally, we draw conclusions in Section 3.7.
3.2 RELATED WORK
3.2.1 Prior Work in WCET Analysis
WCET analysis has been studied intensively in the last two decades. A good
review of the state of the art can be found in Wilhelm et al. [16]. Most of the
research eorts on WCET analysis have been focused on single-core processors
[21, 22, 23, 24, 25]. However, these techniques cannot be applied to estimate the
WCET for multicore processors, because they do not consider the possible
inter-core interferences caused by concurrent threads accessing resources shared
among dierent cores.
In contrast, there are relatively few eorts to study WCET analysis for
multicore processors. Stohr et al. [19] proposed a measurement-based approach to
bounding worst-case access time for multicore platforms. However, this approach
may be unsafe due to the fact that it is generally impossible to exhaust all the
possible paths with various inputs for concurrent threads. Rosen et al. [17] studied
the implicit bus trac due to cache misses by dierent processors. However, they
14
did not investigate the challenging problem of inter-thread cache interferences on a
multicore chip, which is crucial for accurately bounding the worst-case
performance for multicore processors. Recently, Yan and Zhang proposed a control
ow based approach [20] and an enhanced approach [9] to deriving WCET for
multicore processors with shared instruction caches. However, due to the
conservative nature of static timing analysis and the lack of actual runtime
information, the tightness (or accuracy) of the static analysis is not guaranteed. In
contrast to all these existing studies, we propose a model checking based method
to safely and accurately bound the WCET for multicore processors.
3.2.2 Prior Work in Model Checking
Pioneering work in the model checking was done by Clarke et al. in the early
1980s [26]. They developed a model checking method for checking models of
system designs where the specication is given by a temporal logic formula. The
initial temporal logic is proposed by Pnueli [27] called linear-time propositional
temporal logic to specify and compute the behaviors of computer systems.
Temporal logic has proved to be useful for specifying concurrent systems. There
have been many variants of temporal logic proposed in the literature [28]. Alur
and Dill rst proposed the model of timed automata [29]. Both logic formulas and
automata can be used to specify the model of a system. Model checking techniques
have been applied not only to nite state systems but also to real-time systems
[30]. In practice, a real-time system is usually described as a set of process-timed
automata, each representing the behavior of an autonomous process [30].
15
In this chapter, we do not intend to make a comprehensive survey of all the
related work in model checking. Instead, our focus is to discuss related work on
using model checking technology to cope with the WCET analysis problem. In
this area, Metzner [7] rstly showed that model checking could be used to compute
WCET for single-core systems. Lv et al. [31] compared the performance of the
WCET analysis techniques using static path analysis and model checking for
uniprocessors. Wilhelm [32] compared model checking, integer linear programming
(ILP) and a combination of abstract interpretation (AI) with ILP to determine the
WCET and argued that AI+ILP is a better approach. Recently, Huber and
Schoeberl [33] compared ILP and model checking based WCET analysis for Java
uniprocessors and found that model checking is fast enough for local analysis and
small applications, leading them to suggest combining model checking with ILP for
attaining tight WCET results with reasonable analysis time. Mohalik et al. [34]
applied UPPAAL model checker to bound the end-to-end latency in real-time
systems with clock drifts. However, due to the inherent high complexity of WCET
analysis for multicore systems, to the best of our knowledge, there is no prior work
to study applying model checking technology to solve the WCET analysis problem
for multicore systems.
16
3.3 BACKGROUND
3.3.1 The Assumed Dual-Core Processor with a Shared L2 Instruction
Cache
In a multicore processor, each core typically has private L1 instruction and
data caches. The L2 (and/or L3) caches can be shared or private. While private
L2 caches are more time predictable in the sense that there are no inter-core L2
cache conicts, they suer from other deciencies. First, each core with a private
L2 cache can only exploit separated and limited cache space. Due to the great
impact of the L2 cache hit rate on the performance of multicore processors [35],
private L2 caches may have worse performance than a shared L2 cache with the
same total size, because each core with a shared L2 cache can make use of the
aggregate L2 cache space more eectively. Second, separated L2 caches will
increase the cache synchronization and coherency cost. Moreover, a shared L2
cache architecture makes it easier for multiple cooperative threads to share
instructions and data, which becomes more expensive in separated L2 caches.
Therefore, we will study the WCET analysis of multicore processors with shared
L2 caches (by contrast, the WCET analysis for multicore chips with private L2
caches is a less challenging problem).
Although our study focuses on examining the WCET analysis for a dual-core
processor with a shared L2 cache, our approach could also be generally applied to
multicore processor with a higher number of cores (e.g, a quad-core chip). Figure
3.1(a) shows a typical dual-core processor where each core has private L1
17
instruction and data caches and shares a unied L2 cache.
Figure 3.1. (a) A normal dual-core with a shared L2 cache; (b)
a dual-core with a shared L2 instruction cache where the L1
data caches (i.e., dL1*) are perfect, that is, there are no L1 data
cache misses.
3.3.2 Assumptions
We assume that the worst-case execution time of a single core can be
handled by existing WCET analysis techniques [16]. We also assume that the
shared bus and memory of the multicore processor are time predictable, which
actually can be supported by recently proposed techniques, such as the
interference-aware bus arbiter [36] and the predictable SDRAM memory controller
[37], respectively. Furthermore, we assume a perfect data cache so that we can
concentrate on examining the inter-thread interferences of instruction accesses. It
should be noted that this last assumption is consistent with the recent multicore
WCET analysis work [9], which will be compared with the proposed model
checking based approach. Specically, the assumed dual-core architecture is
depicted in Figure 3.1(b), where each core has its own L1 instruction cache and a
perfect L1 data cache (i.e., dL1*) and shares the L2 cache.
Also, we assume that either two real-time threads (RTs) or a real-time
18
thread and a non-real-time thread (NRT) are running simultaneously on these two
cores. Our goal is to safely and accurately estimate the maximum inter-core L2
cache interferences, based on which the WCET of each RT can be calculated.
Moreover, we assume these two concurrent running threads are totally
independent with each other, that is, they do not share any data or instructions
and they do not communicate with each other or need to be synchronized.
Consequently, cache accesses from each thread can interleave with another thread
in any order, making it challenging to compute the worst-case inter-thread cache
interferences and WCET. To focus on WCET analysis, our study does not consider
real-time scheduling for multicore processors. However, we believe the WCET
results obtained for each real-time task running on a multicore platform can
provide a basis for schedulability analysis and are crucial to designing multicore
real-time schedulers for reducing inter-thread cache interferences.
3.3.3 SPIN and PROMELA
In this chapter, we use the SPIN model checker [8] to perform WCET
analysis on the assume dual-core architecture. Spin is a tool developed at Bell Labs
in the original Unix group of the Computing Sciences Research Center, starting in
1980. It can be used for the formal verication of distributed software systems.
SPIN verication models can simulate the interactions of dierent processes using
asynchronous message rendezvousing at buered channels, shared variables, or
with any combination of these, and its internal sequential computations are
abstracted as much as possible to verify the correctness of temporal specication.
19
Specications of the model in SPIN are written in the verication language
PROMELA (Process Meta Language), and its correctness is checked by the syntax
of standard linear temporal logic (LTL). Instead of synchronous control in
hardware systems, SPIN focuses on asynchronous control in software systems
which distinguishes it from other well-known approaches to model checking [38].
PROMELA is the verication modeling language of the SPIN system.
PROMELA programs consist of processes, message channels, and variables.
Processes are global objects, and message channels and variables can be declared
either globally or locally within a process. Processes specify behavior; channels and
global variables dene the environment in which the processes run. All statements
in the PROMELA processes are atomic, which are executed concurrently,
interleaving and non-deterministically. It is worth noting that in SPIN, atomic
means that each statement is executed without interleaving with other
processesinterleaving indicates that statements of dierent processes do not occur
at the same timeand non-deterministic means that each process may have several
dierent possible actions, and only one choice is made non-deterministically.
3.3.4 Static Analysis for WCET
Yan and Zhangs previous work proposed a static analysis approach for
WCET using extended ILP (EILP) [9], which we used for comparison to evaluate
our model checking approach. In Yan and Zhangs article, they adopt the same
assumed dual-core processor as we did in Section 3.1. The basic idea of this
approach is to try to nd out the maximum impact of inter-thread instruction
20
interference to the WCET. For two co-running threads from dierent cores, the
shared L2 cache line accessed by one thread may be also requested by another
thread, and this will cause additional L2 cache misses for the former thread. As a
result, the execution time of the former thread may be longer than just running by
itself. In Yan and Zhangs work, they rst build a formula to describe the WCET
of multicore task in EILP which is a summation of computation time of each basic
block, total cache hit latency, and total cache miss latency, including intra-thread
and inter-thread. Then structural constraints are constructed by looking into the
CFG, and functionality constraints are derived by bounding the loops and other
path information. Moreover, interthread constraints, which are the key factors to
the WCET computation in multicore, are provided by considering the possible
interferences from another core. All of them are put together into an ILP analyzer
to nally obtain the maximum value of the original WCET computation formula.
More details of this work can be found in [9].
3.4 MODEL CHECKING BASED WCET ANALYSIS FOR
MULTICORE PROCESSORS
3.4.1 Model Single-Core Systems with SPIN
For single-core WCET analysis, we adopt Metzners method [7] with several
extensions. First, we analyze the input program to build its control ow graph
(CFG). Second, the formal semantics of constructing an automaton from the CFG
is dened, which is called basic block automaton (BBA). Third, concrete models
described in PROMELA are generated based on which the upper bounds of
21
execution time can be calculated. Last, we use a binary search algorithm to nd
theWCET which is described in detail in Section 2.4.2.
Listing 3.1. The C program source code for an motivating example
1 #include "stdafx.h"
2 #include <iostream >
3 int main(int argc , char* argv [])
4 {
5 for(int i=1; i<=10; i++)
6 {
7 if(argc /2==0)
8 cout <<"argc_is_even"<<endl;
9 else
10 cout <<"argc_is_odd"<<endl;
11 i++;
12 }
13 return 0;
14 }
Figure 3.2. The control ow graph of the motivating example.
Listing 3.1 gives a specic example of a simple C program; Figure 3.2 shows
its control ow graph; and Listing 3.2 illustrates the SPIN model for this example,
22
which is automatically generated given the BBA of the program. In this example,
an integer variable lpc1 is used as the loop counter, and another variable wcet is
used to record the time passed. The proctype BBA() implements the BBA by
simulating all the state transitions and performing the corresponding actions.
Each block led by line number 'Si: ' represents a state of the BBA. The
statements wrapped in atomic implement the semantics of all possible transitions
enabled in the current state. The init() proctype is mandatory in SPIN which
initializes variables and starts all the other user-dened proctypes.
The never claim in Listing 3.2 implements the LTL property 2p.1 If
wcet >= actualWCET , 2p is evaluated true, and SPIN exits with 0; otherwise,
2p is violated, and SPIN writes the counterexample into the trace le and exits
with 1. A successful run of SPIN proves an upper bound of the execution time
based on which the WCET can be derived by using a binary search algorithm
(more details can be seen in Algorithm 1).
Listing 3.2. The SPIN model of the motivating example
1 int wcet , lpcl;
2 proctype BBA()
3 {
4 S1:atomic{
5 wcet=wcet+execution_time1;goto S2;
6 }
7 S2:atomic{
8 wcet=wcet+execution_time2;
9 lpcl ++;
10 if
12 denotes the always (or globally) operator in linear temporal logic
23
11 ::goto S3;
12 ::goto S4;
13 if;
14 }
15 S3:atomic{
16 wcet=wcet+execution_time3;goto S5;
17 }
18 S4:atomic{
19 wcet=wcet+execution_time4;goto S5;
20 }
21 S5:atomic{
22 wcet=wcet+execution_time5;
23 if
24 ::lpcl <10->goto S2;
25 ::else goto S46;
26 if;
27 }
28 S6:atomic{
29 wcet=wcet+execution_time6;
30 }
31 }
32 init{
33 atomic{
34 wcet =0; lpcl =0; run BBA();
35 }
36 }
37 #define p(wcet <=BOUND)
38 never{
39 T0_init;
40 if
41 ::(!((p))) >goto accept_all;
42 ::else >goto T0_init;
43 fi;
44 accept_all;
24
45 skip;
46 }
3.4.2 Model the Dual-Core Processor with SPIN
We propose the model checking based WCET analysis approach for the
dual-core processor with the following major steps. First, our approach models
each thread running on each core with a PROMELA process, as shown in Listing
3.3, which describes the cache and memory access behavior for every instruction
running on each core. The symbols used in Listing 3 are dened in Table 3.1.
Second, our approach models both the L1 and L2 instruction caches as
two-dimensional arrays in which each row represents a cache line and each element
represents an instruction. After each access to the L1 instruction cache or the L2
cache in case of an L1 miss, our approach then calculates the mapped cache line
and identies the corresponding L1 or L2 array element. Our approach then
updates this particular row of the L1/L2 array that holds the mapped instruction
with the instruction number associated with the core number (i.e., either 1 or 2),
which uniquely identies this cache access. Based on this instruction identication
number, our approach can then determine the L1/L2 hit or miss for each
instruction access, including the L2 miss caused by an inter-thread cache
interference. Finally, our approach uses another instruction per basic block to
update the latency of executing this block (assuming cache hits) and add the
accumulated cache miss penalties (including both L1 and L2 misses) onto one of
the two global variables, wcet1, wcet2, using Equation 3.1, as shown in Listing 3.3.
25
wcetn = wcetn + execution bbi+ num bbi  L1hit latency
+ l1missofcoren  L2miss latency + l2missofcoren  L2miss latency
(3.1)
In our approach, LTL formula 2(wcet1 < BOUND1) and
2(wcet2 < BOUND2)2 are used to verify that the WECT of core 1 is less than
BOUND1 and theWCET of core 2 is less than BOUND2. The actual WCET value
is then calculated by changing the value of BOUND until the LTL formulas
2(wcet1 < BOUND1)&!2(wcet1 < BOUND1  1) and
2(wcet2 < BOUND2)&!2(wcet2 < BOUND2  1) hold. A binary search
algorithm can be used to eciently nd this tight WCET value automatically,
which is depicted in Algorithm 1.
Listing 3.3. The PROMELA model for the dual-core processor
1 int wcet1 , wcet2;
2 proctype core1 ()
3 {
4 int l1missofcore1 =0; int l2missofcore1 =0;
5 bb1:for(i=1; i<=the number of instructions in bb1; i++)
6 {
7 atomic{
8 /*run instruction i */
9 if(there is an l1 miss)
10 {
11 l1missofcore1 ++;
2Note that BOUND1 and BOUND2 are constants that represent the possible upper bounds of
execution time for thread 1 and thread 2, respectively.
26
12 update l1[row(map(i))];
13 if(there is an l2 miss)
14 {
15 l2missofcore1 ++;
16 update l2[row(map(i))];
17 }
18 }/*end of if(there is an l1 miss) */
19 }/*end of atomic */
20 }/*end of for */
21 wcet1=wcet1+execution_bb1+num_bb1*L1hit_latency
22 +l1missofcore1*L1miss_latency+l2missofcore1*L2miss_latency;
23 goto bb2;
24 bb2: ...
25 ...
26 }
27 proctype core2 ()
28 {
29 /* Operations similar to those in core1 */
30 }
31 init{
32 atomic{
33 Initial array of L1 and L2;
34 wcet1 =0; wcet2 =0;
35 run core1 (); run core2 ();
36 }
37 }
3.4.3 Improvement of the Previous Model
Due to the inherent complexity of concurrent thread interactions on multicore
platforms and the well-known state explosion problem with model checking, we try
to exploit domain-specic knowledge to mitigate the state explosion without
27
Symbols Explanation
limissofcorei total number of Li misses in corei (i=1,2)
row(map(i)) the cache line that instruction i mapped to
execution bbi execution time of basic block i
num bbi total instruction number in basic block i
L1hit latency latency of an L1 hit
L1miss latency latency of an L1 miss
L2miss latency latency of an L2 miss
Table 3.1. Explanations for symbols used in Listing 3.3.
Algorithm 1 Finding the WCET using binary search
1: begin
2: set the upper and lower bound of binary search;
3: while lower bound < upper bound  1 do
4: middle = (lower bound+ upper bound)=2;
5: check the property 2(WCET <= middle);
6: end while
7: if 2(WCET <= middle) is satisfied then
8: upper bound = middle;
9: else
10: lower bound = middle;
11: end if
12: return upper bound
13: end
28
compromising the safety or tightness of the proposed WCET analysis approach.
The details of these techniques are described in Sections 2.4.3.1 and 2.4.3.2.
Simplify the Model
First of all, we intend to minimize the instructions modeled in each core
without aecting the WCET analysis. To achieve this, the WCET of the
application when a single core is used can be calculated separately by using the
SPIN model of the single-core processor, as described in Section 4.1. The L1-hit,
L1-miss, L2-hit, and L2-miss instructions for a single core can be easily calculated
in the simulation mode of SPIN. Similar to the model in Listing 3.3, we have
already modeled the calculation whether it is a cache miss or hit for every
instruction, and this information can be imported when we simulate the model on
the y. We then assign the cost of each basic block as the WCET of that block in
a single-core system. After categorizing cache accesses into L1/L2 hits or misses,
our approach only needs to model the L2-hit instructions in each core to derive the
possible inter-thread L2 cache interferences and to use one instruction to calculate
the cost of each basic block. This is because the inter-thread cache interferences
can only happen in the shared L2 cache (note that the L1 cache is private to each
core). Also, we do not need to model L2 misses, because they cannot be further
degraded by the interferences from another thread. Therefore, by only modeling
the L2 hit instructions from each thread, we can signicantly simplify the model
and reduce the number of states checked. At the same time, the analysis of
worst-case inter-thread interferences is not aected.
29
For example, assume a simple benchmark with seven basic blocks and 100
dynamic instructions in which only 11 instructions are identied as L2 hits. Before
the aforementioned simplication, we have to model all 100 instructions. After
simplication, however, only 11 L2-hit instructions and seven instructions for
calculating the costs of the seven basic blocks need to be modeled. As a result, the
state explosion problem can be alleviated.
Another benet by only modeling L2-hit instructions is that we do not need
to use two arrays to represent L1 and L2 caches, respectively. Instead, we can use
a single array to model the L2 cache only, whose size only needs to be equal to the
number of L2-hit instructions. Moreover, because all these cache accesses are
guaranteed to be L2 hits, each element in this array now only needs to store a
single bit, indicating whether this L2 cache line is used by core 1 or core 2.
Compared with the original method that has to use an instruction number in
addition to another bit indicating the core number for uniquely identifying each
cache access, this optimization can greatly shrink the size of each state the verier
generates, thus further mitigating the state explosion problem.
Additional Optimizations for Simplication
In addition to the preceding optimizations, we nd that it is possible to
further simplify the model without compromising the quality of analysis by
exploiting control ow information. Specically, by analyzing the CFG of each
real-time application, we nd that more instructions do not need to be modeled,
which are discussed next based on dierent cases.
30
For loops. If the instructions in a loop are all always-hit3 or rst-miss4
instructions [39], then we only need to use two PROMELA instructions to calculate
the execution time of this loop, instead of using N PROMELA instructions, where
N is the number of loop iterations. Specically, we can use only one PROMELA
instruction to compute the execution time of the rst iteration and another
instruction to compute the execution time of the second iterationmultiplied by N1.
This is because except for the rst iteration, all the remaining loop iterations have
the same cache behavior and, thus, the same execution latency. As a result of this
optimization, N2 PROMELA instructions can be reduced for this loop.
For single path and multi-path blocks. There are four dierent cases as
shown in Figure 3.3 and described next.
 Case 1. If there is no L2-hit instruction in block 1, then bb1 can be deleted
from the PROMELA model. However, to maintain the correctness and
accuracy of analysis, Equation 3.2 needs to be used to adjust the cost of bb2.
3An always-hit instruction is an instruction that is guaranteed to hit in the cache. If there is an
L1 miss instruction in a loop and if it is an L2 hit in a single-core case, we calculate its cache set
number and the conicting cache set due to L2 accesses from other core(s) in the dual-core case.
If another core will not use this set during its execution time, then this instruction is classied as
an always-hit instruction.
4A rst-miss instruction in a loop is an instruction that misses for the rst access and then
hits for all the remaining accesses. If there is an L1-miss instruction in a loop that is also an L2
miss at the rst time but hits in the remaining loop iterations, and if the another core will not
use the conicting set during its execution time, this instruction will be classied as a rst-miss
instruction.
31
Figure 3.3. Control ow graphs of four dierent cases. Note the
last three cases shared the same control ow graph though they
dier in the L2 cache access conditions.
In other words, these two blocks can be aggregated into one unit for
calculating the total execution time if bb1 does not have any L2-hit
instructions.
bb2 cost = bb2 cost+ bb1 cost (3.2)
 Case 2. If there is no L2-cache-hit instruction in block 1, then bb1 can be
deleted from the PROMELA. Equations 3.2 and 3.3 will be used to adjust
the costs of bb2 and bb3, respectively.
bb3 cost = bb3 cost+ bb1 cost (3.3)
 Case 3. If there is no L2-hit instruction in blocks 2 and 3, then both blocks
can be deleted from the PROMELA model. However, Equation 3.4 needs to
be used to adjust the cost of bb4 by adding the maximum cost of bb2 and
bb3.
bb4 cost = bb4 cost+max(bb2 cost; bb3 cost) (3.4)
32
 Case 4. If block 2 (or 3) has n L2-hit instructions and it satises Equation
3.5, then block 2 (or 3) can be deleted from the PROMELA model without
impacting the WCET calculation.
n  L2 miss penalty + bb2(3) cost < bb3(2) cost: (3.5)
It should be noted that the aforementioned four cases can be combined and
applied to optimize multiple blocks, which can further reduce the number of
instructions modeled and, hence, mitigate the state explosion.
3.4.4 Architectural Parameters Impacting the Performance of
Verication
We nd that there are mainly three architectural parameters that will
inuence the performance of verication, including (1) the total physical memory
size of the machine the SPIN runs on, (2) the number of instructions executing on
each core, and (3) the size of L2 cache lines that the L2-hit instructions access
(note that the interfering instructions may not access all the L2 cache lines). First,
if the total physical memory of the machine on which SPIN runs is large, there will
be more memory for SPIN to store states and, thus, larger problems can be solved.
However, the physical memory can not be enlarged unlimitedly in practice.
Therefore, we have to simplify the model of the problem to reduce the number of
states to ensure that the verication problem can be reasonably solved by an
actual machine. Second, the number of instructions executing on each core is also
a critical parameter to aecting the performance. Fortunately, this number can be
33
largely reduced by deep analysis of the application, as explained in Section 2.4.3.1.
Finally, the number of L2 cache lines used by the L2-hit instructions in each core
can hardly be reduced, because it is determined by the cache conguration and the
mapping method from memory to cache, which are often xed for a given cache.
3.5 EVALUATION METHODOLOGY
Our approach makes use of an extended Trimaran [40] to extract useful
program information for each concurrent thread. Trimaran is an integrated
compilation and performance monitoring infrastructure. An overview of our design
ow is illustrated in Figure 3.4. The starting point of our analysis is C programs.
We use an extended Trimaran framework to generate basic block information
which is stored in IR (intermediate representation) les. According to the
benchmark information and the CFG, PROMELA programs are constructed and
run in SPIN to either prove an upper bound of execution time or reject an
underestimated value of WCET. We then use the binary search algorithm
described in Algorithm 1 to compute the WCET value.
Figure 3.4. Design ow for the model checking based dual-core WCET analysis.
The benchmarks are selected from Malardalen WCET benchmarks [41] and
34
Benchmarks Number of Insts Source Description
bs 31 Malardalen Binary search
bcall 20 Malardalen Iterative Fibonacci calculation
insertsort 70 Malardalen Insertion sort on a reversed array
jfdctint 219 Malardalen Discrete-cosine transformation
ludcmp 247 Malardalen Read ten values, output half to LCD
matmul 70 Malardalen Multiplication of two 20x20 matrices
minver 127 Malardalen Inversion of oating point matrix
qsort 199 Malardalen Non-recursive quick sort algorithm
qurt 97 Malardalen Root computation of quadratic equations
select 168 Malardalen Select the Nth largest number
cordic 768 Mediabench Rotating complex numbers over the real eld
Table 3.2. Benchmark description.
Size(B) Bsize(B) Assoc Latency
L1 cache 16K 8 1 10
L2 cache 512K 16 1 100
Table 3.3. Conguration of the dual-core memory hierarchy.
MediaBench [42], which are listed in Table 3.2. The cache conguration of the
baseline dual-core processor simulated is listed in Table 3.3. It should be noted
that while our approach can be generally applied to set-associate caches, this
chapter focuses on evaluating direct-mapped caches for reducing the number of
states modeled and alleviating the state explosion problem. All our experimental
are conducted on an Intel processor with 1GB memory. The WCET of each
benchmarks running on single-core system is given in Table 3.4.
35
benchmarks L1 miss L2 miss L1 miss L2 miss WCET
ratio ratio
bs 19 11 0.268 0.579 1356
bcall 12 8 0.075 0.667 1329
insertsort 12 8 0.012 0.667 3105
jfdctint 110 58 0.058 0.527 7242
ludcmp 137 83 0.053 0.606 11351
matmul 39 23 0.017 0.590 3262
minver 68 39 0.078 0.574 4864
qsort 108 63 0.080 0.583 8480
qurt 49 26 0.505 0.531 3169
select 94 52 0.078 0.553 7083
cordic 412 220 0.0001 0.534 2441687
Table 3.4. WCET of a single-core.
3.6 EXPERIMENTAL RESULTS
3.6.1 Comparing the Performance Before and After Model
Simplication
To evaluate the possible advantages of the simplied model described in
Section 2.4.3, we run bs on core 1 and another benchmark concurrently on core 2.
The results of bs are given in Table 3.5. Note that the verier in this experiment
can only use a physical memory of 1GB. When bs is running simultaneously with
itself or fibcall, we observe the same WCET results both with and without using
the simplied model. However, without using the simplied model, the memory
usage is signicantly larger because more instructions need to be modeled, and the
state-vector size is also much longer. When bs is running with two other larger
benchmarks, that is, insertsort and ludcmp, however, the memory consumption is
36
benchmarks instructions modeled state-vector size(B) memory usage(B) WCET
before after before after before after before after
bs 62 27 560 40 123.985M 2.589M 2119 2119
bcall 51 13 560 36 19.883M 2.501M 1553 1553
insertsort 101 29 1028 40 beyond 1G 2.989M - 4105
ludcmp 278 28 3956 40 beyond 1G 2.989M - 14280
Table 3.5. Comparison of the performance before and after using
the simplied model.
beyond 1GB without using the simplied model. In comparison, by using the
simplied model, the memory usage is below 3MB, and the verier can return the
WCET results correctly. Therefore, in the following experiments, we will always
use the simplied model to mitigate the state explosion problem.
3.6.2 Comparing the Model Checking Based Approach with
Simulation and Static Analysis
To compare the eectiveness of the model checking based method (MC) with
simulation (SM) and static analysis (SA) [9], we run bs on core 1 and another
benchmark on core 2 which is selected from those 11 benchmarks shown in Table
3.2, including bs itself. Table 3.6 and Table 3.7 show the WCET, L1-miss ratio,
and, L2-miss ratio of bs on core 1 and of another benchmark on core 2 for each
method, respectively.
As can be seen from both Table 3.6 and Table 3.7, the model checking based
approach can always get bigger WCET values than simulation (otherwise, it would
be unsafe) but tighter WCET values than static analysis. On average, the model
checking based approach improves the tightness of WCET estimation by at least
37
31.1%, as compared to the static analysis. We also nd that the static analysis
always reports the same WCET as well as L1- and L2-miss rates for bs, regardless
of which benchmark is chosen to run on the second core. This is because bs is a
very small benchmark which only contains 31 instructions, as can be seen from
Table 3.2. As a result, all its L2 hits become L2 misses when running concurrently
with most other benchmarks, except for a smaller benchmark, such as fibcall. We
observe similar insensitivity behavior for most benchmarks for the model checking
based approach. However, for fibcall, the model checking based approach returns
smaller WCET, while the static analysis approach still conservatively reports the
same WCET as bs is running with other larger benchmarks. This again indicates
that the model checking based analysis is more accurate and thus can return
tighter WCET results.
Since bs is among the smallest benchmarks, we also choose a larger
benchmark | qsort | to run concurrently with another benchmark and then
compare its estimated WCET and L1-miss rate and L2-miss rate among the model
checking, simulation, and static analysis based methods. The results of core 1 and
core 2 are shown in Table 3.8 and Table 3.9, respectively, where N indicates the
experiment cannot be nished. We nd that the model checking based approach
can nish three out of the four pairs, except for qsort and ludcmp. The reason
being that ludcmp is a relatively larger benchmark having more L2 hits and
interfering instructions to be modeled and leading to state explosion exceeding the
1GB memory limit. In contrast, both the simulation and static analysis based
38
model checking simulation static analysis
core 2 W RL1 RL2 W RL1 RL2 W RL1 RL2
WMC
WSIMU
WMC
WSTAT
bs 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
bcall 1856 0.267 0.842 1456 0.268 0.632 3077 0.543 0.500 1.215 0.603
insertsort 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
jfdctint 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
ludcmp 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
matmul 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
minver 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
qsort 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
qurt 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
select 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
cordic 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
Average 1.372 0.681
Table 3.6. Comparing the worst-case execution time, L1 miss ra-
tio and L2 miss ratio of bs in core 1 among model checking, sim-
ulation and static analysis based approaches.(W: WCET, RL1:
L1 miss ratio, RL2: L2 miss ratio,
WMC
WSIMU
: MC/SIMU WCET
ratio, WMC
WSTAT
: MC/STAT WCET ratio )
approaches can nish all these experiments. However, we nd that for the three
pairs that the model checker can successfully complete, the model checking based
approach is more sensitive to dierent programs and returns more accurate WCET
results than the static analysis.
Moreover, We compare the analysis time of the model checking approach and
the static analysis approach in Table 3.10. From this comparison, we could see
that the average analysis time of static analysis is shorter than the model checking
approach. However, the model checking approach for all the benchmarks we
studied can be nished within 20 seconds, which is acceptable.
39
model checking simulation static analysis
core 2 W RL1 RL2 W RL1 RL2 W RL1 RL2
WMC
WSIMU
WMC
WSTAT
bs 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
bcall 1553 0.075 1.000 1253 0.075 0.750 2657 1.000 0.512 1.239 0.584
insertsort 4105 0.012 1.000 3305 0.012 0.833 4619 0.084 0.523 1.242 0.889
jfdctint 8561 0.058 0.627 7809 0.058 0.555 13284 1.000 0.502 1.096 0.645
ludcmp 14280 0.053 0.672 11451 0.053 0.613 17943 0.098 0.567 1.247 0.796
matmul 4958 0.017 0.795 4258 0.017 0.667 5886 0.078 0.686 1.164 0.842
minver 6247 0.078 0.706 5250 0.078 0.601 9778 0.253 0.549 1.190 0.639
qsort 10626 0.080 0.657 8680 0.080 0.602 12579 0.123 0.523 1.224 0.845
qurt 4266 0.505 0.735 3269 0.505 0.551 7743 0.119 0.578 1.305 0.551
select 9390 0.078 0.670 7283 0.078 0.574 11086 0.124 0.500 1.289 0.847
cordic 2442787 0.0001 0.561 2441987 0.0001 0.541 2656780 0.0002 0.521 1.001 0.919
Average 1.217 0.751
Table 3.7. Comparing the worst-case execution time, L1 miss
ratio and L2 miss ratio of the benchmark in core 2 among model
checking, simulation and static analysis based approaches.(W:
WCET, RL1: L1 miss ratio, RL2: L2 miss ratio,
WMC
WSIMU
:
MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio )
model checking simulation static analysis
core 2 W RL1 RL2 W RL1 RL2 W RL1 RL2
WMC
WSIMU
WMC
WSTAT
bs 10626 0.080 0.657 8680 0.080 0.602 12579 0.123 0.523 1.224 0.845
bcall 10426 0.080 0.639 8580 0.080 0.592 12579 0.123 0.523 1.215 0.829
insertsort 11226 0.080 0.713 8780 0.080 0.611 12579 0.123 0.523 1.279 0.892
ludcmp N N N 8780 0.080 0.611 13431 0.131 0.502 N N
Average 1.239 0.855
Table 3.8. Comparing the worst-case execution time, L1 miss ra-
tio and L2 miss ratio of qsort in core 1 among model checking,
simulation and static analysis based approaches.(W: WCET,
RL1: L1 miss ratio, RL2: L2 miss ratio,
WMC
WSIMU
: MC/SIMU
WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio )
40
model checking simulation static analysis
core 2 W RL1 RL2 W RL1 RL2 W RL1 RL2
WMC
WSIMU
WMC
WSTAT
bs 2119 0.267 1.000 1456 0.268 0.632 3077 0.543 0.500 1.388 0.689
bcall 1553 0.075 1.000 1253 0.075 0.75 2657 0.043 0.512 1.239 0.584
insertsort N N N 3405 0.012 0.917 4519 0.084 0.509 N N
ludcmp N N N 11651 0.053 0.628 21490 0.124 0.502 N N
Average 1.314 0.637
Table 3.9. Comparing the worst-case execution time, L1 miss
ratio and L2 miss ratio of the benchmark in core 2 among model
checking, simulation and static analysis based approaches.(W:
WCET, RL1: L1 miss ratio, RL2: L2 miss ratio,
WMC
WSIMU
:
MC/SIMU WCET ratio, WMC
WSTAT
: MC/STAT WCET ratio )
bs on core1 bs on core2
benchmarks MC SA MC SA
bs 10 1.32 10 1.32
bcall 10 1.3 9 1.5
insertsort 11 1.62 10 1.35
jfdctint 11 2.34 10 1.92
ludcmp 11 3.49 13 6.75
matmul 11 2.29 11 1.99
minver 11 4.69 12 9.84
qsort 11 2.26 12 4.13
qurt 11 4.31 12 16.7
select 11 2.36 16 3.12
cordic 11 28.8 19 40.6
Average 10.82 4.98 12.18 8.11
Table 3.10. Compare the timing of model checking and static
analysis approach (in seconds).
41
Size(B) Bsize(B) Assoc Latency
Model L1 cache 16K 8 1 10
I L2 cache 512K 16 1 100
Model L1 cache 8K 8 1 10
II L2 cache 128K 16 1 100
Model L1 cache 16K 16 1 10
III L2 cache 512K 64 1 100
Table 3.11. Conguration I of the Dual-core Chip Memory Hierarchy.
3.6.3 Sensitivity to Cache Congurations
To study the eectiveness of the model checking based approach to dierent
cache congurations in addition to the base cache conguration (i.e., Model I)
shown in Table 3.3, we also choose two other cache models, as depicted in Table
3.11. For Model II, we reduce the size of the L1 cache to 8KB and the size of the
L2 cache to 128KB, and the block size of each cache is kept the same as Model I.
For Model III, we increase the L1 block size to 16B and the L2 block size to 64B,
while keeping the L1 and L2 cache sizes the same as Model I. Figure 3.5 compares
the WCETs of three dierent methods in these three models which are normalized
with the simulation results. Each normalized WCET is listed for each benchmark
in the x-axis which runs on core 1, and core 2 always runs bs. We observe that in
all these three cache congurations, the model checking based approach returns
tighter WCET than the static analysis based approach, indicating its superiority
in terms of the accuracy of analysis.
42
Figure 3.5. Comparing the normalized WCET of three dierent cache models.
3.6.4 Limitation of the Model Checking Based Method
Due to the state explosion problem of themodel checker, even with the
simplied model we nd that thememory consumption can easily exceed 1GB for
larger benchmarks. To explore the limit of themodel checking based method, we
systematically form 121 pairs from the 11 real-time benchmarks and run all these
pairs on the dual-core processor. In all these experiments, we use a machine with
1GB memory. We observe that out of the 121 pairs, only 41 pairs are solvable,
while for all other pairs the memory consumption exceeds 1GB, and the model
checker cannot nish execution successfully.
Figure 3.6 shows the feasibility of WCET analysis according to the total
number of conicting instructions modeled on the dual-core system and the subset
size of the L2 cache lines these instructions access. As can be seen, the WCET
value can only be solved if the total number of L2 cache lines used by the
interleaving instructions does not exceed 20 and the total number of total
instructions modeled is no more than 49. In most cases, the fewer instructions a
43
benchmark has, the fewer accesses to the L2 cache it will make and the fewer
interfering instructions we need to model. This will also result in smaller subset
sizes of L2 cache lines used by the conicting instructions, and these kinds of
benchmarks are more likely to be solved by the model checking based approach.
Therefore, while our experiments indicate that the model checking based method
can improve the accuracy of WCET analysis, even with the simplied model we
developed, the state explosion can still be the major limiting factor to the
potential wide application of this method in worst-case multicore timing analysis.
By comparison, the static analysis approach can nish all the experiments and
thus have wider applicability. Nevertheless, while real-time applications can be
large, the segments that require hard real-time constraints are often small.
Therefore, we believe that even in its current form, the model checking based
approach can be used to analyze small hard real-time kernels to get tight analysis
results. On the other hand, other approaches, such as static timing analysis
technique [9], may be used to analyze other parts of code to strike a balance
between the tightness of analysis and the computation cost.
3.7 CONCLUSION
This chapter presents a model checking based approach to bounding the
worst-case performance of a multicore processor with shared L2 instruction caches.
To alleviate the state explosion problem, we propose several techniques for
reducing the memory consumption without compromising the quality of WCET
analysis. Our experimental results show that the model checking based approach is
44
Figure 3.6. Feasibility of WCET analysis according to the total
number of conicting instructions modeled and subset size of L2
cache used by these instructions.
safe and improves the tightness of WCET estimation as compared to the static
analysis approach [9]. However, due to the inherent complexity of multicore
WCET analysis, the state explosion problem, and the physical memory constraint,
this approach currently can only solve small benchmarks, while larger benchmarks
with more interfering instructions will cause out-of-memory fault. However, it is
possible to combine the model checking based method with the static analysis to
benet larger real-time applications.
In our future work, we would like to seamlessly integrate static analysis with
the model checking based method to attain safe and tight WCET results with
much smaller memory consumption and less computation time. Our idea is to
exploit static analysis to identify possible worst-case paths and inter-thread
interferences and then use the model checker to verify a subset of worst-case
scenarios with much reduced state space. Also, we intend to study the
applicability and scalability of this integrated approach to data caches and
45
set-associative caches and possibly for processors with more than two cores.
46
CHAPTER 4
EXPLOITING HYBRID SPM-CACHE ARCHITECTURES TO
REDUCE ENERGY CONSUMPTION FOR EMBEDDED
COMPUTING
4.1 CHAPTER OVERVIEW
Energy consumption has become the primary concern for microprocessor
design, which is particularly crucial for battery-operated embedded systems.
Cache memories have been widely used in modern processors to eectively bridge
the speed gap between the fast processor and the slow memory to achieve better
average-case performance and to reduce the energy consumption of accessing the
main memory. However, the cache performance is heavily dependent on the
history of memory accesses, as well as the cache placement and replacement
algorithms, making it hard to accurately predict the worst-case execution time
(WCET) [16]. For this reason, in many hard real-time and safety-critical systems,
designers may simply choose to not use caches.
An alternative to the cache is the Scratch-Pad Memory (SPM) [13], which
has been increasingly used in embedded processors such as ARMv6 and Motorola
MCORE. The SPMs are also on-chip memories based on SRAM, which can be used
to store instructions, data, or both. Unlike caches that are controlled by hardware,
the mapping of program and data elements into the SPM is usually performed
either by the user or the compiler. This leads to statically predictable memory
47
access time, which is desirable to real-time systems. Moreover, since an SPM does
not need to use tag arrays, it is generally more energy- and area-ecient than a
cache with the same size. On the other hand, since SPMs are totally controlled by
software, they are generally less adaptable to various instruction/data access
patterns that are dependent on runtime inputs. Also, because SPM allocation is
done statically, the SPMs generally cannot dynamically reuse the limited on-chip
SPM space as eciently as the caches 1. Both these two factors may have negative
impacts on the performance and total energy consumption of pure SPMs.
The recent work [43] studies hybrid SPM-Cache architectures by combining
SPMs and caches to achieve both time predictability and high performance, which
can widely benet a variety of real-time and non-real-time applications. In a
hybrid SPM-Cache, instead of using a single cache (or SPM) with size N , it
employs an SPM with size M (M < N) and a cache with size N  M in parallel.
Such a hybrid SPM-Cache architecture can be used to store either instructions or
data, which is called Instruction Hybrid (IH) architecture or Data Hybrid (DH)
architecture respectively. The hybrid SPM-Cache relies on the compiler to allocate
a fraction of \protable" instructions or data to the SPM until it is full, while the
rest of instructions or data are stored in main memory, which can use the cache to
exploit the temporal and space locality for improving performance.
While the prior work in [43] has quantitatively studied both the performance
1Dynamic SPM allocation algorithms exist; however, these algorithms generally still need to
statically determine which instruction or data objects need to be swapped in or out at runtime,
which may not perfectly match the actual runtime instruction/data access patterns.
48
Figure 4.1. Baseline caches or SPMs only architectures.
and time predictability of hybrid SPM-Caches and found their superiority over
pure caches or pure SPMs in making better tradeos between performance and
time predictability, it is not clear what is the implication of hybrid SPM-Caches on
energy consumption. Compared to a pure SPM, the cache part of the hybrid
SPM-Cache may consume more energy per access. Also, compared to a pure
cache, the SPM part of the hybrid SPM-Cache may not reuse its space eciently,
potentially leading to more timing- and energy-consuming accesses to the main
memory. Therefore, the energy consumption of the hybrid SPM-Caches is not
guaranteed to be better than the traditional pure caches or pure SPMs. In this
chapter, we will systematically study the energy consumption behaviors of 7
dierent hybrid SPM-Cache architectures, which is expected to provide important
insights on energy-ecient on-chip memory design for embedded processors.
4.2 MOTIVATION
We rst evaluate the energy consumption of two baseline architectures,
including a pure cache and a pure SPM, which are shown in Figure 4.1. The rst
baseline architecture employs only an instruction cache (IC) and a data cache
49
(DC), and thus is referred as the IC-DC architecture in this dissertation. The
other baseline architecture contains only an instruction SPM (IS) and a data SPM
(DS), and thus is called the IS-DS architecture. The benchmarks are selected from
MediaBench suite [42], and we use Trimaran [40] to simulate a Very Long
Instruction Word (VLIW) processor with an one-level cache or SPM. The
experiments are conducted by following the evaluation methodology and
congurations detailed in Section 4.5.
Figure 4.2 compares the on-chip memory energy consumption between the
IC-DC and the IS-DS architectures, which is normalized to that of the IS-DS
architecture. As expected, the pure SPMs are more energy-ecient than the pure
caches for all the benchmarks, because the SPMs do not have the tag arrays and
consume less energy per access. The IC-DC consumes at least 20% more on-chip
memory energy than the IS-DS, with an average of 28.9% more on-chip memory
energy dissipation.
However, on-chip memory energy is only part of the total energy. Figure 4.3
compares the total energy consumption between the IC-DC and the IS-DS
architectures, which is normalized to that of the IS-DS architecture. Unlike the
on-chip memory energy, we nd that the total energy consumption results vary for
dierent benchmarks. For cjpeg, mesatexgen, and mpeg2dec, the IC-DC
architecture actually consumes less total energy than the IS-DS architecture, while
for the rest of benchmarks, the IS-DS is still more energy-ecient than the IC-DC.
The reason is because for cjpeg, mesatexgen, and mpeg2dec, the IC-DC can
50
signicantly reduce the total execution time than the IS-DS, as can be seen in
Figure 4.4. The energy reduced by improving the performance by the IC-DC is
more than the energy increased by accessing the caches instead of the SPMs, thus
leading to less total energy dissipation. For the other four benchmarks, while the
IC-DC can still reduce the total execution time, the amount of reduction is less
signicant. In other words, the energy saving due to reduced execution time by the
IC-DC is not large enough to compensate for the increased energy consumption
caused by accessing the cache instead of the SPM. As a result, the IS-DS consumes
less total energy than the IC-DC for these benchmarks.
Therefore, in terms of the total energy consumption, neither pure caches nor
pure SPMs are always better. The total energy consumption depends on many
factors, such as the total execution time, the number of accesses to dierent types
of on-chip memories, and the energy eciency of dierent types of on-chip
memories. In a hybrid SPM-Cache, given that a pure SPM is more energy-ecient
per access, whereas the cache is likely to conserve total energy consumption by
reducing the total execution time, simply putting SPMs and caches together is not
guaranteed to result in more energy-ecient computing than pure SPMs or pure
caches. Therefore, it is worthy to quantitatively assess the energy consumption
behavior of dierent SPM-Caches to understand their implications on energy. The
hybrid SPM-Cache can become an attractive design option only if it can reduce
the total energy dissipation as compared to both the pure cache and the pure SPM
of the equivalent size.
51
Figure 4.2. The comparison of on-chip memory energy con-
sumption between the IC-DC and IS-DS architectures, which is
normalized to that of the IS-DS architecture.
Figure 4.3. The comparison of total energy consumption be-
tween the IC-DC and IS-DS architectures, which is normalized
to that of the IS-DS architecture.
Figure 4.4. The comparison of the performance (i.e., the to-
tal number of execution cycles) between the IC-DC and IS-DS
architectures, which is normalized to that of the IS-DS architec-
ture.
52
Figure 4.5. Three hybrid SPM-Cache architectures.
4.3 BACKGROUND ON HYBRID SPM-CACHES
4.3.1 Instruction Hybrid and Data Hybrid Architectures
Since both caches and SPMs have their own advantages and disadvantages, it
would be desirable to combine their advantages while avoiding their respective
disadvantages. Recently we have witnessed an increasing number of studies [4, 5]
on hybrid on-chip memory architectures by placing caches and SPMs together to
cooperatively improve performance and/or energy eciency, which are termed as
the hybrid SPM-Caches in this dissertation. A hybrid SPM and cache model has
also been used in some prototype or commercial processors such as TRIPS [1],
ARM1136JF-S [2], and Nvidia Fermi [3]. This work is based on the hybrid
SPM-Cache architectures studied in [43], in which caches and SPMs are placed
on-chip in parallel to achieve both high performance and time predictability.
Figure 4.5 shows three such hybrid SPM-Cache architectures. The rst
architecture has a hybrid SPM-Cache for storing instructions and a regular data
cache, which is named as the IH-DC architecture. The second one has a regular
instruction cache and a hybrid SPM-Cache for data, which is called the IC-DH
53
architecture. The third one employs hybrid SPM-Caches for both instruction and
data, which is referred as the IH-DH architecture.
In the hybrid architecture, the SPM is mapped into an address range disjoint
from the o-chip main memory, but it is connected to the same address and data
buses as the cache. We assume virtual memory system support as described in
[44]. The instructions or data stored in the SPM are mapped to adjacent physical
addresses. Therefore, an access is to the SPM if its physical address (PA) lies
within the SPM address range by comparing its PA with the SPM base register.
The instructions and/or data are assigned to the SPMs by software. Thus after
SPM allocation, an instruction or data can be stored either in the SPM or in the
o-chip memory. In the latter case, the instruction or data can be accessed by the
processor through the small instruction or data cache within the hybrid
SPM-Cache architecture, which can exploit the temporal and spatial locality
dynamically for improving the average-case performance.
There have been many studies on ecient SPM allocation algorithms to
improve either the average-case performance [45, 46, 47, 48, 49, 50] or WCET
[51, 52, 53, 54, 55]. In this work, we implement a static SPM allocation algorithm
for both instructions and data by exploiting proling information. More advanced
SPM allocation algorithms, including dynamic SPM allocation or optimal SPM
allocation, may be used to exploit the SPM space more eciently; however, these
algorithms are more complex and generally are not scalable to larger benchmarks.
Also, our experiments show that even a simple heuristic frequency-based SPM
54
D-Cache D-Hybrid D-SPM
I-Cache IC-DC IC-DH IC-DS
I-Hybrid IH-DC IH-DH IH-DS
I-SPM IS-DC IS-DH IS-DS
Table 4.1. All the hybrid on-chip memories studied.
allocation can already make hybrid SPM-Caches achieve very good energy and
performance results. We will leave it as our future work to investigate other SPM
allocation strategies to further enhance the energy eciency of SPM-Caches.
In our SPM allocation method, the instructions are assigned into the
instruction SPM in the unit of a basic block. All the basic blocks are sorted in the
descending order based on their weights (i.e. the number of times each basic block
is accessed). If a basic block has a larger weight and the total size of the
instructions in it is less than or equal to the remaining size of the instruction SPM,
its instructions will be assigned into the instruction SPM earlier. Similarly the
data objects are assigned into the data SPM by the compiler in the descending
order of the number of accesses, subject to the capacity of the data SPM.
Algorithm 2 describes our SPM allocation method in detail, where the memory
object is a basic block in case of the instruction SPM and it is a data object for the
data SPM allocation. The algorithm ends when all the memory objects have been
checked or until there is no available space left in the SPM. The computational
complexity is linear to the number of the memory objects to be checked.
55
Algorithm 2 SPM Allocation
1: input: the list of the memory objects MOList and the empty SPM
2: output: the SPM with the memory objects assigned
3: begin
4: Sort By Frequencey Descending Order(MOList)
5: MO =MOList:head
6: while MO is not null do
7: if SPM:avail size > 0 then
8: if MO:size <= SPM:avail size then
9: assign MO into SPM
10: SPM .avail size = SPM .avail size - MO.size
11: end if
12: MO = MO.next
13: else
14: break
15: end if
16: end while
17: end
56
4.3.2 Additional Hybrid Architectures
In addition to the proposed hybrid SPM-Caches, there are also other types of
hybrid on-chip memory architectures, for example using a cache for instructions
and an SPM for data. Generally, depending on the use of a cache, an SPM, or a
hybrid SPM-Cache for storing either instructions or data, there are totally 9
dierent combinations as shown in Table 4.1. Among these 9 dierent
architectures, two are homogeneous: IC-DC is the traditional cache only
architecture, and IS-DS is the traditional SPM only architecture, which are two
extremes. Besides the three hybrid SPM-Caches depicted in Figure 4.5, the other
four hybrid architectures include Instruction Cache and data SPM (IC-DS),
Instruction SPM and Data Cache (IS-DC), Instruction Hybrid and data SPM
(IH-DS), and Instruction SPM and Data Hybrid (IS-DH). The rst two use a
cache or an SPM to store either instructions or data but not both. The latter two
involve the hybrid SPM-Cache, in addition to a regular SPM, to store either
instructions or data.
4.4 ENERGY MODELS
The main components in a cache include the decoder, the tag memory array,
the tag column multiplexers, the tag sense ampliers, the tag comparators, the tag
output drivers, the data memory array, the data column multiplexers, the data
sense ampliers, and the data output drivers, while the SPM only needs the
decoding and the column circuitry logic. Thus, the SPM is essentially more
energy-ecient than the cache of the same size.
57
Based on the cache components, Kamble and Ghose proposed an analytical
energy dissipation model for the low power cache in [56], which has been widely
used in the research of cache energy estimation. In Equation 4.1, the total amount
of energy dissipated by a cache can be expressed as the sum of four components,
including bit-line dissipations, word-line dissipations, dissipations in output lines,
and dissipations in input lines. The energy model of an SPM can largely reuse this
equation but needs to remove the consideration of tag bits in the calculation of
Ebit and Eword. Also, the SPM energy estimation does not need to consider the
address output in the calculation of Eoutput due to the direct connection between
SPMs and the processor. We have adopted Kamble and Ghose's model [56] and
calculated the energy consumption of both SPMs and caches by using CACTI [57].
The total energy consumption, including the on-chip memory energy, the processor
energy, and the main memory energy, is computed by using the energy
consumption evaluation tool EPIC-Explorer [58].
Edissipation = Ebit + Eword + Eoutput + Eainput (4.1)
4.5 EVALUATION METHODOLOGY
We use Trimaran compiler/simulator framework [40] to implement and
evaluate all the 9 dierent on-chip memory architectures on a VLIW processor.
The baseline processor has 2 integer ALUs, 2 oating-point ALUs, 1 branch
predictor, 1 load/store unit, and 1-level on-chip memory. Our energy consumption
is based on EPIC-Explorer [58]. The evaluation frame is shown in Figure 4.6. The
58
Figure 4.6. Energy evaluation framework.
Benchmark Description Code Size (bytes) Data Size (bytes)
cjpeg jpeg image compression 50960 135565
djpeg jpeg image decompression 46060 26508
epic an image compression program 19608 329611
mesamipmap OpenGL graphics clone: using mipmap quadrilateral 71240 39397
mesatexgen OpenGL graphics clone: texture mapping 98792 45074
mpeg2dec MPEG digital compressed format decoding 30252 389669
rasta A program for speech recognition 55384 132369
Table 4.2. Salient characteristics of benchmarks.
on-chip memory energy for SPM-Caches consists of cache energy and SPM energy.
The total energy consumption includes both the processor (including the on-chip
memory) and the main memory energy consumption.
By default, we use a 16KB on-chip memory, which can be an SPM, a cache,
or a hybrid SPM-Cache. The parameters of the cache include: 32B block size,
4-way set-associative, and LRU replacement policy. We assume both a cache hit
and an SPM access take 1 cycle and a memory access takes 20 cycles. We do not
use any L2 cache in the experiments.
The benchmarks are selected from MediaBench [42] (also referred as media
benchmarks in this dissertaion). The salient characteristics of all benchmarks are
59
(a) the on-chip energy consumption (b) the total energy consumption
Figure 4.7. The comparison of on-chip memory and total energy
consumption among all 9 architectures, which is normalized to
the on-chip and total energy respectively of the IS-DS architec-
ture.
shown in Table 4.2.
4.6 EXPERIMENTAL RESULTS
4.6.1 Energy Results
Figure 4.7 (a) compares the on-chip memory energy consumption among all
these 9 architectures, which is normalized to that of the IS-DS. We observe that all
the seven hybrid architectures consume less on-chip memory energy than that of
the IC-DC. Among them, the IH-DC, the IH-DH, and the IH-DS have much
smaller energy consumption than the IS-DS, while the rest of the hybrid
architectures consume on-chip memory energy either more than or close to that of
the IS-DS. This is because in a hybrid SPM-Cache such as an IH or DH, the small
SPM is more energy-ecient to access than the larger pure SPM, i.e., the IS or
DS. Similarly, the small cache in the IH or DH is more energy-ecient than the
larger pure cache, i.e., the IC or DC. Since instructions are accessed every clock
cycle, the instruction access energy dominates the total on-chip memory energy
60
(a) normalized execution cycles (b) normalized Energy-Delay Product (EDP)
Figure 4.8. The comparison of performance and EDP among all
9 architectures, which are normalized to execution cycles and
EDP respectively of the IS-DS architecture respectively.
dissipation. Consequently, the hybrid architectures that employ the IH, including
the IH-DC, the IH-DH, and the IH-DS all consume much less on-chip memory
energy than the IS-DS. Moreover, since the DH is more energy-ecient than either
the DS or the DC, the IH-DH is the best among these three. Also, the IH-DS is
superior to the IH-DC because the DS is more energy-ecient than the DC. On
average, the IH-DH consumes 45.7% and 74.7% less on-chip memory energy than
that of the IS-DS and the IC-DC respectively.
Our next experiment compares the total energy consumption among all these
9 architectures, and the results normalized to the IS-DS are shown in Figure 4.7
(b). As we can see, on average, the IC-DC and most hybrid SPM-Caches consume
less total energy than the IS-DS architecture, though it varies for dierent
benchmarks. This trend is quite dierent from the on-chip memory energy
consumption, because the IS-DS consumes less on-chip memory energy than the
IC-DC and a few other hybrid architectures such as the IC-DH, IC-DS and IS-DC.
This is because although an SPM is more energy-ecient per access than a cache
61
of the equivalent size, the signicant performance improvement by using the cache
or hybrid SPM-Caches can lead to large energy reduction. The normalized
execution cycles of these nine architectures are shown in Figure 4.8 (a). As we can
see, all the hybrid SPM-Caches, as well as the pure cache architecture, can achieve
performance either better than or close to the IS-DS architecture, because the
SPM is totally controlled by the software and cannot dynamically reuse its space
as eciently 2. As a result, for those benchmarks whose performance can be
signicantly improved by the IC-DC or other hybrid SPM-Caches, the total energy
consumption can be reduced.
From Figure 4.7 (b), we also observe that the three hybrid architectures that
are ecient in on-chip memory energy, including the IH-DC, the IH-DH, and the
IH-DS, all consume less total energy than the IC-DC for most benchmarks. Among
these three, the IH-DH is the most energy-ecient, which consumes 22% and 16%
less total energy than that of the IS-DS and the IC-DC respectively. The IH-DC is
the second best. It consumes less total energy than the IH-DS, because the DC in
the IH-DC can reduce the execution time better than the DS in the IH-DS. We also
nd that two other hybrid architectures, i.e., the IC-DS and the IS-DC, actually
consume 3.4% and 2.6% more total energy than that of the IC-DC. This is because
the pure SPMs used in these two architectures lead to more accesses to the main
2Enhancing the SPM allocation algorithm, for example by exploiting dynamic SPM allocation
may alleviate this problem, which will be studied in our future work. However, in general, compiler-
based allocation still have the fundamental limit that it does not have perfect knowledge of runtime
instruction/data access patterns, which may be varied based on runtime inputs.
62
memory and longer execution time, thus increasing the total energy consumption.
4.6.2 Energy-Delay Product Results
Figure 4.8 (b) compares the Energy-Delay Product (EDP) of these
architectures, which is normalized to the EDP of the IS-DS. Again, we nd that
the IH-DH architecture has lower EDP than the IS-DS, IC-DC, or any other
hybrid SPM-Cache. On average, the IH-DH reduces the EDP by 38.1% and 16.4%
as compared to that of the IS-DS and the IC-DC respectively, indicating that the
IH-DH is superior by considering both energy consumption and performance.
Among other hybrid on-chip memory architectures, both the IH-DC and the
IC-DH can achieve EDP less than the IC-DC for all the benchmarks, because the
IH (DH) is more energy-ecient than the IC (DC). Between the IH-DC and the
IC-DH, the IH-DC can reduce the EDP by 4.7% more than the IC-DH, because
the IH-DC is more ecient in reducing the instruction access energy consumption,
which dominates the total energy consumption for the media benchmarks.
The other four hybrid architectures, i.e., the IC-DS, the IH-DS, the IS-DC,
and the IS-DH, on average, still have much smaller EDP than the IS-DS, because
compared to the pure SPM, the cache or the hybrid SPM-Cache used in those
architectures can help lower the total energy consumption by reducing the
execution time. However, on average, these four architectures have larger EDP
than that of the IC-DC, because the pure instruction or data SPM used in these
architectures lead to longer execution time and hence more total energy
consumption than a pure instruction or data cache. Among these four
63
architectures, both the IH-DS and the IS-DH have lower EDP values than those of
the IC-DS and the IS-DC, indicating that using the IH (DH) can make better
balance of performance and energy consumption than the IC (DC) of an
equivalent size. Additionally, while the IH-DS consumes less energy than the
IC-DC, the performance of the IH-DS is worse than that of the IC-DC, leading to
larger EDP results.
In summary, by considering both energy consumption and performance, we
believe that the IH-DH, the IH-DC, and the IC-DH are the three top hybrid
on-chip memory architectures to make good tradeos between performance and
energy dissipation. Among them, the IH-DH is the best, and the IH-DC is better
than the IC-DH because the instruction accesses, not data accesses dominate both
the execution time and energy dissipation for these benchmarks.
4.6.3 Sensitivity Study on SPM and Cache Partitioning
In the sensitivity study experiments, we focus on studying the three hybrid
architectures that can achieve better EDP than the IC-DC, including the IH-DC,
the IC-DH, and the IH-DH architectures. In our experiments, for each of these
three hybrid SPM-Caches, we try two dierent partitions between the cache and
the SPM, while keeping the total hybrid SPM-Cache size xed (i.e. 16KB by
default). Generally, for an N-byte hybrid SPM-Cache i with the partition of a
M-byte cache and an (N-M)-byte SPM, we refer it as the i-M scheme. For
example, for a 16K IH-DC architecture with a 4KB instruction cache and a 12KB
instruction SPM, it is denoted as IH-DC-4K (note that the DC, i.e. the data
64
cache, is the default size, which is 16KB). As the cache simulator requires that the
cache size must be a power of 2, for a total SPM-Cache size of 16KB, we can only
try a 4KB cache with a 12KB SPM, and an 8KB cache with an 8KB SPM 3.
Figure 4.9 (a) shows the on-chip memory energy consumption of the IS-DS,
the IC-DC, and the IH-DC, IC-DH, and IH-DH architectures with two dierent
partitions, which is normalized to that of the IS-DS architecture. As we can see,
all these three hybrid SPM-Caches with dierent partitions have on-chip memory
energy consumption less than that of the IC-DC, and both the IH-DC and the
IH-DH consume much less on-chip memory energy than the IS-DS as well.
However, we also nd that dierent partitions between the SPM and cache can
signicantly impact the on-chip memory energy dissipation. In general, for all
these three hybrid on-chip memory architectures with a 16KB total size, an even
partition, i.e., an 8KB SPM with an 8KB cache, seems to be more energy-ecient
than an uneven partition, i.e., a 12K SPM and a 4K cache. For example, on
average, the IH-DC-8K consumes 21.1% less on-chip memory energy than the
IH-DC-4K; the IC-DH-8K consumes 2.4% less on-chip memory energy than the
IC-DH-4K; and the IH-DH-8K consumes 23.5% less on-chip memory energy than
the IH-DH-4K. This is because in the hybrid SPM-Cache, we nd the number of
accesses to the SPM is much more than the number of accesses to the cache in
both partitions, and a larger SPM consumes more energy per access than a smaller
3We did not choose a 2KB cache and a 14KB SPM because the partition is too unbalanced.
While we attempted to try a 12KB cache with a 4KB SPM, our cache simulator could not simulate
a 12KB cache correctly.
65
SPM. However, we also observe that pure caches without any SPM, i.e. the IC-DC
architecture, is not as energy-ecient as these hybrid SPM-Caches, because a
larger cache has diminishing return of cache miss reduction, while signicantly
increasing the energy consumption per access as compared to a smaller cache or
SPM.
Figure 4.9 (b) compares the total energy consumption of the IS-DS, the
IC-DC, and the IH-DC, IC-DH, and IH-DH architectures with two dierent
partitions, which is normalized to that of the IS-DS architecture. Similar to the
trend of the on-chip memory energy consumption, we nd an 8KB SPM with an
8KB cache is more energy-ecient than a 12KB SPM with a 4KB cache for the
IH-DC, IC-DH, and IH-DH architectures. On average, the IH-DC-8K consumes
4.2% less total energy than the IH-DC-4K; the IC-DH-8K consumes 0.5% less total
energy than the IC-DH-4K; and the IH-DH-8K consumes 4.7% less energy than
the IH-DH-4K. Therefore, for the benchmarks we studied, an even partition of the
size between the SPM and the cache is more energy-ecient than the unbalanced
partition.
4.6.4 Sensitivity Study on SPM and Cache Sizes
We also study the impact of dierent SPM and cache sizes on the energy
consumption of various on-chip memory architectures. Figure 4.10 (a) compares
the on-chip memory energy consumption between the IS-DS and the IC-DC
architectures with the total size varying from 8KB to 16KB, 32KB and 64KB,
which is normalized to the on-chip memory energy consumption of the 8KB IS-DS
66
(a) the on-chip memory energy consumption (b) the total energy consumption
Figure 4.9. The comparison of on-chip memory and total energy
consumption among the IS-DS, the IC-DC, and the IH-DC, IC-
DH, and IH-DH architectures with two dierent SPM and cache
partitions, which is normalized to the on-chip memory and total
energy consumption respectively of the IS-DS architecture.
(a) the on-chip memory energy consumption (b) the total energy consumption
Figure 4.10. The comparison of on-chip memory and total en-
ergy consumption of the IC-DC and IS-DS architectures with
their total size varying from 8KB to 16KB, 32KB, and 64KB,
which is normalized to the on-chip memory and total energy
consumption respectively of the 8KB IS-DS architecture.
67
architecture. As expected, a larger SPM or cache leads to more on-chip memory
energy dissipation, and in all 4 dierent sizes, the IS-DS is more energy-ecient
than the IC-DC of the same size.
However, the IC-DC can potentially reduce the total execution time, which
may save the total energy consumption. As depicted in Figure 4.10 (b), the IC-DC
consumes less total energy than the IS-DS when the total size is 8KB or 16KB.
However, as the total size increases to 32KB and 64KB, the IS-DS actually
consumes less total energy. This is because increasing the IC-DC leads to
diminishing performance improvement, while consuming much more energy for
cache accesses. Overall, we nd the 8KB IC-DC is the most energy-ecient as
compared to other IS-DS and IC-DC architectures with various sizes.
Figure 4.11 (a) shows the on-chip memory energy consumption of the IH-DC
architectures with the total size varying from 8KB to 16KB, 32KB, and 64KB,
which is normalized to that of the 8KB IS-DS architecture. The 8KB IH-DC
consumes less on-chip memory energy than the 8KB IS-DS. As the size of the
IH-DC increases, it consumes more on-chip memory energy.
The total energy consumption of IH-DC with dierent sizes is shown in
Figure 4.11 (b). As the total size increases from 8KB to 16KB, the total energy
consumption decreases for most benchmarks due to the reduction of the total
execution time. However, when the size further increases to 32KB and 64KB, the
total energy consumption becomes larger, because of the increased energy
consumption per access to larger caches and SPMs and the diminishing return of
68
(a) the on-chip memory energy consumption (b) the total energy consumption
Figure 4.11. The comparison of on-chip memory and total en-
ergy consumption of the IH-DC architectures with their total
size varying from 8KB to 16KB, 32KB, and 64KB, which is nor-
malized to the on-chip memory and total energy consumption
respectively of the 8KB IS-DS architecture.
performance improvement. On average, the 16KB IH-DC consumes 14%, 9.1%,
and 30% less total energy than the 8KB, 32KB, and 64KB IH-DCs respectively.
The energy eciency of the IC-DH is also heavily dependent on its size. As
we can see from Figure 4.12 (a), the IC-DH consumes more on-chip energy as the
size increases. The total energy consumption of the IC-DH also become larger as
the size increases for most benchmarks, as can be seen from Figure 4.12 (b). On
average, the 8KB IC-DH consumes 26.9%, 6.3%, 23.1%, and 59.3% less total
energy than the 8KB IS-DS, and the 16KB, 32KB, and 64KB IC-DHs respectively.
Figure 4.13 (a) shows the on-chip memory energy consumption of the IH-DH
architectures with dierent sizes, which is normalized to that of the 8KB IS-DS.
As we can see, compared to the 8KB IS-DS, the 8KB IH-DH consumes much less
on-chip memory energy. However, as the size of the IH-DH increases, the on-chip
memory energy also increases. This is because a larger SPM and a larger cache in
69
(a) the on-chip memory energy consumption (b) the total energy consumption
Figure 4.12. The comparison of on-chip memory and total en-
ergy consumption of the IC-DH architectures with their total
size varying from 8KB to 16KB, 32KB, and 64KB, which is nor-
malized to the on-chip memory and total energy consumption
respectively of the 8KB IS-DS architecture.
(a) the on-chip memory energy consumption (b) the total energy consumption
Figure 4.13. The comparison of on-chip memory and total en-
ergy consumption of the IH-DH architectures with their total
size varying from 8KB to 16KB, 32KB, and 64KB, which is nor-
malized to the on-chip memory and total energy consumption
respectively of the 8KB IS-DS architecture.
70
the IH-DH consume more energy per access, both of which lead to more on-chip
energy dissipation.
Figure 4.13 (b) shows the total energy consumption of the IH-DH
architectures with dierent sizes. Although a larger IH-DH generally results in
more on-chip memory energy, the total energy consumption may not increase
linearly with the size, because a larger IH-DH can lead to better performance. On
average, we nd that the 8KB IH-DH consumes the least total energy, which is
1.3%, 8.9%, and 26% less than the total energy consumption of the 16KB, 32KB,
and 64KB IH-DHs respectively. The reason is that for media benchmarks, the
8KB IH-DH can already achieve very good performance by reducing the number of
accesses to main memory (i.e. cache misses), and further increasing the size leads
to diminishing return of performance but much larger energy consumption per
access to the on-chip SPM-Caches. Therefore, for embedded systems, it is
important to prole the applications to nd out the best conguration for hybrid
SPM-Caches to minimize the total energy consumption.
4.7 RELATED WORK
Previous studies on SPM mainly treated it as an alternative to the cache
memory for achieving time predictability or improving performance and energy
eciency. Steinke et al. [50] developed a compiler-based method to assign program
and data objects into the SPM to reduce the dynamic energy consumption.
Kandemir et al. [59] and Chen et al. [60] studied compiler-based approaches to
reduce leakage energy of SPMs. Several SPM allocation algorithms have also been
71
proposed to improve the average-case performance [46, 47, 48, 49] or WCET
[51, 52, 53, 54, 55]. However, all these research eorts generally focused on pure
SPMs, not on hybrid SPM-Caches.
There are only a few research eorts to combine the SPM with the cache. In
particular, Wang et al. [61] proposed a method to remap portions of data
segments into SPM space to reduce cache conict misses and they also introduced
an SPM controller with a tightly coupled DMA to minimize the swapping
overhead of dynamic SPM allocation. This work however, is limited to the IC-DH
architecture, whereas in this chapter, we have explored 7 dierent hybrid
SPM-Cache architectures and comparatively evaluated their energy consumption
behaviors. Especially, our work indicates that using the hybrid SPM-Cache for
instructions rather than data is more eective at reducing the total energy
consumption and EDP for the mediabench we studied.
Xue et al. [62] proposed a hybrid SPM consisting of Non-Volatile Memory
(NVM) and SRAM to achieve energy eciency. In contrast, our study is focused
on hybrid architectures by combining an SPM with a cache based on the same
SRAM technology.
Panda et al. [45] investigated partitioning scalar and array variables into
SPM and data cache to minimize the execution time for embedded applications.
Verma et al. [63] studied an instruction cache behavior based SPM allocation
technique to reduce the energy consumption. Recently, Cong et al. [5] proposed an
adaptive hybrid cache by reconguring a part of the cache as software-managed
72
SPM to improve both performance and energy eciency. Kang et al. [4]
introduced a synergetic memory allocation method to exploit SPM to reduce data
cache pollution.
Comparing to all these studies that basically use SPMs to boost the
performance and/or energy eciency of caches, the hybrid SPM-Cache
architectures proposed in this dissertation treat both SPMs and caches equally,
though for dierent objectives. More specically, the hybrid architectures in our
study rely on the SPM to ensure a basic level of time predictability [43], while
using caches to improve the average-case performance or energy eciency by
exploiting the access locality for instructions and data that are not stored into the
SPM. In this work, we do not change the SPM allocation used in [43] to preserve
the time predictability that can be achieved by the SPM. Moreover, prior eorts
only study a limited hybrid model by combining a cache and an SPM for either
instructions or data only. In contrast, in this work, we have systematically and
comparatively evaluated all the seven dierent hybrid on-chip memory
architectures to understand their implication on energy consumption.
4.8 CONCLUSIONS
While cache memories are usually eective at improving the average-case
performance, they are harmful to time predictability. In contrast, SPMs are
time-predictable and more energy-ecient per access, but generally are less
adaptive to runtime instruction/data access patterns and may result in inferior
performance. Built upon the prior work in [43] to study the performance and time
73
predictability of hybrid SPM-Cache architectures, this chapter investigates the
energy consumption of seven dierent SPM-Caches. We nd that all these seven
hybrid on-chip memory architectures consume less energy than the pure SPM
based architecture. Three hybrid SPM-Cache architectures, including the IH-DC,
the IH-DH, and the IH-DS, can reduce the total energy consumption than the
IC-DC. By considering both energy consumption and performance, the IC-DH,
IH-DC, and IH-DH can achieve energy-delay product less than both the pure
cache-based and SPM-based architectures.
Among all the hybrid on-chip memory architectures, our evaluation indicates
that the IH-DH architecture is the best in terms of both total energy consumption
or EDP. More specically, on average, the IH-DH architecture can reduce the total
energy consumption by 22% and 16%, and reduce the EDP by 38.1% and 16.4% as
compared to that of the IS-DS and the IC-DC respectively. Therefore, in addition
to reconciling performance and time predictability as revealed in [43], our study
demonstrates that the hybrid on-chip memory architectures, in particular the
IH-DH, can also make better tradeos between performance and energy
consumption, making it a very attractive design option for real-time and
embedded systems.
74
CHAPTER 5
REDUCING WORST-CASE EXECUTION TIME OF HYBRID
SPM-CACHES
5.1 CHAPTER OVERVIEW
Cache memories have been widely used in modern processors to eectively
bridge the speed gap between the fast processor and the slow memory to achieve
good average-case performance. However, cache performance is heavily dependent
on the history of memory accesses and the cache placement and replacement
algorithms, making it hard to accurately predict the worst-case execution time. In
contrast, Scratch-Pad Memories are time-predictable because the allocation is
controlled by software and the latency to access data from the SPM is xed.
However, SPMs generally are not adaptive to runtime instruction and data access
patterns, and thus may lead to inferior average-case performance.
A cache memory or an SPM alone can only benet performance or time
predictability respectively, not both. For real-time systems, it is attractive to
enhance both performance and time predictability. Recently we have witnessed an
increasing number of studies [4, 5] on hybrid on-chip memory architectures by
placing caches and SPMs together to cooperatively improve performance and/or
energy eciency, which are termed as the hybrid SPM-Caches in this . A hybrid
cache and SPM model has also been used in some prototype or commercial
processors such as TRIPS [1], ARM1136JF-S [2], and Nvidia Fermi [3]. Recent
75
studies show that a hybrid SPM-Cache can greatly improve the performance [4] or
energy eciency [5], both of which are potentially benecial to embedded systems,
including hard real-time systems.
However, for hard real-time systems to safely and reliably exploit hybrid
SPM-Cache architectures, it is crucial to be able to predict the worst-case
execution time for real-time tasks running on processors with hybrid SPM-Caches.
A safely and accurately estimated WCET value provides the basis for
schedulability analysis. Moreover, it is attractive to optimize (i.e. reduce) the
WCET for those systems. The reduced WCET of a task can give more exibility to
the real-time scheduler and hence may enable the system to meet stringent timing
constraints that would otherwise be impossible. Moreover, reducing the WCET of
a task may allow the embedded processor to use a lower clock rate or to place itself
into a low-power mode during the idling periods (while meeting the deadlines) to
save energy consumption. However, prior studies on hybrid SPM-Caches
[4, 5, 45, 63] mainly focus on improving performance and/or energy eciency, and
the impacts of these techniques on predicting and reducing WCET are uncertain.
To benet hard real-time systems, this explores four dierent SPM allocation
algorithms to reduce WCET for the hybrid SPM-Cache architecture. Compared to
existing allocation algorithms for pure SPMs [45, 46, 47, 48, 49, 50, 51, 52, 54, 55],
in a hybrid SPM-Cache architecture, the existence of the parallel cache requires
that an intelligent SPM allocation algorithm needs to consider the cache eects,
particularly its impact on WCET, so that both the SPM and the cache can
76
eectively and cooperatively reduce the WCET for programs running on the hybrid
SPM-Cache architecture. To this end, we have developed a WCET-oriented and
cache-aware SPM allocation algorithm, called the Enhanced Hybrid SPM-Cache
Allocation (EHSA) algorithm. Our experimental results indicate that the EHSA
algorithm can outperform other three algorithms to reduce WCET for the hybrid
SPM-Cache with little or even positive impact on the average-case performance.
The main contributions of this chapter includes the following.
 We propose a WCET-oriented and cache-aware SPM allocation algorithm for
the hybrid SPM-Cache architecture. To the best of our knowledge, this is the
rst paper to study how to exploit both cache and SPM to cooperatively
minimize WCET for hard real-time systems.
 We have also explored three other SPM allocation algorithms for the hybrid
SPM-Cache with dierent complexity and eectiveness, including the
Frequency-based SPM Allocation (FSA), the Hybrid SPM-Cache Allocation
(HSA), and the Longest Path based Allocation (LPA).
 We have implemented and evaluated all the four SPM allocation algorithms.
Our experiments indicate that the EHSA algorithm outperforms other
allocation algorithms in reducing WCET. Interestingly, we also observe that
the EHSA algorithm can even achieve better ACET than other SPM
allocation algorithms for many benchmarks with various SPM and cache
congurations.
77
The rest of this chapter is organized as follows. Section 5.2 introduces the
SPM-Cache architecture. Section 5.3 presents dierent SPM allocation algorithms
designed for the hybrid SPM-Cache architecture. The evaluation methodology is
given in Section 5.4, and the experimental results are shown in Section 5.5. We
discuss related work in Section 5.6, and nally make conclusions in Section 5.7.
5.2 BACKGROUND ON HYBRID SPM-CACHES
The SPM is a small, high-speed on-chip SRAM memory that is mapped into
an address space disjoint from the o-chip memory, as shown in Figure 5.1 (a).
Both the SPM and the cache can be accessed in a single processor cycle, which is
much faster than an access to the o-chip memory. However, while the SPM can
guarantee a single-cycle access time, an access to the cache is dependent on
whether the access is a hit or a miss. Also, caches are usually controlled by
hardware, whereas SPMs are managed by software.
A hybrid SPM-Cache is an on-chip memory architecture by placing a cache
and an SPM in parallel to store instructions and/or data. A hybrid SPM-Cache
for storing instructions are depicted in Figure 5.1 (b). When the CPU accesses an
instruction, the interface circuitry of the SPM determines whether or not the
referenced memory address maps into the address space of SPM. If that is the
case, it issues the signal S HIT and takes control of the address and data buses. If
not, this instruction will be accessed through the regular memory hierarchy.
Specically, if the instruction hits in the cache, then the signal C HIT is generated
and this instruction is directly passed to the processor; otherwise, this instruction
78
needs to be fetched from the main memory.
In a hybrid SPM-Cache architecture, a certain fraction of instructions and/or
data can be loaded into the SPM by software, subject to the available SPM space.
After SPM allocation, instructions and/or data that are not stored in the SPM can
be accessed through the traditional memory hierarchy (i.e. dierent levels of
caches and main memory), which can exploit the temporal and spatial locality
dynamically to improve the average-case performance. The existence of a small
SPM in the SPM-Cache architecture can reduce the number of conict misses in a
pure cache. Also, the accesses to the instructions/data stored in the SPM can be
more energy-ecient than accessing the cache. Therefore, compared to a pure
cache-based architecture, the hybrid SPM-Cache architecture can potentially
combine the advantages of both caches and SPMs to achieve better performance,
energy eciency, and/or time predictability to benet a wide range of applications.
A hybrid SPM-Cache also has signicant advantages over a pure SPM. Since
an SPM is controlled by software, the instructions/data stored into the SPM may
not match the actual instructions/data accessed at runtime, especially if the
program has paths that are dependent on runtime inputs. In contrast, a hybrid
SPM-Cache can leverage the cache part to dynamically reuse its space for the
instructions/data that are not stored into the SPM. While there are schemes to
load instructions/data dynamically into the SPM at runtime [64], the decisions are
still made statically (i.e., at the compilation time), which is unlikely to match the
performance of the hardware-controlled cache in terms of exploiting runtime
79
instruction/data locality. Actually, the recent work in [43] has shown that using
the hybrid SPM-Caches for both instructions and data can lead to better time
predictability than the pure cache, and better performance than the pure SPM.
Therefore, the hybrid SPM-Cache architecture is an attractive on-chip memory
design option for embedded processors.
However, it is still a new and challenging problem to exploit both the cache
and the SPM collaboratively in the hybrid architecture to minimize WCET to
benet hard real-time systems or a mix of hard, soft, or non-real-time tasks. As
the rst step to exploiting the hybrid SPM-Caches deterministically for hard
real-time systems, our study focuses on studying an instruction SPM-Cache as
depicted in Figure 5.1(b), in which both the SPM and the cache are used for
storing instructions only. We plan to explore hybrid SPM-Caches for data accesses
in our future work.
5.3 SPM ALLOCATION FOR HYBRID SPM-CACHES
The SPM allocation algorithms can be either ACET or WCET-oriented. The
ACET-oriented algorithms are not aware of the worst-case path (WC-path), thus
their eectiveness in reducing the WCET is not ensured. In contrast, the
WCET-oriented SPM allocation algorithms specically target the worst-case path,
which can reduce the WCET more eectively, though it may have negative impact
on ACET. Prior works on WCET-oriented SPM allocation [51, 52, 54, 55],
however, are designed for pure SPMs. Since these algorithms are not aware of the
parallel cache in the hybrid SPM-Cache architecture, they may lead to suboptimal
80
(a) Division of address space between SRAM (i.e.
SPM) and DRAM (i.e. main memory).
(b) Hybrid SPM-Cache architecture for instruc-
tions, where I-Cache denotes the instruction
cache, and I-SPM denotes the instruction SPM.
Figure 5.1. The hybrid SPM-Cache system architecture.
81
cache-unaware cache-aware
WCET-unaware FSA HSA
WCET-aware LPA EHSA
Table 5.1. Four SPM allocation algorithms studied in this chapter.
results. In this chapter, we systematically explore four dierent SPM allocation
algorithms as shown in Table 5.1. These algorithms are classied based on two
criteria: 1) whether or not the algorithm is aware of the cache, and 2) whether or
not the algorithm is aware of the WCET. For all the four SPM allocation
algorithms, we propose to allocate SPM space at the Basic Block (BB) granularity.
Our compiler will generate book-keeping instructions after SPM allocation to
ensure the correct transfer of control between instructions stored in the SPM and
the main memory [53].
As we can see in Table 5.1, the Frequency-based SPM Allocation (or FSA) is
a traditional SPM allocation algorithm that is aware of neither the cache nor
WCET. The hybrid SPM-Cache Allocation (or HSA) is aware of the cache, but is
not WCET-oriented. By comparison, the Longest-Path based Allocation (or LPA)
is WCET-oriented, but is not cache-aware. The Enhanced HSA (or EHSA) is both
WCET-oriented and cache-aware. The details of these algorithms are presented in
the following subsections respectively.
82
5.3.1 Frequency-based SPM Allocation
Since our study focuses on studying the hybrid SPM-Cache for instructions,
we propose to allocate SPM space at the Basic Block (BB) granularity to avoid
generating too many bookkeeping instructions, which may degrade instruction
locality and performance.
The FSA algorithm allocates SPM space based on the access frequency of
each basic block extracted from the proling of simulated traces. Specically, the
heuristic is that the basic blocks are stored into the SPM based on the decreasing
order of their frequencies, until the SPM is full or there is no enough space left to
hold another basic block. This algorithm is straightforward, which is described in
Algorithm 3. The FSA algorithm only needs to check each basic block once and its
complexity is dominated by sorting all the basic blocks based on their access
frequencies. Therefore the time complexity of FSA is O(N log(N)), where N is the
number of basic blocks in the given program.
5.3.2 Longest-Path based Allocation
The longest-path based allocation algorithm is a greedy approach to target
SPM allocation on the worst-case path. We adopt the approach used in [65]. The
algorithm rst constructs weighted Directed Acyclic Graphs (DAGs) from the
Control Flow Graph (CFG) of the program, based on which it identies the
longest path of the DAG. The algorithm then sorts all the basic blocks on the
longest path according to their execution frequencies from the highest to the
lowest, which are obtained through proling. The basic block on the longest path
83
Algorithm 3 Frequency based SPMAllocation Algorithm
1: begin
2: allocations empty;
3: run simulation and get the frequency of each basic block;
4: sort basic blocks by frequency from the highest to lowest in f list;
5: while isNotFull(SPM) && isNotEmpty(f list) do
6: get the first basic block B from f list;
7: if sizeof(B) <= sizeof(available SPM space) then
8: allocate B into SPM ;
9: end if
10: remove B from f list;
11: end while
12: return allocations
13: end
with a larger frequency will be allocated into the SPM rst. Algorithm 4 describes
the LPA allocation algorithm . The time complexity of this algorithm is
O(N + E)+O(N logN), where E is the number of edges in the CFG and
O(N + E) is the complexity to nd the longest path.
5.3.3 Hybrid SPM-Cache Allocation
Both the FSA and LPA do not consider the cache that is in parallel with the
SPM in the hybrid SPM-Cache. In contrast, the hybrid SPM-Cache allocation
algorithm is designed to take into account both the SPM and the cache to
cooperatively reduce WCET. We use Abstract Interpretation (AI) to do the cache
analysis [10], which can classify memory references into three classes: Always
Hit(AH), Always Miss(AM) and Not Classied (NC). The basic idea of the HSA
algorithm is to only allocate the basic blocks with more AM and/or NC instructions
84
Algorithm 4 Longest Path based Allocation Algorithm
1: begin
2: allocations empty;
3: calculate the longest path P ;
4: sort BBs on P by frequency in decreasing order in lf list;
5: while isNotFull(SPM) && isNotEmpty(lf list) do
6: get the first basic block B from lf list;
7: if sizeof(B) <= sizeof(available SPM space) then
8: allocate B into SPM ;
9: end if
10: remove B from f list;
11: end while
12: return allocations
13: end
into the SPM , while leaving basic blocks with more AH instructions into the cache.
This is because for basic blocks with more AH instructions, their worst-case
performance in the cache is already guaranteed to be very good. Therefore, the
SPM space can be saved for other basic blocks to improve the WCET or
performance more eciently.
The HSA algorithm begins by doing static cache analysis, based on which all
the basic blocks can be classied and some of them can be put into two lists: the
AM list and the NC list. The AM list stores the basic blocks with Always Miss
instructions, and those basic blocks are sorted by the descending order of their
numbers of AM instructions. For the basic blocks with the same number of AM
instructions, the HSA algorithm sorts them by the ascending order of their
numbers of the Always Hit instructions. If two basic blocks have the same number
of AM and AH instructions, they are then sorted by the descending order of their
85
numbers of NC instructions (otherwise; it does not matter which basic block is
allocated rst). The NC list is used to store basic blocks with NC instructions but
no AM instructions. For the basic blocks with the same number of NC
instructions, the HSA algorithm sorts them by the ascending order of their
numbers of AH instructions.
Algorithm 5 gives the pseudo-code of the HSA allocation algorithm. After
getting the AM list and NC list (lines 5-6), the algorithm selects and allocates the
rst available basic block from the AM list if its block size does not exceed the
available SPM size (lines 8-13). If the block size is too large to be stored into the
SPM, the algorithm then removes it from the AM list and marks it (lines 14-15).
After each SPM allocation, the instruction reference classication may be changed.
For example, if a conicting instruction is stored into the SPM, an AM instruction
in the cache may become AH. Thus, after allocating SPM space to each basic
block, the algorithm re-invokes the cache timing analysis to categorize the
instruction accesses and then allocates the remaining SPM space by bypassing the
marked blocks (lines 8-16). Note that by setting b allocate to be TRUE (at line
12), the innermost while loop will exit, so the cache analysis and sorting logic in
the outermost while loop will be used again. If there is no basic block with AM
instructions left, the algorithm then checks the NC list and allocate SPM space to
blocks on that list in the same way as it does for the AM list (lines 17-25). The
algorithm stops if the SPM is full or if there is no unmarked basic block. In
Algorithm 5, lines 26-28 are used to avoid endless loops.
86
The time complexity of HSA is heavily dependent on the complexity of static
cache timing analysis (see line 5). The cache analysis based on abstract
interpretation has been shown to run eciently [10], assuming its complexity is .
For each allocation iteration, the HSA needs to conduct the cache analysis, then
sorts the basic blocks by either the number of AM instructions or the number of
NC instructions. Therefore, the time complexity of HSA becomes
O(Ns  (+N logN)), where Ns is the size of the SPM 1.
5.3.4 Enhanced Hybrid SPM-Cache Allocation
While the HSA algorithm is aware of the cache, it is not aware of the
WC-path. Therefore, it may happen that some AM instructions stored in the SPM
may have no impact on the WCET at all. To further decrease the WCET
specically, we propose to enhance the HSA by considering the worst-case path
during SPM allocation, which is called the EHSA algorithm. In this algorithm,
after nding every candidate basic block to be allocated into the SPM, it will do
WCET analysis to ensure this block is on the worst-case path. If the attempted
SPM allocation leads to the WCET reduction, then this allocation is conrmed.
Otherwise, the allocation is undone (because this block is not the WC-path) and
the algorithm continues to nd the next available allocation. Algorithm 6
describes the EHSA allocation algorithm in detail. Compared to the pseudo-code
1It should be noted that the HSA algorithm is generally independent of any specic cache
timing analysis approach such as the abstract interpretation based analysis[10]. Thus enhancing
the eciency of cache timing analysis can also improve the eciency of the HSA algorithm.
87
Algorithm 5 Hybrid SPM cache Allocation Algorithm
1: begin
2: allocations empty;
3: availableBB  total number of basic blocks;
4: while isNotFull(SPM) &&availableBB > 0 do
5: do cache analysis;
6: sort basic blocks into AM list and NC list;
7: b allocate FALSE;
8: while isNotEmpty(AM list) && b allocate == FALSE do
9: get the first available basic block Bm from AM list;
10: if sizeof(Bm) <= sizeof(available SPM space) then
11: allocate Bm into SPM ;
12: b allocate TRUE;
13: end if
14: remove Bm from the AM list and marked;
15: availableBB   ;
16: end while
17: while isNotEmpty(NC list) && b allocate == FALSE do
18: get the first available basic block Bn from NC list;
19: if sizeof(Bn) <= sizeof(available SPM space) then
20: allocate Bn into SPM ;
21: b allocate TRUE;
22: end if
23: remove Bn from the NC list and marked;
24: availableBB   ;
25: end while
26: if b allocate == FALSE then
27: use FSA to allocate one BB from the basic blocks left;
28: end if
29: end while
30: return allocations
31: end
88
of the HSA depicted in Algorithm 5, the EHSA algorithm inserts logic to conduct
WCET analysis (line 12 and line 28 respectively) and to check the potential
impact of SPM allocation on WCET (line 13 and line 29 respectively). The rest of
the logic is similar to the HSA algorithm.
The EHSA algorithm used WCET analysis to make SPM allocation WCET
aware. Since our WCET analysis is based on ILP, the complexity of the ILP solver
based on the simplex algorithm [66] is O(2N)2.
5.3.5 WCET Analysis of Hybrid SPM-Caches
To conduct WCET analysis for the hybrid SPM-Cache architecture, we
extend the ILP-based method proposed by Li et al. [68]. We use ILP to calculate
the maximum value of the total execution time (i.e. the objective function) under
three types of linear constraints: structural constraints, functional constraints, and
cache constraints. The structural constraints are derived from the program's CFG,
and the functional constraints are provided by the loop bounds and other path
information, both of which are the same with those in [68]. However, when we
build the cache constraints from the cache conict graph [68], we do not consider
the basic blocks allocated into the SPM. Therefore, the cache timing analysis for
the hybrid SPM-Cache is actually less complex than the analysis for a pure cache,
2However, for small-scale problems, it can achieve expected time polynomial to n, d, and
1= [67], denoted as f(n; d; 1=), where n is the size of the problem, d is the dimension, and 
is the standard deviation. Therefore, the complexity of EHSA for small-scale problems can be
O(Ns  (+N logN) +Ns  f(n; d; 1=)).
89
Algorithm 6 Enhanced Hybrid SPM cache Allocation
1: begin
2: allocations empty;
3: availableBB  total number of basic blocks;
4: get the initial WCET value from WCET analysis;
5: while isNotFull(SPM) &&availableBB > 0 do
6: do cache analysis;
7: sort basic blocks into AM list and NC list;
8: b allocate FALSE;
9: while isNotEmpty(AM list) && b allocate == FALSE do
10: get the first available basic block Bm from AM list;
11: if sizeof(Bm) <= sizeof(available SPM space) then
12: do WCET analysis with this allocation;
13: if WCET is reduced then
14: allocate Bm into SPM ;
15: b allocate TRUE;
16: remove Bm from the AM list and marked;
17: availableBB   ;
18: end if
19: remove Bm from the AM list;
20: else
21: remove Bm from the AM list and marked;
22: availableBB   ;
23: end if
24: end while
90
25: while isNotEmpty(NC list) && b allocate == FALSE do
26: get the first available basic block Bn from NC list;
27: if sizeof(Bn) <= sizeof(available SPM space) then
28: do WCET analysis with this allocation;
29: if WCET is reduced then
30: allocate Bn into SPM ;
31: b allocate TRUE;
32: remove Bn from the NC list and marked;
33: availableBB   ;
34: end if
35: remove Bn from the NC list;
36: else
37: remove Bn from the NC list and marked;
38: availableBB   ;
39: end if
40: end while
41: if b allocate == FALSE then
42: use FSA to allocate one BB from the basic blocks left;
43: end if
44: end while
45: return allocations
46: end
91
not only because the cache in the hybrid SPM-Cache is typically smaller, but also
because less instructions need to be stored into the cache and be modeled in the
cache conict graph. An ILP solver is then used to solve all the ILP equations and
inequalities to compute the WCET based on the objective function. More
information about the ILP-based WCET analysis can be found at [68].
5.4 EVALUATION METHODOLOGY
Figure 5.2. High-level overview of our evaluation framework.
Our evaluation framework is depicted in Figure 5.2. In this framework, we
rst use Trimaran compiler [40] to compile a benchmark into the intermediate
representation (IR) targeting a Very Long Instruction Word (VLIW) processor
that is supported by Trimaran. Based on the IR information, we can construct the
control ow graph that will be used by both the WCET analyzer and the SPM
allocator. The WCET analyzer generates all the structural constraints, functional
constraints, and cache constraints with the on-chip memory conguration and SPM
92
Benchmark Path Info Description Code Size (Bytes)
crc single path cyclic redundancy check computation on 40 bytes of data 520
edn single path nite impulse response (FIR) lter calculations 3452
matmult single path matrix multiplication of two 20 20 matrices 480
ndes multiple paths complex embedded code 3452
cnt multiple paths Counts non-negative numbers in a matrix 408
r multiple paths a 700 items long sample impulse response lter 356
Table 5.2. Benchmarks used in our experiments.
allocation information. A commercial ILP solver-CPLEX [69] is used to solve the
ILP problem to compute the estimated WCET. All the four allocation algorithms
are implemented in the SPM allocator. In case of the EHSA, the SPM allocator
needs to call the WCET analyzer to make the SPM allocation WCET-aware. In
addition, we use Trimaran simulator to report the average-case performance.
We select 6 real-time benchmarks from Malardalen WCET benchmark suite
[41] for the experiments. The salient characteristics of all benchmarks are shown in
Table 5.2. It should be noted that Malardalen benchmark suite includes both
single-path and multiple-path programs. In our evaluation, the rst three
benchmarks are single-path programs, while the last three benchmarks have
multiple paths.
In our experiments, the baseline processor has 2 integer ALUs, 2 oat ALUs,
1 branch predictor, 1 load/store unit, and 1-level on-chip memory. To focus on the
instruction on-chip memory, the data cache is assumed to be perfect. Since the
WCET benchmarks are typically very small, as can be seen in Table 5.2, we need
93
to use smaller congurations for both the cache and the SPM3. By default, we
assume a 64B instruction cache and a 64B SPM. The SPM takes 1 clock cycle to
access. The parameters of the cache include: 16B block size, direct-mapped, and
LRU replacement policy. A cache hit takes 1 cycle and a memory access takes 20
cycles.
5.5 EXPERIMENTAL RESULTS
5.5.1 Safety and Accuracy of WCET Analysis
We rst evaluate the safety and accuracy of our WCET analysis for the
hybrid SPM-Cache architecture with the four dierent SPM allocation algorithms.
Figure 5.3 shows the estimated WCET reported by the developed WCET
analyzer, which is normalized to the simulated WCET through simulation on
dierent inputs. As we can see, for all the SPM allocation algorithms, the
estimated WCET is more than the simulated WCET, indicating that our WCET
analysis can safely estimate the upper bound of the execution time.
For single-path benchmarks, we nd that the estimated WCET is very close
to the simulated WCET. The slight overestimation is mainly due to the
conservative assumption used in static cache analysis. For example, the unclassied
cache accesses, which are assumed to be misses in the worst case, may actually be
hits during the simulation. For multiple-path programs, however, we observe larger
dierences between the estimated WCET and the simulated WCET. One reason is
3It should be noted that this is not uncommon in the research of WCET analysis; otherwise,
the benchmarks can easily all t into a regular cache with several kilo bytes.
94
Figure 5.3. Comparing the estimated WCET and the simulated
WCET for all the four SPM allocation algorithms, which is nor-
malized to the simulated WCET.
still due to the conservative cache analysis, which can become worse for programs
with more paths. Another possible reason is that for a multiple-path program, it
becomes much harder to simulate (or observe) the actual worst-case path by
limited simulation, unless we can exhaust all possible paths for all inputs in our
simulation, which is prohibitively expensive in computation. On average, the
estimated WCET is 21.3%, 19.1%, 15.9%, and 16.1% larger than the corresponding
simulated WCET for the FSA, HSA, LPA, and EHSA algorithms respectively.
In particular, we nd our WCET analyzer is more accurate for the HSA,
LPA, and EHSA algorithms as compared to the FSA. This is because all these
three SPM allocation algorithms can reduce WCET by exploiting the worst-case
path information and/or the worst-case cache performance information. The
overestimation of the WCET analysis is also reduced by improving the worst-case
cache performance and by reducing the execution time on the worst-case path.
Given the conservative nature of worst-case execution time analysis, and the
diculty to obtain the actual WCET through simulation for multiple-path
programs, we believe our WCET analysis approach is reasonably accurate.
95
5.5.2 WCET Results of Dierent SPM Allocation Algorithms
Figure 5.4 compares the WCET of the four dierent SPM allocation
algorithms, which is normalized to the WCET of the FSA algorithm. As we can
see, all the three other algorithms can achieve better (i.e. smaller) WCET than
the FSA. Among these three algorithms, the HSA can only reduce WCET slightly
as compared to the FSA. This is because the HSA is not aware of the WC-path.
While the HSA can exploit the cache analysis information, the basic blocks with
large Always Miss numbers but not on the worst-case path may be selected to use
the SPM, which do not help the WCET reduction.
The LPA can reduce the WCET more than the HSA. For single-path
benchmarks, i.e., crc, edn, and matmult, the LPA achieves the same WCET as the
FSA, because there is only a single path in those programs, which is also the
longest path. For benchmarks with multiple paths, including ndes, cnt, and fir,
the LPA is much better than both the HSA and the FSA, because the LPA can
focus on allocating the basic blocks on the WC-path into the SPM to eectively
reduce the WCET.
As expected, the EHSA is the best among all these four algorithms, because
it is not only WCET-oriented, but also cache-aware. The EHSA is especially
eective for the multiple-path benchmarks. For example, the EHSA can reduce the
WCET of ndes by 8.1% as compared to the FSA. For the single-path benchmarks,
the EHSA can still achieve better WCET by exploiting the cache analysis
information. On average, the EHSA reduces the WCET of the FSA by 5.4% for all
96
the benchmarks.
Figure 5.4. Comparing the WCET of dierent SPM allocation
algorithms with the default conguration, which is normalized
to the WCET of the FSA algorithm.
5.5.3 Average-Case Performance Results
Figure 5.5 gives the ACET of the four dierent SPM allocation algorithms,
which is normalized to the ACET of the FSA algorithm. We nd that for the
single-path benchmarks, the FSA and the HSA have exactly the same ACET,
because the longest path is the same as the average-case path in this case. For
single-path benchmarks, we also observe that both HSA and EHSA can actually
achieve better ACET. The reason is that both the HSA and the EHSA can reduce
the worst-case cache misses by exploiting the SPM, which also reduce the
average-case cache misses for the single-path programs.
For the multiple-path benchmarks, however, we nd that the FSA can
actually achieve slightly better ACET than the other three algorithms. This is
because the FSA algorithm allocates SPM space based on the access frequency of
each basic block, which is obtained through proling. Consequently, the basic
blocks selected by the FSA are likely to be on the average-case path in our
97
simulation, thus beneting ACET. By comparison, both the LPA and the EHSA
exploit WC-path information, and the HSA uses the worst-case cache timing
analysis information, all of which are not guaranteed to improve ACET.
Nevertheless, as can be seen from Figure 5.5, the average-case performance
degradation by these three algorithms is not signicant. On average, for the
multiple-path benchmarks, the HSA, LPA, and EHSA degrade the ACET of the
FSA by only 2.3%, 3.0%, and 1.8% respectively.
Figure 5.5. Comparing the ACET of dierent SPM allocation
algorithms with the default conguration, which is normalized
to the ACET of the FSA algorithm.
5.5.4 Sensitivity Study
Since the average-case performance of the cache and the SPM can be
dependent on their sizes, we also conduct experiments to study the sensitivity of
dierent SPM allocation algorithms with respect to various SPM and cache sizes,
while trying to keep their total size xed. Figure 5.6 compares the WCET of
dierent SPM allocation algorithms for the hybrid SPM-Cache with a 96B SPM
and a 32B cache, which is normalized to the WCET of the FSA algorithm with the
same SPM-Cache conguration. As we can see, with a smaller cache and a larger
98
SPM, the HSA can reduce the WCET more signicantly as compared to a 64B
SPM and a 64B cache. This is because there are likely more capacity and conict
misses in a smaller cache. Therefore, keeping the Always Hit instructions into the
cache can help improve the worst-case cache performance. For multiple-path
benchmarks, the LPA can achieve even better (i.e. smaller) WCET than the HSA
because the LPA is WCET-oriented. The EHSA is superior to both the HSA and
the LPA for all benchmarks. For ndes, the EHSA can reduce the WCET by 10.1%
as compared to the FSA. On average, the EHSA can reduce the WCET of the
FSA by 9.4% for multiple-path benchmarks and by 7.3% for all benchmarks,
indicating the eectiveness of this approach in reducing WCET.
Figure 5.6. Comparing the WCET of dierent SPM allocation
algorithms for the hybrid SPM-Cache with a 96B SPM and a
32B cache, which is normalized to the WCET of the FSA algo-
rithm.
Figure 5.7 compares the ACET of dierent SPM allocation algorithms for
the hybrid SPM-Cache with a 96B SPM and a 32B cache. Compared to the 64B
SPM and the 64B cache conguration, we nd that on average, both the HSA and
EHSA can improve ACET by 4.3% and 4.6% respectively better than that of the
FSA. A possible reason is that with a smaller cache and a larger SPM, more AM
99
and NC instructions will be allocated into the SPM, which are more likely to be on
the average-case path for the multiple-path programs and are denitely on the
average-case path for the single-path benchmarks. Actually for the latter, we
observe up to 9.1% ACET reduction. One exception is the benchmark fir, for
which the ACET of the HSA and the EHSA is 4.1% and 3.4% respectively worse
than that of the FSA. For this benchmark, we nd many instructions allocated
into the SPM are not on the simulated average-case path, leading to worse ACET.
Also, we nd that compared to the FSA, the LPA leads to the same ACET for
single-path benchmarks, while its eect on ACET for multiple-path benchmarks
varies, depending on how many selected blocks are on both the WC-path and the
average-case simulated path.
Figure 5.7. Comparing the ACET of dierent SPM allocation
algorithms for the hybrid SPM-Cache with a 96B SPM and a
32B cache, which is normalized to the ACET of the FSA algo-
rithm.
We also run experiments with a smaller SPM and a larger cache. More
specically, we use a 32B SPM and a 128B cache4. Figure 5.8 shows the WCET of
4We have attempted to use a 32B SPM and a 96B cache to keep the total size of the SPM-Cache
xed. However, our cache simulator requires that the cache size must be a power of 2. Thus we
increase the cache size to 128B.
100
dierent SPM allocation algorithms for the hybrid SPM-Cache with a 32B SPM
and a 128B cache. For such a relatively small SPM and a large cache, we nd that
while the HSA, LPA, and EHSA algorithms can still achieve better WCET than
the FSA, the amount of WCET reduction is smaller. This is because, with a
smaller SPM, fewer basic blocks can be stored into the SPM, thus there is less
room for an SPM allocation algorithm to improve WCET. Also, with a larger
cache, there are less cache misses, thus limiting the benet of cache-aware SPM
allocation.
Figure 5.8. Comparing the WCET of dierent SPM allocation
algorithms for the hybrid SPM-Cache with a 32B SPM and a
128B cache, which is normalized to the WCET of the FSA al-
gorithm.
Figure 5.9 presents the ACET of dierent SPM allocation algorithms for the
hybrid SPM-Cache with a 32B SPM and a 128B cache, which is normalized to the
ACET of the FSA algorithm with the same SPM-Cache conguration. Again, we
observe that for single-path programs, the LPA and the FSA have the same
ACET, and both the HSA and the EHSA can attain better ACET than the FSA.
However, the amount of ACET reduction by both the HSA and the EHSA is
smaller as compared to the ACET reduction for the 96B SPM and 32B cache.
101
This is similar to the trend of WCET reduction we have observed. The reason is
that as the SPM size decreases and the cache size increases, there are less cache
misses and less SPM space to reduce possible cache misses.
Figure 5.9. Comparing the ACET of dierent SPM allocation
algorithms for the hybrid SPM-Cache with a 32B SPM and a
128B cache, which is normalized to the ACET of the FSA algo-
rithm.
5.6 RELATED WORK
Several researchers have explored hybrid models consisting of both cache
memory and SPM. Panda et al. [45] investigated partitioning scalar and array
variables into SPM and data cache to minimize the execution time for embedded
applications. Verma et al. [63] studied an instruction cache behavior based SPM
allocation technique to reduce the energy consumption. Cong et al. [5] proposed
an adaptive hybrid cache by reconguring a part of the cache as software-managed
SPM to improve both performance and energy eciency. Kang et al. [4]
introduced a synergetic memory allocation method to exploit SPM to reduce data
cache pollution. All these prior studies have focused on improving performance
and/or energy eciency. In contrast, this chapter investigates how to reduce the
WCET of hybrid SPM-Caches.
102
There have also been many studies to exploit SPMs for better performance,
energy eciency, or time predictability. For example, a number of SPM allocation
algorithms have been proposed to improve the average-case performance
[46, 47, 48, 49], energy eciency [50], or WCET [51, 52, 53, 54, 55]. However, to
the best of our knowledge, all the prior studies on WCET-oriented SPM allocation
have focused on pure SPMs, which may lead to suboptimal results because they
cannot deterministically and cooperatively leverage the cache in the hybrid
SPM-Cache architecture to minimize WCET. To the best of our knowledge, the
EHSA approach proposed in this chapter is the rst work to study
WCET-oriented and cache-aware SPM allocation.
The WCET-oriented and cache-aware SPM allocation is based on prior work
on WCET analysis [68], especially static timing analysis for caches based on
Abstract Interpretation [10]. There are several other methods for cache timing
analysis, for example the static cache simulation techniques [70, 71]. Also, a
number of researchers have examined WCET-aware compiler optimizations. While
this dissertation does not intend to have a complete survey of related work in this
area, the studies close to this work include WCET-driven code positioning [72, 73],
procedure positioning [74], and memory content selection to benet from the
instruction cache [75]. However, all these prior studies are focused on the
instruction cache, which are not aware of the instruction SPM that is available in
the hybrid SPM-Cache architecture.
103
5.7 CONCLUSIONS
In this chapter, we have explored four SPM allocation algorithms that dier
by whether or not they are aware of the WCET and/or the cache. The FSA
algorithm allocates SPM space based on the access frequency of each basic block
from proling, whereas the LPA attempts to allocate basic blocks with high access
frequencies on the WC-path. Both the HSA and the EHSA algorithms can exploit
the worst-case cache analysis information; however, the EHSA ensures that only
basic blocks on the WC-path are allocated to the SPM. We have also extended the
ILP-based timing analysis method [68] to predict the WCET for the hybrid
SPM-Cache architecture, and our experiments indicate that the developed WCET
analyzer is safe and reasonably accurate.
We have implemented all the four SPM allocation algorithms on our
evaluation framework based on Trimaran compiler/simulator infrastructure [40].
Our evaluation indicates that the EHSA algorithm, which is both WCET-oriented
and cache-aware, can achieve the best WCET for all benchmarks under all
SPM-Cache congurations we have evaluated. The EHSA is especially more
eective to reduce WCET with a smaller cache and a larger SPM. While the EHSA
may lead to degradation of the average-case performance for some multiple-path
benchmarks, its impact is insignicant. Actually, on average, the EHSA leads to
better ACET than the FSA in our sensitivity study. Therefore, exploiting the
cache and the SPM cooperatively is important for the hybrid SPM-Cache to
enhance WCET, and its impact on ACET can be either positive or insignicant.
104
Also, our experiments show that the LPA algorithm outperforms both the
FSA and the HSA in reducing WCET for most multiple-path programs.
Compared to the EHSA, the LPA algorithm has signicantly less time complexity
and can run much faster in practice. On the other hand, the HSA algorithm can
reduce the WCET for both single-path and multiple-path programs as compared
to the LPA, although the time complexity of the HSA is much higher due to the
invocation of cache time analysis. Among these four algorithms, the FSA has the
least time complexity and can reduce the ACET more eectively for some
multiple-path benchmarks under certain SPM-Cache congurations, but it is the
least eective algorithm to reduce WCET.
In our future work, we would like to explore the hybrid SPM-Caches for
storing data as well. Moreover, to use the SPM space more eciently, we plan to
study an optimal SPM allocation algorithm by using integer linear programming
and explore dynamic SPM allocation algorithms for the hybrid SPM-Cache
architecture for minimizing WCET without signicantly impacting performance
and energy consumption.
105
CHAPTER 6
CACHE-AWARE SPM ALLOCATION FOR MAXIMIZING
PERFORMANCE ON HYBRID SPM-CACHE ARCHITECTURE
6.1 CHAPTER OVERVIEW
To address the growing gap between CPU and main memory performance,
cache memories have been widely used in modern processors. Cache memories are
based on the principle to make the common cases fast. However, there is no
guarantee that the cache can also benet the worst-case execution time (WCET),
which is crucial for hard real-time systems. Actually, the cache performance is
heavily dependent on the history of memory accesses, as well as the cache
placement and replacement algorithms, making it hard to accurately predict the
worst-case execution time. The WCET analysis for data caches is even harder,
because the addresses of data accesses to the heap may not be predicted statically.
Scratch-Pad Memory is an alternative on-chip memory to the cache. SPM is
also a small on-chip memory based on fast SRAM, but is directly and explicitly
managed at the software level, either by the compiler or by the developer. Due to
its area and energy eciency, SPM has been increasingly used in embedded
processors such as ARMv6, Motorola MCORE, Nvidia's PhysX PPU (Physical
Processing Unit) and the Cell multiprocessor jointly developed by IBM, Sony, and
Toshiba. The SPM is particularly useful for real-time systems, because the SPM
allocation is controlled by software and the latency to access the SPM is xed,
106
both of which can be statically predicted. However, SPMs generally are not
adaptive to runtime instruction and data access patterns, and thus may lead to
inferior average-case performance.
Recently there are an increasing number of studies on hybrid on-chip
memory architectures by placing caches and SPMs together to cooperatively
improve performance, energy eciency, or time predictability. Cong et al. [5]
proposed an adaptive hybrid cache by reconguring a part of the cache as
software-managed SPM to improve both performance and energy eciency. Kang
et al. [4] introduced a synergetic memory allocation method to exploit the SPM to
reduce data cache pollution. Zhang et al. [6] studied hybrid on-chip memory
architecture that can leverage the SPM to achieve time predictability while
exploiting the cache to improve the average-case performance.
The HSC models have also been used in some prototypes or commercial
processors such as TRIPS [1], ARM1136JF-S [2], and Nvidia Fermi [3]. For
example, in the Nvidia Fermi architecture, the L1 on-chip SRAM memory is
congurable to support both shared memory (i.e. SPM) and caching of local and
global memory operations.
The hybrid SPM-Cache architecture brings new challenges and opportunities
to further enhance the performance and energy eciency of the on-chip memory.
Traditionally, the SPM allocation, including both static and dynamic allocation,
mainly focuses on the SPM alone. These cache-unaware SPM allocation
algorithms are unlikely to harness the full potential of the hybrid SPM and cache.
107
To use the aggregate SPM and cache space more eciently, we believe the SPM
allocation for the hybrid SPM-Cache architecture must be aware of the cache
performance to maximally optimize the execution time or energy consumption.
To this end, we design and comparatively evaluate 4 dierent SPM allocation
algorithms. The rst one is the Frequency-based SPM Allocation (FSA), which is
not aware of the cache and is used as the baseline. The other three algorithms are
all cache-aware, but exploit cache information in dierent ways. The Hybrid
SPM-Cache Allocation (HSA) exploits cache proling information. It tries to
allocate the memory objects with the largest cache misses into the SPM. The
remaining two algorithms are both based on the Stack Distance Analysis (SDA)
[11, 12]. The Greedy Stack Distance based Allocation (GSDA) is a greedy
algorithm, whereas the Optimal Stack Distance based Allocation (OSDA) is an
optimal algorithm by using model checking. More details of these algorithms can
be seen in the rest of the chapter. As the rst step to exploiting the tight
interaction between SPM allocation and cache performance in the HSC
architecture, our study focuses on studying an instruction SPM-Cache as depicted
in Figure 5.1(b). In the instruction HSC, both the SPM and the cache are used for
storing instructions only. We plan to explore hybrid SPM-Caches for data accesses
in our future work.
This chapter makes three main contributions as the follows.
 First, we propose a novel unied HSC analysis framework based on stack
distance, in which the SPM is treated as additional \virtual" ways for the
108
cache and thus can be included in the stack distance analysis.
 Second, we develop a heuristic based GSDA algorithm with polynomial time
complexity and an optimal OSDA algorithm based on model checking, both
of which are built upon the unied stack distance analysis framework.
 Third, we have implemented and compared all the four SPM allocation
algorithms, and nd that all the three cache-aware algorithms attain superior
performance than the FSA algorithm. In particular, the HSA and the GSDA
improve the performance by 9% and 11% respectively as compared to the
FSA. The OSDA always achieves the best performance, but requires
signicantly more memory space and longer running time and may not be
scalable for larger benchmarks. The GSDA can achieve performance either
the same as or very close to that of the OSDA.
6.2 RELATED WORKS
To use the SPM eciently, researchers have done extensive study on SPM
allocation. All the existing approaches can be classied into two classes: static
allocation [45, 46, 48, 49, 50, 52] and dynamic allocation
[47, 51, 54, 55, 76, 77, 78, 79, 80]. In static allocation, once an instruction or data
is loaded into the SPM, its space cannot be allocated to other instructions or data.
By comparison, in dynamic allocation, the SPM space can be reused by other
instructions or data under the compiler's control. While the dynamic SPM
allocation can use the SPM space more eciently, transferring instructions or data
109
from the main memory to the SPM takes time and energy, which must be
considered by the dynamic allocation algorithms. All these prior studies on SPM
allocation, however, have focused on pure SPMs, which may lead to suboptimal
results for HSCs because they cannot cooperatively leverage the cache in parallel.
Several researchers have also explored hybrid models consisting of both cache
memory and SPM. Panda et al. [45] investigated partitioning scalar and array
variables into the SPM and the data cache to minimize the execution time for
embedded applications. Kang et al. [4] introduced a synergetic memory allocation
method to exploit the SPM to reduce data cache pollution for real-time tasks. In
contrast, our study focuses on allocating SPM space to store instructions for the
hybrid SPM-Cache architecture.
Verma et al. [63] studied an instruction cache behavior based SPM allocation
technique to reduce the energy consumption. Their approach was based on the
cache conict graph, which used proling information (i.e. weights) to get the
number of conicting misses for dierent instructions. They then proposed an
Integer Linear Programming (ILP) based solution to minimize the number of the
conicting edges in the conict graph and the overall energy consumption of the
system. However, approximation is used to linearize the problem so that it is
solvable by the ILP solver.
Despite the breath of existing studies, our work diers from all the
investigations above in the following two aspects.
First, prior studies on cache-aware SPM allocation [4, 45, 63] examine cache
110
and SPM separately. The SPM allocation is done after cache performance proling
and analysis, with the goal to minimize dierent kinds of cache misses for dierent
objective functions. While this separation of concern reduces the complexity, it
does not consider the impact of SPM allocation on the original cache proling or
analysis. For example, if instructions A and B are conicting with each other.
While allocating B into the SPM may reduce the conicting misses for A, it may
change another instruction, say C, from hit to miss due to spatial locality if both
C and B are stored in the same cache block. Also, if A is conicting with multiple
instructions, including B, while allocating B into the SPM may not reduce A's
number of misses, it may have positive impact in the future if other conicting
misses are also stored into the SPM. Since each SPM allocation may aect the
cache hit/miss estimation, we develop a novel stack distance based analysis
framework for HSC by treating the SPM as additional \virtual" ways for the
cache, enabling us to study the interactions between SPM allocation and cache
performance in a ner granularity.
Second, we have studied both heuristic and optimal cache-aware algorithms
in this work. The rst heuristic algorithm HSA allocates SPM space after cache
proling. The second heuristic algorithm GSDA is based on the SDA framework,
which can take into account the impact of each SPM allocation on the cache
performance and thus can achieve better performance. The optimal algorithm
OSDA leverages model checking. While the OSDA in general demands much more
memory space and execution time, it provides the basis to assess the eectiveness
111
of the heuristic based algorithms.
6.3 BASIC SPM ALLOCATION ALGORITHMS
We rst develop two basic SPM allocation algorithms: the Frequency-based
SPM Allocation algorithm and the Hybrid SPM-Cache Allocation algorithm.
While the former is cache-unaware, the later is cache-aware. Both algorithms are
heuristic based, and can be implemented eciently.
6.3.1 Frequency-based SPM Allocation
The FSA is based on the access frequency of each memory object of a given
program. The heuristic is that the memory objects are stored into the SPM based
on the decreasing order of their access frequencies, until the SPM is full or there is
no enough space left to hold a memory object. This algorithm is straightforward,
which is described in Algorithm 7.
The FSA algorithm only needs to check each memory object once and its
complexity is dominated by sorting all the memory objects based on their access
frequencies. Therefore the time complexity of FSA is O(N log(N)), where N is the
number of memory objects in the given program.
6.3.2 Hybrid SPM-Cache Allocation
The HSA is a cache-aware SPM allocation algorithm designed for the hybrid
SPM-Cache architecture. The idea is to allocate the memory objects with larger
cache misses into the SPM in order to reduce the total number of cache misses.
112
Algorithm 7 FSA Algorithm
1: begin
2: allocations empty;
3: run simulation and get the frequency of each memory objects;
4: sort memory objects by frequency from high to low in f list;
5: while isNotFull(SPM) && isNotEmpty(f list) do
6: get the first memory object B from f list;
7: if sizeof(B) <= sizeof(available SPM space) then
8: allocate B into SPM ;
9: end if
10: remove B from f list;
11: end while
12: return allocations
13: end
We rst simulate the benchmark and get the number of cache misses of each
memory object without the SPM. Then we allocate the memory objects in the
descending order of their numbers of cache misses. Algorithm 8 describes the HSA
allocation algorithm.
The HSA algorithm only needs to check each memory object once and its
complexity is dominated by sorting all the memory objects according to their
numbers of cache misses. Thus the time complexity of the HSA is also
O(Nlog(N)).
113
Algorithm 8 HSA Algorithm
1: begin
2: allocations empty;
3: profiling to get the number of cache misses of all memory object;
4: sort the memory objects in m list;
5: while isNotFull(SPM) && isNotEmpty(m list) do
6: get the first memory object B from m list;
7: if sizeof(B) <= sizeof(available SPM space) then
8: allocate B into SPM ;
9: end if
10: remove B from m list;
11: end while
12: return allocations
13: end
6.4 STACK DISTANCE BASED SPM ALLOCATION
ALGORITHMS
6.4.1 Stack Distance
Stack distance [11] [12] has been widely used in cache performance analysis.
Stack distance of a memory access can be dened as the number of accesses to
unique addresses made since the last reference to the requested data [11]. The
stack distance has an interesting property: in a d-way associative LRU cache, a
reference with stack distance s < d will hit, and a reference with stack distance
s  d will miss.
6.4.2 Stack Distance Analysis for The HSC Architecture
While the stack distance is very useful to analyze the cache performance, it is
not directly applicable to the hybrid SPM-Cache architecture, as the instructions
114
stored into the SPM will disrupt the cache behavior. Therefore, we propose to
treat the SPM as additional \virtual" ways of the cache to enable unied analysis
for the HSC. Given an SPM with N words and an d-way set associative cache, we
model the hybrid SPM-Cache as a virtually (N + d)-way set-associative cache
initially. This is because up to N words can be stored into the SPM, in addition to
the d ways in the cache, to reduce the conict misses. Thus stack distance based
analysis can treat the HSC as a virtually (N + d)-way set-associative cache.
However, unlike a regular (N + d)-way set-associative cache, after allocating each
word from the SPM, the set associativity is reduced by 1, until all the SPM space
is allocated and the cache becomes a regular d-way set-associative cache.
For each memory access ai, we dene the binary variable xi as the following:
xi =
8>>><>>>:
0; if ai is in the SPM
1; if ai is not in the SPM
(6.1)
We use the binary variable mi to indicate whether the memory access is a
cache miss or not.
mi =
8>>><>>>:
0; if ai is in the SPM or cache hit
1; if ai is not in the SPM and cache miss
(6.2)
Then the total number of cache misses M of the benchmark after the SPM
allocation can be calculated by Equation 6.3:
115
M =
X
all ai
xi mi (6.3)
The stack distance of each memory access can be calculated according to the
trace of all memory accesses. Assuming that stack distance of ai is si before the
SPM allocation, if ai is in the SPM after the SPM allocation, then xi = 0. If ai is
not in the SPM after the SPM allocation, then xi = 1 and the stack distance of ai
becomes s0i. If s
0
i  d, then memory access ai is a miss and mi = 1.
In the hybrid SPM-Cache, s0i is dependent on si. Specically, if si < d, we
denitely have s0i < d , so ai is a cache hit, and mi = 0. If d+N  si <1,
because we can only allocate up to N instructions into the SPM, s0i  d. Thus ai is
a miss, and mi = 1. Also, if si =1, we must have s0i =1, so ai is still a miss, and
mi = 1. However, if d  si < d+N , then ai may become a hit or a miss,
depending on how many of the interfering instructions are allocated into the SPM.
For example, if up to N interfering instructions are stored into the SPM, then
s0i < d, so ai becomes a hit.
Based on the discussion above, Equation 6.3 can be transformed into
Equation 6.4.
M =
X
sid+N
xi +
X
dsi<d+N
xi mi (6.4)
If s0i < d, mi = 0, and s
0
i  d, mi = 1, so
mi = U(s
0
i   d) (6.5)
116
U is the step unit function:
U(x) =
8>>><>>>:
0; if x < 0
1; if x  0
(6.6)
s0i can be written as:
s0i =
X
i sij<i
xj (6.7)
In the above equation, j is the dierent memory accesses between ai and the
latest memory access to the same address of ai.
Combining Equations 6.4, 6.5 and 6.7, the total number of cache misses M of
the benchmark after the SPM allocation can be described by the following
equation:
M =
X
sid+N
xi +
X
dsi<d+N
xi  U(
X
i sij<i
xj   d) (6.8)
6.4.3 Stack Distance Based SPM Allocation Algorithms
Based on the unied stack distance analysis framework, we propose two SPM
allocation algorithms to minimize the total number of cache misses.
A Greedy Stack Distance Based SPM Allocation
To get the best performance for the HSC, our goal is to minimize the total
number of cache misses M in Equation 6.8. The problem is the
satisfiability(SAT ) problem for the propositional calculus, which has been proved
to be a NP-complete problem [81]. We rst design a heuristic algorithm called
117
Greedy Stack Distance based SPM Allocation algorithm, which is described in
Algorithm 9.
Algorithm 9 GSDA Algorithm
1: begin
2: allocations empty;
3: run simulation to get the memory accesses trace;
4: calculate the stack distance for the memory accesses;
5: while isNotFull(SPM) do
6: M  MAX INT ;
7: B  NULL;
8: for each unallocated memory object b do
9: allocate b into SPM ;
10: calculate cache miss number m use Equation(6:8);
11: if m < M then
12: M  m;
13: B  b;
14: end if
15: unallocate b from the SPM ;
16: end for
17: allocate B into the SPM ;
18: update the stack distance for the memory accesses;
19: end while
20: return allocations
21: end
The algorithm is a greedy one. In each iteration, this algorithm always tries
to nd the memory object that can minimize the current cache misses and allocate
it into the SPM, and this process is repeated until the SPM is full. The complexity
of GSDA is O(N Ni Ns), where N is the number of memory objects, Ni is the
number of instructions in the memory access trace of the given program, and Ns is
the number of memory objects the SPM can hold.
118
An Optimal Stack Distance Based SPM Allocation Algorithm
Although the problem is computationally expensive to solve, an optimal
solution may still be derived by using the model checking for small size
benchmarks and small SPMs. The optimal results can be used as a basis to check
how close the heuristic-based algorithms can achieve. In this chapter, we use the
SPIN model checker [8] to exhaustively and automatically check the developed
model to nd the optimal solution. Like other cache-aware algorithms developed
in this chapter, the SPM allocation in OSDA is also based on the cache line block
granularity (see Subsection 6.4.4 for details). We describe our OSDA allocation
model by using the PROMELA (i.e. the verication modeling language of SPIN
system), which is shown in Listing 6.1.
Listing 6.1. The SPIN Model for OSDA
1 bit arraylb[n];
2 int iAvailableSPM;
3 int iCacheMiss;
4 proctype allocation (){
5 /*lbi */
6 atomic{
7 if
8 ::1 -> arraylb[i]=0;
9 iAvailableSPM = iAvailableSPM - sizeof(lbi)
10 ::1 -> arraylb[i]=1
11 fi;
12 if
13 :: iAvailableSPM ==0 -> goto endofallocation
14 :: iAvailableSPM <0 -> arraylb[i]=1;
15 goto endofallocation
119
16 :: iAvailableSPM >0 -> skip
17 fi;
18 }
19
20 endofallocation: skip;
21 d_step{
22 iCacheMiss =
P
sid+N arraylb[i]
23 +
P
dsi<d+N arraylb[i]  [(
P
i sij<i arraylb[j]  d)  0];
24 assert(iCacheMiss >TEST_VAL );
25 }
26 }
27 init{
28 atomic{
29 int i=0;
30 for(i: 0 .. n){
31 arraylb[i]=1
32 }
33 iAvailableSPM=M;
34 iCacheMiss =0;
35 run allocation ();
36 }
37 }
For each line block of the given program, we generate an atomic allocation
sequence. In the atomic sequence, we can have an if statement, because either
path of the if statement (i.e. whether allocating this line block into the SPM or
not) is executable, the SPIN will arbitrarily choose one of them based on its
non-determinism. The variable iAvailableSPM is used to ensure that the total
size of the allocated blocks will not exceed the total SPM size. After the
allocation, the d step statement (Lines 21-23) uses Equation 6.8 to calculate the
120
number of cache misses (i.e., iCachemiss) for this allocation. An assert statement
(Line 24) is used to verify if there exists an allocation that can make iCachemiss
less than or equal to the test value. The model checker will evaluate the assertion
as a part of its search of the state space. If an error is reported in the verication
stage, it indicates there is an allocation with the number of cache misses less than
or equal to the test value. In this case, we can check another value less than the
test value, until no error is reported. Then the last test value reporting an error is
the optimal allocation we are looking for. Therefore, a binary search can be used
to nd the allocation with the minimum value of iCachemiss.
Algorithm 10 describes the OSDA allocation algorithm.
Algorithm 10 OSDA Algorithm
1: begin
2: allocations empty;
3: run simulation to get the memory accesses trace;
4: calculate the stack distance for the memory accesses;
5: the upper bound of TEST V AL result of GSDA;
6: the lower bound of TEST V AL 0;
7: while lower bound  upper bound do
8: middle = (lower bound+ upper bound)=2;
9: TEST V AL = middle;
10: verify the model by SPIN ;
11: if no error report then
12: lower bound middle;
13: else
14: upper bound middle;
15: end if
16: end while
17: return the allocation of upper bound;
18: end
121
It is worthy to note that to get the minimum number of cache misses, SPIN
must exhaustively search all the possible allocations (the whole state space). More
specically, a state of a program is a set of values of its variables and location
counters. In our allocation model, the state can be described by the state vector
(arraylb[n], iAvailableSPM , iCachemiss, i), where i is the location counter of
the program. A computation of a program is a sequence of states beginning with
the initial state and continuing with the states that occur as each statement is
executed. The verier systematically checks that the correctness of the
specications held in all possible computations, which involves executing the
program and backtracking over each choice of the next statement to execute the
program nondeterministically.
To optimize the verication, we need to build a minimum allocation model in
SPIN. We only declare necessary variables, and the types of the variables are as
narrow as possible. We take advantage of the atomic and d step statements to
avoid middle states and to eciently execute the statements if possible. Moreover,
we only use the assert statement to check the allocation result and to verify the
correctness of the model. However, an array whose elements are of type bit or bool
is stored as an array of type byte (i.e., arraylb) in SPIN's implementation [82]. To
reduce the memory consumption of the state vector, we use multiple bit type
variables instead of the array of bytes in our implementation if the number of the
variables is less than 256. However, for the benchmarks that need more than 256
variables in the model, we still need to use arraylb because of the SPIN compiler
122
constraint.
Despite our eorts to reduce the memory consumption, the limitation of the
OSDA is the state explosion. The whole state space includes all the possible
allocations. Suppose all the line block sizes are equal, the SPM can hold m line
blocks, and there are n line blocks for a given program, the total number of the
allocations is Cmn +C
m 1
n +...+C
0
n. When the problem size increases, the state space
grows very fast, which can quickly reaches the upper limit of the physical memory
that is available in our experimental computer. Although SPIN provides some
techniques such as hash table, partial order reduction, collapse compression and
minimal automaton to reduce the memory consumption, they come at the cost of
much longer execution time. As a result, the optimal results for large problems
may not be practically solvable with limited resources (i.e., time and memory).
6.4.4 Side Eects of Basic Block Based Allocation
Many prior studies on pure SPM allocation for instructions are based on
basic blocks. However, we nd in the context of the HSC architecture, the basic
block based allocation may be harmful to understanding the impact of SPM
allocation on cache performance, which may result in suboptimal results. As the
instructions of a program are typically placed continuously in the main memory
and they are fetched into the cache in the unit of a cache line (also called line
block in this dissertation), it is possible that the instructions from two continuous
basic blocks are fetched into the same cache line. Typically, this can happen with
the instructions in the end of a preceding basic block and the instructions at the
123
beginning of the following basic block. In this case, allocating the preceding basic
block into the SPM can aect the cache hits for the instructions in the following
basic block.
For example, as shown in Figure 6.1, a program segment contains three basic
blocks. For simplicity, assume the cache has two cache lines and each cache line
can hold two instructions, and there is an SPM whose size equals to the size of the
cache. Inst 3 from BB1 and Inst 4 from BB2 are fetched into the Line 2 of the
cache, and Inst 3 is placed in the head of the Line 2. Before the SPM is used, the
number of cache misses is 4, which is caused by the accesses to Inst 1, Inst 3, Inst
5 and Inst 7 respectively. If the instructions are allocated into the SPM in the unit
of a basic block, and BB1 and BB3 are allocated into the SPM, then two cache
misses from Inst 1 and Inst 3 are reduced. However, the access to Inst 4 now turns
into a cache miss, due to the elimination of spatial locality after Inst 3 is stored
into the SPM. So the cache misses after SPM allocation is 3. In contrast, if we
allocate the SPM space based on the line block granularity, both Inst 1 to Inst 4
will be stored into the SPM. Thus, the total number of cache misses after SPM
allocation becomes 2 (i.e. Inst 5 and Inst 7 are still misses but there is no new
miss).
To avoid this kind of side eect of basic block based allocation, we adopt line
blocks as the allocation unit in SPM allocation for all the three cache-aware
algorithms: HSA, GSDA and OSDA. Also, we assume the virtual memory support.
After SPM allocation, the virtual to physical memory address is updated to ensure
124
the correct execution.
Figure 6.1. The example of side eect of basic block based allocation.
6.4.5 An Example To Compare HSA and GSDA
Compared to the HSA, which only considers cache performance before SPM
allocation, the unied SDA framework for the HSC enables both the GSDA and
the OSDA to take into account the impact of allocating each SPM line block on
the cache performance. The updated cache performance provides more accurate
information to guide the next allocation of the next SPM line block, which can
result in better SPM allocation and higher cache performance. To illustrate the
dierences between the HSA and the GSDA, we provide an example and its control
ow graph is shown in Figure 6.2. There are 6 basic blocks (from BB0 to BB5),
and BB3 has a back-edge to BB1, which iterates 2 times. Therefore, the basic
blocks in this loop may have higher access frequencies than other basic blocks.
The conguration parameters of on-chip memories in the example are listed
in Table 6.1. The SPM and cache have the same size, i.e, 4 words, and the line
block size is 2 words. The execution order of basic blocks is listed in Table 6.2, and
the memory address trace before the SPM allocation is shown in Table 6.3.
The stack distance for each memory access before the SPM allocation is
125
calculated by using cache block address (BA) instead of instruction memory
address (A) to keep the spatial locality. One stack is maintained for each cache set
to calculate the stack distance s. In this example, there are 2 cache sets, so the set
index (SI) is either 0 or 1. Based on SDA, if s  d (cache associativity), the
memory access results in a cache miss. So the total number of cache misses before
the SPM allocation is 13. Table 6.4 provides the 6 line blocks of this program, and
line block 1 is not executed and will not be considered in the allocation. The
numbers of cache misses of each line block before the SPM allocation are
calculated and shown in Table 6.4 as well.
Figure 6.2. The control ow graph of the example code segment.
If the HSA algorithm is used, the line blocks 0 and 4 will be stored into the
SPM because they are the rst two from the list of line blocks sorted in the
descending order of the number of cache misses. The cache accesses after the
allocation are shown in Table 6.5 (note the accesses to the SPM are not shown
here), and the total number of cache misses becomes 6.
If the GSDA algorithm is used instead, all the line blocks are checked to nd
the rst candidate as shown in Table 6.6. We can calculate the stack distance
126
instruction size 1
SPM size 4
cache size 4
cache line size 2
number of cache lines 2
associativity(d) 1
Table 6.1. The SPM and cache parameters used in the example.
BB 0 1 5 3 1 5 3 4
Table 6.2. The execution sequence of the basic blocks.
A 0 1 11 4 5 6 7 8 1 11 4 5 6 7 8 1 9 10
BA 0 0 5 2 2 3 3 4 0 5 2 2 3 3 4 0 4 5
SI 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1
SD 1 0 1 1 0 1 0 1 2 1 2 0 1 0 2 2 1 1
Table 6.3. The memory access trace before allocation. M=13.
(A: instruction address, BA: block address, SI: set index, SD:
stack distance, M: number of cache misses)
LB 0 1 2 3 4 5
INSTR 0, 1 2, 3 4, 5 6, 7 8, 9 10, 11
M 3 0 2 2 3 3
Table 6.4. The number of cache misses of the line blocks before
the SPM allocation. (LB: line block, INSTR: instruction, M:
number of cache misses)
A 11 4 5 6 7 11 4 5 6 7 10
BA 5 2 2 3 3 5 2 2 3 3 5
SI 1 0 0 1 1 1 0 0 1 1 1
SD 1 1 0 1 0 1 0 0 1 0 1
h/m m m h m h m h h m h m
Table 6.5. The memory access trace and cache misses after the
SPM allocation by the HSA. (A: instruction address, BA: block
address, SI: set index, SD: stack distance)
127
directly by using Equation 6.8 and get the number of cache misses for each case.
We choose line block 0 as the rst candidate to minimize the current number of
cache misses. Then all the remaining line blocks are checked to nd the second
candidate, which is line block 3 as shown in Table 6.7. So the total number of
cache misses after the SPM allocation by the GSDA is 5, which is smaller than
that of the HSA.
6.5 EVALUATION METHODOLOGY
We use Trimaran compiler [40] to compile the benchmarks into the binary
code for the target processor. The address information obtained from the compiler
is used in the stack distance analyzer. The SPM allocator conducts the SPM
allocation based on dierent algorithms and the SPIN model checker is used for
the OSDA algorithm. The Trimaran simulator is used to simulate and report the
performance of each benchmark. The model checking experiments are executed on
a machine with the Intel Core i7 2.8GHz CPU and 16GB memory.
In our experiments, the baseline processor has 2 integer ALUs, 2 oat ALUs,
1 branch predictor, 1 load/store unit, and 1-level on-chip memory. In our default
conguration, to focus on the hybrid instruction SPM-Cache, the data cache is 128
bytes without any data SPM. The instruction HSC includes a 64 byte instruction
SPM and a 64 byte instruction cache. The parameters of the caches include: 32
byte block size, direct-mapped, and LRU replacement policy. A cache hit takes 1
cycle and a main memory access takes 20 cycles. We also use two other
congurations to do the sensitivity experiments, all of which are shown in Table
128
(a) Allocate line block 0 into the SPM. M=9.
A 11 4 5 6 7 8 11 4 5 6 7 8 9 10
BA 5 2 2 3 3 4 5 2 2 3 3 4 4 5
SI 1 0 0 1 1 0 1 0 0 1 1 0 0 1
SD 1 1 0 1 0 1 1 1 0 1 0 1 0 1
(b) Allocate line block 2 into the SPM. M=11.
A 0 1 11 6 7 8 1 11 6 7 8 1 9 10
BA 0 0 5 3 3 4 0 5 3 3 4 0 4 5
SI 0 0 1 1 1 0 0 1 1 1 0 0 0 1
SD 1 0 1 1 0 1 1 1 1 0 1 1 1 1
(c) After allocate line block 3 into the SPM. M=9.
A 0 1 11 4 5 8 1 11 4 5 8 1 9 10
BA 0 0 5 2 2 4 0 5 2 2 4 0 4 5
SI 0 0 1 0 0 0 0 1 0 0 0 0 0 1
SD 1 0 1 1 0 1 2 0 2 0 2 2 1 0
(d) Allocate line block 4 into the SPM. M=9.
A 0 1 11 4 5 8 1 11 4 5 8 1 9 10
BA 0 0 5 2 2 4 0 5 2 2 4 0 4 5
SI 0 0 1 0 0 0 0 1 0 0 0 0 0 1
SD 1 0 1 1 0 1 2 0 2 0 2 2 1 0
(e) Allocate line block 5 into the SPM. M=9.
A 0 1 4 5 6 7 8 1 4 5 6 7 8 1 9
BA 0 0 2 2 3 3 4 0 2 2 3 3 4 0 4
SI 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0
SD 1 0 1 0 1 0 1 2 2 0 0 0 2 2 1
Table 6.6. Checking the cache misses for each line block to
identify the rst candidate by the GSDA-based SPM allocation.
(A: instruction address, BA: block address, SI: set index, SD:
stack distance, M: number of cache misses).
129
(a) Allocate line block 2 into the SPM.
M=6.
A 11 6 7 8 11 6 7 8 9 10
BA 5 3 3 4 5 3 3 4 4 5
SI 1 1 1 0 1 1 1 0 0 1
SD 1 1 0 1 1 1 0 0 0 1
(b) Allocate line block 3 into the SPM.
M=5.
A 11 4 5 8 11 4 5 8 9 10
BA 5 2 2 4 5 2 2 4 4 5
SI 1 0 0 0 1 0 0 0 0 1
SD 1 1 0 1 0 1 0 1 0 0
(c) Allocate line block 4 into the SPM.
M=6.
A 11 4 5 6 7 11 4 5 6 7 10
BA 5 2 2 3 3 5 2 2 3 3 5
SI 1 0 0 1 1 1 0 0 1 1 1
SD 1 1 0 1 0 1 0 0 1 0 1
(d) Allocate line block 5 into the SPM.
M=5.
A 11 4 5 8 11 4 5 8 9 10
BA 5 2 2 4 5 2 2 4 4 5
SI 1 0 0 0 1 0 0 0 0 1
SD 1 1 0 1 0 1 0 1 0 0
Table 6.7. Checking the cache misses for each line block to iden-
tify the second candidate by the GSDA based SPM allocation.
(A: instruction address, BA: block address, SI: set index, SD:
stack distance, M: number of cache misses).
Conguration I-SPM I-Cache D-Cache
default 64B 64B, 32B cache line 128B, 32B cache line
conguration I 128B 128B, 32B cache line 256B, 32B cache line
conguration II 64B 64B, 16B cache line 128B, 16B cache line
Table 6.8. Three memory congurations in our experiments.
130
Benchmark Description Code size (bytes) Total exe cycles
crc cyclic redundancy check computation on 40 bytes of data 664 65314
edn nite impulse response (FIR) lter calculations 13504 162944
lms lms adaptive signal enhancement 2136 1015329
matmult matrix multiplication of two 20 20 matrices 480 395755
ndes complex embedded code 3580 336728
statemate automatically generated code 10476 8829
Table 6.9. General information of all benchmarks
6.8.
We randomly select 6 real-time benchmarks from Malardalen WCET
benchmark suit [41] for the experiments. The salient characteristics of all
benchmarks are shown in Table 6.9.
6.6 EXPERIMENTAL RESULTS
6.6.1 Performance of Dierent Algorithms
Table 6.10 shows the cache misses of all 4 allocation algorithms for all the
benchmarks in our default conguration. As we can see, by using the cache-aware
SPM allocation algorithms at line block granularity, the instruction cache misses of
all the three cache-aware algorithms are decreased compared to the baseline FSA.
Among the three cache-aware algorithms, both the GSDA and the OSDA can
reduce the cache misses more than the HSA for all the benchmarks. We also
observe that for all benchmarks except statemate, the GSDA can achieve the same
results as the OSDA, which are optimal. Even for the benchmark statemate, the
result of the GSDA is only about 1% worse than that of the OSDA.
131
Benchmark FSA HSA GSDA OSDA
crc 2111 1902 1363 1363
edn 4784 3134 2883 2883
lms 24833 19088 18711 18711
matmult 1712 1293 932 932
ndes 9759 9448 9236 9236
statemate 199 197 197 195
Table 6.10. The cache misses of all 4 allocation algorithms in
default conguration.
Figure 6.3 compares the overall performance of dierent algorithms in the
default conguration, which is normalized to the total number of execution cycles
of the FSA. We nd that the HSA and the GSDA can improve the overall
performance by 9% and 11% respectively on average. The OSDA achieves the best
performance for all benchmarks. However, the GSDA can achieve the same
performance as the OSDA for all the benchmarks except the benchmark
statemate, for which it is only 0.5% worse than the OSDA.
Figure 6.3. The performance of all 4 algorithms in default con-
guration, which is normalized to the total number of execution
cycles of the FSA.
Although the GSDA and OSDA can reduce more cache misses and achieve
132
better performance than both the FSA and the HSA, it should be noted that both
the GSDA and the OSDA have higher time complexity and thus can take much
longer time to compute. Table 6.11 presents the running time of SPM allocation
for each algorithm. Since our SPM allocation algorithms are all static, the
allocation will only be performed once, and the results of allocation will be used
many times in the benchmark execution on the HSC platforms. Therefore, it is
worthy for a designer to spend more time oine to nd a better allocation to
improve the performance of hybrid SPM-Cache system in run time. For most
benchmarks, both the FSA and the HSA can nish allocation within 1 millisecond,
while the GSDA and the OSDA take signicantly longer time.
It should be noted that the running time of the OSDA can be varied by
dierent upper bound of the TEST VAL , as listed in Table 6.11. OSAD(1) is only
the time to perform one round of verication by checking all possible allocations.
The actual allocation time can be O(lg n) times of the verication time in the
worst case, where n is the upper bound set for the TEST VAL in Algorithm 10. In
case there is no reference result that can be used from other allocations, we can
simply use the number of cache misses without the SPM as the initial TEST VAL
(OSDA(M) in Table 6.11 ). Otherwise, we can set the initial TEST VAL as the
results of the FSA(OSDA(F) in Table 6.11), the HSA(OSDA(H) in Table 6.11) or
the GSDA(OSDA(G) in Table 6.11). In practice, since we may already get the
cache miss result by using the GSDA rst, the TEST VAL can be set to be the
result of the GSDA minus 1. Because for most cases, the GSDA can nd the
133
Benchmark FSA HSA GSDA OSDA(1) OSDA(F) OSDA(H) OSDA(G) OSDA(M)
crc 0.586 0.605 63 1 11.586 11.605 74 13
edn 1.922 0.812 3461 3230 38761.922 38760.812 42221 38760
lms 0.686 0.602 1815 50 750.686 750.602 2565 750
matmult 0.669 0.599 205 1 11.669 10.599 215 14
ndes 0.770 0.790 896 410 5330.77 5330.79 6636 5740
statemate 0.750 0.726 40 40 320.75 320.726 360 280
average 0.897 0.689 1080 622 7531.231 7530.856 8678.5 7592.833
Table 6.11. The running time (in msec) of all 4 allocation algo-
rithms in default conguration.
optimal or near optimal results, the OSDA may only need one round of verication
instead of using the binary search. As a result, the execution of the OSDA can be
greatly reduced for small benchmarks. However, as the benchmark size increases,
the running time of the OSDA increases much faster than that of the GSDA. For
example, the running time of the OSDA for one round verication for edn is about
3230 times longer than that of the crc, while it is only about 55 times longer for
the GSDA.
6.6.2 Sensitivity to the Cache Size
For sensitivity study, we run two more groups of experiments. First, the size
of both the SPM and the cache is increased to 128 bytes while keeping the same
cache line size as the default conguration, which is the Conguration I. Second,
the cache line size is decreased to 16 bytes while using the same size of the SPM
and the cache as the default conguration, which becomes the Conguration II.
Both the Congurations I and II increase the number of line blocks that the SPM
134
can hold, thus increasing the complexity of the allocation problem.
Table 6.12 gives the number of cache misses for dierent algorithms in
Conguration I. Due to the increase of the SPM and cache size, the number of
cache misses decreases as compared to that of the default conguration. With a
larger SPM and cache size, we observe that all the three cache-aware algorithms
still lead to much less cache misses than the FSA algorithm, and both the GSDA
and the OSDA are superior to the HSA. For all benchmarks, the GSDA can
achieve the results either the same as or very close to those of the OSDA. For the
two benchmarks that the GSDA does not achieve the optimal results, i.e., edn,
ndes and statemate, the GSDA is only 0.1%, 3.7% and 1.1% worse than the
OSDA respectively.
Benchmark FSA HSA GSDA OSDA
crc 26 19 17 17
edn 1896 1695 1598 1596
lms 21320 19060 9122 9122
matmult 55 15 11 11
ndes 7919 7250 7031 6770
statemate 195 193 193 191
Table 6.12. The cache misses of all 4 allocation algorithms with
the Conguration I.
Figure 6.4 demonstrates the overall performance of all the four algorithms in
Conguration I, which is normalized to the total execution cycles of the FSA
under Conguration I. On average, the HSA, the GSDA, and the OSDA can
improve the overall performance of the FSA by 2.8%, 7.7%, and 8.4% respectively,
135
indicating the eectiveness of cache-aware SPM allocation.
Figure 6.4. The performance of all 4 algorithms with Congu-
ration I, which is normalized to the total execution cycles of the
FSA under Conguration I.
6.6.3 Sensitivity to the Block Size
The numbers of cache misses of dierent algorithms in Conguration II are
presented in Table 6.13. As the block size decreases, the number of cache misses
increases for most benchmarks due to the reduced spatial locality. In
Conguration II, both the HSA and the GSDA can decrease the cache misses by
22.3% and 28.5% on average as compared to the FSA. For all the benchmarks
whose optimal results can be obtained by the OSDA, the GSDA can achieve
results either the same as or very close to those of the OSDA.
Figure 6.5 compares the performance of all the four algorithms in
Conguration II, which is normalized to the total execution cycles of the FSA
under Conguration I. The HSA and the GSDA can improve the overall
performance of the FSA by 9.6% and 10.5% respectively on average. The GSDA
attains the optimal performance for the benchmarks crc, edn, lms and ndes. For
matmult and statemate, the GSDA is only 0.1% and 0.15% worse than the OSDA.
136
Benchmark FSA HSA GSDA OSDA
crc 1386 1384 1381 1381
edn 8402 5201 4699 4699
lms 42065 31087 27435 27435
matmult 2642 1063 1063 1044
ndes 16236 16153 15855 15855
statemate 388 386 386 385
Table 6.13. The cache misses of all 4 allocation algorithms with
the Conguration II.
Figure 6.5. The performance of all 4 algorithms with Congura-
tion II, which is normalized to the total execution cycles of the
FSA under Conguration II.
137
6.6.4 Running Time Under Conguration I and II
We have also compared the running time of all SPM allocation algorithms for
both Congurations I and II, which is shown in Table 6.14 and Table 6.15
respectively. It can be observed that the allocation time of the FSA and the HSA
does not vary too much with dierent congurations. The allocation time of the
GDSA in Conguration I and II is increased by 1.8 and 4.1 times respectively on
average as compared to that of the default conguration, due to the increased
number of line blocks in the SPM. However, for the OSDA, the running time
increases dramatically. Taking the benchmark lms for example, the verication
time for the default conguration is only 50ms, while it becomes 292 times longer
in Conguration I and 8520 times longer in Conguration II. This is because in the
OSDA algorithm, as the problem size grows, the number of states explodes, thus
requiring deeper compression to solve the problem. Table 6.16 shows the
compression ratio to the memory usage for all congurations. The higher the
compression ratio, the longer time it may take to nish the verication.
Benchmark FSA HSA GSDA OSDA(1)
crc 0.745 0.562 95 10
edn 0.859 0.852 6409 6.04E+08
lms 0.851 0.937 3388 14600
matmult 0.788 0.687 391 10
ndes 0.801 0.715 1611 3480000
statemate 0.82 0.773 60 87800
average 0.811 0.754 1992.333 1.01E+08
Table 6.14. The allocation time (in msec) of all 4 allocation
algorithms in Conguration I.
138
Benchmark FSA HSA GSDA OSDA(1)
crc 0.608 0.618 175 190
edn 0.757 0.792 14455 1.71E+09
lms 0.681 0.726 7361 426000
matmult 0.693 0.677 754 160
ndes 0.749 0.829 3527 39600000
statemate 0.732 0.799 160 23800000
average 0.703 0.740 4405.3 2.96E+08
Table 6.15. The allocation time (in msec) of all 4 allocation
algorithms in conguration II.
Benchmark default conguration I conguration II
crc None 0.77% 15.56%
edn 2.32% 99.99% 99.99%
lms None 13.99% 38.42%
matmult None 0.62% 15.23%
ndes 10.57% 97.61% 99.71%
statemate 5.3% 11.14% 99.99%
Table 6.16. The compression ratio of OSDA model during verication.
139
6.7 CONCLUSIONS
In this chapter, we develop 4 SPM allocation algorithms for the HSC
architecture: FSA, HSA, GSDA and OSDA. While the FSA is cache-unaware, all
other three algorithms are aware of the cache, which can reduce more cache misses
and achieve much better performance than the FSA.
Both the GSDA and the OSDA are based on the unied stack distance
analysis framework, which can consider the interaction between the SPM
allocation and the cache performance. The OSDA is an optimal algorithm based
on model checking; however, it may take signicantly more memory and longer
time to run as compared to other algorithms. The GSDA is a greedy algorithm,
which can run eciently and achieve optimal or near-optimal results for most
benchmarks. Overall, we believe the GSDA is a good SPM allocation algorithm to
harness the full potential of the HSC architecture eciently.
In our future work, we would like to explore both heuristic-based and
optimal cache-aware SPM allocation algorithms for data accesses as well. Also, we
plan to study SDA based dynamic SPM allocation for the HSC architecture.
140
CHAPTER 7
CACHE-AWARE SPM ALLOCATION FOR MAXIMIZING ENERGY
EFFICIENCY ON HYBRID SPM-CACHE ARCHITECTURE
7.1 CHAPTER OVERVIEW
The traditional SPM allocation algorithms, including both static and
dynamic allocation, mainly focuse on the SPM alone and are cache-unaware. They
are unlikely to harness the full potential of the hybrid SPM and cache. we also
believe that the SPM allocation for the hybrid SPM-Cache architecture must be
aware of the cache performance to maximally optimize the energy consumption.
In this chapter, we design two energy-oriented algorithms: the Greedy Stack
Distance based Allocation for Energy (GSDA-E) and the Optimal Stack Distance
based Allocation for Energy (OSDA-E), which are extend from the GSDA and
OSDA in chapter 6. We also comparatively evaluate all the 6 dierent SPM
allocation algorithms, including the 4 SPM allocation algorithms in chapter 6. The
rst one is the Frequency based SPM Allocation (FSA), which is not aware of the
cache and is used as the baseline. The other ve algorithms are all cache-aware,
but exploit cache information in dierent ways. The Hybrid SPM-Cache
Allocation (HSA) exploits cache proling information. It tries to allocate the
memory objects with the largest cache misses into the SPM. The remaining four
algorithms are all based on the Stack Distance Analysis (SDA) [11], [12], including
two performance-oriented algorithms, i.e., the Greedy Stack Distance based
141
Allocation (GSDA) and the Optimal Stack Distance based Allocation (OSDA),
and two energy-oriented algorithms, i.e., the Greedy Stack Distance based
Allocation for Energy (GSDA-E) and the Optimal Stack Distance based Allocation
for Energy (OSDA-E). The GSDA and GSDA-E are greedy algorithms, whereas
the OSDA and OSDA-E are optimal algorithms by using model checking. More
details of these algorithms can be seen in the rest of the chapter.
As the rst step to exploiting the tight interaction between SPM allocation
and cache energy in the HSC architecture, our study focuses on studying an
instruction SPM-Cache only as we do in chapter 6. In the instruction HSC, both
the SPM and the cache are used for storing instructions only. We plan to explore
hybrid SPM-Caches for data accesses in our future work.
This chapter makes three main contributions as the follows.
 We develop the heuristic based GSDA-E algorithms with polynomial time
complexity and the optimal OSDA-E algorithms based on model checking,
all of which are built upon the unied stack distance analysis framework we
proposed in chapter 6.
 We have implemented and compared all the six SPM allocation algorithms,
and nd that GSDA-E can reduce the energy either the same as or close to
the optimal results attained by the OSDA-E, while achievingperformance
close to the OSDA and the GSDA.
142
symbol meaning
d associativity
N the maximal number of memory objects stored the SPM
A the total number of memory accesses
ai a memory accesse
xi binary: a memory access is in SPM(0) or not(1)
mi binary: a memory access is a hit(0) or a miss(1)
M total number of cache misses
eh cache energy consumption per access
es SPM energy consumption per access
eM main memory energy consumption per access
em eh + eM
Emem energy consumption of the memory subsystem
Etotal total energy consumption of the processor and memory
Table 7.1. The symbols used in the equations.
7.2 STACK DISTANCE BASED SPM ALLOCATION
ALGORITHMS FOR ENERGY
7.2.1 Stack Distance Analysis for HSC on Energy Consumption
We lised all the symbols used in this chapter in Table 7.1.
When exploring the energy consumption of the memory accesses, i. e., Emem
, we have the equation 7.1,
Emem =
X
8ai
xi  ei (7.1)
143
in which, ei is the energy consumption for each memory accesses, and
ei =
8>>>>>>>><>>>>>>>>:
es; if ai is in SPM
eh; if ai hits in cache
em; if ai misses in cache
(7.2)
Therefore, the total energy consumption of the memory acceses can be
calculated as the following:
Emem =
X
8ai
[(1  xi)  es + xi  (1 mi)  eh + xi mi  em] (7.3)
Combining with Equation 7.4 derived from chapter 6, Equation 7.3 can be written
as Equation 7.5, where A is the total number of the memory accesses, and eM is
the main memory energy consumption per access.
M =
X
sid+N
xi +
X
dsi<d+N
xi  U(
X
i sij<i
xj   d) (7.4)
Emem = Aes + (eh   es)
X
8ai
xi + (em   eh) M
= Aes + (eh   es)
X
8ai
xi
+ eM  (
X
sid+N
xi +
X
dsi<d+N
xi  U(
X
i sij<i
xj   d))
(7.5)
7.2.2 Exploit Cache Stack Distance to Improve SPM Allocation
A Greedy Stack Distance Based SPM Allocation Algorithm for Energy
(GSDA-E)
Similar to the GSDA. we design a heuristic algorithm called Greedy Stack
Distance Based SPM Allocation Algorithm for Energy minimizing using Euqation
144
7.5, which is described in Algorithm 11. Same as the GSDA, the complexity of
GSDA-E is O(N Ni Ns), where N is the number of memory objects, Ni is the
number of instructions in the memory access trace of the given program, and Ns is
the number of memory objects the SPM can hold.
Algorithm 11 GSDA  E Allocation Algorithm
1: begin
2: allocations empty;
3: run simulation to get the memory accesses trace;
4: calculate the stack distance for the memory accesses;
5: while isnotfull(SPM) do
6: Emem  MAX INT ;
7: B  NULL;
8: for each allocate memory unit b do
9: allocate b into SPM ;
10: calculate total memory energy emem use Equation(7:5);
11: if emem < Emem then
12: Emem  emem;
13: B  b;
14: end if
15: unallocate b into SPM ;
16: end for
17: allocate B to the SPM ;
18: update the stack distance for the memory accesses;
19: end while
20: return allocations
21: end
145
An Optimal Stack Distance Based SPM Allocation Algorithm for
Energy Optimization (OSDA-E)
An optimal solution OSDA-E can also be designed by can also be designed
by using model checking to optimally reduce the energy consumption for HSC. A
similar Algorithm 12 to OSDA can be used by the OSDA-E as well. The allocation
model of OSDA-E using the PROMELA (the verication modeling language of
SPIN system) is shown in Listing 7.1.
Listing 7.1. The SPIN Model for OSDA-E
1 bit arraylb[n];
2 int iAvailableSPM;
3 int iEnergy;
4 proctype allocation (){
5 /*lbi */
6 atomic{
7 if
8 ::1 -> arraylb[i]=0;
9 iAvailableSPM = iAvailableSPM - sizeof(lbi)
10 ::1 -> arraylb[i]=1
11 fi;
12 if
13 :: iAvailableSPM ==0 -> goto endofallocation
14 :: iAvailableSPM <0 -> arraylb[i]=1;
15 goto endofallocation
16 :: iAvailableSPM >0 -> skip
17 fi;
18 }
19
20 endofallocation: skip;
21 d_step{
22 iEnergy = Aes + (eh   es) 
P
8 ai arraylb[i]
146
23 + eM  (
P
sid+N arraylb[i]
24 +
P
dsi<d+N arraylb[i]  [(
P
i sij<i arraylb[j]  d)  0]);
25 assert(iEnergy >TEST_VAL );
26 }
27 }
28 init{
29 atomic{
30 int i=0;
31 for(i: 0 .. n){
32 arraylb[i]=1
33 }
34 iAvailableSPM=M;
35 iEnergy =0;
36 run allocation ();
37 }
38 }
7.3 EVALUATION METHODOLOGY
We use Trimaran compiler [40] to implement the proposed SPM allocation
algorithms. In our experiments, the baseline processor has 2 integer ALUs, 2
oating point ALUs, 1 branch predictor, 1 load/store unit, and 1-level on-chip
memory. The instruction on-chip memories include a 64B SPM and a 64B cache.
The parameters of the caches include: 32 Byte block size, direct-mapped, and
LRU replacement policy. A cache hit takes 1 cycle and a main memory access
takes 20 cycles. We use Cacti [57] to estimate the SPM and cache energy
consumption. We randomly select 6 realtime benchmarks from Malardalen WCET
benchmark suit [41] for the experiments.
147
Algorithm 12 OSDA  E Allocation Algorithm
1: begin
2: allocations empty;
3: run simulation to get the memory accesses trace;
4: calculate the stack distance for the memory accesses;
5: the upper bound of TEST V AL results of GSDA  E;
6: the lower bound of TEST V AL 0;
7: while lower bound  upper bound do
8: middle = (lower bound+ upper bound)=2;
9: TEST V AL = middle;
10: verify the model by SPIN ;
11: if no error report then
12: lower bound middle;
13: else
14: upper bound middle;
15: end if
16: end while
17: return the allocation of upper bound;
18: end
148
7.4 EXPERIMENTAL RESULTS
7.4.1 Memory Energy Consumption
We compare the memory energy of all the 6 allocation algorithms in Figure
7.1, which is normalized to the memory energy consumption of the FSA. The
OSDA-E has the lowest memory energy dissipation for all the 6 benchmarks since
it is the optimal energy oriented allocation. We also nd that the OSDA-E and the
GSDA-E can both reduce the memory energy consumption more than other four
algorithms. For all the benchmarks, the GSDA-E can get the same or very close
results comparing to the OSDA-E. The average energy consumption of the
GSDA-E is only 0.3% higher than that of the OSDA-E, which proves that the
GSDA-E is an eective heuristic algorithm.
Figure 7.1. The memory energy of all the allocation algorithms
in default conguration (normalized to OSDA).
For the benchmark lms, the energy consumption of the GSDAE is 2% higher
than that of the OSDA-E. This is because in lms, there are a number of line blocks
that have close access frequencies and similar numbers of cache misses, making the
GSDA-E less eective to choose the global optimal line blocks for minimizing
memory energy. However, for all the other 5 benchmarks, we nd that the
GSDA-E can get exactly the same allocation as that of the OSDA-E.
149
We also observe that the memory energy consumption of the OSDA and the
GSDA are almost the same because most of the GSDA allocation are the same
with the OSDA. Compared to both the OSDA and the GSDA, on average, the
OSDA-E and GSDA-E reduce the memory energy consumption by 10% and 9.7%
respectively, indicating that the SPM allocation optimized for performance does
not necessarily produce the best energy dissipation for the memory subsystem.
This is mainly because in the HSC, although accesses to the SPM and the cache
take the same latency, they consume dierent energy, which can only be eectively
optimized by the energy-oriented algorithms.
Figure 7.1 also indicates that the HSA has the largest memory energy
consumption. This is because the HSA can neither minimize the number of cache
misses compared to the performance-oriented algorithms, nor maximize the
number of SPM accesses compared to the energy-oriented algorithms. On average,
the memory energy consumption of the HSA is 3.7% higher than that of the
OSDA, and 15% higher than that of the OSDA-E. The FSA, however, can reduce
the energy consumption compared to the OSDA and the GSDA by having more
SPM accesses, because it tries to put the most frequently accessed instructions
into the SPM. However, the memory energy consumption of the FSA is still 2%
worse than the OSDA-E on average because of the energy dissipation caused by
more cache misses.
150
7.4.2 Performance Results
Figure 7.2 compares the performance of all the allocation algorithms, which
is normalized to the execution time of the FSA. Compared to the FSA, all the
other ve algorithms lead to better performance because they are all cache aware.
Among all the algorithms, the OSDA has the best performance as it is the optimal
algorithm to minimize the execution time. The GSDA can achieve performance
either the same or very close to the OSDA, indicating its eectiveness in reducing
the total execution time. While both the OSDA-E and the GSDA-E aim at
optimizing memory energy consumption, they also result in better performance
than both the FSA and the HSA because the energy-oriented allocation also
requires to reduce the number of cache misses and increase the number of SPM
accesses, both of which can benet performance as well.
Figure 7.2. The total energy of all the allocation algorithms in
default conguration (normalized to FSA).
7.4.3 EDP Results
Figure 7.3 compares the Energy-Delay Product (EDP) for all the SPM
allocation algorithms. It can be seen that the EDP of the OSDA-E is the smallest
on average because it can achieve optimal energy consumption with near-optimal
151
performance results. The EDP of the GDDA-E, on average, is only 0.1% larger
than that of the OSDA-E, indicating its eectiveness in improving both energy
consumption and performance. While the performance of the OSDA and the
GSDA is close to the OSDA-E and the GSDA-E, the EDPs of the OSDA and the
GSDA are much larger than those of the OSDA-E and the GSDA-E on average
because both the OSDA and the GSDA consumes much more memory energy.
Also, the EDPs of the FSA and the HSA are worse than the OSDA-E and the
GSDA-E on average because they have both longer execution time and higher
memory energy consumption.
Figure 7.3. Compare the EDP of all the allocation algorithms
in default conguration (normalized to FSA).
7.5 CONCLUTIONS
In this chapter, we extend the GSDA and the OSDA from chapter 6 to
optimize the energy consumption of the hybrid SPM-Cache architecture. We
propose two energy-oriented allocation algorithms based on cache stack distance
analysis, and nd in general the cache-aware SPM allocation can lead to better
performance and/or energy consumption than the cache-oblivious SPM allocation
algorithm (i.e. FSA). Also, we discover that for the HSC architecture, the
152
energy-oriented algorithms can lead to better EDP than the performance-oriented
algorithms. In particular, our experiments indicate that the GSDA-E can reduce
the energy consumption either the same as or close to the optimal results attained
by the OSDA-E, while achieving performance close to the optimal results obtained
by the OSDA.
153
CHAPTER 8
CONCLUSION REMARKS
This dissertation explores the hybrid SPM-Cache Architectures to improve
the performance and energy eciency of the real-time systems:
 How can we precisely estimate the WCET of the real-time applications?
 How can hybrid SPM-Cache Architectures improve energy eciency besides
the performance and the time predictability of the real-time systems?
 How can we improve the WCET of the real-time applications on hybrid
SPM-Cache Architectures by SPM allocation algorithms?
 How can we design SPM allocation algorithms to optimise the execution
time of the real-time applications on hybrid SPM-Cache Architectures?
 How can we design SPM allocation algorithms to reduce the energy
consumption of the real-time applications on hybrid SPM-Cache
Architectures?
Chapter 3 proposes a model checking based approach to bounding the
worst-case performance of a multicore processor with shared L2 instruction caches.
To alleviate the state explosion problem, we propose several techniques for
reducing the memory consumption without compromising the quality of WCET
analysis. Our experimental results show that the model checking based approach is
safe and improves the tightness of WCET estimation as compared to the static
154
analysis approach [9]. However, due to the inherent complexity of multicore
WCET analysis, the state explosion problem, and the physical memory constraint,
this approach currently can only solve small benchmarks, while larger benchmarks
with more interfering instructions will cause out-of-memory fault. However, it is
possible to combine the model checking based method with the static analysis to
benet larger real-time applications.
Chapter 4 Built upon the prior work in [6] to study the performance and
time predictability of hybrid SPM-cache architectures, we investigates the energy
consumption of seven dierent SPM-caches.We nd that all these seven hybrid
on-chip memory architectures consume less energy than the pure SPM based
architecture. Three hybrid SPM-cache architectures, including the IH-DC, the
IHDH, and the IH-DS, can reduce the total energy consumption than the IC-DC.
By considering both energy consumption and performance, the IC-DH, IH-DC,
and IH-DH can achieve energy-delay product less than both the pure cachebased
and SPM-based architectures. Among all the hybrid on-chip memory
architectures, our evaluation indicates that the IH-DH architecture is the best in
terms of both total energy consumption or EDP.More specically, on average, the
IH-DH architecture can reduce the total energy consumption by 22% and 16%,
and reduce the EDP by 38.1% and 16.4% as compared to that of the IS-DS and
the IC-DC respectively. Therefore, in addition to reconciling performance and time
predictability as revealed in [6], we demonstrates that the hybrid on-chip memory
architectures, in particular the IH-DH, can also make better tradeos between
155
performance and energy consumption, making it a very attractive design option
for real-time and embedded systems.
Chapter 5 have explored four SPM allocation algorithms that dier by
whether or not they are aware of the WCET and/or the cache. The FSA
algorithm allocates SPM space based on the access frequency of each basic block
from proling, whereas the LPA attempts to allocate basic blocks with high access
frequencies on the WC-path. Both the HSA and the EHSA algorithms can exploit
the worstcase cache analysis information; however, the EHSA ensures that only
basic blocks on the WC-path are allocated to the SPM. We have also extended the
ILP-based timing analysis method [9] to predict the WCET for the hybrid
SPM-cache architecture, and our experiments indicate that the developed WCET
analyzer is safe and reasonably accurate. Our evaluation indicates that the EHSA
algorithm, which is both WCET-oriented and cache-aware, can achieve the best
WCET for all benchmarks under all SPM-cache congurations we have evaluated.
The EHSA is especially more eective to reduce WCET with a smaller cache and a
larger SPM. While the EHSA may lead to degradation of the average-case
performance for some multiple-path benchmarks, its impact is insignicant.
To improve the performance, Chapter 6 develops 4 SPM allocation
algorithms for the HSC architecture: FSA, HSA, GSDA and OSDA. While the
FSA is cache-unaware, all other three algorithms are aware of the cache, which can
reduce more cache misses and achieve much better performance than the FSA.
Both the GSDA and the OSDA are based on the unied stack distance analysis
156
framework, which can consider the interaction between the SPM allocation and
the cache performance. The OSDA is an optimal algorithm based on model
checking; however, it may take signicantly more memory and longer time to run
as compared to other algorithms. The GSDA is a greedy algorithm, which can run
eciently and achieve optimal or near-optimal results for most benchmarks.
Overall, we believe the GSDA is a good SPM allocation algorithm to harness the
full potential of the HSC architecture eciently.
Last but not the least, Chapter 7, extend the GSDA and OSDA to reduce
the energy consumption to GSDA-E and OSDA-E. We evaluate them together
with the four dierent SPM allocation algorithms from Chapter 6, and nd in
general the cache-aware SPM allocation can lead to better performance and/or
energy consumption than the cache-oblivious SPM allocation algorithm (i.e. FSA).
Also, we discover that for the HSC architecture, the energy-oriented algorithms
can lead to better EDP than the performance-oriented algorithms. In particular,
our experiments indicate that the GSDA-E can reduce the energy consumption
either the same as or close to the optimal results attained by the OSDA-E, while
achieving performance close to the optimal results obtained by the OSDA.
8.1 FUTURE WORK
In our future work, we would like to seamlessly integrate static analysis with
the model checking based method to attain safe and tight WCET results with
much smaller memory consumption and less computation time. Moreover, we
would like to explore the hybrid SPM-caches for storing data as well. We plan to
157
explore both heuristicbased and optimal cache-aware SPM allocation algorithms
for data accesses as well. Also, we plan to study SDA based dynamic SPM
allocation for the HSC architecture. In addition, we intend to study the hybrid
SPM-cache architecture to multi-level on chip memory and multi core platform.
158
REFERENCES
[1] K. Sankaralingam, et al., \Exploiting ILP, TLP, and DLP with the
polymorphous TRIPS architecture," Proc. Computer Architecture, 2003.
Proceedings. 30th Annual International Symposium on. IEEE, 2003.
[2] ARM1136JF-S and ARM1136J-S Technical Reference Manual, 2013.
[3] FERMI Compute Architecture White Paper.
http://www.nvidia.com/object/fermi-architecture.html, 2013.
[4] S. Kang and A. Dean, \Leveraging both data cache and scratchpad memory
through synergetic data allocation," Proc. Real-Time and Embedded
Technology and Applications Symposium (RTAS), 2012 IEEE 18th. IEEE,
2012.
[5] J. Cong et al., \An Energy-Ecient Adaptive Hybrid Cache," Proc. Low
Power Electronics and Design (ISLPED) 2011 International Symposium on.
IEEE, 2011.
[6] W. Zhang, and Y. Ding, \Hybrid SPM-cache architectures to achieve high
time predictability and performance," Proc. Application-Specic Systems,
Architectures and Processors (ASAP), 2013 IEEE 24th International
Conference on. IEEE, 2013.
[7] A. Metzner, \Why model checking can improve wcet analysis," Proc. 16th
International Conference on Computer Aided Verication. Lecture Notes in
Computer Science, vol. 3114, Springer- Verlag, Berlin Heidelberg, 2004.
159
[8] SPIN. Homepage of spin. http://spinroot.com/spin/whatispin.html, 2013.
[9] J. Yan and W. Zhang, \Accurately estimating worst-case inter-thread cache
interferences and WCET for multicore processors," Proc. IEEE International
Conference on Embedded and Real-Time Computing Systems and
Applications (RTCSA). 455463, 2009.
[10] M. Alt et al., \Cache behavior prediction by abstract interpretation," Static
Analysis. Springer Berlin Heidelberg, 1996. 52-66.
[11] K. Beyls, and E. D'Hollander, \Reuse Distance as a Metric for Cache
Behavior," Proc. IASTED Conference on Parallel and Distributed
Computing and systems. Vol. 14. 2001.
[12] C. Cascaval and D. A. Padua, \Estimating cache misses and locality using
stack distance," Proc. 17th annual international conference on
Supercomputing. ACM, 2003.
[13] R. Banakar et al., \Scratchpad memory: design alternative for cache on-chip
memory in embedded systems," Proc.10th international symposium on
Hardware/software codesign. ACM, 2002.
[14] MPC500 32-bit MCU Family. Motorola/Freescale, Revised July 2002.
http://www.freescale.com/les/microcontrollers/doc/fact
sheet/MPC500FACT.pdf.
[15] D. Brash, \The ARM architecture Version 6 (ARMv6)," ARM Ltd.,
January2002. White Paper.
[16] R. Wilhelm et al., \The worst case execution time problemOverview of
160
methods and survey of tools," ACM Transactions on Embedded Computing
Systems (TECS) 7.3 (2008): 36, 2008.
[17] J. Rosen etal., \Bus access optimization for predictable implementation of
real-time applications on multiprocessor systems-on-chip," Proc. 28th IEEE
International Real-Time Systems Symposium (RTSS). 4960, 2007.
[18] S. Schliecker et al., \Reliable performance analysis of a multicore
multithreaded system-on-chip," Proc.e International Conference on
Hardware/Software Codesign and System Synthesis (CODES+ISSS). 161166,
2008.
[19] J. Stohr et al., \Bounding worst-case access times in modern multiprocessor
systems," Proc. 17th Euromicro Conference on Real-Time Systems
(ECRTS05). 189198.
[20] J. Yan and W. Zhang, \WCET analysis for multicore processors with shared
instruction caches," Proc.IEEE Real-Time and Embedded Technology and
Applications Symposium (RTAS), 2008. 8089.
[21] C. Healy et al., \Integrating the timing analysis of pipelining and instruction
caching," Proc. Real-Time Systems Symposium, 1995. IEEE, 1995.
[22] Y. S. Li and S. Malik, \Performance analysis of embedded software using
implicit path enumeration," Proc. 32nd ACM/IEEE Design Automation
Conference. 456461, 1995.
[23] Y. S. Li and S. Malik, \Cache modeling and path analysis for real-time
software," Proc. 17th IEEE Real Time Systems Symposium. 254264, 1996.
161
[24] G. Ottosson and M. Sjodin, \Worst-case execution time analysis for modern
hardware architectures," Proc. ACM SIGPLAN Workshop on Languages,
1997.
[25] F. Stappert et al., \Ecient longest executable path search for programs
with complex ows and pipeline eects," Proc. International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems. 132140, 2001.
[26] E. M. Clarke et al., \Model checking and abstraction," ACM Transactions
on Programming Languages and Systems (TOPLAS) 16.5 (1994): 1512-1542.
[27] A. Pnueli, \The temporal logic of programs," Proc. 18th Annual IEEE
Symposium on Foundations of Computer Science. 4657, 1977.
[28] E. M. Clarke and B. H. Schlinglo, Model Checking, Handbook of Automated
Reasoning, MIT Press, Cambridge, MA, 2001.
[29] R. Alur and D. Dill, \Automata for modeling real-time systems," Proc.
Automata, languages and programming. Springer Berlin Heidelberg, 1990.
322-335.
[30] F. Wang, \Formal verication of timed systems: A survey and perspective,"
Proc. IEEE 92, 8, 12831305, 2004.
[31] M. Lv et al., \Performance comparison of techniques on static path analysis
of wcet," Proc. IEEE/IFIP International Conference on Embedded and
Ubiquitous Computing. 104111, 2008.
[32] R. Wilhelm, \Why ai+ilp is good for wcet, but mc is not, nor ilp alone," In
Verication, Model Checking and Abstract Interpretation (YMCAI). Lecture
162
Notes in Computer Science, vol. 2937, Springer, Berlin, 2003.
[33] B. Huber and M. Schoeberl, \Comparison of implicit path enumeration and
model checking based wcet analysis," Proc. 9th International Workshop on
Worst-Case Execution Time (WCET) Analysis, 2009.
[34] S. Mohalik et al., \Model checking based analysis of end-to-end latency in
embedded real-time systems with clock drifts," Proc. IEEE/ACM Design
Automation Conference (DAC). 296299, 2008.
[35] C. Liu et al., `Organizing the last line of defense before hitting the memory
wall for cmps," Proc. 10th International Symposium on High Performance
Computer Architecture. 176185. 2004.
[36] M. Paolieri et al., \Hardware support for wcet analysis of hard real-time
multicore systems," Proc. 36th Annual International Symposium on
Computer Architecture (ISCAS09). 5768, 2009.
[37] B. Akesson et al., \Predator: A predictable sdram memory controller," Proc.
5th IEEE/ACM international conference on Hardware/software codesign and
system synthesis. ACM, 2007.
[38] G. Holzmann, \The model checker spin," Software Engineering, IEEE
Transactions on 23.5 (1997): 279-295.
[39] R. Arnold et al., \Bounding worst-case instruction cache performance,"
Proc. Real-Time Systems Symposium, 1994. IEEE, 1994
[40] TRIMARAN. Homepage of trimaran. http://www.trimaran.org/, 2013.
[41] Malardalen WCET research group, \Malardalen wcet benchmark suite,"
163
http://www.mrtc.mdh.se/projects/wcet.
[42] C. Lee et al., \MediaBench: A Tool for Evaluating and Synthesizing
Multimedia and Communication Systems," Proc. 30th International
Symposium of Microarchitecture (MICRO), 1997.
[43] Y. Ding and W. Zhang, \Hybrid SPM-Cache Architectures to Achieve High
Time Predictability and Performance," Technical Report, Department of
Electrical and Computer Engineering, Virginia Commonwealth University,
2012.
[44] B. Egger et al., \Scratchpad Memory Management for Portable Systems
with a Memory Management Unit," Proc. 6th ACM & IEEE International
conference on Embedded software. ACM, 2006.
[45] P. Panda et al., \Ecient utilization of scratch-pad memory in embedded
processor applications," Proc.1997 European conference on Design and Test.
IEEE Computer Society, 1997.
[46] O. Avissar and R. Barua, \An optimal memory allocation scheme for
scratchpad-based embedded systems," ACM Transactions on Embedded
Computing Systems (TECS), Nov, 2002.
[47] S. Udayakumaran and R. Barua, \Compiler-decided dynamic memory
allocation for scratch-pad based embedded systems," Proc. 2003
international conference on Compilers, architecture and synthesis for
embedded systems. ACM, 2003.
[48] N. Nguyen et al., \Scratch-pad memory allocation without compiler support
164
for java applications," Proc. 2007 international conference on Compilers,
architecture, and synthesis for embedded systems. ACM, 2007.
[49] M. Kandemir et al., \Compiler-directed scratch pad memory optimization
for embedded multiprocessors," Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on 12.3 (2004): 281-287.
[50] S. Steinke, et al., \Assigning program and data objects to scratchpad for
energy reduction," Proc. Design, Automation and Test in Europe Conference
and Exhibition (DATE), IEEE,2002.
[51] J. F. Deverge and I. Puaut, \WCET-directed dynamic scratchpad memory
allocation of data," Proc. Real-Time Systems, 2007. ECRTS'07. 19th
Euromicro Conference on. IEEE, 2007.
[52] V. Suhendra et al., \WCET centric data allocation to scratchpad memory,"
Proc. Real-Time Systems Symposium, 2005. RTSS 2005. 26th IEEE
International. IEEE, 2005.
[53] H. Falk and J. Kleinsorge, \Optimal Static WCET-aware Scratch-pad
Allocation of Program Code," Proc. 46th Annual Design Automation
Conference. ACM, 2009.
[54] J. Whitham and N. Audsley, \Studying the applicability of the scratchpad
memory management unit," Proc. Real-Time and Embedded Technology and
Applications Symposium (RTAS), 2010 16th IEEE. IEEE, 2010.
[55] H. Wu et al., \Optimal WCET-aware code selection for scratchpad memory,"
Proc. 10th ACM international conference on Embedded software. ACM, 2010.
165
[56] M. Kamble and K. Ghose, \Analytical Energy Dissipation Models For Low
Power Cache," Proc. Low Power Electronics and Design. 1997 International
Symposium on. IEEE, 1997.
[57] CACTI,Homepage of CACTI, http://www.cacti.net/.
[58] G. Ascia, et al., \VLIW-Explorer: A Parameterized VLIW-based Platform
Framework for Design Space Exploration," Proc. 1st Workshop on Embedded
Systems for Real-Time Multimedia, 2008.
[59] M. Kandemir et al., \Banked scratchpad memory management for reducing
leakage energy consumption," Proc. 2004 IEEE/ACM International
conference on Computer-aided design. IEEE Computer Society, 2004.
[60] G. Chen and M. Kandemir, \Dataow analysis for energy-ecient
scratch-pad memory management," Proc. Low Power Electronics and
Design, 2005. ISLPED'05. 2005 International Symposium on. IEEE, 2005.
[61] H. Wang et al., \Energy-oriented dynamic SPM allocation based on
Time-Slotted Cache Conict Graph," Proc. Conference on Design,
Automation and Test in Europe. European Design and Automation
Association, 2010.
[62] C. Xue, et al., \Towards energy ecient hybrid on-chip Scratch Pad Memory
with non-volatile memory," Proc. Design, Automation and Test in Europe
Conference and Exhibition (DATE), IEEE, 2011.
[63] M. Verma et al., \Cache-aware scratchpad allocation algorithm," Proc.
conference on Design, automation and test in Europe-Volume 2. IEEE
166
Computer Society, 2004.
[64] N. Nguyen et al., \Memory Allocation for Embedded Systems with a
Compile-Time-Unknown Scratchpad Size," Proc. 2005 international
conference on Compilers, architectures and synthesis for embedded systems.
ACM, 2005.
[65] Q. Wan et al., \WCET-Aware Data Selection and Allocation for Scratchpad
Memory," Proc. 13th ACM SIGPLAN/SIGBED International Conference on
Languages (LCTES '12), 2012.
[66] N. Megiddo, \On the complexity of linear programming," Advances in
economic theory (1987): 225-268.
[67] D. Spielman and S. Teng, \Smoothed analysis of algorithms: Why the
simplex algorithm usually takes polynomial time," Proc. 33rd annual ACM
symposium on Theory of computing, 2001.
[68] S. Li et al., \Performance estimation of embedded software with instruction
cache modeling," Proc. 1995 IEEE/ACM international conference on
Computer-aided design, 1995.
[69] CPLEX. Homepage of cplex. http://www.ilog.com/products/cplex, 2013.
[70] R. Arnold et al., \Bounding Worst-case Instruction Cache Performance,"
Proc. Real-Time Systems Symposium (RTSS), 1994.
[71] F. Mueller, \Generalizing timing predictions to set-associative caches," Proc.
Euromicro Workshop on Real-Time Systems, 1997.
[72] W. Zhao et al., \WCET code positioning," Proc. IEEE Real-Time Systems
167
Symposium (RTSS), 2004.
[73] H. Falk and H. Kotthaus, \WCET-driven Cache-aware Code Positioning,"
Proc. 14th international conference on Compilers, architectures and synthesis
for embedded systems. ACM, 2011.
[74] P. Lokuciejewski et al., \WCET-driven cache-based procedure positioning
optimizations," Proc. Real-Time Systems, 2008. (ECRTS'08). Euromicro
Conference on. IEEE, 2008
[75] S. Plazar et al., \WCET-driven Cache-aware Memory Content Selection,"
Proc. Object/Component/Service-Oriented Real-Time Distributed Computing
(ISORC), 2010 13th IEEE International Symposium on. IEEE, 2010.
[76] I. Puaut and C. Pais, \Scratchpad memories vs locked caches in hard
real-time systems: a quantitative comparison," Proc. Design, Automation
and Test in Europe Conference and Exhibition, 2007. DATE'07. IEEE, 2007.
[77] M. Kandemir et al., \Dynamic management of scratch-pad memory space,"
Proc. Design Automation Conference, 2001. IEEE, 2001.
[78] L. Li, et al., \Memory coloring: A compiler approach for scratchpad memory
management," Proc. Parallel Architectures and Compilation Techniques,
2005. PACT 2005. 14th International Conference on. IEEE, 2005.
[79] B. Egger et al., \Dynamic scratchpad memory management for code in
portable systems with an mmu," ACM Transactions on Embedded
Computing Systems (TECS) 7.2 (2008): 11. 2008.
[80] M. Verma, and P. Marwedel, \Overlay techniques for scratchpad memories in
168
low power embedded processors," IEEE Transactions on Very Large Scale
Integration (VLSI) Systems 4, 8, August 2006.
[81] S. Cook, \The complexity of theorem-proving procedures," Proc. 3rd annual
ACM symposium on Theory of computing. ACM, 1971.
[82] M. Ben-Ari, \Principles of the spin model checker," Springer 2008.
169
