TRISHUL: A Single-pass Optimal Two-level Inclusive Data Cache Hierarchy
  Selection Process for Real-time MPSoCs by Haque, Mohammad Shihabul et al.
4B-1 
 
Core Core Core Core 
 
 
 
 
TRISHUL: A Single-pass Optimal Two-level Inclusive Data Cache Hierarchy 
Selection Process for Real-time  MPSoCs 
 
Mohammad Shihabul Haque, Akash Kumar, Yajun Ha, Qiang Wu and Shaobo Luo 
Department of Electrical and Computer Engineering, National University of Singapore 
Email: {elemsh, akash, elehy, elewuqia, shaobo.luo}@nus.edu.sg 
 
 
Abstract— Hitherto discovered approaches analyze the execu- 
tion time of a real-time application on all the possible cache hi- 
erarchy setups to find the application specific optimal two-level 
inclusive data cache hierarchy to reduce cost, space and energy 
consumption while satisfying the time deadline in real-time Multi- 
Processor Systems on Chip (MPSoC). These brute-force like ap- 
proaches can take years to complete. Alternatively, application’s 
memory access trace driven crude estimation methods can find a 
cache hierarchy quickly by compromising the accuracy of results. 
In this article, for the first time, we propose a fast and accurate 
application’s trace driven approach to find the optimal real-time 
application specific two-level inclusive data cache hierarchy. Our 
proposed approach “TRISHUL” predicts the optimal cache hi- 
erarchy performance first and then utilizes that information to 
find the optimal cache hierarchy quickly. TRISHUL can suggest 
a cache hierarchy, which has up to 128 times smaller size, up  to 
7 times faster compared to the suggestion of the state-of-the-art 
crude trace driven two-level inclusive cache hierarchy selection 
approach for the application traces analyzed. 
I. INTRODUCTION 
Guaranteed execution time and performance in real-time 
computer applications allow planning the efficient use of the 
application as well as other related tasks. Due to this fact, 
from saving lives in hospitals to compressing images on digital 
cameras, real-time applications can be found everywhere. To 
satisfy the performance and time critical nature in the real-time 
applications, use of MPSoCs with multi-level cache hierarchy 
on real-time systems is growing day by day. By keeping data 
handy to the processors, cache memory hierarchy hides the 
latency of slow memory transactions.  However, if the   cache 
shared cache before memory. Therefore, shared cache con- 
tains the superset of the private caches. See [22] for inclusive 
cache hierarchy details. If each cache memory has ten possible 
configurations and fifteen seconds are taken on average to find 
the execution time of an application on one cache hierarchy 
configuration, it will take eighteen days of continuous analysis 
to find the optimal cache hierarchy, unless any speedup mech- 
anism is used. 
Application’s total execution time as well as time spent  on 
instruction/data memory operation can be calculated quite ac- 
curately from the number of cache hits and misses [9]. There- 
fore, finding the number of cache hits and misses in each level 
of the data cache hierarchy will be enough to find the most ap- 
propriate application specific data cache hierarchy. If the range 
of cache hits and misses for each level in the data cache hier- 
archy (or required cache hierarchy performance CHP ) to sat- 
isfy the allowable data memory operation time (WCDMOT ) 
for real-time application is known, the searching process for 
the most appropriate cache hierarchy can be shortened by 
pruning the infeasible cache hierarchy configurations. Even 
though maximum allowable WCDMOT can be calculated 
using worst-case timing analysis [20, 17], to the best of our 
knowledge, no proposal has ever been made to estimate/predict 
the required performance of a multi-level/two-level inclusive 
data cache hierarchy in real-time MPSoCs. Moreover, no sig- 
nificantly fast method is known to find the most optimal two- 
level inclusive data cache hierarchy in real-time MPSoCs. 
configurations1 in the cache hierarchy are not chosen appropri- 
ately, it can have catastrophic effects by exceeding completion 
time deadline and by causing adverse effects on cost, space and 
energy consumption [2]. 
As a data cache hierarchy can have single or multiple cache 
memories  in  each  level  and  one  cache  memory  can influ- 
Core Core Core ore 
ence others’ cache hits/misses (inter-influencing), analyzing 
the given application’s execution time on all possible cache 
hierarchy configurations2 is a mandatory step in deciding the 
optimal application specific data cache hierarchy for real-time 
MPSoCs. However, take the cache hierarchy of Figure 1 
(collected from [14]) to understand the problem with analy- 
sis time. Figure 1 depicts a widely used two-level inclusive 
data cache hierarchy (Harvard Architecture) on contemporary 
MPSoCs [8, 3, 11, 16]. In Figure 1, the processor cores in- 
clude private caches which loads data in the shared cache be- 
fore loading on them.  The private caches search data in the 
 
 
1Combination of cache parameters such as number of cache sets (set size), 
number of storage locations in each set (associativity), capacity of each storage 
location (cache line/block size), etc. 
2A cache hierarchy setup with a specific configuration per cache memory. 
Two Level Data Cache Hierarchy 
 
External Memory 
Fig. 1. Two-level Cache Hierarchy in MPSoC Architecture (collected 
from [14]) 
In this article, for the first time, we present a fast and ac- 
curate application’s memory access trace driven process to se- 
lect the optimal two-level inclusive data cache hierarchy for 
real-time MPSoCs. Our proposed cache hierarchy selection 
process “TRISHUL” (Time Restricted Interconnected Simula- 
tion of Hierarchical-cache Utility Library) finds the smallest 
storage capacity configurations, to save cost and space-energy 
consumption while meeting the time deadline, for each cache 
memory in the cache hierarchy. Our target architecture is the 
one presented in Figure 1.   TRISHUL predicts the    required 
 
 
 
978-1-4673-3030-5/13/$31.00 ©2013 IEEE 320 
IMEM IMEM IMEM IMEM 
L1 DCache L1 DCache L1 DCache L1 DCache 
L2 Data Cache 
4B-1 
321 
 
 
 
 
CHP first with a novel approach. CHP is then used to re- 
duce the cache hierarchy design space by pruning the infea- 
sible cache hierarchy configurations. Therefore, a significant 
amount of time can be saved. To analyze each cache memory’s 
behavior with minimal memory consumption and without ef- 
fecting the accuracy of analysis, TRISHUL adopts “Single- 
pass technique (details in Section II), through a layered ap- 
proach. Another unique feature of TRISHUL is, when a cache 
hierarchy is selected for an application but the WCDMOT 
has reduced, the optimal cache hierarchy can be found with 
minimal cache simulation. Due to all these features, TR- 
ISHUL can find the most optimal cache hierarchy in similar 
or less time than the state-of-the-art application trace driven 
crude method DIMSim [14] to select a two-level inclusive data 
cache hierarchy in real-time MPSoCs. TRISHUL is upto 7 
times faster than DIMSim and the TRISHUL suggested shared 
caches can be up to 128 times less in size than DIMSim’s 
suggestions for the application traces presented in this article. 
Note that TRISHUL can find the optimal one among all those 
cache hierarchies which have the same block size/cache line 
size in a particular level. The article is written assuming that 
all cache configurations can have a fixed block size only. 
The rest of the paper is structured as follows:  Section II 
discusses the related works, Section III explains TRISHUL’s 
working policy and implementations, Section IV discusses the 
results and analyzes the TRISHUL suggested cache hierar- 
chies’ optimality and Section V concludes the paper. 
II. RELATED WORK 
The worst case execution time of an application and the 
maximum number of main memory accesses estimated using 
worst-case timing analysis [4, 5] serve as the required CHP to 
select a single application specific cache memory. Even though 
real-time systems are usually application specific [18, 9] and 
the maximum number of main memory accesses acceptable for 
the WCDMOT can be estimated using the worst-case timing 
analysis, inter-influencing cache memories in the multi-level 
data cache hierarchy do not allow required CHP to be ex- 
tracted from the number of memory accesses. No other meth- 
ods are known either to predict the required CHP for real-time 
application specific two-level inclusive data cache hierarchy. 
A single application specific cache memory is selected   by 
evaluating the applications execution time on a large group of 
cache configurations. For this purpose, three types of applica- 
tion’s memory access trace driven cache behavior simulation 
approaches are very popular due their speed compared to cy- 
cle accurate simulators or instruction set simulators. In the 
type called the compressed trace simulation, redundant infor- 
mation is pruned to compress the memory access trace [13, 19]. 
In the second type called the parallel simulation, cache con- 
figurations are simulated in parallel by using parallel hard- 
ware to reduce the overall simulation time [1, 15]. In con- 
trast to parallel simulation, one processing unit is used as op- 
timally as possible in the third type called single-pass simula- 
tion [12, 9]. In Single-pass simulation, several cache configu- 
rations are simulated by reading the application’s memory ac- 
cess trace once. To mimic the hardware behavior as minimal as 
possible, cache configurations are represented by mainly four 
cache parameters: (i) set size (S), (ii) associativity (A), (iii) 
cache block/line size (B) and (iv) replacement policy. To sim- 
ulate all the cache configurations quickly and accurately, sev- 
eral additional mechanisms (such as Inclusion properties  [7], 
Due to the advantages, attempts have been made to adopt 
single-pass simulation techniques to select appropriate multi- 
level cache hierarchy. Two proposals made by Wei Zang et 
al.[22, 23] are the latest in these attempts. However, Zang’s ap- 
proaches are limited to two-level Exclusive Cache hierarchy3 
only. Cache coherency is not considered in Zang’s approaches; 
hence, not usable in MPSoCs causing coherency through data 
sharing. Zang’s approaches are very restricted in terms of us- 
ability as they require the cache hierarchy to have first level 
cache with Least Recently Used (LRU) replacement policy and 
the second level cache with First-In-First-Out (FIFO) replace- 
ment policy ( Note that TRISHUL allows different replacement 
policies in shared and private caches). 
To the best of our knowledge, only one approach  DIMSim 
has adopted single-pass simulation so far to make a crude se- 
lection of two-level inclusive data cache hierarchy in real-time 
MPSoCs. To handle coherency, DIMSim finds a shared cache 
first that can satisfy the WCDMOT alone and, on top of that, 
a private level configuration to cover system overheads. As a 
result, the size of the shared cache is always much larger than 
required. Moreover, due to addition of private caches, traffic 
to shared cache is reduced causing a reduction in memory op- 
eration time further. As system overheads are dynamic and 
not predictable, adding a private cache per processor to han- 
dle unpredictable amount of system overhead is impractical 
and can cause excessively large private caches. Most impor- 
tantly, shared caches cannot be found using DIMSim if the 
WCDMOT is not large enough to be satisfied by a single 
cache (Section IV details this problem with experiment re- 
sults). 
 
 
TRISHUL Cache Selection Process 
Fig. 2. TRISHUL Cache Hierarchy Selection Flow 
 
III. TRISHUL 
In search for the optimal real-time application specific two- 
level inclusive data cache hierarchy, TRISHUL deployes three 
major components: (a) Cache Hierarchy Performance Predic- 
tor (CHPP), (b) Single-pass Private Cache Simulator (SPCS), 
and (c) Single-pass Shared Cache Simulator (SSCS). Figure 2 
depicts the work flow of these components. The target real- 
time application’s memory access trace is prepared before- 
hand. To generate the trace, memory accesses are observed 
and captured at the memory controller while the real-time MP- 
SoC (without caches) is executing the application as communi- 
cating tasks or multiple applications. After that, data accesses 
from the trace are extracted and annotated; so that, for   every 
Intersection properties [6], etc.)  are applied too in single-pass    
simulation. Single-pass simulation can be deployed with com- 
pressed trace simulation and/or parallel simulation. 
3In exclusive cache hierarchy, requested content is loaded in the  privates 
caches directly from memory and shared cache stores the evicted content from 
private caches 
STAGE-4: 
SSCS 
 
Cache_Config 1 
. 
. 
. 
Cache_Config n 
Operation Time (WCDMOT) Caches and Memory 
Read 
 
Shared 
Memory 
Trace 
 
Extract 
Memory 
Trace 
for 
Data 
Access 
Worst Case Data Memory Data Access Times in 
Select Private 
Cache 
Configurations 
Select Shared 
Cache 
Configuration 
Shared Memory 
Based (Cache- 
less) MPSoC 
Simulator 
STAGE-1: 
CHPP 
STAGE-2: 
SPCS 
 
Private_level_config 1 
. 
. 
. 
Private_level_config n Secondary Trace 
STAGE-3: 
Secondary Trace 
Generation 
P
re
-p
ro
c
e
s
s
in
g
 o
f 
th
e
 t
ra
c
e
 
4B-1 
322 
 
 
4B-1 
323 
 
 
 
 
 
Associativity list for 
Processor 1's Private cache 
 
Indicate by 
 
 
Indicated by 
 
Look-up 
Table 
 
Memory Block 
 
Bit Array 
Line 2 
Pointer to 
Line 1 
Pointer to 
   Address Set Size 1 Set Size 2 LT Entry for LT Entry for 0 1 
Set 
0 
10010 
 
11010 
 
 
Memory Block 
011 
 
101 
 
 
 
 
 
Bit Array 
011 
 
101 
Key 1001100 
 
 
 
Line 2 
Key 1101100 
 
 
 
Line 1 
 
 
00 10 01 11 
   
 
Set 
1 
Address 
 
10011 
 
11011 
Set Size 1 
111 
 
011 
Set Size 2 
111 
 
011 
Pointer to 
LT Entry for 
Key 1100100 
Pointer to 
LT Entry for 
Key 1111100 
 
 
000     100     001     101    010     110     011   111 
 
(a) Schematic of Lookup Table 
Fig. 3. SPCS Data Structures 
Associativity list for 
Processor 2's Private cache  
(b) Simulation tree 
 
 
private cache. To reduce entry search time, look-up table en- 
tries are arranged into sets and entries are sorted on their keys 
to facilitate binary search. Memory blocks are mapped to look- 
up table sets just like cache sets. 
 
 
Algorithm 1: SPCSEvaluation(RequestedAddress(RA), 
RequestingProcessor(N), MissLimit(TAS)) 
 
 
1   LT  =Look-up Table; 
th 
By reading the trace entry for memory block (RA) once, 
2  AN = The associativity list for the N 
3   if (RA is not found in LT ) then 
processor; 
SPCS determines cache misses in all the private level configu- 
rations by utilizing Algorithm 1. Just by reading the bit arrays 
for RA, SPCS identifies the appropriate processor’s cache in 
each private level configuration that has not stored RA. For 
these caches, SPCS records misses and then updates the look- 
up table and simulation tree to reflect after miss scenario. To 
record the number of sequential data blocks served from the 
private level (TAPr), SPCS just counts the number of differ- 
ent cycles when a data access occurred.  From the bit   arrays, 
4 Record one cache miss for all the configurations of Processor N ’s private 
cache; 
5 Exclude the tree level L from simulation whose total number of misses is 
greater than T AS 
6 Place RA in LT and place pointer to RA’s location in LT in all the 
7 configurations for processor N ’s private caches in the simulation tree; 
8   else 
9 Select the tree level L = 0(smallest cache set size S = 2L ) in the tree; 
10 while 2L is not larger than the largest set size do 
11 if (Lth level is not excluded for simulation) then 
12 if (Write Operation) then 
13 if (RA found in AN ) then 
L 
SPCS also knows quickly which processors’ caches have    to 14 For set size 2 LT for 
, update/invalidate bit arrays for RA in 
be updated/invalidated when a particular processor updates  a 
shared data (coherency handling). 
After simulation, all the private level configurations’ ob- 
served TAP and TAS are substituted in Equation 1 to  find 
the largest value for TAM below the CHPP predicted TAM 
(we call this fine tuned value as TAMr). The private level 
configuration that generates the largest TAMr is selected for 
shared cache generation and its TAMr is used as the miss limit 
in shared cache simulation. 
A. Single-pass Shared Cache Simulator (SSCS) 
Role: Finding the optimal shared cache configuration by 
reading the secondary trace file only once. 
Details: SSCS simulates one shared cache memory’s multi- 
ple configurations. SSCS is actually the simulator of [6] with- 
out any intersection property deployed and modified to accom- 
modate the use of TAMr. The Look-up table and simulation 
tree are also used by SSCS to represent shared cache config- 
urations. However, one look-up table and its associated simu- 
lation tree are generated to simulate cache configurations with 
varying S and A. The look-up table and the simulation tree in 
SSCS looks exactly same as in SPCS; however, the bits in the 
look-up table bit arrays provide data availability information 
for cache configurations with the same S and B but with dif- 
ferent A. For example, when the look-up table in Figure 3(a) 
will be used for SSCS, the bit array associated with memory 
address 10010 for S =1 will indicate that the memory block 
content will be absent in the shared cache configurations with 
S =1 and A = 1 and 2 provided three options 1,2 and 4 for A 
value. SSCS will add three lists containing 1, 2 and 4 nodes to 
represent A = 1, 2 and 4 respectively with each tree node. That 
means; the top level in the tree in Figure 3(b) will represent the 
fixed cache line sized shared cache configurations with S = 2 
and A = 1, 2 and 4. After simulation, the shared cache config- 
uration with the largest number of memory accesses (TAMrr) 
15 the processors I where 
I  ƒ= N, I  = 1, 2, 3, ..., Last processor; 
16 else 
17 Record a cache miss for processor N ’s configuration 
with set size 2L ; 
18 Place RA in AN and update the LT record; 
19 update/Invalidate bit arrays for RA in LT for 
20 all AI  where I  ƒ= N, I  = 
1, 2, 3, ..., Last processor number; 
21 else 
22 if (Not found in AN or invalid) then 
23 Record a cache miss for processor N ’s configuration 
with set size 2L ; 
24 Place RA in AN and update the LT record; 
 
25 Exclude the tree level L from simulation whose total number of 
misses is greater than T AS 
26 L = L + 1; 
 
 
 
is selected for final design. 
From here, we use n apostrophes (’) after TAP and TAS 
but n + 1 apostrophes after TAM to indicate the observed 
number of sequential accesses in private level, shared level and 
main memory in the nth  cache hierarchy configuration. 
By now, readers may be starving to know, when TRISHUL 
selects a private level configuration X with TASr misses  and 
a shared cache configuration with TAMrr misses: 
1. Will there be any smaller private level configuration with 
TASrr > TASr and larger shared cache configuration 
with TAMrrr ≤ TAMrr that still satisfy WCDMOT ? 
2. If no shared cache is found for X, can a larger private 
level configuration with TASrr < TASr have a shared 
cache to satisfy WCDMOT ? 
3. How to select the optimal cache hierarchy with minimal 
simulation when an application’s WCDMOT reduces? 
    Index 0 Tag Valid bit Data 
Index 1 Tag Valid bit Data 
Cache with two sets 
 Index 00 Tag Valid bit Data 
Index 10 Tag Valid bit Data 
Index 01 Tag Valid bit Data 
Index 11 Tag Valid bit Data 
Cache with four sets 
 
4B-1 
324 
 
 
 
 
 
 
Trace 
 
WCDMOT 
(sec) 
TRISHUL (Optimal) DIMSim TRI SHUL DIMSim 
Private 
Config 
Shared 
Config. 
Shared 
Config 
AMT 
(Sec) 
Decision 
in (Sec) 
Decision 
in (Sec) 
JPEG 
barbara 1.00 (8X2) (1X2) (8X16) 0.96 1700 1832 
0.40 (4X16) (1X2) N/A 0.40 361 N/A 
0.15 (16X16) (64X16) N/A 0.15 281 N/A 
criss 1.00 (8X2) (1X2) (8X16) 0.96 1699 1758 
0.40 (4X16) (1X2) N/A 0.40 345 N/A 
0.15 (16X16) (64X16) N/A 0.15 280 N/A 
graph 1.00 (8X2) (1X2) (8X16) 0.96 1792 1752 
0.40 (4X16) (1X2) N/A 0.40 354 N/A 
0.15 (16X16) (128X8) N/A 0.15 283 N/A 
lena 1.00 (8X2) (1X2) (8X16) 0.96 1761 1735 
0.40 (4X16) (1X2) N/A 0.40 344 N/A 
0.15 (16X16) (64X16) N/A 0.15 281 N/A 
photo1 1.00 (8X2) (1X2) (8X16) 0.96 1769 1722 
0.40 (4X16) (1X2) N/A 0.40 346 N/A 
0.15 (16X16) (128X8) N/A 0.15 281 N/A 
photo2 1.00 (8X2) (1X2) (8X16) 0.96 1772 1751 
0.40 (4X16) (1X2) N/A 0.40 346 N/A 
0.15 (16X16) (128X8) N/A 0.15 281 N/A 
H264 
Bluesky 1.00 (2X8) (8X8) (8X4) 0.99 2336 1526 
0.75 (8X4) (1X2) (16X16) 0.61 525 1525 
0.40 (2X16) (64X16) N/A 0.39 511 N/A 
river 1.00 (2X8) (8X8) (8X4) 0.99 2145 1541 
0.75 (8X4) (1X2) (16X16) 0.61 524 1472 
0.40 (2X16) (64X16) N/A 0.38 640 N/A 
station 1.00 (2X8) (8X8) (8X4) 0.99 2255 1506 
0.75 (8X4) (1X2) (16X16) 0.61 600 1422 
0.40 (2X16) (64X16) N/A 0.38 478 N/A 
pedest. 1.00 (2X8) (8X8) (8X4) 0.99 2262 1464 
0.75 (8X4) (1X2) (16X16) 0.61 529 1436 
0.40 (2X16) (64X16) N/A 0.38 696 N/A 
tractor 1.00 (2X8) (8X8) (8X4) 0.99 2255 1507 
0.75 (8X4) (1X2) (16X16) 0.61 523 1449 
0.40 (2X16) (64X16) N/A 0.39 706 N/A 
TABLE I 
EFFICIENCY OF TRISHUL OVER DIMSIM 
 
 
Let’s answer all these questions in the following section. 
IV. EXPERIMENT AND RESULTS 
DIMSim showed that a crude estimation of a real-time ap- 
plication specific two-level inclusive data cache hierarchy re- 
duces the design space exploration time from years to min- 
utes. Therefore, our experiment setup is to find out whether 
TRISHUL can find the optimal cache hierarchy within similar 
or less time than DIMSim. We implement TRISHUL using C 
language and re-implement DIMSim following the guidelines 
provided in [14]. 
We  implement a six core cache-less multiprocessor  imple- 
mentation using the Tensilica tool set [21].  Like  DIMSim, 
we execute JPEG encoder and H264 encoder (only the motion 
estimation kernel) to generate traces for different image and 
video benchmarks. Both the applications are partitioned into 
multiple communicating/sharing tasks which are mapped on 
separate processors. Data sharing is performed only through 
shared cache. 
For Simulation,  we execute TRISHUL and DIMSim on   a 
machine with a dual core Opteron64 2GHz processor,   8GB 
of main memory and 1MByte shared L2 cache. In our ex- 
periment, each private cache or shared cache has 75 possi- 
ble configurations where S = 1 to 16384, A = 1, 2, 4, 8, 16, 
B = 4Bytes and FIFO replacement policy. We used TP = 1 
ns, TS = 4 ns, and TM = 15 ns (based on the Xtensa proces- 
sor [21]), assuming that all the applications are mapped on a 
1GHz processor with one clock cycle private cache latency. 
Table  I presents the experiment results in TRISHUL     and 
DIMSim. Column 1 presents the six JPEG traces and five 
H264 traces. Column 2 presents the generous (1.0sec), regular 
(0.40sec for JPEG and 0.75sec for H264) and stingy (0.15sec 
for JPEG and 0.40sec for H264) WCDMOT calculated us- 
ing [10] for every trace file. Column 3 presents the con- 
figuration of each cache in the private level selected by TR- 
ISHUL. Columm 4 and 5 present the shared cache configu- 
rations selected by TRISHUL and DIMSim respectively. No 
private level cache configuration has been presented for DIM- 
Sim as no practical private cache selection criteria has been 
provided in [14]. Column 6 presents the actual data opera- 
tion time (AMT ) of the TRISHUL selected cache  hierarchy. 
Column 7 presents the total time to select an optimal cache 
hierarchy in TRISHUL. The last column presents the time to 
select a shared cache only in DIMSim. For example, for JPEG 
Barbara and WCDMOT =1.0sec, TRIHSUL selected a pri- 
vate level configuration with each cache containing (S  =   8, 
A = 2 and B = 4Bytes). The selected shared cache con- 
figuration in TRISHUL and DIMSim contain (S = 1, A = 2 
and B = 4Bytes) and (S = 8, A = 16 and B = 4Bytes) 
respectively. The TRISHUL selected cache hierarchy has a 
AMT =0.96sec. For this case, TRISHUL selected the entire 
cache hierarchy in 28min (approx). On the other hand, DIM- 
Sim took almost 31min just to select a shared cache.   Note 
that in this example the entire cache hierarchy selected by TR- 
ISHUL is only 396Bytes. However, DIMSim’s shared  cache 
is alone 512Bytes. The results reveal that TRISHUL selected 
shared cache can be 128 times smaller or 2 times bigger in size 
compared to DIMSim’s shared cache. Readers may wandering 
why, sometimes DIMSim suggested shared cache is smaller 
than TRISHUL suggested shared cache in the cache hierar- 
chy (ex. bluesky and WCDMOT = 1.0). In TRISHUL, 
private cache misses generated in parallel, are sequentialized 
and searched in shared cache. This ordering process is ran- 
dom. Depending on the ordering, cache misses may  increase 
in the shared cache. In DIMSim, no parallel access is con- 
sidered. Therefore, if the trace file has the most optimized 
ordering of accesses,  DIMSim may produce smaller    shared 
caches compared to TRISHUL. In Figure 4, the CHPP pre- 
dicted values for TAP , TAS and TAMr  and their correspond- 
ing TAPr, TASr and TAMrr in the TRISHUL selected cache 
hierarchies are presented in groups for generous, regular and 
stingy WCDMOT . In each group, the order of the trace files 
is same as the order in Table I where the left most bar pair in- 
dicates the predicted and observed values in Barbara and the 
the right most bar pair represents the tractor trace. Figure 4(b) 
shows that TAS is within 96%-64% accuracy range compared 
to TASr. Similarly, Figure 4(c) shows that TAM r is within 
99.95%-74.55% accuracy range compared to TAMrr. The 
minimum value of sequential private cache accesses (TAP ) 
is also very accurately predicted by CHPP (see in Figure 4(a)). 
Due to the accurate predictions, TRISHUL can select a cache 
hierarchy without simulating all the possible configurations for 
each cache. For each JPEG trace, TRISHUL simulated neither 
more than 54 nor less than 38 out of 75 private level configu- 
rations. For every H264 trace, the number of private level con- 
figurations simulated is in between 33 to 53. TRISHUL sim- 
ulated 27-60 and 30-45 shared cache configurations for JPEG 
and H264 respectively. Moreover, the cache hierarchy selected 
by TRISHUL for each trace file can closely satisfy the given 
WCDMOT  with their AMT . 
Results show that TRISHUL is quite efficient in finding    a 
cache hierarchy for the given criteria. However, to prove that 
TRISHUL choose the optimal cache hierarchy, we have to an- 
swer Question 1 and 2. To find the answers, let’s take an ex- 
ample to analyze. Lets consider that we have two cache hierar- 
chies ‘H1’ with private level configuration ‘C1’ and ‘H2’ with 
private level configuration ‘C2’. ‘C2’ is bigger than ‘C1’. For 
a trace file and fixed B, number of misses in ‘C2’ is TASrr 
which will always be smaller than misses in ‘C1’ (TASr). For 
a fixed B and trace file, total number of sequential accesses to 
the private level (TAP ) does not change when cache hierarchy 
configuration is changed (see Figure 4(a)). Therefore, 
1. Answer for Question 1: if both ‘H1’ and ‘H2’ could 
satisfy WCDMOT with equal number of memory ac- 
cesses (TAMrr = TAMrrr), it means (TAPr × TP )+  
4B-1 
325 
 
 
 
 
 
(a) TAP vs. TAP’ 
Fig. 4. Predicted vs. Observed Values in TRISHUL 
(b) TAS vs. TAS’ (c) TAM’ vs. TAM” 
 
 
(TASr × TS)+ (TAMrr × TM ) = (TAPrr × TP )+ 
(TASrr  × TS) + (TAMrrr  × TM ).     As  TAP r     = 
TAPrr and  TAMrr =  TAMrrr,  TASr  must  be equal 
to TASrr. Otherwise WCDMOT cannot be  satisfied. 
So, ‘H1’ cannot satisfy WCDMOT when TASr > 
TASrr and TAMrr = TAMrrr. Even if ‘H1’ could sat- 
isfy WCDMOT with TASr > TASrr and TAMrr < 
TAMrrr, it would not be optimal. Because, less memory 
accesses means more storage capacity in the cache hier- 
archy (more space and energy consumption besides being 
costly). 
2. Answer for Question 2:In this case, TRISHUL selected 
private level configuration is representing ‘C1’ that can- 
not   satisfy   WCDMOT  with ‘H1’. So, for ‘H1’, 
WCDMOT − (TAPr × TP ) < (TASr × TS) + 
(TAMrr × TM ). If ‘H2’ could satisfy WCDMOT , 
WCDMOT − (TAPrr × TP ) = (TASrr × TS) + 
(TAMrrr  × TM ). As TAP r = TAPrr, it means 
(TASrr × TS)+ (TAMrrr × TM ) < (TASr × TS)+ 
(TAMrr × TM ); or TS × (TASr − TASrr) > TM × 
(TAMrrr − TAMrr). As TASr > TASrr, to have a posi- 
tive value of (TAMrrr − TAMrr), the TAMrrr must be 
larger than TAMrr. But in reality, TAMrrr cannot be 
larger than TAMrr when TASr > TASrr. Because to 
satisfy WCDMOT by ‘H2’, TAMrrr has to be less than 
TAMrr. So, answer for Question 2 is “No”. 
So, TRISHUL selects the optimal cache hierarchy if there ex- 
ists one. 
From  the  last  two  columns  of  Table  I,  it  can  be   seen 
that TRISHUL and DIMSim spent almost similar time to 
select a cache hierarchy and shared cache respectively for 
WCDMOT =1.0sec. However,  DIMSim  failed  to  make 
any decision for WCDMOT  <1.0sec  in  any  JPEG  and 
for WCDMOT <0.75sec  in  any  H264  trace.  The  rea- 
son is, as DIMSim selects a  shared  cache  first  that alone 
can satisfy the given WCDMOT , it is impossible to satisfy 
WCDMOT <1.0sec (for JPEG) or 0.75sec (for H264) by 
any single cache memory with any configuration simulated in 
our experiment. On the other hand, TRISHUL saves a huge 
amount of time for WCDMOT <1.0sec. The reason  is, 
when private level cache hierarchy configurations are simu- 
lated for WCDMOT =1.0sec, SPCS records the results for 
any private level configuration that do not exceed the CHPP 
given TAS. Therefore, when the WCDMOT reduces, re- 
quired TAS value will be decreased and can only be satisfied 
by a larger private level configuration. As all the larger private 
level configurations’ TASr values are recorded for the trace 
file in the SPCS produced result for WCDMOT =1.0sec, the 
appropriate private level configuration can be selected with- 
out further simulation for any WCDMOT <1.0sec with the 
help of CHPP. Once the private level configuration is selected, 
shared cache can be selected with the help of SSCS. This 
is the answer for Question 3. When DIMSim cannot find a 
shared cache for WCDMOT <1.0sec in  JPEG,  the solu- 
tion for WCDMOT =1.0sec has to be used. Same goes for 
H264. For example, for WCDMOT =0.4sec and bluesky, 
TRISHUL took around 8min to decide a cache hierarchy. But 
for DIMSim, the solution for WCDMOT =0.75sec has to be 
used. So, DIMSim’s decision time is 25min (3 times slower 
than TRISHUL). In this way, TRISHUL can be up to 7 times 
faster than DIMSIM for the traces analyzed in Table I. 
V. CONCLUSION 
In this article, we present an application trace driven method 
to select the optimal two-level inclusive data cache hierar- 
chy selection process for real-time MPSoCs. The method 
TRISHUL presents a novel mechanism to find the required 
cache hierarchy performance without analyzing/simulating 
any cache memory behavior. TRISHUL can select an opti- 
mal cache hierarchy within a time period necessary to  select 
a single shared cache by the available trace driven two-level 
inclusive data cache hierarchy selectors. 
REFERENCES 
[1] L. Barriga and R. Ayani. Parallel cache simulation on multiprocessor workstattions. ICPP 1993, volume 1, pages 
171–174, 1993. 
[2] Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-hashimi, and S. M. Reddy. Cache size selection for performance, energy 
and reliability of time-constrained systems. ASP-DAC, 2006. 
[3] J. Casazza. Intel core i7-800 processor serioes and intel core i5-700 processor series based on intel microarchitec- 
ture (nehalem). Intel White Paper, Intel Corporation, USA, 2009. 
[4]   R. Chapman. Worst-case timing analysis via finding longest paths in spark ada basic-path graphs, 1994. 
[5] M. Corti and T. R. Gross. Approximation of the worst-case execution time using structural analysis. EMSOFT, 
2004. 
[6] M. S, Haque, J. Peddersen, and S. Parameswaran. Ciparsim: Cache intersection property assisted rapid single-pass 
fifo cache simulation technique. ICCAD, 2011. 
[7] M. S, Haque, A. Janapsatya, and S. Parameswaran. Susesim: a fast simulation strategy to find optimal l1 cache 
configuration for embedded systems. CODES+ISSS, 2009. 
[8]   L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale CMPs. 
SIGARCH Comput. Archit. News, 33:24–33, November 2005. 
[9]   A. Janapsatya, A. Ignjatovic, and S. Parameswaran. Finding optimal l1 cache configuration for embedded systems. 
ASP-DAC, 2006. 
[10]    S.K. Kim, S. L. Min, and R. Ha. Efficient worst case timing analysis of data caching. RTAS, 1996. 
[11] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multithreaded Sparc processor. Micro, IEEE, 
25(2):21–29, march-april 2005. 
[12]    X. Li, H. S. Negi, T. Mitra, and A. Roychoudhury.  Design space exploration of caches using compressed traces. 
ICS, 2004. 
[13] A. Milenkovic´ and M. Milenkovic´. An efficient single-pass trace compression technique utilizing instruction 
streams. ACM Trans. Model. Comput. Simul., 17(1):2, 2007. 
[14] M. S, Haque, Roshan G Ragel, A. J, Ambrose and S. Parameswaran. Dimsim: A rapid two-level cache simulation 
approach for deadline-based mpsocs. In Technical Report: 201218, CSE, UNSW, 2012. 
[15] D. M. Nicol, A. G. Greenberg, and B. D. Lubachevsky. Massively parallel algorithms for trace-driven cache 
simulations. IEEE Trans. Parallel Distrib. Syst., 5(8):849–859, 1994. 
[16]    B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner.  POWER5 system microarchitecture. 
IBM Journal of Research and Development, 49(4.5):505–521, july 2005. 
[17]    J. Staschulat and R. Ernst. Worst case timing analysis of input dependent data cache behavior. ECRTS, 2006. 
[18]   R. A. Sugumar and S. G. Abraham. Set-associative cache simulation using generalized binomial trees. ACM Trans. 
Comput. Syst., pages 32–56, 1995. 
[19] W.-H. Wang and J.-L. Baer. Efficient trace-driven simulation method for cache performance analysis. In SIGMET- 
RICS, 1990. 
[20] R. T. White, F. Mueller, C. A. Healy, D. B. Whalley, and M. G. Harmon. Timing analysis for data caches and 
set-associative caches. RTAS, 1997. 
[21]    Xtensa. Xtensa lx2 product brief. www.tensilica.com. 
[22] W. Zang and A. Gordon-Ross. T-spacs: a two-level single-pass cache simulation methodology. ASPDAC, 2011. 
[23]   W. Zang and A. Gordon-Ross. A single-pass cache simulation methodology for two-level unified caches. ISPASS, 
2012. 
1E+08 6E+07 
1E+08 5E+07 
8E+07 4E+07 
6E+07 TAP 
TAP' 
3E+07 TAS 
TAS' 
0E+00 0E+00 
5E+07 
5E+07 
4E+07 
4E+07 
3E+07 
3E+07 
2E+07 
2E+07 
fE+07 
5E+06 
0E+00 
TAM' 
TAM'' 
4E+07 2E+07 
2E+07 fE+07 
Generous Regular Stingy Generous Regular Stingy Generous Regular Stingy 
