Abstract-Cache tuning has been shown to achieve considerable energy savings and methods have also been proposed for tuning the cache for standalone embedded applications. However, with the increasing complexity of modern day embedded applications, RTOS based multitasking systems are fast becoming the norm. Therefore, there exists a need for techniques to tune the cache for multitasking systems. In this paper we present a framework for energy centric tuning of the instruction cache for embedded multitasking systems. Our framework is built upon a formal model for characterizing multitasking systems and is suitable for fast instruction cache tuning using loop profiling. We validate our proposed techniques by applying them to tune a predictor based filter cache hierarchy -a common solution for low power embedded systems. For all the multitasking programs tested, our techniques are able to successfully predict configurations that are optimal or near-optimal. The proposed methods are also able to achieve speed-ups of up to an order of magnitude compared to exhaustive design space exploration techniques.
I. INTRODUCTION
With the advent of mobile and hand held devices, energy centric design space exploration is emerging as an active area of research. Of the different components contributing to the power consumption of an embedded system, cache memories in general and the instruction cache in particular has been shown to consume a significant portion [1] . Based on these findings, configurable cache memories were proposed in [2] . The underlying idea behind configurable cache memories was that it could be tuned (by modifying characteristics like cache size, block size, associativity, etc) in an application specific manner leading to savings in terms of energy and/or area without trading off too much performance.
Choosing the appropriate cache parameters for an application however requires exhaustive design space exploration. This entails simulating the application multiple times and can be prohibitively time consuming even for small applications. A workaround to this problem has been instruction cache tuning based on loop profiling. It has been shown to achieve significant speed-up in the tuning process by tuning the cache based on results from a one-time loop profiling of the application [3] [4] [5] .
While such loop profiling techniques work well for standalone programs, there has been no work done to extend them to multitasking systems running on a real-time operating system. The challenge in applying loop profiling algorithms to tuning multitasking systems lies in the inherent pseudo-random nature of control flow. While the user tasks and system calls have a somewhat deterministic control flow, invocation of interrupt or trap service routines upsets this determinism and introduces randomness in the system. As a result, there is a need for extending current loop profiling and cache tuning techniques for rapid and effective cache tuning of multitasking systems.
The rest of the paper is organized as follows. In section II, we give present a background on loop profiling techniques for cache tuning. The cache tuning heuristics for the predictor based filter cache hierarchy are presented in section III. We propose our cache tuning framework for multitasking applications in section IV. We discuss the results from our experiments in section V and conclude the paper in section VI.
II. BACKGROUND
Loop profiling is the process of extracting information about loops in the application. This information is usually extracted from the control flow graph of the application. Several algorithms have been proposed to find loops in an application like Tarjan's algorithm [6] and the Dominator-Join algorithm [7] . The algorithms find widespread use in the area of optimizing compilers for loop related transformations.
The size and frequency of loops in an application have a direct bearing on the cache miss rates for the application. This has prompted recent research on application of loop profiling for the rapid estimation of instruction cache hit rates and subsequent cache tuning [3] [8] . Loop profiling techniques rely on a one time simulation of the application to obtain its weighted call graph and weighted control flow graph of all its functions. This information is then utilized to extract information about the size and frequency of loops in the application and for subsequent cache tuning. The popularity of this technique stems from the fact that weighted control flow graphs and call graphs are fairly quick and easy to obtain for an application using instrumentation tools like gcov, gprof etc.
III. TUNING HEURISTICS FOR PREDICTOR BASED FILTER CACHE
To test our profiling techniques for mulitasking cache tuning, we chose the predictor based filter cache hierarchy as a testcase. This is a common solution for low power cache hierarchies in the embedded systems domain. In this section, we give an overview of the predictor based filter cache hierarchy and also discuss energy-centric heuristics for tuning the same.
A. Predictor based filter cache hierarchy
The predictor based filter cache hierarchy, shown in figure 1 , consists of a filter cache in conjunction with an L1 instruction cache [9] [10] [5] . The filter cache is a tiny auxiliary instruction cache whose purpose is to hold the instructions of the many loops inherent in embedded applications. While these tiny loops execute, access to the small filter cache reduces the energy expended in the cache hierarchy. However, the filter cache could be a potential liability for larger loop sizes as they cannot be contained in the small cache. This is where the predictor becomes useful. Simply put, its role is to minimize accesses to the filter cache when the requisite instruction is not expected to be there. The predictor in a filter cache hierarchy works in a two stage manner. The first level is the cache line level. It assumes that the address of the next instruction to be accessed will be the current value of the program counter plus the instruction size. If the current instruction and the future instruction map to the same cache line, instruction access is directed to the filter cache. If they do not map to the same cache line, a pattern predictor is accessed to decide whether the next access will be from the L1 cache or the filter cache [9] . The prediction process is shown in figure  2 .
The important thing to note here is that when an instruction is accessed from the L1 cache, a full filter cache line containing the instruction is moved to the filter cache too. Subsequent access to instructions from the same line are done from the filter cache instead of the L1 cache. So in an ideal case, the predictor can sense that the instruction will be a miss in the instruction cache and direct it to the L1 cache. The proportion of the instructions directed to the L1 cache will therefore be the application wide miss rate for the filter cache. 
B. Instruction cache hierarchy tuning heuristics
Energy consumption is a prime issue in modern day embedded systems. Therefore, our objective behind cache tuning is to find an ideal instruction cache hierarchy for a given application which reduces the energy consumption of the cache hierarchy without trading off too much performance. Consideration of the performance clause is important because using a smaller cache may help save energy but it will ultimately lead to poorer performance due to the many misses that might occur while accessing it. To meet our tuning objective, we used a yardstick that captures both the energy saved and the performance that is sacrificed to achieve this -the energy-delay product.
For calculating the energy-delay product of the filter cache and the L1 cache in the predictor based instruction cache hierarchy, we look at the following properties of the cache.
• Cache Access Time: This is average amount of time spent per cache reference (assuming it to be a hit) to retrieve the requisite instruction from it. In the following equations, this is represented as AT F C for the filter cache and AT L1 for the L1 cache.
• Cache Access Energy: This is average amount of energy spent per cache reference (assuming it to be a hit) to retrieve the requisite instruction from it. In the following equations, this is represented as AE F C for the filter cache and AE L1 for the L1 cache.
• Cache Miss Rate: This is the application wide miss rate for a given cache size. In the following equations, it is represented as MR F C for the filter cache and MR L1 for the L1 cache.
• Average Memory Access Time: An ideal predictor can identify if a reference will be a miss in the filter cache and direct it to the L1 bypassing the filter cache altogether. This way, when the instruction can be a hit in the filter cache, we get the latency of the filter cache and when it will be a miss in the filter cache, we get the latency of the L1 cache. In a predictorless filter cache hierarchy, the latency on a miss would be the access time of the filter cache plus the access time of the L1 cache. The average memory access time for the filter cache and the L1 cache is calculated as shown in equation 1 and 2. In these set of equations, AM AT F C is average memory access time for the filter cache and AM AT L1 is the average memory access time for the L1 cache. HR F C is the application-wide hit rate for the filter cache.
• Average Memory Access Energy: The case for average memory access energy calculation in a predictor based filter cache hierarchy is different from that of the average memory access time calculations. When an instruction is accessed from the L1 cache, the line is transferred to the filter cache and subsequent accesses from the same line are made from the filter cache. So, for every access to the L1 cache, the filter cache is also accessed once. The average memory access energy for the filter cache and the L1 cache is calculated as shown in equations 3 and 4. In these set of equations, AM AE F C is average memory access energy for the filter cache and AM AE L1 is the average memory access energy for the L1 cache.
• Energy-Delay Product: The energy-delay product is just the product of the average memory access time and the average memory access energy and is calculated as shown in equations 5 and 6
In our exploration algorithm, we first find the L1 cache size that gives us the lowest energy delay product or min(EDP L1 ) in the set of L1 cache sizes that we explore. Once we know the optimal (lowest EDP) L1 cache size, we know the values for AM AE L1 and AM AT L1 . Then, we repeat the same process to find the optimal filter cache size.
IV. INSTRUCTION CACHE TUNING FRAMEWORK FOR
MULTITASKING SYSTEMS In this section, we discuss our cache tuning framework for multitasking applications. We first address the issues in extending the current cache tuning techniques for standalone applications to multitasking systems. We then present our approach to profiling multitasking applications for cache tuning. For the rest of the discussions, we use the following abbreviations -UT for user task, SC for system call, ISR for interrupt service routine and TSR for trap service routine.
A. Issues in instruction cache tuning for multitasking systems
There are two issues that need to be addressed before extending existing instruction cache tuning techniques developed for stand-alone applications to multitasking systems.
Firstly, system calls are tightly bound to user tasks and are often invoked from within loops in user tasks. Any method for loop profiling and subsequent cache tuning of multitasking system needs to be sensitive to the interaction between user level tasks and system calls in estimating loop sizes for user level tasks.
The second issue that needs to be addressed is code interference in the instruction cache. Figure 3 shows the execution footprint of a typical multitasking application. The solid arrows represent invocation of system calls from user tasks and the return paths. This control flow is deterministic. The dotted arrows represent indeterminate or sporadic flow of control to ISR/TSR and back. Two kinds of code interference can happen in such executions of multitasking applicationsintrinsic and extrinsic. Intrinsic interference happens because of cache line conflicts due to the instructions of a task and the system calls invoked by it. This is shown in the lighter circles in figure 3 . On the other hand, extrinsic interference happens due to cache line conflicts between two different tasks or an ISR/TSR and a task. Misses due to intrinsic interference can be determined by stand-alone loop profiling techniques. However, misses due to extrinsic interference are indeterminate because of the sporadic nature of the interrupts.
While the first issue can be addressed by binding and profiling the UT and SC together, the second issue requires careful consideration. In the next subsection, we conduct some empirical experiments to estimate the effect of extrinsic interference in multitasking applications on the cache tuning process.
B. Effect of extrinsic interference on tuning process
To measure the effect of code mixing or extrinsic interference on the cache tuning process, we conducted two sets of experiments. In the first experiment, we simulated multitasking applications to obtain the instruction cache miss rates and the energy delay product of the entire system. We obtained these measurements for a large range of cache sizes (1kB to 32kB). In the second experiment, we simulated the same applications again. But this time round, we obtained the instruction cache miss rates of stand-alone TSR/ISR and UT/SC invoked as a part of the running system. After obtaining these figures, we summed it up for all TSR/ISR and UI/SC that were invoked. As expected, there was a decrease in the number of misses as compared to the first experiment since the second experiment ignored the case of extrinsic interference. Due to the lowered Fig. 3 . Extrinsic cache interference in multitasking execution number of misses, the energy delay product for the second experiment was also found to be lower as compared to the first experiment. Figure 4 shows the miss rate curve and the energy delay for two of eight multitasking programs tested. The difference in the miss rates and hence the energy delay product because of extrinsic interference, however, is found to be negligible. This conclusion is further validated in table I. The average difference in instruction cache miss rate is only 0.02% to 0.08% of all instructions executed. Consequently, as shown in table I, the difference in energy delay product is also found to be negligible. These results can be explained by the observation that misses due to extrinsic interference happen once in a while. They happen only when the thread of execution switches from a UT/SC to an ISR/TSR or vice versa. As the execution proceeds after the switch, these transient misses incurred become negligible as compared to the rest of the misses incurred due to cache line conflicts from instructions of the same UT/SC or ISR/TSR.
C. Overview of the multitasking cache tuner
As discussed in the preceding subsection, the effect of extrinsic interference on the EDP and hence cache tuning is minimal. In view of this, loop profiling, estimation of the cache miss rates from the loop characteristics and subsequent cache tuning for multitasking applications can be simplified as follows. Profiling the multitasking system can be done separately for each UT/SC and the ISR/TSR as shown in figure  5 . For profiling purposes, we can use the techniques described in [4] . The estimates for the number of misses can then be added up to get an estimate for the number of misses for the entire system. The number of misses thus obtained can be input to the tuning heuristics to obtain an estimate for the optimal cache for the multitasking program.
V. RESULTS AND DISCUSSION
In this section, we present our results on cache tuning for multitasking applications. We first discuss our experimental setup and then go on to compare design space exploration using our proposed techniques with design space exploration using exhaustive simulation.
A. Experimental Setup
We ran the benchmark programs from the Mibench [11] and the Mediabench [12] benchmark suite on the M5 simulator [13] . The M5 simulator simulated a single issue processor with the Alpha Dec instruction set. The operating system running on the simulator was Linux. We assumed a unified L2 cache of 512kB with a 32B block size and 4-way set associativity. The selection space for the design space exploration and The measurements of access energy for the different cache configurations were obtained using the CACTI tool [14] .
B. Comparison of proposed techniques with exhaustive simulation
The proposed EDP estimation techniques closely follow the results obtained through exhaustive simulation of the instruction cache hierarchy. Figure 6 shows the offsets in EDP for different cache sizes relative to the optimal cache size. It is evident from the figures that for both the multitasking programs shown, the estimated curve closely follows the EDP offset curve obtained through exhaustive simulation. Table III shows the filter cache and L1 cache sizes chosen through exhaustive simulation (sim) and through the estimation heuristics (est). The exhaustive simulation approach is always able to identify the optimal configuration -the one with the least EDP -because it resorts to brute force search. However, our estimation heuristics are also able to identify the optimal configuration or a near-optimal one in all cases. The EDP of the configurations chosen by the two techniques is shown (in figure 7) next to the EDP of a base 32kB, 4-way set associative and 32B line size L1 cache which is the norm in many processors. As is evident from the figure, the EDP of the configuration obtained through the estimation techniques is optimal in almost all cases and near-optimal for the rest.
While the estimation technique has been shown to identify optimal or near-optimal cache configurations in almost all cases, the real value of the proposition comes through in the runtime of the algorithm and its scalability.
Multitasking benchmark
Config(est) Config(sim) dijk + patr + sha (1kB,32kB) (1kB,32KB) rawd + cjpeg + sha (512b,4kB) (512B,4kB) epic + unepic + rawc (256B,1kB) (512B,1kB) djpeg + rawc + sha (512B,4kB) (1kB,4kB) sha + cjpeg + rawc (512B,4kB) (512B,4kB) rawc + rawd + cjpeg + djpeg (1kB,4kB) (1kB,4kB) patr + dijk + rawc + sha (1kB,32kB) (1kB,32kb) dijk + sha + rawc + rawd (256B,4kB) (512B,4kB) Table IV shows the time that was required to run the estimation algorithms as compared to an exhaustive simulation. As can be seen, a significant amount of time is saved if the design space exploration is done using the estimation heuristics. For some of the multitasking programs, the estimation techniques are able to achieve a speedup of upto an order of magnitude.
Another huge advantage of such an approach is its scalability. The time complexity of the exhaustive simulation techniques is O(n) where n is the number of configurations tested. On the other hand, the time complexity of the estimation heuristics is constant irrespective of the number of cache sizes that the estimation is required for. This constant time is the time taken to simulate the application, build its call graph and control flow graph and to estimate the EDP for different cache sizes through a one-time loop profiling.
VI. CONCLUSION
In this paper, we proposed a formal method for modelling embedded multitasking systems for the purpose of loop profiling and subsequent cache tuning. We divided the system execution into the execution of the deterministic usertasks/system-calls (UT/SC) and the sporadic interrupt-serviceroutines/trap-service-routines (ISR/TSR). We showed that the interference in the cache due to the code segments of these two components was too low to affect the cache tuning process. Based on this observation, we then went on to propose a framework to profile these components in isolation while the system was simulated. The isolated profiling results were then collated to predict optimal cache configurations for the entire Our proposed framework was shown to achieve significant speed-up as compared to exhaustive cache simulations while at the same time being able to predict optimal or near-optimal cache sizes for all multitasking programs tested. The proposed techniques are highly scalable as well because the techniques proposed can be relied upon to estimate the energy delay product estimates for any range of cache size after a one time profiling.
