Abstract: The memory hierarchy subsystem has a significant impact on performance and energy consumption of an embedded system. Methods which increase the hit ratio of the cache hierarchy will typically enhance the performance and reduce the embedded system's total energy consumption. This is mainly due to reduced cache-to-memory bus transactions, fewer main memory accesses and fewer processor waiting cycles. A heuristic approach is presented to reduce the total number of cache misses by carefully relocating selected sections of the application's software code within the main memory, thus reducing conflict misses resulting from the cache hierarchy. The method requires no hardware modifications i.e. it is a software-only approach. For the first time such a method is applied to large program traces, and the miss rates and corresponding energy savings are observed while varying cache size, line size and associativity. Relocating the code consistently produces superior performance on direct-mapped cache. Since direct-mapped caches, being smaller in silicon area than caches with higher associativity (for the same size), cost less in terms of energy/access, and access faster, using direct-mapped instruction cache with code relocation for performance-oriented embedded systems is recommended. A maximum cache miss rate reduction from 71% down to less than 1% is achieved, with energy reductions of up to 63% with only a small increase in main memory size.
Introduction
It is well known that the memory subsystem of any computing system has a significant impact on performance and power of a whole system. As far as general-purpose microprocessors are concerned, from one processor generation to another, the on-chip cache is increasing significantly in size [1] . Often, on-chip memory will account for the largest resource of transistors, larger than the processor core. This is well-invested chip area since any access to the main memory is time consuming; depending on the memory subsystem more than 100 cycles might be necessary when the main memory is accessed after a cache miss occurs, whereas the L1=L2 cache have 1-2 cycle access time. The access cycle time is far lower for cache due to the smaller distances between processor and L1=L2 cache compared with the distance between processor and main memory, and also the smaller effective capacitance of the L1=L2 cache because of its size, silicon and memory technology.
Hardwarewise, there are many means to increase cache hit ratio, such as increasing cache size, modifying cache line sizes and cache policy (this might work when specific characteristics of an application are exploited), and introducing tiny caches for loops [2] . Other recent hardware modification techniques to improve the instruction memory and=or power performance are the inclusion of scratchpad memory [3, 4] , the addition of extra levels of cache [5 -7] , increasing bus widths [8, 9] , and shutting down sections of cache to improve energy efficiency [10] . In addition, there are compiler-based approaches for reducing the number of cache accesses by optimising the executable software. These software modification methods, called code placement or code relocation, involve careful rearrangement of instructions within the main memory. Code relocation is performed so that when these instructions are brought into cache they cause minimal conflict misses [11 -17] .
This paper is closely related to the code relocation methodology. Relocating the code in the main memory is a technique that is applied after the code is compiled. Relocation is unaware of the semantics of the code that is going to be relocated and thus any method using this principle is orthogonal to any compiler technique i.e. it can be applied in addition to compilation-based approaches and might further increase cache hit ratio.
In this paper we apply code relocation to a variety of applications. For the first time we analyse the effect of varying the cache size, associativity, and linesize on both the performance and energy consumption. Previous work looked at either only the performance benefit of different cache associativities and linesises on small programs [11 -15, 17] or only at performance and energy gains for direct-mapped cache [16] . Throughout this paper, cache refers to instruction cache unless otherwise stated.
Contribution and characteristics of our approach
Our approach for code relocation uses a simple though effective heuristic that shows its usefulness over a broad range of cache parameters (cache size, line size, and associativity) with significant improvements for any application size=domain. This is achieved by supplying cache parameter information, and reducing the search space by eliminating program paths that are rarely or never executed.
Compared with general (noncode-relocation) approaches, our approach distinguishes itself in two ways:
. no additional hardware is necessary . the approach is orthogonal to compiler-based memory optimisation strategies. It is applied after the compilation phase and yields additional (nonsemantic-sensitive) improvements whereas compiler-based approaches optimise with regard to the code semantics only. Thus our approach complements compiler-based approaches.
The basic target architecture considered in this paper is as follows: a single processor core that has exclusive access to the caches, a single unified memory, an instruction cache, a data cache, and possibly a custom hardware unit (without direct cache access). The cache can either be an L1 or L2 cache.
Related work
Hwu et al., in [11] , effectively reduced the cache miss rates in a compiler called IMPACT-1 by function inline expansion, trace selection, function layout and global layout. Function inline expansion replaces the function calls with the functions in higher execution calls. Trace selection groups the basic blocks that are often executed together, which then reduces the compulsory and conflict misses. Function layout places the most important descendant after the function entrance, thus reducing the conflict misses. Global layout places functions, which are executed together in close proximity to each other. McFarling [14, 15] , analysed functions such that the dependencies amongst functions were exposed and exploited to reduce misses. Chow [12] reduced cache conflicts by sorting functions by their execution frequencies, and then grouping functions together to reduce conflict misses. All of these methods worked at the functions level.
The first of the global methods was proposed in [17] by Tomiyama and Yasuura, where an ILP formulation was applied to two differing methods to reduce the cache misses. The first method applied trace selection, which reduced the cache size but increased memory size. The second method was a refined method and applied trace selection, trace merging and trace placement to reduce the total misses. The ILP formulation reduced the speed of application and is not a suitable method for large program optimisations. Performance estimation of such caches has also been reported in the literature [18] . Kirovski et al. [19] use the frequency of execution to reduce cache misses in a system to be synthesised.
In [20, 21] the authors used a scheme where half the cache was assigned for high-priority tasks and the other half was allocated for nonhigh-priority tasks. This method was applied to instruction cache optimisation in multi-processor systems by Li and Wolf [22] , though they later abandoned [23] for random placement of instructions in memory and doubling of cache sizes until deadlines were met.
In [16] , code is reordered by examining the execution frequency of basic blocks, and placing code segments with high execution frequency next to each other within the cache. Their methodology works only for a single instruction block sizes and direct-mapped cache. Our work differs from this by having a methodology able to be applied for configurations with cache line size greater than one and cache associativity greater than one.
A different code placement method is known as branch alignment [24] . Branch alignment methods placed the most likely follower of an instruction at successive locations within the memory. Other approaches that are relevant to our work stem from the area of system-level power estimation and optimisation: Dave et al. [25] introduce a task-level codesign methodology that optimises for power consumption and performance. The influence of caches is not taken into consideration. The procedure for task allocation is based on estimations for an average power consumption of a processing element. The approach described by Hong et al. [26] uses a multiple-voltage power supply to minimise system-power consumption. Another system-level power estimation approach that focuses on peripheral cores within SOCs is described by Givargis et al. [27] . Their hybrid approach uses one-timeobtained gate-level power data and propagates it to an executable specification to speed up power estimation. Simunic et al. [28] simulated the power consumption of an ARM processor plus a cache hierarchy and a main memory using a cycle -accurate approach. Lajolo et al. [29] have conducted research on a cycle-based cosimulation environment for power estimation.
REMcode relocation strategy
The code relocation methodology is shown in Fig. 1 . The inputs are cache parameters and application parameters. The REMcode strategy consists of a cache allocation algorithm and a memory allocation algorithm. The results are a list of basic blocks and their new locations in the memory. Given the new memory locations of each basic block, it is then possible to recompile the executable and adjust the branching destinations of all the relocated codes. 
Assumptions and definitions
where M loc is the memory location.
The following assumptions are also made:
. For RISC machines each instruction is of the same width, i.e. I W is constant. This assumption of fixed instruction size is made to simplify the reallocation of code in memory.
For machines with varied instruction widths it is still possible to implement our algorithm with careful allocation of instructions into the memory to ensure that instructions are aligned with each other in memory.
. Cache line size N L ! instruction width I W :
. The size of the cache is known a priori. This assumption is made to allow for greater optimisation. However, if a number of processors with differing sizes of cache have to be serviced by the same binary source it is possible to ship several binaries, and the suitable one can be applied to the particular embedded system. . Each basic block is smaller than the cache size divided by the cache associativity. This assumption is typically fulfilled. However, if the basic block is too large it can always be broken into smaller basic blocks.
. The methodology targets the low-power embedded system, where the cache size is tailored to suit the application and silicon area restricts the use of large cache sizes.
Problem statement
Let the size of the cache associated with processor P be equal to N. The basic blocks to be placed in the cache be Note that while multiple loops can share a basic block, the set of basic blocks within a loop will be unique to that loop. Some basic blocks will not belong to any of the loops. The problem is to arrange the basic blocks of each of the loops (and other basic blocks which do not belong to any of the loops) in main memory so that it can be brought into the cache such that the conflict misses are reduced, and the total main memory used is minimised. This problem is known to be NP-complete [16] .
REMcode cache allocation
The cache allocation algorithm is used to find locations to map each basic block to locations in cache to minimise total conflict misses. If a number of basic blocks belonging to the same loop overlap in cache (located at same address space), cache misses will occur repeatedly.
Cache allocation process:
We illustrate the cache allocation process using a sample program for allocation into a 64 bytes cache, Fig. 2a . Each vertex represents one basic block. The directed edges indicate the traversal of the program and the number beside each edge indicates the number of times each path is executed. In addition to the program flow graph shown, cache parameters and application parameters are given, Table 1 . Column 1 gives the basic block name, column 2 the execution frequency of the basic blocks, and column 3 the effective execution frequency (calculated in step 1 of cache allocation algorithm). Column 4 shows the size of each basic block, and column 5 shows the original location of each basic block. Column 6 shows where each basic block maps into cache using (1) . For this example, we assume a directmapped cache. Inputs to the cache allocation algorithm are the program flow graph (such as the one given in Fig. 2a ), cache parameters, and the application parameters shown in Table 1 columns 1, 2, 4, 5, and 6. The cache allocation algorithm is displayed in Fig. 3 .
Step 1: Basic block B shown in Fig. 2a has a path to itself that is executed 9,500 times. This loop back path of a basic block B will never cause a conflict (given that our cache is always larger than any single basic block). The first step adjusts for this loop back path. Basic block B's effective execution frequency is reduced to 500 and shown in column 3 of Table 1 . The effective execution frequency is the number of times a cache conflict can occur.
Step 2 identifies the loops within the application. In this work the loops are identified by searching for backward paths in the program trace. In the example given, the following loops have been identified: BCDEB, CDC, CGHJC, CGIJC, GHJG, GIJG, BCGHJCDEB, BCGIJC-DEB, BCGHJGIJCDEB, BCGHJGIJCDEB, BCDCDE.
Step 3 separates the program into groups of dependent basic blocks (GDBB), created by partitioning the program into smaller groups of basic blocks. If one basic block belongs to two or more separate loops we partition all basic blocks in the different loop into a single GDBB. For example, in Fig. 2a , basic block A, will never cause a conflict miss with basic blocks 'B, C, D, E, G, H, I, J' or with basic block F. Thus we group A into one GDBB; 'B, C, D, E, G, H, I, J' into another and finally F into a third GDBB (Fig. 2b) . Basic blocks in separate GDBBs will not cause a conflict miss. Therefore we can consider each GDBB separately, greatly simplifying the cache allocation problem.
Step 4 is used to identify dominant loops DL within a GDBB. A DL is any instruction group with a maximum effective executed frequency greater than the threshold percentage T L : Basic blocks identified as part of a DL is removed from the GDBB. We set T L to be 5%: In this example only the loop CGHJ is above this threshold. Each DL is then added as new GDBB to the list of GDBBs, and each DL will then be considered as another GDBB. By treating the DLs as GDBB we further simplify the cache allocation problem. Even though separate consideration of the new GDBBs (previously DL) does not guarantee zero conflict miss, the effect on the total miss rates would be negligible due to small threshold percentage T L ; chosen.
Step 5 ranks basic blocks within each GDBB in descending order of execution frequency. If two basic blocks have execution frequencies within the same order of magnitude, the larger one is ranked higher than the smaller one (this reduces capacity misses). In the example, basic blocks CGHJ within the GDBB identified in the previous step will be ordered as CHGJ.
Step 6 allocates individual instructions into a location in cache. Consider each GDBB separately. The basic blocks within each GDBB are taken from the ordered list given in step 5 (in the given order). These basic blocks (only whole basic blocks are allowed) are allocated to the cache from the lowest address to the highest until the cache is completely filled or there are no remaining basic blocks within that GDBB, Fig. 4a .
If there are any remaining basic blocks, we find the largest basic block from the remaining basic blocks in the ordered list. This large basic block is allocated to the starting address A ls and ending at address A s of the cache, Fig. 4b . After this we take the next largest basic block and allocate 
unallocated basic block can be found which fills the space (below the basic block we just allocated and above the last cache address A s ), we allocate that basic block to the available space, Fig. 4c . We keep doing this until no more basic block can fit to the remaining available space. We take the next largest unallocated basic block and start it at address A ls and repeat the process until all basic blocks are allocated. We do this step for all GDBBs. The complexity of this algorithm is OðS þ B 2 Þ; where B is the number of basic blocks and S is the size of the program trace. For a cache with associative greater than one, the algorithm is modified to allow the allocation of code in every one of the ways in the associative cache. This allows basic blocks to overlap in multiple identical cache locations on different sets.
Worked example:
Consider only the list C, H, G, and J for allocation into a 64-byte direct mapped cache. Basic block C is allocated first. Since block C is 64 bytes long, it occupies the entire cache as seen in Fig. 4a . Allocation of basic block H will force part of basic block C to be replace from the cache. The allocations of the other two basic blocks G and J will force further replacement of part of basic block C and=or H from the cache. To minimise the total cache misses it is best to allow basic block H, G, and J to overlap each other; hence causing only minimal portion of basic block C to be replace out from the cache. Basic block H is to be allocated to replace a portion of basic block C as shown in Fig. 4b . The start of basic block H is then marked as A ls ; which indicates the location for overlapping the remainder of the unallocated basic blocks; G and J. The next step is to allocate basic block G into location A ls and allocate basic block J into location 'A ls þ sizeofbasicblockG' shown in Fig. 4c . Figure 4 shows the resulting allocation of basic blocks CHGJ into the cache. Output of the cache allocation algorithm gives the location at which each basic block should be within the cache; this is given in column 7 of Table 1 .
Memory allocation
The memory allocation algorithm is used to map basic blocks into memory. Inputs to the algorithm are the application parameters shown in columns 1, 3 and 4 of Table 1 , and output from cache allocation algorithm shown in column 7 of Table 1 ). The memory allocation algorithm given in Fig. 5 . Figure 6 shows an example of how basic blocks are mapped directly into memory. In this Figure only allocation of basic blocks A, C, G, H, and J from the example program Fig. 2a are shown. Basic block A is allocated to the start of memory because it indicates the start of the program. Basic block C, G, H, and J are allocated such that when they are loaded into cache, they overlap as shown in Figure. Looking at the allocation of basic blocks C, G, H, and J, note that they will not map directly to the cache location as predicted in column 7 of Table 1 but will map to cache as indicated in column 8 of Table 1 . This is acceptable because the overlap in cache between blocks within the same GDBB is as predicted in the cache allocation algorithm. Thus if a basic block is mapped to the location from t x to e x in cache, the basic block can be placed in memory in any one of the address ranges from addresses t x þ Z r þ i Â N to e x þ Z r þ i Â N; where i is a positive integer, Z r is the cache offset, and N is the total cache size divided by the cache associativity. This introduction of an offset value Z r allows a reduction in size of the total memory needed for the system.
The memory algorithm starts by allocating the first basic block into the start of the memory (i.e. the start of program). In the example it is basic block A. It cannot be moved from the initial location, since, changing the location of A will cause the calling program directed to the wrong location. The next step is to sort the GDBBs in descending order of size (i.e. the total size of all the basic blocks within each). Then, for each GDBB in the ordered list, allocate the first basic block in each group to the first available contiguous location in memory that is large enough to accommodate the basic block. The offset value Z r is then calculated for the current GDBB based on the equation shown in the algorithm in Fig. 5 . Subsequent basic blocks in the same GDBB, are allocated to the cache by finding free space in memory between t x þ Z r þ i Â N to e x þ Z r þ i Â N: Results of the memory allocation algorithm for the sample program are shown in column 9 of Table 1 . The runtime complexity of the memory allocation algorithm is OðBMÞ; where B is the number of basic blocks and M is the size of the memory.
Experimental procedure and discussions
The experimental setup is shown in Fig. 7 . Application programs written in C were run on a SPARC machine (Sun UltraSparc with four processors, 400 MHz, 2GB, Solaris 2.6). Instruction traces of the applications are obtained using QPT [30] . A custom profiler was used to gather application parameters (shown in Fig. 7) .
Step 2 reads the cache parameters. Cache allocation and memory allocation algorithms are applied to design the new instruction code layout and the corresponding new program Allocate first basic block to start of memory Set Z r to be the size of the first basic block Order each GDBB in descending order of the total size For each basic block b i in GDBB { Repeat until each basic block is allocated to memory { Allocate first basic block in the first available contiguous memory block ðM x to M Y Þ; which will hold the basic blocks Calculate Z r ¼ ðM x modN=N A Þ À t x ; where N A is the cache associativity While basic block not allocated do { If memory locations t x þ Z r þ i Â N to e x þ Z r þ i Â N is free then · Map basic block to address t x þ Z r þ i Â N to e x þ Z r þ i Â N Else · iþþ } Remove basic block from unallocated list } } trace. These two execution traces are applied as two separate inputs in step 3. The cache simulator, dineroIV [31] , is deployed to simulate and calculate the cache miss rates whereas the memory calculator is a custom program calculating the memory size increase. An analytical power model estimates the energy consumption. The output gives the cache miss rate comparison of both the original and the modified program trace, the increase in memory size due to the code relocation algorithm, and the percentage energy savings.
Six different benchmarks were used to validate our approach. The trace size is shown in column 1 on Table 2 . Program size in bytes is shown in column 2 of Table 2 . The experiment was conducted for varying associativity (1, 2, and 4), varying instruction cache size from 64 to 2048 bytes, and varying cache line sizes from 4 to 8 bytes. The larger size of instruction cache was not evaluated owing to most results already showing a cache miss rates of less than 1% for cache size of 2048 bytes. Data cache size was fixed at 1024 bytes and the least recently used (LRU) replacement policy was used in the experiment. The runtime of the cache allocation algorithm and the memory allocation algorithm are only a few seconds in all the experiments. Table 2 shows the results of the six benchmarks for a direct mapped cache with cache line size of 4 bytes. Column 3 shows the instruction cache size in bytes. Columns 4 and 5 show the number of instruction misses and miss rate of the original code layout. The relocated code misses are given in column 6 and its miss rate is in column 7. Columns 8, 9, and 10 show the energy consumption of the original code, the modified code, and the percentage energy savings, respectively. Column 11 shows the percentage main memory size increase. Table 3 shows the cache miss rates in column 2 to 5 (to read Table 3 , the column titled 'cache size' is referred to as column 1) for compress, mpeg, and trick1 applications for differing cache line sizes (other results are not presented due to lack of space). Results shown on Table 3 are divided into three the applications (compress, mpeg, and trick1).
Effect on system performance
Reduction in cache miss rates will improve the system's performance. Cache miss rates are shown in Table 2 . The best cache miss rate results for each application is shown in Fig. 8b (before and after code relocation) . The results in Table 2 show that using a smaller cache size with code relocation can provide similar or better cache miss rates compared with using larger cache sizes with the original code placement as shown in [16, 17] The results show that cache miss rates can be reduced from 71:17% to 0:04% for trick1.
Cache miss rates for varying cache line sizes are shown in Table 3 for three applications. Comparing columns '2 and 3', and '4 and 5' of Table 3 , we see that code relocation does reduce miss rates in most cases. However, as we go to larger line sizes with greater associativities, cache miss rates do not always reduce with code relocation (for example in Table 3 , trick1 with associativity of two and line size of 4 and 8 for 512 byte cache). Since the cache replacement policy dictates which of the sets are replaced, and we have no control over the replacement, the effectiveness of code relocation is diminished.
Larger cache line sizes can decrease the cache miss rate and cache access time but will increase the cache penalty since more bytes will have to be brought in to cache for every miss. Cache penalty is the cost (both time and energy) of bringing elements from the memory into the cache. In calculating the CPU time (using the equation given in [17] ), it is assumed that the memory bus width is fixed at 32 bits (4 bytes). Thus a cache line size of 8 bytes requires two memory accesses for each cache miss (other memory types such as RDRAM can access memory faster for subsequent data bytes, but this is beyond the scope of this paper). Column 6 of Table 3 gives the % speedup of applications with line size of 4 bytes. Column 7 of Table 3 shows the % speedup of code relocation for cache line size of 4 bytes against cache line size of 8 bytes. The positive value for all the values in column 7 of Table 3 indicates that 
Impact on energy consumption
Energy estimation (see columns 8-10 of Table 2 ) shows a reduction in energy consumption due to reduced conflict cache misses. Energy measurements are performed using analytical power models [32] to estimate the energy consumption of I-cache, D-cache, buses, and main memory. CPU energy is estimated using an instruction set simulator (ISS). Results of energy estimation for direct mapped caches are shown in columns 8 -10 of Table 2 . The best energy savings gain for each application is shown in Fig. 8a . Results shown in Table 2 show that energy reduction is obtained whenever lower instruction cache miss rates are observed.
Effect on area of memory hierarchy
Higher cache associativity (columns 2 -5 of Table 3 : each column contains figures for associativities 1, 2 and 4) can decrease cache miss rates but increases the complexity of the cache architecture, increasing cache access times, cache area (silicon) and energy=access [33] . The increase in main memory due to the code relocation algorithm is shown in column 11 of Table 2 . The average memory size increase is 13% if we exclude the mpeg application. The mpeg application was written for a general purpose computer system, though our system assumes a single application embedded system. Our trace generator does not reflect the system instructions that were executed and therefore had a large number of short segments of code which were overlapping. This was due to one system call with changing pointers as arguments. Since the size of the loop is larger than the cache size, the overlapping of instructions in cache causes the memory area to be large.
Memory is usually available in sizes which are multiples of 2 n (1 K, 2 K, 4 K etc), and therefore it is possible that for most applications there is no need for extra memory, since usually there is sufficient unused main memory available to implement this increase of 13%:
Summary of results
Experimental results show that code relocation can increase performance of the system and decrease its energy consumption. For cases with very small cache sizes, it is seen that in most cases code relocation has very little or no effect on the cache miss rates. This is because most of the cache misses are due to capacity misses. At the other end of the problem, where cache sizes are very large, code relocation also shows little or no effect. This is due to most loop sizes being smaller than the cache and the whole loops are able to fit within the cache concurrently.
Results of the trick1 application shows that the cache miss rate is reduced from 71:17% down to 0:04% with 1024 bytes of cache. Analysing the statistical information for the trick1 application reveals that 99:98% of the time, it is executing a single loop (the size of this loop is below 1024 bytes). The code relocation method increases the spatial locality of the code resulting in reduced cache miss rates.
In the case of the tsp application, for any cache size the code relocation method shows little reduction in cache miss rates (even an increase in cache miss rates is seen with cache size of 1024 and 2048 bytes). The statistical information for tsp application shows that a single big loop of size larger than 7000 bytes is executed 99% of the time. Hence any cache size up to 2048 bytes will not contain the whole loop in the cache.
One of the most interesting observations is that the direct mapped cache with code relocation will always perform better than caches with greater associativity (with or without code relocation). The use of a higher associativity cache exploits temporal locality in programs, while code relocation increases the spatial locality of the code. Thus the two methods through different means result in improved performance. Since direct-mapped caches have smaller cache areas (silicon) and faster access times, any embedded code requiring high performance should use direct-mapped cache with code relocation. The unwanted side-effect of code relocation is the increased main memory size.
Conclusions
A heuristic code placement algorithm for minimizing conflict cache misses has been presented. For the first time it has investigated the effect of code placement algorithms on performance and energy consumption for varying cache sizes, cache associativities and cache line sizes. We have also identified that code placement methods increase the spatial locality of codes. Results from experiments have shown that cache conflict miss rate reduction can be up to 71% and energy consumption savings of 63%:
It is seen that direct mapped cache with code relocation always performs (lowest miss rates) better than caches with higher associativity (with or without code relocation). Since direct-mapped caches are faster, smaller in silicon area, and consume less energy=access than caches with higher associativities, we recommend the use of direct-mapped caches with code relocation for performance-oriented embedded systems. 
