The increasing use of microprocessor cores in embedded systems as well as mobile and portable devices creates an opportunity for customizing the cache subsystem for improved performance. In traditional cache design, the index portion of the memory address bus consists of the K least significant bits, where K = log 2 D and D is the depth of the cache. However, in devices where the application set is known and characterized (e.g., systems that execute a fixed application set) there is an opportunity to improve cache performance by choosing a near-optimal set of bits used as index into the cache. This technique does not add any overhead in terms of area or delay. In this article, we present an efficient heuristic algorithm for selecting K index bits for improved cache performance. We show the feasibility of our algorithm by applying it to a large number of embedded system applications as well as the integer SPEC CPU 2000 benchmarks. Specifically, for data traces, we show up to 45% reduction in cache misses. Likewise, for instruction traces, we show up to 31% reduction in cache misses. When a unified data/instruction cache architecture is considered, our results show an average improvement of 14.5% for the Powerstone benchmarks and an average improvement of 15.2% for the SPEC'00 benchmarks.
INTRODUCTION
The growing demand for embedded computing platforms, mobile systems, handheld devices, and dedicated servers, coupled with shrinking time-to-market windows, are leading to new core based system-on-a-chip (SOC) architectures [ITRS 2005; Kozyrakis and Patterson 1998; Vahid and Givargis 1999] . Specifically, microprocessor cores (a.k.a., embedded processors) are playing an increasing role in such systems' design [Wong et al. 2004] . This is primarily due to the fact that microprocessors are easy to program using well evolved programming languages and compiler tool chains, provide high degree of functional flexibility,
• Tony Givargis allow for short product design cycles, and ultimately result in low engineering and unit costs. However, due to continued increase in functional complexity of these systems and devices, the performance of such embedded processors is becoming a vital design concern.
The use of data and instruction caches has been a major factor in improving processing speed of today's microprocessors. Generally, a well-tuned cache hierarchy and organization can reduce the time overhead of fetching instruction and data from main memory, which in most cases resides off-chip, requiring power-costly communication over the off-chip system bus. Specifically, if a cache is designed to reduce the total number of cache misses during the execution of an application, the bit-traffic to and from the off-chip memory will be reduced. Since the capacitive load of a communication bus crossing the chip boundary is relatively large, a reduction in bit-traffic over such a bus will yield a reduction in power consumption and overall energy use of the application.
Consequently, in embedded, mobile, and handheld devices, optimizing of the processor cache hierarchy has received a lot of attention from the research community [Balasubramonian et al. 2000; C.L and Despain 1995; Malik et al. 2000; Petrov and Orailoglu 2001; Suzuki et al. 1998 ]. This is due in part to the large performance gained by tuning caches to the application set of these systems. The kinds of cache parameters explored by researchers include deciding the size of a cache line (a.k.a., cache block), selecting the degree of associativity, adjusting the total cache size, and selecting appropriate control policies such as write-back and replacement procedures. These techniques, typically, improve cache performance, in terms of miss reduction, at the expense of area, clock latency, or energy.
In this work, we propose a zero-cost technique for improving cache performance (i.e., reduce misses). Our technique involves selecting a near-optimal set of bits used as index into the cache. In traditional cache design, the index portion of the memory address bus consists of the K least significant bits, where K = log 2 D and D is the depth 1 of the cache [Patterson and Hennessy 1997] . In general, any of the address bits can be used for indexing. In our technique, we assume that the processor and cache cores are black-box entities to be integrated on a single SOC. However, we do assume that the integration of cores, more specifically, routing of the address bus wires is flexible, as is commonly the case in core-based SOC design, as well as board-based design.
We pictorially depict the idea of near-optimal cache indexing by showing the traditional approach, Figure 1 (a), versus our approach, Figure 1 (b) . Here, we have a 16-bit processor core connected to a 1K cache core, which in turn is connected to 64K of memory. In Figure 1 (a), the least significant address bit is used for the byte-offset calculation (assuming the cache is organized with each line being two bytes wide). The next nine least significant bits are used for cache indexing and the remaining bits are used for tag comparison. In Figure 1 (b), we have swapped bits seven and ten in order to achieve, say, a near-optimal cache indexing. Note that the reverse of the indexing scheme is performed on the cache-to-memory side in order to preserve functional correctness.
The intuition behind our work is simple. We exploit the fact that, for embedded applications (where the execution pattern is mostly constant), there is an opportunity to use alternate index bits to achieve better cache performance without increasing cache size or degree of associativity.
The problem of cache indexing is one of hashing. In traditional cache design, reference A maps to cache location L, using the hash function shown in Eq. (1):
In Eq.
(1), D is the depth of the cache. In general, data can be mapped onto a cache using the generic hash function shown in Eq. (2)
In Eq. (2), H is the arbitrary hash function. While it may be possible to compute a perfect hash function, given the cache organization and a trace file, in this work, we focus on a special class of hash functions, namely those that have a zero cost overhead (e.g., zero delay, area, power, etc.). In other words, we focus on the class of hash function that only swap the address bits.
The remainder of this article is organized as follows: In Section 2, we summarize related work. In Section 3, we formulate the problem and give our heuristic solution. In Section 4, we state our experiments. In Section 5, we state our concluding remarks and define some possible future directions along the lines of the proposed work.
PREVIOUS WORK
While the problem of hashing is a well-studied topic in the general area of computer science [Corman et al. 2001] , we are unaware of any direct research related to a hardware approach to processor cache indexing, as stated in this work. However, there are a number of compiler (software) techniques that aim at achieving similar goals as ours. In this section we briefly elaborate on some of these techniques.
• Tony Givargis Rivera and Tseng [1997] present two data only compiler transformations to eliminate conflict misses. One of these transformations is to modify variable base addresses. The other transformation is to pad inner array dimensions. Unlike compiler transformations that restructure the application code, these two techniques modify the application data layout in order to improve cache performance. Panda et al. [1997] present a data alignment technique that pads arrays, to improve program performance through minimizing cache conflict misses. They also describe algorithms for selecting tile sizes for maximizing data cache utilization, and computing pad sizes for eliminating self-interference conflicts in the chosen tiles. Abella et al. [2002] exploit the effectiveness of the memory hierarchy by means of program transformations (code segment), such as padding, to reduce conflict misses. They present an approach to perform near-optimal code segment padding for a system with multilevel caches by analyzing programs, detecting conflict misses (by means of cache miss equations), and using a genetic algorithm to compute the transformations. Huang et al. [2003] propose loop (code) and data tiling for improving data locality in partial-differential equation (PDE) solvers running on single processor systems with a memory hierarchy. They combine loop (code) tiling with array layout transformation in such a way that a significant amount of cache misses that would otherwise be present are eliminated. They compare their results to nine existing loop tiling algorithms and show that their techniques delivers impressive performance speedups (faster by factors of 1.55-2.62) and smooth performance curves across a range of problem sizes on representative machine architectures. Ding and Kennedy [1999] explore ways to achieve better data reuse of dynamic applications by improving both code and data locality in irregular programs. They demonstrate that runtime program transformations can substantially improve code and data locality despite the added complexity and cost. They utilize a compiler to automate such transformations, eliminating much of the associated runtime overhead. Grun et al. [2000 Grun et al. [ , 2001 describe a memory-aware compiler approach that exploits efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations. In this context, memory access modes include page mode, burst mode, and pipelined access. In their work, they also assume a more application specific, tuned, memory subsystem that may include one or more caches and memories organized to meet low power or high performance demands in addition to the traditional memory/cache hierarchy. Further, they present a compiler technique that in the presence of caches actively manages cache misses, and performs global miss traffic optimizations, to better hide the latency of the memory operations.
Software-based cache-sensitive compiler techniques, such as those outlined above, use code motion, data realignment, and data padding as a mechanism to improve cache performance. With respect to our approach, these techniques: (1) have a limited degree of freedom in moving code or data (e.g., code segment can only be moved at the basic block boundary or data objects may only be moved in a non-overlapping way), (2) may incur added cost (e.g., consecutive basic block segments, if split would require an added jump instruction or padded arrays may require added index computation overhead), (3) have a limited degree of freedom in padding arrays and structures (e.g., a structure object can only be padded at the element boundary to comply with the source programing language semantics or multidimensional arrays usually are padded at the row or column boundary but not both), (4) are targeted for compute-intensive loops or nested loops operating on large arrays, and (5) have a limited view of the overall software (e.g., do not consider multiple concurrent tasks or interleaved operating system activity.) We note that our approach can be applied over and beyond these software-based compiler techniques for added benefit.
OPTIMAL CACHE INDEXING
In this section, we first formulate the problem of optimal cache indexing. Then, we show that the problem of optimal cache indexing belongs to the class NPcomplete. Last, we provide a heuristic that is efficient in running time and produces good results when applied in practice.
Problem Formulation
Optimal cache indexing is the problem of selecting K bits among all address bits of a processor for indexing into the cache. Specifically, let us assume that a processor has an M -bit bus and is connected to a cache of size S bytes that is A-way set associative and has line size equal to L bytes. K can be computed as shown in Eq. (3)
Here, the term S/(L × A) gives the depth D of the cache (i.e., the number of rows). Note that K is the number of bits used by the row decoder of the cache. Since there are a total of M address bits, we can potentially use any combination of size K for cache indexing. The number of combinations, thus, is computed as shown in Eq. (4).
The problem is to find the one combination that reduces cache misses for a fixed application set. Specifically, we assume that a trace of memory references, corresponding to the application set, is available and is the input to our problem. We note that a trace of memory references will be different during different runs of a program, if the input to the program changes. For embedded systems, we assume a representative trace to be one that is obtained by concatenating traces obtained from different runs of the program using typical input data. Our technique can be applied to the representative trace in order to arrive at an indexing scheme that is ideal for the kinds of input the device may receive.
In an exhaustive approach, one can find an optimal set of index bits by enumerating all possible combinations, integrating the processor and cache • Tony Givargis accordingly, and simulating the application trace while keeping track of the one combination resulting in minimum misses. Such an approach is clearly not tractable as the number of combinations is very large for all interesting cases. For example, assume a 32-bit processor connected to an 8192 bytes two-way set associative cache with line size equal to four bytes. Using Eq. (3), we compute K = 10. Further, using Eq. (4), we compute the number of possible cache index sets to be over 64 million.
NP-Completeness
We show that the problem of optimal cache indexing belongs to the NP-complete class of problems. Our proof is by reduction from the set cover NP-complete problem [Gurari 1989 ]. In the set cover problem, we are given a set of sets S over some universe U and a number k. We compute yes if some k-sized subfamily C of S has the same union as U and no otherwise. Recall that in the optimal cache indexing problem, we have a sequence of memory addresses T , a number k, and a number m. We compute yes if it is possible to choose k of the memory address bits so that a direct mapped cache with size 2 k causes m cache misses on the sequence T and no otherwise. These two problems are defined as follows:
Set Cover Instance:
Input:
Optimal Cache Indexing Instance:
Input: A sequence of memory addresses T Input: Integers k and m Question: Does there exist k index bits in a direct mapped 2 k cache causing m misses on T ? THEOREM 1. Optimal cache indexing ∈ class NP-complete.
PROOF. In our proof strategy, we first show that the problem of optimal cache indexing belongs to the NP class of problems, that is, optimal cache indexing ∈ NP. Then, we show that the problem of set cover is reducible to the problem of optimal cache indexing in linear time, that is, set cover ≤ P optimal cache indexing.
To show that optimal cache indexing ∈ NP, we nondeterministically select k bits as the cache index set, configure a 2 k direct mapped cache simulator accordingly, and simulate the sequence of memory addresses T . If the number of cache misses is m we halt and output yes; otherwise, we halt and output no.
To show that optimal cache indexing ∈ NP-hard, we show that set cover ≤ P optimal cache indexing. The reduction is as follows. Let S = {S 1 , S 2 , . . . , S n } be the set of sets (over a universe U ) from the set cover instance. From this set cover instance, we construct an instance of the optimal cache indexing in which the memory addresses are n-bit wide. Furthermore, each bit of a memory address corresponds to one of the members of S (i.e, the least significant bit of a memory address corresponds to S , the next least significant bit of a memory Table I . Constructing a Sequence of Memory Addresses (i.e., T )
x ∈ U Address a 1 "0000"
"0011" address corresponds to S and so on). For each set member x in the set cover instance (i.e., x ∈ U ), we create a sequence of three addresses a x → b x → a x , where a x and b x differ in exactly those bit positions corresponding to sets containing x. We choose these addresses so that all addresses a x , a y , b x , and b y are distinct for distinct x ∈ U and y ∈ U . The overall sequence of addresses (i.e., T ) is formed by concatenating together all these triples. Finally, we choose k to be the same as that from the set cover instance (i.e., the number of subsets) and m = 2/3 × |T | (i.e., m = 2 × the-number-of-triples). This completes the reduction. By simulating the optimal cache indexing instance, and within each triple, we will either get two cache misses (the first a x and b x ), or three (also the second a x ). We get two misses if and only if the k chosen address bits include one corresponding to a set that covers x, so that a x and b x land in different cache entries. Therefore, there exists a set of k index bits causing m misses if and only if there exists a size-k set cover.
Let us give a simple example to highlight the reduction outlined in our proof. Consider the following instance of the set cover problem.
Set Cover Instance:
Input: U = {1, 2, 3, 4} Input: S = {S 1 , S 2 , S 3 , S 4 } where S 1 = {1, 3}, S 2 = {2}, S 3 = {3, 4}, and
In accordance with the reduction outlined in the proof, we create an instance of the optimal cache indexing problem as follows. For the reduction, we choose the address width to be 4 bits (i.e., same as |S|). For each member x ∈ U , we construct a sequence of three addresses a x → b x → a x , as shown in Table I by flipping the bit positions corresponding to sets where x is a member. For example, since x = 3 is a member of S 1 and S 3 , we need to make b 3 different from a 3 in exactly bit positions zero and two. We repeat this for all members of U and obtain a trace T of size 12. We set m = 2/3 × |T | = 8. When done, we arrive at the following instance of the optimal cache indexing problem.
Optimal Cache Indexing Instance:
Input: T as shown in Table I Input: k = 2 and m = 8 Question: Does there exist k = 2 index bits in a direct mapped 2 2 cache causing m = 8 misses on T ?
Let us now select bits zero and one as the index into our cache. Here, we note that for the triples corresponding to x = 1, 2, 3 we get exactly two misses each. In other words, a 1 /a 2 /a 3 will be a miss, b 1 /b 2 /b 3 will be a miss, but the second occurrence of a 1 /a 2 /a 3 will be a hit (this is because at least one of the selected index bits will differentiate between the a and b values). However, for the triples corresponding to x = 4 we get exactly three misses (this is because the selected index bits will not differentiate between a 4 and b 4 ). Thus, in this case, the total misses will be m = 9. In fact, the only way to arrive at m = 8 would be to select a pair of index bits that differentiate between all pairs of a and b. In other words, cover the universe U in the set cover problem.
Heuristic Algorithm
Since the problem of optimal cache indexing is NP-complete, we give a heuristic algorithm that is efficient and performs well when executed on a large number of typical embedded applications. The first step of the algorithm is simply reading a trace into memory. We denote the size of the trace as N . The next step is to reduce the trace to the unique references, denoted as N , where N ≤ N . We next describe the remaining parts of the algorithm.
For each bit in our address space, we compute a corresponding quality measure. This quality measure is a real number in the range of zero to one. Having a quality measure of zero would indicate that the bit, if used as an index into a cache of depth two (i.e., a cache with two addressable data locations), would be a poor choice, as it would place all the references into a single location in the cache, thus causing the most number of conflicts. On the other hand, having a quality measure of one would indicate that the bit, if used as an index into a cache of depth two, would be a good choice, as it would equally split the references among the two cache locations, causing the least number of conflicts. We compute the quality measure Q i for address bit A i by taking the ratio of zeros and ones along the A th i column, as shown in Eq. (5):
In Eq. (5), Z i denotes the number of references having the value zero at address bit A i and O i denotes the number of references having the value one 
at address bit A i . We further illustrated the concept of quality measure with a running example. Consider the striped trace shown in Table II . Applying Eq. (5) to our running example, we compute the quality measures shown in Table III .
In a particular example, Eq. (6) illustrates the computation of the quality measure Q 4 , corresponding to the address bit A 4 . Note that Z 4 = 7 is obtained by counting the zeros along the A 4 th column of Table II and O 4 = 3 is obtained by counting the ones along the A 4 th column of Table II .
Next, for each pair of bits in our address space, we compute a corresponding correlation measure. This correlation measure is a real number in the range of zero to one. A correlation measure of zero indicates that a pair of address bits split the unique references in exactly the same way. A correlation measure of one indicates that a pair of address bits split the unique references in completely different ways. To illustrate further, Figure 2 Table II . Note that according to our quality measure, both A 0 and A 2 are ideal indices to use in a cache of depth two (i.e., a cache with two addressable data locations). Now consider the case where we have a cache of depth four (i.e., a cache with four addressable data locations), thus needing a pair of indices. If we use A 0 and A 2 , the trace would be split into the four cache locations as shown in Figure 2 (c). Note that even though the cache has four addressable data locations, only two slots receive the references, and the other two slots remain unused. The reason for this is that A 0 and A 2 are correlated. From looking at the trace, we can see that A 2 is simply the complement of A 0 . In such a case, we would have a correlation 
In Eq. (7), E i, j denotes the number of references having identical values at address bits A i and A j . Likewise, D i, j denotes the number of references having different values at address bits A i and A j . Applying Eq. (7) to our running example, we compute the correlation measures shown in Table IV. In a particular example, Eq. (8) illustrates the computation of the correlation measure C 2,3 , corresponding to the address bits A 2 and A 3 . Note that E 2,3 = 4 is obtained by counting the number of bits that have the same values along the A 2 nd and A 3 rd columns of Table II . Likewise, D 2,3 = 6 is obtained by counting the number of bits that have different values along the A 2 nd and A 3 rd columns of Table II E 2,3 = 4
In the last step of the algorithm, we use the quality measures along with the correlation measures to compute the near-optimal index ordering as shown in Algorithm 1.
Algorithm 1 repeatedly selects an address bit with the highest corresponding quality measure and then updates the quality measures using the correlation measures. For the running example shown in Table II and quality/correlation 
for j ← 0 to M − 1 do 8:
Q j ← Q j * C best, j 9: end for 10:
print A best 11: end for 
3/7 * 2/3 = 6/21 1 * 2/3 = 2/3 1 * 0 = 0 2/3 * 1/9 = 2/27 1 * 0 = 0 measures computed in Table III and Table IV , Algorithm 1 first selects A 0 as the best index bit (ties are broken in an arbitrary manner) and updates the quality measures Q M −1 · · · Q 0 by multiplying with C 0, M −1 , C 0, M −2 · · · C 0,0 to obtain a new set of quality measures, as illustrated in Table V . Next, having the largest quality measure, the algorithm selects A 3 , and updates the quality measures again, etc. On termination, Algorithm 1 prints A 0 , A 3 , A 5 , A 4 , A 1 , and A 2 , in that order. This ordering defines a near-optimal solution to the problem of cache indexing. Subsequently, to build a cache of depth two (i.e., a cache with two addressable data locations) we choose A 0 . To build a cache of depth four (i.e., one with four addressable data locations) we choose A 0 and A 3 , and so on.
Time Complexity and Further Remarks
In terms of running time complexity, our heuristic takes O(N × log 2 N ) to execute, where N is the size of the original memory trace. This time complexity is computed as follows. Reading the trace takes O(N ), as the length of the original trace is N . Reducing the trace down to only the unique references involves what amounts to sorting the trace and thus takes O(N × log 2 N ). Computing the quality and correlation measures takes O(N ), where N ≤ N is the number of unique references, as a single pass over the unique references is needed to compute these values. The final phase (i.e., Algorithm 1) takes O(M 2 ) where M is the size of the processor address space (i.e, the width of the address bus). Since, in most instances M is a small integer like 16, 32, or 64, we assume it to be constant. Therefore, Algorithm 1 executes in O(1) time. Thus, the overall heuristic algorithm executes in time O(N × log 2 N ).
In an implementation of the above-mentioned heuristic, the entire memory trace does not need to be loaded into memory prior to striping it down to the unique references. Instead, a set data structure can be used to track unique references during a single pass over the original trace. Thus, the space complexity of our heuristic is bounded by O(N ), where N ≤ N is the number of unique references. It is reasonable to assume that in most cases N N . As a result, ,5,8,9,10,4,7,11,16,19 v42 649168 23942 6,9,4,7,5,8,10,11,12,13 the effective execution time of our heuristic is linear with respect to the size of the trace. We note that our heuristic only looks at the address patterns and not the sequence or frequency of memory reference occurrences. Furthermore, our heuristic is by no means optimal, but it performs well and runs very efficiently as supported by our experimental results. In fact, our experiments with alternate heuristics that take into account the frequency of occurrence of memory references or sequence of memory references have resulted in marginal (and sometimes negative) performance improvements when compared to the presented indexing scheme.
It is important to note that our heuristic will converge to a random selection of index bits in the unlikely scenario where each memory location is accessed at least once. In such cases, one can limit the trace to the segment corresponding to the time critical regions of the code.
EXPERIMENTS
For experiments, we have used the Powerstone-embedded benchmarks [Malik et al. 2000; PowerStone 1999] 
Benchmark Results
We have compiled and executed each benchmark application on a MIPS R3000 simulator, instrumented to output memory reference traces for both instruction and data accesses. We have run the traces through our heuristic algorithm to obtain improved cache index mappings. Results for data caches are show in Table VI and Table VII for Powerstone and SPEC'00 benchmarks, respectively. 4M 17,8,14,19,18,28,16,22,23,7 crafty 70.2G 1.94M 8,9,10,15,16,20,11,19,21,14 eon 38.8G 0.559M 22,10,14,11,2,6,7,9,5,19 gap 80.9G 67.3M 20,19,21,12,11,17,6,8,25,15 gcc 25G 161M 18,24,6,13,25,8,20,3,7,15 gzip 24.8G 89.8M 16,20,8,18,23,26,14,13,2,3 mcf 23.1G 198M 16,27,7,28,12,17,14,4,21,5 parser 191G 38.2M 18,10,17,16,6,7,4,8,25,13 perlbmk 18.6G 77.2M 19,18,6,28,7,11,3,22,20,13 twolf 112G 5.73M 17,24,25,6,16,23,5,9,11,3 vortex 48.2G 76.2M 7,25,23,11,16,3,26,12,28,22 vpr 37.1G 51.7M 18,14,7,12,11,26,25,10,22,4 ,3,8,5,7,4,6,9,12,10 bcnt 1337 115 2,3,4,5,6,7,8,11,9,0 blit 22244 149 2,3,4,5,10,7,8,9,11,12 compress 137832 731 3,2,7,4,11,5,8,6,10,9 crc 37084 176 2,3,4,6,11,7,9,10,12,8 des 121648 570 2,3,7,4,5,8,12,9,10,11 engine 409936 244 2,3,4,5,7,10,8,6,11,12 fir 15645 327 7,2,3,8,4,5,6,9,11,12 g3fax 1127387 220 2,4,3,6,5,8,7,9,12,13 jpeg 4594120 623 2,3,5,4,8,6,7,13,14,10 pocsag 47840 560 2,6,3,5,4,10,9,8,7,11 qurt 1044 179 2,3,5,4,8,6,10,9,7,11 ucbqsort 219710 321 2,3,5,4,6,12,13,8,7,10 v42 2441985 656 2,3,8,12,13,5,6,4,7,9 00487M 7,8,9,10,13,14,15,16,12,6 crafty 192G 0.16M 12,13,14,15,18,19,20,21,5,6 eon 80.6G 0.206M 18,19,20,21,2,3,4,5,6,12 gap 214G 0.123M 3,4,5,6,13,14,15,16,11,12 gcc 46.1G 0.986M 18,19,20,21,14,15,16,17,12,13 gzip 844G 0.00486M 5,6,7,8,2,3,4,11,12,13 mcf 61.9G 0.0475M 9,10,11,12,8,13,14,15,16,7 parser 547G 0.105M 9,10,11,12,16,17,18,19,5,6 perlbmk 41.1G 0.328M 2,3,4,5,17,18,19,20,6,7 twolf 346G 0.177M 16,17,18,19,2,3,4,5,6,9 vortex 119G 0.358M 17,18,19,20,4,5,6,7,8,9 vpr 84.3G 0.156M 18,19,20,21,2,3,4,5,15,16 Results for instruction caches are shown in Table VIII and Table IX for Powerstone and SPEC'00 benchmarks, respectively. The last column of these tables shows the computed, near-optimal, index mapping. The values are ordered from left to right (i.e., the leftmost number denotes the best address bit, the second leftmost number denotes the second best address bit, etc.).
• Tony Givargis We have simulated the traces under three typical cache organization schemes. Configuration A with 4-Kb, direct mapped, and 4-byte line; configuration B with 8-Kb, 2-way, and 8-byte line; and configuration C with 16-Kb, 4-way, and 16-byte line. For each of the three cache configurations, we have measured the number of misses when traditional (T) cache indexing as well as when the proposed (i.e., improved) (P) cache indexing is used. Results for data caches are shown in Table X and Table XI for Powerstone and SPEC'00 benchmarks, respectively. Results for instruction caches are shown in Table XII  and Table XIII for Powerstone and SPEC'00 benchmarks, respectively.
On the average, for the data traces, the improved cache indexing achieved 23%, 19%, and 14% reduction in cache misses, for cache configurations A, B, and C, respectively, as shown in Figure 3 and Figure 4 . In some cases, the reduction in misses was up to 45% for data traces. On the average, for the instruction traces, the improved cache indexing achieved 14%, 10%, and 7.7% reduction in cache misses, for cache configurations A, B, and C, respectively, as shown in Figure 5 and Figure 6 . In some cases the reduction in misses was up to 31% for instruction traces. For smaller caches, or larger application benchmarks, a larger reduction was observed. Moreover, the technique benefited data caches more than instruction caches. In Table XIV and Table XV , we have summarized all the data reported thus far. Specifically, we have averaged the number of instruction and data cache among all three configurations (i.e., configuration A, B, and C). Moreover, we have used the Wattch simulator [Tiwari and Martonosi 2000 ] to obtain performance and energy figures. As predicted, our results show that both performance (i.e., execution time) and power are improved when a near-optimal cache indexing scheme is used.
Effects of Input Data on Near-Optimal Cache Indexing
In the next set of experiments, we have measured the effects of input data variation (i.e., variation on data fed to the benchmark applications) on data access pattern, control flow, and the resulting impact on the near-optimal cache indexing scheme. For these experiments we have simulated the traces under a • Tony Givargis cache configuration with 4 Kb, direct mapped, and 4-byte line. For each benchmark, we have selected four input data sets I1, I2, I3, and I4. (Input data set I1 corresponds to the default input data set used in our earlier experiments.) Results for data cache misses are shown in Table XVI and Table XVII for Powerstone and SPEC'00 benchmarks, respectively. Results for instruction cache misses are shown in Table XVIII and Table XIX for Powerstone and SPEC'00 benchmarks, respectively. The results are summarized as follows. For each benchmark, we selected the near-optimal cache indexing scheme computed using the default input data set I1. Then, without changing the near-optimal cache indexing scheme, we ran the benchmarks using the additional input data sets I2, I3, and I4. The presented results show the percent increase or decrease in cache misses relative to the default input data set I1.
The results show that for the embedded applications (i.e., most of the Powerstone benchmarks) the effects of input data variation on the near-optimal cache indexing is minimal. For the desktop applications (i.e., most of the SPEC'00 benchmarks), the effects of input data variation on the near-optimal cache indexing is more pronounced. In a particular example, gcc behaved very differently in both control and data cache performance with different input data set, which consisted of compiling different C/C++ programs. On the other hand, jpeg behaved with little variation in both control and data cache performance regardless of the input data set, which consisted of decoding different jpeg images.
Near-Optimal Cache Indexing Applied to Unified Caches
In the next set of experiments, we have applied our near-optimal cache indexing scheme to unified caches. For these experiments, we have simulated the traces under a cache configuration with 8 Kb, direct mapped, and 4-byte line. This particular cache configuration corresponds to a cache that is as large as the combined instruction and data caches of our earlier experiments (configuration A) on split cache architectures. Results are summarized in Table XX and  Table XXI for Powerstone and SPEC'00 benchmarks respectively. The tables provide the number of cache misses under a traditional (T) indexing scheme, the number of cache misses under an improved (P) indexing scheme, and the percent difference (using the traditional (T) indexing scheme as the reference point).
Our results show an average improvement of 14.5% for the Powerstone benchmarks and an average improvement of 15.2% for the SPEC'00 benchmarks.
Final Remarks
We conclude this section by making some final remarks about our experiments and the obtained results.
-Not in all cases did our heuristic algorithm obtain an indexing scheme that was better (or significantly better) than the traditional indexing scheme. For example, Figure 3 to Figure 6 , as well as Table XX and Table XXI show a few instances where the miss reduction was reported to be zero. In such cases, we were able to apply an alternate search heuristic (based on a simulated annealing greedy algorithm) that was able to obtain some indexing scheme better than the traditional. However, the run time of the search heuristic was impractical in practice (e.g., taking many days to run on smaller benchmarks and much long on larger benchmarks). Therefore, we conclude that the nearoptimal cache indexing technique is always feasible, however our heuristic could be improved further. -Based on our results, the average improvement was better for split data/instruction cache architecture than for unified cache architecture. However, in both architectures the improvement was significantly better than the traditional indexing scheme. -The particular input data used when running a benchmark application to obtain a trace file can have an impact on the final indexing scheme that is selected. However, for most embedded benchmarks (i.e., those from Powerstone), a cache indexing scheme obtained from a particular input data set performed well when the application was simulated using alternate input data sets, as shown in Table XVI and Tale XVIII. However, for desktop applications (i.e., those from SPEC'00), the results showed a larger deviation from the near-optimal solution, as shown in Table XVII and Table XIX. • Tony Givargis -In terms of absolute performance and energy, as reported in Table XIV and  Table XV , we note that significant improvements are achievable by applying our technique.
CONCLUSION
We have proposed a zero-cost technique for improving cache performance in embedded systems as well as mobile and portable general-purpose devices that execute a known application set. Our technique involves selecting an nearoptimal set of bits used for indexing into the cache. While an optimal selection of index bits is shown to be NP-complete, we have provided an efficient heuristic algorithm for computing a near-optimal indexing scheme. This heuristic algorithm computes in polynomial time and produces good results, as demonstrated by experiments on a large number of Powerstone and SPEC'00 benchmarks. Specifically, for data traces, our technique achieves up to 45% reduction in cache misses. Likewise, for instruction traces, our technique achieves up to 31% reduction in cache misses. When a unified data/instruction cache architecture is considered, our results show an average improvement of 14.5% for the Powerstone benchmarks and an average improvement of 15.2% for the SPEC'00 benchmarks.
In the future, we plan to investigate cache indexing schemes that may incur some constrained cost and analyze the tradeoffs in terms of improved hit rate versus increased cache access time. For example, we may consider cache indexing schemes that use one, two, or n-level logic. We also plan to investigate a dynamic approach to cache indexing by using a reprogrammable crossbar along the processor/cache and cache/memory buses, enabling on-the-fly swapping of cache address wires by a task or an operating system.
