Most modern microprocessors employ one or two levels of on-chip caches in order to improve performance. These caches are typically implemented with static RAM cells and often occupy a large portion of the chip area. Not surprisingly, these caches often consume a significant amount of power. In many applications, such as portable devices, low power is more important than performance. We propose to trade performance for power consumption by filtering cache references through an unusually small L1 cache. An L2 cache, which is similar in size and structure to a typical Ll cache, is positioned behind the filter cache and serves to reduce the performance loss. Experimental results across a wide range of embedded applications show that the filter cache results in improved memory system energy efficiency. For example, a direct mapped 256-byte filter cache achieves a 58% power reduction while reducing performance by 21%, corresponding to a 51% reduction in the energy-delay product over a conventional design.
Introduction
In order to mask latency, and thus improve performance, most microprocessors have one or two levels of on-chip caches. For simple single-issue microprocessors, the access time of the first level cache often is on the critical path [I] . Even though these caches consist of densely packed MOS transistors, the area assigned to on-chip caches can be a significant fraction of the entire IC area. Figure 1 shows a die photo of the StrongARM processor from DEC, which is dominated by cache memory. The power dissipated by the on-chip caches is often a significant part of the power dissipated by the entire microprocessor. Table 1 compares key characteristics for two modern embedded RISC microprocessors, the StrongARM 110 [2] and a PowerPC from IBM [3] . For these chips, the cache power consumption is either the largest or second largest power-consuming block. This trend will likely continue as embedded processors become more sophisticated and provide higher performance.
Figure 1: Die of DEC StrongARM
Caches clearly present one of the most attractive targets for power reduction. Power reduction in caches can be achieved through several means: semiconductor process improvements, memory cell redesign, voltage reduction, and optimized cache structures. Our focus here is on cache structures, which is where architects can have the largest impact.
The motivation for our research is the increased application of embedded systems for multimedia and communication applications. Many techniques have been discovered which lead to increased performance as well as reduced power consumption. For example, improvements in cache organizations for high performance commercial processors also serve to reduce traffic on high capacitance buses. This phenomenon clearly helps to save power although the primary goal is improving performance. We believe that low power researchers should be open to sacrificing some performance for power savings. Our experience is that, without this perspective, the best power saving ideas often have been discovered in the chase for high performance. Many approaches can be used to reduce power by sacrificing arbitrary amounts of performance, e.g. avoiding pipelining and fully associative caches. Furthermore, by introducing an arbitrary reduction in the clock rate, we can achieve an arbitrary reduction in the power consumed by CMOS circuits. Clearly, power alone is not a good design metric, and must be evaluated along with some concern for performance. We have decided to adopt the Energy-Delay metric, which has been used evaluating CAD tools and circuit designs. We propose the use of a first level cache that is very small relative to conventional designs. This cache has reduced power dissipation relative to a traditional cache architecture, albeit at the expense of a decreased hit ratio. Our hypothesis is that the decrease in power consumption will compensate for the loss in performance, resulting in a reduced Energy-Delay product. The small L1 cache is called afilter cache in order to distinguish it from traditional caches that are designed solely for performance. However, while the design decisions differ from those of a traditional cache, the basic structure is the same.
The basic filter cache organization is illustrated in Figure 2 , where it is compared to a traditional memory organization. Power consumption estimates for the caches are calculated using a power model presented later in the paper. The power consumption of the main memory only accounts for the bus capacitance, and ignores the power consumed in the memory chips. The L1 cache is likely to have the same characteristics for both systems. However, with the new design, the L1 cache is only accessed as a consequence of a miss in the filter cache, otherwise it is not cycled and remains in a standby mode. Thus, although the L1 cache has a similar design in both cases, it will require an additional clock cycle for access with the filter cache. Because the filter cache is smaller than the L1 cache, it will generally have a faster access time. While this phenomenon may present an opportunity to increase the processor clock, it will necessarily result in an increase in the latency for access to the L1 cache. We have not investigated the impact of this option, and will only be comparing systems with equal clock frequencies and two-cycle access for the L1 cache backing the filter cache.
1.1
Overview We use the following approach to evaluate the energy dissipation in the cache of an embedded processor. The cache power is mainly a function of the capacitance of the memory array and the access transitions. We determined the actual number of transitions through a detailed cycle-level simulation.
The capacitance values were obtained from published results for 0.8um technology. The analysis can be repeated, with the same resulting trends, using values from a more modern process if available. These parameters were applied to an analytical model in order to determine the power dissipation for the filter caches and the backing L1 caches.
The remainder of the paper is organized as follows. The next section presents the previous work that is relevant to our problem and approach. Section 3 presents the experimental methods used. The workload used for evaluation is discussed in Section 4. Section 5 presents the experimental results and discusses effective filter cache design. The paper concludes with a discussion and summary.
Previous Work
The related previous work can be divided into three major areas: low power processor evaluation, cache modeling, and low power cache structures.
Gonzalez and Horowitz investigated various methods for evaluating low power processor designs [4] , and they presented some of the first arguments concerning power and performance to the architecture community. Much of this discussion follows the structure of previous arguments in the CAD community. Based on this work, we have adopted the Energy-Delay metric for system evaluation. We use these models and values for power estimation. These approaches are conceptually similar to the filter cache. Block buffering can be viewed as a degenerate case of the filter cache and our results indicated that a significant benefit could be realized by larger structures. Compared to MDM, the filter cache results in better Energy-Delay results because of improved performance. Furthermore, the filter cache can be turned off when higher performance is needed, which is not possible with MDM.
Experimental Methods
The base machine model used here is an embedded processor executing the HPPA instruction set. The system is designed to be roughly comparable to the DEC StrongAM 110 in terms of system resources [lo] : 16 KB instruction and data caches, single issue processor core, no aggressive branch prediction, no L2 cache. Applications were compiled and executed using the IMPACT toolset [ 1 11. This approach allows us to experiment with various cache structures and generate accurate clock cycle counts for execution time. These counts are used directly for Delay in the Energy-Delay measures. The Energy, Delay, and Energy-Delay measures were determined for the base processor executing each application in the experimental workload, and subsequently used to evaluate alternate filter cache designs.
The cache power models are based on the work by N h i t and N,,, , represent the raw number of hits and misses to the first level cache. The term Naddr counts the average number of address line transitions seen by the first level cache from the CPU, assuming that half of the address lines switch during each memory request. T represents the number of tag bits, while C is the number of control bits stored with each cache line. M is the degree of associativity, and L is the line size measured in bytes. Using these characteristics of the cache organization, it is easy to count the number of precharged bit transitions that will occur, which is designated N b p . Using W for the average number of bits that switch on each write operation, the total number of bit reads ( N b , ) 
Benchmarks
There currently exists a significant void with regards to effective benchmarks for embedded systems. While a number of industrial and academic efforts have been proposed, to date there has been little progress towards a suite of representative programs and workloads. One part of the problem is that the field of embedded systems covers a wide range of computing systems. It is difficult to imagine a benchmark suite that would be useful to the designers of fax machines and cellular phones, because of the drastically different uses for these products. This unfortunate state of affairs is best reflected by the use of the Dhrystones benchmark, and the derivative metric Dhrystones per milli-Watt. For the purposes of this paper we have adopted the MediaBench benchmark suite [ 121. These benchmarks encompass most of the media applications in use today. Table 2 summarizes the codes, provides a brief description of them, and indicates the number of instructions simulated for each.
Experimental Results
For our base case, we have assumed a one level cache hierarchy with split Instruction and Data caches, each with a capacity of 16 KJ3. These are direct mapped caches with a line size of 32 bytes each. The filter cache machines have a two level cache hierarchy consisting of instruction and data filter caches and a unified direct mapped 32 KB L1 cache. The L1 used in conjunction with the filter cache has the same size as the combined L1 caches for the base machine. We considered two small filter cache sizes: 128 and 256 bytes. The line size was varied between 8 and 32 bytes, and both direct mapped and fully associative caches were considered. The instruction and data filter caches were simulated with the same configuration in order to reduce the number of design points to be evaluated. The hardware cost of these structures is small relative to the base case L1 cache. 
size and associativity
The results for each filter cache design are presented relative to the base machine.
Power consumption for the MediaBench applications is shown in Table 3 for the 128 byte filter caches and Table 4 for the 256-byte configurations. Figure 3 summarizes the most important aspects. Each line is identified by its cache size, line size and degree of associativity; for example a 128 byte fully associative filter cache with 16 byte lines is identified as 128116-F.
First, associativity tends to have a strong impact on power consumption. For example, the 128 byte fully associative cache with 16byte lines consumes almost as much power as the split 16K byte direct mapped caches used for the base case. Associativity increases the amount of data and control information read out of cache arrays, thus consuming more power. The variation between the 128 byte and 256 byte fully associative caches is due to the fixed line sizes. The 256 byte fully associative cache has twice as many lines as the 128 byte filter cache, thus consuming approximately twice as much power. The improved hit rate does not fully compensate for the increased energy per reference. Secondly, the power consumption for the 128 byte and 256 byte direct mapped caches is reasonably similar. These caches have similar hit rates; thus they have approximately equal effectiveness at filtering out memory references.
Finally, for the direct mapped cases, increased line size tends to increase power consumption. Again, for these caches, we tend to see hit rates that are fairly close. However, each reference to a 32-byte direct mapped line reads out four times as much data as the 8-byte case, thus discharging four times as many bitlines.
The performance impact of the filter cache is shown in Table 5 for the 128 byte caches and Table 6 for the 256-byte caches. Somewhat surprisingly, increased associativity often reduces performance for these applications by a small amount. This phenomenon appears to be the result of the combination of uncommonly small caches and the corresponding small number of cached lines, as the impact is generally less for the larger cache size. For the caches considered here the line sizes are still small enough that the system sees a significant benefit from the effect of instruction prefetch. Figure 4 illustrates this point for the 256-byte direct mapped filter caches. While the delay is always greater than the base case (i.e. the performance is lower) the 'longer line sizes have reduced performance loss and exhibit significantly less variability. Although the shortest line size approaches the performance of the longer lines for several applications, in general, it appears to be a poor choice for performance.
The resultant EnergyDelay measures are shown in Table 7 and Table 8 for the 128 byte and 256 byte cases respectively. We will first consider the impact of full associativity. Full associativity performs worse than the base case for all but three instances. These particular points correspond to the smallest cache size and the largest line size: thus they have the smallest amount of associativity. For these cases (epic, pgpdecode, and pgpencode) the EnergyODelay improvement is still very small relative to the base case. Due to these results, we will no longer consider the fully associative caches. Figure 5 summarizes the Energy-Delay for the direct mapped caches with 16 byte lines. For many cases, the benefits of the larger cache are relatively small, and on average the 256-byte filter cache has only a 16% benefit. However, the overall effect in reduced Energy-Delay is much flatter for the larger cache, suggesting that the additional size is likely to be worth the investment.
Benchmarks

Figure 6: Energy delay product ratio vs. line size
Having selected the 256-byte direct mapped filter caches for closer examination, Figure 6 highlights the impact of varying the line size. The conclusions are less clear here, because the conflicting factors that drive EnergyODelay result in no consistent design advice. As the line size gets longer, the delay tends to decrease while the power consumption tends to increase. Based on the mean Energy-Delay, one would conclude that the 16-byte line size is the best choice. This seems natural, as it is a balance between the competing pulls of energy and delay. Additionally, this line size has the least variance across EnergyODelay, and thus provides the most consistent benefits.
Discussion
Several topics remain to discuss which do not neatly fall into any section of this report. While this work introduces the filter cache and establishes its effectiveness, there remains a significant amount of further work required to better analyze and quantify the design space. In particular, we are moving forward with further simulation to explore the effectiveness of a small degree of associativity. A sound analytic model is also needed to capture the basic physical phenomena and allow rapid exploration of the design space without exhaustive simulation.
The filter cache design presented here uses a backing cache that has an access time that is typical of existing embedded L1 caches. Clearly two design opportunities exist: the backing cache can stay with a single cycle access, or it can require multiple cycle access while the processor clock is increased to match the shorter access time of the filter cache. It is interesting to imagine a design that left the backing cache with a single cycle access. The filter cache can then be turned off, moving the backing cache back to the L1 position with single cycle access time. By controlling the filter cache with a processor mode bit, the system can switch between high-speed operation and low Energy-Delay operation as system demands vary.
Conclusions
In spite of the increased commercial interest in sophisticated embedded processors, very little work has been conducted to improve the EnergyDDelay performance of embedded processor systems through architecture. This paper develops and evaluates the filter cache, a small memory that trades performance for reduced power in order to optimize Energy-Delay. The filter cache has been shown to provide an average Energy-Delay reduction of 51% across a set of 19 multimedia and communications applications. In Table 6 Performance for 256-byte filter caches relative to base machine Table 7 EnerWDelay product ratio with 128-byte filter caches 
