Abstract-The cache memory plays a crucial role in the perfor-more energy per bit access than the DRAM memory used in mance of any processor. The cache memory (SRAM), especially the main memory, however, the small size of the cache, and the the on chip cache, is 3-4 times faster than the main memory fact that it is on chip results in much less power consumption (DRAM). It can vastly improve the processor performance and speed. Also the cache consumes much less energy than the perfaccesstop with themimer Even for more main memory. That leads to a huge power saving which is powerful desktop computers, the power consumption is a very very important for embedded applications. In today's processors, important factor in the design process since it affects both the although the cache memory reduces the energy consumption of reliability of the system, as well as the price of the system the processor, however the energy consumption in the on-chip because of the need of sophisticated cooling systems for high cache account to almost 40% of the total energy consumption of the processor. In this paper, we propose a cache architecture, power processors. for the instruction cache, that is a modification of the hotspot For modern processor systems, more than half of the chip architecture. Our proposed architecture consists of a small filter area, and more than half the transistors count on the chip is cache in parallel with the hotspot cache, between the LI cache dedicated to the cache. That means the power dissipation in the and the main memory. The small filter cache is to hold the code cache is an important part of the overall power consumption that was not captured by the hotspot cache. We also propose a preicton echnis t ster he emoy cces t eiherth in modern processors. It also means that reducing the cache prediction mechanism to steer the memory access to either the poecnsmtnisvripratobciefrtdy' hotspot cache, the filter cache, or the LI cache. Our design has power consumption is very important objective for today's both a faster access time and less energy consumption compared processor's designers. to both the filter cache and the hotspot cache architectures. We Energy consumption in the cache can be reduced using use Mibench and Mediabench benchmarks, together with the three different techniques. The first is at the physical (VLSI) simplescalar simulator in order to evaluate the performance of our proposed architecture and compares it with the filter cache level Inwhi approch ch memor Tis designed wh and the hotspot cache architectures. The simulation results show reduced power consumption in mind. This can be achieved by that our design outperforms both the filter cache and the hotspot reducing the voltage levels, reducing the capacitance, or recache in both the average memory access time and the energy ducing the switching frequency. Second, at the compiler level, consumption.
architecture. Our proposed architecture consists of a small filter area, and more than half the transistors count on the chip is cache in parallel with the hotspot cache, between the LI cache dedicated to the cache. That means the power dissipation in the and the main memory. The small filter cache is to hold the code cache is an important part of the overall power consumption that was not captured by the hotspot cache. We also propose a preicton echnis t ster he emoy cces t eiherth in modern processors. It also means that reducing the cache prediction mechanism to steer the memory access to either the poecnsmtnisvripratobciefrtdy' hotspot cache, the filter cache, or the LI cache. Our design has power consumption is very important objective for today's both a faster access time and less energy consumption compared processor's designers. to both the filter cache and the hotspot cache architectures. We Energy consumption in the cache can be reduced using use Mibench and Mediabench benchmarks, together with the three different techniques. The first is at the physical (VLSI) simplescalar simulator in order to evaluate the performance of our proposed architecture and compares it with the filter cache level Inwhi approch ch memor Tis designed wh and the hotspot cache architectures. The simulation results show reduced power consumption in mind. This can be achieved by that our design outperforms both the filter cache and the hotspot reducing the voltage levels, reducing the capacitance, or recache in both the average memory access time and the energy ducing the switching frequency. Second, at the compiler level, consumption.
some techniques are used in order to fully utilize every data element that was brought to the cache before replacing it (may I. INTRODUCTION require some transformations on the source code level). The
The processor speed is, and has been for a long time, third approach at the cache management level. This approach advancing at a much higher rate than the main memory includes using different associativity, different levels/types of access time. That led to a big, and still growing, gap between cache, prediction schemes, and different replacement policies. the processor speed and the main memory speed. Cache, This approach is usually referred to as cache architecture. especially on-chip cache, is considered to be the main solution In this paper we concentrate on the third approach, cache for the processor-memory gap. By placing a small cache architecture level. memory on the chip between the CPU and the main memory,
In this paper, we introduce a new instruction cache architecand having a good cache management scheme, we can reduce ture that results in reducing the average memory response time the memory access time to 1 or 2 cycles in most processors. as well as the power consumption. Detailed simulation using On-chip cache is considered to be a standard on almost all the Simplescalar simulator with Mibench and Mediabench different types of processors. benchmarks, shows that our architecture has a better response Another reason for the popularity of caches is the growing time and less power consumption than the hotspot architecture demand for mobile (battery operated) devices that include and the filter cache architecture. One or more processors. Cache is ideal for such applications
The organization of the paper is as follows: Section 2 since it consumes much less energy than the main memory. presents a brief overview of the previous attempts to reduce Although SRAM technology that is used for cache requires memory access time and/or the cache power consumption.
1-4244-0155-0/06/$20.00 ©2006 IEEE In Section 3 we present our architecture and the prediction hotspot cache) between the LI cache and the CPU. Their mechanism. Section 4 presents our simulation results (both goal is to capture loops in the hotspot cache whose access energy consumption and average cache access time) for our requires much less energy than the much bigger LI cache. architecture and compares it with two similar architectures They modified the Branch Target Buffer (BTB) in order to (filter cache and hotspot cache). Section 5 is a conclusion.
determine which loops will be loaded into the hotspot cache. Their results show up to 52% energy reduction in cache access using mediabench suite without performance degradation. With increasing the size of the on-chip cache, and having
The authors in [1] proposed a highly associative cache using more than one level of caches on chip, that resulted in more CAM design. Since CAM requires high-energy consumption, power dissipation in the cache. Coupled with the fact that they used a last-used prediction technique in order to reduce caches are implemented using SRAM instead of the more energy using a 32-way set associative cache; they showed 30-power thrifty DRAM technology for speed purpose, addressing 40% three the location of the next reference, the cache is accessed as improvements in order to reduce power. They used multiple direct mapped cache instead of a set associative cache. Direct line buffers. They check the line buffer in parallel with tag mapped cache consumes less energy since there is only one checking in cache, if the data is found in one of the line access to the tag. buffers, the ache access is aborted. Second, they divided the Victim cache was proposed in [9] to reduce the delaydata array into sub-banks, thus saving power on the bit line energy product and the delay-energy-area product. They used a energy. Finally, they used bit line segmentation for a further comparison between the tag high/low order bits and the access power saving. They compared their design to a conventional high/low order bit in order to quickly detect most of the miss cache with no line buffer, and showed a large energy reduction. in the filter cache and direct the access to the LI cache. Their
In [14] , the authors proposed the use of a small, energy-proposed scheme resulted in an average saving of 8.6% for efficient filter cache. The authors proposed a small filter cache, energy and 3.8% for execution time. and they used the spatial hit/miss pattern in order to predict Cluster miss prediction was used in [2] in combination with the next access and to minimize the total cache energy, as prefetch on miss in order to minimize miss rate for ready well as the total cache access delay. Their design resulted in CPU cores where the designer does not have complete access an energy delay saving of 7%.
to cache configuration. Their simulation shows a reduction in Way prediction and selective direct mapping was used in the miss rate [12] in order to reduce the LI cache dynamic energy without degrading the cache access time. Their objective was to reduce III. PROPOSED ARCHITECTURE the energy wasted in accessing all the ways in a set associative
In this section we start with a brief description of both the cache. Since, at most, the data is found in only one way, they filter cache and the hotspot cache, then the motivation of our predicted the access way and accessed it as a direct mapped. work showing some of the shortcomings of the hotspot cache, Their technique resulted in achieving the energy-delay of a and why do we propose to modify it. Then, we propose and sequential access while maintaining the the performance level discuss our proposed architecture. Of a parallel access.
In [15] , the authors proposed the hotspot cache in order A. Filter Cache to reduce energy consumption in embedded systems. In their Filter Cache, introduced in [8] , adds a small cache (LO design, they proposed a small filter cache (known as the cache, usually 512 bytes) in front of level-i cache. The main idea is to capture the most recent accessed instruction to Hot-Block Flag, and Prev-Hot Flag. Execution counter is avoid accessing level-I cache. For each memory access, the used to identify hot blocks (the counters in the BTB are filter cache is accessed first and LI cache is accessed only if used to count the frequency of taken branch or how many the filter cache misses. Filter caches usually result in energy times a specific loop was executed). A Hot Block is detected saving, but increase the average cache access time. Filter cache when this execution counter reaches a certain threshold. The is shown in Fig-i Fig. 2 . Hotspot Cache * Loops whose number of iterations were less than the threshold were not marked as hotspot and never moved The block detection mechanism is incorporated into Branch into the hotspot cache Target Buffer (BTB) and is shown in Fig-3 . Each entry in . All the hotspot cacheable code below the threshold value the BTB is augmented with a valid bit, an execution counter, was accessed from the Li cache.
. We now propose our scheme which enjoys a faster memory the hotspot cache due to the above mentioned reasons. As access time than the hotspot cache, and less energy consumpwe can see, for some applications in communication, video tion than the Filter Cache. In our scheme, we did not augment and voice hotspot capture loops iteration quite effectively but the LI cache with block buffer for the reasons mentioned not so for other types. For example, for applications in Data, above (either longer cycle time, or high energy consumption). Image and Mp3 type, up to 36% of the loop iterations were Our proposed architecture is as follows: There are 2 parallel handled by LI cache.
caches between the LI cache and the CPU, the hotspot cache, Mp3 and data applications, low energy consumption branch is taken. If that counter reaches a threshold value, the can still be achieved by using block buffer. It was reported in block is assumed to be a hot block and is moved to the hotspot [15] that for such applications, if using hot spot cache without cache. While the block is in the hotspot cache, the block is block buffer, the energy consumption is even higher then Filter monitored to be sure that at least 50% of the references are Cache. We can see from Table I that, for such application, up from the hotspot cache. If the ratio falls below that, the block to 36% of iterations were handled by the relatively high energy is considered a cold block and the search for another hot block LI cache. programs are almost identical to the ones reported here. Table- II shows the programs we reported on in this paper. [7] architecture with the energy consumption of the filter cache to conduct our experiments. We have modified Simplescalar and hotspot Cache. Figure-6 shows the energy consumption to simulate Filter Cache, Hotspot cache, and our proposed of some representative programs in Mibench and Mediabench benchmarks normalized to the baseline architecture (assuming that the baseline architecture energy consumption is 1), which proposed architecture and compares it with the other two architectures. Note that in this table, hotspot cache* means _ the hotspot architecture where the line buffer will be accessed 2 in a separate cycle in order to not increase the cycle time. We 0 can see that our proposed architecture consumes less energy than the Filter cache, and almost the same energy consumption as the hotspot Cache. As we will see the delay and the off-°c ? chip memory access of our proposed architecture is better X than the hotspot and filter cache. For communication and data application, such as crc32 and epic, using 256 Filter cache applongication,sh 512ahosp cacheperform usinghtlybter tacha Fig. 6 The filter Cache and the hotspot Cache significantly reduce cache. As with energy, for both data and communication applithe energy consumption by avoiding energy expensive level-cation the delay is lower when using the modified hotspot with 1 cache access. Both of them have performance overhead. 256 bytes filter cache. Whereas for all other applications the Table-IV shows normalized delay for the filter cache and the modified hotspot cache with 256 bytes filter cache is actually hotspot cache. On the average, for simulated applications, up worse than the original hotspot cache. For the modified hotspot to 10% and 6% performance degradation was observed for cache with 512 Filter cache, the delay for all programs for Filter Cache and HotSpot Cache respectively compared to the various types of applications are better than all other schemes. baseline architecture. Using our proposed scheme we reduced Fig-8 shows the average memory access time in cycles per the performance overhead to only 2%. As our scheme doesn't memory access for the different architectures.
use line buffer between level-i cache and LO cache, we avoid consumption. Fig-12 shows the normalized total (on-chip and CD m -,; ; off-chip access) energy consumption of the three architectures normalized ot the direct mapped cache. In this paper, we proposed a new cache architecture for the instruction cache to minimize both the average cache access time and the energy consumption. Our proposed architecture Accessing the off-chip memory is expensive, both in terms combines the low miss rate of the hotspot cache architecture of energy consumption and delay. Off-chip memory could and the low energy of the filter cache architecture. Our be the main memory, or an off-chip second level cache (L2 simulation, using Simplescalar, Mediabench and Mibench, cache). Direct-mapped cache, although comparatively have shows a reduction in both the average memory access time and fast access, can suffers from thrashing problem. Thrashing the cache energy consumption compared to both the hotspot occurs when two memory lines map to same line in the architecture and the filter cache architecture. cache. Thrashing can cause performance degradation as most RFRNE of the time is being spent in moving data between memory and caches. 
