Abstract-Two of the most important factors in the design
I. INTRODUCTION
The processor speed is advancing with a much higher rate than the memory access speed. The cache memory is generally considered as the only solution to bridge the gap between the processor speed and the memory speed. By placing a small cache memory between the CPU and the main memory, memory access could be reduced to one or two cycles. Another reason for the popularity of the cache memory, is that cache require less energy per access than the main memory. With the increasing demand of portable (battery operated) devices, energy consumption has become a very important factor in the design of the modern processors.
In today's processors, cache memory takes a very large portion of the area of the CPU, even a larger portion of the transistor count. A large cache requires longer access time, and consumes more energy due to the increase in the bitline and word-line capacitance [11] . Adding more cache to improve performance is not the ideal solution in this case. But rather a good cache architecture that can produce a good performance in terms of average memory access time, and energy consumption.
In this paper, we propose a new cache architecture, Predictive Line Buffer that consists of the cache, and a single line buffer between the cache and the CPU. Adding a line buffer (or even a small filter cache) is not a new concept. However the main contribution of the paper is the prediction mechanism, where the CPU predicts if the next access is from the line buffer, or the main cache. Thus saving at least one cycle if there is a miss on the line buffer. Using Simplescalar simulator [9] , CACTI power simulation [8] , and three different testbenchs we show that our proposed architecture have a faster memory access time, and a lower energy consumption compared with single line buffer without prediction, filter cache, hotspot cache, and the way-halting cache.
The organization of the paper is as follows. In section 2 we present some recent attempts to reduce energy and memory access time. Section 3 introduces our proposed architecture. Section 4 explains the simulation setup, and the simulation results and compares it with the above mentioned architectures. Section 5 is a conclusion and future work.
II. PREVIOUS WORKS
There have been many attempts to improve the cache performance (average memory access time) as well as decreasing the energy consumption of the cache. In this section, we briefly discuss some of recent attempts of reducing energy consumption, and the average memory access time.
Jouppi in [4] showed how to use a small fully associative cache and prefetching to improve the performance of a directmapped cache without paying the price of fully associative cache. The authors in [10] showed how to tune the filter cache to the needs of a particular application in order to save energy.
Hotspot cache was introduced in [12] where frequently executed loops are detected (frequently executed in this context means execution of the loop more than a specific number of time, they referred to this number as threshold), and placed in the hotpsot cache. Their architecture led to a less energy consumption than a regular filter cache. They also used a single line buffer between the hotspot cache and the main cache (In this paper, we use the words line buffer and block buffer interchangeably). However in their analysis they did not mention if the line buffer is accessed in parallel with the L1 cache, or sequentially. If the line buffer is accessed in parallel, that will negate the power saving which is the main reason to use a line buffer. If the access is done sequentially, that will add one cycle if there is a miss in the line buffer. In our simulation, we assumed that the access is done sequentially in order to maintain the energy saving which is the main reason for the hotspot cache.
Way halting cache [13] Predicts which way to access in a set-associative cache. The predicted way is accessed first. If the prediction was correct, energy is saved by not accessing all the ways, otherwise, the rest of the ways are accessed.
In [1] the authors proposed a variable sized block cache. Their scheme depends on identifying the basic blocks (block tail is a backward branch, block head is the target of the backward branch) and they mapped the basic block to a variable size cache block. They successfully addressed the problem of the instruction overlap among traces that was present in the trace cache [7] . In [5] the authors investigated the energy dissipation in the bit array and the memory peripheral interface circuits. Taking these parameters into consideration, they optimized the performance and power dissipation in the cache. Different techniques for reducing static energy consumption in multiprocessors caches were compared in [3] . While [2] proposed code compression techniques, where the instructions are accessed from the cache in a compressed form, as energy saving method for embedded processors.
III. PREDICTIVE LINE BUFFER
In this paper, we assume a cache memory (L1 cache) with a single block buffer. We also propose a mechanism to predict if the next access is from the block buffer or from the cache. If our prediction is correct, then we save one clock cycle (the miss in the block buffer) together with the energy consumed in accessing the block buffer.
We assumed the existence of a Branch Target Buffer (BTB). The BTB is a small cache that hold the PC addresses of the branch instructions and their target addresses. The BTB is accessed prior to accessing the next instruction in order to decide if we should access the next instruction sequentially or from the target address.
A. Conventional Line Buffer
In a conventional block buffer a cache line is used in front of L1 cache to capture the spatial locality of the program. Once a word is accessed , the line containing that word is transfered to the block buffer. The next access, if sequential, will be accessed from the block buffer instead of the cache. The energy required to access the block buffer is much less than the energy required to access the cache. If an instruction is a miss in the block buffer, it requires an extra clock cycle to fetch the instruction from L1. If we can access the cache in the same cycle as the block buffer, that increases the clock cycle time and affects the processor performance.
B. Predictive Line Buffer
To avoid the performance overhead of using conventional block buffer, we now propose a new scheme. The key difference is that our scheme use prediction between the bloc buffer and the L1 cache therefore only accessing either of them. Predictive block buffer architecture is shown in Fig. 1 . By dynamic steering between block buffer and L1 cache we avoided the extra cycles overhead, therefore improving both energy and average cache access time.
C. Implementation Details
The prediction is performed by instruction fetch mode controller, which is used during IF (instruction fetch) stage to fetch from either L1 cache or the line buffer. The prediction mechanism works as follows: After the instruction is fetched, the processor checks he BTB in order to decide if the instruction is a branch instruction, and if the branch is taken or not. If the instruction is a branch instruction, and the BTB decided it is probably taken. Or if the instruction is not a branch instruction, but its address is on the boundary of the line buffer; the prediction mode is set to L1 cache; otherwise the prediction is set to line buffer. Figure 2 shows a flow diagram of our prediction algorithm.
IV. EXPERIMENTAL RESULTS
In this section, we compare energy and performance (average memory access time) of predictive block buffer with Conventional block buffer, filter cache, hotspot cache and the way halting Cache (way halting scheme only applies to setassociative caches), using Direct Mapped and set-associative Caches. We consider three different parameters for our performance evaluation. The first is the miss ratio in the block buffer This have meaning only in conventional block buffer and predictive block buffer. Then power consumption, and average memory access time. Note that way halting cache is used only for set associative caches, since it has no meaning for direct mapped cache.
A. Experimental methodology
We used Simplescalar toolset along with CACTI 3.2 power simulator to conduct our experiments. We modified Simplescalar to simulate filter cache, block buffer, way halting cache and predictive block buffer. Our base architecture is using 16KB cache (direct-map or 4-way set-associative) of 32 bytes line size. Our block buffer is also 32 bytes. For filter cache and Hotspot cache, we used 512 bytes. The BTB is a 4-way set-associative with 64 sets with 2-level branch predictor [6] . Energy consumption is evaluated using 0.35µ process technology. Table I shows the energy consumption per-access for Block Buffer, L0 and L1 cache. For the hotspot cache, a value of 16 is used as the candidate threshold to promote the hot blocks to the Hotspot filter cache as was suggested in [12] . For the way halting cache, we used a value of 4 as the bit-width for the halt tag array [13] For the benchmark, we used SPEC2000, Mediabench, and Mibench suites. Those benchmarks cover a wide variety of applications both in embedded systems as well as general purpose computing. Each applications ran for up to 500 Million instructions using the dataset supplied with the benchmark. Those three benchmarks contain large number of applications. We reported the results for a subset of these applications because of space limitation. 
B. Line Buffer Miss Ratio
In this section, we consider the line buffer miss ratio. Miss in the line buffer cost one cycle to go to the L1 cache. This one cycle could be saved if we fetch the word from the line buffer and the L1 cache simultaneously. However, the main reason for using the line buffer is because it consumes less energy than the L1 cache. Fetching the L1 cache in parallel with the line buffer defeats the main reason for using a line buffer. Table II shows the line buffer miss ratio for the line buffer and predictive line buffer for direct mapped cache, while Table III shows the same results for the 4-way set associative cache. Although line buffer decreases the energy consumption, however it has a high miss rate. For example, since the average miss rate for the programs in Tables II and III is 30% , that means at last 30% of the memory access requires 2 cycles. However, although line buffer caches although it consumes less energy than regular caches, its delay performance is not very good. Our prediction mechanism decreases the miss rate in the line buffer to less than 1% From these 2 tables, it is obvious that our prediction mechanism is very accurate. On the average the miss rate drops from 30% to less than 1%. Moreover because our pediction depends on the BTB, advances in branch prediction technology will help in improving the performance of our prediction mechanism, and consequently, the performance of our proposed cache architecture.
C. Energy
In this section we compare the energy consumption of predictive line buffer with conventional line buffer, filter cache, hotspot cache and way-halting Cache. We assumed two different configurations for the L1 cache, direct mapped and 4-way set associative cache. We did not include the way-halting cache for the direct mapped L1 cache, since way-halting does work with set associative caches only. Also, for the hotspot cache, we assumed that the block buffer is not accessed in parallel with the L1 cache. The energy results are normalized to a 16KB cache. Figure 3 shows energy consumption of predictive line buffer, line buffer, filter cache and hotspot cache assuming L1 cache is a direct mapped. We can see that conventional line buffer cache does significantly reduce energy consumption when compared with other techniques e.g. hotspot or filter cache. That is expected since the main advantage of the line buffer is energy reduction. Table I shows that the access to the line buffer consumes less than 10% of the energy required to access the direct mapped L1 cache (less than 5% for the 4-way set associative cache) and almost 20% of the energy required to access the L0 (filter) cache. However, the main drawback of the line buffer is its delay, sine the average miss rate for our simulation is 30%, that means 30% of the instructions requires 2 cycles to access the L1 cache. Figure 4 Shows similar results for the energy consumption for the previously mentioned architectures and the way-halting cache assuming L1 is a 4-way set associative. Table IV shows the average energy consumption of the predictive line buffer architecture for both direct mapped and set associative L1 cache over the different programs in the three benchmarks, and compares it with the hotspot, way-halting, filter cache, and line buffer cache. From the previous two Figures, we can see that our proposed architecture has the lowest energy consumption, and as we will see in the next section, it does not sacrifice the speed in order to save energy. This was achieved by utilizing the energy saving characteristics of a line buffer, without paying the high miss ratio overhead of the conventional block buffer caches.
We can also notice that for some applications such as mpeg2 encode, and unepic that hotspot cache has the lowest energy consumption. The difference between the hotspot cache and the predictive line buffer is that the predictive line buffer can capture only one loop in the buffer. While hotspot cache can capture more than one line. if a program will be alternating between two small loops, the hotspot cache will have less energy consumption than predictive line buffer.
D. Delay
In this section we show that predictive block buffer does not sacrifice performance for lower energy consumption. We compare the delay (average memory access time) for predictive block buffer with filter cache, hotspot cache and conventional The delay for the four different architectures using conventional direct-map cache is shown in Figure. 5. We can see that the delay for predictive line buffer is close 1 cycle/memory access for all benchmarks. Conventional block buffer, on other hand, increases the delay by up to 30%. For predictive line buffer, the worst performance is for g721-encode, of up to 1.2 cycles/memory access. These results proves that predictive line buffer scheme reduces energy consumption without sacrificing performance comparable to hotspot and filter caches. Table V shows the average normalized delay for hotspot cache, filter cache, way-halting cache, line buffer and Predictive block buffer using direct-mapped cache as the base L1 cache.
For predictive line buffer, using set-associative cache as the L1 cache does not increase the delay. Fig. 6 shows normalized delay when using 4-way set-associative cache for various scheme. The average for all the schemes are shown in Table  V . Predictive block buffer scheme is almost close to ideal with 
V. CONCLUSION
In this paper, we proposed a new cache architecture for embedded system that consumes less energy and have a better delay performance than many recently proposed caches. Our proposed architecture consists of l1 cache and a single line buffer. We use prediction to access either the line buffer or the L1 cache. Our proposed architecture assumes the existence of a BTB and add minor logic for the prediction mechanism. Simulation results using Simplescalar and CACTI 3.2, SPEC2000, Mediabench, and Mibench benchmarks show that our proposed architecture has less energy consumption and better average memory access time than many recently proposed caches Normalized delay for Filter Cache, conventional block buffer, predictive block buffer and hotspot cache using direct-map L1 cache Normalized delay for Filter Cache, conventional block buffer, predictive block buffer and hotspot Cache using 4-way set-associative L1 cache
