Abstract: Filter cache (FC) is an auxiliary cache much smaller than the main cache. The FC is closest in hierarchy to the instruction fetch unit and it must be small in size to achieve energyefficient realisations. A pattern prediction scheme is adapted to maximise energy savings in the FC hierarchy. The pattern prediction mechanism proposed relies on the spatial hit or miss pattern of the instruction access stream over previous FC line accesses. Unlike existing techniques, which make predominantly incorrect hit predictions, the proposed approach aims to minimise this, thereby reducing the performance and power penalties associated with it. Simulation results on an extensive set of multimedia benchmarks are presented as proof of its efficacy. The prediction technique results in energy-delay savings of up to 6:8% over the NFPT predictor, which has been proposed in the past as the preferred prediction scheme for FC structures. Investigations conclusively demonstrate that the performance of the proposed prediction scheme is comparable with and in most cases better than that based on NFPT. Unlike NFPT, the new proposed prediction technique lends well for VLSI efficient implementation, making it the preferred choice for energy aware implementations.
Introduction
Various advancements in VLSI technologies have been catapulting processor speed and DRAM capacity dramatically. However, the disparate growth rates between processor and main memory has introduced a large and still growing performance gap. By improving both the clock speed as well as instruction level parallelism (ILP), the processor performance has been improving at a rate of 60% per year [1] . On the other hand, the access time for DRAMs has been improving at a rate of less than 10% per year [1] . The inability of memory systems to cope with faster processor speeds results in poor total system performance in spite of higher processor performance.
Cache memories have long been playing an important role in bridging the performance gap between high-speed processors and low-speed off-chip main memory. There has been a lot of research done to improve cache performance and many high-performance cache architectures have been proposed [1] . However, the processor -memory performance gap is still increasing.
The growing mobile market requires not only high performance but also demands low energy consumption. One of the uncompromising requirements of portable computing is energy efficiency, particularly due to the finite nature of battery life. With deep submicron (DSM) process technology and high clock speeds, power consumption is becoming a major design issue even in cases where unlimited energy is available to power the device. In highperformance architectures, like server processors where infinite energy source is available, heat dissipation becomes a concern due to the high rate of energy consumption in the devices. Thus, there exists a need to realise energy-efficient high-performance memory architectures for all types of processors.
To alleviate the performance gap between memory systems and processors the trend is to invest the growing transistor density in increasing the cache capacity. Raising the cache capacity reduces the frequency of off-chip accesses due to improved cache hit rates. This approach seems to be useful in that the energy dissipated for driving external I=O pins is reduced. However, this will increase the on-chip energy consumption due to frequent accesses to the larger cache memory. For instance, the on-chip cache for the Strong ARM SA110 consumes approximately 43% of the total chip power [2] .
Though future process technologies will allow bigger instruction caches to be fabricated on-chip, energy consumption and access time are likely to continue to be the limiting factors. This inability to achieve notable energy reduction using current techniques highlights the need for a memory hierarchy design that maximises the instruction hit rate, while tackling the access time penalty and dynamic energy.
It is well understood that the dynamic or switching energy dissipation in instruction cache is typically higher than that for data cache. For instance in x86 architecture, simulation with SPECInt92 reveals that only 34% of the instructions access data cache [1] . Moreover, 70 to 90% of the power consumption in CMOS technology is attributed to switching activity [3] . In this paper, we propose a predictor-based hierarchy to reduce the overall energy -delay product in the instruction cache.
Various small auxiliary cache hierarchies have been proposed to tackle the problem of energy dissipation in instruction cache hierarchy. These auxiliary caches utilise the immediate spatial and temporal locality within the small loops of the application. Line Buffer, loop cache [4] and FC [5] are examples of such cache memories, which perform in slightly different ways. These mechanisms exploit the energy savings possible when a small cache is placed between execution core and main cache. FC has been shown to achieve 58% power saving at the cost of 21% performance loss. The power saving is achieved by accessing a small energy-efficient cache more often than the main cache. In doing so, it takes advantage of the spatial and temporal locality exhibited by the instruction stream. Though the energy -delay product reduction achieved due to the FC is significant, the performance degradation due to the sequential nature of the accesses to FC and the main cache may not be acceptable. It has been reported that having a predictor-based parallel access path between processor core and the FC or main cache [6] could limit the performance penalty to acceptable levels.
Incorporation of the predictor leads to four operating scenarios. The following considers the latency tradeoff for a prediction-based hierarchy.
(i) Predicted FC Actual FC: The prediction was to access FC, and the instruction is found in the FC. In this case the power savings expected with FC is obtained and at the same time, there is no performance penalty.
(ii) Predicted L1 Actual L1: The prediction was to access the L1 cache, and the instruction required is absent in the FC. In this case, since the predictor has correctly predicted L1, the performance will be equal to the case where FC is not present. (iii) Predicted FC Actual L1: In the case of an incorrect hit-prediction in FC, performance is sacrificed as the data is now fetched from L1 cache or from a memory block that is further away in the hierarchy. (iv) Predicted L1 Actual FC: In the case of an incorrect miss-prediction in the FC, where the data is actually in the FC but is predicted otherwise, possible savings in energy and increased performance are sacrificed.
Ideally the predictor should encounter zero number of cases (iii) and (iv) stipulated. A predictor which is capable of this is termed an ideal predictor in our discussion. For an ideal predictor the average number of memory access cycles is equal to the base case where FC is not present, and the access energy savings achieved is better than the case with (hierarchical) FC. In the Section that follows, we discuss the existing next-fetch-prediction-table-based predictor prior to introducing the new branch-prediction-based scheme for FC-based instruction cache hierarchy.
Prediction algorithms

Next-fetch-prediction table
The next-fetch-prediction table (NFPT) [6, 7] prediction introduced for predictive access to the FC was shown to achieve energy savings of 31:5% for the instruction cache with SPEC95 benchmarks for a wide-issue superscalar processor by limiting the performance degradation to 1:3%: The NFPT predictor predicts whether the instruction to be fetched will be present in the FC or not, in the previous cycle. Based on this prediction, the fetch unit will attempt to fetch the instruction either from the FC or L1 cache.
The prediction algorithm works as follows:
If the tag of the current fetch address is equal to the tag of the predicted next fetch address based on the previous control path (i.e. based on the next fetch entry in the NFPT for the corresponding cache line), then a hit in the FC is predicted. Otherwise the main cache is accessed.
The rationale is as follows:
If the control path falls within a small loop, the tags of the current fetch address and the predicted next fetch would be equal, and only if it is part of a small loop would it remain in the cache until the next iteration, or else it would have been flushed out. The drawbacks of the NFPT and the motivation for devising an alternative approach follow.
2.1.3 Motivation for alternative prediction scheme: Though the NFPT has been shown to be a useful prediction technique, it is ineffective whenever the loops whose function calls disrupt the dependency on the tags to facilitate the prediction process. This is because, even if the FC can hold all the instruction lines, the NFPT predictor will predict a miss, as the tags will be different. Moreover, the NFPT requires that tags must be maintained in a table, which can be costly in VLSI. Hence, there exists a need to identify a suitable predictor whose hardware complexity is not dictated by the size of FC. The need to minimise the energy consumption and improve prediction accuracy has been the primary motivation to experiment with the two-level binary predictor which is commonly used to predict the patterns of hit miss stream of the instruction line access. This is described as follows.
Pattern history based predictor
The two-level adaptive predictor has been researched extensively for branch prediction and numerous variants of this have been proposed [8] . Studies have shown which predictors and configurations best predict the branches in a given set of benchmarks. We focus on one such branch predictor known as two-level adaptive branch predictor for predicting a hit in the FC. These binary predictors have found widespread applications as explained in [9] . In our setup, we use the 'GAg' [8] scheme due to its higher prediction accuracy and simplicity.
The two-level predictor consists of a shift register (SR) to record the hit=miss patterns of instruction line accesses and the pattern history table (PHT) consisting of 2-bit saturating counters. These counter values are used to predict (hit/miss) the nest instruction access. When the previous N outcomes of hit=miss are known then by indexing the PHT with the SR value we can predict whether the next one is going to be a hit or miss. The size of the PHT (2 N entries) is determined by the size of the SR (N bits) used. The finite state machine (FSM) shown in Fig. 1 depicts the prediction process. Figure 2 illustrates the algorithm employed in the prediction process. Drawing from the insight provided by branch prediction research, we assume that the next Fig. 1 Finite state machine of 2-bit saturating counter instruction fetch address will be the current fetch address plus '4' (here each instruction is 32-bits wide and we assume that most branches are not taken so that next instruction address is current instruction address plus four bytes). We access the predictor only when cache line changes are encountered. Hence, we predict a hit in the FC if the cache line being accessed is unchanged. In the event of a change, we access the PHT corresponding to the SR value, and the prediction of a hit or a miss in the FC is made depending on whether the PHT value reads above the preset threshold or below. For a two-bit saturating counter, a threshold value of 2 is used. When the outcome of the prediction is known, the SR and PHT values are updated accordingly, as in the normal branch predictor scheme.
Experimental setup
The SimpleScalar [10] tool set with ARM instruction set architecture (ISA) was used for these experiments. To measure the per-access energy, CACTI version 3.0 [11] was used with parameters set to 0:18 mm technology. The processor model simulated was a single-issue processor typically found in general-purpose embedded processors where instructions are fetched from the cache hierarchy every clock cycle; if an instruction cache access results in a miss then it will result in a pipeline bubble. Also, the firstlevel instruction access is in the critical instruction access path. A 32-kbyte L1 cache with 32-byte line size and fourway set-associative, organised as two sub-banks was used in all the experiments as the base case so that the setup is compatible to the instruction cache used in high-end embedded microprocessors. A set of benchmarks from MediaBench [12] and MiBench [13] were selected as they characterise embedded applications working sets for highperformance embedded processors. All the applications were compiled with gcc 2.93 and were run to completion. The conventional filter cache (CON), NFPT (NFPT predictor was simulated with 4-bit next fetch address as provided in the literature) and the pattern predictor cache configuration are implemented with the SimpleScalar tool set and simulated. All the FC configurations used in these experiments are with 16-byte line size and direct mapped configuration.
Experimental results
Our experiment results are grouped into two sections of importance, they are: Prediction accuracy: Prediction accuracy has the highest impact in the energy savings and the average memory access latency. In this section, the prediction accuracy of the pattern prediction algorithm is compared (Section 4.1.1) with NFPT for 256 and 512-byte FC sizes. Then, investigations on the ability of the pattern predictor to cater for varying requirements of the application, in an attempt to improve the prediction accuracy, are performed (Section 4.1.2). Energy -delay product: The energy -delay product provides for unified quantification of the effectiveness of the architecture. In this Section, the factors affecting the energy -delay product measure are first discussed (Sections 4.2.1 -4.2.3). In the discussion that follows, we show that the energy -delay product of a pattern-prediction-based FC is (Section 4.2.4) lower than that based on NFPT. The effect of FC size on the energy -delay product is presented (Section 4.2.5).
Prediction accuracy 4.1.1 Prediction accuracy of NFPT against pattern predictor:
The pattern predictor with 32 entry PHT was employed at first for 256 bytes and 512 bytes of FC memory as suggested in [5, 6] . This metric compares the prediction performance of both the prediction algorithms. As seen from Fig. 3 , for 256-byte FC, the pattern predictor always results in higher prediction accuracy than the NFPT predictor. For the 512-byte FC the pattern predictor results in higher prediction accuracy except for ADPCM decoder, ISPELL and SUSAN. For the benchmarks ADPCM decoder, ISPELL and SUSAN, the performance of the pattern predictor is comparable with NFPT. However, an anomaly is also visible in the form of depreciated performance with increased FC size for a pattern predictor with some benchmarks (e.g. GS-SCRIPT DEC and JPEG-DEC). The relationship between FC size, PHT size and the prediction accuracy are further investigated in the following Section.
PHT size on prediction accuracy of pattern predictor:
Having established the overall superiority of the pattern predictor over NFPT, we investigate the effect of varying PHT size on prediction accuracy. Figure 4 shows that, in general, the prediction accuracy increases as we increase the PHT size It can be seen that for some benchmarks, the prediction accuracy is improved when the PHT size is increased. It is also evident that larger cache benefits more from having bigger PHT more than smaller caches. This is not so for benchmarks (e.g. ADPCM) with smaller footprint. For such cases it is noteworthy that an improvement in the prediction accuracy was observed when smaller cache size and SR length are employed. For 256-byte FC, g721DECODE, ISPELL, PATRICIA and QSORT show that the prediction accuracy improves along with the increase in the PHT size, be it marginally. Moreover, we observe a marked improvement in the prediction accuracy when the FC size is increased from 256 to 512 bytes. Based on our findings, we can safely assume that the prediction accuracy with 8-and 32-entry PHT is generally comparable with and in most cases better than that obtained using the NFPT prediction scheme. We choose the PHT size to be 32 to generate all the results in the subsequent Sections.
4.1.3
Hardware implementation cost comparison: Figure 5 shows the simplified implementation of the algorithm shown in Fig. 2 . It employs an AND gate to detect whether the current fetch address and the predicted next fetch address ðPC þ 4Þ maps to the same cache line. Based on this, the next fetch prediction source is selected. The register 'last line' stores the last instruction line address accessed. While fetching a new instruction, the PC is compared with the 'last line' to detect line change in the instruction cache. In the event that an instruction line change is detected, the SR and PHT are updated. NFPT implementation is explained in [6] . In NFPT implementation, an NFPT table size is equal to NCL Â 4; where NCL is the number of instruction cache lines. In addition, it consists of an additional register, to hold the last line, and a 4-bit comparator.
As an initial estimate of relative hardware cost, the two predictors were synthesised using the behavioural compiler of synopsys on 0.35 micron Avant! Libra Passport library. The architectures were described using SystemC [14] and compiled to gates using the CoCentric SystemC compiler [15] . Though the area measures produced by hardware synthesis from SystemC are not optimal, they serve to clearly illustrate the relative hardware cost of the two predictors. Table 1 shows the hardware resource requirements for implementing NFPT and pattern predictor for various cache sizes and configurations. It is clear from the Table 1 that an application-centric selection criteria for determining the PHT size may provide for an area-efficient implementation of the pattern predictor. Unlike the NFPT, which is dependent on FC size, the pattern predictor is application-dependent. This distinct feature of the pattern predictor provides a mechanism to propose applicationcentric configurations that could be tailored to satisfy area and energy-delay product trade-offs.
4.2 Energy-delay product 4.2.1 Filter cache hit rate: Figure 6 compares the filter-cache hit rates (FCHR) [6] calculated as the percentage of actual hits to the FC. This metric evaluates the NFPT. This is despite the fact that the pattern predictor results in higher overall prediction accuracy (i.e. predictions for both hits and misses) for these benchmarks. This is because most of the incorrect predictions by the pattern predictor tend to be FC miss predictions, whereas most of the NFPT's incorrect predictions are FC hit predictions. As discussed in Section 2.1, incorrect FC hit predictions will degrade the performance and the incorrect FC miss predictions will result in reduced access energy savings.
Filter cache performance index:
The main motivation for incorporating a predictor is to improve the overall performance of conventional hierarchical organisation with the help of a predictor-based implementation. We introduce another metric called filter-cache performance index (FCPI), which is the ratio of correct FC predictions against the total number of correct and incorrect predictions towards the FC. FCPI can be used to evaluate the performance improvements in the presence of FC, as incorrect predictions towards FC will lead to pipeline bubbles. An ideal predictor will have an FCPI of one. Figure 7 compares the FCPI for pattern prediction with NFPT. Improvement in FCPI is noted when the pattern predictor is combined with the 256-byte FC. The same could be said for the combination of 512-byte FC and pattern predictor, except for the case with QSORT. Figure 8 compares the normalised delay or performance [5] which is the normalised average access cycle time of instructions. Here the normalised delay is obtained by restricting our analysis only to FC and L1 accesses. This normalised delay will be one if an ideal predictor is employed. It can be seen that the combination of 256-byte FC and pattern predictor results in better performance for all the benchmarks. The same could be said for the combination of 512-byte FC and the pattern predictor except for the case with QSORT. These observations reassure that the pattern predictor exhibits superiority over NFPT.
Normalised delay:
Energy-delay product:
The impact on the incorrect predictions towards both the FC and L1 will affect the energy -delay measures. This is because incorrect predictions towards FC will incur additional access cycles while incorrect accesses to L1 will require additional access energy due to notable disparity between FC and LI sizes. Figure 9 compares the normalised energy -delay product [5] for 256-byte and 512-byte FC with both NFPT and pattern predictor. The configuration with 256-byte FC and pattern predictor results in better normalised energy -delay product for all the benchmarks. The same could be said for the combination of 512-byte FC and pattern predictor except for ISPELL and SUSAN. However, the difference in energydelay product with ISPELL and SUSAN is negligible, as seen from Fig. 9. 4.2.5 Optimal FC size for energy -delay product reduction: Figure 10 shows the normalised energy -delay products of pattern predictor and NFPT for various FC sizes. Here, the energy delay product of value 1 refers to lowest energy -delay product among the set. It is evident that the optimal energy-delay product for a given bench mark is governed by FC size. This implies that the size of FC can be tailored for a given application. Our 
Conclusions
The pattern predictor prediction scheme has been successfully incorporated into the filter-cache memory hierarchy. It has been clearly demonstrated that the proposed prediction technique is more energy-efficient than NFPT prediction schemes. Our investigations confirm that when compared with NFPT implementations the proposed predictor leads to an average energy -delay product reduction of 3.39% and 2:39% with 256-byte and 512-byte filter cache, respectively. Unlike the NFPT predictor, the pattern prediction approach can be configured independently to filtercache size to maximise the prediction accuracy. For example, for a typical filter cache of size 256 bytes, a relatively small PHT (8=32 entry table) is sufficient to contain the patterns exhibited by instructions lines. Moreover, the fact that the pattern predictor can be efficiently realised in VLSI further justifies its incorporation for use in filter-cache structures. Finally, we have demonstrated that, for a given configuration of the pattern predictor and application, an optimal size for the FC can be proposed by examining the normalised energy-delay product. 
