The authors present a translation lookaside buffer (TLB) system with low power consumption for embedded processors. The proposed TLB is constructed as multiple banks, each with an associated block buffer and a corresponding comparator. Either the block buffer or the main bank is selectively accessed on the basis of two bits in the tag buffer. Dynamic power savings are achieved by reducing the number of entries accessed in parallel, as a result of using the tag buffer as a filtering mechanism. The performance overhead of the proposed TLB is negligible compared with other hierarchical TLB structures. For example, the two-cycle overhead of the proposed TLB is only , 1%, as compared with 5% overhead for a filter (micro)-TLB and 14% overhead for a banked-TLB with block buffering. The authors show that the average hit ratios of the block buffers and the main banks of the proposed TLB are 94% and 6%, respectively. Dynamic power is reduced by , 93% with respect to a fully associative TLB, 87% with respect to a filter-TLB and 60% relative to a banked-TLB with block buffering. Therefore, significant power savings are achieved with only a small performance degradation.
Introduction
Recent embedded processors support a virtual memory system through a hardware memory management unit (MMU) that translates virtual addresses to physical addresses. Moreover, those processors are widely used for multimedia and communication applications, requiring high-speed computing capability, high memory bandwidth and effective memory hierarchy support. Also, power consumption is a major factor in designing high performance embedded processors. In general, reducing power consumption architecturally in the memory system causes a more significant impact than many other techniques at the gate=circuit level.
The translation look-aside buffer (TLB) is an on-chip cache that records page table entries for recently used virtual to physical address translations [1] . If the necessary translation information exists in the TLB, the system can translate a given virtual address to its corresponding physical address without accessing the page table. If the translation information is not found in the TLB, an expensive lookup in the page table has to be initiated and the TLB has to be updated with this information.
Measured data [2, 3] from commercial processors such as Intel's StrongARM and Hitachi's SH-3 indicate that as much as 17% of on-chip power is consumed in the TLBs and the trend is increasing. In case of the ARM920T [4] , the power consumption by the data TLB and instruction TLB are, respectively, $ 5% and 4% of the total power consumption of the entire processor. For comparison, the power consumption of the data cache is $ 19%; and the instruction cache consumes $ 24%: Although the physical size of a TLB is small, compared with a cache memory, it accounts for a significant fraction of the total power consumption. A TLB is typically organised as a fully associative structure that is accessed for every instruction and data fetch. Since the TLB's content addressable memory (CAM) and register file (RF) data portions are constructed as dynamic circuits, they consume a large amount of power. Because TLB circuits are generally on a critical timing path, power savings must not be obtained at the expense of increased circuit delay.
Conventional methods for reducing TLB power consumption are to make it hold fewer entries, apply a filtering or a block buffering mechanism, and utilise a bank structure [5 -7] . When the number of TLB entries decreases, it brings about a performance degradation. The filter-TLB mechanism, where a very small TLB is located above the conventional L1 TLB, causes a performance degradation because it increases the number of two-cycle accesses. Block buffering can be viewed as an approach similar to the filter mechanism. However, accessing the block buffer should be completed during one cycle for modern microprocessors with high clock frequency.
Our TLB structure supports low dynamic power consumption for embedded processors. We use selective block buffering as another type of filtering mechanism that offers low power consumption via a simple hardware control mechanism. That is, either a block buffer or a main TLB bank can be selectively accessed using simple logic. The average accuracy of these operations is $ 99% with respect to the total number of addresses generated by the CPU. Our scheme divides both the block buffering structure and the main TLB structure into four banks to reduce power consumption. We show that dynamic power is saved by accessing either the main bank or the block buffer selectively and by decreasing the number of fully associative TLB entries to be accessed in parallel.
Simulation results show that the average memory access time of the proposed TLB is almost equal to that of a conventional fully associative TLB. But the dynamic power consumption of the proposed TLB is $ 93% less than the fully associative TLB, 87% less than a filter-TLB and 50% less than a banked-TLB with block buffering. Also, the Energy Â Delay for the TLB is reduced by $ 93%; 88% and 60% compared with a fully associative TLB, a filter-TLB and a banked-TLB with block buffering, respectively.
Related work
In both modern high performance microprocessors and embedded processors, the on-chip TLB is split into separate instruction and data TLBs (e.g. Strong-Arm, MIPS, Alpha, PowerPC and Ultra-SPARC) [8, 9] . The miss ratio in a TLB tends to be very small because each entry refers to a page of memory. However, a TLB miss is accompanied by a long handling latency, i.e. on the order of tens to hundreds of cycles. Therefore, a fully associative TLB is typically used to obtain lower miss rates, but full associativity is very costly in terms of power consumption. To reduce power consumption, the total number of TLB entry tags accessed in parallel should be fewer than 64 or 128 [1] . However, higher performance can be achieved if more TLB entries are provided. One method for dealing with these conflicting goals is to divide the entire TLB space into several subTLBs so that the number of tags accessed together can be reduced to fewer than 32 or 64 [7] . This bank-TLB [7] consumes less power than a fully associative TLB because only a portion of the CAM entries are activated on each access.
The filter (micro)-TLB is a hierarchical structure, where a very small TLB is located above the conventional L1 TLB [5] . In terms of power consumption, a filter-TLB turns out to be effective when combined with the instruction TLB due to its low miss ratio, but for the data TLB, the performance degradation of the filter-TLB, compared with a fully associative TLB, becomes significant. To overcome this weakness in the filter-TLB, the banked-filter TLB [10] was proposed, which is configured as a 2-way banked filter TLB and a 2-way banked main TLB. Banking decreases power consumption by reducing the number of entries that are accessed, and effectively increases the TLB space by adding the filter space while avoiding the inclusion property.
Ghose and Kamble [6] proposed to overcome the weakness of the conventional filter cache with a linebuffering cache. Cache access latency is greater on a miss in the filter cache, but line buffering [6, 11, 12] reduces latency by accessing a 4-entry fully associative cache in parallel with the normal cache. The line buffering cache thus achieves high performance but its power consumption is not as low as the filter cache, because it probes the line buffers in parallel with accessing the main cache. The dynamic power consumed per cache access can be specified as the power to drive line buffers, e.g. four set-number comparators, one tag-part comparator, 4X1 and 2X1 MUXs, as well as the precharge power of the tag and data arrays in the main cache, and the decoding power of the main cache. From this understanding of the power model, we propose a new block buffering technique, called selective block buffering, which makes up for the weaknesses in the original block buffering technique by avoiding the overhead of parallel access to the main TLB.
The difference-bit cache [13] used a detection mechanism to select one two-way set associative cache. This cache has diff-index and diff-value. Diff-index and diff-value are used to determine the enable signal for selecting one of two ways. To do this, the pairs (diff-index, diff-value) are stored in the Diff memory size S Ã r; where S is the number of sets of the two-way associative cache and the value r depends on the code used to represent diff-index. If t is the number of bits of the tag, the value of r is t þ 1: That is, this mechanism suffers from high hardware cost and greater complexity. Also, tag bit selection time, crossbar and enable signal time incur high delay, but because the proposed mechanism uses bank decoder time, the delay introduced by the additional comparison gates is effectively hidden in comparison with a monolithic TLB. Also, the proposed structure offers low hardware cost and a simple hardware control mechanism.
A mostly no machine (MNM) [14] is accessed either in parallel with the level 1 cache or is accessed only after a level 1 cache miss. Accessing both the level 1 cache and the MNM in parallel incurs high power consumption because level 1 cache power consumes in spite of hitting a MNM. Also, accessing only after a level 1 cache miss incurs long access time.
Other TLB studies for low power consumption address memory cell redesign, such as modifying the CAM cell [2] , using a low power RAM [15] and voltage reduction [16] . The work by Juan et al. [2] proposes modifying the CAM cell by adding another transistor in the discharge path. With the modified cell, the control line can be used to precharge the match line without pulling the bit lines to zero. This method suffers from high hardware cost, greater complexity and lower performance. The work by Itoh et al. [15] proposes low-power circuit design techniques, such as pulsed word-line and sense circuitry. With these schemes, the access circuitry is enabled only long enough to ensure reliable reading and writing of memory cells. Work by Liu and Svensson [16] proposes a method of supporting a lower supply voltage in designing memory systems. Supply voltage is one of the most important parameters controlling CMOS power consumption. Finally, virtually tagged caches [17] may be used in embedded processors. For example, the StrongARM SA-1100 [18] is designed as a high-performance low-power processor for embedded applications. This processor is based on a virtually indexed and virtually tagged cache [17] . The TLB is thus accessed only if a cache miss occurs. While using a virtual cache achieves lower TLB power consumption for CPU traffic to memory, it requires reverse translation for all DMA I=O traffic to check for invalidation. Conventional general purpose processors bridge this problem using multiple levels of cache, with the virtual-to-physical translation taking place between the L1 and L2 caches. For various reasons (e.g. cost, power, size), embedded processors typically employ just one level of cache, and map it physically to reduce the overhead for I=O-intensive applications.
Selective block buffering achieves all advantages of the earlier mechanisms and also overcomes the weaknesses of the conventional filter and buffering mechanism. If the main TLB structure is a banked memory, a subbank decoder is required, which adds to the latency. Our scheme uses this extra delay by overlapping it with a simple check of two bits of the tag. Using a two-bit comparison during subbank decoding avoids the need to always access the full set of tags in parallel with the normal L1 TLB as was done in the original line buffering mechanism [3] . Another advantage is that we do not have to access both the block buffer and the main TLB in the same cycle, which is becoming increasingly difficult to achieve as clock rates have increased. In effect, our approach pipelines a partial check of the tag and the main TLB access, which makes it more compatible with higher frequency designs. Furthermore, the partial tag check enables our TLB to avoid a significant fraction of the twocycle accesses that were seen in earlier designs, which reduces the performance penalty incurred. To summarise, selective block buffering is a simple mechanism with low hardware cost and greater compatibility with high-frequency designs that preserves the traditional physical cache mapping of embedded processors. Our experimental results show that it succeeds in achieving higher performance and lower power consumption.
Selective block buffering TLB
In this Section we present the operational model of our selective block buffering TLB. It should be kept in mind that the goal of this design is to reduce dynamic power consumption while retaining both high performance and a memory hierarchy that is suited to embedded processor designs.
The tag memory space in a fully associative TLB is implemented using a group of content addressable memories (CAMs), which have additional transistors that enable the memory cells to perform parallel comparisons of all the tag entries with a tag from a memory reference. If the tag in any one entry is matched with the input tag placed on the bit lines, its corresponding match line remains high, all other match lines are pulled low, and the selected match line activates the associated word line of the SRAM. Thus its corresponding PTE (page table entry) information is read out from the data array. The structure of the fully associative TLB precludes the need for any external comparison logic or multiplexers, but its access time is longer than that of other organisations because the tag comparison cannot be simultaneously performed with reading the data from SRAM. In addition, for each access to the CAM, all match lines must be precharged high, and all match lines that do not produce a match signal must then be discharged. These precharge and discharge operations are responsible for a significant fraction of the TLB's energy dissipation. Even so, with 0:13 mm technology, the fully associative structure has a shorter access time and lower power consumption than a 4-way set-associative structure for memories in the range of sizes that are typical of a TLB.
With fully associative TLB structures, power consumption tends to increase abruptly as the number of TLB entries increases beyond 64 [1] . To keep power consumption low, the total number of TLB tags that are compared at once should clearly be smaller than 64, and preferably smaller than 32. However, higher performance can be achieved if more TLB entries are provided. We would like to allow as many entries as possible, while keeping the number of tags accessed together to fewer than 64. This is done by dividing the entire TLB space into separately accessed sub-TLBs. In our preliminary exploration of the design space, we simulated several configurations and determined that the most effective number of sub-TLBs for our benchmarks is four. Figure 1 illustrates the organisation of the selective block buffering TLB with its dynamic searching operation. As shown in Fig. 1 , the selective block buffering TLB is constructed as four banks, and each bank consists of a main TLB and its associated block tag buffer and block data buffer. We refer to the combination of the block tag buffer and the block data buffer as simply the block buffer. A block buffer is located above its associated main TLB in the hierarchy. To reduce power consumption, the two low-order address bits of the tag for any given VPN (virtual page number) are used to select a bank. The tag buffer associated with each bank stores a tag value for the most recently accessed VPN belonging to its corresponding bank module.
A two-bit comparator compares two bits of the VPN tag in the buffer with two bits of a newly generated VPN. This comparator consists of two XORs and one NAND gate. According to [19] , when it is based on 0:13 mm technology, internal gate delay times of inverter, NAND and XOR are about 0.011 ns, 0.016 ns and 0.03 ns, respectively. Therefore, a two-bit decoder takes about 0.03 ns and a two-bit comparator takes about 0.05 ns. However, according to the results of CACTI simulation [20] , the access time for a fully associative TLB with 128 entries is 1.18 ns and a TLB with 32 entries (e.g. a four-bank structure) is 1.12 ns. Access time for the tag comparison in a fully associative 64-entry TLB is 1.15 ns and for a 16-entry TLB (e.g. four bank structure) it is 1.09 ns. In all of these cases, the difference in access time between the monolithic TLB and the sub-bank TLB is over 0.06 ns. Therefore, the delay introduced by the additional comparison gates is effectively hidden in comparison with a monolithic TLB. However, we must also compare our access time with block buffering and filtering, which are also able to reduce access time. As we will show, the buffer hit ratio of the proposed TLB is> 90%; and the simplicity of the buffer enables it to respond faster than other schemes. Therefore, the proposed TLB can achieve a faster overall Fig. 1 Selective block buffering TLB access time in spite of a small additional delay that is incurred for 10% of accesses.
In our simulations, the particular two bits that are used for the comparison are the fourth and fifth low-order bits of the VPN. If more bits are used for the comparison, then higher accuracy can be achieved, but overheads, such as the comparison time and hardware cost, then increase. The selective block buffering TLB is designed so that four twobit comparators can operate in parallel for fast access. The two-bit comparison time takes place during the bank selection period and thus can be almost completely hidden.
The different cases for the operational model are explained as follows.
Hit in two-bit comparator for a chosen bank module
When the CPU generates a virtual address, a subset of the address bits are used to select one of the four bank modules. That is, the two low-order address bits of the tag for any given VPN are used to select a main bank. These two-bits indicate a 16 kB page boundary. If a 4 kB page is accessed, the probability of accessing the remaining three within that 16 kB group of pages is high. Therefore, using the two loworder address bits to select a bank can ensure that the sequential VPN accesses within the 16 kB group of pages are well balanced among the four sub-banks of the TLB. If a hit occurs at the two-bit comparator in the enabled bank module, then the block tag buffer is enabled and compared for a match of the entire tag field. If the VPN in the tag buffer and the newly generated VPN are identical, the PPN (physical page number) stored in the corresponding block data buffer is sent to the cache and compared with the tag bits of the cache, but if the VPN in the tag buffer differs from the generated VPN, the cache tag comparison is squashed at the tag buffer and its corresponding TLB sub-bank is accessed for a match during the next cycle. Also during that cycle, the block tag buffer is updated with the generated VPN in order to store the most recently referenced VPN. If a requested page is found in the TLB sub-bank, its action is the same as for a conventional TLB hit. If the requested page misses in the sub-bank, the OS invokes its misshandling service.
Miss in two-bit comparator for a chosen bank module
If a miss occurs at the two-bit comparator, it means that the VPN is definitely not in the tag buffer. Thus, the tag buffer comparison can be skipped. Instead, the corresponding TLB sub-bank is immediately searched in the first cycle, and the tag buffer is simultaneously updated with the new VPN. The combination of a rapid, very-low-power test for the most recent VPN, with the ability to switch to sub-bank search without delay in most cases, results in significant power savings and minimal loss of performance. As we show in the following Section, there are enough accesses to the most recent VPN to justify the use of the tag buffer for power reduction, and the number of two-cycle accesses is sufficiently minimised by the two-bit comparison that performance is only slightly reduced.
In general, the LRU (least recently used) replacement policy produces the best miss rates since it minimises conflicts. Unfortunately, the cost of implementing this policy in hardware is high (in the context of an embedded processor), so we have used the FIFO (first-in -first-out) replacement policy for evaluating the proposed TLB. The flow chart for the proposed selective block buffering TLB management is shown in Fig. 2 and is described in detail as follows.
Performance evaluation
Our simulation environment, performance metrics and power consumption analysis are presented in this Section. The benchmarks used in the trace-driven simulation are taken from MiBench [21] . Four performance metrics, i.e. miss ratio, average memory access time, power consumption and Energy -Delay product are used to evaluate and compare the proposed TLB system with other approaches. Only data references are collected and used for the simulation. The Dinero IV [11] and CACTI simulators [12, 20] were modified to simulate the proposed TLB system. The basic parameters for the simulation are presented in Table 1 . These parameters are based on the values used for common 32-bit embedded processors (i.e. Hitachi SH4 or ARM920T).
Accuracy and overhead of selective searching operation
Many preliminary simulations were performed to explore the design space and establish the parameters of the design. For example, the proposed TLB uses two particular bits for initially checking whether the VPN in the tag buffer and the generated VPN are potentially the same. Our simulations showed that when the low order fourth and fifth bits of any given VPN are compared, it provides the most significant gain with the least overhead. Because of the number of variations explored, we do not present simulations of the different configurations, but instead focus on simulations that enable analysis of the performance and power saving that we achieve in comparison to prior research.
The one aspect of our design that has the potential to cause a loss of performance is that an incorrect prediction by the two-bit comparator can add an extra cycle to the TLB search. We refer to this as two-cycle search overhead, and it is shown in Fig. 3 . In this Figure, two other TLB structures that are also subject to two-cycle overhead are compared with our design. The first one is a filter-TLB, constructed as a small TLB of 4 entries and an L1 TLB with 64 entries. The second one is a plain 4-way banked-TLB structure with four associated block buffers. Figure 3 shows that the percentages of two-cycle accesses for the filter-TLB, the bank-TLB with block buffering and our TLB turn out to be 5%; 14% and 1%; respectively. Thus, according to the simulation results, our scheme achieves the least overhead in comparison with the other hierarchical TLB structures. Figure 4 shows the percentage of the TLB hits that were found in the tag buffers against the main banks in our design. The tag buffers account for over 90% of the hits in most benchmarks.
Clearly, significant amounts of power can be saved by avoiding access to the TLB sub-bank 90% of the time, and instead using the much lower power block buffer logic. Because the two-cycle overhead of our design turns out to be negligible, compared with other hierarchical structures, we also avoid the pitfall of giving up performance in order to save power.
Miss ratio and average memory access time
In this Section we compare three different TLB structures in terms of miss ratio and average memory access time. Generally, the more meaningful measure to evaluate the performance of any given memory-hierarchy is the average memory access time:
average memory access time ¼ hit time þ miss rate Â miss penalty ð1Þ
Here, hit time is the time to process a hit in the TLB and miss penalty is the additional time for miss handling. We do not consider page faults. CACTI circuit simulations [12, 20] of the fully associative TLB, the small TLB of the filter-TLB, and the banks of a banked TLB show that accessing the fully associative TLB takes more than a single cycle, while the other structures can be accessed in one cycle. However, for our performance evaluations, we simply assume that all of these structures operate in one cycle. Figures 5 and 6 show the average miss ratio and the average memory access time, respectively, for our design and the conventional TLB Fig. 3 Two-cycle access overhead Fig. 5 , most of the TLB structures can be seen to have similar average miss ratios. However, in terms of the average memory access time, the filter-TLB and the bank-TLB with block buffering show greater performance degradation due to a large number of two-cycle accesses.
Comparison of TLB power consumption
In this Section we examine the impact of our architecturelevel technique for reducing power consumption. Voltage scaling and specialised circuit techniques have been the main strategies for low power design, and these are important issues in the future. Unfortunately, these techniques alone are not sufficient for optimal power reduction. That is, higher-level strategies for reducing power consumption are increasingly crucial. In addition to lower-level circuit techniques, architecture techniques can have a major role in creating power-efficient computer systems. In this paper, our focus is on applying our approach to reducing power in the L1 data TLB, where it works especially well. Of course, the same mechanism can be adapted for instruction TLBs as well as data and instruction caches, but we leave the exploration of its effectiveness in these other applications for future research. Because all of the entries are searched with every memory access in a monolithic fully associative TLB, one might expect that it would be the worst structure in terms of power consumption, but this is true mainly when it has 128 entries or more, and to some extent with 64 entries. Figure 7 shows the energy dissipation for a TLB access, for various TLB configurations.
The fully associative TLB has less power consumption than a set associative TLB when the number of entries is small. This is because the set associative TLB is constructed with more sense amplifiers than the fully associative TLB and these have high power consumption. For a 128-entry fully associative TLB, the energy dissipated at the match line and the bit lines in the CAM reaches the point that it consumes significantly more power than a 2-way set associative TLB configuration.
The overall energy dissipation in the TLB can be divided into two parts, i.e. internal energy dissipation and external energy dissipation. The internal energy dissipation is the energy dissipation within the TLB system when the TLB is accessed. External energy dissipation includes driving the I=O pads for off-chip memory access and searching the data cache for the required page table entry. First, we evaluate power consumption for various TLB configurations using the CACTI simulator, which can calculate access times, cycle times and power consumption for many types of hardware caches [22 -24] .
The CACTI simulator was modified for TLB simulation in several ways. First, the number of bits allocated to a TLB entry is not variable but fixed by the PTE (page table entry) size. Throughout this research, the PTE size was assumed to be 4 bytes. Second, in the cache, the length of the offset field within an address is determined by the size of a cache block, but in the TLB, a predefined page size determines the length of the page offset field for a given virtual address. In the simulation, it is assumed that the page size is 4 kbytes but that the tag array has sufficient tag width to support a small page size of 1 kbyte. Additionally there is one valid bit and an 8-bit extension address for each set in the tag array. Finally, CACTI could not originally simulate small caches with fewer than eight sets because its decoder architecture is based on a 3-to-8 decoder block. Thus, we modified the decoder architecture to simulate a 2-to-4 decoder block, which enables used of a 4-entry TLB. Our results are based on 0:13 mm technology parameters, assuming a 1.3 V supply voltage. Table 2 shows the energy dissipation for each event corresponding to a TLB access. For a fully associative configuration, most of the power is consumed in the decode stage, where the tag comparison is performed. The significant difference in power consumption between a TLB with 16 entries and one with 32 entries comes from the growth in power consumed by the match line and bit lines in the CAM. Each entry of Table 2 shows the energy dissipation for a TLB read hit, a TLB read miss, and a TLB write.
The average power consumption of the fully associative TLB is given by
where N hit and N miss are the ratios of hits and misses in the TLB or small buffer. P hit and P miss are the power required to process a hit and a miss, respectively. P miss can be calculated as follows: 
where P CAM is the power dissipated by all the entries when the tag part of the TLB is accessed, and P write is the power dissipated by the data memory and tag memory to update an entry on a miss. P off is the power dissipated by the cache and pads when a TLB miss occurs. Then P off can be calculated as follows:
where P cache acc is the power used to access a cache block, M cache miss is the cache miss ratio, P cache write is the power for a cache write operation on a cache miss, and P pad is the power dissipated at the on-chip pad slot. P pad can be calculated as follows [22 -24] :
where W data and W addr are the number of bits for both the data sent=returned and the address sent to the lower level of memory on a miss request. The capacitive load for off-chip destinations is assumed to be 20 pF [24] . Also, a 16kB direct-mapped data cache with 32-byte block size is assumed, where the values of W data and W addr are also 32 bits. The basic parameters for the simulation are summarised in Table 3 . Figure 8 presents the dynamic energy dissipation of the different TLB structures compared to our design for the same set of benchmarks. The power consumption data for the selective block buffering TLB are obtained by considering all possible cases, such as the power consumed by the comparators, an additional multiplexer, and so on. These values are obtained from the CACTI model indirectly. The Figure shows that a filter-TLB or a banked-TLB with an associative block buffer is a good structure in terms of power consumption because the hit ratio at the block buffer exceeds 90%: However, their performance degradation tends to be significant because their small buffers are always compared before accessing the main TLB. In our design, the tag buffer is only compared when there is a hit in the two-bit comparators. As shown in this Figure, dynamic power consumption in the proposed TLB can be reduced by $ 93% compared with a fully associative TLB, 87% with respect to a filter-TLB and 50% compared with a banked-TLB with block buffering. Figure 9 shows the Energy -Delay product for the TLB alone. This metric provides a basis to identify a specific TLB configuration that offers the best balance of both energy and performance. Simulation results show that the EnergyDelay metric for the TLB is reduced by $ 93%; 88% and 60% compared with a monolithic fully associative TLB, a filter-TLB and a banked-TLB with block buffering, respectively. Conclusively, the proposed selective block buffering TLB offers the best result in terms of both performance and power consumption among all of the approaches.
Comparison of total energy dissipation
The preceding Section shows only the dynamic power consumption and Power -Delay product for the TLB itself. Here we examine the power savings with respect to the entire processor. As we noted in the Introduction, overall TLB power consumption varies from 9% (for the ARM TLB) to 17% (for all TLBs in a processor such as the StrongARM).
We used the Simplescalar ARM power model [25, 26] with the ARM-ISA for our total-chip power evaluation. Basic processor parameters for the simulation are presented in Table 4 . As processor technology advances, TLB size may grow into this range. Figures 10 and 11 show the normalised energy dissipation and normalised EnergyDelay product for the total chip configured with different TLB structures. Simulation results show that energy dissipation of the total chip is reduced by $ 8%; as a result of reducing dynamic power consumption by 90% $ 95% in TLB, which is just as we would expect, given the percentage of the chip's power that is dissipated in the simulated ARM TLB. Energy Â Delay tends to show a similar effect. If the approach can also be adapted to caches, then the potential improvement is even greater.
Conclusion
To achieve high performance, recent TLB research for embedded processors tends to support many page entries via large TLB sizes, but in fully associative TLBs, all the entries are searched for every memory access. Because of this, they would be among the worst structures in terms of power consumption. When they have more than 64 or 128 entries, their power consumption is especially high. Therefore, to attain low power consumption, a banked-TLB was designed that divides one fully associative TLB space into four smaller, fully associative TLBs. To further reduce power consumption, a selective searching mechanism is applied in the proposed TLB to compensate for the weaknesses of the filter-TLB and the simple block-buffering TLB. The amount of power saved by the proposed TLB strongly depends on the filtering effect of the two-bit comparison that quickly selects between searching a main bank or its buffer. This selection avoids the need to search the buffer on every access, thereby saving power. It also reduces the frequency of two-cycle accesses, which reduces the performance penalty incurred by previous low-power designs. We showed that the average hit ratios of the block buffers and the main banks of the proposed TLB are 94% and 6%; respectively. Simulation results show that the average memory access time of the proposed TLB is almost equal to that of a conventional fully associative TLB. However, the dynamic power consumption of the proposed TLB is $ 93% less than the fully associative TLB, 87% less than a filter-TLB and 50% less than a banked-TLB with block buffering. Thus, the Power -Delay metric for our TLB is reduced by $ 93%; 88% and 60% compared with a fully associative TLB, a filter-TLB and a banked-TLB with block buffering, respectively. Also, we show that energy dissipation for the whole chip is reduced by $ 8% as a result of applying our technique to both TLBs. Energy -Delay tends to show a similar effect. However, if the proposed mechanism is also adapted to the caches, the potential power savings could be much more significant.
References

