Most recently used (MRU) cache is one of the set-associative caches that emphasize implementation of associativity higher than 2. However, the access time is increased because the MRU information must be fetched before accessing the sequential MRU (SMRU) cache. In this paper, focusing on the SMRU cache with subblock placement, we propose an MRU cache scheme that separates the valid bits from data memory and uses these valid bits to decide to reduce the unnecessary access number of memory banks. By this approach, the probability of the front hits is thus increased, and it significantly helps in improving the average access time of the SMRU cache without valid-bit assistant search especially for large associativity and small subblock size.
Introduction
In the past decade, the speed of processors has increased rapidly due to the development of VLSI technology; however, the speed of main memory (DRAM) is not improved at the same rate as the processors.
1 Therefore, the cache memory plays an important role to reduce the speed gap between processor and main memory. In several cache organizations, a direct-mapped cache has fast hit access time because the CPU can directly read data from the data bank without waiting for the tag checking. However, the direct-mapped cache has a higher miss rate. On the other hand, a set-associative cache has a lower miss rate because blocks from the main memory can map into any cache block of one fixed set in cache, but it needs higher hardware complexity and suffers from a longer hit access time.
2 To reduce the cache
MRU Caches
The MRU cache is one of the set-associative caches with implementation of associativity higher than 2. The MRU cache uses a memory table to record the information about the MRU block for each set in a cache. While a set is referred to, the probability to find the correct block location in this set at the first time is very high.
10
In addition, there also exist some advantages 11 :
(1) The MRU cache can easily implement the least recently used (LRU) replacement policy without much hardware.
(2) No swap operation is required for the MRU cache, whereas the HR-cache and the CA-cache require a swapping hardware to ensure that the MRU block is always kept in the major location after each reference.
However, the MRU cache also has some disadvantages 12 :
(1) The access cycle is lengthened because the MRU information must be fetched before accessing the MRU cache. (2) In some extreme cases, such as long sequences of consecutive addresses are referred to, a longer average access time may occur when the number of block is more than two in a set.
There are two MRU schemes in the past researches. One is the SMRU cache proposed by Kessler et al., 13 which probes each block in one set serially according to the search order stored in the MRU table. The other is the parallel MRU cache proposed by Chang et al., 14 which can fetch the MRU table at the same time while accessing the tag banks and data banks. If the first hit does not occur (MRU miss), the cache system uses the result of tag checking backup to select the correct block data. In this paper, we focus on the SMRU cache with subblock placement, and the following subsections will discuss the conventional SMRU cache and the subblock placement.
Sequential MRU cache
Kessler's scheme 13 uses a sequential search to find the desired block in a set according to the content of the MRU table. The architecture of the SMRU cache is shown in Fig. 1 . In the SMRU cache, both the tag memory and the data memory are single bank, and only one comparator is required. The content of the MRU table stores the bits that represent the MRU block number for each set and determines the search order which is from MRU to LRU. For example: in a four-way set-associative MRU cache, if the MRU block list for one set is "01001110", that means the search order of the locations is 1, 0, 3, and 2. For each access of the MRU block location, the MRU block bits indicating the present desired block location are taken from the MRU block list. These MRU block bits are then used to associate with the set bits of main memory address to form an effective address as accessing the tag bank and data bank. Due to using a true LRU replacement policy, the MRU block list for each set can be maintained by the cache system. The operations of the SMRU cache are described as follows 15 :
(1) While a set of the cache is referred to, the cache system fetches the MRU table to obtain the MRU block list, and then these block bits of the MRU block list are used to form the address of the tag bank and data bank. This operation should be prior to accessing the tag bank and data bank. (2) According to the first block bits taken from the MRU block list, the MRU block location is probed first. (3) The cache system checks the tag of the selected block location. If the first hit occurs, the block data are read out from the data bank like the direct-mapped cache, however, two access cycles are required for the first probe. (4) If no first hit occurs, the cache system continuously checks the rest blocks in this set and selects the next probed block from the MRU block list until all tags of this set are examined. (5) When a miss occurs, the cache system will take more cycles to refill a new block from the lower-level memory to perform the replacement operation.
Word
Because of sequential search and prefetch of the MRU table, the SMRU cache has a longer average access time than that of other cache schemes used to improve the conventional set-associative cache. However, the SMRU cache with high associativity can be used as the second-level cache in a two-level multiprocessor cache architectures to reduce memory interconnection traffic. 13 Moreover, due to only one comparator being required in this cache scheme, that will be more suitable for the low cost implementation of second-level cache chips with high associativity.
Subblock placement
Increasing block size will reduce the tag memory size for an on-chip cache design; 16 however, the large miss penalty is incurred due to large block size. Usually, the subblock placement, 3 which only refills a part of the entire block into the cache when the miss occurs, is an appropriate approach to reduce the miss penalty. In this cache scheme (shown in Fig. 2 ), each data block is divided into several subblocks, and each subblock has a corresponding valid bit to indicate if this subblock exists in the cache. Therefore, for a set-associative cache with subblock placement, when the cache is accessed, in addition to tag checking of all ways, the corresponding valid bits of all ways must be checked together. In this paper, the subblock placement will be applied to the SMRU cache.
The Proposed MRU Cache
For design of the sequential caches, a low probed time and a large number of front hits are two important factors to achieve a low average access time. When a SMRU cache employs the subblock placement to reduce its miss penalty, fortunately, the valid bits can be used to preeliminate the unnecessary search times for each cache access, such that it can make the original rear hits become more front hits. Based on this idea, a new SMRU cache with valid-bit assistant search (called SMRU-V cache) is proposed to reduce the average access time.
Valid-bit assistant search
In the conventional SMRU cache, the search order always starts from the MRU block to the LRU block one by one. Even though the present probed block does not exist (i.e., the valid bit = "0"), it still must complete checking the present block before probing the next block, which means this search is redundant. In our proposed SMRU-V cache, the search order is the same as that of the conventional SMRU cache. However, the valid bits of the subblocks for different ways in one set are immediately loaded into the control circuit when the cache is accessed, and they can be used as a decision assistance to judge which subblocks need to be examined during the search process. For an n-way SMRU-V cache, the valid-bit assistant search algorithm is shown in Fig. 3 , and Fig. 4 illustrates two search approaches by an example of four-way SMRU cache for the SMRU cache and SMRU-V cache, respectively. If the search order is block 1, block 0, block 3, and block 2, and the hit block is block 2, therefore, regardless of the contents of the valid bits, the search times are four and it is the fourth hit for the SMRU cache. However, according to the valid-bit assistant search algorithm, the proposed SMRU-V cache can make block 2 become the second hit from the original fourth hit, and thus it only requires two search times. Consequently, for a cache with small subblock size, such a search algorithm can achieve more front hits and reduce many unnecessary search times with the valid bits being "0" on a cache hit, and it can also help in reducing the miss search times even when a cache miss occurs.
Architecture
The architecture of the SMRU-V cache shown in Fig. 5 only modifies the data memory organization of the original SMRU cache, which the valid bits of all subblocks are also separated from the data memory bank, and they are organized as a single n-bit valid-bit bank and each bit represents one valid bit of the accessed subblock for each way. The bit order is from the MRU way (MSB bit) to the LRU way (LSB bit) for each set. When the cache is referred to, all memory banks including the MRU 
Valid-bit Bank

Operations
The main difference of operations between the SMRU cache and the proposed SMRU-V cache is their search process, and the operations of the SMRU-V cache are described as follows:
(1) While a set of the cache is referred to, the cache system fetches the MRU table and the valid-bit bank, and the control circuit takes the first MRU block bits that its corresponding valid bit is "1" from the MRU block list to form the address of the tag bank and data bank for the first MRU block. (2) The cache system checks the tag of the first MRU block location selected by the first MRU block bits. Simultaneously, these bits are also used to speculatively select the data of the first MRU block location. (3) If the first hit occurs, similar to the direct-mapped cache, the desired block data are directly read out from one of the n data banks; however, two access cycles are required for the first probe. (4) If the first hit does not occur, according to the valid bits with "1", the control circuit selects the next MRU block from the MRU block list in order, and checks the rest blocks in this set until all tags of this set are examined. If any hit is found again, the last selected MRU block bits are used to select the desired block data. (5) When a miss occurs, the cache system will take more cycles to refill a new subblock from the lower-level memory to perform the replacement operation. Simultaneously, the status of the valid-bit bank will be maintained.
In our proposed cache architecture, due to the valid-bit assistant search and the support of LRU replacement algorithm, most of the increased front hits are first hits. Therefore, the first hit rate will be higher than that of the SMRU cache without valid-bit assistant search.
Overheads
In this proposed SMRU-V cache, the valid-bit bank is separated from the original data memory, and thus this cache scheme does not especially add the other memory device for a SMRU cache with the subblock placement. Furthermore, due to concurrent accesses of the MRU table and valid-bit bank at the first cycle, there almost exists no extra access time. To implement the valid-bit assistant search, the control logic (for example, four-way) indeed requires the extra hardware components instead of the two-bit binary counter within the control logic of the conventional SMRU cache, and it can be implemented as the circuit shown in Fig. 6 . 17 In this circuit, the priority encoder can decide to select the desired MRU block bits according to the corresponding valid bit with "1" at each search clock from MRU block to LRU block, and the decoder is used to clear the previous searched priority input after each clock. The incurred delay time compared with the SMRU cache is caused by this search decision circuit, and it can be neglected only due to two-level logical gate propagation (pass through priority encoder) compared with the access time of memory banks. Therefore, without much delay time, the proposed SMRU-V cache still maintains the low cost implementation as that of the conventional SMRU cache.
Comparison
Alternate sequential cache scheme, sequential multicolumn cache (SMC cache), uses a multiple MRU block technique to provide a ring search order with multiple entry points for different references to the same set. 1, 12 For example, if the major location is 2, then its search order is location 2, location 3, location 0, and location 1 for a four-way cache. To guarantee that the MRU block always remains in the major location after each reference, a swapping hardware and a index table are required for this cache scheme, which differs from the SMRU cache without any swap operation. This index table stores a group of bit-vectors indicating the other selected locations of each way in one set, and thus its memory size is S × n × n for an n-way cache, where S denotes the number of set.
The main benefit of the SMC cache is it needs only one cycle for the MRU block probes in the major locations. However, the swap operation also requires one cycle time on a cache miss or a nonfirst hit. Due to ring search order, except for the first hits, the number of other nonfirst hits averagely distribute at different search times, which makes the access time of nonfirst hits be larger than that of the SMRU-V cache with more front hits. From the hardware view, because the SMC cache requires a control mechanism to fetch the corresponding bit vector from the index table upon a reference, maintain the index table when a cache miss at the fixed associativity = 32, the average access time of the SMRU cache decrease at first due to the reduction of miss penalty. However, the miss rate increases and the first hit rate decreases as the subblock size decreases, and thus the average access time increases again until the subblock size = 4 bytes. When the subblock size is less than 4 bytes, only the average access time tardily increases because the reduction of miss penalty becomes more obvious. Contrary to the variety trend of the SMRU cache, the average access time of the proposed SMRU-V is always less than that of the SMRU cache as the subblock size decreases. In the realistic applications, too small subblock size is not applicable, and thus the often-used subblock size is about 4 bytes or 8 bytes. 
Improvement in average access time
The simulation results shown in Fig. 9 indicate the improvement over the SMRU cache in average access time of the SMRU-V cache, where the improved rate in terms of the average access time (IMR TAS ) is defined as
Consequently, except for the subblock size = 32 bytes that means no subblock placement is used, the IMR TAS of the SMRU-V cache will increase as the associativity increases at the fixed subblock size, or as the subblock size decreases at the fixed associativity. When the associativity > 4, the IMR TAS has a significant increment. Especially for the associativity = 32 and the subblock size = 4 bytes, the IMR TAS can achieve up to about 40%.
Conclusions
In this paper, a valid-bit assistant search algorithm applied to the SMRU cache is proposed to improve the average access time of the conventional SMRU cache with subblock placement. Without adding much hardware in our proposed MRU cache scheme, many unnecessary search times are eliminated and more first hits are obtained by the valid-bit assistance. From simulation results, the improved rate in average access time can achieve about 25% on average when the associativity > 4 and the subblock size = 4 bytes, even at the associativity = 4, the proposed SMRU-V still has about 7% improvement. Therefore, for achieving more significant improvement in the average access time, the proposed SMRU-V cache is suitable for large associativity and small subblock size. Moreover, being a second-level cache,
