By studying the behaviour of general programmes, it was observed that over 50% of bytes in a data cache are zero-valued. To reduce this waste of zero-valued spaces in a data cache, an overlapped cache scheme, which allows one cache line entry to hold up to two cache lines, is proposed. Experimental results show that, for SPEC2000 benchmarks, the proposed design reduces cache misses by 29% on average over a conventional direct-mapped cache.
Introduction: Because a cache hit rate is a crucial factor for performance of computing systems, several intelligent cache designs that have low miss rates have been proposed. A victim cache (VC) [1] was proposed to improve performance of a direct-mapped cache (DMC). A DMC with a fully associative victim cache of a few entries (generally, 1-8 lines) can greatly reduce cache misses by holding cache lines evicted from the DMC. However, owing to a fully associative structure, VC is limited in scaling up its size. A frequent value cache (FVC) [2] is another approach for improving cache performance by augmenting DMC with a small frequent value-centric cache, which is dedicated to hold only frequently occurring values in encoded form. Although this approach reduces cache misses significantly, it is expensive to implement a logic circuit that finds frequently accessed values owing to run-time profiling. By profiling SPEC2000 benchmark programmes, we observed that over 50% of bytes in a data cache are zero-valued, and found that the majority of word values in a data cache can be represented in a half word or less. Based on these cache characteristics, we propose a novel cache design, overlapped cache (OVLPC), which enables up to two cache lines to be represented by one cache line entry in order to reduce cache misses and increase space usage.
OVLPC operation: To hold two cache lines in one cache line entry, we implemented a hardware logic that compresses and restores a word value in time of writing to and reading from the cache, respectively. When values are written to the cache, zero-valued half words are removed unless the effective size of the value is bigger than a half word. In the case of a hexadecimal value 0x000007CA, for example, the effective size of this value is 12 bits; thus, only the half word value 0x07CA is written to the cache. At compression, OVLPC stores the effective size information of the corresponding word as a flag that will be used at restoration. To reduce the waste of zero-valued spaces in a data cache effectively, OVLPC needs to adopt an overlapping technique that avoids putting the least significant bytes in the same position between two different cache lines. To achieve this overlapping and for simplicity, in this Letter we used a byte order switching between littleendian and big-endian at each cache miss. For easy understanding of these operations, we illustrate how changes in cache contents are reflected by each operation. Fig. 1 shows an example when a sequence of three operations is applied to one cache line entry. In this example, we assume that the size of cache line is 8 bytes (i.e. two words). Here, a cache line whose state is before update is called previous line, whereas a cache line whose state is after update is called recent line. Since the byte ordering is switched from little-endian to bigendian and vice versa, as mentioned above, the previous and the recent lines have opposite forms at any time. The proposed cache scheme is augmented with VTAG, FLAG L=B, and R on a typical cache scheme. TAG represents a tag value of the recent line, and VTAG represents that of the previous line. FLAG L concerns a word value whose stored form is little-endian whereas FLAG B concerns a word value whose stored form is big-endian. Both flags, FLAG L and FLAG B, can take four different kinds of values: 00 for a zero-valued word, 01 for a value whose effective size is equal to or smaller than a half word, 10 for an invalid word, and 11 for a value whose effective size is bigger than a half word. R is used for indicating a stored form of the recent line: 0 for little-endian and 1 for big-endian. In the following, the notation 'Line ABC' is used for indicating a cache line whose tag value is 'ABC'.
Assume that Line A59 consists of two words as in Fig. 1a . The byte ordering of this cache line at the start is little-endian; thus, R value is set to zero and corresponding flags are stored in FLAG L. Because the effective size of the first word is smaller than a half word, a value 01 is written to the first entry of FLAG L. In this manner, a value 11 is written to the second entry because the effective size of the second word is bigger than a half word. After a cache miss from the operation a in Fig. 1 , R value is switched. Because of overwriting the second word of Line A59, that word must be invalidated by changing the second entry of FLAG L to 10. Naturally, Line A59 becomes the previous line and its tag value is evicted to VTAG (Fig. 1b) . Owing to a write miss of the previous line (operation b), Line A59 is fetched again and becomes the recent line (Fig. 1c) . Note that in the case of a write operation, OVLPC only considers writing on the recent line because we greatly simplify the OVLPC architecture by sacrificing a write opportunity on the previous line, i.e. our scheme always results in a miss when a write operation occurs on the previous line. A write hit occurs by the operation g. Because the effective size of the incoming word is 4 bytes, the first word of the previous line is destroyed; namely, the first entry of FLAG B is set to 10 and that of FLAG L is set to 11 (Fig. 1d ) .
OVLPC has three types of cache hit: 1. read, 2. write hit on the recent line and 3. read hit on the previous line, whereas a conventional cache has only the first two types (1. and 2.). Therefore, when reading a value on the previous line in OVLPC, a read hit occurs unless the corresponding flag state is invalid.
Overheads: To hold two partial cache lines in one cache line entry, additional SRAM cells are needed to store VTAG, R, and FLAG L=B values for each cache line. Depending on a cache configuration, about 16-20 bits per line are needed for VTAG and only 1 bit per line is needed for R. On the other hand, FLAG L=B requires 4 bits for each word. Taken altogether, OVLPC has about 15-20% storage overheads against DMC as varying the cache configuration.
To consider the effect of delay overheads we measured the access time of each component of the cache memory using CACTI tool [3] , implemented a logic circuit for word control (compaction and restoration), and then measured the mean latency of this logic using WiscStat [4] , under 0.18 mm CMOS technology and a cache line with size of 32 bytes. Based on this, the cache access time of OVLPC compared with that of DMC is shown in Table 1 . The results show that OVLPC has about 9-15% access latency overheads against DMC. Because large caches typically have longer bit line and word line lengths, which cause high access time for a similar cache miss rate, the access times of DMC and OVLPC are expected to be close to each other. As known, because the majority of real conventional caches have multicycle (generally, 2-4 cycles) access latency to read or write data [5] , a slight increase of access time can be divided and hidden more easily. Experimental results: To evaluate performance of the proposed cache scheme we performed experiments using a SimpleScalar simulator [6] . We simulated ten popular benchmark programmes of SPEC2000.
In these experiments, we used a 16 KB direct-mapped data cache with a cache line size of 32 bytes as a baseline cache. We compared our scheme with the baseline DMC, VC and FVC. For VC, a 256 bytes small fully associative cache was employed. For FVC, we profiled each benchmark programme to the extent of 100 000 instructions in order to find 15 frequently accessed values. The cache miss rate normalised to that of DMC is shown in Fig. 2 Conclusions: We have introduced the OVLPC scheme, which allows one cache line entry to hold up to two cache lines by utilising zerovalued spaces. Our scheme clearly shows that a treatment of diminishing the waste of zero-valued spaces in a data cache can lead to significant improvement in reducing cache misses. The proposed scheme can be also used for an instruction cache if instructions are compressed to within a half word size. Besides applying OVLPC to the instruction cache with an instruction compression scheme, more detailed studies such as applying the proposed scheme to set-associative caches remain as future work. 
