We investigate the feasibility of using instruction compression at some level in a multi-level memory hierarchy to increase memory system performance. Compression effectively increases the memory size and the line size reducing the miss rate at the expense of increased access latency due to decompression delays.
Introduction
Multi-level memory hierarchies [3] [5] [7] are the standard way to reduce average memory access time in a cost-effective manner.
The average access time of a cache is function of its hit time, miss rate, and miss penalty. We can reduce the miss rate of a cache either by making the cache bigger or by making the program smaller. The latter can be done in two ways.
1. Use an instruction set [2, 111 with a higher code density. Unfortunately, designing an instruction set is a complicated issue as it affect many areas of the system including the processor decoding complexity, and memory traffic. Also, from a commercial standpoint, a new instruction is undesirable because it will not be compatible with previous designs. tage of this approach is that it can be used with any processor, preserving backward instructionset compatibility. The disadvantage is that extra time and hardware is needed for decompression.
In this paper, we investigate improving system performance via the second approach, namely by compressing instructions in a multi-level memory hierarchy. Our approach is transparent to the processor, which sees normal instructions. We require extra hardware for runtime decompression and address translation. We do not consider compressing data, only instructions. Recently, a major computer maker announced plans for a, fast, decompression engine for similar purposes [13].
This paper is organized as follows. In Section 2, we illustrate our memory model using compression.
In Section 3, we derive formulas showing when compression is advantageous given various parameters and show that the use of compression is feasible now for today's fastest processors. We discuss the additional hardware needed for our method in Section 4. Finally, in Section 5 we evaluate several different compression schemes.
Memory Hierarchy Model
We show a memory hierarchy with and without compression in Figure 1 . The memory contents before or upstream of level i , (i.e. closer to the CPU) are the same for both approaches, so that the processor sees the same binary instruction stream in both approaches. The memory contents of all levels after level i are compressed in the compression approach. There is no architectural difference between these two approaches other than the decompression hardware. Compression at level i denotes that the decompression is done between levels i -1 and i. In the compression approach, the compiler (or compression software) creates an executable with compressed instructions; at the runtime, decompression hardware in the memory system restores the original instructions. We define the compression ratio as the increase in eflectiue memory site increase due to compression.
Original size Compressed size
Compression ratio E C E We use the following definitions in the rest of this paper. A compressed level is memory level that contains compressed code; a normal or uncompressed leuel contains uncompreased code. In comparing a memory system using compression versus a normal memory system, we make the following assumptions.
Both systems use the same processor and the same memory organization (memory size, associativity, block/line size, and replacement policy) except the compression approach has extra decompression hardware.
The effect of memory misalignment caused by compression is neglected. The misalignment penalty can be minimized by adding hardware, and it can be considered as part of the decompression delay.
The inclusion principle holds: The contents of level i are always in level j for all i < j 5 n, where n is the number of levels in the memory hierarchy. 
Tkadeoff Between Compression Ratio and Average Memory Access Time
Although compression increases the effective memory size, it also introduces a decoding penalty. We use a simple equation to parameterize the relationship between the compression ratio and the reduction in miss rate. We use this relationship to determine the change in average access time when using compression at the i-th level memory and the effect on adjacent memory levels. Finally, we show that the increase in effective memory size and line size contribute to reducing the miss rate, In the following analysis, a subscript denotes the memory level and a superscript denotes a compression or a non-compression approach, e.g. mc versus mnc.
A Model for Miss Ratio versus Effective Memory Size
We assume the global miss rate at memory level i changes as the (effective) memory size raised to the power logy as described by Equation 1.
Mi"
where Mi" = Global miss rate at level i of the new memory size and new line size.
original memory size and line size.
M r c = Global miss rate at level i of the
The global miss rate at memory level i , Mi, is defined as: Mi = mlm2 . . .mi = n;=, m i ) where mi = Local miss ratio at the i-th level memory.
Misses in the i-th level Memory accesses to the i-th level
In Equation 1, the parameter p indicates how much the miss rate decreases when both the memory size and the line is size is doubled. For example, p = 0.3 means that doubling the memory size and line size will reduce the miss rate to 3/10 of its previous value.
Note that, 0 5 p 5 1 and C > 1.
--
We used trace-driven simulation to empirically estimate the value of p for various programs. We used four For these programs, we found most p values ranged from 0.4 to 0.7 as shown in Figures 2-3 , which was somewhat lower than we had expected. The reason p is so low is that the compression increases both the effective memory size and the effective line size. Figures 2 and 3 shows p values for line sizes of 32 and 2048 bytes respectively. In Figure 3 , we model main memory with the 2048 byte line size corresponding to a memory page. We observe that p values are constant for each application, as the only misses occur at startup, as the memory sizes are much larger than the number of distinct addresses in the traces. Thus, the increased line size from compression accounts entirely for the reduction in miss rate. 
Evaluation of Systems With or Without Compression
In this section, we analyze the average memory access time both with and without considering block transfer time. First, we ignore block transfer time, assuming early restart and out-of-order fetch [3, pg. 4581. We then consider block transfer time. In both cases, we gives formulas for the average memory access time as a function of C, p and the decompression time, d.
We use effective memory access time as our performance metric to evaluate different memory systems.
Since we examine the instruction stream only, compression has no effect on the access time for data reads and writes. Hereafter, the analysis concentrates on the time for instruction fetches.
Average Memory Access Time without Block Tkansfer Time
The effective access time at the i-th level memory ti is defined as where hi = The access time to the i-th level memory when it is a hit Pi = t;+1 = The penalty incurred when the access to the i-th level is a miss. The effective access time at level i + 1.
= mi = Local miss ratio at level i.
The effedive memory access time of a system is t l .
In an n-level memory system, t l can be determined by for when using compression is favorable. Because
Mi-1 > 0 and Mi-1 is independent of compression, the only difference between the approaches is t?" and t i . Equation 2 indicates that ti is a function of hi, mi, and t i + l , giving a recursive dependence down to level n. We use the following lemmas [lo] to simplify the recurrence relation.
Lemma 1 : mj = my", for i + 15 j 5 n Lemma 2 : t; = t?", for i + 1 5 j 5 n From Lemmas 1 and 2, the difference between 11" and ti" relies only on ( i ) the compression ratio at the i-th level, ( i i ) the miss ratio at the i-th level for compression and non-compression approaches, and (iii) the access delay at the i-th level introduced by the decompression hardware.
Note the memory access time and miss rate of level j, for all j > i , have no effect on A t . We define the is doubled.
Case studies of memory systems
Compression at the L1 cache is not feasible because of the limited decompression time is allowed [lO] . At the other extreme, compression at secondary storage cannot improve memory performance because we cannot reduce the miss rate, since accesses to this level never miss. A general analysis. Figure 4 shows the maximum allowed decompression delay if compression is to be effective. Points on the curve show where compression neither helps nor hurts the average memory access time for various p values and for m x P = 300 CPU cycles. Points below the curve favor compression. As Section 5 shows, we can obtain C B 1.5, so that the maximum allowed decompression delay is about 60 CPU cycles. As d,,, is proportional to m x P, we can calculate where compression is advan- The maximum allowed decompression delay where P = miss penalty to access level i+ 1, m = miss rate at level i, p= reduction ratio, and m x P = 300
CPU cycles
We empirically estimated m x P by measuring local miss ratios using trace-driven simulations. The miss ratios and memory organization are shown in Table 2 . To calculate the average access time and miss penalty of each memory level, we assumed a 200 MHz processor with a disk access time of 5ms -15ms and a memory system with parameters similar to that in [3] as shown in Table 3 . In this design, we observe that m x P value for the second level cache range from impractically small (1-3 cycles) to moderate (20-56 
Memory level
I 2ndlevel I main n
Average Memory Access Time with
The effective access time at the i-th level memory t i is still defined as ti = h; -+mi x P;. Every term in this equation remains the same except P;, which is now defined as Pi = t;+l + q + l . t;+l is the latency tjime to obtain the first data from level i+ 1 and z;+1 is tlhe time to transfer a block from level i + 1 to level i . It can be shown that the bound d s < mY"P;(l -p'g c , for increased memory performance. Clearly, the delay time allowed for decompression is increased when block transfer time can not be hidden by mechanisms such as early restart and out-of-order fetch.
Block Transfer Time 4 Design of Memory Systems Using Compression
Our compression method requires additional hardware for two reasons, runtime instruction decompression and translation of uncompressed addresses to compressed addresses. For any compression algorithm, we refer to the mapping from normal symbols to compressed symbols and vice versa as the codebook. We maintain an address mapping table and a codebook for each process as shown in Figure 5 .
The address translation problem occurs because the compressed instruction stream does not preserve a linear addressing space. Thus, on a branch to (uncompressed) address A, where do we find A among the compressed instructions? The address mapping table (AMT) contains an index into the compressed instructions for each cache index. For example, if the L2 cache has a line size of 256 bytes and addresses are 32 bits (4 bytes) wide, then the AMT would contain a 32-bit index into the compressed program for every cache line or 256 bytes of code. In this case, the AMT would require 4/256 = 1.6% of the original program size. Figure 5 shows t,hatf the hardware leaves the data stream intact. Decompression hardware also tracks whether a program is in compressed form. A selective bypassing capability allows the system to run uncompressed programs. In our example, we have assumed all caches are virtually addressed. The decompression hardware stores a copy of the current codebook. The operating systlem stores the codebook and the address Gapping table in a non-swappable region in the main memory. On a context switching, the operating system must, reload the decode table with the appropriate c.odebook and AMT.
Both the address mapping table and codebook require space in main memory and must be saved with the compressed program in the file system. The space overhead of the codebook and the AMT will reduce the effective compression ratio.
Using compression at main memory, assuming 32 byte addresses and a line size of a t least 128 bytes, the overhead would be no more than 41128 m 3%. For C = 1.5, the compression ratio would be reduced to 1.4.
Compression Methods and Results
In this section, we compare the compression ratios of several compression methods on various Unix e x e cutables. We ran different compression algorithms on the text segment from various executables. We measured compression ratios on executables for a SUN SPARCstation running SUN-OS 4.1.1.
After some experimentation, we found that independent compression of the different fields of a machine instruction performed well. We broke down each instruction by its fields (opcode, operand, jump displacement, immediate value, etc.) [SI and compressed each field. For example, an opcode "LD", a register %31", and an immediate value "#4095" all belong to different fields. Each instruction uses only some of the fields. We compressed compressing independently on all the following strategies except LZW. In the compressed instruction stream, each field is an index into the appropriate f-cache. In the event the field value is not in the f-cache, we use a special index (say 0) followed by tha actual (uncompressed) value. Thus, the most frequently used instructions are represented by f-cache indices, and all others result in f-cache misses.
For each field, we tried different f-cache sizes (always powers of two) and we selected the size providing the best compression.
The MFU method is ideal for use in a memory system, as the decompression hardware is always "in sync" because the MFU f-caches are fixed. MFU lends itself to a straight forward implementation of the decompression hardware. 
5.1
Compression bound: We calculated the entropy for each field, which gives a theoretical upper bound for compression schemes that compress each field independently. The entropy for the entire instruction is the sum of the entropies for each field. Note that by adding the space for a Huffman encoding tree, we get the Huffman bound.
Lempel-Ziv-Welch (LZW) [12] . We also measured the popular LZW algorithm used by the UNIX utility compress. LZW is unsuitable for our purposes, as the codebook, a string table, will be out of sync when the program branches. We measured LZW result only as a comparison point. Table 4 shows compression ratios of various files on a SUN4. The original size of the text (code) segment is listed. The size for MFU compression includes the space for the preloaded f-cache values. For smaller programs, the overhead due to the preloaded f-caches significantly decreased the compression ratio. For larger programs, MFU had an compression ratio of roughly 150%, including the space for the f-caches codebook.
Results of Compression Met hods
The size for static Huffman coding includes the size of the Huffman tree. The compression bound gives the projected best possible compression. The majority of the difference between Huffman encoding and the compression bound is due to the Huffman tree, which amounts to roughly 1/3 of the compression bound size. The compression ratio of Huffman encoding is always greater than the simple MFU encoding.
We also listed the compression ratio of LZW compression method. For small to medium size programs, the Huffman encoding performs slightly better than LZW encoding. For large programs, LZW usually gives bet,ter compression ratios.
Conclusion
We have analyzed the tradeoffs from using compression in a memory system on the average system access time. If a compression ratio of M 1.5 can be achieved, compression is feasible at main memory for computers of today. We have found that the benefit from compression is quite sensitive to the miss ratio and miss penalty at the level of compression.
We have proposed a memory system design to handle instruction decompression and address translation, and have suggested OS support for this particular design. This design is capable of running compressed and uncompressed programs. This capability provides a way to utilize compression when it improves memory performance.
We have also measured the compression ratios of several different compression techniques. A simple compression method using a f-cache of MFU values achieved compression ratios of 150%. A static Huffman encoding gives even better compression ratios. With miss penalties increasing in future systems, we believe using compression in the memory system will only become more viable as time progresses. 
