The Feasibility of Using Compression to Increase Memory System Performance by Wang, Jenlong & Quong, Russell W.
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
11-1-1993
The Feasibility of Using Compression to Increase
Memory System Performance
Jenlong Wang
Purdue University School of Electrical Engineering
Russell W. Quong
Purdue University School of Electrical Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Wang, Jenlong and Quong, Russell W., "The Feasibility of Using Compression to Increase Memory System Performance" (1993). ECE
Technical Reports. Paper 246.
http://docs.lib.purdue.edu/ecetr/246
THE FEASIBILITY OF USING 
COMPRESSION TO INCREASE 
MEMORY SYSTEM PERFORMANCE 
TR-EE 93-37 
NOVEMBER 1993 
The Feasibility of Using Compression to 
Increase Memory System Performance 
Jenlong Wang and Russell W. Quong 
School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47907 
{wangj ,quong)@ecn.purdue.edu 
November 9, 1993 
The Feasibility of Using Compression to 
Increase Memory System Performance 
Abs t rac t  
We investigate the feasibility of using instruction compression at some level in a multi-level memory 
hierarchy to increase memory system performance. For example, compressing at main memory means 
that main memory and the file system would contain compressed instructions, but upstream caches would 
see normal uncompressed instructions. Compression effectively increases the memory size and the block 
size reducing the miss rate at the expense of increased access latency due to decompression delays. We 
present a simple compression scheme using the most frequently used symbols and evaluate it with several 
other compression schemes. On a SPARC processor, our scheme obtained compression rirtio of 150% for 
most programs. 
We analytically evaluate the impact of compression on the average memory access (ime for various 
memory systems and compression approaches. Our results show that feasibility of using compression 
is sensitive to the miss rates and miss penalties at the point of compression and to a lesser extent the 
amount of compression possible. For high performance workstations of today, compression already shows 
promise; as miss penalties increase in future, compression will only become more feasible. 
Keywords: Memory system performance, multi-level memory system, cache, data compression. 
1 Introduction 
As the ratio of processor speeds to  memory speeds continues t o  rise, design of faster memory systems has  be- 
come a crucial of computer systems design. Multi-level memory hierarchies [ H I ~ ~ ] [ P R H H ~ ~ : I [ P R H H S S : ~ [ S H L ~ ~ ]  
are the  standard way t o  reduce average memory access t ime in a cost-effective manner. A memory hierarchy 
uses one or more levels of cache between the  processor and the main memory to  reduce ithe average memory 
access time. Fast, small upstream caches match the  processor's speed, while larger, downstream caches 
reduce traffic t o  slower main memory. 
T h e  average access t ime of a cache is function of its hit time, miss rate,  and miss penalty. We can reduce 
the  miss rat': of a cache either by making cache bigger or by making the program smaller. The  latter can be 
done in two ways. 
1. Use ax1 instruction set [ F L M M ~ ~ ,  W ~ F 8 7 1  with a higher code density. Unfortunately, designing an  
instruction set is a complicated issue as i t  affect many areas of the system including the processor 
decoding complexity, and memory traffic. Also, from a commercial standpoint, a new instruction is 
undesirable because it will not be compatible with previous designs. 
2. Compiress the  instruction stream. This approach are tha t  it can be used with any processor, so 
t h a t  backward instruction-set compatibility can be maintained if desired. A small amount additional 
hardware is needed. 
We consi.der the second approach in this paper. Namely, we investigate improving :system performance 
by compressing instructions in a multi-level memory hierarchy. Our approach is transparent to  the processor 
in tha t  i t  sees normal instructions. We require extra hardware for runtime decompiression and address 
translation. Use of compression will also reduce executable sizes on disk, however we are not concerned with 
this side effect. Finally, we do not consider compressing da ta ,  only instructions. 
This paper is organized as follows. In Section 2, we illustrate our memory model using compression. In 
Section 3, wls derive formulas showing when compression is advantageous given various parameters and show 
that  the use of compression is feasible now for t,oday's fastest processors. We discuss the  ;additional hardware 
needed for our method in Section 4. Finally, in Section 5 we evaluate several different compression schemes. 
2 Mernory Hierarchy Model 
In  this section, we describe the memory hierarchy used in our study. Figure 1 shows the memory hierarchy 
models with and without compression. The  memory contents before or upstream of level i, (i.e. closer t o  the 
CPU) are the same for both approaches, so that the processor sees the same set of inlput symbols in both 
approaches. The memory contents of all levels after level i are also compressed in the compression approach. 
There is no architectural difference between these two approaches other than the decolnpression hardware. 
Compress ion  at level i denotes that the decompression is done between levels i - 1 and i as in Figure 1. In 
the compre!ssion approach, the compiler (or compression software) creates an executal~le with compressed 
instructions; a t  the runtime, decompression hardware in the memory system restores the original instructions. 
To be feasible, we must be able to build a fast hardware realization of the decompression algorithm. 
Conventional Memory Hierachy: 
Ci-1) th 
i-th level Ci+l) th n th level 
memory level mem. level mem memory 
4 - - - - - - - 
Faster but smaller 
------- - 
Slower but larger 
Memory Hierachy with Compression at Level i : 
(i-1) th 
level i-th level (#+I) th n th 
memory memory level mem. Level mem 
7 - - - - - - - - - - - 1  
-----___ 




I I ' 
I ----_ I I I 
Hardware -- 
decompression -. --. 
Effective sizes after comprs:sslon 
- - - - - - - - ------- - 
Faster but smaller Slower butt larger 
Figure 1: Memory hierarchy models for the compression and non-compression approaches. 
We defin.e the compress ion rat io  as the increase in effective m e m o r y  size increase due to compression. 
Thus, if the compressed instructions are 114 the size of the original, the compression ratio = four, because 
the same mc:mory can hold four times as much information. 
Original size 
Compression ratio = Compressed size 
The effec:tiveness of compression at  a particular level depends on the following facto:rs. 
1. The capacity and miss rate of that level. 
2. The miss penalty of that level. 
3. The increase in miss penalty due to hardware decompression. 
4. The compression ratio and the effectiveness of compression in reducing the miss rate. 
For a given design space, these factors are interdependent, because as memory size in'creases, access time 
increases, but miss rate decreases. The compression ratio dictates the effective capacity increase, and the 
compression/decompression methods impose various decoding delays. 
We use the following definitions in the rest of this paper. A compressed level is memory level that 
contains cornpressed code; a normal  or vncompressed level contains uncompressed code. In comparing a 
memory sys1;em using compression versus a normal memory syst,em, we make the following assumptions. 
Both systems use the same processor and the same memory organization except the compression 
approach has extra decompression hardware. The same memory size, associativity, blocklline size, and 
replacement policy is used in both systems. 
The effect of memory misalignment caused by compression is neglected. The misalignment penalty can 
be mi:nimized by adding hardware, and it can be considered as part of the decompression delay. 
The tracluston principle holds: The contents of level i are always in level j for all i < j < n,  where n is 
the number of levels in the memory hierarchy. 
Uniformity condition: Compression changes the code density of all program parts equally, independent 
of how often they are executed. The uniformity condition is not true for individual instructions but is 
shown to be approximately true for extended basic blocks[S~89]. 
An important question to ask is "At which level should we decompress the program?" We answer this 
question in Che next section where we will model the memory hierarchy and quantitatively predict the benefits 
of using compression. 
3 Tradeoff Between Compression Ratio and Average Memory 
Access Time 
Although compression increases the effective memory size, it also introduces a decoding penalty. Based on 
empirical da.ta, we use a simple equation to parameterize relationship between the compi:ession ratio and the 
reduction in miss rate. We use this relationship to determine the change in average access time when using 
compression a t  the i-th level memory and the effect on adjacent memory levels. 
3.1 A Model for Miss Ratio versus Effective Memory Size 
We assume 1;he global miss rate at  memory level i changes as the (effective) memory size raised to the power 
logp as desc:ribed by Equation 1. Alternatively, Equation 1 shows the miss rate is reduced by the compression 
ratio raised to the power logp. 
where m: = miss rate at  level i of the new memory size and new block size. 
mi = miss rate at  level i of the original memory size. 
C = original size / compressed size = size increase ratio 
l g  2 = log2 2 function 
p = reduction ratio = m:/mi where both new size and new block size are twice  
of the original values. 
In Equation 1, the parameter p indicates how much the miss rate decreases when both the memory size 
and the line is size is doubled. For example, p = 0.3 means that doubling the memory size and block size 
will reduce the miss rate to 3/10 of its previous value. Note that 0 5 p 5 1 and C > 11. For a fixed C, as 
p decreases the new miss rate decreases as well. For a fixed p, as C increases the new :miss rate decreases. 
Therefore, smaller p values (and larger C) values are "good" in that the miss rate decreases quickly. 
We used trace-driven simulation to empirically estimate the value of p for various programs. Table 1 
describes th'e four ATUM cache traces [ A G S H ~ ~ ]  and two other traces, spice and ccl  we used. Using a 
cache simulator, we gathered the instruction miss information for different cache and block sizes. The value 
of p for a 4k cache with an 8-byte block size is denoted by hk /g  and is given by pnt/g = :=, where m a k / l ~  
is the miss ]:ate of an 8k cache with block size of 16 bytes. We define the other p values similarly. Note 
that an important effect of compressing instrlictions is that the effective line size increarses by C also. E.g. 
if C = 4, a 92 byte line in a compressed cache has an effective line size of 128 bytes. 
Table 1: Trace used to evaluate reduction ratio. 
For the lthese programs, we found most p values ranged from 0.4 to 0.7 as shown in Figures 2-4, which 
was somewh.at lower than we had expected. The reason p is so low is that the compression increases both the 
effective memory size and the effective line size. Figure 2 shows p values for different cache sizes and a line 
size of 8 bytes; Figures 3 and 4 shows p values for line sizes of 32 and 2048 bytes respectively. In Figure 4, 
we model main memory with the 2048 byte block size corresponding to a memory page. We observe that y 
values are constant for each application, as the only misses occur at startup, as the melnory sizes are much 
larger than the number of distinct addresses in the traces. Thus, the increased line size from compression 
accounts enliirely for the reduction in miss rate. 
Range of p for block size - 8 bytes (Dir-c-Mapped) 
0.9 
1: ccl  
2: spice 
3 : decO 
6 6 
................................. 6 .  
5: forf 
0.2 1 I 
2 4 8 16 32 64 128 256 512 
Cache size (K bytes) 




Figure 2: Range of p values in different applications. 
0.3 




- 6' ... 
..... 
....... 2 
In this secticm, we analyze the average memory access time both with and without considlering block transfer 
time. First, we ignore block transfer time, assuming early restart and out-of-order fetch [HP90, page 4581. 
We then consider block transfer time. In both cases, we gives formulas for the average lnemory access time 
as a function of C, p and the decompression time d .  
We use effective memory access time as our performance metric to evaluate different memory systems. 
Since we examine the instruction stream only, compression has no effect on the access time for data reads and 
writes. Hereafter, the analysis concentrates on the time for instruction fetches. In the following analysis, a 
subscript de:notes the memory level and a superscript denotes a compression or a non-compression approach, 
e.g. rnC versus rnnC. 
3.2.1 Average  M e m o r y  Access T i m e  wi thou t  Block Transfer  T i m e  
The effective access time at the i-th level memory ti is defined as 
Range of p for block size = 32 bytes (Direct-Mapped) 
1: cc l  4: fora 
2: spice 5: forf 






Cache size (K bytes) 
Figure 3: Range of p values in different applications. 
Range of p for block size = 2048 bytes (Fully-Associative) 
0.9 1 
0-5 1 1: cc l  4:  fora 1 
2: spice 5: forf 
3: decO 6: lisp 
Memory size (M bytes) 
Figure 4: Range of p values for different applications. 
where hi = The access time to the i-th level memory when it is a hit, on a miss from the (i-1)th level. 
Pi = ti+1 = The penalty incurred when the access to  the i-th level is a miss. 
= The effective access time at level i + 1. 
mi = Probability of miss a t  the i-th level memory. = Local miss ratio at level i .  
- 
Misses in the i-th level 
Memory accesses to the i-th level 
The glollal miss rate at memory level i, Mi, is defined as: 
Misses at the i-th level 
Mi = Memory accesses generated by the CPU 
i 
The effective memory access time of a system is t l .  In an n-level memory system, t can be determined 
by 
As there is no miss at the last memory level, m, = 0. Hence, t ,  = hn.  If we define Mo = mo = 1 = the 
miss rate at the CPU, then 
n n 
When compression is performed at the i-th memory level, levels closer to the CPU are unaffected, i.e., 
miss rate artd hit time of all levels before i are not affected. Hence, 
where c denotes the compression approach and nc denotes the non-compression aplproach. As the only 
hardware difference between these two approaches is the decompression hardware between the levels i - 1 
and i ,  there is no access delay for levels > i. Hence, for levels after i, hit time is not changed 
Let At =I tyc - i t  = the time savings using compression. Thus, the compression approach is advantageous 
only when 
At = t;" - t ;  > 0 
Using Equation 3 and expanding until level i ,  we can derive the condition 
for when using compression is favorable. Because Mi-l > 0 and Mi-, is independent of compression, the 
only difference between the approaches is tyc and t i .  Equation 2 indicates that ti is a fuiiction of hi, mi, and 
t i t l .  giving a recursive dependence down to level n. We use the following lemma to siniplify the recurrence 
relation. 
The follmowing lemmas prove that compression does not change the local miss rates and the effective 
access time of compressed memory levels after level i. We then derive a tradeoff condition to  judge when 
the compression approach gives better performance than the non-compression counterpart. 
Applyinl: Equation 1 to the global miss rate, we obtain 
Lemma 1 : mi = myc, for i+ 15 j 5 n 
Proof: By induction on j from i + 1 to  n. 
Basis: 
Since MJ = p'g CMTC, V j  such that i 5 j 5 n and Mj = n;=, ml 
Hence, mt+:[ = my;l. 
Hypothesis: Assume that m j  = myc, V i  + 1 5 j 5 6. 
Induct ion: 
Hence, m i t l  = mi!, . Therefore, mi  = myC, V j  such that i + 1 5 j 5 n.  H 
Lemma 2 : t j  = tyC, for i + 1 5 j 5 n 
Proof: (by induction) 
Basis: 
tz = t:C = h,, because mz = mEC = 0 
Hypothesis: Assume that t i  = t?", V  6 5 j 5 n 
Induction: Recall that 
ti-1 = hi-, + mi-, t i ,  and 
tgC1 = h;!, + mtClt;" 
Since the cc~mpression is performed at level i, we can obtain that 
and from the previous lemma, m j  = myc, V  i < j < n. Hence, 
n c  t i-1 = it-1 
Therefore, t j  = tYC, V  i < j 5 n .  
From Lemmas 1 and 2, the difference between tyC and t t  thus relies only on the following. 
The compression ratio at  the i-th level. 
The nriss ratio at  the i-th level for compression and non-compression approaches 
The access delay a t  the i-th level introduced by the decompression hardware. 
Note the memory access time and miss rate of level j, for all j > i ,  have no effect on At. When compression 
at the i-th level memory is advantageous, the following conditions can be derived using Lemmas 1 and 2: 
Letting ti = h: - h l c  = the delay due to decompression, and using tic+, = t l i l  = ti+l = Pi, the savings 
from using (:ompression is 
t4 - t;" Mi-l [ m l c R ( l  - pig ') - d]  (9) 
Thus, compression a t  level i is advantageous when 
Equatioii 10 indicates that d is directly proportional to mi, miss rate at  level i, and the access time 
of level i + 1. For example, if myc or t;+l doubles, d is doubled. We now assess two extreme cases, i = n 
and i = 1 as examples as a intuitive check of our analysis. 
3.2.2 C a r i e s t u d i e s o f m e m o r y  sys tems  
As examples, we evaluate using compression at  the extreme ends of the memory hiera.rchy, namely a t  the 
L1 cache and a t  secondary storage. We then study the general case, showing our apprloach is theoretically 
feasible when used a t  main memory for next-generation processors. 
Case  1: i = 1. Compression is done at  the first level cache so that Equation 10 becomes 
In order t o  assess the feasibility of compression a t  this level, we use the parameters from Table 2 which 
are typical in the early 1990's according to [HP90]. Using Equation 2, t2  = h2 + m2f i  = 8.5 - 34 cycles. 
Let p = 0.5 - 0.8, and C = 1.2 - 2.5, then the extreme values are 
m;"tz(l - p'g ') = 0.002 - 3.34 cycles 
Even for an optimistic case where C = 2.0 and p = 0.5, 
myCt2(1 - ') = 0.04 - 3.4 cycles 
As a ballpark figure, the allowed decompression delay for a 100 MHz processor would be .02-33.4 nS. Because 
of the very :short latency allowed for hardware decompression, compression a t  the first llevel cache is simply 
not feasible 
Table 2: Parameters for Case 2 
my 
C a s e  2: i = n, i.e., compression is done at  the n-th memory level. E.g. the filesystem contains compresses 
executables, but memory holds normal executables. As m, = 0, Equation 10 gives d < 0, which means that 
any decompression delay slows down memory performance. 
Thus, cclmpression at  the n-th level degrades average memory access time, which is expected. Although 
memory response time is not improved by doing compression at  the last level, delay is much less than the 
1% - 20% 1 4 - 10 cycles 1 15% - 30% 1 30 - 80 cycles h y  
my I P2nC = tge 
hit time on the n-th level. Typically t i C ,  the conventional disk access time, is in the ra:nge from 8 ms to 20 
ms. For a ISDO MHz processor, the disk access time is in the range from 8 x lo5 - 2 x 11D6 cycles. The value 
of d is due t,o extra hardware decompression delay. In other words, th R t i C  >> d. Hence:, 
Consequently, the only advantage of compression a t  this level is to save space. 
Case 3: A general analysis. Figure 5 shows the maximum allowed decompression delay if using 
compression. is to  be effective. For the particular set of parameters (p = 0.7, m x P = 300 CPU cycles), points 
on the '--' curve show where compression neither helps nor hurts the average memory access time. Points 
below the curve favor compression. As Section 5 will show, we can obtain C R 1.5, so that the maximum 
allowed decompression delay is about 60 CPU cycles. As d is proportional to m x P, we can calculate where 
compression. is advantageous by simply scaling the graph. For example, if p = 0.7, m x I' = 600 CPU cycles, 
then d = 1:!0 cycles. From the graph, if the original design has a large miss rate and the miss penalty is 
large, a connpression approach gains significant improvement. Clearly, as C grows, compression becomes 
more feasible. 
C = compression ratio 
Figure 5: The tradeoff conditions with various miss ratios and miss penalties. P = m.iss penalty to access 
level i + 1. m = miss rate a t  level i. p= reduction ratio. 
We empirically estimated na x P by measuring local miss ratios using trace-driven simulations. The 
miss ratios and memory organization are shown in Table 3. To calculate the average a.ccess time and miss 
penalty of each memory level, we assumed a 200 MHz processor with a disk access time: of 5ms - 15ms and 
a memory s,ystem with parameters similar to that in [HP90] as shown in Table 4. In this design, we observe 
that m x P value for the second level cache range from impractically small (1-3 cycles) to moderate (20-56 
cycles). However, m x P for main memory is large (300 CPU cycles) even with the most pessimistic miss rate 
measured and the least disk access time (5ms). With a moderate miss rate 0.05% and 9ms disk access time, 
the m x P == 900 CPU cycles which makes compression a distinct possibility. For example, with p = 0.7 and 
C = 1.5, the maximum allowed d is 180 CPU cycles. For a near-future CPU running at 400 MHz, using 
compression becomes even more attractive. 
First level cache Second level cache 
8K - 256K 256K - 1M 
4 - 128 4 - 256 
Direct Direct - Fully associative 
local miss rate (%) 
Table 3: Local miss rates in % of various applications. 
200 MHz 
50 ns 
5 ms - 15 ms 
2nd level I main 1 1  2nd level I 
I I I cache I memorv 1 1  cache I ~memorv 1 1  
access time (cycles) 
transfer time (cycles) 
miss rate (%) 
Table 4: Design parameter sets. 
\ ,  
Miss penalty (cycles) 
m x P (cycles) 
3.2.3 Average M e m o r y  Access T i m e  wi th  Block Transfer  T i m e  
5 - 66 
2 - 22 
0.06 - 0.6 
The effectiv'e access time at the i-th level memory ti is still defined as ti = hi + mi x Pi. Every term in this 
equation remains the same except Pi, which is now defined as 
310 - 9310 
1 - 56 
= The penalty incurred when the access to the i-th level is a miss. 
310 - 9310 
lo4 - 10" 
0.03 - 0.31 
ti+1 = The latency time to obtain the first data from level i + 1. 
6 - 122 
0.06 - 0.6 
lo6 - 3 x 
300 - 9300 
= Time to transfer a block from level i + 1 to level i .  
- -- Bi Block size at level i. 
 - 
Xi+l Transfer rate from level i + 1 to level i. 
10"lO 
Equation 10 remains applicable, giving the following bound for dB for increased me~mory performance. 
- 18610 2 x 
1 - 112 
As shown in Equation 11, the delay time allowed for decompression is increased when block transfer time 
can not be hidden by mechanisms such as early restart and out-of-order fetch. 
4 Design of Memory Systems Using Compression 
Our comprassion method requires additional hardware for two reasons, runtime instrulction decompression 
and translation of uncompressed addresses to  compressed addresses. For any compression algorithm, we 
refer to the mapping from normal symbols to compressed symbols as the codebook. We maintain an address 
table and a codebook for each process as shown in Figure 6. 
The address translation problem occurs because the compressed instruction stream does not preserve a 
linear addressing space. Thus, if we branch to (uncompressed) address A, where do we find A among the 
compressed instructions? The address  mapp ing  table contains an index into the compressed instructions for 
each cache index. For example, if the L2 cache has a line size of 256 bytes and addresses are 32 bits (4 bytes) 
wide, then the address table would contain a 32-bit index into the compressed program for every 256 bytes 
of code. Thus, the address table would be be 41256 = 1.6% of the original program size. As we shall see, 
this additioinal overhead reduces the effective compression ratio. 
Figure 6 shows that the hardware leaves the data stream intact. Decompression h~ardware also tracks 
whether a program is in compressed form. A selective bypassing capability allows the system to run uncom- 
pressed programs. In our example, we have assumed all caches are virtually addressed. The decompression 
hardware stores a copy of the current codebook. 
Usr 
v i r t u a l  
addrmra 
Non-swappabk 






Memory Hierarchy Maln Memory Usage 
Figure 6: Design of memory system with compression occuring at main memory. 
The operating system stores the codebook and the address mapping table in a non-swappable region in 
the main memory. On a context switching, the operating system must reload the decode table with the 
appropriate codebook. 
As an example, we illustrate the sequence of actions for a L2 cache miss for virtual address A. 
A hit a t  main memory: Look in the address mapping table for the L2 line ho1di:ng address A. Read 
the index X into the compressed instructions in main memory. Since this table is never swapped out, 
it cost an extra main memory access latency. The decompression hardware then starts decompression 
at index X in main memory. 
A page fault: After translating the virtnal address to  a physical address, t.he operating system detects 
a page fault. The page of compressed instructions is loaded from disk into maill memory. We then 
proceed as above when there is a hit on main memory. 
Both the address mapping table and codebook require space in main memory and must be saved with the 
compressed program in the filesystem. Therefore, the actual compression ratio must bt: adjusted. Figure 7 
shows the adjusted compression ratio as a filnct,ion of compression ratio and the mapping overhead. Most 
current worlcstations have L1 caches of size greater than 32 bytes or 8 instructions for most RISC processors 
[HP90]. Tlius, the overhead of address mapping will be less than 118. To be effeci,ive, the L2 I-cache 
typically will1 have much larger size and line size than the L1 I-cache. As the cache size in most workstation 
is increased, the block size will change correspondingly. We expect the overhead to be less than 1/32. For 
C = 1.5, the adjusted compression ratio is 1.4. 
Adjusted C = 1/ ((l/C) + overhead) 
Figure 7: Adjusted compression ratio. 
5 Connpression Methods and Basic Compression Unit 
In this section, we address compression requirements and the choice of the smallest unit of code to  be 
compressed. We also present measurements for a simple compression method suitable for use in a memory 
system. 
5.1 Dec:ompression requirements 
Data compression [ H u ~ ~ ] [ L E H ~ ~ ] [ S T ~ ~ ] [ W E ~ ~ ]  has been used extensively to  reduce data storage and trans- 
mission cos1;s. Recently, data is compressed on secondary storage, with the slight time penalty needed for 
decompression more than offset by the increase in disk space. As an example, the operating system MS-DOS 
6.0 contains a file-compression utility. These utilities compress the entire executable including instructions, 
data, and the symbol table. Before execution, the entire program is decompressed and copied to  main 
memory. As we have seen compression at  the file system level must degrade memory system performance. 
Because we will be decompressing fragments (i.e. a cache line) of a program a t  runtime, we require a 
 compression^ scheme that requires minimal synchronization between compression and decompression. On a 
cache miss t o  instruction I, the system must locate I among the compressed instructions and decompress I, 
filling the appropriate upstream cache slots. As I might be the target of a branch, I can have an arbitrary 
address and the system might not have decompressed neighboring instructions. Thus, the Ziv-Lempel-Welch 
(LZW) algorithm [WE841 is unsuitable because it uses a dynamic codebook for compre:;sion/decompression 
that is built during a sequential pass over the data. 
We only compress read-only items, such as instructions or read-only data. Thus, we do not consider 
compressing the writable data stream because data changes as the program executes so that it would have 
to be recompressed during a write. We do not know of any fast, effective technique that can compress small 
amounts (a cache line) of dynamically changing data. By considering only read-only items, the compression 
can be done at  compile/link time. At runtime we only need to do decompression. 
Our last consideration was the size of the basic compression unit (BCU). A small BCU offers little 
opportunity for compression, as there is little repeated information. Hence, a basic l>lock in a program, 
normally 3-19 instructions, is not an effective BCU. A small sized procedure has the same potential drawback, 
and there is no guarantee a program will not have small procedures. In addition, procedure calls and returns 
complicate tihe use of a procedure as the BCU. Thus, we use an entire program as the BCU. 
5.2 Experimental Compression Ratio 
In this section, we compare the compression ratios of several compression methods om various Unix exe- 
cutable~. We used the entire text segment from an executable as the BCU and fed it to  the compression 
algorithms. The compression ratios are measured on executable files of a SUN SPARC workstation running 
SUN-OS 4.1 . l .  
After some experimentation, we found that independent compression of the different fields of a machine 
instruction ]performed well. We broke down each instruction by its fields (opcode, operand, jump displace- 
ment, immediate value, etc.) [R090] and compress each field. For example, an opcode "LD", a register 
"r31", and <an immediate value "#4095" all belong to different fields. Each instruction uses only some of 
the fields; e.g. a ADD instruction would not have a jump displacement field. We used this approach of 
compressing fields on all the following strategies except LZW. 
Most frequently used (MFU): For each field, we used a f-cache (field-cache) of fixed-size preloaded with 
the m'ost frequent values for that field. E.g, an opcode f-cache of size four might be preloaded with 
--, 
10 : e r 4 ,  I 1 : LOAD 1, 12 : STORE 1, 13 : BRANCH I. In the compressed instruction stream, each 
field if; an index into the appropriate f-cache. In the event the field value is not in the f-cache, we use 
- -  - 
a special index (say 0) followed by the actual (uncompressed) value. Thus, the most frequently used 
instructions are represented by f-cache indices, and all others result in f-cache misses. Indices into the 
f-caches are shorter than the actual fields giving compression. 
For each field, we tried different f-cache sizes (always powers of two) and we selected the size providing 
the best compression. The sizes of the f-caches differed depending on the field. Larger f-cache sizes 
reduce the "miss rate" increasing compression, but require larger indices decreasing compression. 
The MFU method is ideal for use in a memory system, as the decompression hardware is always "in 
sync" because the MFU f-caches are fixed. MFU lends itself to a straight forward implementation of 
the decompression hardware. 
Static Huffman coding: We estimated the effect of independently compressing e,ach instruction field 
via Huffman coding. We underestimated the compressed size by calculating the entropy of each field 
and then adding the space required for the Huffman trees. 
Compression bound: We calculated the entropy for each field, giving a theoretical upper bound for 
compression schemes that independently compress each field. For field k (say the jump displacement 
field) with possible values f i ,  f2, . . . , f,,, the entropy is 'Hk = Cy='=l - Pr(f;) log2 lPr(f;), where Pr(f;) 
is the probability off;  occurring, given that field k exists. The entropy for the entire instruction is the 
sum of the entropies for each field. Note that by adding the space for a Huffman encoding tree, we get 
the Hluffman bound. 
While better compression might be possible by viewing instructions differently, our measurements 
indicate our bound is fairly good (making it difficult to  beat in practice). 
Lemp,el-Ziv-Welch (LZW) [ W ~ 8 4 ] .  We also measured the popular LZW algorithm used by the UNIX 
utility. compress. The LZW result is used only as a comparison point as LZW is unsuitable for our 
purposes, as previously mentioned in Section 5.1. 
5.3 Results of Compression Methods 
Table 5: Experimental compression ratios. All sizes in bytes; all compression ratios in percent. 
Table 5 shows compression ratios of various files on a SUN4. The original size of the text (code) segment, 
is listed. The size for MFU compression includes the space for the preloaded f-cache values. For smaller 
programs, tlne overhead due to  the preloaded f-caches significantly decreased the compression ratio. For larger 
programs, MFU had an compression ratio of roughly 150%, including the space for the f-caches codebook. 
The size for static Huffman coding includes the size of the Huffman tree. The compression bound gives 
the projected best possible compression. The majority of the difference between Huffmisn encoding and the 
compressior~ bound is due to the Huffman tree, which amounts to roughly 1/3 of the compression bound 
size. The cc~mpression ratio of Huffman encoding is always greater than the simple MFU encoding. 
We also listed the compression ratio of LZW compression met,hod. For small t,o medium size programs, 
the Huffmail encoding performs slightly better than LZW encoding. For large programs, LZW usually gives 
better compression ratios. 
We have analyzed the effect of using compression in a memory system on the average system access time. 
We have found that if a compression ratio of around 1.5 can be achieved, compressior~ is feasible at main 
memory for computers of today. We also found that the benefit from compression is quite sensitive to the 
miss ratio and miss penalty at  the level of compression. 
We proposed a memory system design to  deal with instruction decompression and address translation 
and suggested OS support for this particular design. This design is capable of running compressed and 
uncompresst:d programs. This capability provides a way to utilize compression when it improves memory 
performance. 
We have also measured the compression ratios of several different compression techniques. A simple 
compression method using a f-cache of MFU values achieved compression ratios of 15096. A static Huffman 
encoding gives even better compression ratios. With miss penalties increasing in future systems, we believe 
using compl.ession in the memory system will only become more viable as time progresses. 
7 Acknowledgements 
We thank Glary Lauterbach for his comments. 
References 
[AGSH~G] .Agarwal, A., Sites, R., and Horowitz, M. ATUM: A New Technique for Captaring Address Traces 
Using ,Vicrocode. Proceedings of the 13th Annual Symposium on Computer Architecture, June 1986, 
pp. 1191-127. 
[ F L M M ~ ~ ]  Flynn, M. J . ,  Mitchell, C., Mulder, H., And Now a Case for More Compi'ez Instruction Sets.  
IEEE Computer, Sep. 1987, pp. 71-83 
[HP90] Hennessy, J .  L., Patterson, D. A,,  Computer  Architecture: A Quantitative Approach. Morgan Kauf- 
mann Publishers, 1990. 
[H188] Hill, M. D., A Case for Direct-Mapped Caches. IEEE Computer, Dec. 1988, pp. 25-40. 
[Hu52] Huffman, D. A., A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE, 40(9), 
1952, pp. 1098-1101. 
[ L E H ~ ~ ]  Lt!lewer, D. A., Hirschberg, D. S., Data Compression. ACM Computing Surveys, Vol. 19, No. 3, 
Sep. 1987, pp. 261-296. 
[ P R H H ~ ~ ]  Przybylski, S., Horowitz, M., Hennessy, J. ,  Performance Tradeofls in Cache Design. Proceedings 
of the 115th Annual International Symposium on Computer Architecture, 1988, pp. 290-298. 
[ P R H H ~ ~ ]  Przybylski, S., Horowitz, M., Hennessy, J . ,  Characteristics of Performance-Optimal Multi-Level 
Cache Hierarchies. Proceedings of the 16th Annual International Symposium on Computer Architecture, 
1989, pp. 114-121. 
[PR~O] Przybylski, S., Cache and Memory Hierarchy Design: a Performance-Directed Approach. Morgan 
Kaufmann Publishers, 1990 
[R090] ROSS Technology, Inc., S P A R C  RISC User's Guide. Cypress Semiconductor Corporation, 2nd Ed., 
Feb., 1'990 
[ S H L ~ ~ ]  Short, R. T., Levy, H., A Simulation Study of Two-Level Caches. Proceedings of the 15th Annual 
International Symposium on Computer Architecture, 1988, pp. 81-88. 
[ S ~ 8 2 ]  Smith, A. J . ,  Cache Memories. ACM Computing Survey, Vo1.14, No. 3, Sep. 1982. 
[ S ~ 8 9 ]  Steenkiste, P., The Impact of Code Density on Instruction Cache Performance. Proceedings of the 
16th Annual International Symposium on Computer Architecture, 1989, pp. 252-259. 
[ S ~ 8 8 ]  Storer, J .  A., Data Compression: Methods and Theory. Computer Science Press, 1988. 
[ W A F ~ ~ ]  Wakefield, S. P., Flynn, M. J. ,  Reducing Execution Parameters  Through Correspondence in Com-  
puter A.rchitecture. IBM J .  Res. Develop., Vol. 31, No. 4, July 1987, pp. 420-434 
[ W ~ 8 4 ]  Welch, T. A., A Technique for High-Performance Data Compression. IEEE Computer, Jun. 1984, 
pp. 8-1'3. 
