This paper presents the power-performance trade off of three different cache compression algorithms. Cache compression improves performance, since the compressed data increases the effective cache capacity by reducing the cache misses. The unused memory cells can be put into sleep mode to save static power. The increased performance and saved power due to cache compression must be more than the delay and power consumption added due to CODEC(COmpressor and DECompressor) block respectively. Among the studied algorithms, powerdelay characteristic of Frequent Pattern compression(FPC) is found to be the most suitable for cache compression.
I. INTRODUCTION
In recent years, a fair amount of research has been done into the idea of adding data compression to the memory path of microprocessors [1] . Typically, the use of compression is proposed as a way to store more information in a given amount of memory. Some researchers have proposed increasing the aggregate performance of a microprocessor by using compression to improve the likelihood of a cache hit [2] . Due to addition of CODEC block between the processor and L2 cache, the delay of the critical path increases. For the cache compression to be useful the performance improvement due to increase in cache size must not be offset by the additional delay of the CODEC block.
Another aspect of cache compression is to reduce the overall static power by putting the unused memory cell in sleep [3] , [1] , [4] mode. In CMOS design when gate is not transitioning, the static power is consumed due to leakage current. With decease of transistor size [5] , leakage current is increasing [6] , [7] . To reduce the static power the path between VDD and GND should be disconnected. This can be achieved by adding sleep transistors [8] , [9] , [10] . Again, power saved by turning off the unused memory cell must not be offset by the power consumed by the CODEC block.
Due to the two reasons described above, it is important to study the power-delay trade off of the cache compression designs. The goal of this paper is to compare the power vs performance characteristics of three different algorithms: Frequent Pattern compression(FPC) [11] , Run Length encoding (RLE), Diff Lx [12] .
The remainder of the paper is organized as follows. In Section II cache compression algorithms are reviewed, Section III presents the comparision of all the compression alorithms. Circuit implementation of these algorithms are given in Section IV. Section V discusses the simulation results. Conclusions are presented in Section VI.
II. REVIEW OF THE COMPRESSION ALGORITHMS

A. Modified Frequent Pattern Compression
The original Frequent Pattern Compression (FPC) [11] is specifically designed for the compression of data cache contents. It is based on the observation that some data patterns are frequent and also compressible to a fewer number of bits. For example, many small-value integers can be stored in 4, 8 or 16 bits, but are normally stored in a full 32-bit word. These frequent patterns can be specially treated to store them in a compact form. By encoding the frequent patterns, effective cache size can be increased. In FPC compression/decompression is done on the entire cache line. Each cache line is divided into 32-bit words (e.g., 16 words for a 64-byte line). Each 32-bit word is encoded as a 3-bit prefix plus data. Table I shows the pattern that can be encoded using FPC algorithm, and the amount of compression attainable by each compression type. These patterns are selected based on their high frequency in many of our integer and commercial benchmarks. As shown in the Table  I ; if data does not match one of the first 6 compressible patterns, then the original data is stored with three bit prefix. So it is possible that for the algorithm to expand rather than compress. To overcome this problem, the algorithm was modified. One Bit is added to each cache line to indicate whether or not the cache line is compressed. The added bit is called line bit. In this way, if a given cache line ends up with a negative compression ratio, the penalty is reduced to only one added bit out of the 8192 bits of a cache line(For this experiment 8191 bit cache line is used for all simulation). 
B. Run Length Encoding Compression
In an effort to find more ways for improving the compression results, a visual inspection was performed on the data which was uncompressible. This visual inspection revealed many repeated patterns at the sub-word level which led to the Run Length Encoding (RLE) algorithm.
RLE simply looks for repeated bit patterns. For RLE algorithm, two key parameters to control the encoding are identified: atom size and maximum run length. The atom size refers to the bit count of the smallest data division. For example, if RLE is performed on the nibble level, then atom size is 4 bits. Similarly, byte level RLE has 8 bit atoms. The other key parameter is the Maximum Run Length (MRL). Encoding of RLE will require log2(MRL) bits per run, plus a "we have a run" bit, and the data atom that is repeated. An example of the encoding with 4 bit atom size and an MRL of 8 atoms is shown in Fig. 1 . The first two data atoms in this example (green and yellow) were not compressible. The next several atoms (pink) were all identical and were encoded as a single 8 bit quantity:
The ideal combination of MRL and atom size depends on the input data. For this data, the ideal parameters are determined to be an atom size of 16 bits and a MRL of 32 atoms (length field is 5 bits).
978-1-4244-7456-1/10/$26.00 ©2010 IEEE 
C. Diff Lx Compression
Diff-Lx was developed by Macii, et. al. It was used for VLIW (Very Long Instruction Word) processors, but the authors reported good results even with smaller word size processors. Also, they report having implemented both the compression and decompression logic in purely combinational logic with a latency of only one clock period.
The algorithm encodes the difference from one word to the next. The basic idea behind the Diff-Lx differential scheme lies in realizing that for data words appearing in the same cache line, some of the bits are common across pairs of adjacent words, either from LSB to MSB or vice versa. The encoding scheme for Diff Lx is shown in Fig. 2 . The first word of a cache line is stored verbatim. Every following word is then stored as count field, a direction bit and a remaining bits field. The count field stores the number of bits in the current word which are the same as the previous word, and the direction bit stores the direction of similarity. The remaining bits field (WCx) stores the dissimilar bits. 
III. COMPARISION OF COMPRESSION PERFORMANCE OF ALGORITHMS
Each algorithm (FPC, RLE, Diff Lx) was simulated to determine the compression ratio using various types of input data. In order to quantify the compression results for comparison purposes, a Compression Ratio (CR) was calculated as.
Compression Ratio = 100% -Bits(out)/Bits(in) Using this scale, CR increases as the compressed data footprint decreases. A negative CR indicates that the compressed data is larger than the original (uncompressed) data. The goal is to find the algorithm which gives the highest average CR without worrying about excessively negative results in the worst case.
To evaluate the performance of each compression algorithm, a set of data files was selected in an attempt to emulate data that might be found in the data cache of a modern microprocessor. The files, Table II , were selected to be about 2 megabytes in size and the types were choosen represent common end-user applications:
Initial simulation results with original FPC [6] showed a disturbing tendency towards negative compression ratios. Pattern usage histograms showed that most of the data words were falling into the uncompressed bin. This problem of negative CR is overcome by the modified FPC explained in sectionII-A. Fig. 3 shows the results of simulating the modified FPC algorithm. Note that the addition of the line bit has been very effective in limiting the degree of negative compression in the JPEG sample data. The average compression ratio was about 10%. Table III shows the CR of all the simulated algorithms. Running the RLE simulation with atom size 16 and MRL 32 atoms on the set of sample files gave the results show in Fig. 3 . Every type of input file showed substantially less compression than with FPC, and the average CR dropped to 1%. The simulation results for Diff Lx are shown in Fig. 3 . For most data files, the results are mediocre. Only the Windows Library and Unicode Document showed good results, pushing the average compression up to 8.5 %.
IV. HARWARE IMPLEMENTATION OF COMPRESSION SCHEME
The three selected compression algorithms were implemented in VHDL and simulated to verify their functionality.The algorithms were simulated with 8191 bits data lines to find the CR To accurately obtain their performance, a transistor level implemetation of all the unique signal paths for each compression algorithm was derived, considering gate fan-out associated with the complete design. Such a simplification allows us to find the critical path of compression algorithms using static timing analysis tool.
All the circuits were implemented using a commercial CMOS process 180nm technology . CMOS static design style was used to design all the circuits. The length of the transistors was kept minimum i.e. 0.18um. The minimum width of the transistors for this technology is 0.22um. The NMOS transistors width was kept as minimum to reduce the CODEC's power consumption. The PMOS transistors in the circuits were adjusted to achieve a balanced rise and fall time at output of all logic gates. Finally, width of the NMOS and PMOS were adjusted in a such a way that the rise and fall time of any node in the circuits was kept below 500ps.
Critical path is defined as the path which has the longest delay in the circuit. The critical paths of all the designs were determined by using the static timing analysis tool from Sysnopsys (PrimeTime). Primetime-PX was also used to obtain power consumption of the circuits.
A. FPC Implementation
As shown in Fig. 4 , FPC compression block consists of decoder, RAM block, comparator and an encoder. Decoder selects 32 bit at a time out of 8191 bits of input. The RAM block stores all the patterns and prefix of each pattern listed in the FPC dictionary Table  I . Comparator compares the decoder selected bits and the patterns stored in the RAM. If the decoder selected bits match with any of the patterns stored in the RAM, encoder generates the compressed and sleep signals from the prefix of the pattern and the decoder selected bits. The decompresseor block is fairly simple. It consists of a decoder and a combinational logic block. The decoder selects the 32 bit signal and the corresponding sleep signal from the compressed signal. The combinational logic block generates the decompressed signal by taking decoder selected signals and sleep signal as input. The CODEC block has two set of latches to synchronize functionality. It takes two clock cycle for the data to come out. So the effective delay of the circuit is twice the critical path delay. Critical path delay is 0.74ns. As show in the Table III the effective delay is 1.48ns. The RLE compressor is shown in Fig. 6 . The atom size (i.e.16 bit) is stored in the 16bit Parallel-In-Parallel-Out (PIPO) . The comparator compares the output of PIPO register and the bits selected by the decoder. The Counter keeps count of how many run of inputs are matched with the atom stored in the shift register. The MRL for the circuit is 5. The compressed output and the sleep signals are generated from the register and counter output by the combinational block
The decompressor is composed of a decoder, shift register and a combinational block. The decoder selects 16 bit compressed signal and the corresponding sleep signal at a time. The first 16 bits are stored at the PIPO. The control logic determines MRL by taking sleep signal as input. Depending on MRL and the decoder selected input the combinational logic generates the decompressed output. Fig. 7 , shows the critical path of the RLE CODEC circuit. The rise time and fall time of each node were indicated as R/F in the bracket. There is no latch in the circuit. It takes one clock cycle for the data to come out. The path delay is 1.19ns (Table III) . 
C. Diff Lx Implementation
The flow of the Diff Lx is show in Fig. 8 . The decoder selects 16 bits from 8191 bits input at a time. Parallel-In-Serial-Out (PISO) 4 is loaded with this 16 bit data. This data is shifted continuously by this register either right or left depending on the direction bit and the shifted bit is compared with the boundary bit using a comparator. The output of the 2 bit comparator which compares the shifted bit with the boundary bit is the input to the counter. Counter gets incremented when the output of the shift register matches with the boundary bit, otherwise it resets to 0. The output of the counter is stored into one of the two shift register which is selected according to shift direction of PISO shift register. The four bit comparator compares the output of both the shift registers described above. Depending on the comparator output, one of the 4 bit registers output is selected. The register output, boundary bit of the corresponding register and rest of the bits which are not compressed are the output of this circuit. Depending on the number of bits in the compressed signal, the sleep signal is generated.
Decompressor consists of a decoder, PISO shift register, SIPO shift register, down counter and combinational function. The compressed input and the corresponding sleep signals are selected by the decoder. The PISO S/R is loaded with input signal. The control block controls the direction of shift of the SIPO. The output of PISI S/R is stored at the in a SIPO register depending on the output of the combinational logic block which is controlled by the counter. the SIPO output and the decoder output goes to combinational logic block to generate the decompressed output. Fig. 9 , shows the critical path of the RLE CODEC circuit. The Rise time and Fall time of each node were indicated as R/F in the bracket. There is no latch in the circuit. It takes one clock cycle for the data to come out. As shown in the Table III the effective path delay is 1.56ns. tion(approx. 50% of FPC) and delay of the critical path is around least of all the circuits. However, it has a really bad CR compared to the other two. Due to this bad CR less number of memory cells can be put in sleep mode. So the purpose of saving power by putting memory cells in sleep mode won't be satisfied with this algorithm. So RLE is not suitable for compression.
We can see from Table III that both FPC and Diff Lx has good CR. We can observe from TableIV that power consumption of FPC and Diff Lx are also comparable. Critical path delay of FPC is 5% less than the critical path delay of Diff Lx algorithm. Therefore, FPC with reasonable compression rate and power consumption, but with less effective delay can be used as the best cache compression algorithm VI. CONCLUSIONS Though FPC is not the best algorithm w.r.t. power consumption, but with less delay and good CR, it appears to be the best algorithm under consideration. With the good CR more transistors can be put into off mode in the CODEC block. The CR, power and delay for each of these circuits is described in this paper. The next challenge is to integrate this CODEC with a processor and check whether the power consumed by this CODEC block is really nominal compared to the power saved by the sleep transistor. That would be the final test for the utility of the CODEC method to increase processor performance.
