AbstractÐLow-power and high-performance data compressors play an increasingly important role in the portable mobile computing and wireless communication markets. Among lossless data compression algorithms for hardware implementation, LZ77 is one of the most widely used. For real-time communication, some hardware LZ compressors/decompressors have been proposed in the past. Content addressable memory (CAM) is widely considered as the most efficient architecture for pattern matching required by the LZ77 compression process. In this paper, we propose a low-power CAM-based LZ77 data compressor. By shutting down the power for unnecessary comparisons between the CAM words and the input symbol, the proposed CAM architecture consumes much lower power than the conventional ones without noticeable performance penalty. Moreover, using the proposed conditional comparison mechanism and the novel CAM cell with the NAND-type matching logic, on average we have close to two orders of improvement on power consumption, i.e., a reduction of more than 98 percent for 8-bit words. Speed is sacrificed if we use the NAND-type matching logic, but the NAND-type logic and the NOR-type logic can be combined to provide the best solution that balances power and delay. Our approach also can be applied to general-purpose CAMs which use the valid bits, so far as the proposed design techniques are adopted.
INTRODUCTION
DATA compression is an economical way to increasing the effective volume of a storage device and the effective bandwidth of a data communication channel. With the advent of VLSI technologies, more and more wireless and portable products come into our life. As an increasing number of functions are being built into these products, large-volume data transmission and intensive computation are becoming inevitable. Real-time, low-power transmission/ computation is now the main design objective. Therefore, lowpower and high-performance data compressors will play an increasingly important role in the growing portable computing and wireless communication markets.
Among many proposed lossless data compression algorithms, the LZ77 algorithm [1] is the most widely used one. The term lossless means that the original data must be identical to the decompressed result from the compressed original data. Variants of LZ77, such as compress (a typical compression program adopted in many UNIX systems), arj, lha, zip, and zoo, have become popular software tools. To fulfill real-time requirements, several works on hardware realization of LZ77 or its variants have been presented in the literature. The core computation in the LZ77 compression algorithm, which is the most time-, area-, and powerconsuming task, is searching for a given string in a rather large buffer that stores previously processed input data. Some hardware architectures, including content addressable memory (CAM) [2] , [3] , [4] , systolic array [5] , [6] , [7] , [8] , [9] , and embedded processor [10] , have been proposed in the past.
CAM has been considered the fastest architecture among all proposed hardware solutions for searching for a given string, as required in LZ77. A CAM-based LZ77 data compressor can process one input symbol per clock cycle, no matter what the buffer size and string length are. Even if a CAM-based compressor's clock rate is only about a half of a systolic-array-based one [4] , [8] , [9] , the former is still much faster than the latter so far as the overall compression efficiency is concerned. It is because the total number of clock cycles required for a CAM-based compressor is much smaller than that for a systolic-array-based one. However, CAM's major drawbacks are its high hardware complexity and high power consumption. As the IC fabrication technology continues to advance into the deep submicron age, large and complex functional blocks or cores with hundreds of thousands of logic gates or more (such as the CAM block discussed here) become feasible and popular. Silicon area is becoming less a concern, but power consumption (heat removal) is becoming a major obstacle. For the CAM-based LZ77 data compressor, power consumption needs to be investigated and reduced.
In this paper, we will stress the design of a low-power CAMbased data compressor. The decompressor can be realized in a much simpler wayÐit consists of a RAM, an address decoder, and a simple control logic [1] . We have examined every execution step of the CAM while it is searching for a match string in its contents, and found that many comparisons between the input symbol and the CAM words actually are redundantÐthese comparisons are wasting power. Based on the observation, we modify a CAM word's matching mechanism without noticeable performance degradation so that the power supply of the matching mechanism will be cut off if a compare operation is unnecessary. Experimental results show that about 78.77 percent of the power consumed on the comparison mechanism can be savedÐthe power consumed on the comparison mechanism is one of the two main sources of the power consumed by the CAM. To further reduce the power, we also propose a novel CAM cell with the NAND-type matching logic. On average, we have close to two orders of improvement on power consumption as a whole. Speed is sacrificed if we use the NAND-type matching logic, but the NAND-type and NOR-type implementations can be combined to provide the best solution that balances power and delay. This low-power technique not only is very efficient for the CAM used in the LZ77 compressor, but it also can be easily applied to other types of CAM, such as those using the valid bits.
The paper is organized as follows: In Section 2, we present a typical procedure describing how CAM processes input symbols in an LZ77 compressor. From this procedure, we identify the compare operations that are redundant and can be removed. The CAM cell and word structure adopted in this paper is presented in Section 3. Under the structure, the bit lines (which are the major consumers of power in the CAM) have the lowest switching activities. In Section 4, we propose an approach to further reducing power by turning off the comparison mechanism for those words that do not need to be compared with the input symbol. A detailed analysis of the redundant comparisons is given. Other possible implementations considering the trade-offs between power and performance are also discussed. Finally, conclusions are given in Section 5.
CAM OPERATION IN LZ77
We assume the data to be compressed is composed of symbols. To increase the CAM's data-processing speed and simplify the complexity of the logic blocks associated with the CAM, the CAM word length is selected to be equal to the symbol size (number of bits). For example, for characters using the ASCII code, the CAM word length will be eight bits. The address space of the CAM is considered cyclic. Let the CAM block have x words, i.e., word H Y word I Y F F F Y word xÀI . For I i x À P, the neighboring predecessor and successor of word i are word iÀI and word iI , respectively. Furthermore, word xÀI is the predecessor of word H (word H is the successor of word xÀI ). Every word contains an additional flag bit, which indicates whether the word and its predecessors are still candidates for a match string.
All the CAM's activities required for processing an incoming string of input symbols according to the LZ77 compression algorithm are shown in the following procedure: Even if there are many LZ77 variants, the central taskÐstring searchingÐfor the compression process still can be performed efficiently by a CAM, with possible slight modification to the above procedure. The total number of CAM words and the maximal value for ML are two important LZ77 parameters which must be carefully determined according to the characteristics of the input data to achieve the best compression ratio. This can be easily done by simulation [9] .
A CAM can simultaneously execute x comparisons between the input symbol and each of the x words stored in the CAM in
Step 3 in one clock cycle. However, we found that many comparisons are redundant and, thus, can be removed (see Condition b of Step 4). In fact, Step 3 of LZ-CAM() can be rewritten as follows without affecting its correctness:
3. Compare the next input symbol with only the words whose neighboring predecessors' flags are 1.
That is, if the flag of a word's neighboring predecessor is 0 before
Step 3, the flag of this word must be 0 after Step 4, no matter whether the compare result between the word and the input symbol is ªmatchedº or ªunmatched.º Therefore, the comparison is redundant and we can save power by not doing it.
In the original LZ77 algorithm, the buffer is treated as a FIFO (first-in first-out) queue. Whenever an input symbol has been processed, it is shifted in the buffer from one end and an unused symbol is shifted out from the other end. From the software point of view, however, direct shifting of an input symbol is an inefficient task which requires x read-and-write commands for a buffer of x words. A more efficient way is to treat the buffer as a circular queue. A pointer P is used to point to the word that should be shifted out in the next cycle. Note that P also points to the location into which an input symbol should be inserted. The ªshiftingº that involves all elements in the FIFO does not really take place. The strategy is adopted in LZ-CAM().
There are other possible hardware solutions for the shift operation. The most straightforward method is to use a shift register, which performs the shifting in constant time. The operating speed is basically independent of the buffer size if we neglect the delay increase due to the clock load growth. In [2] , a CAM block with an embedded shifting function was proposed to meet this requirement. Although shifting is not a time-critical operation, it is a very power-consuming task. For random bit patterns, about one half of the flip-flops (FFs) in the shift register will change their states and consume power in each clock cycle. If a buffer of 2,048 eight-bit words are considered, at least 2,048 shift register stages will be required. Each of the register stages stores eight bits of data, i.e., is composed of eight flip-flops. Furthermore, power consumption of the clock tree must be taken into consideration in such a large circuit. For a low-power LZ compressor, shift registers or the shiftable CAM design proposed in [2] are not suitable for the buffer implementation.
Note that, when the pointer P is pointing to the last location, it will point back to the first location in Step 6. Such a pointer can be easily implemented by a binary counter with reset capability. Generally, LZ77 requires a buffer with a capacity of x P k symbols, which is equal to the number of words contained in the CAM. When a kEit binary counter reaches the maximal value P k À I, it will automatically go back to 0 in the next cycle. A combination of a counter and a kEtoEP k decoder can generate the necessary write-enable (WE) signals for all CAM words.
CAM STRUCTURE
A typical one-bit CAM cell adopted in the LZ77 data compressor is shown in Fig. 1 [11] , [4] . The cell consists of a traditional SRAM cell, a cross-coupled XOR (exclusive-or) gate, and a pull-down transistor (PDT, shown in the bottom of the figure). The PDT is gate-controlled by the output of the XOR gate. If an input bit, represented by the two complementary voltage values on it and it lines, is not identical to the stored bit in the cell, the PDT is turned on and the node match is shorted to ground; otherwise, the PDT is turned off and match is in the high-impedance (HZ) state. The input WE is the write-enable signal. The input bit is written into the SRAM cell when WE is high. A CAM word is composed of eight cells with all their match and WE signals connected together. The connected PDTs thus form an open-drain eight-input NOR gate. The common match node is in the HZ state if and only if every cell in the word matches its corresponding input bit. If at least one cell does not match the input bit, match will be low.
Determining the size of the PDT is a dilemma. A large PDT can speed up the compare operation, since the match node can be quickly discharged when the input symbol is not identical to the word. However, a large PDT also increases the load of the bit lines and power and speed penalties must be paid. A compromise solution is to partition a long CAM word into ªbytes,º each of which contains a small number of cells. Every byte has an individual match node. All match nodes from the bytes are then ANDed together to form the final match signal.
Two widely used logic families capable of accomplishing the desired NOR function for the match signal are the pseudonMOS logic [3] , [4] and the dynamic-CMOS logic [12] . Both logic families have their own drawbacks so far as low-power CAM design is concerned. For the pseudo-nMOS logic gate, there is always a static current in the shorted path from hh to ground if the input symbol does not match the word. Assume that the input data is random. The probability that two symbols are identical is only IaP n , where n is the number of bits in a symbol. That is, if n V (which is assumed for most data-compression applications), about PSSaPST WWXTI7 of the total CAM words will conduct this static current in every compare operation. One solution to lower down the power consumption is to increase the resistance of the pull-up's. However, it also increases the access time of the match line.
The dynamic-CMOS structure is a more appropriate choice since the static current is blocked by the pMOS precharge transistor. There is additional power consumption due to the precharge clock signal, which is a global line spanning all the CAM words and requires large driving buffers. However, the power saving due to the elimination of static currents is still very obvious.
To further lower the power consumption, the waveform depicted in Fig. 2 can be adopted. All bit lines are pulled low during the precharge phase so that all PDTs can be turned off to push match to the high state and the transient short-circuit current through the precharge transistor and PDTs can be avoided. Unfortunately, such waveform still has a drawback of increasing the switching activities of bit lines. For every pair of bit lines, one line must transit once, no matter what input patterns are.
The power dissipated on the bit lines is also one of the dominant factors for the total power consumption. Similar to the precharge clock, the bit lines are globally distributed over the whole CAM. Large driving buffers are required. The stray capacitance and the MOS drain/source junction capacitance associated with every CAM cell impose very heavy loading onto the bit lines. The bit lines also need to drive the PDTs and the SRAM cells. Obviously, the waveform shown in Fig. 2 will cause the bit lines to switch more frequently and, thus, to consume more power. In this paper, therefore, we use the waveform shown in Fig. 3 . If random input data is to be compressed, only half of the bit lines will transit on average in each clock cycle.
Besides the bit lines, the comparison mechanism is also an important power consumer. To reduce the power consumed there, we will propose an improved NOR structure by which most of the static currents will be cut off during the compression process. This will be discussed later.
REMOVAL OF REDUNDANT COMPARISONS

Conditional Comparison Mechanism
In LZ77, the CAM compares every input symbol with all the stored words. It writes the current symbol into a specific location in the memory core before reading the next input symbol for subsequent comparison. According to LZ-CAM() as discussed in Section 2, the CAM only needs to compare the input symbol with those words whose neighboring predecessors' flags are 1 after the previous cycle. The heuristics is that if the word located at address i does not match the current input symbol, it is unnecessary for the word at address i I to be compared with the next input symbol. This is derived directly from the LZ77 algorithm. A typical match logic for a pair of neighboring words, i.e., words i and i I, is depicted in Fig. 4 . In the figure, we use the AND gates to perform the masking of unnecessary comparison results on the match nodes. Specifically, the match logic consists of an AND gate and a flip-flop (FF). In the beginning of the search for a match string, all FFs are preset to high to enable all AND gates. This corresponds to Step 2 in LZ-CAM(). If the CAM word at address i does not match the input symbol, the signal mth i will be low. When the FF is triggered by the system clock to store the output value of exh i (which is low), m i becomes low and forces the output value of exh iI to low. Thus, m iI also becomes low in the next cycle no matter what mth iI is. The m i outputs from all words are sent to an address arbiter to generate an appropriate starting address of a match string for codeword generation.
Whenever m i is low, mth iI is always masked by the gate exh iI . The compare operations carried out at all word j , j b i, become redundant. It does not matter whether the mth j values are correct or not. To reduce the power consumption, we can turn off the comparison mechanism at all address j, j b i. A straightforward method for such a purpose is to break the connection between the comparison mechanism of word j and hh . In Fig. 5 , we show a novel conditional comparison mechanism (CCM), in which the gate of the p-type active load is connected to the inverse of the predecessor's match output (i.e., " m i ). If m i is low, the path from hh to mth iI is open. No switching activity can occur at mth iI for the subsequent compare operation and the static current is totally blocked. Therefore, no power will be consumed for the redundant comparison on all word j , j b i.
We design an 8-bit CAM by a typical HXQS"m CMOS technology. The operating frequency is 50 MHz under 3.3 volts. To make a fair power comparison between the traditional and conditional match logic, all operating conditions are the same for both. For example, to achieve the same match precharge period, the cascaded pMOS transistors in the CCM need to be slightly enlarged to shorten the delay. The SPICE simulation discloses that both structures consume almost the same power when CCM is turned on during a compare cycle. If CCM is turned off, however, the power can be neglected.
For a general-purpose CAM or some other application-specific CAMs, there is no match logic as discussed here. However, the CCM is still applicable to them. Usually, there is a valid bit associated with every word in such CAMs. Whenever an input symbol is written into a word, the valid bit is set to indicate that the data stored in this word is effective and should participate in the compare operations. If the valid bit is implemented by an FF, the gate of the p-type active load can be connected to the complemented output of the FF. Therefore, there will be no power consumed by the pseudo-nMOS structure in each word, unless the word becomes valid.
Redundancy Analysis
We now analyze the efficiency of the approach by calculating the number of words that can benefit from the proposed CCM.
Consider a typical data compressor using a CAM of 2,048 8-bit words as the buffer. In the original CAM, a significant power level is required for a word when the word does not match the input symbol. If the CCM is adopted, however, the significant power level exists only when the CCM is also turned on. Therefore, the average number of words whose CCMs are turned on will be approximately proportional to the power consumption during a compare cycle.
Assume that the input symbols are randomly distributed. On average, only PY HRV Â I PST V words can match any input symbol. That is, there are PY HRV À V PY HRH words that will require the power if the original CAM structure is used. We implement the LZ77 algorithm on several benchmark files to calculate the more realistic data. Our experiments were performed on real files from the Calgary/Canterbury text compression corpus [13] . The second column of Table 1 shows the average number of words, denoted as r gw , that does not match the input symbol in a compare cycle. These words require significant power while being compared with a given input symbol. We can see that on average 89.81 percent of the CAM words will dissipate substantial power for this set of files.
If the proposed CCM is adopted, all CAM words still must participate in the comparison with the first input symbol during the first cycle of the search for a new match string. However, the active loads of the match logic in most of the words will be open in the next cycle since these words do not need to be compared with the second input symbol during the next cycle. All subsequent compare operations involve only a very small portion of the CAM words.
Assume that a given nEsymol match string s H s I Á Á Á s nÀI can be found in the buffer. The number of CAM words which will conduct static current while being compared with a specific input symbol, s i , is denoted as h i , where H i`n. Obviously, using the CCM, we have
The total number of comparison operations resulting in static current in the search for the match string is r h H h I Á Á Á h nÀI X P Assume that there are k match strings for an input file after we have performed the LZ77 compression algorithm, i.e., there are k codewords in the final compressed result. The match strings are referred to as i , I i k. Also, for each i , there is a corresponding r i (number of comparisons resulting in static current in the search for a match of i ). The average number of comparisons resulting in static current for each input symbol is
where x is the total number of symbols contained in the input data. That is, the value of r ggw refers to the average number of CAM words whose CCM will dissipate static power during a compare operation for a given input symbol. The fourth column of 
where jj is the string length of . If longer match strings are often encountered while compressing a specific input file, v vg will be larger. From Table 1 , we find that the ratio of r ggw as compared with r gw will decrease, while v vg increases. This is also clear from the relation between r ggw and v vg (see (1)- (4)). Again, on average only 19.07 percent of the total CAM words can lead to static current under the CCM scheme. This number is only about 21.23 percent of r gw (the original comparison mechanism). In other words, about 78.77 percent of the power consumed by the comparison mechanism of the original CAM can be saved.
NAND Implementation
In the beginning of the search for a new match string, all flip-flops must be preset to high when we use the CCM. It means that all CAM words must participate in the compare operation during this cycle. Therefore, h H in (1) is usually a rather large value, which is equivalent to r gw . To further reduce the power consumption, we propose the cell shown in Fig. 6 . The two complementary inputs from the SRAM cell to the cross-coupled XOR gate are reversed, so the PDT is turned on if the input bit is identical to the stored bit and is turned off otherwise. In addition, all the PDTs of the same word are cascaded to form an open-drain NAND gateÐone end of the PDT-chain is connected to ground, and the other end is the match node. The match node will be discharged to ground when all the PDTs of the same word are turned on. Therefore, only those words identical to the input symbol can pull their own match nodes to ground and consume power during this cycle. The CCM discussed above is also applicable to this structure, as illustrated in Fig. 7 . The match logic consists of an OR gate and an FF. All FFs are reset to 0 to enable the OR gates before a new string search begins. If word i does not match the current input symbol, the output of its FF will turn off the p-type active load of word iI to block the possible static current during the next cycle. Table 2 shows the average numbers of CAM words that will conduct static currents. The numbers, denoted as r ggw , were obtained from the experiments done on the same set of benchmark files. From the table, we can see that the number of words that consume power during the compare operation has been greatly reduced. On average, we have close to two orders of improvement, as shown in the third column of the table. There is an exceptionÐthe file pic, which is a monochrome bitmap picture and consists of large amounts of white space. The white space is represented by very long runs of 0s in its graphic format. Thus, the CAM is frequently filled with 0s during the compression process. Whenever there is a long run of 0s in the CAM, each 0 will be compared with almost all of the CAM words. Therefore, pic suffers from a high r ggw value. The main drawback of the pseudo-nMOS NAND gate is its long delay. The critical path is a cascade of n PDTs, where n is the number of cells in a word. If a word does not match an input symbol due to the difference between the leftmost cell (see Fig. 7 ) and the input bit, and the word matches the next input symbol, then it will take the longest time to discharge the match line. In this case, the period of a compare operation increases as compared with the NOR-type implementation. The NAND-type implementation is suitable when power is more of a concern than speed. Otherwise, the NOR-type implementation should be used. If the fanin of the NAND gate is large (e.g., greater than eight, which is rare in real applications), the chain can be partitioned and a multilevel NAND implementation can be used. For example, a high-fanin NAND function has a two-level AND-NAND implementation, which is equivalent to the NAND-OR implementation, as shown in Fig. 8 . This, in fact, is a trade-off between the NORtype implementation (for delay) and the NAND-type implementation (for power). From the figure, we see that to speed up the discharge of the match node through the long PDT-chain, the long word is partitioned into segments (two segments are shown in the figure) . Each segment has its own pseudo-nMOS NAND gate and an individual match node. The CCM is the same as that shown in Fig. 7 , except that all the match nodes in the same word and the FF output of the preceding word should be ORed together.
CONCLUSION
In this paper, we present a low-power application-specific CAM design, which is the fundamental functional block of a highperformance LZ77-based data compressor. The proposed CAM consumes much less power than the conventional CAM. We found that the bit lines of the memory core and the comparison mechanism associated with every CAM word are the major power consumer of the CAM. An appropriate CAM cell and word structure has been proposed to reduce the power consumed by the bit lines. We showed that the redundant comparisons in the compression process can be removed by turning off the power supply to those words that do not need to participate, saving about 80 percent of the power consumption of the comparison mechanism as compared with the conventional CAM. Moreover, using the proposed conditional comparison mechanism and the novel CAM cell with the NAND-type matching logic, on average we have close to two orders of improvement on power consumption, i.e., a reduction of more than 98 percent for 8-bit words. Speed is sacrificed if we use the NAND-type matching logic. We showed that the NAND-type logic and the NOR-type logic can be combined to provide the best solution that balances power and delay. Our approach can be applied to general-purpose CAMs so far as the design techniques proposed here are adopted. In that case, the power will be approximately proportional to the number of valid words.
