Abstract: A simple real-time parallel architecture for CMOS VLSI implementation of a Ziv-hmpel data compression system is presented. This encoding system employs a linear systolic array to find concurrently the matches between each input data character and its corresponding dictionary, and can easily achieve ideal compression ratio by cascading the chips of the encoding cell. A new encoding architecture is proposed to improve the encoding speed and reduce hardware complexity for the encoding cells. In addition, the number of memory accesses is reduced to save power consumption for high-speed applications. The encoder codes one character (more than eight bits) per encoding cycle. The clock rate by Verilog simulator can be constrained below lSns using the Compass standard cell library for the 0 . 6~ CMOS process.
Introduction
In recent years the need to develop efficient data compression methods has increased considerably owing to the iucreasing applications of data compression in various areas. The most widely used classes of lossless data compression algorithms are those developed by Ziv and Lempel in 1977 and 1978, labelled as LZ77 and LZ78, respectively [l, 21. In the LZ77 algorithm, pointers are used to denote phrases in a fixed-size window that precedes the coding position, There is a maximum length for substrings that may be replaced by a pointer, given by the parameter F (typically 24-25). These restrictions allow LZ77 to be implemented using a 'sliding window' of N characters. In this scheme the first N -F characteis have already been encoded and the last F characters constitute a lookahead buffer. The window is illustrated in example 1 (see Fig. 1 ). To encode a character, the frst N -F characters of the window are searched to find the longest match with lookahead buffer. The match may overlap with the buffer but obviously cannot be the buffer itself. The longest match is then encoded into a triple codeword (0, l, U) where o is the offset of the longest match from the lookahead buffer, 1 is the longest length of match, and a is the first character that did not match the substring in the window. The window is then shifted right I + 1 characters, ready for the next encoding step. Attaching the explicit character to each pointer ensures that coding can proceed even if no match is found for the first character of the lookahead buffer.
In the LZ77 algorithm, whenever there is no match or if there is a match of length one, each symbol that r e s p tively constitutes the matched substrings would be substituted by a codeword of three bytes long. This may result in expansion rather than compression in the encoding process. If many such unfavourable codewords occurred in a file, the compression achieved would be low. The LZSS algorithm [3] , one of the variants of the LZ77 algorithm, adopts a free mixture of pointer and characters to replace the triple codeword in LZ77 [l] to overcome the expansion problem. A character itself will be used only when a pointer takes more space than the characters it encodes. An extra bit, known asflug, is added to each pointer or character to distinguish between them. The output is packed so that there are no unused bits. Many designers have incorporated this technique in the software implementation which significantly improves the compression ratio. However, the speed of data compression by software is usually not satisfactory for a real-time system. Several LZ hardware architectures have been presented in the literature [ H I . The most computationally expensive step in the hardware is to search for the maximum matching strings in the sliding dictionary. However, the LZ-based algorithm must be executed at a very high throughput rate for real-time transmission and storage. The content addressable memory (CAM) approach [7] can provide constant time to search the matching strings for each symbol, but it must consume static DC current for many of the CAM cells during the match action cycle. If the size of the sliding window is increased the power consumption problem will dominate the chip performance.
We present the design of a systolic array processor for efficient implementation of the LZSS compression technique by using wrap architecture [9] . This new approach can fmd a maximum match length for each clock cycle.
The new hardware structure can execute the data compression task on-thefly in the real-time communication system. To implement the encoder with VLSI ASIC dsign technology, the system is divided into several modules and facilitated by the systolic architecture. As shown in Fig. 2 , the encoding system contains several blocks: VLtell, monitor, packet, sender, and in-buf. The VLcell block is composed of a segment of the systolic array illustratec in Fig. 3 to find a series of optimal matches. The in_l,uf block includes input pads, registers, and buffers to nxeive the input signals and convert them to internal CMOS signals. DIN is the input of encoded character string for the die tioniuy registers; SIN and SINC are the input uncoded character string under normal mode and cascade mode, respectively. When the ACT1 signal goes to high the encoder begins to compress the uncoded input character string. a sliding window embedded in VLcell can he expanded to a suitable width. Typical width of the window is 2-8 K of dictionary, which has 0.25-1 K cells. The bit number n for the offset is 11-13. The primary advantages of our system are high speed and simple hardware structure with expandable dictionary size to increase the compression ratio easily. We also adopt an effcient output buffer to control the output rate and convert the output packets to and from words of fmed width, e.g. 16 bits for compatibility with typical computer hardware. Since power consumption is an important consideration for high throughput-rate application, we use a parallel process technique to reduce the number of register access times and reduce power dissipation.
Wrap architecture for encoding and decoding
In the systolic array described in [IO] , data pass from one processor to the neighbouring ones in a regular, rhythrmc pattern. As described in [ll], the array data flow of the wrap architecture for encoding is given in Fig. 3 ing. Thus, we ignore the values of 1, -1, and lk during these encoding cycles until lo is reset to zero. From this description, it can be seen that the length value is not unique in determining the position of switch S.
In our design we utilise only the group code and comparison results instead of the real length value to control the position of switch S. Hence, each encoding cell just sends (code, offsef) pairs instead of (length, offset) pair to the next stage. Finally, these pairs are sent to 'monitor' to extract a sequence of matches that exactly include the corresponding input stream. Thus our system doesn't use any counter or magnitude comparator in the series of encoding cells to achieve optimal compression ratio. Hence, these modifications reduce the complexity in hardware implementation.
VLSl implementation of encoding scheme
As described in Fig. 3 , the encoding cell selects the optimal length based on the match length which indicates the relationship between adjacent input characters. If the match length is monotonically increased among adjacent input characters, these characters will be compressed into the same codeword. Hence we classify the characters as the same group, where the first character is named as group leader.
As 
Position of the Switch S:
Clock No. 0 1 2 3 4 5 6 7 8 9 10 I1 12 .,.,..
CeU K-i
CeU K NdN4 N4Na W d X Ns Ns NI Ns X X X ....,,
x &on' , X X X X Ni Ni N, Nj N; W7 N7 N7 N7 ...... length values (lo. I,, ,,., /7, lk) , where r, is obtained from the preceding stage. For the sake of simplification, we assume that the r, value from cell k + 1 is always zero and negligible in Fig. 4 . The symbol '-' represents the comparison operator between the uncoded character and the encoded character, and '=' represents the input character matching the encoded character. From cycles 0 to 4 in cell k the first five uncoded characters {group} fmd a series of matched string {D3D4DSD6D7} in dictionav. Thus the /4 value will be sent to cell k -1, i.e. it is the value o f d l k ~ 1, and the switch position should be at node N4 from cycles 1 to 4. During these five clock cycles the five input characters {group} arc assembled into the same group by cell k, where the fxst uncoded character SO, g, marked by symbol '*' is the group leader and the following four characters {roup} marked by symbol 'v' are its group members. Since the / , , length value of cell k is always less than the /4 value of cell k from cycles 0 to 4, the lo value of cell k is neghgible. During cycle 5 there is no match and it is not necessary to determine the switch position of cell k. From cycles 6 to 9 the encoding cell k can find a series of matches between the input characters string {S6S7S8S9}, {roup} and the coded characters string {D8D9DIoDll). Thus, these four character {roup} should be assembled into the same group, and the switch position of cell k should be switched to node Ns and send l5 value of cell k to cell k -I. The selected group in cell k will he transmitted to cell k ~ 1, and it of cell k ~ 1) becomes a candidate to compete with the eight comparison results of cell k ~ 1 after four clock cycles. If the group leader of lk-l fxst appears in cell k -1, the (length, offset) data of lk-l will be transmitted to cell k ~ 2. As shown in Fig. 4 results between the first five input characters {group} and the encoded string {D3D4DsD6D7} in cell k will be transferred transparently from cell k to cell k -I, cell k ~ 2, ._. and the final stage (cell 0) according to the amval time of group leader in the future consecutive cells.
As described in Fig. 4 , the main task of the encoding cell is to control the switch position. If the amval time of any group leader is earlier than the others, the switch position of that encoding cell will be switched to the node that can construct a data path for the optimal group leader to pass, such as node N4 of cell k during clock 0 or node N, of cell k -1 during clock 4 in Fig. 4 . The flowchart of the encoding cell's main task can be shown as Fig. 5 , where codes 01 and 11 indicate that the character is a group leader and a group member, respectively. Code 00 indicates that the character does not belong to any group. Switch S of the encoding cell shown in Fig. 3 can be treated as a finite state machine which can save the data path of the existing group and control the switch position. If one data path is selected in this encoding cycle, it has the highest priority to be selected during the next encoding cycle, e.g. the fxst group string {group} in cell k from cycles 0 to 4 and in cell k -1 from cycles 4 to 8, respectively. On the contrary, the finite state machine will select the optimal data path again when the original data path gets an unmatched result during the next encoding cycle. For example, during clock cycle 9, in cell k -1 the switch position is switched to N7 position. The block diagram of the modified encoding cell is illustrated in Fig. 6 , where the diagram of the fmite state machine is marked by the dash line, and its transition table is shown in Table 1 . In Fig. 6 , nine state registers are used to generate a new group code L, and its appending offset values 0,. The individual value of lo-17 is equal to 1 when the respective comparison result is matched, othenvise it is 0 for unmatched status. These fdtered group codes and their appending oflet values amve at the MONITOR to generate the optimal code pair (length, offset). From now on the main task of the encoding cell can be adjusted to encode the group codes based on the current match results, the current state, and the current incoming group code as follows: where symbol '+' and symbol '.' represent logical AND and OR operators, respectively. We use a simple finite state machine circuit to generate the optimal group code according to la ~ j7, and lk so that neither counter nor magnitude comparator are needed. This system has been implemented by using a 0 . 6~ standard cell library, e.g. provided by Compass Company, USA.
The timing sequence of the encoding cell given in Finally, encoding cell n utilises the new state values to obtain the new offset output value 0,. Thus the encoding cell is separated into two pipeline stages to execute the encoding process for one input character. As simulated by the Synopsys synthesis tool, the critical path of cell n is identifed from the Sa, D3,-DOa, and D3,+,-DO,+, to the output of the new code L, . The critical path delay is about 7ns, calculated from positive edge ck to M as shown in 
Monitor module
As depicted in Fig. 9 , the monitor module receives the character, offset and group codes from cell 0 and decodes the longest length based on goup code, where the longest length is equivalent to p e d h g values in Fig. 9 . The main function of the monitor, as shown in Fig. 9 , is to indicate whether one codeword is found or not by the send signal.
Suppose a pointer uses the space of p uncoded characters, the encoding rule is based on the length value of the codeword. If the length value is greater than that of p , the monitor will send the codeword. Otherwise, the monitor sends the uncoded character.
Packet module
The packet module receives codewords from the monitor, concatenates these codewords together and segments them into 16-bit words for output. In words for executing the packing task. The LZSS codeword has two codeword formats, i.e. uncompressed and compressed forms. Thus, it is not necessary for the LZSS codeword to utilise the PLA table. In our design a simple combinational logic circnit instead of the PLA table is used to control the barrel shifter. The circuit diagram of the packet module is shown in Fig. 10 . The 16-bit register W2 is the output latch. If the sum of Residue and Length is greater than 15, the Full output of the accumulator will be set to High and the packet module will put the contents of the W2 register at the positive trigger of Out-clock signal. Since the maximum bitnumber of the concatenation of Flag and codeword is greater than 16, the sum of Length and Residue may be equal to 32, sum of 17 and 15. In this case there are two 16-hit packets being ready for output, the system will send W2 contents frst and then W3 contents at the positive trigger of Out-clock and that of Out-again, resptively. Here Out-again is generated hy Next-any signal of the accumulator.
The parallel concatenation of codewords is done by MUXl and Barrel_Shifter, which provide 16-bit windows on their 33 input hits. MUXl is controlled by the Flag signal and shifts the codeword from WO into W1 so that the rightmost bit of W1 is the last hit of the codeword. Consequently the data stored in W1 is ready to concatenate with the next codeword. The Barrel-Shifter is controlled by the residue value which represents the number of the residual bits in W1. The residnal bits in W1 are determined hy
MUXI.

Sender module
When the Over signal of the packet module falls from high to low the system is to f k h the encoding process. When the value of the residue register is not equal to zero all the final residual bits in W2 or Barrel-Shifter must be sent to the sender module for output. If the packet module has just sent an output in the previous ckl cycle, the 16-bit output of Barrel-Shifter is sent to sender. Otherwise, packet sends the 16-bit output of register W2.
4
A block diagram of the decoding processor is shown in Fig. 11 . The functions of the major modules are described as follows. The input to the Frontend is a bit stream without explicit word boundaries. The Frontend has to decode a codeword, determine its length and Flag, and shift the input data stream hy the number of bits corresponding to the decoded code length before decoding the next codeword. The pre-processor separates the codeword into the corresponding offset and character parts, and generates Active=l for the successive length cycles. As the system receives a Ready signal the counter generates three enable signals to start the decoding process. As enable-0 rises to high, the 16-bit register W1 stores the fmt 16-bit register. During enable-1 cycle, W1 stores the second 16-bit input stream and the fust 16-bit input stream must be shifted into WO. The flag and length of the frst codeword are determined at the same cycle. When the enable-2 signal goes high the Frontend produces the fmt codeword and its flag, and the system begins to enter the normal decoding process in the consecutive cycle.
The circuit diagram of the preprocessor for decoding is shown in Fig. 12 . The pre-processor has to separate the codeword into character and offset, generate the Active signal to the rightmost cell, and send a Request signal to Frontend to process the next best match. In ow decoding process the output rate is fixed, i.e. the output decoded characters have the same length and will be generated to the same rate (one character per decoding cycle), but the input rate is variable. If the pre-processor receives the match of a pair (offset, length) this match will he decoded into a series of successive length characters. It is not necessary for the pre-processor to receive the codeword from Frontend in every decoding cycle. When the Request signal goes high it indicates that the previous codeword has decoded completely and the huffer register should receive the next codeword being ready for decoding. The down counter takes the responsibility of informing the preprocessor itself to take the new data from the Frontend. The down counter latches the length value from DEMUX, sends the high level of Active signal for the length consecutive cycles when the Flag is equal to 1, and will decrease one at each positive trigger of ck timing clock. As soon as the down counter decreases to two it changes the state of the Request signal to high. The Buffer, Flag and Endsignal registers will 'then receive the new data in the next decoding clock cycle. If the new Flag is equal to 0, the current codeword in the Buffer is just a character. In this case the Request signal must also be changed to high and all the registers will latch the new data in the next cycle, too.
Dataflow in the systolic array of the series decoding cell has been described in previous literature [Ill. The main task of the decoding cell n is to compare the input oflet value with the stored oflet values in cell n, where the stored ofset values are 2n and 2n + 1 when the cell number is n. If the input oflet value is equivalent to one of the stored offset values 2n or 2n + 1, the output decoded character will be copied from S, register or Sa+, register, repeo tively, where Sn+l register comes from the preceding stage (cell n t I). Othenvise, the input character is retransmitted to the next stage (cell n -1). The original characters will be recovered without distortion by the systolic array of the series decoding cell. The decoding cell [ll] is shown in Fig. 12 .
analysis
The new encoding architecture is implemented by the To verify our encoding system we compare the input strings of encodmg processor with the output strings of the decoding processor whose block diagram is shown in Fig. 1 1. If there is no difference between them the modified architecture should be correct. We use some input fdes for verification by Verilog simulator [13] . From these simulation results we verity that our encoding and decoding circuits are correct. The hardware simulation results of the whole chip including U 0 and power pads are shown in Fig. 13 . The DATA of Fig. 13 is the data output consisting of flag and encoded codeword. The SEND signal is the handshaking signal to inform the receiver whether the encoding system transmits output data or not. The low level of the FINISH signal indicates that the encoding system has fmished the entire encoding task. As shown in Fig. 13 , the function of our encoding system can work correctly even when the clock cycle is about 11 ns. The operating frequency is the reciprocal of the clock period so the maximum operating frequency can reach 91 MHz and the compression hit rate is about 728Mbitls. To reduce ground bounce noise for high-speed application we use ten pairs of power pads for the drivers of VO pads, and six pairs of power pads for the core of digital circuit and predrivers of U 0 pads. The total VO pin count is 128.
276
For high speed-application power consumption is an important issue. The gate count of the registers contributes the major part of the hardware for most LZ-type compressors. In the Zit+Wolfsystem each register must access and latch new data in every clock cycle, and thus the dynamic power consumption is very large. As described in [14] , we estimate the percentage activity of the circuit Pa and the total capacitance C, driven by gate outputs in circuit. Thus, the estimated power consumption of the complex circuit for CMOS design can be estimated as follows:
where f, is the clock frequency. From eqn. 3, minimising the percentage of the switching capacitance Pa can save power dissipation. In our design only one quarter of registers execute the data access operation for input characters, codes, and offsets as shown in Fig. 7 during one clock cycle. Therefore minimising the number of memory accesses will reduce the power dissipation for the sake of reducing P,. The number of memory accesses for our encoding cell is about 7/19 of the Zitc-Wolf architecture [I I], so that the dynamic power dissipation can he lowered with the same throughput rate. Since our implementation is based on CMOS design, there is no DC path during the operating cycle. For the CAM implementation [7l the DC path exists in the circuit of the CAM cell such that the power dissipation of CAM architecture is larger than the CMOS design under the same throughput rate, clock fre- quency, and the same buffer size conditions. For highspeed applications the power dissipation of our architecture implemented by CMOS process is lower than those of the other implementations for the same compression ratio.
Conclusion
A parallel structure for a high-speed LZSS coder has been introduced. This parallel LZSS coder encodes each charao ter in one clock cycle, and its operating frequency can reach 91MHz. This system has several advantages. First, the compression time is linearly proportional to the input length. Only a single clock cycle is required for processing one character, and the clock cycle is bounded by the critical delay of the encoder and independent of window size. S e 0 ondly, the architecture is simple and modularly expandable. In our design only 256 encoding cells are integrated in one chip at most because of the problems of power, clock distribution, and hardware complexity. We can cascade several chips to increase the dictionary buffer size and achieve the ideal desired compression ratio.
Although the hardware complexity of our modified architecture is higher than those of other architectures in [4, 61, the speed is much higher. Suppose the trigger cycle C,.ig represents the numbcr of time unit (clock cycle) between two initiations of a pipeline. A trigger cycle Clrig of k means that two initiations are separated by k clock cycles. Assume the clock period is T,. The average compression speed bit If we want to achieve the ideal compression ratio, the encoding buffer in [U] is 512 at least.
Our C,". is much lower than that of [4, 61, so our speed is apparently higher for the same process technology. Thns, the new compression system is more suitable for real-time application to increase the bandwidth of a communication system and can also be used to effectively increase the amount of mass storage available to computer systems. By utilising VLSI technology to implement the system chip, the data compression hardware can be integrated into realtime systems so that data can be compressed and decompressed on-the-fly.
Acknowledgment
This work was supported by the National Science Council of Republic of China under grant NSCX2-0404-E009-338. 
