Systolic implementations for transpose coding by Stauffer, Lynn M. & Hirschberg, Daniel S.
UC Irvine
ICS Technical Reports
Title
Systolic implementations for transpose coding
Permalink
https://escholarship.org/uc/item/3p10849c
Authors
Stauffer, Lynn M.
Hirschberg, Daniel S.
Publication Date
1991-11-15
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Systolic Implementations for Transpose Coding 
Lynn M. Stauffe~ 
University of Caiiiornia, Irvine 
Irvine, CA 92717 
stauffer@ics.uci.edu 
Daniel S. Hirschberg 
University of California, Irvine 
Irvine, CA 92717 
dan@ics.uci.edu 
Technical Report 91-69 
November 15, 1991 
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
)'J 1 
Contents 
Introduction ..... 
List Compression .. 
Related Work in Move-to-Front Coding. . . .... 
Parallel Transpose Coding with Fixed-Length Words 
Systolic Implementations ENCl and DECl ... 
Systolic Implementations ENC2 and DEC2 with Reduced Delay . 
Parallel Transpose Coding with Arbitrary Words .. 
Topics for Future Investigation . 
References ............. . 
Page 
1 
2 
4 
5 
6 
13 
16 
17 
18 
1. Introduction 
Data compression attempts to remove redundancy from data and thereby 
increases the density of transmitted or stored data. Traditionally, there has been a 
tradeoff between the benefits of employing data compression versus the computa-
tional costs related to encoding and decoding. Parallelism represents a means for 
speeding up data compression performance. The problem of compressing data as 
effectively as possible is a challenging one that has been extensively researched in 
the sequential setting [W91, BCW90, 888, LH87]. Included in this vast collection 
of sequential methods is Move-to-Front coding which maintains a dynamic list of 
words (to be encoded by their list position) using the move-to-front self-organizing 
list strategy [BSTW86, E87, R87, HC87]. A systolic array implementation of 
Move-to-Front has been described [TW89]. In this paper, we present systolic array 
implementations of Transpose coding, which uses an alternative self-organizing list 
strategy but otherwise is similar to Move-to-Front coding. We present implemen-
tations for fixed-length word lists which provide improved system bandwidth by 
accelerating Transpose coding. 
The state-of-the-art in software data compression systems is the UNIX1 com-
press utility which is based on a variation of Ziv and Lempel coding due to Welch 
[W84, ZL 78]. The UNIX compress system provides compression savings of up to 
80% at a relatively high input bandwidth of 30 Kbytes per second on a 1 MIPS 
machine [TW89]. Higher compression savings are achieved by high-order Markov 
models and improved versions of compress which operate at limited input band-
widths of approximately 10 Kbytes per second on a 1 MIPS machine. A systolic 
array implementation of Move-to-Front running on a 40 MHz clock operates at a 
bandwidth of 40 Mbytes per second with compression savings ranging from 20% 
to 70% [TW89]. Several parallel compression systems based on dictionary coding 
achieve similar compression at input rates exceeding 25 Mbytes per second using a 
1 UNIX is a trademark of AT&T Bell Laboratories 
1 
40 MHz clock. Our implementations operate at a bandwidth commensurate with 
the systolic Move-to-Front system. 
Our algorithms are implemented on a systolic array. Systolic arrays consist 
of a linearly-connected collection of synchronized rudimentary processing elements. 
Each processor has its own local memory and is assigned a unique identification 
number. An advantage of the systolic implementation is that a larger pipe can 
be fabricated by placing a sequence of processing elements on a single chip, and 
then joining a series of chips on a board. Another benefit is that the length of 
interprocessor connections are constant and independent of the array size. 
Section 2 describes the basic list compression method and several of its 
variants. Related work on systolic implementations of Move-to-Front coding is 
presented in Section 3. Our systolic designs for Transpose coding are given in 
Sections 4 and 5. Conclusions and areas for further investigation are the focus of 
Section 6. 
2. List Compression 
Data compression schemes can be categorized according to the method used 
to parse the input stream into individual encodable messages. In defined-word 
schemes, the context determines a set of source messages or words2 (sequences of 
input symbols) that are candidates for encoding. There are a number of suitable 
definitions for the composition of a word. For instance, in text file compression, a 
word may be defined to consist of an individual character or a sequence of characters 
delineated by a space. 
This paper considers a class of compression algorithms which maintain a 
sequential list of words using a self-organizing heuristic so that frequently accessed 
words appear near the front of the list. To distinguish the collection of compression 
techniques which utilize a self-organizing list from dynamic dictionary or Ziv and 
2 Bentley et al. refer to these source sequences as "words" [BSTW86]. 
2 
Lempel compression schemes, we will refer to this collection as li.'Jt compre.'Jsion 
methods under a particular update heuristic. 
A list compression method uses a self-organizing data structure to maintain 
a list of source messages and a variable-length encoding of the integers to compress 
list indices. To compress a word, it is located in the dynamic word list and encoded 
by its list position. After a word has been referenced, the list is reorganized 
appropriately. In Move-to-Front coding, the encoded word is removed from its 
current position and placed in the first list position. In Transpose coding, the 
encoded word is exchanged with the contents of the preceding list entry. By 
directing the encoded list position to a fixed-to-variable coder, the output is further 
compressed by assigning short codewords to positions near the front of the list. 
The move-to-front and transpose list organizing strategies are two update 
heuristics among a collection of many others (see [HH85] for a survey of self-
organizing linear search). After reaching a steady state, where many further 
search requests are not expected to significantly impact the expected search time, 
the expected access cost will be less for transpose than for move-to-front. But 
the convergence time or number of accesses required to reach a steady state is 
greater for transpose than for move-to-front [HH85]. There are applications for 
which move-to-front and transpose outperform each other. Moreover, Horspool 
and Cormack report that the transpose heuristic performs as well as the move-to-
front and is easier to implement since updates involve only local rearrangement 
[HC87]. For any particular application, simulations are necessary to determine the 
superior heuristic. This paper, by furnishing a systolic array implementation of 
transpose, provides the option of choosing between transpose and move-to-front 
for other applications. 
3 
3. Related Work in Move-to-Front Coding 
Dictionary coding algorithms (which include list compress10n algorithms) 
function by replacing blocks of input with references to earlier occurrences of 
identical data. Systolic implementations have been developed for several variants 
of the general dictionary scheme. Parallel dictionary methods are surveyed in 
[SH91A]. Previous findings in parallel list compression are described below. 
Thomborson and Wei investigate parallel implementations of Move-to-Front 
coding [TW89]. They distinguish two major algorithmic variants of Move-to-
Front compression. The simpler procedure permutes a byte-level fixed-length list 
of symbols and the other approach divides the input stream into "words" and 
transmits words by a move-to-front code. VLSI implementation issues are the 
root of this distinction. That is, a general defined-word scheme requires words 
of arbitrary length to be present in the word list. In a systolic array, arbitrary 
word lengths place unreasonable demands on the number of input/ output pins 
that must be placed on each processing element. Thomborson and Wei examine 
various alternatives, such as placing a limit on the length of words, and find that 
even permitting short words can be prohibitive. 
For the simpler byte-level fixed-length Move-to-Front coding, Thomborson 
and Wei describe an array of 256 processing elements each of which stores an 8-
bit number corresponding to an ASCII code [TW89]. The input stream enters 
the pipe and is encoded by detecting matches between the input characters and 
the bytes stored in the processing elements. When a match is detected, the input 
character is replaced by the identification number of the matching processor. By 
depositing the input character in the first processor as it enters the pipe and then 
cascading previous processing element contents down the array, the move-to-front 
behavior is realized. The output of the array, consisting of a sequence of 8-bit 
list indices, is fed into a fixed-to-variable length coding system. This byte-level 
4 
design achieves compression savings of 19% to 38% and operates at a bandwidth 
of 40 Mbytes /second running on a 40 MHz byte clock. 
Thomborson and Wei describe a systolic design for approximating general 
defined-word schemes [TW89]. The idea is to map variable-length words to an 8-
bit hashcode using a hardwired hash table. These 8-bit codes are entered into the 
Move-to-Front list of target strings and manipulated as in the byte-level systolic 
encoder and decoder arrays. A closed hashing scheme with no collision resolu-
tion is used to obtain a high-speed, high-bandwidth design. These performance 
improvements, however, come at the expense of poorer compression performance. 
Unlike the sequential Move-to-Front codes in which the least-recently-used target 
word "falls" off the end of the list, the hashing approach randomly eliminates list 
words. This random behavior of the systolic design yields compression savings 
ranging from 25% to 65% and an input bandwidth of 40 Mbytes running on a 40 
MHz clock. 
4. Parallel Transpose Coding with Fixed-Length Words 
Parallel transpose list compression is described by the following general 
paradigm. Encoder and decoder maintain identical word lists using the transpose 
heuristic. Namely, after a word is used it is exchanged with the word stored in the 
position immediately preceding its original position. In general, to transmit word 
w on the systolic array, w is compared to the list entries of successive processors. 
If w matches the list entry in processor i, it is encoded as i. The encoder then 
updates the list by transposing the list entry ( w) of processor i with the list entry 
in processor i - 1. When the decoder array receives list index i, it decodes it as 
the list word stored in processor i (which will be w) and then updates the list by 
exchanging w with the previous list entry stored in processor i - 1. Since several 
5 
matches can be detected in parallel the list update procedure needs additional 
consideration. 
In the sequential setting, a sequence of words that match the list structure 
m successive entries are handled in the same way as other matches. However, 
m the systolic environment, matches corresponding to successive entries in the 
array impose additional constraints when the list of words is being manipulated in 
parallel. That is, simultaneous matches occurring in different locations in the array 
may force global communication among the processors to determine the contents of 
the updated list. To illustrate this difficulty, consider the input string "abcdefgh" 
and the word list " h, g, f, e, d, c, b, a". Sequential transpose list compression 
outputs the sequence of positions 8, 8, 6, 6, 4, 4, 2, 2 and the final word list is 
identical to the original. In a sense, each pair of matches "cancels" the effect of 
their updates. On the systolic array, the input string pipes into the array and all 
eight matches are detected simultaneously. Handling the subsequent update may 
require global communication. 
A systolic implementation may have difficulty allowing non-fixed-length words 
because of the unreasonable pin requirements [TW89]. Our initial designs are for 
defined-word methods which permute a list of fixed-length source messages. Section 
5 describes systolic implementations which approximate the more general dynamic 
list compression method. 
4.1. Systolic Implementations ENCl and DECl 
This section gives systolic implementations (which we call ENCl and DECl) 
for encoding and decoding using the transpose fixed-length list compression model. 
For a list of length N, ENCl consists of N processing elements (PE's) linearly 
connected by a two-way communication channel. PE i stores the list entry which 
is currently in position i in the list and a copy of the input word PE i considered 
on the previous clock cycle. The list entry will be referred to as entryi and the 
6 
PEN PE N-1 PE 2 PE 1 
(m) 
fntryN I ~ 1 bit j entry2 j I entry1 J 
~ ~ (w,e,p) ~ I oldw l I 
24 bits 
Figure 1 
Systolic array for ENCl 
prior input word as oldwi. The input stream enters the pipe from the right (PE 1) 
and the encoded message exits at the left (PEN). A schematic of the architecture 
is shown in Figure 1. For N = 256, the array consists of 256 processing elements 
each storing an 8-bit byte list entry corresponding to an ASCII symbol. 
In order to prevent a contiguous sequence of matches from occurring concur-
rently, our design allows input packets to enter the array only on every other clock 
cycle. The word list is updated at the start of each encoding cycle. Later in the 
cycle, word matches are detected and encoded. 
At the beginning of the EN Cl, clock cycle, PE i receives 3-tuple ( w, e, p) 
from PE i - 1 and bit ffii+l from PE i + 1. w is a word to be matched, e is the 
current contents of entryi-l that may be needed for a transpose update, p is the 
list position of w (0 is no match found yet) and ffii+l is a bit flag which is set 
if PE i + 1 detected a match in the previous clock cycle. If ffii+i is set then PE 
i overwrites entryi with oldwi. If w matches entryi then PE i carries out three 
tasks. Namely, PE i sets p = i, flags ffii = 1, and (if i > 1) overwrites entryi with 
input e (equivalently entryi-l obtained from PE i - 1 ). 
At the close of the clock cycle, PE i overwrites oldwi with w and transmits 
( w, entryi, p) to PE i + 1 and (mi) to PE i - 1. Contention is avoided as a result 
of restricting input to every other cycle. An example of this encoding procedure is 
given in Figure 2. 
7 
For N = 256, the ENCl communication channel is 25 bits wide and each 
processing element has three 8-bit registers, an 8-bit identity comparator, two 8-
bit multiplexers, and additional control logic. The critical path in PE i passes 
through three hardware components. First, entryi and oldwi pass through a 
multiplexer triggered by flag ffii. Second, entryi and the input word w enter the 
identity comparator. Finally, the output of the comparator determines if the input 
word should be encoded as i by triggering a second multiplexer. The critical path 
compares to the move-to-front systolic array [TW89]. 
Our first decoding algorithm DECla resembles ENCl. As in ENCl, input 
enters the pipe on every other clock cycle. At the outset of the cycle, the word list 
is updated. Unlike ENCl, where only a single bit is passed from PE i back to PE 
i - 1 to facilitate updating, DECla requires entryi be transmitted along with the 
single match bit. Following the list updating, list indices are replaced by word list 
entries. 
DECla proceeds as follows. At the beginning of the clock cycle, PE i receives 
3-tuple (w, e, p) from PE i - 1 and 2-tuple (mi+i, J) from PE i + 1. pis the 
encoded list index, e is the current contents of entryi-1, w is the unencoded word 
occurring in position p of the list (A if p < i), ffii+l is a flag indicating if PE 
i + 1 decoded a list index on the previous clock cycle and f is the prior contents 
of entryi+l that may be needed for a transpose update. If mi+l is set then PE i 
copies f into entryi. If p = i then PE i replaces w by entryi, sets mi = 1, copies 
entryi into f, and (if i > 1) finally overwrites entryi with e. At the end of the 
clock cycle, PE i sends ( w ,entryi ,p) to PE i + 1 and ( ffii, J) to PE i - 1. Figure 3 
provides an example of DECla. 
Our second decoding algorithm, DEClb, processes packets on every system 
cycle and therefore operates at twice the rate of ENCl and DECla. However, 
DEClb restricts the input to a fixed predefined alphabet of size S. The symbols in 
8 
Encoding of input string: decade Generic PE: ~ y p) 
PE 6 PE 5 PE 4 PE 3 PE 2 0 PE 1 
_J D I. ·1 L ·1 L ·1 I. ·1 I~ d,A,O) ecade f e d c b a A A A A A A 
0 0 
_J D I. ·1 L ·1 I. ·1 kd,a,~ I. f e d c b a A A A A A d ecade 
0 0 0 
_J D I. ·1 I. ·1 l~,b,~I L ·1 1~e,A,O) f e d c b a A A A A d d cade 
0 0 0 0 
_J D I. ·1 ~ c b,c,O~ I. ·1 b,a,~I I. f e c b a A A d d e cade 
LJ 0 1 0 0 _J kd,c,:i1 I. ·1 : d 11:,b,;1 I. ·1 1~c,A,O) ade f e c b a A A d e e 
0 0 0 0 0 
_J g I. ·1 11:,d,;1 L 1 b,a,1 I. f e c d b a A d d e e c ade 
0 0 0 0 0 
.. .. 
-
....... 
--f -- -& c -- c -- d - b a 
@,f,4) d 
---
d l~,c,O) e ~ e [{£,b,O_l c 
--
c _(a,A,O 
-- - - -- --
de 
Q 1 0 0 0 _J I. ·1 : e 1~,d,~1 I. ·1 1~,a,~ I. f c d b a d e c c a 
After 16 clock cycles: 
~ D I. ·1 I. ·1 L 1 I. ·1 I. f c e b d a e e e e e e 
Final Encoded Output: 4, 5, 5, 1, 3, 5 
Figure 2 
Example of ENCl on input word "decade" 
this alphabet are arbitrarily given indices 1 through S. The output of ENCl is a 
9 
Decoding of input string: 4, 5, 5, 1, 3 Generic PE: 
( d,f,4) 
( e,f,5) 
( c,f,5) 
After 16 clock cycles: 
( e,f,5) 
Final output: decade 
Figure 3 
Example of DECla 
10 
-... ---.. o~m,r) .. 
(w,e,p) 
(A,A,5) 
(A,A,5) 
(A,A,1) 
(A,A,3) 
(A,A,5) 
sequence of list positions which DEClb receives as input and decodes in two stages. 
The first stage decodes list positions into indices and the second stage translates 
indices into symbols. PE i stores posi which is the list position of the alphabet 
symbol with index representation i. 
At the start of the clock cycle, PE i receives input packet ( w, p) from PE 
i - 1. As in DEC 1 a, p is the encoded list position and w is the unencoded word 
occurring in position p of the list (A if p < i). If p > 1 and p = posi then w is set to 
i and posi is decremented. If p = posi - 1 then posi is incremented. All necessary 
manipulations of list positions transpire before a packet arrives at the processor 
which will carry out the decoding. 
After leaving the DEC 1 b array, the decoded symbol indices are replaced by 
their corresponding alphabet symbol by a special purpose processor located at the 
end of the pipe. The DEClb algorithm is illustrated in Figure 4. 
For a fixed-length list, such as the 256 different 8-bit ASCII characters, each 
processor is initialized to contain one of the 8-bit bytes. Alternatively, new words 
can be added to the list until the list becomes full. Moreover, if an input word 
w is not in the current list of size K ( 1 :::; K :::; N) the word is encoded by the 
index K + 1 followed by the word w and the list is updated by transposing w 
with the list entry K. If K = N (i.e., the list is full), word w replaces the last 
list entry. The decoder "learns" the word list in a similar fashion. In the systolic 
array, an additional flag bit in each processor is used to delimit the current list. 
The processor holding the flag is designated as the first empty list entry. Initially, 
the flag bit in PE 1 is set. 
4.1.1. Empirical Evaluation 
Compression findings for fixed-length word implementations of Move-to-Front 
and Transpose coding are reported in Table 1. A systolic array consisting of 
128 processing elements each initialized to contain a 7-bit ASCII character was 
11 
alphabet = a, b, c, d, e 
symbols: a b c d e f Generic PE: · ~ 
~ ~
indices: 1 2 3 4 5 6 
If PE i has contents j then symbol with index i is in list position j and thus posi = j. 
PE 6 PE 5 PE 4 PE 3 
(4,4) 
(1,1) 
(4,3) 
Ouput of stage 1: 4, 5, 3, 1, 4, 5 
Output of translator (indices into alphabet symbols): decade 
Figure 4 
Example of DEC2 
PE 2 PE 1 
(A,4) 
(A,5) 
A,5 
(A,1) 
A,3 
A,5 
simulated. The test files used are part of the Calgary /Canterbury text compression 
12 
Input File 
bib 
bookl 
book2 
paperl 
paper2 
paper3 
paper4 
paper5 
paper6 
news 
progc 
progl 
progp 
Size (Bytes) 
111261 
768771 
610856 
53161 
82199 
46526 
13286 
11954 
38105 
377109 
39611 
71646 
49379 
Compression 
Savings - MTF 
29.44 
37.91 
36.98 
34.19 
37.49 
36.63 
35.20 
33.71 
35.45 
31.48 
30.72 
38.72 
35.26 
Table 1 
Compression 
Savings - Transpose 
32.24 
41.90 
39.54 
34.33 
39.07 
36.81 
29.75 
26.64 
33.79 
34.43 
30.70 
38.61 
34.99 
Compression savings delivered by Transpose and Move-to-Front Coding 
corpus (BCW90]. Transpose coding provides superior compression performance for 
large text files but, for small files (under 40 Kbytes ), Move-to-Front gives better 
compression. These findings support the theoretical expectation that Transpose 
coding takes longer to reach a steady state but, after reaching a steady state, has 
a smaller expected access cost. 
4.2. Systolic Implementations ENC2 and DEC2 with Reduced Delay 
The linear delay of the previous designs is determined by the piping of the 
input from PE 1 through to PE N. In this section we describe an architecture 
13 
which combines a systolic array with trees, resulting in a logarithmic delay. The 
trees broadcast the input to the array processors and reduce the simultaneous 
outputs of the processors. A similar architecture for dictionary coding is described 
in [Z90]. In addition to decreased delay, the restriction allowing data to enter 
the pipe only on every other system cycle is eliminated in systolic implementation 
EN C2 for Transpose coding. 
For a list of size N, the ENC2 architecture consists of 3N - 2 processors. 
The first N processors are arranged in a systolic pipe and the remaining processors 
are configured as two binary trees (each containing N - 1 processors) synchronized 
with the systolic array (see Figure 5). One of the binary trees (the broadcast tree) 
is placed on top of the systolic array. Input enters at the root of the broadcast 
tree and is propagated down the tree to each array processor. The other binary 
tree (the reduction tree) is placed below the systolic array. Results of the array 
processors are reduced to a single non-null output via the propagation toward the 
root of the reduction tree. The tree interconnect provides total delay of 2 log2 N. 
Processor PE i in the systolic array stores the list entry which is currently 
in position i in the list. As in ENCl, this entry is referred to as entryi. PE i 
receives input from the broadcast tree and from PE i - 1 and PE i + 1. PE i 
transmits entryi to PE i - 1 and PE i + 1 and outputs match information to the 
reduction tree. Processors in the broadcast tree simply pass their input to their 
outputs. Reduction processors receive two inputs which are either both zero or one 
is non-zero. In the first case, the reduction tree outputs zero. In the second case, 
the reduction processor transmits the non-zero input. 
After the input has propagated down the broadcast tree to the array pro-
cessors, encoding proceeds as follows. At the beginning of the clock cycle, PE i 
receives ( w) from the broadcast tree, ( enirYi-1) from PE i-1, and ( entryi+l) from 
PE i + 1. w is the word to be encoded. If w matches entryi then w is set to i and 
14 
( 
r 
I 
I 
I 
I 
I 
' 
/ 
' 
/ 
/ 
' 
' 
/ 
/ 
/ 
/ 
' 
' 
' 
' 
/ 
' 
/ 
/ 
' '-. 
Generic PE i in systolic array: 
w 
entryi+l .------. entry;, 
entryi-1 
(output i if entry;, matches w) 
Figure 5 
""' ' 
/ 
.,,./ 
Broadcast/reduce architecture for ENC2 and DEC2 
' 
' 
' 
' 
/ 
/ 
/ 
/ 
' 
' 
' 
' 
/ 
/ 
/ 
/ 
'I 
, 
I 
I 
I 
I 
I 
) 
enirYi-1 is written into enirYi· If w matches entryi+l (received from PE i + 1 at 
the start of the cycle) then PE i overwrites entryi with entryi+l and sets w to 0. 
Otherwise, PE i sets w to 0. At the close of the clock cycle, PE i transmits entryi 
to PE i + 1 and PE i -1 and sends w to its neighboring processor in the reduction 
15 
tree. Thus, at the end of each clock cycle, exactly one array processor (the one 
which matched the input symbol) outputs a non-zero value into the base of the 
broadcast tree. This non-zero value is propagated to the root of the reduction tree 
and finally output. Decompression (DEC2) mirrors ENC2. 
5. Parallel Transpose Coding with Arbitrary Words 
The designs of Section 4 included only fixed-length words in the list. The 
more general word list compression scheme permits words of arbitrary lengths to 
be present. A straightforward extension of our previous designs allows arbitrary 
words. However, the 3-tuples that travel through the array, consisting of two words 
and a pointer, may lead to unreasonable VLSI implementation requirements. For 
example, the fixed-length implementations consisting of 256 processing elements 
require 24 bits per tuple or 48 input/output pins per systolic element. For words 
of length 8 or less on a systolic array of similar size, the pin requirement jumps to 
272 input/output pins per chip. One simple solution to the problem of unbounded 
pin requirements is to place a bound on the maximum allowable word length. This 
bound enforces a limit on the pin requirements. The appropriate maximum word 
length is dependent on the application. 
In order to avoid the potential VLSI issues, Thomborson and Wei approximate 
Move-to-Front coding by using a hardwired hash table to map arbitrary words onto 
an 8-bit byte[TW89]. This single byte is piped into the array and encoded as a 
fixed-length word. The increased throughput provided by the systolic design comes 
at the expense of lower compression savings. Their design provides compression 
savings ranging from 25% to 65% and an input bandwidth of 40 Mbytes running on 
a 40 MHz clock. This is considerably lower than the compression savings of 30% to 
75% obtained by the sequential Move-to-Front codes. Implementing this hashing 
approach on our systolic transpose coders is straightforward. We suspect that the 
16 
random behavior of the hashing scheme would result in compression savings similar 
to those of the Move-to-Front systolic implementation [TW89]. 
6. Topics for Future Investigation 
Although we have outlined the design and compression performance of several 
systolic compressors for high-bandwidth applications, extensive empirical evalua-
tions of our methods are needed to fully describe their behavior. 
As supported by the empirical findings in Section 4.1.1, Move-to-Front coding 
performs well on small files and Transpose coding is superior for large files. This 
suggests the examination of a hybrid scheme which combines Move-to-Front and 
Transpose coding. Initially, Move-to-Front coding is utilized and at some point 
control is switched to a Transpose coding method. 
The broadcast/reduce implementation for Transpose coding described in 
Section 4.2 has a delay of logarithmic length. Further investigation into a broad-
cast /reduce for Move-to-Front coding may result in similar reductions in the delay. 
The systolic designs for arbitrary word lengths are limited by the imperfect 
approximation behavior of the hashing scheme. Alternative methods are necessary 
to better emulate general sequential list compression. 
As suggested by Thomborson and Wei, a high-bandwidth architecture for 
adaptive Huffman coding or other fixed-to-variable length coders which is capable 
of operating at a rate that is commensurate to the systolic list compression systems 
is also of great interest. Potentially, this high-bandwidth coder, when coupled with 
a systolic list compressor, will lead to improved compression performance. 
17 
REFERENCES 
[BCW90] BELL, T. C., CLEARY, J. G., AND WITTEN, I. H. Text Compression, 
Prentice-Hall, Englewood Cliffs, New Jersey, 1990. 
[BSTW86) BENTLEY, J. L., SLEATOR, D. D., TARJAN' R. E., AND WEI, v. K. A 
locally adaptive data compression scheme. Commun. A CM 29, 4 (April, 
1986), 320-330. 
[E87) 
[HH85) 
[HC87) 
[LH87] 
ELIAS, P. Interval and recency rank source coding: two on-line adap-
tive variable-length schemes. IEEE Trans. Inform. Theory IT-39, 1 
(Jan., 1987), 3-10. 
HESTER, J. H. AND HIRSCHBERG, D. S. Self-organizing linear search. 
ACM Comp. Sur. 17, 3 (Sep., 1985), 295-311. 
HoRSPOOL, R. N., AND CORMACK, G. V. A locally adaptive data 
compression scheme. Technical Correspondence. Commun. A CM :JO, 
9 (Sept., 1987), 792-794. 
LELEWER, D. A. AND HIRSCHBERG, D. S. Data compress10n. ACM 
Comp. Sur. 19, 3 (Sep., 1987), 261-296. 
[R87] RYABKO, B. Y. A locally adaptive data compression scheme. Technical 
Correspondence. Commun. ACM 90, 9 (Sept., 1987), p. 792. 
[SH91A] STAUFFER, L. M. AND HIRSCHBERG, D. S. Parallel data compression. 
Technical Report 91-44. Info. and Comp. Sci. Department, University 
of California, Irvine. 
[S88] STORER, J. A. Data Compression Methods and Theory, Computer 
Science Press, Rockville, Maryland, 1988. 
[TW89] THOMBORSON, C. D. AND WEI, BELLE W.-Y. Systolic implementations 
of a move-to-front text compressor. In Proceedings 1989 ACM Sympo-
sium on Parallel Algorithms and Architectures, Sante Fe, New Mex., 
ACM, New York, 1989, pp. 283-290. 
[W84] WELCH, T. A. A technique for high-performance data compression. 
Computer 17, 6 (June, 1984), 8-19. 
[W91] WILLIAMS, R. N. Adaptive Data Compression, Kluwer Academic Pub-
lishers, Norwell, MA, 1991. 
18 
[Z90) 
[ZL 78) 
ZITO-WOLF, R. J. A broadcast/reduce architecture for high-speed data 
compression. In Proceedings of the Second IEEE Symposium on Parallel 
and Distributed Processing, Dallas, Texas, 1990. 
ZIV, J. AND LEMPEL, A. Compression of individual sequences via 
variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530-536. 
19 
