University of Central Florida

STARS
Retrospective Theses and Dissertations
1988

Hardware algorithms for data compression
N. Ranganathan
University of Central Florida

Part of the Computer Sciences Commons

Find similar works at: https://stars.library.ucf.edu/rtd
University of Central Florida Libraries http://library.ucf.edu
This Masters Thesis (Open Access) is brought to you for free and open access by STARS. It has been accepted for
inclusion in Retrospective Theses and Dissertations by an authorized administrator of STARS. For more information,
please contact STARS@ucf.edu.

STARS Citation
Ranganathan, N., "Hardware algorithms for data compression" (1988). Retrospective Theses and
Dissertations. 4329.
https://stars.library.ucf.edu/rtd/4329

HARDWARE ALGORITHMS FOR DATA COMPRESSION

by
N. RANGANA THAN

A dissertation submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in
the Department of Computer Science at
the University of Central Florida
Orlando, Florida
August 1988
Major Professors: Amar Mukherjee and Mostafa Bassiouni

HARDWARE ALGORITHM:S FOR DATA COMPRESSION
N. Ranganathan
University of Central Florida
Orlando, FL 32816, August 1988
Major Professors: Amar Mukherjee and Mostafa Bassiouni

ABSTRACT

Data compression is the reduction of redundancy m data representation in
order to decrease storage and communication costs. Data compression techniques
have been used in practice primarily through software implementations which fail
to meet the speed and performance requirements of current and future systems.
This Ph.D. dissertation presents a set of hardware algorithms for compression and
decompression techniques and the results of detailed simulations performed to
quantify the effects of incorporating such hardware in various architectural environments.
A new pipelined algorithm for data compression applicable to static binary
encoding schemes is presented. A fast hardware algorithm for decompression that
uses a balanced binary tree structure to eliminate code storage tables is introduced.
Hardware algorithms are presented for the multi-group compression technique,

ii

run-length encoding method and an enhanced version of arithmetic coding scheme.
These algorithms are suitable for VLSI implementation and can provide speeds that
are an order of magnitude higher than currently obtainable encoding speeds. The
design and implementation of a prototype compression chip for the Huffman's
encoding scheme is presented. The chip yields an estimated compression rate of 10
million characters per second.
The effect of employing compression hardware on the performance of a general purpose computer system and a special purpose back-end multiprocessor
machine is analyzed by constructing detailed simulation models. Simulation results
establish that our VLSI chips for compression and decompression cause significant
improvements in system performance.

iii

Dedicated
to

my grandmother

Kanthammal (Patti)

iv

ACKNOWLEDGEMENTS
It is with pleasure that I express my appreciation to my advisor, Dr. Amar
Mukherjee, for his support and guidance in this work. It has been a privilege to work
under him.
I would like to thank Dr. Mostafa Bassiouni, who co-chaired my research committee, for his guidance and help. I would also like to thank the members of my
committee, Dr. H. N. Srinidhi, Dr. Ratan K. Guba and Dr. Brian E. Petrasko for their
time and assistance. I express my special thanks to Dr. Robert C. Brigham, who has
been a great source of encouragement and moral support throughout my graduate
study. I would like to thank the entire administrative and technical staff of the
Department of Computer Science for their support. The editorial assistance provided
by Ms. Susanne Payne is appreciated.
I would like to mention my special thanks to Teresa and Jack for their extraordinary friendship and support during my stay in Central Florida. I have enjoyed the
company of several friends while in UCF including Greg Schaper, Jennifer, Viva,
Semwal, Ken, Tim, Greg Hanson, Loree, Mahesh, Donna, Don, Amy, Sudhir, Reddy,
Krishnan, Ashok, Manni and Srinivasa.
Finally, I am deeply grateful to my parents, N. Bagyalakshmi and V. Nagarajan,
and the rest of my family for their continued support, understanding and encouragement throughout these years of my graduate study in a faraway land.

V

Table of Contents

1 IN1'RODUCTION ................................................................................................................
1
1.1 Background ............................................................................................................
2
1.2 Motivation ..............................................................................................................
5
1.3 Problems .................................................................................................................
8
1.4 Importance of Hardware Algorithms for Compression .......................... 11
1.5 Summary of Contributions ............................................................................... 14
1.6 Outline of Dissertation ....................................................................................... 16
2 LITERATURE SURVEY .................................................................................................. 18
2.1 Software Compression Techniques ................................................................ 18
2.1.1 The Huffman's Encoding Scheme ............................................... 19
2.1.2 The Multi-group Compression Technique ................................ 22
2.1.3 The Adaptive Huffman's Encoding Method ............................ 27
2.1.4 Run-length and Header Compression ......................................... 31
2.1.5 Arithmetic Coding Scheme ............................................................ 34
2.1.6 Differencing, Null Bit-Maps and Front/Rear Compaction ..................................................................................................................... 38
2.1. 7 Order-preserving Compression Methods ................................... 40
2.1.8 Dictionary-based Compression ...................................................... 42
2.2 Hardware-assisted Data Compression .......................................................... 44
2.2.1 The Microprocessor-assisted system (MAS) ............................ 44
2.2.2 A Finite State Machine Approach ............................................... 47
2.2.3 A Systolic Approach ........................................................................ 48
2.2.4 The UW Hardware .......................................................................... 51
2.2.5 Mukherjee's Approach ..................................................................... 51
3 HARDWARE AI..GORITHMS ....................................................................................... 55
3.1 A Pipelined Architecture for Data Compression ...................................... 56
3.2 The Decompression Hardware ........................................................................ 64
3.3 The Multi-group Compression Hardware .................................................... 70
3.4 The Multi-group Decompression Hardware ............................................... 78
3.5 The Run-length Encoding Scheme ................................................................ 85
3.6 Hardware for an Enhanced Arithmetic Coding Scheme ........................ 88
3.7 Speed Estimates.................................................................................................... 96
4 DESIGN OF A COMPRESSION CHIP ...................................................................... 100
vi

4.1 Functional Description .......................................................................................
4.2 Design of Basic Cells .........................................................................................
4.2.1 The Decoder Cell ...............................................................................
4.2.2 The Recirculating Latch Cell ........................................................
4.2.3 The Basic Cells for the Tree .........................................................
4.2.4 The Output Circuit ............................................................................
4.3 Layout Design .......................................................................................................
4.3.1 The Decoder Layout .........................................................................
4.3.2 The Recirculating Latch Layout ...................................................
4.3.3 The Tree Layout ................................................................................
4.3.4 The Output Circuit layout ...............................................................
4.3.5 The Chip Layout ................................................................................
4.4 External Interface .................................................................................................
4.5 Dynamic Simulation ...........................................................................................
4.6 Timing Estimates .................................................................................................
5 SIMULATION OF ARCHITECTURES ......................................................................
5.1 A Communication Network Architecture ....................................................
5.2 A General Purpose Computer System ..........................................................
5.3 A Special Purpose Back-end Machine .........................................................
6 CONCLUDING REMARKS ...........................................................................................
BIBLIOGRAPHY ......................................................................................................................
I

vii

101
102
103
103
106
109
109
109
112
112
121
121
125
125
127
128
129
133
141
161
164

CHAPTER 1
INTRODUCTION
The increasing use of computer-based systems and explosive growth in the
size of data within centralized and distributed information processing systems mandate the availability of massive on-line and archival storage medium as well as
high-capacity communication links.
Data compression is the reduction of redundancy in data representation in
order to decrease data storage requirements and data communication costs. A
mechanism to encode the data in order to reduce the redundancy could possibly
provide a 30-80% reduction in the size of data in a large commercial database.
Due to their complexity, past implementations of data compression techniques have
been mostly restricted to software. The overheads incurred by the compression
process to encode data and the expansion process to recover original data are so
high that some databases do not use any compression technique. With the advent
of VLSI technology, the obvious solution is to design and develop fast VLSI chips
that could be integrated into real time systems, so that data can be encoded and
decoded on the fly.
The objective of this Ph.D. dissertation is to design and develop hardware
algorithms for data compression and decompression and establish, with quantified

2

petformance measures, the benefits of integrating such hardware into different
architectures, through detailed simulation studies.
In this chapter, we will discuss the background and motivation for data

compression in sections 1. 1 and 1.2. Section 1.3 discusses problems related to data
compression. The importance of VLSI algorithms for data compression and
decompression is described in section 1.4. The summary of contributions is given
in section 1.5. The last section describes the outline of the dissertation.

1 .i Background
Data compression can be defined as the reduction in the amount of signal
space that must be allocated to a given message or a data sample set (Lynch 1985).
The signal space may be in a physical volume, such as a storage medium like the
disk; an interval of time, such as the time required to transmit a given message set;
or bandwidth required to transmit the given message set. All these forms of signal
space -- volume, time, and bandwidth -- are interrelated, in the sense that volume
is a function of the product of time and bandwidth. Thus, a reduction in volume
can be translated into a reduction in transmission time or bandwidth.
Data compression techniques are broadly classified as reversible and irreversible techniques. These classifications are also generally referred to as redundancy

3

reduction and entropy reduction, respectively, in the literature. We can think of
data representation as a combination of information and redundancy (Shannon
1948). Information is the portion of data that must be preserved in its original
form permanently.

However, redundancy is that portion of data that can be

removed and reinserted at a later time and place. Most often, the redundancy has
to be reinserted in order to recognize the original data. A redundancy reduction
technique removes, or at least reduces, the redundancy in such a way that it can be
subsequently reinserted into the data. Thus, redundancy reduction is always a
reversible process, since the original data can be completely recovered. Entropy
can be defined as a measure of the average information of the source data set.
Entropy gives a lower bound to the average information content of the source.
Hence, entropy reduction results in a reduction of information. The information
lost can never be recovered, so an entropy reduction is irreversible.
The entropy reduction schemes are classified as quantization techniques and
others. The redundancy reduction schemes are further classified into optimum
source coding, nonredundant sample coding, binary source coding and others. For
example, the Huffman code and the Shannon-Pano codes are optimum source
codes while run-length coding is used for non-redundant sample coding as well as
binary source coding. These techniques have been widely used in many applications including the fields of Speech, Telemetry, Television, Pictures, and Computer

4

Infonnation Systems. The theoretical background, the various techniques and their
applications are described in great detail in Lynch (1985).
In this dissertation, we are concerned with data compression techniques for

redundancy reduction that are applicable to text files in order to decrease data
storage requirements and communication costs in a distributed or general purpose
computer system. Hence, in this dissertation, all references to data compression are
assumed to be in the context of text compression. Also, the terms compression and
encoding are used interchangeably.

The terms decompression, expansion and

decoding will mean the same process of recovering the original data by reinserting
the redundancy removed during the compression phase. Within a digital system,
information is represented as an encoding of symbols. This representation involves
two distinct steps, the choice of a physical representation which is a set of symbols
from a given alphabet and the choice of a code set which maps each symbol to a
particular code. We will assume that the symbol set consists of the alphabets,
digits and special characters represented using the ASCII code set. In this context,
data compression involves the encoding of data with codes that will reduce the
redundancy in this representation.
Compression Ratio is defined as the ratio of the length of a message before

compression to the length of the same message after compression. This parameter
reflects the effectiveness of compression.

5

Compression Rate is defined as the number of characters that can be encoded

per second. The speed of different hardware algorithms proposed in Chapter 3 will
be compared using compression rate as the metric.

1.2 Motivation
Data compression offers several benefits.

A complete discussion on the

advantages of data compression is given in Bassiouni (1985). The most obvious
advantage of data compression is that of reducing the storage requirement of information. Reducing the storage requirement of databases is equivalent to increasing
the capacity of the storage medium. This latter issue is a constantly pressing problem in virtually all large database systems. In systems with several levels of
storage hierarchies, it is possible (at least in principle) to use data compression for
moving data files to higher (faster) storage levels (usually with smaller capacity). It
is interesting to note here that although advances in technology are continuously
reducing the cost and increasing the maximum capacity of storage devices, the
interest in and the need for data compression research have actually increased.
One main reason behind this is the fact that the explosive proliferation of data and
information processing applications continues to outgrow any advances in technology. In addition, data compression has other appealing benefits, as explained
below.

6

Since compressed data are encoded using smaller number of bytes, transfer of
compressed information from one place to another requires less time (hence resulting in a higher effective transfer rate). The growing recognition of the importance
of data compression in reducing the cost of data transmission within distributed
databases and document delivery systems has become one of the main motivations
behind the increased interest in data compression. Great cost savings can be
achieved by compressing the voluminous amounts of data and electronic mail
before they are transmitted over long-distance communication links. In 1/0 bound
systems, the fast transfer rate of compressed data from disk to main memory or
vice versa can contribute to reducing the transaction's turnaround time (or user's
response time). Since data compression reduces the loading of J/O channels, it
becomes feasible to process more J/O requests per second and hence achieve a
higher effective channel utilization.
Efficient data encoding schemes can also be very valuable to the design and
performance of supercomputers (Bassiouni and Mukherjee 1987). Some supercomputers (e.g., Fujitsu VP-200) use a compress/expand mode as one of the available
vector instruction modes (Lubeck 1985; Riganati and Schneck 1984). It is important to realize that unless J/O transfer rates are proportionally improved, processor
utilization in supercomputers may yield sustained performance figures that are orders of magnitude lower than peak rates.

Attaining dramatic gains in the

7

performance of J/0 and communication devices may prove to be one of the most
important challenges facing the design of future supercomputers. High speed VLSI
chips for data compression/encoding represent a viable solution that can effectively
increase the speed of 1/0 and communication devices.
Data compression also plays an important role in reducing the cost of backup
and recovery in many computer systems. Backup copies of the files and dumps of
databases are often stored in compressed form. Not only does this process reduce
the cost of backup, but it also speeds up recovery. Upon a system failure, the
backup copies are read and used to restore the database to the consistent state of
the last checkpoint (Kaunitz and Ekert 1984), and the audit trail is then processed
to restore the database to the consistent state at the time of failure. Recovery time

is therefore reduced since data compression reduces the overhead of the timeconsuming operation of writing into or reading from the backup tapes. It is also
worth mentioning that data security may be enhanced by data compression since
compressed data usually need to be decoded before it can be processed or interpreted.
Data compression can lead to other types of improvement in system performance. For example, in some index structures it is possible through compression
to pack more keys into each index block (Batory 1983). When the database is
searched for a given key value, the key is first compressed and the search is

8

performed against the compressed keys in the index blocks. The net effect is that
a fewer number of blocks have to be retrieved and thus the average search cost is
reduced. This justifies why data compression is provided as a standard option in
the IBM VSAM access method. Similar arguments may apply for associative
buffers used to store a subset of database records in main memory (Bassiouni
1985). When a database record needs to be accessed, the associative buffer is first
searched and a disk access operation is avoided if the record is found in the buffer.
By compressing records, the effective capacity of the buffer increases and consequently its hit ratio (the probability that a record will be found in buffer) increases.

1.3 Problems
The availability of fast and efficient compression methods does not assure an
effective system. Data compression introduces several problems as we try to
integrate compression into a computer system. The remainder of this section
discusses the disadvantages of data compression.
The overhead incurred by the compression process to encode data and the
decompression process to recover original data is one of the most serious disadvantages of data compression. For some applications, this overhead could be considerable enough to discourage any consideration for employing data compression. The
compression rates of previous methods that use hardware assist features for imple-

9

menting compression have been estimated to be around the order of 0.64 Mega
bytes per second (Lea 1978) and 1.57 Mega bytes per second (Hawthorn 1982).
These figures still do not meet the performance requirements of current and future
systems. The solution to this problem is the design of complete hardware and/or
hardware assist features to implement efficient compression schemes. This dissertation focuses on solving this major problem by proposing fast and efficient VLSI
algorithms for implementing compression.
The length of the compressed message is usually not predictable prior to actually compressing the data since it depends on the compression method as well as
the contents of the message itself.

Sometimes, the message may not be

compressed at all. Therefore, the space allocated for the compressed message has
to be as big as the size of the original message. Unpredictability in length of the

compressed message also impacts bandwidth requirements of a system; as a result
data paths in the system have to be overdesigned to handle peak load demands
(Welch 1984).
It is obvious that reducing redundancy in data representation would in turn
reduce the ability to recover data from errors. A single-bit error in the output of
the Huffman Code (Huffman 1952), for example, could cause the decoder to misinterpret all subsequent bits. This problem has been alleviated to some degree by the
increasingly reliable technology in disk storage and communications media (Cysper

1978). Also, most telecommunication networks employ protocols for detecting
transmission errors, e.g., SDLC (Synchronous Data Link Control) and HDLC
(High-Level Data Link Control) protocols (Shanker and Lam 1983).
Many data compression schemes disrupt data properties that may be important
for some applications.

For example, by not preserving the lexical order of

compressed data, efficient sorting and searching schemes can become inapplicable.
In this case, expansion of data is done and the search is performed on the original

data. Some order-preserving compression methods have been proposed (Alsberg
1975; Batory 1983 and Knuth 1973) which allow search operations to be performed directly on compressed data.
Most software implementations of data compression are coded in assembler
languages (for efficiency reasons), and there are not enough well-defined standards
in this area to make data compression utilities completely portable or easily
modifiable.
Some compression techniques require the storage of extra data like the
encoding/decoding trees in the Huffman's scheme (Huffman 1952) and the header
file in the header compression method (Eggers 1981). However, this overhead is
far less than the space saving due to compression.

11

1.4 Importance of Hardware Algorithms for Compression
Several data compression algorithms of different philosophy, complexity and
application scope have been proposed in the literature. We will review these algorithms in Chapter 2. Many of these algorithms have been used in practice primarily through software implementations which fail to meet the speed and performance requirements of current and future systems. The most serious disadvantage
of data compression is the time and space overheads incurred during the encoding
and decoding phases. Successful implementation of efficient VLSI chips for data
compression and decompression that can be integrated into the remote terminals,
disk controllers, and communication controllers can result in significant improvement in the performance of computer systems. To use Lynch and Brownrigg 's
(1981) words:

We are not aware of any generally available hardware assist features for algorithms such as Huffman coding at the present time, which seems strange since
such a feature could be used to good advantage throughout the design of an
operating system, a file subsystem or a database management system. If one considers the less revolutionary (and in our view, more likely) scenerio of gradually
increasing controller intelligence, data compression (along with cache, encryption,
and perhaps expanded independent key searching capabilities) would seem like a
strong prospect for inclusion in an advanced disk controller. (441)
In 1988, the essence of this idea remains unchanged. A few paper designs
have been reported in the literature, but none have been implemented. Thus, we
see the need for development and implementation of efficient hardware algorithms

12

for data compression and decompression.
The effect of compression hardware on the performance of a computer system
can be quantified by constructing detailed simulation models of the system with
and without compression. We constructed such models and ran simulation programs for various input data. We will report the results of the simulations in
Chapter 5.
response

Important performance measures like utilization, throughput and
time

were

compared,

using

typical

values

of

system

parameters/characteristics. The improvement in the performance of the system can
be used to evaluate the effectiveness of the proposed VLSI chips.
The architectures considered for the simulations are a general purpose computer system and a special purpose back-end machine. The compression hardware
will be of great advantage in communication systems. However, we will not consider communication architectures for simulation purposes since the improvement
in the effective bandwidth of the communication links due to compression is
directly dependent on the compression ratio. Device controllers and I/O channels
are appealing choices for location of data compression hardware in a general purpose machine. The data have to be compressed before being stored and expanded
before being supplied to the CPU for processing. Compression will be transparent
to both the user and the application software. The configuration used for the simulation is similar to that of a VAX

11nso machine.

13

There are different classifications of database machines depending on factors
like processor organization, storage schema, control strategy and several others
(Srinidhi and Sloan 1988). We performed simulations for one such architecture
which is similar to that of the DELTA machine (Shibayama, Kakuta, Miyazaki,
Yokota and Murakami 1984). Such machines are equipped with very high speed
hardware mechanisms for performing operations like selection and join. The tuple
reconstruction overhead due to the adopting of an attribute-based internal schema
for data storage is extremely high, which makes the high speed processing capability of the machine transparent and useless to the user. The large number of disk
accesses required to fetch the large domains of attributes for tuple reconstruction
considerably affects the performance of the system. The tuple reconstruction time
could be reduced by storing the attribute values in compressed form. Our simulation results, which are reported in Chapter 5, quantify the improvements in the performance of such a system due to compression hardware.
Supercomputers and telecommunication networks can use the power of
hardware algorithms for compression and decompression. Supercomputers suffer
from degradation of the net sustained performance because of low processor utilization in spite of high vector processing speeds. One of the major factors for this
degradation is due to I/O delays resulting from poor I/O transfer rates. Attaining
significant improvements in the performance of I/O devices may be one of the

14

major challenges facing the design of future supercomputer systems. High speed
VLSI chips for data compression/expansion represent a viable solution that can
effectively increase the speed of I/O and overcome the speed limitations of
mechanical disk devices.
Telecommunication links are currently among the most bandwidth-restricted
devices in computer systems. VLSI encoding chips can be inserted into the
transmitter logic of the communication controllers of these devices so that huge
amounts of data and electronic mail can be compressed before being transmitted
over communication channels. The original data can be recovered by the expansion
hardware inserted into the receiver logic of the communication controller at the
receiving end. Thus the placement of compression hardware into these systems
increases the effective bandwidth of communication links thereby reducing
transmission costs.

1.5 Summary of Contributions
The major contributions of the dissertation are:

(i)

development of a set of new hardware algorithms for data compression
and decompression that have applicability to various architectures of
present and future computing. The speed of these algorithms are an
order of magnitude higher than currently obtainable compression speeds.

15

(ii)

simulation results that quantify the performance improvements obtained
by incorporating compression hardware in various architectural environments.
Specifically, we make the following contributions:
Under (i),

*

A new pipelined architecture for data compression applicable to any
static coding scheme is developed.

*

A hardware algorithm for decompression applicable to static binary
codes is presented.

*

A set of new hardware algorithms for the multi-group compression and
decompression schemes is proposed.

*

A hardware algorithm for run-length encoding scheme is presented.

*

Hardware designs for implementing an enhanced version of arithmetic
coding scheme are introduced.

*

Design and implementation of a prototype CMOS VLSI chip for
Huffman coding is described.
Under (ii),

*

Improvement in throughput of a general purpose computer system due
to compression hardware is quantified through simulation studies and
quantified measures of improvement in system performance are
presented.

16

*

Quantified measures of improvement in query execution time as a result
of including compression hardware in a special purpose multiprocessorbased back-end machine architecture are presented.

1.6 Outline of Dissertation
A detailed literature survey is given in Chapter 2, which summarizes the various software techniques that exist in the literature. Also, the state of the art with
respect to hardware implementations of compression/decompression is discussed in
the second part of Chapter 2.
Chapter 3 presents a set of new VLSI algorithms for software compression
and decompression techniques. A new pipelined architecture for data compression
applicable to any static coding scheme is proposed. A hardware algorithm for
decompression applicable to static codes is introduced. Hardware algorithms are
proposed for the multi-group compression technique, run-length encoding scheme
and an enhanced version of arithmetic coding scheme. Speed estimates of various
algorithms are also discussed.
The design and implementation of a prototype CMOS VLSI chip for the
Huffman encoding scheme are described in Chapter 4. The chip implements the
above mentioned pipelined algorithm for compression using static codes. The
results of detailed simulations to quantify the effects of incorporating compression

17

hardware in various architectural environments are discussed in Chapter 5.
We conclude in Chapter 6 by summarizing our results and identifying directions for further research.

CHAPTER 2
LITERATURE SURVEY
This chapter presents an overview of both software and hardware techniques
for data compression. Several data compression techniques of different philosophy,
complexity, and application scope have been reported in the literature and many
have been used in practice primarily through software implementations. A detailed
literature survey of the various techniques is presented in the following section of
this chapter. A few paper designs exist in the literature that propose partial or
complete hardware assist features for implementing compression, but no implementation has been done. The state of the art with respect to hardware implementation
of compression and decompression techniques is discussed later in the chapter.

2.1 Software Compression Techniques
Several data compression algorithms such as Huffman's method (Huffman
1952), adaptive Huffman's method (Gallager 1978), multi-group compression
method ( Bassiouni and Hazboun 1983; Hazboun and Bassiouni 1982), run-length
encoding (Golomb 1966), header compression method (Eggers, QI.ken and Shoshani
1981), LZ algorithm (Ziv and Lempel 1977), L'ZW algorithm (Welch 1984), arithmetic coding (Langdon 1984), dictionary based methods (Wagner 1973; Storer and
18

19

Szymanski 1982) and many variations of these have been proposed in the literature. In the rest of this section, these techniques are described and illustrated with
examples.

2.1.1 The Huffman's Encoding Scheme
The Huffman's encoding scheme (Huffman 1952) takes advantage of the variability of the frequency of occurrence of characters. Accordingly, the more frequent
characters are assigned the shorter codes and the less frequent ones get the longer
codes. This reduces the average number of bits per character to a minimum value
in the encoded message. The Huffman's code has the prefix property, which

guarantees that no code for a character appears as a prefix in the code for any other
one.
The Huffman's method builds a decoding tree which yields codes that minimize the average number of bits per character. If Pi is the probability of occurrence
of the character i in the alphabet

:r., and d;

is the length of the path from the root

of the tree to the leaf node representing the character i, then the Huffman's tree
minimizes the quantity
N

L
i=l

where N is the size of :r..

P;d;

20

In the Huffman's tree, the leaf nodes represent the characters. The tree is con-

structed as follows: a parent node is created for the two nodes with the lowest probabilities. The sum of the probabilities of the two nodes will be assigned as the
probability of the parent node. Then the two nodes are replaced by the parent node
for further construction. The above steps are repeated until a single node is left
with an assigned probability of 1. This node becomes the root of the Huffman tree.
The left and right branches of each parent node are labeled as zero and one respectively, or vice versa. An example of Huffman's tree is given in Figure 2.1. The
tree is constructed for a set of 7 characters A, B, C, 0, 1, 2, b(blank) with the
corresponding probabilities of occurrence being 0.1, 0.1, 0.1, 0.3, 0.1, 0.1, and 0.2
respectively.
The Huffman's code for a character is the sequence of O's and 1's in the
unique path from the root of the tree to the leaf node representing the character
(the code for A is 000 and that of 0 is 01). To encode (compress) the string
"OAF", for example, the encoder concatenates the codes for the three characters to
produce the binary string "01000100".

To decode (decompression) this latter

string, the decoder moves down the tree while processing the binary string from
left to right. Thus the first "0" causes the decoder to branch to the right child of the
root. The following "1" causes it to branch to the external node representing the
character D. After emitting "D" in the output, the decoder repeats this process start-

21

B

Figure 2.1. An example Huffman tree.

C

22

ing at the root node. Thus the decoder must receive bits in the correct order, i.e.,
bits of the Huffman's codes must be generated starting at the root of the tree and
ending at an external node.
There are overheads involved m implementing the Huffman's compression
scheme. Typically, it is a two pass algorithm where one pass is required to compute the frequency counts of the characters in the file to be compressed and a
second pass to actually encode the file. Moreover, the translation table must be
stored along with the compressed file for use during decompression. Another problem associated with Huffman's scheme, which is common to variable length coding
schemes, is that the length of each code is not known until a few bits have been
interpreted. Furthermore, the Huffman's scheme uses only the distributional properties of data and ignores the correlational properties (Bassiouni 1985). This leads to
the discussion of the next technique which makes use of both the above properties.

2.1.2 The Multi-group Compression Technique
The multi-group compression technique (Hazboun and Bassiouni 1982) is an
improvement over Huffman's since it takes advantage of the distributional as well
as the correlational properties observed in most commercial files. The method
makes use of two properties: (i) the group locality of character reference behavior
and (ii) the variable frequency of occurrence of different characters within the

23

different subgroupings of the character set. The first property refers to the tendency
of consecutive characters to fall within the same type, e.g., alphabets, digits, successive blanks, etc. The entire character set is divided into subgroups depending on
their types. The second property implies that there is skewness in distribution of
the frequency of occurrence of characters within any subgroup.
The Multi-group scheme involves building two sets of trees: local trees and
failure trees. A local tree for a given group is a Huffman tree for the characters
within that group along with a special character that will be used to indicate group
switch. This special character called the "failure" character is assigned a weight
computed from the average sequential run-length and the weights of the characters
in that group. The average sequential run-length of a locality group is the expected
number of consecutive characters from that group before a character from a
different locality group is encountered. The failure tree of each group is constructed by assigning appropriate weights to other groups depending on the statistics of the transition frequencies among the different groups.
The Multi-group scheme is illustrated with the same example that was used to
describe the Huffman's scheme. The character set is divided into three subgroups:
alphabets, digits and blanks with ' & ' being the failure character. The local and
failure trees are given in Figure 2.2. In the case of three subgroups, the weights in
failure trees are immaterial since each such tree has only two leaf nodes. The string

24

2

&

0

C

A

b

&

alphabet local tree

digit local tree

alphabet

B

blank

digit failure tree

digit

blank

alphabet failure tree

&

blank local tree

alphabet

digit

blank failure tree

Figure 2.2. Local and failure trees for multi-group scheme.

25

ABC00012bb can be coded in 23 bits as follows: 01 11 10 00 1 0 0 0 101 100 11
0 1 1. Notice that after the code for C (which is 10), the failure code 00 followed
by the branching code 1 (from alphabet failure tree) are used to indicate a switch
to the digit group. It should be noted that the same string needs 27 bits to be coded

using the Huffman's tree given in Figure 2.1.
Further optimization can be obtained by combining the local tree and failure
tree of each group (Bassiouni and Hazboun 1983). In this version, the failure character and the branching code are replaced by a single group switch indicator, with
one for each possible group switch. The group switch indicator represented as
&(ij) is used to indicate a switch from group i to group j. The weight of &(i,j) is
assigned a value based on the expected frequency of transitions from group i to
group j. The alphabet local and failure trees in Figure 2.2 are combined into a single tree as given in Figure 2.3.
The multi-group method achieves average compression improvement of 25%
over the Huffman's method (Bassiouni 1985). It should be noted that the technique has the disadvantage of the need to gather statistics about transition frequencies among the different groups in addition to the individual character frequencies
required by the Huffman's method. Both the methods need a first pass for collection of statistics and this overhead is eliminated in the next variation of the
Huffman's scheme discussed below.

26

Figure 2.3. Combining local and failure trees for the alphabet group.

27

2.1.3 The Adaptive Huffman's Encoding Method
The adaptive Huffman's encoding scheme is another variation of the
Huffman's method. This scheme aims at achieving further compression by adapting the Huffman's method to slowly varying estimates of the frequency distribution. Many recent versions of the UNIX operating system (Ritchie and Thompson
1974) provide a file compression facility based on this adaptive Huffman's method.
The adaptive scheme maintains a counter for each character in the alphabet,
and increments the counter each time that character occurs. The algorithm uses
these counts directly as the weights required to construct the Huffman's tree. The
idea of the scheme is to change the codes (i.e., the Huffman's tree) as the running
estimates of character frequencies change. Gallager (Gallager 1978) showed that a
Huffman's tree must have the property that each node (except the root) has a
sibling, and that the nodes can be listed in order of nonincreasing weight, with
each node being adjacent in the list to its sibling. We will illustrate the technique
with an example from Bassiouni (1985). Consider an alphabet of the five characters A,B,C,D, and E, with counts 20,10,10,10, and 17 respectively. Figure 2.4
shows a Huffman's tree for this example. A list satisfying the sibling property
mentioned above is
Vl

V2

V3

V4

V5

V6

V7

V8

28

V6

Figure 2.4. An example Huffman tree for the adaptive scheme.

29

which can be grouped into sibling pairs as
(Vl

V2) (V3 V4) (V5 V6) (V7 V8)

Whenever a character occurs, the counts in the reversed path from the leaf node to
the root are incremented by one (in the order specified by the reversed path). For
example, if letter A occurs, the count of vertex V3 becomes 21, and then that of
Vl becomes 41. Whenever the count of a node is incremented, the new count must
be compared with the two counts of the next higher sibling pair (if any) in the
ordered list. If the new count becomes larger than any one of those two counts,
then the two nodes must be interchanged. For example, if for the tree of Figure
2.4, the letter C occurred, the new count (of value 11) of node V8 is compared to
the counts of V5 and V6 (since V5 and V6 form the next sibling pair higher than
v8 in the list given above). Therefore V6 and V8 are interchanged. The count of
the new parent of V8 (now V2) is incremented by one to become 28 and the final
tree is shown in Figure 2.5. Thus the occurrence of the character C has caused the
code of that character to change from 100 to 00.
Notice that the code change is made after the encoder has generated the old
code of the encountered character. This is necessary because the decoder must
decode the old code to obtain information on which changes are based. Notice
also that several node interchanges along the reversed path could occur as a result
of an occurrence of one character. Two forward and two backward pointers per
each sibling pair are used to implement the interchange in a constant amount of
time (Gallager 1978).

30

VB

8

D

Figure 2.5. Updated tree after one occurrence of the character C.

31

The adaptive method described above uses two parameters: N and ex. The
parameter N represents the periodic (fixed) interval (i.e., number of characters)
after which estimates (counts) must be re-evaluated. The fraction ex represents the
degree by which previous intervals affect the current estimates. After every interval
of N characters, the counts are multiplied by ex. Thus if ex=O, counts are computed
using only the current interval, and hence resulting in rapidly changing estimates.
When ex=l, estimates are computed using the entire previous history, resulting in a
slow adaptation. In Gallager (1978), the use of cx=0.5 is advocated to make the
multiplication process easier and to give reasonable slowly changing estimates.

2.1.4 Run-length and Header Compression
Many scientific and statistical databases are sparse in nature and contain long
sequences of repeated zeros or missing value codes. This fact makes the technique
of run-length encoding (Golomb 1966) and its variations one of the most effective
and commonly used data compression methods for scientific and statistical databases (Bassiouni 1985). Run-length encoding is also very effective in picture processing and some business and commercial applications.
The run-length encoding scheme replaces sequences of identical characters by
a count field followed by an identifier for the repeated value. The count field must
be somehow flagged so that it can be recognized from other data values. Also, the

32

selected run-lengths must be large enough for their replacement by the identifier
and count values to be advantageous. For example, in the 8-bit EBCDIC code,
non-special characters are encoded with a zero in the second bit from the left. This
bit is set to one when the byte is used to store a count value. A second byte is
used to store the repeated character. Thus, only sequences of three repeated characters or more are replaced since smaller sequences do not give any storage savings.
Run-length encoding has the disadvantage that it requires serial decoding of
the compressed data in order to access a given data item. The Header compression
scheme described in Eggers, Olken and Shoshani (1981) overcomes this problem.
It achieves logarithmic access time by compressing data to a minimum byte size
and employing a variation of run-length encoding. This method produces two
different files: a compressed data file and a header file. The run-lengths are
extracted from the data stream and an entry for each run-length is stored in the
header file. The entry for the ith sequence is of the form (Ti, Li, Pi). Ti is a bit
flag where zero indicates a repeated sequence and one indicates a sequence of
different values requiring the same byte length. Li is an integer value that gives the
cumulative count of the total number of bytes of sequences 1 through i. Pi gives
the cumulative count of the total number of bytes in the compressed stream of
sequences 1 through i. The data stream,
6 2 4 8 10 2 2 2 2 600 700 800 900 0 0 0 0

33

contains 4 sequences, with the first sequence having four single-byte integers, the
second having a single-byte integer repeated four times, the third having four twobyte integers and the last sequence with a single-byte integer repeated again four
times. The corresponding compressed file will contain
6 2 4 8 10 2 600 700 800 900 0
The header file for the above example is given in Table 1.

TABLE 1
HEADERFORTHEEXAMPLEDATASTREAM
i

ith sequence

T I•

L;

P;

1

1

5

5

0

3

624810
2222
600 700 800 900

4

0000

0

9
13
17

6
14
15

2

1

In order to access a data value in a specific position, a binary search is done

on the L; field of the header file with the position as the index. Say we want to
access the 14th data value in the original data stream. A logarithmic search indicates that the value is in the fourth sequence and the T; flag indicates that it is a
sequence of repeated values. The size of the repeated element is calculated as P 4

-

P 3, which is 1 in our example. The location of this element in the compressed

34

stream is P 3 + 1, which is the 15th byte. A more formal description of the technique is given in Eggers, Olken and Shoshani (1981).

2.1.5 Arithmetic Coding Scheme
The basic principles of arithmetic coding were first described by Elias in the
early 1960s (Abramson 1963). Later, Rissanen and Pasco proposed several practical techniques (Rissanen 1976; Pasco 1976; Rissanen 1979). In a recent paper by
Witten, Neal and Cleary (1987), an implementation of arithmetic coding was discussed in detail.
In arithmetic coding, a message is represented by an interval of real numbers

between O and 1. The consecutive symbols in a message are coded recursively
using the frequency of occurrences of the symbols within the message. Data
compression using arithmetic coding is viewed as two parts: modeling the statistics
of the source message and the coding of the message using the statistics generated
by the model. The model could be static or adaptive. When the model is static,
fixed probabilities are used throughout the process of coding. In the adaptive
model, the frequency counts are updated during the coding of each symbol and the
model dynamically adapts to the varying statistics of the source message. The
fixed model case is a two-pass version where the first pass is to gather statistics
and the second pass is used to actually code the message. The adaptive model

35

dynamically changes its model while coding the message in one pass. It has been
found that the adaptive model is never significantly worse than a two-pass
approach while it could be significantly better than the fixed model most of the
time. One problem with the adaptive model is that during the initial phase of coding, it does not have enough information about the statistics. So using a reasonable
initial model to begin with is an important consideration for the adaptive model
(Langdon and Rissannen 1983; Cleary and Witten 1984).
The technique of arithmetic coding can be described as follows: initially,
before any message is coded, the message is represented by the entire interval ~O,
1). This interval is divided into subintervals such that each symbol is assigned a
particular subinterval and the length of the subinterval for a symbol is proportional
to its frequency count. As each symbol is processed, the interval representing the

message is narrowed to that portion of it allocated to the symbol. Thus the interval
length narrows down with the coding of each symbol. While encoding high probability symbols, the interval reduces at a slower rate; while coding low probability
symbols, the interval length reduces more rapidly.

Also, as the interval size

reduces, more bits are needed to represent it without loss of precision.

In general, the codeword is an interval represented as (L, H) where L is the
current low point and H is the current high point of the interval. W is the width of
the current interval such that H

= L + W.

Let p(i) be the probability of occurrence

36

for character i and P(i) be the corresponding cumulative probability. Any particular order of the symbols can be assumed in calculating the cumulative probabilities,
i.e., the ordering of the symbols will not have any effect on the compression
efficiency. Then while encoding character i, the new Lis calculated as L + W*P(i)
and the new W is calculated as W*p(i). The new L and L+W will represent the
interval soon after the encoding of character i. We will illustrate the technique with
an example of the fixed model.
Let us assume a simplified alphabet set for our example, and that is { a, b, c,
0, 1, 2, * } with the probabilities being 0.1, 0.1, 0.1, 0.3, 0.1, 0.1 and 0.2, respectively. The symbol "*" is used to denote the end of the message. The character
set and the corresponding symbol probabilities p(i) are listed in Table 2.

TABLE 2
PROBABILITIES FOR TIIE CHARACTER SET

Symbol

p(i)

P(i)

a

0.1

0

b

0.1

0.1

C

0.1

0.2

0

0.3

0.3

1

0.1

0.6

2

0.1

0.7

*

0.2

0.8

37

The table also shows the cumulative probabilities P(i) calculated using the
symbol probability. The order of the symbols does not matter in calculating the
cumulative probabilities. Suppose the message to be coded is "OOOlabb". The initial code is the range (0, 1). The first symbol is encoded by narrowing the interval
to the range that corresponds to the symbol. After the symbol "O" is coded using

the values in Table 2, the range narrows down to (0.3, 0.6), where L
0.3 + 0.3

= 0.6.

= 0.3 and H =

This process is repeated until the end of the message and the

character "*" is encoded. The coding of the various symbols in the message string
is illustrated in Table 3.

TABLE 3
ENCODING OF STRING OOOlabb

Symbol

w

L

H=L+W

0

0.3

0.3

0.6

0

0.39

0.09

0.48

0

0.417

0.027

0.444

1

0.477

0.0027

0.4797

a
b

0.477

0.00027

0.47727

0.477027

0.000027

0.477054

b

0.4770297

0.0000027

0.47770324

*

0.47703186

0.00000054

0.47703240

38

The columns "L" and "H" give the low and high values of the range and "W"
gives the width of the interval after the encoding of the symbol in the left-most
column. Once the final range has been calculated after the encoding of the endof-message symbol "*", any value within that range can be selected to represent
the compressed codeword. The decoder, on the other hand, recursively undoes
what the encoder had done during the coding phase. The steps involved in decoding are:
(i)

find the symbol for which the codeword lies within the allocated range for
the symbol and include that symbol in the expanded message.

(ii)

subtract the symbol's cumulative probability P(i) from the codeword.

(iii)

divide the resultant value from (ii) by the symbol probability p(i) which
gives the codeword after the decoding of symbol i.

2.1.6 Differencing, Null Bit-Maps and Front/Rear Compaction
Differencing is a technique used when successive numeric values tend to fall
within a small range. In this case only the difference between each pair of successive values is stored. For example, if the data stream for a particular attribute contains the values
1500, 1520, 1600, 1550, 1570, 1610
Then the compressed stream is

39

1500,20, 80,-50,20,40
Since the differences are expected to have smaller absolute values than the original
entries, they can be stored using less space.
In order to access a given item, the differencing method requires serial decod-

ing of the compressed file (in order to sum up the differences). This technique is
employed in picture processing applications. We would like to point out that the
header compression method can be modified to store differences of successive
values whenever the flag Ti of the run-length is 1. The resulting technique could
achieve better compression on the expense of reduced access time (in this case a
logarithmic search is used to find the run-length, then a linear search is performed
within that run-length).
The technique of front/rear compaction is usually applied to fields ordered
alphabetically and is based on an idea similar to that of differencing. Compression is achieved by deleting common bit strings from the front/end of successive
entries.

A count of the number of deleted bits must be included with each

compressed entry. But the technique requires serial decoding of files to access data.
The technique of null bit-map is useful for flat files having large number of
missing data. A bit-map is used in each record to indicate the presence or absence
of data items. The scheme has an overhead of one bit for each present item, and a
space saving of bit_length (null indicator )-1 for each missing item.

40

The use of m-grams is based on replacing frequently occurring sequences of
m characters by a shorter code. Implementations are usually restricted to the case
of m=2 (bigrams) and the scheme is usually used to augment other compression
methods (e.g., Huffman-like compression methods).

2.1.7 Order-preserving Compression Methods
The Hu-Tucker algorithm (Knuth 1973) and some index encoding schemes
(Alsberg 1975; Batory 1983) allow some operations (e.g., searching) to be performed directly on compressed data. The basic idea is to assign numeric codes for
key values in such a way that the numeric ordering of the codes preserves the lexical ordering of their key counterparts.
The simple index encoding approach is more suitable for keys (attributes)
having small or static domains (Bassiouni 1985). All distinct values from the
given domain are identified and assigned numerical values that preserve the lexical
order of the key values. As a simple example, the domain whose elements are the
states of the USA can be index encoded using six bits, with bit pattern 000001 for
Alabama through bit pattern 110011 for Wyoming. In addition to space reduction,
sorting or searching files on that domain can be carried out more efficiently using
the compressed codes. Some scientific and statistical database systems incorporate
variations of the index encoding method described above, e.g., IRIS (Alsberg 1975)

41

and RAPID (Turner, Hammond and Cotton 1979).
A serious problem of the index encoding method arises when the domain is
dynamically growing, i.e., insertions of new key values occur after the initial
encoding. If the new key values cannot be known at the time of index encoding,
the initial codes may have to be reassigned and all data files may have to be
recoded. To reduce the frequency of (or practically eliminate) file recoding, longer
index codes (than the initial database requires) are usually used in such a way that
the initial key values are not assigned consecutive index codes. For example, if we
have a domain of 50 initial values, we can use 8 bits (instead of 6) to create the
index codes. The initial key values can then be assigned codes of 0, 4, 8, etc.
When a new key value needs to be added to the domain, it is simply assigned an
unused index code that preserves the lexical order of key values. Obviously only a
limited number of insertions can be handled in this way before the entire database
needs to be recoded. The frequency of file recoding depends on the interval
between successive initial codes, and the frequency and pattern of insertions (random insertions result in less frequent recoding than clustered insertions).

42

2.1.8 Dictionary-based Compression
Dictionary-based methods (Wagner 1973; Storer and Szymanski 1982; Ok
1984) are mainly used for text compression. A dictionary of common words or
phrases is used and is sometimes augmented with a table for common word endings. The basic idea of these methods is to replace common words in the input
text by their address in the dictionary.
The LZ method (Ziv and Lempel 1977; Ziv and Lempel 1978) and the L'ZW
method (Welch 1984) are adaptive schemes that start with an empty table of symbol strings and then build it up both during compression and expansion. These
methods require no prior knowledge about the data. They encode variable length
strings using fixed length codes. The LZ method views the data to be compressed
as a bit string. A portion of the data is read into a buffer and the repeating
sequences are sought and encoded. Then the variable length part that has been
encoded is shifted out of the left of the buffer while a new portion of data is
shifted in from the right of the buffer. The choice of the size of the buffer is an
important decision in order to obtain effective compression. The L'ZW algorithm is
a modification of the LZ method.
The LZW method replaces strings of characters by fixed length codes (usually
12 bits). Its basic idea is similar to having a dictionary of common strings with
the requirement that for any string in the dictionary, all prefixes of that string must

43

also be in the dictionary. The table is initialized to contain all single-character
strings. The LZW string table contains strings that have been encountered previously in the message being compressed.

LZW examines the input strings

character-serially in one pass, and the longest input string already in the table is
parsed off each time. Each parsed input string extended by its next input character
forms a new string to be added to the table. Each string in the table is assigned a
unique identifier, namely its code value. Thus the dictionary grows in size during
the compression phase. Similarly, the decompression algorithm constructs its translation table dynamically as the code values are translated into the corresponding
prefix string and the extension character. The extension character is pulled off and
the prefix string is decomposed into its prefix and extension. This process is
repeated recursively until the prefix string is just a single character.
The LZW scheme does not optimally select the strings to be included in the
table. It is mentioned by Welch (1984) that the method produces compression
results that, while less than optimum, are effective. The scheme seems to be useful
in commercial systems and could be applied to files that contain text intermixed
with numbers.

44

2.2 Hardware-assisted Data Compression
The lack of special hardware to assist data compression, the view of compression by many system designers as a utility of limited use, and the sophisticated
logic of many data compression schemes have been primary factors to limiting virtually all implementations of data compression to software. In very large data
bases, the efficiency of compression is critical to the overall performance of the
system; hardware implementation of compression can be very beneficial to large
database systems. The rest of this chapter will discuss past attempts to implement
compression with hardware assistance.

2.2.1 The Microprocessor-assisted system (MAS)
In Lynch and Brownrigg (1981), a suggestion was made to include data

compression, encryption, and some key searching capabilities in an advanced disk
controller. The scheme proposed in Hawthorn (1982) uses a microprocessorassisted system (MAS) to offload the process of data compression and attribute

partitioning/tuple assembly from the front-end computer running a statistical database management system (SDBMS). A brief discussion of this system is given
below.
The MAS design is targeted to minicomputers (e.g., VAX1ln80) dedicated to
support statistical databases. The idea is to support the front-end computer by a

45

hardware structure of microprocessors (MPs) which perform specific data management functions. The microprocessors are organized in a two level hierarchy. Each
disk in the system is connected to a distinct microprocessor (MP) in the bottom
level of the hierarchy (these are called leaf MPs). At the top-level, there is a single microprocessor (the root MP) that is connected to the front-end computer. Figure 2.6 shows the back-end structure for the case of four disks.
When the SDBMS running in the front-end receives a user query to retrieve
data, it sends a request to the root MP to retrieve the appropriate data from the
different files. The root MP analyzes the request and sends subrequests to the
appropriate leaf MPs.

The function of leaf MPs is to schedule disk reads,

decompress data and send it to the root MP. The root MP assembles attributes that
are spread across multiple disks (because of attribute partitioning commonly used
in scientific and statistical databases), and passes them to the front-end computer.
When new records are appended, the data is passed from the front-end to the root
MP which performs attribute partitioning and sends the data to the appropriate leaf
MPs. The latter MPs compress data and schedule disk writes.
The MAS design uses general purpose MPs rather than special data filters
(like those used in database machines). The reason for this choice is that MPs are
intended to perform specific functions; thus general restriction, projection, and
semi-join operations cannot be included in MAS. The general purpose MPs are

46

Root MP

Figure 2.6. The MAS structure.

Front-end
computer

47

slower than data filters and they cannot process data at the speed of disk transfer
rates. The expected gain in performance, however, arises from the parallel operation of MPs.
A cost/performance analysis for the MAS design was given m Hawthorn
(1982). The analysis compared the cost and performance of the statistical DB
model both with and without the back-end (MAS) system. The conclusion of the
analysis is that MAS is cost-effective and is expected to have positive impact on
the performance of statistical databases and other systems that rely heavily on
compression and tuple assembly. This initial design of MAS will probably need
considerable tuning, and it remains to be seen whether it is more advantageous to
replace the general purpose MPs by data filters that can enable expanding the size
of the back-end structure and the data management functions that are supported at
the back-end.

2.2.2 A Finite State Machine Approach
The two-stage finite state machine presented in Hazboun and Raymond (1983)
is a proposed implementation of the Multi-group compression method (discussed
earlier in this chapter) used for efficient transmission of data in a distributed
environment. The automaton is conceived as an 1/0 board in a host machine. It
consists of a decode machine (input register, control logic, receive FIFO to hold

48

characters, address register, and microcode memory), an encode machine (primary
and alternate send registers, transmit control logic, look-ahead logic, transmit FIFO,
and three flags to monitor transmitted characters), and an encode/decode memory.
Performance evaluation of this implementation was not completely covered. Further
research is needed to compare the proposed scheme with other alternatives.

2.2.3 A Systolic Approach
Smith and Storer (1985) have proposed parallel algorithms for compressing
data by textual substitution, that can be implemented in VLSI with systolic arrays,
a popular model of parallel computation. They discuss systolic algorithms for
compressing data using two different models. The static dictionary model is one
where data is compressed by replacing substrings of text by a pointer to a dictionary. The sliding dictionary model is the other one where text is compressed by
replacing substrings of text to an earlier occurrence in the text. Both of these
models have been extensively studied in the literature.
The basic model for the systolic structure for data compression along a communication channel, using the static dictionary model, is given in Figure 2.7. The
basic structure is a pipe with three processing elements for each string of the dictionary. The dictionary strings themselves are stored in the middle row, one string
per element. Pointers within dictionary elements will always point to a dictionary

49

ecelve 14'-----l

in

A

out

Figure 2.7. A systolic structure for data compression.

so
element to the left. Data to be encoded enters from the left, is piped along the bottom row, and encoded as it progresses. Data to be decoded enters from the right, is
piped along the top row, and decoded as it progresses. The dictionary may be
loaded by either location. The circle next to the transmitter for location A indicates
a switch to divert data via the bypass line, to the dictionary, and to the output.
A fixed dictionary is assumed and data is compressed by replacing substrings
of data by pointers to matching strings in the dictionary. For example, if the dictionary consists of the two strings ab and bab, the string ababbab could be
represented as 1 lbl, a22, or a2bl, where 1 denotes a pointer to ab and 2 a pointer
to bab. Decoding is always unique and consists of simply replacing a pointer by its

target Also, dictionary entries themselves could contain pointers to other entries
and so decoding a pointer may involve several dictionary lookups. The string
represented by a pointer, after all its pointers have been expanded, is referred to as
the expanded target of that pointer. Pointers in dictionary entries can reduce the
space for the dictionary, but the major reason for such pointers is that they allow
for the representation of strings longer than the maximum length of a dictionary
element. The implementation issues for both the static and dynamic models are discussed in Smith and Storer (1985).

51

2.2.4 The LZW Hardware
A brief discussion is given in Welch (1984) about the hardware-design
(Sperry proprietary) of the L'ZW algorithm. The hardware structures for compres-

sion and decompression proposed in the paper are given in Figures 2.8 and 2.9.
The details of implementation were not given since they are Sperry proprietary.
The principal implementation decision is choosing the hashing strategy for the
compression device. The scheme uses a 8K RAM as a hash table with a load factor
of 0.5. The speed depends on the hashing system, and it was estimated (Welch
1984) that compression speeds of up to half the clock rate can be possible if the
compression ratio is 50% and hash search lengths are short.

2.2.5 Mukherjee' s Approach
A set of hardware algorithms have been proposed in Mukherjee, Ranganathan
and Bassiouni (1988) based on a reverse binary tree. These algorithms are applicable to any static binary coding scheme. The reverse binary tree is a labeled binary
tree whose leaf nodes and some of the internal nodes represent the symbols to be
encoded in the following sense: the sequence of O's and 1's in the unique path
from the node representing the symbol to the root node is the code for the symbol,
i.e., the tree represents the reversed Huffman's codes. A reverse binary tree is
derived in Figure 2.10 for the example Huffman tree given in Figure 2.1. A more

52

RAM
I

10R CODE I

NEW CODE

HASH
FUNCTION

INPUT

CHAR------CHARACTER
REGISTER

_ ___.___.,_......_--._ _ OUTPUT

COOENUMER
REGISTER

ca:e

Figure 2.8. Hardware for LZW compression.

ca:e
COJNTER

PfM
PRIOR CODE

CHARACTER
OUTPUT
CHAR

CODE REGISTER 1

CODE REGISTER 2
FINAL

CHARACTER

Figure 2.9. Hardware for L'ZW decompression.

S3

Figure 2.10. The reverse binary tree.

54

formal treattnent of the reverse binary tree will be presented in Chapter 3.
The reverse binary tree can be mapped into a corresponding hardware layout
and the approach is to detect the symbol to be encoded and place a token on the
corresponding leaf node of the tree. As the token traverses upwards towards the
root, the code bits that correspond to the labels on the reverse tree are extracted in
a bit-serial fashion.

CHAPTER 3
HARDWARE ALGORITHMS
Successful

hardwareNLSI

implementation

of

data

compression

and

decompression can be integrated into 1/0 and communication devices of current
and future computer systems. Front-end machines and host nodes can then be
relieved from the overhead of compression/decompression. VLSI chips for data
compression can be incorporated into communication controllers resulting in
significant improvement in the performance of distributed networks.
In this chapter, we will present a set of new hardware algorithms for data

compression and decompression. These algorithms can be implemented in VLSI
using nMOS or CMOS technology. Our approach differs primarily from the previous methods in that we use designs that eliminate or reduce the use of storage
(code-decode tables) or microcoding. This yields faster processing rates independent of the memory or machine cycles of the back-end or the host computer. The
estimated speeds of our algorithms far exceed the maximum data flow rates in
current and projected disk and communication technologies.
We describe the following contributions in this chapter: (i) A new pipelined
architecture for data compression applicable to any static binary encoding scheme
is proposed. (ii) A fast hardware algorithm for decompression is presented. (iii)
55

56

Hardware algorithms are proposed for the Multi-group compression technique. (iv)
A hardware circuit for the Run-length encoding scheme is presented. (v) Hardware
schemes for an enhanced version of arithmetic coding. Speed estimates of the
various algorithms are discussed in the last section of this chapter.

3 .1 A Pipelined Architecture for Data Compression
We present a new pipelined architecture for data compression that is applicable to any static binary encoding scheme. We begin with a formal definition of the
reverse binary tree which forms the basis for several hardware algorithms that will
be described in this chapter. A reverse binary tree is a labeled binary tree whose
leaf nodes and some of the internal nodes represent the symbols such that the
sequence of O's and l's in the unique path from the node corresponding to a symbol to the root node is the code for the symbol. Thus the sequence of O's and l's
from the root to any node representing a symbol gives the reversal of the original
code. The reverse binary tree can be derived for any binary code set (e.g.,
Huffman Code) in the following manner:
(i)

Obtain the reverse code for each symbol by writing its original code backwards.

(ii)

Consider the reverse code for the first symbol and construct a left child to
the root node if the first bit is a 'O' or a right child if the first bit is a 'l '.

57

(iii)

Assuming this newly built node as the parent node, consider the second bit
of the reverse code and build a new child node as before. Repeat this step
until all the bits of the code for the first symbol are considered.

(iv)

Consider the reverse code for the second symbol. If the first bit is a •o•, we
need a left child from the root and if the bit is •1•, a right child is to be constructed. If the particular child node already exists due to the consideration
of a previous symbol, traverse to that node and consider the second bit of the
reverse code. The same procedure is applied to all the bits of the code for
the second symbol constructing only the missing nodes during each step.

(v)

Repeat step (iv) until the reverse codes for all the symbols have been considered.
The resulting tree is the reverse binary tree obtained from the original code.

The time complexity for the construction of the reverse binary tree is linearly proportional to the total length of binary codes of all the symbols.
We now describe the hardware algorithm for the Huffman encoding scheme
proposed in Mukherjee, Ranganathan and Bassiouni (1988) which uses a module of
the reverse binary tree as the core of the design. The algorithm will be described
with the help of the example Huffman tree given in Figure 2.1. We discussed earlier in section 2.1 that the decoder must receive the bits in the correct order, i.e.,
bits of the Huffman code for each character must be generated starting at the root
of the tree and ending at the corresponding leaf node. The reverse binary tree

58

module is required in order to generate bits in the correct order. The reverse binary
tree corresponding to the example Huffman tree is given in Figure 2.10.
The functional block diagram of the hardware encoder for the character set of
the example in consideration is given in Figure 3.1. The hardware consists of a
module of the reverse binary tree and a decoding circuit The decoding circuit
(decoder) places a "pulse" or a "token" at the leaf node (in the tree module)
representing the symbol being encoded. The token traverses one level up towards
the root node at each clock pulse or time step, delivering an output bit (for each
level of the tree) whose value is equal to the label of the tree edge being traversed.
The output line from each level is connected to an OR gate which produces the
bit-serial output of the chip.
When the character is encoded (compressed), its Huffman's code will be generated bit by bit as the token traverses up the tree (one level at each clock pulse).

The structure of the reverse binary tree ensures that the bits of the Huffman code
are generated in the correct order (as needed by the decompression phase).
The entire operation of the device can be described as follows: the input symbols are held in an input buffer whose size depends on the average number of symbols received per unit of time. It will be safe if the size of the buffer equals the
maximum length of any path in the binary tree. Once a symbol is encoded, a new

59

,,,

B

C

0

2

3 : 7 LINE DECODER CIRCUIT

SYMBa.. REGISTER

start. ~

BUFFER

Figure 3.1. The Huffman encoder circuit

A

60

symbol is latched at the input of the decoder which sends a token to the appropriate input of the tree module. The token traverses up towards the root, generating
an output bit at each level of the tree during each clock pulse. When the token
arrives at the root, it is used to activate the compression of the next symbol. Once
the code for the current symbol has been output, the code for the next symbol
starts being output without loss of continuity. Thus, once a compression phase is
started, one bit of compressed data is output during each clock cycle.
The main components of the design are the tree module and the decoder as
shown in Figure 3.1. The function of a node in the tree is to pass an incoming
token to its parent node. The two-input nodes are OR gates while the single-input
gates are just shift register stages. It must be noted that the nodes can be of either
single input, two-input or at the most three input gates. If a node corresponding to
a symbol happens to be an internal node that has two child nodes in the reverse
binary tree, a three input OR gate is required for the corresponding module in the
hardware layout. Since the logic of any node in the tree module is basically a shift
register stage, the propagation of the token up the tree can be achieved at a very
high speed (a critical delay of about 6 nano seconds will result in a 25 nano
seconds clock cycle equivalent to 40 MHz speed). But the decoder circuit due to
its multi-level gate logic will operate at a much slower speed. Thus the clock
cycle of the entire chip depends on the delay within the decoding circuit. Next, we

61

present a hardware design that minimizes this delay by breaking the decoder into a
pipeline of several stages.
The pipelined architecture for data compression is shown in Figure 3.2. The
basic components of the architecture are a reverse binary tree, a multi-stage pipeline decoder, a set of latches, a buffer to hold the data stream and some control
logic as shown in the figure. The reverse binary tree is similar to that shown in
Figure 3.1. The latches are a set of 1-bit registers to temporarily hold the output of
the decoding circuit until all the bits of the compressed code for the previous character have been output
The function of the decoding circuit is to decode the character loaded into the
symbol register and place a token at the corresponding leaf node in the tree
module. The basic cell for the decoder consists of several NOR gates and can be
implemented as a pipeline of multiple stages. The decoding of the next symbol
will proceed in parallel with the traversal of the token corresponding to the previ-

ous symbol up the tree.
The decoder outputs will be stored into a string of recirculating latches
(shown in Figure 3.2). The latch containing a '1' corresponds to the input symbol
to be encoded next. As the previous token emerges out of the root of the tree, it is

used to activate the control which transfers the information waiting in the latches to

62

output

code

3 :7 DECODER

(2-stage pipe)

BUFFER
start

Figure 3.2. A pipelined architecture for Huffman encoding.

2A

63

the leaf nodes of the tree. At the same time, the token will also initiate the decoding of the next symbol stored in a buffer.
The number of stages within the decoder can be at most equal to the minimal
path depth in the Huffman tree. This is necessary since the decoder output for the
next symbol must be ready as soon as the previous symbol has been compressed.
The minimum number of cycles to compress any symbol corresponds to the shortest path in the tree. In our example, the pipeline decoder will have two stages
corresponding to the minimum depth of the code in the Huffman tree. The chip
will function such that one code bit will be generated during each clock cycle.
The decoder output for the next symbol will always be ready in the latches when
the last bit of the current symbol has been generated. Breaking the decoder into a
pipeline of two stages will double the maximum clock rate allowed. This principle
can be easily extended to a realistic Huffman tree. If we want to design the
Huffman compressor for the ASCII code, with a minimum code word length of 4
bits, the 8-bit decoder can be designed to have four stages in its pipe. Thus the
pipelining of the decoder not only allows parallelism, but also decreases the cycle
time yielding a better compression rate. A prototype CMOS VLSI implementation
of this modified architecture is described in detail in Chapter 4. The maximum
clock speed of the circuit depends on the delay in the feedback path from the root
of the tree module to the load signal at the symbol register. As per the estimates

64

obtained from the implementation, a clock cycle of 25 nano seconds was required
that yields a speed of about 40 MHz. Assuming an average length of Huffman's
code as 4 bits (a compression ratio of 50% ), we can obtain a compression rate of
10 million characters per second.

3.2 The Decompression Hardware
We present a hardware algorithm for decompression that is based on the
Huffman tree structure. The hardware design is given in Figure 3.3. Although the
algorithm is applicable to any static binary code, we use Huffman code for illustration and our discussion is based on the example used in the previous section.
The basic components are the Huffman tree module and a read-only memory
(ROM) as shown in Figure 3.3. The function of a node in the Huffman tree
module is to pass the incoming token to one of its two child nodes depending on
the value of the code bit. If the code bit is a "1 ", the token is passed to the left
child and if the code bit is a "O" the token is passed on to the right child. The
logic design for a node is given in Figure 3.4.
The compressed code is input to the circuit at the rate of one bit per clock
cycle. This bit is used to control the flow of the token down the Huffman tree
module as described below. At the beginning, a start "pulse" or a "token" is

65

0

code
rraddress lines
A
C

B
D

F
E
G

Gnd

output buffer

I-

Precharge

Vdd

Figure 3.3. Hardware design for decompression.

66

Figure 3.4. Node logic for decompression circuit

67

applied at the root of the tree. This token traverses down the tree, controlled by
the bits of the input code (if the code bit is 1, the token traverses the edge labeled
1; otherwise, it traverses the edge labeled 0). If the token emerges out of a leaf
node, it initiates a read operation from the read-only memory which contains the
uncompressed code of the symbol being decoded. The uncompressed code of the
symbol is loaded into the symbol register which is then output by the decompression circuit. The token is fed back to the root node to continue the decoding of the
next symbol. The feedback logic is designed using precharge logic so that the
token is passed on to the root without much propagation delay. In the scheme
described above, the critical delay of the circuit depends on the time needed for the
read operation from the read-only memory (ROM), which is used to store the fixed
length code (for example, symbols of the ASCII character set). The ROM access
time will determine the speed at which code bits can be pumped into the device.
The critical delay of the circuit can be improved by replacing the read-only
memory by a balanced binary tree module built to represent the uncompressed
fixed length code set.
The modified circuit is shown in Figure 3.5. The circuit shown in Figure 3.5
corresponds to the earlier example in Figure 3.3 and for ease of explanation, the
fixed length codes are assumed to be of 3 bits instead of 8 as in the ASCII set.
The hardware design consists mainly of a Huffman tree structure that represents the

68

start

feedback
token

symbol ready

Figure 3.5. Modified hardware design for decompression.

69

compressed codes and a balanced binary tree structure representing the
uncompressed fixed length codes. Each leaf node that corresponds to a symbol in
the Huffman tree structure is connected to the corresponding leaf node in the balanced binary tree module. The function of the Huffman tree module is the same as
in the scheme in Figure 3.3. The function of a node in the balanced binary tree
module is to pass the incoming token to the node in the next lower level. The
logic design for the nodes are simply OR gates. At each level, the edges labeled
as "1" are tied to the inputs of an OR gate as shown in the figure. The outputs of
the OR gate at each level together provide the fixed length code. The outputs of
the OR gates are connected to a set of triangular delay elements so that all the bits
of the fixed length code for a symbol being decompressed are obtained in parallel
from the symbol register.
The operation of the circuit is as follows: at the beginning, a "start" pulse initiates a token at the root of the tree. The token traverses down the tree, controlled
by the bits of the input code. If the code bit is "1 ", the token traverses the edge
labeled "l ", else, it traverses the edge labeled 0. Thus, the function of a node in the
Huffman tree is to pass the incoming token to one of its two child nodes depending
on whether the code bit is a "l" or a "O". When the token emerges out of a leaf
node of the Huffman tree, it starts traversing down the balanced binary tree. Also,
when the token passes from the Huffman tree to the balanced binary tree, the

70

precharge logic shown in the left side of Figure 3.5 generates a feedback token.
This token is passed on to the root of the tree to intiate the decoding of the next
character. The output circuit that consists of a set of OR gates with one gate for
each level outputs the fixed length code. The delay elements ensure that all the
bits of the fixed length code for a symbol are loaded simultaneously into the symbol register.
It can be seen that the maximum clock speed for the decompression circuit
depends on the node logic. Since the logic for a node is equivalent to a shift register stage, it is possible to obtain a clock speed of about 40 MHz using current technology. Assuming an average length of Huffman's code as 4 bits (a compression
ratio of 50% ), we can obtain a decompression speed of 10 million characters per
second.

3.3 The Multi-group Compression Hardware
The Multi-group compression method is described in section 2.1.2.

The

hardware design for the Multi-group method uses similar techniques used in implementing the Huffman's scheme. However, extra logic is required to detect group
switches and insert the corresponding switch codes in the compressed output. The
basic circuit for the Multi-group technique is shown in Figure 3.6. The circuit is
designed for the modified scheme where the local and failure trees for each

71

feedback
token

Reverse Binary
tree for
Multi-group

Decoder1

Reg 1

2: 1
Mux
Group
Switch
Logic

Dummy
Symbol
Next Symbol

Buffer

Figure 3.6. Hardware design for multi-group compression.

72

subgroup are combined into a single tree with an indicator node for each possible
group switch. The main components are the reverse binary tree module, two
decoders, group switch logic and a 2:1 Multiplexer as shown in the figure. Regl
and Reg2 are registers to hold values (symbols) to be input to the decoders. The
reverse binary tree module consists of a tree structure that includes the codes of
characters of all groups and the various group switch codes. The design has two
decoders, Decoderl for the character symbols and Decoder2 for the switch codes.
The Decoderl and the Decoder2 outputs are connected to the corresponding leaf
nodes representing the characters and the switch codes respectively. The group
switch logic detects whenever a group change occurs and causes the generation of
the appropriate group switch code. The 2: 1 Multiplexer selects the symbol in the
next symbol register whenever there is no group change. If there is a group
change, the Multiplexer loads a "dummy" symbol into the register Regl. The
dummy symbol corresponds to the situation when Decoderl produces no token.
Simultaneously, the group switch logic loads a specific value into Reg2 which
when decoded by Decoder2 places a token on the leaf node corresponding to the
particular group switch code. The Multiplexer is controlled by the signal sel.<!>2
which is activated by the group switch logic as in Figure 3.6.
The circuit works as follows: the next input symbol is loaded into a register
from the buffer, which is used by the group switch logic (GSL) to determine if

73

there is a change of group. If there is no change in the group, then GSL activates
the sel.<1>2 signal which passes the symbol to the register Regl to be decoded. In
this case, GSL loads the other register Reg2 with the "dummy" symbol for which
all the outputs of Decoder2 will be 'O'. Also, the sel.(j> 2 signal loads a new character into the next symbol register from the buffer. If there is a change in the group,
the next symbol is left to stay in the next symbol register and the sel.(j> 2 signal
activates the selection of the dummy symbol to be loaded into Regl. Thus the
sel.<1>2 signal acts as a control input to the 2: 1 multiplexer which selects between
the dummy symbol and the next symbol. Again this dummy symbol corresponds
to the situation where all the outputs of Decoderl are 'O's. Simultaneously, GSL
loads a value into Reg2. Decoding the value in Reg2 places a token on the leaf
node that corresponds to the particular group switch code.
We will be referring to the multi-group hardware as we describe the working

of the group switch logic. The group switch logic is given in Figure 3.7. For ease
of explanation, the logic is worked out for an example having three groups, say,
alphabets, digits and special characters. Any fixed length code can be viewed as a
binary integer value. Without loss of generality, we will assume that the original
codes in the first group fall within a binary limit 'x', the codes in the third group
to be above or equal to the limit 'y' and the second group falls in between. This
property is valid for common ASCII codes applicable to alphabets, digits and

74

to Reg2

control
for 2:1

Multiplexer

Add
Reg4

Compare

from symbol
register

Figure 3.7. Group switch Logic.

Adder

75

special characters. If the codes within a group do not have adjacent binary values,
the case can be handled by introducing additional comparators and registers. The
group switch logic for our example consists mainly of three comparators, a 1's
complement adder and two registers Reg3 and Reg4. The logic works as follows:
as soon as a new symbol is loaded into the symbol register from the buffer, the
two comparators labeled as 'x' and '=>y' detect the subgroup to which the symbol
belongs. Depending on the specific group, one of 100, 010 and 001 corresponding
to alphabets, digits and special characters respectively, is loaded into Reg3. The
state that corresponds to the previous symbol is stored in the register labeled Reg4.
The previous state and the current state are compared. If a match occurs, the control input for the 2:1 Multiplexer in Figure 3.6 is enabled in order to pass on the
symbol to be decoded. A mismatch corresponding to a group switch causes the
Multiplexer control (signal sel. <h) to be zero. When the multiplexer control is
zero, the dummy symbol is selected to be loaded into Regl. The adder circuit is
enabled simultaneously. The adder performs a 1's complement subtraction of the
contents of Reg3 from Reg4. The output of the adder is loaded into Reg2 to be
input to Decoder2. Decoding this value places a token on the leaf node that
corresponds to the particular group switch code. The propagation of the token up
the tree and the extraction of the bit-serial code is done in the same manner as
described in section 3.1 for the Huffman encoding hardware algorithm.

76

In our example, there are three groups: alphabets, digits and special characters
in that order. The registers Reg3 and Reg4 in Figure 3.7 hold 3-bit values that

could be from 100, 010 or 001 that correspond to alphabets, digits and special
characters respectively. Among three groups, there are six different group switches
possible and thus the output of the adder in the group switch logic must be able to
produce six different values uniquely representing each possible group switch. Our
scheme ensures this, as can be seen from the table given below:

TABLE 4
TRANSITION TABLE FOR GROUP SWITCH LOGIC

= P + C'

p

C

Transition

100

010

A-D

001

100

001

A-S

010

010

100

D-A

101

010

001

D-S

000

001

100

S-A

100

001

010

S-D

110

F

The state P represents the group of the previous character and the state C
represents the group of the current character. C' is the one's complement value of
C and the table above lists the unique output states corresponding to the six possi-

77

ble group switches in a Multi-group scheme of three groups.
In Figure 3.6, the reverse binary tree module is shown as a single tree that
includes the symbol codes and the group switch codes of all the groups. An alternative method is to layout a separate group tree for each group and the roots of all
the individual trees can be 'OR'ed in order to obtain the feedback token that will
initiate the processing of the next symbol. It is possible to obtain some optimization by using a single reverse binary tree that includes all the group codes. In both
cases, the outputs of the decoders are connected to the appropriate leaf nodes in
order to place the token at the bottom of the tree. If symbols from different groups
share the same code, then the corresponding decoder outputs can be connected to
the same leaf node. Thus, we can reduce the size of the VLSI layout.
The two decoders in Figure 3.6 can be implemented as pipelines as in the
hardware algorithm described in section 3.1. The number of stages in each pipe
must be less than or equal to the length of the shortest code in the whole set. So
we can use the same number of clock cycles in the group switch logic as the
number of stages in the decoder pipes. Since the group switch logic is quite simple, the critical path delay for this implementation will remain the same as in the
Huffman compressor aiding the same compression rate of about 10 million characters per second.

78

3.4 The Multi-group Decompression Hardware
The hardware design for Multi-group decompression is similar to that used for
Huffman decompression described in section 3.2. The decoding tree module consists of the Huffman tree structure. A decoder tree similar to that described in section 3.2 is laid out for each group separately. The symbols are partitioned into the
corresponding groups and stored accordingly as shown in Figure 3.8. The decoding
of a symbol within a group is similar to the scheme described in section 3.2. The
token traverses down the tree depending on the value of the code bit and once it
reaches a leaf node, it enables a read operation of the corresponding symbol from
the memory. Also, the token is passed to the root of the same tree by the feedback
logic for decoding of the next symbol. Whenever there is a switch in the group,
the token comes out from the corresponding path in the tree and it is passed to the
tree corresponding to the next group. If there are n groups, then n(n-1) different
group switches are possible and hence a bus-width of n(n-1) is required in order to
switch the token from one tree to another. Since the number of lines increases
considerably with the number of groups (say, for 6 groups 30 lines are needed), an
alternative design is suggested which is given in Figure 3.9. In the new design, the
tokens are passed between the trees using a precharge bus of size equal to the
number of groups. During one phase of the clock, say
high. During the other phase

<j> 1,

<!> 2,

the bus is precharged to

if there is a group switch, the token comes out of

79

•••
A
B
C

••
•

•
•

•
Group 1

•••

1
2
3

•

Code

•

••

•

•

Group 2

$
#
@

•

•
••
•

•
Group 3

symbol register

Figure 3.8. Hardware design for multi-group decompression.

80

A
B
C

•
•
•
Group 1
1
2
3

•
•
•

Code

Group 2

$

#
@

•
•
IGrd

••
•

•
Group 3

symbol register

Figure 3.9. Improved design for multi-group decompression.

81

the current tree at the corresponding leaf node. This token pulls down the line to
zero that is connected to the root of the tree corresponding to the next group. The
signal in this line is inverted to a '1' which becomes the token for the new group.
Another advantage in using precharge logic besides reduction in the number of
lines is that it helps the passing of the token to the next tree with less propagation
delay.
The above circuit can have even less hardware if it is possible to obtain the
union of the decoder trees of the various subgroups. In this case, the hardware
required will be similar to the decompressor proposed in section 3.2, with some
additional logic to handle the group switch codes. The design is illustrated with an
example of the multi-group scheme given in (Bassiouni and Hazboun 1983). The
example consists of three groups: alphabets (A), digits (D), and special characters
(S). The multi-group codes for the characters as well as the group switch indicators are given in the table below:

82

TABLE 5
A MULTI-GROUP CHARACTER SET AND ITS CODES

Alphabets(A)
A-0
B - 10
C - 110
&(A,D) - 1110
&(A,S) - 1111

Digits(D)
1- 0
2 - 10
3 - 110
&(D,A) - 1110
&(D,S) - 1111

Special(S)
b-0
&(S,A) - 10
&(S,D) - 11

The symbol &(A,D) shown in the above table represents the group switch
from alphabets group (A) to the digits group (D). It should be noted that in each
group there is a unique group switch character included for every other group in
the scheme.
The hardware design is given in Figure 3.10. In this approach, the characters
in various groups that share the same multi-group code will be stored (in ASCII

format) in consecutive locations of the read-only memory. For the example in consideration, a 3-bit register labeled ASD (the size of the register depends on the
number of groups) will store 100, 010 or 001 depending on which of the three
groups A, S or D is being processed currently. This value is used to detect the
particular group change when the token reaches a leaf node corresponding to a

83

token

feedback
Precharge
Logic

A
Code

){

2

B
3

C

output

Group

Transition

Logic

A

D

Figure 3.10. Another approach for multi-group decompression.

84

group switch character. The group transition logic shown in Figure 3.10 updates
the register labeled ASD whenever a group change occurs.

The leaf nodes

corresponding to the group switch codes and the current value stored in the register
ASD are used by the group transition logic to decide on the next state value to be
stored in ASD. When the token flows out of the leaf node corresponding to a
group switch code, it is input to the group transition logic. The group tansition
logic generates a new value that corresponds to the new group. The 3-bit register
value is used to drive the pass transistors that connect the leaf nodes to the
memory location where the corresponding symbol is stored. The same code may
be shared by symbols in different groups. For instance, in our example the code

"O" is shared by the symbols A, 1 and b(blank). When the token reaches the leaf
node corresponding to these symbols, the value in the ASD register selects only the
pass transistor that corresponds to the current group. Thus only one of the three
symbols is read out into the buffer from the memory. Thus at any time the pass
transistors corresponding to only one group are activated, which is necessary since
symbols from different groups may share the same multi-group code.

This

approach is superior than the previous one since the amount of hardware needed is
considerably reduced because of combining the decoding trees into a single tree
module.

85

The read-only memory (ROM) access time will be the determining critical
delay in the above design and will determine the speed at which the code bits can
be pumped into the device. With 25 nano seconds access time and an average of 4
bits/character, the decompression rate will be in the neighbourhood of 10 million
characters per second.

3 .5 The Run-length Encoding Scheme
The run-length encoding scheme (Golomb 1966) replaces sequences of identical characters by a count field followed by an identifier for the repeated value.
The scheme is described in detail in section 2.1.4. In this section, we present a
new hardware scheme for the run-length encoding compression technique.
The hardware design for implementing run-length encoding is given in Figure
3.11. The design uses a two-phase nonoverlapping clocking scheme. It is a simple
circuit that uses several registers, a counter and a comparator. The input string is
read at the rate of one character per clock cycle. Repeated occurrences of any
character is replaced by the character and the count which is the number of repetitions of the same character. The boxes labeled as 'CUR', 'PREV' and 'TEMP' are
half register stages to hold the input characters as they come in. The other two
pairs of registers are to hold the output character and its count. The first pair
labeled as cntl-charl is a temporary buffer to accumulate the count values along

86

input

PREY

clear
to '1'

Figure 3.11. A circuit for run-length encoding scheme.

87

with the character and the second pair labeled as cnt2-char2 holds the encoded output. The comparison between the current input character and the previous one is
done during the

<1> 1

phase. If the characters match, the output of the comparator is

'1 ', otherwise it is 'O'. If the two consecutive characters are different, the comparator output is inverted and during the following

<1> 2

phase, is used to activate the

'clear' signal of the counter. The 'clear' signal initializes the counter value to "1"
and copies the contents of cntl-charl into the register pair cnt2-char2. If the comparator output is a '1' due to matching of the characters, then it increments the
counter value by 'l ' . This process repeats until a mismatch occurs. When there is
a mismatch, the contents of cntl-charl are copied into the register pair cnt2-char2.
Since the data path consists of two cycles, the output should be ignored during the
first cycle. The "clear" signal can be connected to an external pin and can be used
as an indicator to know whenever the register pairs cnt2-char2 are loaded with new
encoded values.
The comparator does bit-wise comparison and hence there is no ripple
involved. So the clock for the circuit depends on the size of the counter, which in
tum depends on the maximum repetition count of a character. Since a repetition

count of 256 is typically used, an 8-bit comparator can be used. An 8-bit comparator will need about 40 nano seconds clock cycle (a conservative estimate). Thus
our circuit will operate at a speed of at least 25 MHz compressing 25 million

88

characters per second.

3.6 Hardware for an Enhanced Arithmetic Coding Scheme
The basic arithmetic coding scheme is described in detail in section 2.1.5. In
this section, we present an enhancement to arithmetic coding (Bassiouni, Ranganathan and Mukherjee 1988) and introduce hardware algorithms to implement
both the original and the enhanced versions of arithmetic coding.
It should be noted that in arithmetic coding, the interval reduces in size as
more symbols are encoded and more bits are required to represent the interval as
its size decreases. The interval size decreases more rapidly when a less probable
symbol is encoded. Thus the precision required to represent the encoded message
may increase so much that it might not be possible to continue the encoding. At
that point, the end-of-message character has to be encoded and then the rest of the
message is coded starting with the initial interval (0, 1). More symbols can be
encoded within a limited precision if the probabilities of the symbols being
encoded are large. The arithmetic coding scheme uses only the distributional property of data. A modified arithmetic coding scheme which utilizes the correlational
properties of data, i.e., the property that consecutive characters tend to be of the
same type (alphabets, digits, blanks, successive zeros, etc.) is presented in this section. The scheme is based on the observation that in many commercial and

89

business data files, data records exhibit a strong locality behaviour of character
reference. Again we will describe the scheme with the help of an example.
The simplified character set shown in Table 4 is split into two groups as
alphabets and digits. The alphabet group will consist of {a, b, c, &, *} where the
symbol "&" indicates a group switch and the symbol "*" is used to indicate the
end of the message. The digit group will consist of {0, 1, 2, &, *}. The symbol
probabilities (p(i)) and the cumulative probabilities (P(i)) for the symbols in the
two groups are given in Tables 6 and 7.

TABLE 6
PROBABILITIES FOR GROUP-I

Symbol

p (i)

p (i)

a

0.3
0.3
0.2
0.1
0.1

0
0.3
0.6
0.8
0.9

b
C

&
*

90

TABLE 7
PROBABILITIES FOR GROUP-2

Symbol

p (i)

p (i)

0

0.5

0

1

0.15

0.5

2

0.15

0.65

&

0.1

0.8

*

0.1

0.9

Suppose we have to encode the string "OOOlabb". The string will actually be
coded as "OOOl&abb*", where "&" indicates the switch from digits to alphabet
group while "*" indicates the end of message. The message is coded using the
arithmetic coding scheme, but the probabilities are taken from the appropriate table
depending on the group of the character being encoded. The encoding of the message and the calculation of low (L) and high (H) values of the interval code are
shown in Table 8. The width of the interval W is the difference between the high
and low points of the interval code.

91

TABLE 8
ENCODING OF STRING OOOlabb WITH THE MODIFIED SCHEME

Symbol
0
0
0
1
&
a

w

L
0.0
0.0
0.0
0.0625
0.0775

b

0.0775
0.07775335

b
*

0.5
0.25
0.125
0.01875
0. 0028125
0.0008445

H=L+W

0.5
0.25
0.125
0.08125
0.0803125
0.0783445

0.00025335

0.07800665

0.077829355

0.000076005

0.077905400

0.0000076005

0.0778369555
0.0779130005

In our example, since the character set is split into two groups and the proba-

bilities are assigned with respect to the characters within a group, the symbols get
higher probability values. Thus in this new scheme, the interval length reduces at
a slower rate as the message is being encoded. This property may be observed by
comparing the code ranges in Table 8 with the corresponding ones in Table 3 in
Chapter 2 which uses the basic arithmetic coding. For our example, when the
same string is encoded using the enhanced version of arithmetic coding scheme,
the resultant interval length is 14.075 times larger than the interval length when
coded using the original scheme. This means that we can encode more characters

92

within a specified precision limit by using the modified scheme. The overhead due
to group switch characters is more than offset by the decrease in the rate of narrowal of code ranges due to higher probability values of the individual characters
in the new scheme.
We now present hardware algorithms for implementing both the basic and the
enhanced versions of arithmetic coding. Since arithmetic coding consists of simple
arithmetic operations like add and multiply, it can be easily implemented in
hardware. The basic hardware as shown in Figure 3.12 consists of a random
access memory (RAM), a multiplier, an adder, a 2:1 multiplexer, a demultiplexer
and four registers. The probabilities for the characters are stored in a RAM and
the RAM is addressed by the binary value of the ASCII symbol set. For each
character, the symbol probability (p(i)) and the cumulative probability (P(i)) are
stored. The registers "New L" and "cur L" hold the new and current low values of
the code interval during the encoding of symbols. The registers "New W" and the
"Cur W" hold the new and current values of the width of the interval. The 2: 1
multiplexer selects between p(i) and P(i) depending on the control value "cl" to be
input to the multiplier. The demultiplexer in Figure 3.12 passes the result from the
multiplier to the adder or the register "New W" depending on the value of the control point "c2".

93

D
8
C

RAM

0

d
8

Register

r

Buffer
Cur W
c1

2 :1
Multiplexer

Multiplier

New W

Figure 3.12. Basic hardware for aritmetic coding.

94

We assume integer arithmetic since it has been shown by Witten, Neal and
Cleary (1987) that arithmetic coding can be implemented using integer arithmetic.
As discussed in section 2.6, the new low point of the inteival (L) is calculated as
current value L + W*P(i) and the new width is calculated as W*p(i) where W is
the current width of the inteival code. During the first cycle, P(i) and W are input
to the multiplier and W*P(i) is computed, and during the second cycle, the product
is input to the adder to be added to current value of L. During the second cycle,
the product W*p(i) is computed to obtain the new width. The 2: 1 multiplexer is
used to multiplex between P(i) and p(i) to be input to the multiplier.

The only

control points in the design are the signals cl and c2 shown in Figure 3.12. The
design needs very little control logic and the sizes of the registers and other
hardware depend on the precision required.
For the enhanced arithmetic coding scheme, the probabilities are stored in the
RAM by separating the characters in each group.

Also, the probabilities

corresponding to the group switch characters are stored. The hardware for the
enhanced arithmetic coding scheme is given in Figure 3.13. The circuit works as
follows: The next input symbol is loaded into a register from the buffer, which is
used by the Group Switch Logic GSL to determine if there is a change of group. If
there is no change in the group, then GSL activates the sel.phi2 signal which
passes the symbol to the register Regl to be decoded. In this case, GSL loads the

95

group 1

0
• 1----

~-------~-------..i

0

d
Reg 1

R

c
group2

A

• I----- M
group3

2 :1
Mux

Group
Switch
Logic
c1

Buffer
OeMultipl
exer

New W

Figure 3.13. Hardware for the enhanced arithmetic coding scheme.

96

other register Reg2 with a fixed binary value. The bits in Reg2 are appended to
those in Regl before they are input to the decoder in the RAM. These extra bits
from Reg2 will be used in selecting the current group. Also sel.phi2 signal loads a
new character into the next symbol register from the buffer. If there is a change in
the group, the next symbol is made to wait and the sel.phi2 loads the dummy symbol into Reg 1. Thus the sel.phi2 signal acts as a control input to the 2: 1 Multiplexor which selects between the next symbol and the dummy symbol. Again this
dummy symbol corresponds to the situation where no symbol from any group is
selected. Simultaneously, GSL loads a value in Reg2 which will select the group
switch character and the corresponding probabilities will be output from the RAM.
The size of Regl will be 8 bits in order to hold the ASCII symbols, while the size
of Reg2 will depend on the number of groups. If there are three groups, then Reg2
will need 3 bits in order to represent six possible group switches. The group
switch logic is the same as given in Figure 3.7.

3. 7 Speed Estimates
Our method of approach differs primarily from the previous methods in that
the algorithms are aimed at eliminating or significantly reducing the use of storage
(code-decode tables) or microcoding. The other key factor is that the algorithms
can be efficiently implemented in VLSI. Thus our algorithms help overcome the

97

main limitations of space and time overheads associated with software compression
as well as previous hardware proposals. The compression rates of previous methods
have been estimated around the order of .64 Mega bytes per second (Lea 1978)
and 1.57 Mega bytes per second (Hawthorn 1982). Our algorithm for the
Huffman's scheme is estimated to yield a compression rate of 10 Mega bytes per
second. Further speed up can be obtained by using the proposed pipeline architecture for the above algorithm. Since the circuit for the multi-group compression
method also works on the priniciples of reverse binary tree, we expect the same
compression rate of about 10 Mega bytes per second.
The hardware algorithms for arithmetic coding can be implemented in VLSI
using CMOS technology and the clock period of the circuit depends on the time
needed to perform multiplication. With the high packing density possible in VLSI
chips today, the proposed encoding hardware can be easily built within a single
chip. The arithmetic coding scheme is very sequential in nature. The encoding of
the next symbol cannot start until the current symbol has been encoded and the
interval representing the codeword is available. In order to obtain speedup, there
are two approaches possible and we briefly outline them.
The first approach is to pipeline the whole operation of encoding and perform
block coding. In this case, we will use a pipeline multiplier, and the various stages
of encoding like fetching the probabilities, multiplication, addition, etc., can be

98

done in a pipeline fashion. Let the entire hardware consist of n stages. The entire
data file can be split into n blocks and during the first n cycles, the coding of a
new block is initiated during each cycle. The first symbol of the first block is
coded at the end of the nth cycle, and during the n +1th cycle the encoding of the
second symbol of the first block is initiated. During the following cycle, the
encoding of the second symbol of the second block is initiated. Thus the utilization and throughput of the overall coding of the data file can be improved considerably due to pipelining. This scheme will be effective only for very large files
which can be broken into reasonably sized blocks.
The second approach is to build a compression system with several modules
of the compression circuit operating in parallel. We explain the scheme with the
example of having four such modules operating in parallel. In this case, the data is
input to the compression system in the following manner: the first four characters
are input to the four modules. After the encoders complete the entire encoding
operation, the next four consecutive characters are input to the encoders in the
same order. Thus the first encoder will encode the 1st, 5th, 9th characters, etc.
The second encoder will encode the 2nd, 6th, 10th characters, etc. Thus the
compressed data file is in four parts. During the decoding phase, four decoders

will operate in parallel on the four different parts of the compressed file. The first
four characters will be output by the decoders at the same time and they can be

99

assembled in the same order as they were in the original file before being
disassembled and compressed. The speed up we can obtain is a direct ratio of the
number of encoder/decoder modules operating in parallel. With the decreasing cost
of hardware due to the advancement in VLSI technology, it is not expensive to
build such a compression system in an environment where compression could yield
considerable improvement in the overall system performance.
The transfer rates of current and newly released disk devices are in the order
of 2 to 3 Mega bytes per second. Thus our algorithms provide speeds that far
exceed the current rates for disk and communication controllers as well as the projected rates for these devices in the foreseeable future.

CHAPTER 4
DESIGN OF A COMPRESSION CHIP
In this chapter, we present the design and implementation of a prototype
CMOS VLSI chip for data compression using Huffman codes based on the
hardware algorithm proposed in section 3.1.

The algorithm can be used to

compress text files based on an ASCII character set, but the implementation of the
Huffman Compressor HC7 chip is based on a simplified seven-character alphabet
set. The chip implements the example Huffman encoding scheme discussed in section 2.2.5. The layout and the basic cells can be easily adapted to design a real
time compression system. This chapter includes discussions of the design process
from the functional level up to the final layout verification. The design and the
reasons for the choice of the individual cells are explained.

The layout was

designed using the 4th version of the MAGIC software, and design tools like ESIM
and CRYSTAL have been used for software verification of the layout. The chip
uses a two-phase non-overlapping clocking scheme and is designed to have 11
pins.

100

101

4.1 Functional Description
The architecture and the algorithm are described in detail in section 3.1. The
main components in the implementation are a decoder and a hardware binary tree.
The decoder decodes the ASCII character and places a token on the node that
represents the corresponding character in the tree which is actually a reverse
Huffman tree. As the token traverses towards the root through the reverse code
path, the corresponding bit is output at the rate of one code bit per clock cycle.
Once the code for the current character has been output, the code for the next character starts being output without loss of continuity. Thus, once a compression
phase is started, we get one bit of compressed data during each clock cycle. The
decoder is implemented as a pipeline and the number of stages in the decoder is
equal to the minimum path length in the Huffman tree. The decoding can be done
in parallel as the token traverses up the tree, and the decoded data will be stored in
a set of recirculating latches. Once the token reaches the root, the token is used to
place the data from the latches to the bottom of the tree. The decoding of the next
symbol is initiated simultaneously. The restriction on the number of stages in the
decoder pipe makes sure that the decoded data is ready in the latches before the
previous token reaches the root. The pipelining of the decoder not only allows
parallelism but also decreases the cycle time, thus aiding a better compression rate.

102

An important aspect that came into consideration during the layout design
process is the initialization that is required before every phase of encoding. Since
the output of the root cell is being used as a feedback signal, it became important
that the feedback signal is low whenever a token is not passing through the root.
This can be done by placing a "O" on every arc of the tree to begin with and also
by inserting "O''s at the bottom of the tree whenever the feedback signal is low

(when the feedback signal is high, the decoded output from the latches is placed on
the bottom of the tree). If no initialization is done, the feedback signal becomes
"undefined" later and so it becomes impossible to place a new token at the leaf
nodes or read a new character in for decoding. In the implementation, we have two
external signals called "i" for initialization purposes and "s" for starting an encoding phase. The start signal is actually used to receive the first character in the symbol register so that it can be decoded.

4.2 Design of Basic Cells
In this section, we describe the logic design of the various cells used in the
circuit.

103

4.2.1 The Decoder Cell
This stage decodes the character in the symbol register and the decoded data
are latched in recirculating latches. We used a 3:7 Decoder, but the same layout
can be easily extended to handle 8-bit ASCII codes. The decoder is designed to be
a two-stage pipeline. The circuit diagram for the basic decoder cell is given in
Figure 4.1. During the first phil phase the character is loaded into the symbol
register and during the second phi2 phase the decoded output is captured in the
recirculating latches. The decoding is done between these two phases as shown in
the figure.

4.2.2 The Recirculating Latch Cell
The basic cell is given in Figure 4.2. Seven of these cells are placed together
to store the decoder output until the previous token passes through the root and
allows the latched data to be placed on the leaf nodes. Once the feedback signal
goes high, it initiates the decoding of the next character, and since decoding takes
two cycles in our example, the same feedback signal is delayed by two cycles and
is used to activate the transmission gate that allows the decoded output to be
latched. The data is refreshed through recirculation whenever new data is not
being latched in.

104

latch

out

symbol register

Figure 4.1. The basic decoder cell.

105

delx .<p

-1...:_2

T

T

x

: feedback signal

delx : x delayed by 2 cycles

Figure 4.2. The recirculating latch cell.

106

4.2.3 Toe Basic Cells for the Tree
The circuit diagram for the leaf cell is given in Figure 4.3. The input to the
leaf cell comes from the recirculating latches which are controlled by the feedback
signal. The leaf cell consists of two invertors separated by a transmission gate controlled by the phi2 signal.
There are four types of internal node cells which are given in Figures 4.4 to
4.7. The cell given in Figure 4.4 (named "icell") is truly a shift register stage, but
for the "ibar" signal that forces the output of the node to "O" during the initialization. The cell given in Figure 4.5 (named "tcell") has two inputs that are latched
by a NOR gate during the phil phase. The output of the gate is inverted during
the next phi2 phase so that the overall function of the cell is to "OR" the inputs.
The cell given in Figure 4.6 (named "hcell") has a non-uniform structure in that
one of the inputs goes through only the phi2 phase whereas the other input passes
through both phil and phi2 phases. The logical function of the circuit is the same
as the "tcell". The function of the root cell given in Figure 4.7 is to again OR its
two inputs, but this cell has no clocks since we do not want to insert any cycles
between codes. In other words, we would like to generate the code in a continuous
fashion with one bit every clock cycle, once a compression phase is started.

107

Vdd
P--ibar

T
Figure 4.3. The leaf cell.

Vdd

~ ibar

T

T
ibar

initialization signal i inverted

Figure 4.4. The internal node "icell".

108

Vdd

T
_l_

Figure 4.5. Toe internal node "tcell".

T

Vdd

Figure 4.6. The internal node "hcell".

109

4.2.4 The Output Circuit
The output circuit produces the code with one bit of encoded information
every clock cycle. The edges in the tree of Figure 2.11 that are labeled as "1" are
connected to the output circuit which generates a "O" during each clock except
when a token passes through one of these edges in the tree. A precharge circuit is
used as shown in Figure 4.8.

4 .3 Layout Design
This section describes the VLSI layouts for the various cells described in the
previous section.

The layouts were designed using the CAD software called

MAGIC distributed by University of California at Berkeley. The layouts were
verified using the event-driven switch-level simulator ESIM and timing estimates
were obtained using the analyzer called CRYSTAL. The layout designs for the
individual cells are described below.

4.3.1 The Decoder Layout
The decoder with a two-stage pipeline was designed and the layout was
verified. The basic cell is of size 138 lambda X 51 lambda and the cifplot is given
in Figure 4.9. The basic cell consists of a two-stage pipeline with the pass transistors left unconnected with the input lines. Seven of these cells are arranged

110

Figure 4.7. The root cell.

Vdd

r

4

T

GD
1,2,3,4,5

: from arcs labeled '1' in the reverse tree

Figure 4.8. The internal node "hcell".

cifplot• Window: -6900 750 -6450 14250 --- Seate: 1 micron is 0.035 inches (889x)
-6450 14250 --- Scale: 1 micron is 0.035 inches (889x)
'T1
oq·

~
.i:,.

'°
j
c::
r+
0

-s,

~
-0

g

g

i

~ :·•· • • :I !iz~l: :1;·,:]l~1~
·:.!~!1;.
·
1
~·
·
1
1
~ti
:
}
.
;
·•· •
if . [! 'iii i l ,···· t]IJt .~ i Hi; ~~i····;iji.l: i
·'.'.· '.·'. ·'.i .·•··•:·••··,•, ... ,.,. ,..... .• ···• .,..

j .

!!t

,rn •• ~ .

<

t ·····:··•·• ···•

n

,> ,.,.., •

.·.•

I

i . .;r.
~
11
.
,;Li_
·
iiiiii
j
,
;
.
.
\
.
.
,
E
tmu:111~~,
.;:._:.:;
,
'.'. : :: ::: :::: ,: _.:. -·· ... "-·, :: :<
<I
.::::::::::
~w

:••:-:

:

~Im -: •:

1,- •.·•

\'

::·:::;

.:-

·····
•:•:•:•
:):1 1i?

:• :· ~~ ··

1/,,1/,,

• ·•

·'.·

...,_,,.LJ.u.

:: :,:

~

•••••• _..,_,... •• • •

: : :: ::::7:2~s::: ::: Ii) ::
•• •s··;;.;;.; · • • · · :rh::.::::.::
• .• ··• ,,. ••_ ••
~::"fttf:[J

~~

--·,

'(,·

1/,

t•:•

8·,:

: <·.

~~~

-~--

- - , ,.,.

•: :-

~

-:-t:{-·
>-~H

.. --

0

F

,...
,...
,...

112

together in a higher level cell to obtain a 3:7 decoder. The basic cell had to be left
incomplete so that it could be used iteratively. The transistors were connected to
the appropriate input signals once the basic cells were put together. The cifplot for
the complete decoder is given in Figure 4.10.

4.3.2 The Recirculating Latch Layout
The layout of the latch cell is given in Figure 4.11. It is of size 84 lambda X
51 lambda. The width of the latch cell is the same as the width of the basic
decoder cell to maintain uniformity. Our approach was towards obtaining a rectangular final layout.

4.3.3 The Tree Layout
The basic cells for the tree were laid out and verified. The cifplot for the leaf
cell is given in Figure 4.12. The cifplots for the internal node cells given in Figures 4.4 and 4.7 are given in Figures 4.13 to 4.16. The tree was laid out in two
rows. The size of the layout for the tree is 183 lambda X 540 lambda. The root
cell's output is "OR"ed with the start signal "s" (after "s" has been delayed by four
cycles for initialization purposes) and then "AND"ed with phil before it is used to
start the decoding of the next symbol while placing the next token at the bottom of
the tree. A cifplot of the tree is given in Figure 4.17.

'T1

1·

cifplot• ~indow: -20700 0 0 535S0 --- Scale:
S35S0 --- Scale:

is 0.012 inches (305x)

~

....

9

r-4
~
0

....

c::

....

0

-....

"d'
0
0

~

t>l

:..j

g

l
........
f.M

114

Figure 4.11 . Layout cifplot of recirculating latch cell.

115

ti)
Q)

.c:
0
C

ti)

C
0
I,.,

0

-

Q),..,

<ll
0

ti)
Q)

tn .c:
0

I c

I•-

I

I/')

~~
~~

N
-

ti)

•....
0

-N

f;-i-:...----:-:_""··"·'""·"'··""""'""""'""'":':-:"'.~
·~ ~ -1:::::::?}:/\//::

Figure 4.12. Layout cifplot of leaf cell.

116

><
<S>
C'N
Ul
Cl)

.s::
c.>

C

V)

<S>
<S>
Ul
C

0
1-,

....c.> ,,....
s (S)><

-r-N

Cl)--

<1) Ul

c.>

Cl)

(/) .c
0

I C
I ....
I
V)

(S)(S)

<S>
<S> (S)
N
-

.

Ul

<S>
c::
<S> 0

V) 1-,

~o

<S> ....

-s
<S>If)

(")

Cl)

.....

•• <1)

;le 0
OC/)

"0

c:: I
.... I
~ I

ti

O(S)

.... N

....o.-

~

.... (S) OJ
u
0

Figure 4.13. Layout cifplot of "icell".

117

Cl)

Cl)

..c:
0

C

Cl)

C
0
~

Cl)
__,

Cl)

~

Cl)

o.,.c:
(/) 0

C

Figure 4.14. Layout cifplot of "tcell".

118

(/l
Q)

..c:
0

C:

t /J

C:

0
I,.

0

s

><

t.O
...........
(S)

(lj (/l
0 Q)

t/J

..c:
0

I c::
I•I

,:;f"
(S) (S)
(S)
..... (S)

t.O
NUl

(S)

s

~ ......
/J')

Figure 4.15. Layout cifplot of "hcell".

119

...................

Vl
<I>

·················
...................
·················

..c
()

................ .

..........
.... ...
················
................

C

Vl
C
0

...e><:,.....
I,.,

()

'st

-N
V)
ell Vl
() <I>

C/J ..c
()

I C
I • ..,

I

~

CS>CS>

V')

•

~CS>
-

Vl

CS>

....................

...................
.....................
.. ..........................

C

:;tf.l ifrlTIH m~........................
:ltlm
..........................

0

V') r...

~.~
-e

.......................

CS> .....
CS>
V') <I>

•• ell
~

()

OCI)

"0

;!

1:::::::::r::::::r:!:::?::::::::::::::::<::::::::?::::'!::::::::::::::::::::::::::r:~::::::::::::J

•CS>
.... V)
00')

.... CS>

o.-

....

~

·c:; ~L~~L;.;;;;.;;;;;;.;.;;;;,;;;;;;;;;;;;;;;;;;:;;;,;;.:::;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;:;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;.;;J.___.J

Figure 4.16. Layout cifplot of the root cell.

120

Figure 4.17. Layout cifplot of the tree.

121

4.3.4. The Output Circuit Layout
The cifplot of the layout for the output circuit is given in Figure 4.18. The
layout is of size 73 lambda X 85 lambda. The output signal is connected to the
output pad labeled "o" in the final chip layout.

4.3.5 The Chip Layout
A cifplot of the final layout without pads is given in Figure 4.19. The size of
the layout without I/0 pads is 489 lambda X 616 lambda. The cifplot for the final
chip layout is given in Figure 4.20. A total of 11 pads have been used. Seven
input pads, two output pads and two pads for Vdd and GND have been used. The
input pads a, b and c are used for the input character symbol and two input pads
for the start "s" and initialization "i" signals. The remaining two are used for phil
and phi2 clocks. The output pad

11

0

11

is used for the code and the pad labeled "x"

is for the feedback signal. This signal will actually be used to load the symbol
register with the next character to be decoded. The power (Vdd), ground (GND)
and clock lines (phi 1 and phi2) run horizontally through the cells while the data
flows in the vertical direction. The entire layout has been verified by the switchlevel simulator ESIM. The overall size of the layout is 1483 lambda X 1078
lambda.

122

U)
Q)

.c:
0

c::

······························--··········
·········································
···························
... .
U)

c::
0
1-,

0

.....A ~

-~ ~L~.2.il__________________J_ _ _= _ __..____________~

Figure 4.18. Layout cifplot of the output circuit.

cifplot• Window: -73350 0 5700 98100 --- Scale: 1 micron is 0.0065 inches (165x)
700 98100 --- Scale: 1 micron is 0.0065 inches (165x)

....'T1

O'Q

C:

ti

~
.....

!-Cl
r4
SI)
'<

...

0
C:
0

t:.;

...-

"Cl
0

0

......

s-o

-··
•,•,
::::

0

)(

,a·

·,·.

~

::::

...

::·:

::r'

•.·.

§:
0

c::

"Cl

~

Cll

~

N

~

cifplot• Wind ow: -67650 94050 -70200 152250 --- Scale: 1 micr o n is 0.0033 in ch es (84x )
50 -70200 152250 --- S c ale: 1 micr o n i s 0.0033 inches (84x)

r~: ~

t ~ ~: ::::::::~ :::::::::::::::::::::::::::::::::::::::r:~~~ ;::::::::::::::~ :::
.~

-·

=: : : : : : : : : : : : : : : : : : :

.: -

I::::::::::;::::::::::::

I~~~ :I:": : : : : : : : : :: : : : : : : : :

l -

==: :

'Tl

1

lbo t~ad

~

iv

9

....

(")

--

~
0

0

......

s("O

l=tl

=:s

e.
(")

....

::r

-

'O

s;l)

'<
0
t::
rt

~

N

~

125

4.4 External Interface
The HC7 chip has eleven pins in all with 3 data input lines, 2 clock lines, 2
output lines, 2 control signals and the Vdd and the GND lines. The external functional interface is shown in Figure 4.21.

4.5 Dynamic Simulation
The final layout was functionally verified using the event-driven switch-level
simulator called ESIM which works for both nMOS and CMOS circuits. The
simulation was done on a sample string which consisted of all the symbols in the
character set used in the example and also repetitions of some symbols. To start
with, all the input signals including phil and phi2 were set to "O" and the initialization signal "i" alone was set to "1 ". This forced a "O" on all the arcs of the tree
and then i was set to "O" and would remain so for the rest of the simulation. Then
the inputs a, b and c were set to represent the first character in the string. Then the
start signal "s" was set to "1" which would remain so during the first clock cycle.
One clock cycle consists of a phil phase followed by a phi2 phase. Then "s" was
reset to low before continuing the simulation. The "s" signal passes through a
four-cycle delay before it is ORed with the feedback signal and used to place the
first token in the bottom of the tree and also to load the symbol register with the
second character of the string to be compressed. For the rest of the simulation "s"

126

character
data - in

i-----~

0: output
code

HC7
CHIP
X : feedback
signal

Vdd

S : start
signal

initialization
signal

Figure 4.21 . The external functional interface.

127

was kept low and henceforth the above control solely depended on the feedback
signal. As the token passed through the first arc of the tree, during the fifth clock
cycle after "s" was high the output signal "o" started generating the code. The
mechanism is designed so that from the fifth cycle onwards we will get one bit of
compressed code every clock cycle until the completion of the compression phase.

4 .6 Timing Estimates
The final layout for the chip was tested for timing estimates using the CRYSTAL timing simulator. The worst-case analysis gave the phil phase to be 8 nano
seconds and the phi2 phase to be 20 nano seconds. Since CRYSTAL gives very
conservative estimates, it is justified to take the clock period as 44 nano seconds (8

+ 8 + 20 + 8 = 44) where each inter-clock gap is 8 nano seconds. For the example
used in our implementation, the average number of bits per character is 2.7, which
is also the average number of clock cycles required to generate the code for a character. Hence the compression rate will be around 9 million characters per second,
which is well above the transfer rates of the current and newly released disk devices (between 1 and 3 million characters per second). The critical path in the chip
layout is the feedback signal from the root. By arranging the tree layout such that
the root cell is placed closer to the decoder, the clock period and the compression
rate can be maintained to be approximately the same in a real life chip.

CHAPTER 5
SIMULATION OF ARCHITECTURES
The effect of integrating compression hardware into a computing system can
be quantified by constructing detailed simulation models of the system with and
without compression. We constructed such models and performed simulations for
various input data. Important performance measures like throughput and response
time were compared, using typical values of system parameters/characteristics. The
improvement in the performance of the system due to compression hardware is
used to evaluate the effectiveness of the proposed VLSI chips. This chapter
describes the results of various simulation studies conducted by the author.
The architectures considered for the simulations are a general purpose computer system, and a special purpose backend machine. The compression hardware
will be of great advantage in communication systems. However, we will not consider communication architectures for simulation purposes since the improvement
in the effective bandwidth of the communication links due to compression is

directly dependent on the compression ratio. The next section describes the inclusion and significance of data compression hardware within a communication network architecture. The later sections discuss the specifications and the results of the
simulations performed by the author for the above mentioned architectures.
128

129

5.1 A Communication Network Architecture
Telecommunications links are currently one of the most bandwidth-restricted
devices in computer systems. Other than the LANs, few telecommunication links
exceed data rates of 1 Mbit/sec. Many run at rates as slow as 1200 to 9600
bits/sec. Data compression in network interfaces would significantly increase the
effective bandwidth of the links. A 2400 bits/sec line with compression of 50 percent would operate at 4800 bits/sec. Similarly, a 9600-bits/sec line would function
at an effective rate of 19.2 bits/sec. Clearly, teleprocessing links are a prime candidate for the use of data compression techniques.
The data communication process generally requires at least five elements: a
transmitter or source of information; a message; a binary serial interface; a communication channel or link; and a receiver of transmitted information. A data communications interface is often needed to make the binary serial data compatible
with the communication channel. The Data Link Control ( DLC ) hardware is
shown in Figure 5.1. ff the serial communications interface is handled by a dedicated VLSI chip as in the case of SIGNETICS 2652 Multi-Protocol Communications

Controller,

the

host

can

perform

other

processing

during

transmission/reception. Implementing data compression within these line controller
chips seems to be the reasonable choice for communication systems. The presence
of data compression within the line controllers can be made transparent to the

130

system software, as well as the user.
The serial communications interface consists of transmission logic, receiver
logic, control and timing for implementation of SDLC and HDLC protocols, bitstuffing and error detection logic. A typical transmitter and receiver logic in a
serial communication interface of the data link control hardware are given in Figures 5.2 and 5.3 (adopted for the SIGNETICS 2652 multiple-protocol communication controller chip). The transmitter logic consists of registers to hold byte data,
parallel-to-serial interface and logic to produce header data for outgoing frames.
The transmitter logic inserts the control data when required and transmits messages
with header in a bit-serial fashion. The idea is to replace the parallel-to-serial interface block by the compression module in the transmitter logic and correspondingly
the serial-to-parallel interface by decompression module in the receiver logic. Thus
data can be compressed before transmission and decompressed at the receiving end.
This will result in a significant cut in communication costs and increase the
effective baud rate considerably. Buffering will be required to handle the variable
rate of compressed output, and buffer size needs to be carefully selected as discussed later in the case of a general purpose computer system.

131

ERIAL
14---.il\A.JiMMUNICATION
INTERFACE

--91

comm. channel

I
ERIAL
1&--ail\AJMMUNICATION
INTERFACE

TRANSMITTER
LOGIC

-.-91

RECEIVER
LOGIC

SERIAL
_ . COMMUNICATION
INTERFACE

CONTROL AND TIMING

Figure 5.1. The data link control hardware.

132

BITSTUFFING

Parallel
data

R
E
G
2

R
E
G

PARALLEL
TO
SERIAL
INTERFACE

REGISTER

Serial
data

FLAG

Ca.lTRJI..
---:iLOGIC

CR:
POLYNOMIAL
GENERATOR

Figure 5.2. The transmitter logic.

seria I
data

SERIAL

-- -

.. L_GISTER

-

~

,J

FlAG
COvPARE

•

Paralle
data

TO
PARALLEL
INTERFACE

k~ DELETION LOGIC

LOOIC

CRC a-lECK

Figure 5.3. The receiver logic.

..

,...

133

5 .2 A General Purpose Computer System
The architecture of a general purpose computer system is given in Figure 5.4.
It resembles an architecture that is close to that of the VAX

11nso model, and we

have assumed values for the parameters that are close to that of the VAX machine.
We constructed queueing network models of such an architecture with and without
the compression/decompression hardware. The queueing network model shown in
Figure 5.5 formed the basis for our simulations of a general purpose machine
architecture.
Device controllers represent an appealing choice for location of data compression in a general purpose machine. Device controllers equipped with data compression capability should perform the task of compressing data before storing on the
disk and expanding the compressed data before supplying back to the cpu for processing. In this case, data compression is transparent to users and application
software. Because of the unpredictability of the size of the compressed data, its
instantaneous transfer rate can vary considerably. A buffer in the device controller
would therefore be needed to efficiently handle the transfer of compressed data to
the device. For disk systems, the size of the buffer should be carefully selected in
order to insure good utilization of disk bandwidth. The determination of the buffer
size should take into account the block size of the disk, its transfer rate, and the
speed of the VLSI chips. The effect of varying buffer size on the performance of

134

disk control
Selector
channel

s
y

disk

•
•
•

disk control

s
T

E
M
disk control

I
N
T

Selector
channel

E
CPU

disk

•
•
•
disk control

R

F
A

C

E

main memory
control unit

main

memory

B

u
s

terminal control
mux
channel

terminal

•
•
•
erminal control

Figure 5.4. A general purpose computer architecture.

terminal

135

T : TERMINAL

Figure 5.5. A queueing network model.

136

the machine is quantified through our simulation studies, later in this section.
Another choice for the location of data compression is the 1/0 channel. As in
the case of disk controllers, buffering would also be needed if data compression is
implemented in the 1/0 channel. In this case, compression can be made as a
separate channel instruction. Thus the operating system should instruct the channel
to compress data first, then to transfer the compressed results to the disk. Our
simulation models are constructed assuming that the compression hardware is
located in the 1/0 channel.
The queueing network model shown in Figure 5.5 is derived mainly to suit
our purpose of simulation. The square boxes labeled as 'T' represent terminals
which submit a job and wait for its completion before submitting another. The job
joins the queue for CPU service. According to our model, a job has an alternating
sequence of CPU bursts and IO bursts before it terminates. An IO burst could be a
read or write request. Since we are only concerned with the effect of
compression/decompression which are assumed to occur at the same rate, a read or
write request is treated in the same manner in the simulations. After a job completes its CPU service time, if an IO service is required, it joins the queue for the
corresponding IO channel. Since we are only interested in modeling the change in
the behaviour of the system due to compression hardware, we have not shown
details of the main memory accesses, etc. The CPU service time for each CPU

137

burst is created using an exponential distribution at a specific rate which is
assumed to include the memory access times. The simulation programs were written in SIMSCRIPT and run on the VAX 11n80 machine. The simulation language
SIMSCRIPT can be used to program queueing network models efficiently. Important parameters used for the simulation model were:
Disk access time

0.018 seconds

Disk transfer time

1859000 bytes/second

mean cpu service time (Poisson distribution): 0.0001 second
compression/decompression rate : 10000000 bytes/second
cpu quantum

: 0.0001 second

degree of multiprogramming: 15
channel speed

same as disk transfer time

compression ratio (uniform distribution) range .25 to .65

Performance measures throughput and response time, were measured against varying values of mean l/O request size and the l/O buffer size. The results of the
simulations are illustrated with graphs shown in Figures 5.6 and 5.7. The effects
of increasing mean l/O request size on the throughput and response time are shown
in Figure 5.6. As the job becomes more l/O bound, the throughput reduces since

138

300
I\
I
I
I

model with
compression

200
Throughput
(jobs/sec)

model without
compress ion

_}'

100

200

600
Mean

I / 0

1000

2000

---->

Request Size (bytes)

3.0
I\
I
I
I

2 .0

model without '---.
compress ion
, ,

Response
time (sec)

I

1.0
model with
compression

200

600
Mean

I / 0

1000

2000

Request Size (bytes)

Figure 5.6. Throughput and response time vs mean I/O request.

3000

---->

139

120
100
/\

80

I
I
I
I

Throughput
(jobs/sec)

60

model with
compression

40
20

compression

256

512
I / 0

Buffer

1024
Size

2048

4096

(bytes)---->

3.0
/\
I
I
I

compression

2.0
Response
time (sec)

1.0

model with
compression

512

256
I / 0

Buffer

1024
Size

2048
(bytes)---->

Figure 5.7. Throughput and response time vs I/O buffer size.

4096

140

I/O takes more time due to the speed of I/O devices which is much slower than the
CPU's. The throughput was measured with the same input parameter values for
both models, with and without compression. The throughput values for the model
with compression are always higher than those for the model without compression.
It can be observed from the graph that there is an order of magnitude increase in
throughput due to inclusion of compression hardware. There is a corresponding
increase in response time as the job becomes more I/O bound. The mean I/O
request was increased from 200 bytes upto 3000 bytes and it can be seen from Figure 5.6 that the effect of compression increases considerably as the mean I/O
request size increases.
These simulations were conducted assuming that compression/decompression
hardware is placed in the I/O channel. Because of the unpredictability of the size
of the compressed data, its instantaneous transfer rate may vary in time. A buffer
will be needed to efficiently handle the transfer of data to and from the I/O devices. The size of buffer will have a pronounced effect on the system performance
as can be seen in Figure 5.7. The graphs in Figure 5.7 illustrate the effect of
compression hardware on the throughput and response time as the buffer size is
increased. It is worthwhile to mention that increasing the buffer size means an
increase in cost and so the choice of buffer size is a trade-off between cost and
performance. The choice must be done carefully so that the effect of compression

141

is fully utilized, while keeping the I/0 design cost effective.

5.3 A Special Purpose Back-end Machine
This section describes a performance analysis done to investigate the effect of
employing compression hardware on the performance of a special purpose backend database machine. The study was conducted by constructing detailed analytical
models. The simulation results will be used to obtain performance comparison
between the system with and without hardware assistance for data compression.
There are different classifications of database machines depending on factors
like the number of processors, the type of architecture (tightly coupled or loosely
coupled multiprocessor architectures), the storage schema and several other factors
(Srinidhi and Sloan 1988). Though they are all different in architectures, the
integration of data compression hardware will have more or less similar effects on
their system performance. A detailed performance analysis of different database
machine architectures is given in Dewitt and Hawthorn (1981). They compare four
different database machines, but do not cover an architecture similar to that of
DELTA. A generalized set of models for comparing different database machine
architectures is given by Srinidhi and Sloan (1988).
Our goal is to establish by means of simulation studies the effect of incor-

142

porating compression hardware into database machines which handle huge volumes
of data. We considered a specific machine for our simulation studies, which was
DELTA, a relational database machine. The reasons for our choice of DELTA will
be clear from the discussions that follow.
DELTA is a hardware-oriented relational database machine which is planned
to be one of the software development tools in Japan's Fifth Generation Computer
Project. It has a tightly coupled multiprocessor architecture and uses an attributebased storage schema. DELTA is equipped with very high speed VLSI hardware
for performing relational algebra operations like join, selection, etc. DELTA has
better processing capability than most other such machines, but the advantage
gained was overshadowed by the overhead involved in the reconstruction of tuples
before sending results to the user. The tuple reconstruction overhead due to the
adopting of an attribute-based internal schema for data storage is extremely high,
which makes the high speed processing capability of the machine transparent and
useless to the user. The large number of disk accesses required to fetch the large
domains of attributes for tuple reconstruction affects the performance of DELTA
considerably. The tuple reconstruction time could be considerably reduced by storing the attribute values in compressed form. This can be done by introducing data
compression hardware in DELTA architecture at suitable locations. In this section,
we present analytical models of DELTA with and without compression hardware

143

and present quantified measures of improvement in its performance due to
compression hardware.
The global architecture of DELTA is given in Figure 5.8. The main functional
units are the interface processor IP, the control processor CP, the maintenance processor MP, the relational database engine RDBE and the hierarchical memory HM
and its controller HMCTL. The IP is a front-end processor responsible for communication with the host, and it performs transfer of data between the host and the
hierarchical memory subsystem. The CP analyzes the DELTA commands passed
from the IP and compiles them into subcommands to instruct the individual processors to perform appropriate processing. The CP also maintains dictionary information consisting of schema information on relational data and controls data recovery.
The MP monitors the status of the system and performs system configuration
management. The HM is a hierarchical memory subsystem which consists of two
layers: the lower layer consisting of moving head disks with a maximum total
capacity of 20 giga bytes and the upper layer made up of fast semiconductor random access memory (SDK) with a capacity of 128 mega bytes. A more detailed
description of the architecture and the various algorithms are given in Shibayama et
al. (1984).
The DELTA machine was designed to exploit the growth of VLSI technology
and its performance is affected considerably by the tuple reconstruction overhead.

144

Local Area Network

Multibus

IP

MP

CP

I
ROOE

1
1-M

Figure 5.8. Global architecture of DELTA.

145

One solution will be to provide multiprocessing capability at the HMCTL level in
order to construct tuples in parallel. This strategy requires major modifications to
DELTA' s architecture. Another solution is to store the attribute domains in
compressed form on the disks and decompress the data whenever they have to be
read for processing. The hardware for compression and decompression can be
included at the HMCTL level. We suggest the second alternative and describe the
simulation studies conducted to analyze the effect of compression hardware on
DELTA's performance.

The parameters used in the simulation are listed below. The values for the parameters are taken from Dewitt and Hawthorn (1981).

Tm = cpu time to receive or send a control message = 2.0 msec.
Tq = cpu time to compile a query= 152 ms.
T 0 = average disk access time= 38.6 ms.
T 0 = page read/write time on disk= 16.7 ms.

Ti = cpu time to initiate 1/0 operation = 2.0 ms.
TP = transfer time per page = 1.303 ms.
Ts = track-to-track seek time on disk= 10.1 ms.

146

= page read/write time on disk cache = 0.326 ms.

Tc

Tr = tuple reconstruction operation time per page = 10 ms.
P s = page size = 13030 bytes.

T = RDBE cycle time = 100 nano sec.
B

= bandwidth = 10 M bytes/sec.

Pi = number of pages of values for a given attribute i.
I1

= index factor.

s1

= selectivity factor.

C1

= compression factor.

0c

= compression/decompression time overhead.

b1

= number of bytes in attribute value = 8.

b 2 = number of bytes in tid = 3.

Ar = number of attributes per tuple in relation A.
Br

= number of attributes per tuple in relation B.

N,

= number of tuples in relation

N.

The index factor I I represents the reduction in the number of unwanted pages
read during any disk read operation. This reduction is due to the indexing scheme

147

used in DELTA. The selectivity factor corresponds to the number of attributes
selected during a select query. The parameters N,, 11 , s1 , A, and B, are used as
variables, and the query response times for the relational operations Selection and
Join are evaluated for each case. First we derive an expression for a generalized

query response time in the DELTA machine. Then the expression is modified for
selection and join queries. The expressions are further modified to include the
effect of data compression hardware in the HMCTL module of the DELTA
machine. We derive expressions for evaluating the following:
T Q = generalized query response time.

= selection query response time in model without compression.

Tsel
Tcsel

= selection query response time in model

Tjoin =
Tcjoin

with compression.

join query response time in model without compression.

= join query response time in model with compression.

Now we derive a general model for query response time in DELTA. The
query response time represented by TQ consists of the following 8 components
referred to as T 1 through Ts=

T 1 = IP receives query.
T2

= IP to CP transfer of query.

148

T3

= CP compiles query.

T4

= CP to HMCTL command transfer overhead.

T5

= read attribute set(s) of relation(s).

T6

= subcommand sequence execution or RDBE processing time.

T 1 = tuple reconstruction time.
Ts

= transfer of results from

SDK to Host.

Thus, T Q = T 1 + T 2 + ... + T 8. For both selection and join, we derive expressions for each of the eight components described above. We evaluate Tsel for a
simple selection (based on a single attribute condition) as follows:

(i) T 1 + T 2 + T 3 + T 4 = 3*Tm + Tq.
(ii)T5=Ti +Ta +T0 +(I1*P;-l)*MAX((Ts +T0 ),(Tp +Tc))+Tp +Tc.

In the equation T 5 given above, Ti is the time to initiate the read operation from
the disk and Ta + T 0 for accessing and reading the first page. We assume that the
following pages are read in a consecutive manner. Pi is the number of pages containing the values of the given attribute i, and / / corresponds to the reduction in
the reading of unwanted pages that is achieved by maintaining index. The seeks
and reads on the disk are interleaved with transfer and writes on the disk cache
SDK. The transfer and write of the last page are accounted by the last two terms
TP and Tc.

149

(iii) T 6

= The ROBE processing time represented by

T 6 consists mainly of three

steps that can be interleaved: (1) reading a page from SDK, (2) transfering the
page to ROBE, and (3) processing time in RDBE. The number of items per page
NP for ROBE processing is given by

lp

s

I (b l

J

+ b 2)

where Ps is the page size and b 1

,

b 2 are the sizes of the attribute value and the

tuple id pair.
The total time for reading and processing the first page is,

where 2T corresponds to processing (value,tid) pair, and for the rest of the pages,

The number of selected values is (S1 *p; *NP) and the time to sort them is,

To transfer the results from ROBE to SDK,

2*Tm +p·*Si*T
I
p
Then the sum of all the above terms is simplified to obtain the expression for T 6

150

which is given as,
Tc+ TP

+ Tm*2 +Np*2T + Tp*(/1*P;-l) + (S1*P;*Tp) +

( S1*P;*Np

(iv) T 1

= The

+ LOG(Si*P;*Np) )*2T

tuple reconstruction time involves reading the pages of all the attri-

bute domains and reconstructing the tuples, one at a time, at the HMCTI.. level.
The reading of the pages, transfer to the HMCTL and the reconstruction operations
can be interleaved and the following expression is derived:
( T; +Ta+ T 0

+ (S1*P;-l)*( MAX ( (T5 + T 0 ),(TP + Tc), T,)) )*(A,-1)

+ TP +Tc+ T,
(v) T 8

= The time to transfer the results from SDK to Host is given by:

( A,*Si*Pi*(Tp

+ 3*Tm))

where 3*Tm is included for control messages from CP to HMCTL, IP, and the
Host. As mentioned earlier, the sum of all the above terms, T 1 through T 8 gives
Tsel, the total execution time for a simple selection query.

The above expressions represent the model without compression. The inclusion of compression will affect two terms in Tsel which are the read time T 5 and
the tuple reconstruction time T 1. The other terms remain the same as in the model
without compression.

Since the pages for the attribute domains are stored in

151

compressed form, the number of disk accesses needed to read an entire attribute
domain reduces. The expression for T 5 can be rewritten as:
Ti+ Ta+T0 + (C1*I1*Pi-l) * ( MAX((Ts+T0 ),(TP +Oc+Tc)))

+ Tp+oc+Tc
Similarly the equation for tuple reconstruction time T 1 can be rewritten as:
( T;+Ta+T0

+ (C1*S1*P;-l) * (MAX( (Ts+T0 ),(Tp+

Oc+Tc), T,))) * (A,-1)

The expressions for the response time of join queries were derived in a similar manner as in the case of selection queries. The join queries involve the reading
of two attribute domains needed for the join operation. The tuple reconstruction
overhead will be high since it involves two relations. The expressions for the terms
T 1 through T 4 remain the same as for selection queries. The others are shown

below:

(i) T 5

= The time for reading attribute values of the two relations is derived as:

2*(T;+Ta+T0

)

+ (p1-l)*MAX((Ts+T0

(/1*Prl)*MAX((Ts+T0

),

),

(Tp+Tc)) +

(Tp+Tc)) + 2*(Tp+Tc)

152

In the above equation, the first part corresponds to the time for reading attribute
pages (p 1) of the first relation and the second part corresponds to the time for reading attribute pages of the second relation using index based upon the values in the
page of the first relation.
The attribute values for the join operation are sorted and hence, it takes linear
time to merge those values.
(ii) T 6 = The ROBE processing time is given below:
(pl +l1*P2H Tc+ Tp

+

lps I (b1 + b2)J *2T}

The resultant tid pair set after the join operation will have to be sorted for
each relation rid during tuple reconstruction. The tuple reconstruction consists of
the following phases. Initially, the (tid 1,tid 2) pairs are sorted. Then, the tuples of
the first relation are constructed which are sorted based on the tid 2 values. Finally,
the attribute values for the second relation are retrieved and appended to complete
the reconstruction of tuples.

(iii) T 1

= The tuple reconstruction time is derived as follows:

The time to sort the tid pairs is,
(S1*P1*P2)*( ~s I (2*b~J *2T)
The time to reconstruct the tuples of the first relation is,

153

( Ti+Ta+To+Tp+Tc+T,+ (S1*P1*P2-l)*
( MAX ((Ts+ T 0

),

(Tp + Tc), T,)) )*(A,)

Each constructed tuple has (A, + 1) values and these tuples are sorted with tid 2 as
the key. The sorting time is,
(S1*P1*P2)*( lps I (2*bi)J *(A,+l)*T)
The time to construct the attribute values from the second relation is,
Ti+Ta+T0 +Tp+Tc+T,+{ (S1*P 1*P2-l)*
( MAX ((Ts + T 0

),

(Tp + Tc), T,) )}* (B,-1)

The sum of all the above four expressions give the final expression for the tuple
reconstruction time.
(iv) T 8

= The time to transfer results from

SDK to the Host is,

(A,+ B, - l)*Si*p 1*p 2*(Tp + 3*Tm)

The expressions remain the same for the model with compression, except for
the terms T 5 and T 7 in which the compression factor CI is introduced to account
for the reduction in the number of disk reads.
The modified T 5

=

2*(Ti+Ta+T0

)

+ (p1-l)*MAX((Ts+T0 ), (Tp+Oc+Tc)) +

(l1*P2-l)*MAX((Ts+T0

),

(Tp+ Oc+Tc)) + 2*(Tp+Oc+Tc)

154

The modified T 1 is derived as follows:
The time to sort the tid pairs is,
(S1*P1*P2)*(

lps I (2*b2)J *2T)

The time to reconstruct the tuples of the first relation is,
Ti+Ta+T0 +Tp+Tc+T,+ (Cl *S1 *p 1*P2-l)*
( MAX ((Ts+T0

),

(Tp+oc+Tc), T,)) )* (A,)

Each constructed tuple has (A, + 1) values and these tuples are sorted with tid 2 as
the key. The sorting time is,
(S1*P1*P2)*( ~s I (2*b2)J *(A,+l)*T)

The time to construct the attribute values from the second relation is,
Ti+Ta+T0 +Tp+Oc+Tc+T,+{ (C1 *S1 *p 1*P2-l)*
( MAX ((Ts+T0

),

(Tp+oc+Tc), T,) )}*(B,-1)

The sum of all the above four expressions give the modified expression for the
model with compression.
The simulations were performed using the expressions derived above and the
response times were calculated for both selection and join queries. The effect of
varying the number of tuples in the relation on the response time for a selection
query is given in Figure 5.9. As can be seen from the figures, the response time

155

14
12
10
A

model without
compression

8

'

''
6

Response
time (sec)

4
\.._ '- model with
- - , compression

2

10000

12000

14000

Number of Tuples

16000

---->

340
260
I\

''
'

model without
compression

180

'
'
'

Response 1 oo
time (sec)

model with
compression

20
100000

400000

700000

Number of Tuples

100000
•···>

Figure 5.9. Selection query response time vs number of tuples.

18000

156

increases slowly when the number of tuples is small (of the order of 10000 to
14000) and increases rapidly as the number of tuples increases 100,000 to
1,000,000. Also the effect of compression is higher as the number of tuples
increases. The response times are given in seconds. The other variables were kept
constant while varying the number of tuples. The typical values assumed were 11
0.8, s1 = 0.7, A, = 10, and

c1

=

= 0.5. Similar effects were observed for the join

queries as can be seen in Figure 5.10.
The graphs shown in Figure 5.11 illustrate the response times for both join
and selection queries as the number of attributes per relation is varied. The
response times increase linearly as the sizes of the tuples in the relations increase
and the effect of compression increases with the number of attributes per tuple.
The typical values assumed were 11 = 0.8, s1 = 0.7, N, = 500,000, and c1 = 0.5.
The graphs in Figures 5.12 and 5.13 illustrate the effect of selectivity factor
on the response time for two fixed values of index factor, 0.6 and 0.9. As the
selectivity factor increases, more tuples have to be reconstructed which increases
the response times considerably. Since the compression factor reduces the read
time and the reconstruction time, we see that the effect of compression is quite
pronounced in these graphs. Thus the query response times in DELTA machine can
be considerably improved by incorporating compression hardware into the system.

157

18
model without
compression

15
I\

•

12

Response
time (sec)

9

model with

6

compression

3

10000

12000

14000

Number of Tuples

1300
I\
I

18000

---->

model without
compression

1000

Response
time (sec)

16000

700
400

model with
compression

100000

400000

700000

Number of Tuples

100000

---->

Figure 5.10. Join query response time vs number of tuples.

158

750
650
550

model without
compression

450
Join
response 350
time (sec)

model with
compression

250

5
10
15
20
Number of Attributes per Tuple - - - - >

300
250
A

200

model without
compression ~

I

150

Select
response 1 oo
time (sec)
50

model with
compression

15
20
5
10
Number of Attributes per Tuple - - - - >

Figure 5.11. Response time vs number of attributes/tuple.

159

280

Index Factor =- 0.6

255
I\

230

'

''

model without
compression

205
Join
response
time (sec) 1 80
155
model with
compression

0.2

120
100
I\

'

0.3
0.4
0.5
Selectivity FactoF- - - >

Index Factor • 0.6

model without
compression

~

80

60
Select
response
time (sec) 4 O

model with
compression

20
0.2

0.3

0.4

Selectivity Factor

0.5

---->

Figure 5.12. Response time vs selectivity factor.

160

Index Factor = 0.9
500
450

model wi!hout t-__
compression
,
~

I\

'

400

Join
350
response
time (sec) 300
model with
compression

250

0.2

0.4

0.8

0.6

Selectivity Facto,- - - >

180
150
I\

''

Index Factor •

model without
compression

120

0.9

~

90

Select
60
response
time (sec)
30

1/
0.2

0.4

model with
compression

0.8

0.6

Selectivity Factor

---->

Figure 5.13. Response time vs selectivity factor.

CHAPTER 6
CONCLUDING REMARKS
This Ph.D. dissertation has presented the design and development of several
hardware algorithms for data compression and decompression, and also established
with quantified performance measures, the benefits of integrating such hardware
into different architectural environments.
As mentioned in the introductory chapter, data compression has been implemented in the past only through software methods which involved considerable
time and space overheads. This dissertation presented new hardware algorithms as
possible solutions for the above problem. These algorithms were designed so that
data can be compressed and decompressed on-the-fly in order to obtain high
compression rates. The estimated speeds of these algorithms far exceed the maximum data flow rates in current and projected disk and communication technologies.
The compression techniques considered in this dissertation are Huffman-like
static binary encoding schemes, the multi-group method, the run-length encoding
scheme and an enhanced version of arithmetic coding. A prototype VLSI implementation of the hardware algorithm for compression using Huffman codes has
been presented and estimated compression rates are provided. Future work in this
161

162

area can be directed towards building real life compression/decompression chips
that implement the various above algorithms and interfacing them into computer
systems to obtain exact analysis of performance improvements.
The hardware algorithms introduced in this dissertation for the Huffman,
multi-group and arithmetic coding are based on static models. It has been found
that compression schemes following adaptive models yield better compression rates
than those using static probabilities. An important direction for future study is to
design and develop hardware algorithms for compression schemes using adaptive
models.

In such schemes, the probabilities of occurrence of characters vary

dynamically as each symbol is being compressed. In the case of adaptive Huffman
coding, the symbols have to be sorted and new Huffman codes have to be generated periodically. These are complex operations to perform in hardware. With
the rapidly growing VLSI technology and the cost of hardware decreasing and the
packing density for a chip increasing, it may be possible and advantageous to
develop efficient hardware algorithms for adaptive compression schemes.
The LZ and LZW algorithms described in Chapter 2 have been found to be
very effective compression techniques. These are dictionary-based methods where
the dictionary is constructed dynamically during the process of compression and
decompression. Future study may be directed towards developing hardware algorithms for these methods.

163

Although in this dissertation we were concerned with data compression techniques that are applicable primarily to text files, with the continuing growth of
digital communications technology, there has been a rapid rise in the demand for
image transmission and storage. Efficient transmission and storage of images can
be achieved by developing image compression techniques that can be implemented
using the advanced VLSI technology. Several image compression techniques exist
in the literature and some of them have been implemented using hardware assisted
features. Future research must be directed towards developing VLSI algorithms for
image compression that yield very high compression rates and high fidelity measures of reconstructed images.

BIBLIOGRAPHY

Abramson, N. Information Theory and Coding. McGraw-Hill, New York, 1963.
Alsberg, P. "Space and Time Savings Through Large Database Compression and
Dynamic Restructuring." Proc. IEEE, Vol. 63, No. 8, Aug. 1975, pp. 11141122.
Al-Suwaiyel, M. and Horowitz, E. "Algorithms for Trie Compaction." ACM Trans.
on Database Systems, Vol. 9, No. 2, June 1984, pp. 243-263.
Bassiouni, M. and Hazboun, K. "Utilization of Character Reference Locality for
Efficient Storage of Databases." Proc. 2nd Int. Workshop on Statistical
Database Management, Los Altos, CA, Sept. 1983, pp. 338-344.
Bassiouni, M. "Data Compression in Scientific and Statistical Data Bases." IEEE
Transactions on Software Engineering. Vol. SE-11, No. 10, Oct. 1985, pp.
1047-1058.
Bassiouni, M. and Mukherjee, A. "Supercomputer Algorithms for Data Transmission and Encoding." Proc. 2nd Int. Conj. on Supercomputing, 1988, pp.
371-379.
Bates, D., Boral, H., and DeWitt, D. "A Framework for Research in Database
Management for Statistical Analysis." Proc. ACM SJGMOD, 1982, pp. 6978.
Batory, D. "Index Coding: A Compression Technique for Large Statistical Databases." Proc. 2nd Int. Workshop on Stat. Database Management, Sept.
1983, pp. 306-314.
Bell, D. and Deen, S. "Key Space Compression and Hashing in PRECI." The Computer Journal, Vol. 25, No. 4, 1982, pp. 486-492.

164

165

Bishop, Y. and Freedman, S. "Classification of Metadata." Proc. 2nd Int. Workshop
on Stat. Database Management. 1983, pp. 230-234.
Burnett, R. and Thomas, J. "Data Management Support for Statistical Data Editing
and Subset Selection." Proc. Int. Workshop on Statistical Database
Management, Dec. 1981, pp. 88-102.
Chan, P. and Shoshani, A. "SUBJECT: A Directory Driven System for Organizing
and Accessing Large Statistical Databases." Proc. VWB, 1981, pp. 553563.
Cleary, J. and Witten, I. "Data Compression Using Adaptive Coding and Partial
String Matching." IEEE Trans. on Communications, Vol. COM-32, No. 4,
April 1984, 396-402.
Cormack, G. V. "Data Compression on a Database System." CACM, Vol. 28, No.
12, 1985, pp. 1336-1342.
Cysper, R. Communications Architecture for Distributed Systems. Reading MA,
Addison-Wesley, 1978.
Dance, D. and Pooch, U. "An Adaptive On-line Data Compression System." The
Computer Journal, Vol. 19, No. 3, 1976, pp. 216-223.
DeWitt, J. D. and Hawthorn, P. B. "A Performance Evaluation of Database
Machine Architecture." Computer Sciences Technical Report #437, University of Wisconsin-Madison, June 1981.
Eggers, S. and Shoshani, A. "Efficient Access of Compressed Data." Proc. 6th
VWB, 1980, pp. 205-211.
Eggers, S., Olken, F., and Shoshani, A. "A Compression Technique for Large Statistical Databases." Proc. VLDB, 1981, pp. 424-434.
Fraenkel, A. and Mor, M. "Combinatorial Compression and Partitioning of Large
Dictionaries: Theory and Experiences." Proc. 6th Int. ACM SIGIR Conj. on
Res. and Development in Info. Retrieval, 1983, pp. 205-219.

166

Gallager, R. "Variations on a Theme by Huffman." IEEE Trans. on Info. Theory,
Vol. IT-24, No. 6, 1978, pp. 668-674.
Golomb, S. "Run-length Encodings." IEEE Trans. on Information Theory , Vol. IT12, July 1966, pp. 399-401.
Hahn, B. "A New Technique for Compression and Storage of Data." CACM, Vol.
17, No. 8, 1974, pp. 434-436.
Hammer, M. and Niamir, B. "A Heuristic Approach to Attribute Partitioning."
ACM SIGMOD, 1979, pp. 93-101.
Hamming, R. Coding and Information Theory. Englewood Cliffs, N.J., PrenticeHall, 1980.
Hankamer, M. "A Modified Huffman Procedure with Reduced Memory Requirement." IEEE Trans. on Commun., Vol. COM-27, No. 6, 1979, pp. 930-932.
Hawthorn, P. "Microprocessor Assisted Tuple Access Decompression and Assembly for Statistical Database Systems." Proc. VI.DB , 1982, pp. 223-233.
Hazboun, K. and Bassiouni, M. "A Multi-group Technique for Data Compression."
Proc. ACM SIGMOD Int. Conj. on Management of Data, 1982, pp. 284292.
Hazboun, K. and Raymond, J. "A Multi-tree Automation for Efficient Data
Transmission." Proc. 2nd Int. Workshop on Stat. Database Management,
1983, pp. 54-63.
Huffman, D. "A Method for the Construction of Minimum Redundancy Codes. "
Proc. IRE, Vol. 40, 1952, pp. 1098-1101.
Jones, C. "An Efficient Coding System for Long Source Sequences." IEEE Trans.
on Info . Theory, Vol. IT-27, No. 3, 1981, pp. 280-291.
Kaunitz, J. and Ekert, L. "Audit Trail Compaction for Database Recovery." CACM,

167

Vol. 27, No. 7, 1984, pp. 678-683.
Knuth, D. The Art of Computer Programming, Vol. 3- Sorting and Searching.
Addison-Wesley, Reading MA, 1973.
Langdon, G. "An Introduction to Arithmetic Coding." IBM J. Res. and Develop.,
Vol 28., No. 2, March 1984, pp. 135-149.
Langdon, G. and Rissanen, J. "Compression of Black-White Images with Arithmetic Coding." IEEE Transactions on Communications, Vol. COM-29, No.
6, 1981, pp. 858-867.
Lea, R. "Text Compression with an Associative Parallel Processors." The Computer
Journal, Vol. 21, No. 1, 1978, pp. 45-56.
Lefons, E., Silvestri, A., and Tangorra, F. "An Analytic Approach to Statistical
Databases." Proc. 9th VWB, Florence, Italy, Oct. 1983, pp. 260-274.
Lubeck, J. "A Benchmark Comparison of Three Supercomputers: Fujitsu VP-200,
Hitachi S810/20, Cray X-MP/2" Computer, Vol. 18, No. 12, Dec. 1985, pp.
10-24.
Lynch, T. J. Data Compression Techniques and Applications. Lifetime Leaming
Publications, Belmont, California, 1985.
Lynch, C. and Brownrigg, E. "Application of Data Compression Techniques to a
Large Bibliographic Database." Proc. VWR, 1981, pp. 435-447.
McCarthy, J. "Metadata Management for Large Statistical Databases." Proc. VWB,
1982, pp. 234-243.
McIntyre, D. and Pechura, M. " Data Compression Using Static Huffman CodeDecode Tables." CACM, Vol. 28, No. 6, June 1985, pp. 612-616.
Mukherjee, A. Introduction to nMOS and CMOS VLSI Systems Design. PrenticeHall, Englewood Cliffs: N. J., 1985.

168

Mukherjee, A., Ranganathan, N. and Bassiouni, M. "High-speed VLSI Encoding
Chips for Supercomputers" to appear in Proc. 3rd Int. Conf. on Supercomputing, May 1988.
Ok, B. "DIBACOS: A Dictionary based Compression System." M.S. Thesis,
University of Central Florida, 1984.
Pasco, R. "Source Source Coding Algorithms for Fast Data Compression." Ph.D.
Dissertation, Dept. of Electrical Engineering, Stanford University, Stanford,
California, 197 6.
Pechura, M. "File Archival Techniques Using Data Compression." CACM, Vol. 25,
No.9, 1982, pp. 605-609.
Reghbati, H. "An Overview of Data Compression Techniques." Computer, Vol. 14,
No. 4, 1981, pp. 71-75.
Riganati, J. and Schneck, P. "Supercomputing." Computer, Vo. 17, No. 10, 1984,
pp. 97-113.
Rissanen, J. J. "Generalized Kraft inequality and arithmetic coding." IBM J. Res.
Dev. 20 (May 1976), pp. 198-203.
Rissanen, J. J. "Arithmetic Codings as Number Representations." Acta Polytech.
Scand. Math . 31 (Dec. 1979), pp. 44-51.
Rissanen, J. J. and Langdon, G. G. "Arithmetic Coding" IBM J. Res. Dev., 23, 2
(Mar 1979), pp. 149-162.
Rissanen, J. and Langdon, G. "Universal Modeling and Coding." IEEE Trans. on
Info. Theory, Vol. IT-27, No. 1, 1981, pp. 12-23.
Ritchie, D. and Thompson, K. "The UNIX Time-Sharing System." CACM, Vol. 17,
No. 7, 1974, pp. 365-375.
Rodeh, M., Pratt, V., and Even, S. "Linear Algorithm for Data Compression via

169

String Matching." JACM, Vol. 28, No. 1, 1981, pp. 16-24.
Rubin, F. "Experiments in Text File Compression." CACM, Vol. 19, No. 11, 1976,
pp. 617-623.
Shankar, A. and Lam, S. "An HDLC Protocol Specification and Its Verification
Using Image Protocols." ACM Trans. on Computer Systems, Vol. 1, No. 4,
Nov. 1983, pp. 331-368.
Shannon, C. E. "A Mathematical Theory of Communication." Bell System Technical Journal, Vol. 27, 1948, pp. 379-423, 623-656.
Shibyama, S., Kakuta, T., Miyazaki, N., Yokota, H., and Murakami, K. "A Relational Database Machine with Large Semiconductor Disk and Hardware
Relational Algebra Processor." New Generation Computing, Vol. 2, 1984,
pp. 131-155.
Shoshani, A. "Statistical Databases: Characteristics, Problems, and Some Solutions." Proc. VWB, Mexico City, Mexico, Sept. 1982, pp. 208-222.
Smith, M. E. G. and Storer, J. A. "Parallel Algorithms for Data Compression."
JACM, Vol. 32, No. 2, April 1985, pp. 344-373.
Srinidhi, H. N. and Sloan, J. C. "On a Classification of Relational Database
Machines." Proc. 26th Southeast Regional ACM Conference, Mobile, Alabama, Apr. 1988, pp. 105-107.
Storer, J. and Szymanski, T. "Data Compression via Textual Substitution." JACM,
Vol. 29, No. 4, 1982, pp. 928-951.
Svensson, P. "On Search Performance for Conjunctive Queries in Compressed,
Fully Transposed Order Files." Proc. 5th VWB, 1979, pp. 155-163.
Tarjan, R. and Yao, A. "Storing a Sparse Database." CACM, Vol. 22, No. 11, Nov.
1979, pp. 606-611.

170

Turner, M., Hammond, R., and Cotton, F. "A DBMS for Large Statistical Databases." Proc. 5th VWB, 1979, pp. 319-327.
Wagner, R. "Common Phrases and Minimum-Space Text Storage." CACM, Vol.
16, No. 3, 1973, pp. 148-152.
Welch, T. "A Technique for High-Performance Data Compression." Computer,
Vol. 17, No. 6, 1984, pp. 8-19.
Wells, M. "File Compression Using Variable Length Encodings." The Computer
Journal, Vol. 15, No. 4, 1972, pp. 308-313.
Wilhelm, C. N. "A General Model for the Performance of Disk Systems." JACM,
Vol. 24, No. 1, January 1977, pp. 14-31.
Witten, I. H., Neal, R. M., and Cleary, J. G. "Arithmetic Coding for Data
Compression." CACM, Vol. 30, No. 6, June 1987, pp. 520-540.
Ziv, J. and Lempel, A. "A Universal Algorithm for Sequential Data Compression."
IEEE Trans. on Info. Theory, Vol. IT-23, No. 5, 1977, pp. 337-343.
Ziv, J. and Lempel, A. "Compression of Individual Sequences via Variable Rate
Coding." IEEE Trans. on Info. Theory, Vol. IT-24, No. 5, 1978, pp. 530536.

