Investigations into hardware-based parallel lossless data compression systems by Mark J. Milward (7201970)
\. ~"" . 
\;" •• Loughborough , 
.. ~ University Library ., University 
, -----------------I AuthorlFiling Title .. N..v...W·~·(l.D.t····M .... ~.\ ............. . 
................................................ ....................................... . 
Class Mark .................. ::::I............................................ . 
Please note that fines are charged on ALL 
overdue items. 
FOR R FERENCE NLY 
! IIi~o~I~~ililirllll ~ ~IIIIIIIIIIIIIII ! 
I 

----------------_._----_.-
Investigations into 
Hardware-based Parallel 
Lossless Data Compression Systems 
By 
Mark John Milward 
A Doctoral Thesis 
! ., '«~". _<,"_L'~~' ~,.';~ __ ...... ~- -:':~~:"~,;'l 
.~]:'_:~~{f1H't~;~i~<l.· <. .j:$~ 'I;~ \ 
l 'd'·;r!'·q'!',~ ~ ~1r\?¥~; ! I '.'~" ... .:;.,.,', 
Submitted in partial fulfilment of tne' requirements for the award of I····""' ............... _' , ..... -" ... - "] 
! ~: .'-:-1 ! 
! I 1 .. " .. _.~., ... ,,_~ ..... _..-.. ... -_."', .. _._~ '.' .~ .... -.. -"' ,~.~ 
Doctor df Philosophy of Loughboroz/gh University 
.\..,".,."' .... 'n .. '~'~."~,,_, .... _·~·~,,~ .. ··,_·,~<· .. _ ... ~ ~ -1 
1"'''\ I
j October 2004 .-,'.'.~j t.~_· .. · .... __ ·· 
© by Mark John Milward 2004 
, \ U Louehboreueh 
University 
Pilkinat.n Library 
Date 5('1'1'. 05 
Class '"T 
~~ Ott-03I1bb2-:f 
~------------------------------.......... 
Abstract 
The current increases in silicon logic densities have made feasible the implementation 
of multiprocessor systems onto a single chip able to meet the intensive data 
processing demands of highly concurrent systems. This thesis describes research into 
a hardware implementation of a high performance parallel multi compressor chip. In 
order to fully explore the design space, several models are created at various levels of 
abstraction to capture the full characteristics of the architecture. A detailed 
investigation into the performances of alternative input and output routing strategies 
for realistic data sets demonstrate that the design of parallel compression devices 
involves important trade offs that affect compression performance, latency, and 
throughput. The most promising approach is written in a hardware description 
language and synthesised for FPGA hardware as proof of concept. It is shown that a 
multi compressor architecture can be a scalable solution with the ability to operate at 
throughputs to cope with the demands of modern high-bandwidth applications whilst 
retaining good compression performance. 
i 
Acknowledgements 
I would like to acknowledge alJ my teachers, mentors, lecturers, past and present 
supervisors, who have taught me many things but the most important lesson learnt is 
that 
"Every day you may make progress. Every step may not be fruitful. 
Yet there will stretch out before you an ever-lengthening, ever-
ascending, ever-improving path. You know you will never get to the 
end of the journey. But this, far from discouraging, only adds to the joy 
and glory of the climb." 
I would especially like to thank my colJeagues from the Electronic System Design 
Group who gave me their friendship and joined me on my journey of learning for a 
time, it has made it ever more enjoyable. They mean the world to me. 
Also, I would like to thank my family who have never held me back and provided 
support in whatever I decide to do. 
FinalJy, my greatest appreciation goes to my long-term girlfriend Claire, who always 
believes in me. 
ii 
Statement of originality 
This is to certifY that I am responsible for the work submitted in this thesis, that the 
original work is my own except as specified in acknowledgements or in footnotes, and 
that neither the thesis nor the original work contained therein has been submitted to this 
or any other institutions for a higher degree. 
October 2004 
III 
Table of Contents 
TABLE OF CONTENTS 
Chapter 1 
Introduction................................................................................ 1 
1.1 Introduction ............................................................................ 1 
1.2 Research Aims and Objectives....... .. . .... ... ............. .......................... 4 
1.3 Terminology.......................................................................... 5 
1.3.1 Data Compression.................................................................... 5 
1.3.2 Parallel Systems. ..... ....................................... ...... .................... 12 
1.3.3 Queuing Theory....................................................................... 16 
1.4 Structure of Thesis.................... ....... . ... .. ......... ..... . . . . .. ..... . . ....... 18 
Chapter 2 
Review of Parallel Data Compression Systems............................... 20 
2.1 Chapter Objectives..................................................................... 20 
2.2 Parallel Compression.................................................................. 21 
2.2.1 Parallel Huffman Coding..... ....................................... ......... ....... 21 
2.2.2 Parallel Arithmetic Coding........... ....... .. ... . . ...... ......... . ... .. ........ . .... 22 
2.2.3 Parallel Entropy Coding.............. .. .............. ......... . ..... ...... ... ....... 23 
2.2.4 Parallel Lempel-Ziv................... ..... ........ ........ ..... ...... ........ ....... 24 
2.2.5 Parallel Markov Modelling with Approximate Arithmetic Coding ....... :.... 29 
2.2.6 Move-to-Front and Transpose Parallel Architectures........................... 29 
2.2.7 X-MatchPro..................... ...................................................... 30 
2.2.8 Comparison of Research Work..................................................... 32 
2.3 Summary of Parallel Architectures................................................. 34 
iv 
------------------------------........... 
Table of Contents 
Chapter 3 
Identification of Research Area....................................................... 35 
3.1 Chapter Objectives........................................................... ......... 35 
3.2 Area of Research............................................. ......................... 35 
3.3 X-MatchProRIi... ... ...... ............... ...... ............................ ............ 37 
3.3.1 Algorithm....................... ................................. ................... .... 38 
3.3.2 X-MatchProRIi Architecture and Performance................. .............. .... 39 
3.4 Summary......... ............ .......... .... .... ... ..... ................................. 41 
Chapter 4 
Experimental Framework................................... ........ ........ .......... 42 
4.1 Chapter Objectives..... ....................................................... ......... 42 
4.2 Datasets........ ....................... ........................................ .... ..... 42 
4.3 Measurement Definitions......................................... .......... .......... 45 
4.4 System Modelling ................................. :................................ ... 46 
4.4.1 System Design Languages................................ ..................... ....... 47 
4.4.1.1 Handel-C................................................................... ............ 47 
4.4.1.2 SystemC...... ..................................... ..... ................. .... ............. 48 
4.4.1.3 SystemVerilog.......................................................................... 50 
4.4.2 System Design Language Summary...................... ...... ........ ..... ... .... 51 
4.5 Summary....... . ..... ...... . . .............. . . ...... .................................... 51 
Chapter 5 
Practical Investigation ofInput and Output Routing Strategies 52 
5.1 Chapter Objectives..... ...... ............................ .............................. 52 
5.2 MIMD Architecture Routing Strategies........................................... 52 
5.2.1 Input Routing........ ....... ............ ................... ............................. 53 
5.2.1.1 Interleaved Input....................................................................... 54 
v 
------------.......... 
Table of Contents 
5.2.1.2 Input Blocked - Theoretical........... ...... ... ....... ..... ................... ....... 57 
5.2.1.3 Input Blocked Experimentation............... ......... ............. ..... .... . . . .... 61 
5.2.2 Input Routing Conclusions.......................................................... 66 
5.2.3 Output Routing...................................... ........ .......................... 68 
5.2.3.1 Single Compressed Block.. ....................................................... ... 69 
5.2.3.2 Multiple Compressed Block.. ....................................... ........... ..... 71 
5.2.3.3 Interleaved Compressed Block.................................................. .... 73 
5.2.4 Output Routing Conclusions............. ........................................ .... 77 
5.3 Summary............... ... ...... .... ..... .... ..... ...... .... ... ........................ 78 
Chapter 6 
Implementation............... .............. ... . .... ..... . ... ... . .... .... ............... .... 79 
6.1 Chapter Objectives........... . ... .... . .............................. .................. 79 
6.2 Hardware Design...................... ......... . ......... . .... .... .... . ..... . . .. . . ... 79 
6.3 Hardware Implementation.. . ... .... . ..... ........ ........................... ........ 87 
6.4 Summary........................... .... ...... ... . ...... ... ........ ... . . ... . .. . . . ... . . ... 88 
Chapter 7 
Conclusions......................................................... ......................... 89 
7.1 Chapter Objectives....................... ............................................. 89 
7.2 Summary of Objectives and Design Flow......................................... 89 
7.3 Contribution.......... ................................................................. 90 
7.4 Measurements of Success.... ..... ................................................... 91 
7.5 Limitations of Research......................... . ..... ... ........ .... ............. ... 92 
7.6 Future Work........ ........................ ............................................ 92 
7.7 Summary .............................................................................. 93 
vi 
Table of Figures 
TABLE OF FIGURES 
Chapter 1 
Introduction ..................................................... .......................... . 
1.1 Areas where compression can be applied into a computer system .............. 3 
1.2 Current and proposed computer buses, networks and memory throughputs. 4 
1.3 Data compression taxonomy......................................................... 6 
1.4 An example of statistical model and coder.......... ....................... . . ..... . 7 
1.5 Shared memory for multiprocessor system.. ............. ...................... ... 13 
1. 6 Distributed memory in a multiprocessor system. . . . . . . . . ... . . . . . .. . . . .. . . . . . . . ... 14 
1.7 Systolic SIMD array.................................................................. 14 
1.8 Mesh-connected SIMD array............................... ...... ... ... .............. 15 
1.9 Tree-based SIMD array...................................... ...... . ............. . .... 15 
1.1 0 Produce and consumer model........................... .................. .. ..... ... 16 
Chapter 2 
Review of Parallel Data Compression Systems .............................. . 
2.1 Data allocation between processors......................... . .... ... .... .... ........ 22 
2.2 Lempel-Ziv systolic architecture....... .......................................... .... 25 
2.3 CAM-based Lempel-Ziv................................. .......................... ... 27 
Chapter 3 
Identification of Research Area ...................................................... . 
3.1 Block diagram of multiple compressor system................................... 36 
3.2 Block diagram of the X-MatchProRIi architecture............................... 39 
3.3 Comparison of compression ratios of typical compression algorithms using 40 
vii 
Table of Figures 
the Canterbury corpus .............................................................. . 
3.4 Comparison of compression ratios of typical compression algorithms using 41 
the memory corpus .................................................................. . 
Chapter 4 
Experimental Framework ............................................. ............... . 
4.1 Celoxica design flow for Handel-C....... .......................................... 48 
4.2 SystemC abstraction and design flow........... ......... ................ ........... 49 
Chapter 5 
Practical investigation of input and output routing strategies .. . 
5.1 Illustration of blocked and interleaved input routing...... ...................... 53 
5.2 Mean compression ratio for a range of datasets using the interleaved 54 
method with a n internal 16 word dictionary .................................... .. 
5.3 Mean compression ratio for a range of datasets using the interleaved 55 
method with a n internal 32 word dictionary .................................... .. 
5.4 Mean compression ratio for a range of datasets using the interleaved 56 
method with a n internal 64 word dictionary .................................... .. 
5.5 Graph of ideal input rate against block length to remain in steady state...... 58 
5.6 Graph of input latency against block length for various number of 60 
processors ............................................................................. . 
5.7 Effect of the block size on the compression ratio using a l6-word 61 
dictionary ............................................................................. . 
5.8 Block diagram of a two processor system to monitor compressor utilisation 63 
5.9 Compressor I utilisation over time with a block length of 256 bits..... ....... 64 
5.10 Compressor 2 utilisation over time with a block length of256 bits............ 65 
5.11 Compressor 1 utilisation over time with a block length of8192 bits.... ...... 65 
5.12 Compressor 2 utilisation over time with a block length of8192 bits..... ...... 66 
5.13 X-MatchProRli compressors variable output block length..................... 69 
viii 
Table of Figures 
5.14 Single compressed block outpu!....... ...... ....... .......... ..... .... ..... ......... 70 
5.15 Single block output routing algorithm - compression............................ 70 
5.16 Single block output routing algorithm - decompression. ......................... 70 
5.17 Multiple compressed block output ................................................. 71 
5.18 Multiple block routing algorithm - compression..... .... ...... ... . .... . .... . . ... 71 
5.19 Multiple block routing algorithm - decompression... .... ... ... ..... ........ .... 72 
5.20 Percentage increase in output block size due to the addition of a 64 bit tag.. 72 
5.21 Impact single tags have on the arithmetic mean compression ratio........ .... 73 
5.22 Interleaved compressed block output.............................................. 73 
5.23 Interleaved routing algorithm - compression..................................... 74 
5.24 Interleaved routing algorithm - decompression... ...... ............... .......... 75 
5.25 Effect block length and dataset have on compression ratio standard 76 
deviation .............................................................................. . 
5.26 Final difference between two compressors output FIFOs....................... 77 
Chapter 6 
Implementation ............... ............................................................. . 
6.1 Block architecture of the two X-Matchprorli compressor/decompressor 81 
system ................................................................................. . 
6.2 Multiprocessor block hierarchy..... . ...... .... ..... ..... . ... ...... ..... ............ 83 
6.3 Write input control for a two compressor system.......... ..... .................. 84 
6.4 Waveform for the compression operation for a single X-MatchProRli 85 
Engine ................................................................................. . 
6.5 Waveform from the generic multiplexor for the two X-MatchProRli engine 86 
system in compression mode ..................................................... .. 
ix 
Table of Tables 
TABLE OF TABLES 
Chapter 2 
Review of Parallel Data Compression Systems .............................. . 
2.1 Summary of parallel loss less data compression................ ................... 33 
Chapter 4 
Experimental Framework ............................................................ . 
4.1 Memory dataset................... ...................... ............ ........... ........ 43 
4.2 Application disc data set................. .......................... .......... .... ..... 43 
4.3 Executable disc data set............................................ .................. 44 
4.4 General disc data set.. ... .. ...... ........ . . .. ..... . .. . ......... . ... ..... . . ... . . ... .... 44 
4.5 User disc data set..................................................................... 44 
4.6 Canterbury data set........ ......... . . ... .... . .... . ....... . .............. ..... ........ 45 
ChapterS 
Practical investigation of input and output routing strategies ........ . 
5.1 Percentage change in mean compression ratio..... ............................... 56 
5.2 Example status log of system components........................................ 63 
5.3 Comparison of the performances of input data schemes relative to a single 67 
compressor system ................................................................... . 
5.4 Comparison of the performances of output data schemes relative to a single 77 
compressor system ................................................................... . 
Chapter 6 
Implementation ........................................................................... . 
6.1 Architecture complexity... .................... ........ ......... ...................... 87 
6.2 Synthesis results for a range ofFPGAs......... ........................... .... .... 88 
x 
Chapter I Introduction 
Chapter 1 
Introduction 
1.1 Introduction 
Infonnation has become one of the most important commodities of the 21 st century 
and there appears to be insatiable demands for ever-greater bandwidth in 
communication networks and computer buses and for ever-greater storage capacity in 
computer systems. For example, in communication networks, standards are under 
development to move from 1 Gbitls Ethemet to lOGbitls fast Ethernet and, in 
computer buses, the latest successor to the PCI bus is the PCI -X standard capable of 
delivering a bandwidth of 4.3Gbitls. More efficient use can be made of available 
bandwidth or storage if lossless compression is perfonned on the data involved. 
However, data compression will probably only be adopted if it can meet the 
bandwidth requirements of modern systems, otherwise the compression itself would 
become the bottleneck in these systems [DorwardOO]. 
Lossless compression removes redundant bits while they are being transmitted or 
before they are stored in memory, and lossless decompression reintroduces this 
redundant bits to recover fully the original data. A familiar example is the 
International Telecommunication Union V42.bis standard [V42.bis] that is widely 
adopted in modems to enhance data throughput. 
Software. implementations of lossless data compression algorithms have been 
available for a number of years, and range from the widely known dictionary-based 
Lempel-Ziv [Ziv77, Ziv78] methods and its variants, to the more computationally 
intensive statistical-based Markov modelling [Connack87] and Arithmetic Coding 
[Rissanen79] approaches. Data compression is not currently utilised to its full 
1 
Chapter J Introduction 
potential in current computer systems and networks, probably due to the impact its 
long calculation times have on overall system performance. Several researchers and 
companies have implemented lossless data compression solutions [Hifn, AHA] in 
dedicated hardware rather than in software, this approach is specifically chosen to 
reduce calculation time. However, even these approaches do not meet the throughput 
requirements of modem systems and are generally not scalable to meet future 
demands [NunezOla]. 
To overcome some of the drawbacks of existing methods and implementations, 
there have been several attempts to introduce aspects of parallelism into lossless data 
compression, although these have had limited success. One possible approach is to 
construct or modify an existing algorithm with the aim of exploiting inherent 
parallelism. However, existing compression algorithms are not all inherently parallel 
. 
and to adapt them to parallel architectures would need significant simplifications that 
would adversely affect compression performance. Another approach would be to 
share the compression between a number of identical algorithms running 
concurrently. In this case, there would be no performance gain if the algorithm runs 
on a single CPU, but with the recent advances in chip logic densities, it would be 
possible to integrate a number of compression/decompression engines into a single 
chip. 
Lossless compression can be applied at various points in a computer system. 
Depending on the location of the compression there are differing performance 
priorities for the compressor/decompressor, for example in memory compression 
more emphasis will be placed on low latency than on extracting every bit of 
redundancy from the data. Figure 1.1. shows the architecture of a typical modem 
computer and areas therein where commercial applications and academic researchers 
have focussed on implementing lossless compression. Lekalsas et. al. [LekatsasOl] 
attempted to improve performance by locating a decompressor between the cache and 
CPU, in spite of the higher bandwidth constraints that would result when the 
compressor/decompressor is implemented close to the CPU. The benefits of sitting the 
compression here are that it effectively increases the cache size and reduces the 
memory requirement, and this benefit is particularly apparent for memory limited 
embedded systems. Lee et. al. [Lee99] placed the compressor/decompressor between 
CPU and main memory in an attempt to increase the off chip bandwidth in this 
databus. IBM has a long history of research into lossless compression; one of their 
2 
Chapter J Introduction 
latest developments is the integration of high-speed data compression m mam 
memory. The main application is in the improvement of the price/performance of 
servers, as in these systems the memory is typically one of the most expensive 
components. Other examples of compression implemented in computer systems are 
the Stacker, DriveSpace and DoubleSpace programs [Stacker, DriveSpace, 
DoubleSpace] all targeted to compress hard disk data. They work by trading CPU 
time required to run the compression/decompression algorithm in exchange for 
increasing the effective hard disk capacity. However, this is only beneficial if the hard 
disk is used to its maximum storage capability as it can slow the system down if the 
CPU is already highly utilised. The payofffrom compression in networks [Expand03, 
V42.bis] can be taken in two ways. With the reduction in data that need be transmitted 
the benefit can be taken either in the form of lower network bandwidth requirement 
(which generally reduces cost) or by sending more data using the same bandwidth. 
This is particularly true for slow modems where the link between the main telephone 
exchange and computer represents a bandwidth bottleneck. Wireless networks 
[Bongjin94] are a currently expanding market and unlike fixed cable networks their 
bandwidth is very limited. Implementing compression in these wireless systems may 
not only gain on bandwidth but also save on power. 
Area Compression 
can be applied 
- - , [Nunez(2) 
Figure 1.1. Areas compression can be applied into a computer system 
3 
Chapter 1 Introduction 
Figure 1.2. illustrates the bandwidth performance for a number of current and 
proposed computer buses, networks and memory 
1 
I 
10 
I 
r 
[A] Fast ethemet 
10 Mbit/s 
100 
I 
1000 
I i L 
[A] Gigabit ethemet 
(Bl rCI 3:3bllS!33Mhz 
1 Gbitls 
[AJ OC·48 
10000 
I 
Gbitls 
100000 
I 
++L---.. 
[Cl DDRRAM 
25.6GbitJs 
{A] USB HighSpeed 
480 Mbit/s 
[81 UlU332Q SCSI 
2.5 Gbit/s [ClDRDRAM 
12.8Gbit/s 
[AJ OC-I.2 
Fast Gigabit ethemet 
10 Gbit/s 
(A] OC-48 vvith WDM 
40Gbitis 
Figure 1.2. Current and proposed computer buses, networks and memory 
throughputs, labelled with [A] are network standards, in [B] are the bandwidths 
of the computer buses and in [C] are the typical bandwidths 
at which high performance memory operates. 
1.2. Research Aims and Objectives 
The aim of the research is to improve data compression throughput to levels 
demanded by present and future systems. The issue of scalability also needs to be 
addressed to permit compression at future bandwidths without having to design 
completely new architectures each time throughput demands increase. However, in 
data compression there are often conflicting tradeoffs to be made between design 
complexity (required computational resources), the degree of compression, data 
throughput and latency. 
The research attempts to discover the effect parallel data compression architectures 
have on the compressor/decompressor performance and how design decisions affect 
the various tradeoffs in performance. To achieve these goals, several objectives are 
apparent. 
• Firstly, a literature review of current parallelism in lossless data compression 
is performed to highlight the current state-of-art in parallel loss less 
compression. 
4 
Chapter i introduction 
• The identification of a suitable architecture for the potential of increasing 
compression throughput and able to provide scalability is chosen. Modelling 
and analysis of the architecture is carried out in order to understand how 
design decisions determine the overall compression performance. 
• The most promising findings are synthesised for hardware to demonstrate 
proof of concept and to gain a realistic figure of design complexity. The 
generated IP core needs to be suitable for integration with modern digital 
design practices. 
1.3. Terminology 
In order for readers not familiar with all the terminology and notion present in this 
thesis, the next sections will introduce the basic concepts and further reading for data 
compression, parallel architectures with the relevant taxonomies and the basics of 
queuing theory. The data compression section describes the taxonomy oflossless data 
compression along with descriptions of the more popular data compression 
algorithms. The parallel architecture section covers the terminology used in the 
classification of parallel architectures. The equations for simple queuing theory are 
described in the queuing theory section; these are used later for the analyses of 
memory requirements and latency ofthe compressor system. 
1.3.1. Data Compression 
Data compression works by converting data into a smaller representation, from which 
the original signal or some approximation can be recovered at a later time. For data 
compression algorithms to work they need to identify and remove redundancy from 
the data. To achieve this reduction, several techniques and algorithms have been 
developed to identify and extract the redundancy present in the data. Over the years, a 
clear taxonomy of data compression has been developed. The next few paragraphs 
cover the different aspects of this taxonomy. 
5 
Chapter I Introduction 
Figure 1.3. illustrates a block diagram of the data compression taxonomy, discussed 
below. 
Figure 1.3. Data compression taxonomy 
First of all, there are two main classifications of data compression; these are termed 
lossy and lossless. When data are compressed and then decompressed, if the 
decompressed data are identical to the original data before compression then this is 
termed lossless data compression. When data is compressed and then decompressed, 
if the decompressed data is some approximation of the original data it is deemed to be 
lossy data compression. Lossy compression typically achieves greater compression 
than lossless compression but this is at the expense of some of the original 
information being permanently lost and being rendered unrecoverable. This method is 
unacceptable for many applications, for example, in computer data or medical 
imaging, where any misinterpretation of the data can have critical consequences. The 
rest of the thesis focuses on compressing computer data, so, from this point onwards, 
all discussion of compression refers to lossless data only. 
Lossless compression can be further divided into two types, namely statistical and 
dictionary based and this distinction is clearly identified by Bell, Cleary and Witten 
6 
Chapter I Introduction 
[Be1l90j. When blocks of data are replaced with references to earlier data these are 
termed dictionary based. Methods that use the probability, or the frequency of 
characters or strings occurring in the data are termed statistical. Statistical techniques 
are generally more computational intensive than dictionary techniques, and so they 
result in lower compression speeds, but have the advantage in that they achieve 
superior compression. 
Compression can be further split up into two sub classifications, termed modelling 
(model) and coding (Coder) [Be1l90j. The purpose of modelling is to identify where 
the redundancy is located in the data, and, based on this information, the coder can 
then perform compression. This classification is based on the fact that the modelling 
can be independent from the coding; meaning that once the data is modelled, different 
coding schemes can be used to compress the data. As a consequence, if the modelling 
of the data is not accurate at identifying the redundancy, then no matter how efficient 
the coder, good compression of the data will not be achieved. Figure 1.4. illustrates a 
block diagram ofthis concept for a statistical compression system. 
Figure 1.4. An example of statistical model and coder 
Modelling can be static, semi static or dynamic (also know as adaptive). Static 
modelling is where the mapping from the set of symbols to the set of codewords is 
predetermined before transmission begins, so that a given message will be represented 
by the same codeword every time it appears in the message data. Semi-static 
modelling is where a preliminary pass over the data happens before the data is 
compressed to discover the characteristic of that particular data. With this information 
a suitable model of the data can be created or an appropriately constructed model can 
be selected to compress the data. The second pass compresses the data using the 
7 
---------- --------------_._ .. _-- --
Chapter I Introduction 
model from the preliminary pass. A model is dynamic if the mapping from the set of 
symbols to the set of codewords adjusts over time. The assignment of codewords to 
symbols is based on the values of the relative frequencies of occurrence at each given 
point in time. This means a symbol may be represented by a short codeword early in 
the transmission because it occurs frequently at the beginning of the data, even though 
its probability of occurrence over the total data is low. Later, when the more probable 
messages begin to occur with higher frequency, the short codeword will be mapped 
instead to one of the higher probability messages. 
There are numerous compression algorithms with their variants that have been 
developed. Listed in the next few sections are descriptions of some of the better-
known algorithms and where they fall in the data compression taxonomy. 
Statistical Modelling 
The best compression figures are regularly reported from software statistical models. 
Schemes such as DMC [Corrnack87] and PPM [PPM] that use methods based on 
Markov modelling where predictions are made from the symbols that precede the 
current symbol. The order of the Markov model is based on the number of symbols 
needed to make the prediction. As the order of the Markov model increases the 
complexity tends to increase, resulting in an increase in compression time. The choice 
of coding for statistical modelling is usually Arithmetic Coding [Howard94], as the 
codeword generated is able to represent fractions of a bit, which is useful for 
probability distributions that are highly unbalanced. 
Dictionary Modelling 
Software and hardware based dictionary modelling are more popular than statistical 
modelling. This is because they achieve good throughput due to fewer computational 
resources being needed to construct the model, but this is at the expense of a reduction 
in the compression that can be achieved. The Lempel-Ziv algorithms, more commonly 
abbreviated as LZ-77 [Ziv77] and LZ-78 [Ziv78], have been the most successful of all 
8 
Chapter 1 Introduction 
the dictionary techniques and have been implemented in many applications. Both 
techniques perform compression by storing a sequence of pointers to previously seen 
phrases. LZ-77 algorithms adapt by maintaining a sliding-window buffer, that consists 
of two parts; a search buffer which contains a portion of a recently coded sequence 
and a look ahead buffer that contains the next portion of code to be coded. The 
algorithm tries to match the contents of the look-ahead buffer to a string in the fixed-
size window. 
Rather than using fixed length phrases from a window, LZ-78 builds its dictionary 
out of all of the previously seen symbols in the input text. The basis of this method is 
to build a dictionary of strings while coding. Both the coder and decoder start with an 
empty dictionary and, as each character is read in, it is added to the current string. The 
dictionary is built progressively one character at a time, and as long as the character 
matches some existing phrase in the dictionary, this process continues. For example, 
the first time the string "Mark" is seen, the string "Ma" is added to the dictionary. The 
next time, "Mar" is added. If "Mark" is seen again, it is then added to the dictionary. 
These two algorithms have spawned several variants over the years in attempts to 
improve their speed and compression. Nowadays, it appears these variations are more 
commonly used to try and bypass the patents on the Lempel-Ziv algorithms rather 
than improve their performance! 
A different approach to the Lempel-Ziv algorithms is the X-MatchPro [JonesOO, 
Nunez99] hardware implementation of a dictionary based data compression scheme. 
The X-MatchPro algorithm uses a dictionary of previously seen data and attempts to 
match or partially match the current data phrase with an entry of previously seen data 
present in the dictionary. 
All of the above dictionary schemes can be classified as dynamic classification as 
in all cases the model adapts to the nature of the incoming data. 
Coding 
For when the modelling is static, Huffman coding [Huffman52], Sharmon-Fano 
[Shannon-Fano] coding and Arithmetic coding [Howard92a, Howard94] are often 
used to assign short codes to symbols with a high probability of occurrence. 
9 
Chapter 1 Introduction 
Arithmetic coding and Dynamic Huffman coding [Knuth82] are examples of coders 
that work with dynamic modelling. 
Run Length Coding [Golomb66] is an example of a coding scheme aimed to compress 
a specific characteristic of data. It works by taking a 'run' of repeating symbols and 
replacing them with just one symbol and a count that is the length of the run, for 
example, 'AAAAAA' would be replaced by A {6}. 
Huffman coding was developed in the early 1950s by David Huffman as part of a 
class assignment in information theory. The algorithm to construct Huffman Codes is 
as follows. 
I. Rank all symbols in order of probability of occurrence. 
2. Successively combine the two symbols of the lowest probability to form a new 
composite symbol; eventually a binary tree is built where each node is the 
probability of all nodes beneath it. 
3. Trace a path to each leaf, noticing the direction taken at each node. 
For a given frequency distribution, there are many possible Huffman codes, but the 
total compressed length will be the same. Huffman codes have the unique prefix 
property that no code is a prefix of another code; this means Huffman codes can be 
unambiguously decoded. A technique comparable to Huffman coding is Shannon-
Fano coding, which works as follow. 
I. Divide the set of symbols into two equal, or almost equal subsets, based on the 
probability of occurrence of characters in each subset. The first subset is 
assigned zero, the second one. 
2. Repeat step I until all subsets have a single element. 
The algorithm used to create the Huffman codes is a 'bottom-up' approach, in which 
codes are formed from the leaves and then works towards the root. Shannon-Fano 
codes are 'top-down', as code generation starts from the root, with bits being assigned 
to each code, and then works towards the leaves. Huffrnan encoding always generates 
optimal codes, but Shannon-Fano sometimes uses a few more bits in the construction 
10 
Chapler 1 Introduction 
of its codes, and as the complexity of the two is similar Huffman coding is generally 
the more popular. 
Arithmetic coding can represent several symbols with a single bit, and can handle 
probabilities more accurately than Huffman and Shannon-Fano coders (which always 
round probabilities to a power of two). Arithmetic coding translates the input stream 
into a single floating-point value between 0 and 1. The model provides the symbols 
and their associated probabilities to the coder, and the combined probabilities of the 
whole dataset add up to 1. The algorithm for coding a file using arithmetic coding 
works conceptually as follows. 
1. The current interval [L,H) initialised to [0, I). 
2. For each event in the file, perform these two steps 
(a) Divide the current interval into subintervals, one for each possible 
event. The size of an event's subinterval is proportional to the 
estimated probability that the event will be the next event in the file 
according to the model of the input. 
(b) The subinterval corresponding to the event that actually occurs are 
selected, and made into the new current interval. 
3. Sufficient bits are sent to the output to distinguish the final current interval 
from all other possible final intervals. 
11 
Chapter I Introduction 
1.3.2. Parallel Systems 
The idea of operating microprocessors in parallel to increase system performance 
dates back to the earliest history of computers. As more diverse and complicated 
systems are developed and proposed, the classification of such architectures becomes 
ever more difficult. Flynn proposed one of the more widely accepted classifications 
over 30 years ago [Flynn66] distinguishing between parallel architectures based on 
the concept of parallelism in instruction and data streams. An instruction stream is a 
sequence of instructions executed by a processor and a data stream is a sequence of 
input data. Using these constraints, processors can be placed into one of four 
categories. 
• Single-Instruction, Single-Data (SISD) is essentially an enhanced sequential 
computer capable of pipelining the instruction stream. The typical Personal 
processor falls under this category. 
• Multiple-Instruction, Multiple-Data (MIMD) includes systems that have 
independent processors operating on non-overlapping sequences of input. Each 
processor has its own instructions and operates on its own data. 
• Multiple-Instruction, Single-Data (MISD) encompasses systems of several 
processors each executing a different instruction on identical input. This parallel 
approach is not very common, but can be used for example in multiple 
cryptography algorithms, where the same data is input into several cryptography 
processors. Also in redundancy for safety critical applications. 
• Single-Instruction, Multiple-Data (SIMD) consists of a number of uniform 
processors, an interconnection network and an associative memory. The 
Processing Elements (PEs) of a SIMD simultaneously perform the same operation 
on different data. The layout of these systems can differ depending on the 
interprocessor communications and number of processors. 
These only form a general classification model of parallel systems, as a number of 
architectures that have been developed do not fall readily into anyone of these 
categories or may be a hybrid of these schemes. 
12 
Chapter I Introduction 
Another classification of parallel systems primarily used for theoretical study is the 
PRAM (Parallel Random Access Machine) [Stauffer93] model where the key features 
are: 
• all processors share a single, common, unbounded memory. 
• the number of processors in the system is unbounded. 
The PRAM classification can be further split up into categories based on the ability of 
the processors to read and write to memory: 
• CR (Common Read) - all processors can read memory at the same time 
• CW (Common Write) - all processors can write to memory at the same time 
• ER (Exclusive Read) - only one processor can read from memory at a time 
• EW (Exclusive Write) - only one processor can write to memory at a time 
Memory structure can be used to further categorise parallel architectures. The 
divisions are shared, distributed. In shared memory the processors act independently 
but share the same memory resource, see Figure 1.5. Changes in the memory contents 
are visible to all the other processors. 
, Processor 
Memory 
Figure 1.5. Shared memory for multiprocessor system 
In distributed memory, the processors have their own local memory with no concept 
of global memory, Figure 1.6. illustrates an example of this architecture. If data need 
13 
,------------------------------------------
Chapter 1 Introduction 
to be transferred from one processor to the other, control needs to be specified. 
Allocating distributed memory to processors has two major benefits. Firstly, each 
processor can utilise its own memory without interference from other processors. 
Secondly, latency is reduced due to quicker access to local memory, as there is no bus 
or switch contention. The main disadvantage of distributed . memory is the 
communication overhead between processors becomes more complex, as the 
processors no longer store a single centralised memory. 
Memory Memory 
Memory 
Figure 1.6. Distributed memory in a multiprocessor system 
There are several architectures for SIMD models. The typical construction of these 
systems is that each Processor Element (PE) has its own local memory and data 
transfers between elements are by a communication network. Examples of some of 
the communication structures are mesh-connected SIMD, systolic array SIMD, tree-
connected SIMD, depending on the computational requirements of the application 
some are better than others. Figure 1.7. illustrates asystolic SIMD array layout. 
Figure 1.7. Systolic SIMD Array 
14 
Chapter 1 Introduction 
Mesh-connected SIMD are processors connected in a grid structure, see figure 1.8. 
Figure 1.8 Mesh-Connected SIMD array 
Tree-based processors are where data is passed from parent node to child or vice 
versa. Figure 1.9. shows au example of a section of a tree based communication 
structure 
Figure 1.9. Tree-based SIMD Array 
15 
L-____________________________________________________________________________ _ 
Chapler I Introduction 
1.3.3. Queuing Theory 
To ensure data are readily available for the compressor to process, dedicated memory 
is often used to store data. However, the impact of this extra memory on the design 
needs to be carefully considered, for example, the capacity of the memory, and the 
latency introduced by storing data in memory. 
To help answer these questions, empirical methods can be used to estimate the 
memory requirement. For example, a small memory size may be initially chosen, and 
then, following simulation, should overflow occur, the memory size could be 
increased slightly or the rate at which data enters the memory could be restricted. This 
iterative approach can be taken until a memory size where no overflow occurs is 
found. 
Another technique is termed "high-watermarking", where a large quantity of 
memory is initially allocated in the design. Before simulation, the memory is filled 
with a repeating pattern of data. Following completion of the simulation, the memory 
is examined, and the lowest point in the memory where the data pattern has not been 
over written determines the maximum queue length. These techniques do not provide 
an accurate means of calculating memory requirements or the latency introduced, as 
the worst case may not be covered during simulation. The empirical predictions 
produced by this method can be improved by performing a series of simulations to 
build up a more accurate picture of system behaviour. 
An alternative approach to the empirical methods is one based on queuing theory 
[BoseOI]. Using queuing theory, memory requirements and latency can be 
investigated on a mathematical basis. Figure 1.1 O. illustrates a simple producer-
consumer model. 
Figure 1.10. Produce and consumer model 
16 
Chapter 1 Introduction 
The rate at which data enters the dedicated memory of the compressor is the 
production rate P(I}. The rate at which data exits the memory is the consumption rate 
C(I}. Both are functions of time, as during operation data rates can vary. 
If the production rate is permanently less than the consumption rate, P(I} < C(I}, 
then any data entering the memory immediately exits. Therefore, no memory is 
required and the compressor will be under utilised. However, if the production rate is 
always larger than the consumption rate, P(t} > C(I}, then data is always entering the 
system faster than it can be consumed. This means that eventually all the memory 
space will be filled with data, and, any further data entering the memory will cause an 
overflow condition with the subsequent loss or corruption of the data. Clearly, this 
situation cannot be allowed to occur in lossless compression. 
It is only when the Production Rate is greater than the Consumption Rate for a 
short time T, that is, P(t) > C(t) for time t within T, then data queue memory is 
required. This period of temporarily high production is often termed a 'burst'. The 
burst, need to be separated in time if the memory is not going to overflow, and 
normally the memory needs to be empty, at the beginning of a burst to ensure that 
data do not build up over time. During the burst period, data being produced are 
placed in the queue in preparation for processing by a compressor. As soon as data are 
available, it is taken out of the queue and compressed that is, it is consumed. 
From this starting point, the following equations can be generated. 
Equation 1.1 
Equation 1.2. 
P*T=Nc 
C*T=Np 
In the above equations, Ne data items are produced and put in the queue, P is the 
production rate, Np data items are taken from the queue, C is the consumption rate. 
Since the queue gradually becomes longer during a burst, the length of the queue is at 
its maximum, L, at the end of the burst. L can be calculated from the following 
equation. 
Equation 1.3. L=(P-C)*T 
17 
,----------------------------------- -- -----
Chapter 1 Introduction 
Equation 1.3. is only valid if a new burst of data enters the memory only when it is 
empty of data from the previous burst. The emptying time from the burst can also be 
calculated, as at the end of a burst, the production rate tends to zero, but the 
consumption rate continues at a constant rate. As there are L data items in the queue at 
that time, the emptying time E can be calculated as follows. 
Equation 1.4. E = LIC = (PlC -l)*T 
The emptying time E can be viewed as the length of time taken for the last message 
that was put into the queue at the end of the burst, to move through the queue and then 
exit. Since the queue is longest at the end of a burst, this is the longest time taken for a 
data item to pass through the system; in other words is the worst-case latency 
introduced by the queue, assuming the queue was initially empty before the new burst. 
1.4 Structure of Thesis 
Chapter 2 provides a literature review of research in parallel data compression, 
identifies the range of compression algorithms using parallel approaches, describes 
how they can be applied and gives a classification of relevant parallel architectures. 
Chapter 3 identifies the research area of this thesis and introduces the X-MatchProRli 
data compression core as an example of a high performance data compressor. This 
core is then used as a building block for investigating data compression implemented 
into a parallel architecture. 
Chapter 4 introduces the experimental framework for the research, dataset selection 
and the selection of suitable tools and methodologies for uses in the exploration of the 
design space and in the experimental work. 
Chapter 5 covers the experimental work in the investigation of input and output 
routing strategies for a MIMD architecture. Results on the compression performance, 
latency and throughput of the system are presented. 
18 
Chapter I Introduction 
Chapter 6 describes the implementation of the system developed from the findings of 
the experimental work of chapter 5. The design is synthesised and 'place and routed' 
in a manner suitable for FPGA technology. This allows the demonstration of proof of 
concept and enables realistic figures for the design complexity to be obtained. 
Chapter 7 concludes the thesis by summarising the main findings and how these 
fulfil the aims of the research. It also outlines further potential research paths. 
19 
-----------------~.-~ -----~.- -- ----
Chapter 2 Review o(Parallel Data Compression Svstems 
Chapter 2 
Review of Parallel Data Compression Systems 
2.1 Chapter Objectives 
There have been several research efforts aimed towards implementing lossless data 
compression in parallel, ranging from software algorithms executed across several 
CPUs to highly concurrent VLSI hardware implementations. The common theme to 
all this research is to create algorithms and architectures utilising aspects of 
parallelism to achieve performance better than that available in sequential designs. 
There are two principal means by which parallelism can be introduced into data 
compression. The first approach is to construct or modifY a compression algorithm 
with the specific aim of employing inherent internal parallelism to improve 
performance. For example, in dictionary-based algorithms implemented in hardware, 
parallel architectures such as CAMs perform dictionary searching in one clock cycle, 
a significant improvement compared to the many clock cycles taken to access 
conventional RAM. However, all existing compression algorithms are not inherently 
parallel and so readily adaptable to parallel architectures, and significant 
simplifications that adversely affect compression performance may have to be made 
to coax the algorithm into a suitable form. The second approach is to leave the 
algorithm unaltered but to employ a number of identical data compression algorithms 
running concurrently on the data. One example of this is in lossy image compression 
where the image is split up into blocks, each of which is fed to its own coder and 
compressor. This chapter reviews and focuses on the work in the area of parallel 
lossless compression. The objectives of the chapter are to: 
20 
Chapter 2 Review of Parallel Data Compression Systems 
• Review past and current implementations oflossless parallel compression. 
• Identify those aspects of previous lossless parallel compression work that are 
related to the research pursued in this thesis. 
• Summarise the relevant aspects of performance achieved in the reviewed 
implementations. 
2.2 Parallel Compression 
The next few sections covers brief descriptions of previous research efforts on 
introducing parallelism into data compression. The classification of the sub-headings 
is in accordance with the compression technique used. This helps highlight which 
parallel architectures have been used for each of the compression algorithms and 
where the main focus of research has occurred. 
2.2.1 Parallel Huffman Coding 
Producing a parallel implementation of Huffman coding is desirable, as it is a widely 
known, fairly simple algorithm and produces codes that are close to optimal for a 
given model and can be implemented by fast lookup tables. Huffman codes also have 
the 'prefix property' that no code word is a prefix of another code, and thus allows 
straightforward decoding. A parallel Huffman encoder/decoder method for use in 
loss less image compression is described in [Howard92b]. Parallel coding is easy to 
implement in this application, since the code bits for all pixels are disjoint and 
independent. Parallel coding is achieved by assigning a pixel to each processor; which 
means each processor can independently compute the code length of its own pixe!. 
Parallel decoding is achieved by transposing the outputs, so that the output for the 
first bit of each pixel is sent in the first time unit, then the second bit from all the 
processors in the second time unit. If one processor completes its operations and data 
are still to be processed by the available processors, it is reassigned to the empty 
processor. When all of the processors have completed their output, a new set of data 
are assigned to all the processors. An example of the data allocation between 
21 
Chapter 2 Review o{Parallel Data Compression Systems 
processors is shown in Figure 2.1. In this example, there are four processors, 
numbered P l to P4 and 12 pixels, and initially, three pixels are assigned to each 
processor. The first phase ends when processor P3 finishes pixel 9 after 4 time steps; 
each processor has produced an output of 4 bits by this time. Pixels 3 and 10 have 
been partially encoded and they remain assigned to processors P l and P4. At this point 
the untouched pixels (6,11,12) would be reassigned to balance the load. The paper 
shows the reassigning of partial computed pixels. It is not made cleat the benefits of 
this process. 
P, 
p, 
P3 
P3 
Phase 1 
1-----r-'1E-'---tc;;.;...-"J 
P, 
p, 
P3 
P. 
Phase 2 
Figure 2.1. Data allocation between processors 
2.2.2 Parallel Arithmetic Coding 
Phase 3 
P, 
p, 
P3 
P. 
Arithmetic coding can theoretically achieve optimal coding if implemented using 
exact arithmetic. When combined with a model for accurately identifying the 
redundancy in the data, high levels of compression can be achieved. Implementations 
of arithmetic coding are slow due to the computationally intensive multiplication and 
divisions involved. This has led to the development of quasi-arithmetic coding, which 
avoids the need to calculate these multiplications by storing their results in look up 
tables. Howard et. al. [Howard92b] apply quasi-arithmetic coding to a parallel 
architecture using techniques similar to their method discussed above for parallel 
huffman coding. 
Jiang and Jones [Jiang94] discuss the design of a parallel algorithm suitable for 
real-time implementation of arithmetic coding. The design uses a parallel processing 
array arranged in a tree structure for compressing data, but the authors accept that in 
this method decompression needs to be carried out in a sequential manner, because 
the search and decode operations for parallel decompression would increase the 
22 
Chapter 2 Review o(Parallel Data Compression Systems 
complexity of the design to such an extent that the performance will suffer. The 
expected performance in terms of processing time is eight times that of a sequential 
coder, while offering similar compression ratios. 
Stefo et. af. [StefoO I] describes an FPGA implementation of a statistical modelling 
unit that is able to support parallel binary arithmetic coding. The design is able to 
process 8 bits per clock cycle compared to the conventional 1 bit per clock cycle. The 
model has been implemented in an A500K130 ProASIC FPGA, giving a throughput 
of 256Mb/s. 
2.2.3 Parallel Entropy Coding 
An entropy coder attempts to encode a given set of symbols with the minimum 
number of bits required to represent them. Two such entropy coding schemes are 
Huffman coding and arithmetic coding. Boliek et. al. [Boliek94] describe a prototype 
entropy coding hardware that divides the data into multiple streams that are fed into 
parallel coders. They also present a possible solution for the transmission of multiple 
streams of variable length compressed data, namely, by using a coded data interleave 
method. A concatenated file system for transmitting the coded data is used, but this 
requires that the system incorporates large memory buffers, so that each of the parallel 
encoders may encode several bits before emitting a codeword. This results in non-
deterministic output behaviour from each of the encoders. To enable decoding, the 
decoder requires data to be presented in bit order for the scheme to work, and also 
reordering of the code words using additional decoding circuitry is essential for 
decompression of the data. A prototype of an interleaved decoder system with six 
simple run length decoders was implemented an Altera Flex800 FPGA and was able 
to achieve a bit rate of 16Mbitls at the output of the decompressor. The compression 
performance was not explicitly quoted but the author found it to be comparable to that 
of the IBM Q-coder [Arps88]. The paper highlights some of the practical problems of 
implementing parallel coding in several independent encoders. For example, if the 
efficiency of the system is to be maximised then data needs to be divided between the 
coders as equally as possible and the problem of routing variable sized compressed 
data from different coders can mean large output buffers are needed. 
23 
------ --- - -
Chapter 2 Review of Parallel Data Compression Systems 
2.2.4 Parallel Lempel-Ziv 
The majority of research efforts in lossless parallel data compression have used the 
dictionary-based Lempel-Ziv algorithm modified into a form suitable for 
implementation in hardware. The hardware architecture generally takes the form of 
large numbers of simple Processing Elements (PE) arranged in a systolic array. Other 
popular Lempel-Ziv adaptations use a Content Addressable Memory (CAM). The 
search for the longest matching string in the dictionary (normally the most 
computationally expensive operation in the Lempel-Ziv algorithm) is performed by 
the CAM in a single clock cycle, while the systolic array method uses a much slower 
deep pipelining technique to implement its dictionary search. However, compared to 
the CAM solution, the systolic array method has advantages in terms of reduced 
hardware costs and lower power consumption. 
Ranganathan and Henriques [Ranganathan93] describe a Lempe\-Ziv VLSI 
implementation that exploits parallelism through pipe lining that takes the form of a 
systolic architecture. They used a parallel architecture of n processors, where n is the 
size of the longest match. The number of comparisons was reduced from a quadratic 
order in the sequential algorithm to linear order in this parallel architecture. A 
prototype containing nine Processing Elements in a systolic array was fabricated and 
tested using 21lm CMOS technology.· From the prototype it was estimated a 
compression chip should run at 13MByte/s, operating with a clock speed of 40MHz. 
The authors also described two methods for decompression, one using a sequential 
architecture that decompresses at a rate of one character per clock cycle and the 
second using a semi-systolic architecture, in which global signals are used in the 
control logic for a two level buffer. This design was also patented in 1993 under US 
patent number 5,179,378. 
Bongjin et. al. [Bongjin94] described a systolic array compressor based on the 
Lempel-Ziv algorithm that improved on the latency performance of the system 
described by Ranganathan and Henriques. In a subsequent paper, Bongjin .'and 
Burleson [Bongjin95] presented a systolic array implementation of their design that 
achieved a throughput of 90Mb/s using 32 PEs clocked at 90MHz. This design was 
improved in a later paper [Bongjin98], where emphasis was placed on reducing chip 
area and power consumption. Figure 2.2. illustrates a diagram for the general 
24 
Chapter 2 Review of Parallel Data Compression Systems 
architecture adopted in the work. The architecture has five major components, 1) the 
unidirectional shift register for the sliding dictionary; 2) a linear array of simple 
processors; 3) a counter to computer the match length; 4) 512-input NOR gate to 
compute end of encoding; and 5) a priority encoder to compute the location of the 
match. 
• • • 
Done Match length 
Figure 2.2. Bongjin et. al. Lempel-Ziv systolic architecture 
Gonzalez and Storer [Gonzalez85] presented several parallel algorithms for textual 
substitution based on a systolic array. Modifications that made the solution suitable 
for both static dictionary and sliding window Lempel-Ziv algorithms were described. 
The static dictionary detected matches between input data and dictionary entries 
stored in the processing elements and encoded the input data using the PE's individual 
identification number. The dictionary is constructed prior to compression and is either 
loaded or hardwired into each PE. The static dictionary improved on throughput 
performance as a result and no additional overhead was needed to maintain the 
dictionary; but had the drawback of needing to ensure that the data characteristics was 
adequately represented by the static model. The sliding window algorithm is applied 
to a match tree architecture, which is formed from two parts; the first being the linear 
systolic array and the second being a binary tree attached to the systolic array. Data 
25 
Chapler 2 Review of Parallel Data Compression Systems 
enter the array and are gradually clocked through the system. Character data are 
stored in each PE of the array and the binary tree computes the match position and the 
length. 
Storer and Reif [Storer90] described the implementation of a sliding window array 
using the design from the Gonzalez and Storer. A custom VLSI chip containing 128 
PEs was implemented in 1.2J.lm double metal technology. Thirty of these chips were 
placed together to form a system capable of operating at 40MB/s at a clock rate of 
25MHz. 
Chen et. al. [chen98] presented an alternative systolic array implementation of the 
Lempel-Ziv algorithm that contained 64 PEs distributed between a dictionary buffer 
containing 512 characters. While data are being processed,- each PE compares the 
input character with its 8-character CAM dictionary and produces an output to the 
adjacent PE both the character to be coded and the longest match string. The authors 
report a frequency of 91MHz, having a throughput of 728Mbitls and a design 
complexity of 90Kgates for O.6J.lm technology. The latency of the system was found 
to be proportional to the number of PEs used. The design demonstrated a trade-off 
between Iow latency, high power, highly complex CAM and long latency, lower 
power and reduced complexity of a deeply pipelined systolic array. 
A number of other hardware solutions to the implementation of the Lempel-Ziv 
algorithm also incorporated a CAM. A CAM-based Lempel-Ziv data compressor can 
process one symbol per clock cycle, no matter how large the buffer size or how long 
the string. This offers a substantial advantage compared to the systolic-based 
architectures, which are both buffer size and string length dependent. Conversely, 
though, the CAM uses greater silicon resources and consumes more power than do 
systoIic arrays. Figure 2.3. illustrates the general architectural layout for CAM based 
Lempel-Ziv designs. The input data are fed to the CAM to be compared with all 
dictionary locations and are used to provide the encoded match position pointer and to 
determine the global match signal. The priority encoder determines the best possible 
match position for the output. When the system is ready for the next symbol, the old 
symbol is added to the CAM to update the dictionary. 
26 
Chapter 2 Review of Parallel Data Compression SYstems 
Match results . 
from each cell 
• 
Match 
Position 
Next 
Symbol 
Match 
Length 
Figure 2.3. CAM-based Lempel-Ziv 
}o._~. 
Examples of CAM-based implementations can be found in a number of research 
papers [Jones92, Wei93, Lee95]. Jones presented one of the first shiftable CAMs that 
permit partial matching of incoming data and the implementation used 2!!m 
technology and achieved a data rate of 100Mb/s. Lee and Yang described a VLSI 
implementation of the LZ77 algorithm whose architecture consisted of three units, a 
CAM, match logic and an output stage. Simulations for O.81lm CMOS technology 
achieved a clock speed of 50MHz. 
A Lempel-Ziv systolic array simulated in software in several CPU processors was 
produced by Simpson and Sabharwal [Simpson98]. The algorithmic architecture was 
based on the sliding window array from Gonzalez and Storer [Gonzalez85]. The 
results demonstrated improvements in compression ratio and in compression speed, as 
the number of CPU s was increased, but the processing was distributed between a host 
CPU and several 'worker' CPUs. The complexity of host is O(n), the complexity of 
the worker processor is O(nL2), where n is the number of dataset symbols and L is the 
number of library records (PEs) in the system The total algorithm complexity of host 
and workers is O(n + (n+L2)/p) where P is the number of worker processors in the 
system. The performance results revealed that there is a reduction in compression as 
the number of processor is increased and that this way due to losses in matching when 
shifting data between processors. However, it was demonstrated that an increase in 
throughput resulted from the addition of extra processors. 
Rather than using a systolic array for the Lempel-Ziv algorithm, Penzhorn 
[Penzhorn92] developed a software-based parallel compression technique by making 
use of two CPUs. Parallelism was achieved by constructing two independent 
codebooks, one for each CPU. The results reveal that, at start up, there is a loss of 
27 
Chapter 2 Review of Parallel Data Compression Systems 
compression compared with that achievable using a single processor, but that this 
difference diminishes when adequate data have been processed. Using two CPUs the 
compression was performed approximately 1.5 times faster than using a single CPU. 
On March 1998 a patent was awarded to Franaszek and Robinson [Franaszek98] 
for parallel compression and decompression approach using a cooperative dictionary 
method based on the Lempel-Ziv algorithm. The approach compresses blocks of data 
using a shared dictionary with several compression and decompression processors. 
The data are divided into sub blocks, which are then allocated to the processors 
involved, and collectively, all the processors construct a shared dynamic dictionary, 
and compress the data in parallel using this dictionary. The output from each 
compressor is concatenated to form a compressed block with a prefix code (tag) that 
allows the size of each compressed sub block to be determined, enabling the 
compressed code to be allocated to the correct processor during decoding The patent 
also highlights, why a shared dictionary is desirable. For example, starting with, say, a 
512 byte block, this may need to be divided into 4 sub blocks (assuming 4 
processors), making each sub block only 128 bytes long and their corresponding 
dictionaries become too small for identifying redundancy. This would result in a 
compression performance significantly worse than that possible by compressing the 
whole block sequentially. 
This method of compressing data in parallel has been implemented commercially 
in the IBM MXT main memory compression chip [FranaszekOl, TrernaineOl]. Whose 
hardware implementation uses four compression processors with a shared CAM 
memory, along with main memory controller logic. The purpose of this design is to 
compress and decompress data being sent to and from main memory in a computer 
system. The uncompressed data are split into blocks and compressed using 
independent compression engines, but by emphasising a shared dictionary, the system 
is able to achieve compression equivalent to sequential LZ77 methods. Each 
compression engine operates on 256 bytes at a rate of 1 byte/cycle yielding a 
4Byte/cycle aggregate compression rate. Design complexity and process technology 
are not mentioned in any of the papers but in an interview with Gary Dagastine 
[DagastineOI], design complexity of 5 million gates is mentioned using O.251-lm 
process technology. For the four-processor system, a data throughput of 2Gb/s is also 
quoted. 
28 
~------------------------------.......... . 
Chapter 2 Review of Parallel Data Compression SYstems 
2.2.5 Parallel Markov Modelling with Approximate Arithmetic Coding 
Xie et al. [XieO 1] describe two parallel architectures able to reduce decompression 
time with the target application being the decompression of compressed instruction 
code for embedded systems. The algorithm used to model and code the data is 
Markov modelling in conjunction with approximate arithmetic coding. A two-pass 
system was used to compress the code; this technique is appropriate for embedded 
applications as compression of instruction code can be performed off-line and only 
the decompression needs to be performed in real-time. In the vertical method, the 
code is split into four independent streams, and each stream constructs its own model 
to compress and decompress data. The other method, termed horizontal, where the 
data is compressed sequentially and appropriate tags are used to indicate the length of 
the compressed segments for each decompressor. The main benefit of the latter 
technique is that all the instruction code is available to construct a model of the data, 
therefore resulting in better compression, however, deterioration in compression 
performance can arise if the extra compression achieved by having a more accurate 
model of the data is not sufficient to cancel out the impact of the additional tag 
overhead. 
The horizontal scheme was chosen due to it only requiring one decode table as it is 
shared between the processors. Post synthesis simulation for TSMC 0.25 /-lm 
technology shows the decoder unit has a clock speed of 45MHz achieving a 
throughput of 2.115 Gb/s with a compression ratio of 0.8. 
2.2.6 Move-to-front and Transpose Parallel Architectures 
Myoupo and Wabbi [MyoupoOO] developed a systolic array architecture for self 
organising linear lists. Presented are two schemes, which combine the move-to-front 
and transpose heuristics. Simulation of the design is performed with Parallaxis, which 
is a structured programming language developed by Thomas Braunl in 1989 for the 
simulation of data-parallel SIMD systems. The key concept of the techniques 
presented in the paper is to maintain a sequential list of words, so that the frequently 
accessed words are kept near the beginning of the list. These words are then coded 
29 
Chapter 2 Review oFParallel Data Compression Systems 
according to their location in the list, with shorter codes being assigned to words at 
the top of the list and larger codes for those at the bottom of the list. The first scheme 
uses the move-to-front heuristic, where the accessed word moves to the front of the 
dictionary and the remainder of the entries are shifted down. This continues until 
steady state is approached, then the transpose heuristic is applied, where the accessed 
word is exchanged with the one immediately ahead of it in the list. The second 
scheme maintains two lists simultaneously, depending on the list providing better 
compression ratio at selected points of time; the better one is chosen to compress the 
data. The results reveal the two hybrid schemes outperform the pure move-to-front or 
transpose algorithms. Moreover, the hybrid scheme maintaining the two lists 
simultaneously achieves better compression than the hybrid scheme that switches at 
steady state. The predicted performance in hardware is reported to be similar to a 
previous hardware implementation of a move-to-front scheme that achieved a 
40MByte/s throughput using 1000 PE. 
2.2.7. X-MatchPro 
Research on high-speed lossless compression in the Electronic System Design Group 
at Loughborough University has resulted in the development of the CAM-based X-
MatchProRli algorithm aimed to provide high speed, good compression performance 
and low complexity for hardware implementations [JonesOO, Nunez99, NunezOla, 
NunezOlb, Nunez02j. 
For single-byte CAM architectures, data throughput improvements can only result 
from the shortening of the cycle time, which, in tum, largely result only from silicon 
technology advancements. To make significant improvements in data throughput, a 
scheme permitting the processing of more than one byte simultaneously can be 
chosen. As increasing the data granularity has the consequence of reducing the 
success of CAM data matches, a larger dictionary is necessary to maintain 
compression performance. However, the larger the dictionary, the greater the number 
of address bits needed to identify each memory location, reducing compression 
performance. Clearly, to maximize throughput, a compromise involving granularity 
and dictionary size must be made. 
30 
Chapter 2 Review of Parallel Data Compression Systems 
This observation led to the development of the X-MatchPro architecture, which 
allows partial matching of incoming data with the data stored in the dictionary. This 
has the effect of increasing the effective length of the dictionary, while at the same 
time reducing the required number of address lines. Practical investigations revealed 
that when using 4-byte wide granularity in the data stream, X-MatchPro was able to 
apply data width parallelism to its algorithm to improve throughput without 
compromising compression performance. This feature offers X-MatchPro processing 
speed advantages compared with the majority of compression algorithms that are 
based on a granularity of a single bit or of a single byte. The X-MatchPro algorithm 
attempts to match a 4-byte data element with previously seen data entries in a 
dictionary implemented in a CAM. As each entry is also 4-bytes wide, several types 
of match are possible. If fewer than two bytes match in the dictionary, the full four 
bytes are transmitted with an additional miss bit. If all bytes are matched, then both 
the match location and match type are coded and transmitted, and this match is then 
moved to the front of the dictionary. If the incoming four bytes are partially matched, 
then the match location and match type are transmitted along with the bytes that do 
not match. 
X-MatchPro also uses a pipelining technique to allow steps in the compression and 
decompression process to be carried out simultaneously and so to increase throughput. 
The X-MatchPro design has been fully implemented and tested in FPGA technology 
with data independent throughput speeds in excess of I. I Gbitls. However, attempts to 
extract further intemal parallelism from the X-MatchPro algorithm produced 
diminishing returns and any future substantial improvements are likely to result only 
from silicon technology advances. In [NunezOlj the compressor is developed further 
with the addition of a Run Length encoding unit to increase compression for certain 
types of data. The design is implemented into a ProASIC FPGA, allowing a full-
duplex performance of 200Mbyte/s with a clock speed of 25MHz. 
Lee et. al. [LeeOOj used the X-Match compression algorithm for the 
compression/decompression engine in their research into compressed memory. The 
main aim of the research was to increase the effective memory space and bandwidth 
by integrating compression between the Ll and L2 caches. However, decompression 
times critically affected on memory access times and the variable sized compressed 
block increased design complexity. To overcome these issues, X-Match with run 
length encoding was used to provide parallel decompression with Iow latency and 
31 
~-------------------------------------------------------------------. 
Chapter 2 Review of Parallel Data Compression SYstems 
high throughput. The key idea in the design is that the decompression is carried out 
with two decompressors operating in parallel, achieved by first splitting a source 
block into two sub-blocks, that are compressed independently and their outputs byte-
interleaved, resulting in one compressed block. To overcome the variable block sized 
output problem, the compression ratio is limited to 0.5, meaning that only data that 
can be compressed by more than half is sent to the output. Additional loss in 
compression performance occurs fixed space allocation method used to manage 
variable-sized compressed blocks, in which memory areas of equal length are 
allocated to all compressed blocks. These methods are chosen to simplify 
decompression architecture and provide faster processing times, at the expense of 
compression performance. 
The compression achieved using this architecture is not reported but trace driven 
simulation gives figures of a reduction in the on-chip cache miss ratio of 35% and 
data traffic by 53%. The complexity of the system is not available as the complete 
architecture was only modelled in software. 
2.2.8 Comparison of Research Work 
The literature review reveals a range of strategies for extracting the parallelism for 
data compression, however clear comparisons between implementations are difficult 
to determine due to the use of a variety of algorithms, architectures, and fabricating 
technology that influence clock speed and available silicon area. With these 
limitations in mind, Table 2.1. is intended only as a general summary of the 
performance and implementation details of the related research. 
32 
B(IDgj in951 Penzhorn92 I IBM I StefoOl XieOl NunezOl f MXT 
.., 
I'" 
Lempel-Ziv I Lempel-ziv I Lempel-zivl Lempel-ziv I Lempel-Ziv Arithmetic coding Markov X-MatchPRO ..., modelling modelling ~ 
r:1' 
.. 
'" Systolic array Systolic MIMD MIMDwith SIMDwith Multiple byte ;... Systolic arrayl array (32 shared memory Systolic tree Vl (9 processors) (twoCPUs) shared memory data stream 
= PE) four compressors 8 
8 
~ 
., 
Hardware I Hardware I Hardware I Software Hardware Hardware Hardware Hardware '< 0 
.... 
"Cl 
~ 
Speed up of 1.5 2.1 Gbitls 
., 
~ 
-320 Mbitls I 106Mbitls I 90Mbitls I times compared 2Gbitls 256 Mbitls Decompression 1.6Gbitls -.. 
-to single CPU only 
-0 
'" 
'" 
-
I 
2l'mp-
0.25 I'm A500KI ~ ::., 1.2 I'm I Processor not 0.251'm A500K130 TSMC 0.2511ffi '" 21'mCMOS well s. double metal CMOS mentioned CMOS ProAsci FPGA technology ProASIC ~ ~ 
-
c ~ 
" 
;p 0 
Not 0.55 0.5 0.5 0.5 0.64 0.8 0.58 8 t! 
mentioned "Cl ii," ., 
-.. 1? '" 
'" 128 PE per ::;" 
" 32% of device logic = 6l chip 18K Ilk Not applicable 5 million 80% embedded Not mentioned 70% of :t 30 chips to transistors transistors transistors A500Kl30 
board ram 
'" Clock spee~ I 25MHz 40MHz 90MHz I Not applicable I Not mentioned I 32MHz 45MHz 25MHz I~ '" i~ 
Chapter 2 Review of Parallel Dota Compression Systems 
2.3 Summary of Parallel Architectures 
The literature revIew reveals that research has focused on the exploitation of 
parallelism in the Lempel-Ziv algorithm. It is also clear that the earlier work by 
Ranganathan and Henriques, Gonzalez and Storer has spawned considerable research 
effort into the development and refinement of systolic array architectures. Only 
recently have other parallel architectures been investigated and developed, and the 
commercial exploitation of work by Franaszek for main memory compression in 
computer servers have been able to deliver performance enhancements traditional 
approaches carmot provide. 
In a technical report by Stauffer and Hirschberg [Stauffer93] reviewing the 
theoretical and practical aspects of parallel text compression, two important points are 
highlighted in their conclusions "empirical evaluation is a jimdamental component of 
parallel text compression research just as it is for sequential case" and also that 
"dictionary compression systems need to be developed for alternative parallel 
models." 
The literature review emphasises that lossless parallel compression in hardware is an 
active area of research and that there are parallel architectures remaining to be 
investigated that may offer competitive performance compared to classical 
approaches. The market trend towards mUltiple processor computers and increases in 
logic densities means that designs that were previously either infeasible or 
uncompetitive to implement deserve further investigation. 
34 
Chapter 3 Identification of Research Area 
Chapter 3 
Identification of Research Area 
3.1 Chapter Objectives 
The objectives of this chapter is to present background knowledge on the previous 
research work on the X-MatchProRIi [JonesOO, Nunez99, NunezOla, NunezOlb, 
Nunez02] data compression/decompression core and to propose how this core can be 
used to investigate a MIMD architecture aimed at providing significant improvement 
in data compression throughput. 
3.2. Area of Research 
The literature review reveals that, until lately, the most commonly accepted way of 
improving compression performance is through adapting an existing algorithm with 
the aim of exploiting inherent parallelism. However, existing compression algorithms 
are not all inherently parallel nor have they been specifically designed to run on single 
CPU systems. To adapt them to parallel architectures often requires the introduction 
of significant simplifications that limit compression performance. An alternative 
parallel approach is to share the compression between a number of identical 
algorithms running concurrently. This possibility has only recently become viable 
with the availability of systems with multiple CPU s, as previously there would have 
been no performance gain had several identical algorithms been run on a single CPU. 
Also, limitations in chip logic densities meant earlier designs were restricted to simple 
PE arrays, but with the continued advancement in silicon technology and densities, it 
35 
Chapter 3 Identification of Research Area 
is now possible to implement a number of complex compression and decompression 
engines on a single chip. One such instinctive architecture is where several 
compressors work independently on their own data stored in distributed memory. Fig. 
3.1. illustrates schematically the general concept of this approach. 
"tjProliiid'C!~cfi\\ofltb~rdil~ 
;~~~\.~~~,!~~~~:\\\Irl i 
lfi!~~~ti~~i~\\\'i 
\;j~~, ~- p.rt;i~r 4~ta:j\\1~" 
Flow diagram Block diagram 
Figure 3.1 Block diagram of Multiple Compressor System 
The data stream to be compressed enters the compression system, which is then 
partitioned and routed to the compressors. Appropriate methods for routing the data 
need to be developed and analysed, but, nevertheless, vital to achieving good 
compression performance is that the partitioning mechanism supplies the compressors 
with sufficient data to keep them active for as great a proportion of the time that the 
stream of data is entering the system as is possible. As the compressors operate 
independently, each producing its own compressed data stream, a mechanism is 
required to merge these streams in such a way that subsequent decompression can 
reconstruct the original stream. Also, subsequent decompression needs to be capable 
36 
Chapter 3 Identification of Research Area 
of operating in an appropriate parallel fashion, otherwise a disparity in compression 
and decompression speeds will occur, reducing overall throughput. 
This general architecture raises several questions, such as 
• How does the mechanism used to supply data to each compressor affect the 
performance of the system? 
• How does the mechanism used to route data from the compressors affect the 
performance of the system? 
• What is the impact on compression performance as the system scales? 
Investigation into this chosen architecture for lossless parallel compression is carried 
out in order to answer the overall goals of the research in providing 
• Improved lossless compression throughput without significantly 
compromising other compression performance aspects. 
• Provide scalabiIity in the design, such that future bandwidth requirements can 
be met. 
3.3. X-MatchProRIi 
Developing a data compressor with adequate performance in compression and 
throughput in hardware, which is suitable for investigation of multiple compressor 
architectures is not a trivial task and constitutes a major research undertaking. 
Therefore selection of a suitable compressor is needed and is essential for the research 
to meet the project timescales. The research on the X-MatchProRli 
CompressioniDecompression engine has led to the development of an extensive 
Intellectual Property portfolio. The portfolio contains VHDL code (cycle accurate 
37 
Chapter 3 Identification of Research Area 
RTL), behavioural C++ code (non-cycle accurate for algorithm analysis) and detailed 
documentation allowing ease of integration into other digital designs. 
3.3.1 Algorithm 
The psuedo code for the X-MatchProRli algorithm [NunezOla] is as follows 
set the dictionary to its initial state; 
set run length count to zero; 
DO 
{ 
read in tuple T from the data stream; 
search the dictionary for tuple T; 
IF (full hit at location zero) 
{ 
} 
ELSE 
{ 
} 
increment run length count by one; 
IF (run length count ~ 1) 
{ 
} 
output '0'; 
output Binary code for ML 0; 
output Huffman code for MT 0; 
IF ( run length count> l) 
{ 
} 
output '0'; 
output Binary code for ML MAX _TABLE _ ENTRIES·l; 
output Binary code for run length; 
set run length count to zero; 
IF (full or partial hit) 
{ 
} 
ELSE 
{ 
} 
detennine the best match location ML and the match type MT; 
output '0'; 
output Binary code for ML; 
output Huffman code for MT; 
output any required literal characters of T; 
output '1'; 
output tuple T; 
IF (full hit) 
next adaptation vector = move dictionary entries 0 to ML-l by one location; 
ELSE 
next adaptation vector = move all dictionary entries down by one location; 
adapt next adaptation vector and dictionary using current adaptation vector; 
current adaptation vector = next adaptation vector; 
copy tuple T to dictionary location 0; 
WHILE (more data is to be compressed);. 
38 
Chapter 3 Identification of Research Area 
3.3.2 X-MatchProRli Architecture and Performance 
Figure 3.2. illustrates a high level block diagram of the main features in the X-
MatchProRIi design. The dictionary is used in the searching and construction of the 
model from the arriving data. Adaptation of the dictionary to the new incoming data is 
maintained by the ODA logic and move generation logic. As a consequence of several 
possible matches occurring, the best match decision logic is required to select from 
the candidate matches the one that achieves the greatest compression. The coder then 
can code and send it to the bit assembly logic, which concatenates the variable length 
of compressed data into compressed data of a fixed length. When decompressing, the 
same dictionary and adaptation modules are used. However, it is necessary to employ 
separate logic for the disassembly of the fixed length of compressed data and for 
decoding the compressed data. 
,---
-
MlItt BIC t.tm 111 ... • ~ W--~ ""*" .... COdIr '-rtIt DocIItIn COder I<U> ~ 
-Q:Ioray '--
.... 
~ 
+--~~ CXl\ 
~ ~ 
+--
lIIIt c:. ~ IbI 
-
• 
'-rtIt 
-
tImmtt 
~ 
Wle J~ 
Fig. 3.2. Block Diagram of the X-MatchProRli Architecture 
39 
Chapter 3 Identification o(Research Area 
Figure 3.3 shows the compression ratios of the X-MatchProRIi (X-RLI) algorithm 
[NunezOla) when compared with other software and hardware-based algorithms when 
applied to the Canterbury corpus. The hardware compression implementations are the 
LZS [Hifn) hardware based algorithm used in Hifn devices, DCLZ [AHA) developed 
by AHA and the IBM ALDC [Slattery98) algorithm. The software algorithms are the 
dictionary based PKZIP algorithm [PKZIP) and PPMZ algorithm [PPMZ), which uses 
a high-order context modelling with arithmetic coding. 
The two software-based compression algorithms offer the best compression with 
similar results. However, these require greater computational resources than a 
corresponding hardware solution and operate at a lower throughput. The three 
commercially available hardware algorithms offer very similar performance to each 
other, whilst X-RLI falls behind as that textual data does not follow the four-byte 
granularity of the Canterbury corpus can explain the restricted performance ofX-RLI 
on this type of data. However, when applied to other datasets such as memory (figure 
3.4), X-RLI exhibits similar compression performance to other hardware algorithms. 
0.8 - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - -- - - - - --
0.7 
0.6 
~ 0.5 
0.4 
03 --------------------------- ----------------
0.2 +----r----r---~---~---~ 
256 bytes 1 Kb 4Kb 
Block size 
16 Kb File 
....... PKZIP _PPMZ -*-ALOC -lII-LZS __ OCLZ -+-X-RLI 
Fig. 3.3. Comparison of Compression Ratios of Typical Compression 
Algorithms using the Canterbury Corpus 
40 
Charter 3 Identification of Research Area 
0.8 --------------------------------------------------
0.7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --
0.3 --------------------------------------------------
O.2+---------~------__ --------__ --------~------_, 
256 bytes 1 Kb 4 Kb 
Block size 
16 Kb File 
__ PKZIP __ PPMZ _AlDC -lIE-lZS __ DClZ -+-X-RLI 
Fig. 3.4. Comparison of Compression Ratios of Typical Compression 
Algorithms using the Memory Corpus 
3.4 Summary 
This chapter identified the area of research for this thesis, along with the selection of a 
compressor as the base compression engine for a multiple compressor system. 
41 
Chauter4 Experimental Framework 
Chapter 4 
Experimental Framework 
4.1 Chapter Objectives 
This chapter sets up the experimental framework, specifically covering 
• . dataset selection 
• measurement definitions 
• tools and methods for design space exploration 
4.2 Datasets 
To be able to predict the performance of any data compressor system, it is important 
that the test data are obtained from a representative sample of the types of data that 
are likely to be compressed in practice. The datasets used in this thesis are same as the 
ones used in previous work [Gooch96, NunezOla], allowing for direct comparisons to 
be made with previous work. The datasets selected are the industry standard for 
lossless compression, namely the Canterbury corpus [Canterbury] as well as other 
datasets collected in our research lab. These include a memory dataset captured 
directly from main memory in a UNIX workstation whilst running a range of 
applications and the datasets found in a typical workstation hard disk, which are 
categorised as, 'application', 'general', 'user' and 'executable'. 
• Memory Data Set is a selection of data files of about 80 MB contained in memory 
and includes code and data from the SunOS operating system and eight real 
applications and utility programs. The set contains nine files from the SunOS 
42 
Chapler4 Experimental Framework 
operating system, Netscape, Emacs, Textedit, Ghostview, Xman, Matlab, 
Vlabplus and Logsyn, as listed in Table 4.1. For experimentation purposes, these 
files have been shortened to 9 MB of data, 1 MB from each file. 
Table 4.1 Memory Dataset 
Category No of Size 
Files (KBytes) 
Xman - Unix manual page 1 1000 
Text - Textedit with a small C source file open I 1000 
Ghos - Ghostscript postscript viewer with a technical paper open I 1000 
Emac - Emacs text editor with an elaborate set-up a few buffers open I 1000 
Nets - Netscape world-wide-web viewer after some 'net-surfing' activity I 1000 
Vlab - VlabPlus analogue simulator from Intergraph during extraction and I 1000 
spice simulation of a parallel mUltiplier 
Suno - Approximation to the operating system SunOS working set I 1000 
Mat! - Mat!ab matrix laboratory running a benchmark program I 1000 
Logs - Logsyn logic synthesis tool from Intergraph during logic 1 1000 
optimisation of a parallel multiplier 
Total 9 9000 
• Hard disk Data Set is a collection of 65 files, which range in size from 3K to 
450K, from audio, images, object and text files. This set was obtained for a thesis 
written by Gooch [Gooch96]. Tables 4.2, 4.3, 4.4 and 4.5 list the Application, 
Executable, General and User files. All the tables show the corresponding files, its 
category and size. 
Table 4.2 Application disc data set 
Category No. of Files Size (KBytes) 
Library files from Cadence CAE system 5 1211 
Component files from Intergraph CAE system 5 959 
Simulation libraries from Intergraph CAE system 4 1649 
VHDL libraries from Intergraph CAE system 6 785 
Logic synthesis libraries from Intergraph CAE system 8 39 
Mat!ab function libraries 6 259 
Simulation libraries from Synopsys CAE system 9 , 1232 
Parts files from Unigraphics mechanical CAD/CAM system 3 988 
Data files from Visilog image processing system 2 779 
Parts files from Xilinx CAE system 4 77 
Total 52 8675 
43 
Chapter 4 Experimental Framework 
Table 4.3 Executable disc data set. 
Category No. of Files Size (KBytes) 
General applications (ghostview, tin, xups and mallab) 4 4546 
CAE applications from Intergraph and Xilinx 3 4841 
System Applications (sed, awk, xcal, gtar) 4 633 
User programs 5 189 
Total 16 10209 
Table 4 4 General disc data set . 
Category No. of Files Size (Kbytes) 
System font and keyboard definition files 10 863 
System library files 6 758 
General operating system files 7 2085 
Manual pages 9 748 
Documents in either postscript, html or pdfformat 12 2357 
ASCII text files 3 372 
Total 47 7183 
Table 4.5 User disc data set. 
Category No. of Size (Kbytes) 
Files 
CAE files from Intergraph and Xilinx 10 5942 
Microsoft Excel spreadsheet files 3 328 
Graphics files using Coredraw, Drawperfect and Microsoft powerpoint 5 1360 
ASCII textual files (C and VHDL source code and a mail folder) 7 290 
Results/statistics files 5 528 
Word processing files from Wordperfect and Microsoft Word 4 2175 
Total 34 10623 
The Canterbury dataset [Amold97] has fairly recently been introduced as a standard 
dataset enabling the data compression research community to use it as a common 
reference between algorithms. It was developed to replace the ageing Calgary dataset 
as it was felt that data compression designers were over tailoring their algorithms for 
those particular files [Bell90] and to include representative data found in modern 
computer systems. The author's final conclusion was that the Calgary corpus and the 
Canterbury corpus offer very similar results so the new corpus does not invalidate the 
44 
,----------- - - -- -- -
----------------.......... 
Chapter 4 Experimental Framework 
results obtained with the Calgary corpus. Table 4.6 show the corresponding files, its 
category and size 
Table 4.6. Canterbury data set. 
Category No of Files Size (KBytes) 
alice29.txt - English text I 148 
pttt5 - Fax images I 501 
Fields.c - C source code 1 11.3 
Kennedy.xls - Spreadsheet files 1 1003.5 
Sum - SPARC executables 1 37.3 
LcetlO.txt - Technical documents 1 416 
Plrabnl2.txt - English poetry 1 470 
Cp.html -html I 24.6 
Grammar.lsp - lisp source code I 3.72 
Xargs.1 - GNU manual pages I 4.23 
Asyoulik. txt - Plays I 126 
Total 11 2745.65 
4.3 Measurement Definitions 
The comparison and summarising of the perfonnances of architectures and systems 
are often open to interpretation, and discrepancies often result from differing 
assumptions or lack of infonnation. To avoid such ambiguities, the following 
measurement definitions are used in this body of work. 
Compression ratio (CR): Compression ratio is defined as the ratio of the number 
output bits after compression to the number of input bits before compression. 
CR = Output bits/lnput bits 
This means that the smaller the figure the better the compression. A value larger than 
1 implies that data expansion occurs and that no data compression took place. 
Compression is obtained whenever the CR value is in the range 0 to I. 
For example if 100 MBytes ofuncompressed input data are compressed to 50 MBytes 
of compressed data then a CR = 0.5 is achieved. 
45 
Chapter 4 Experimental Framework 
Throughput: This is the maximum sustained flow of data moved successfully through 
the system in a period of time. This is normally measured in bitls or Byte/s. However, 
the timing performance will change for different silicon implementations (such as 
process technology used, FPGA, ASIC), therefore a more general measurement of 
bitlclock cycle will be used. The clock cycle time can be determined when process 
technology is chosen and hence bitls throughput can be established. 
Latency: Latency is usually taken as the delay between sending an item of data from 
point A and its arrival at point B and is measured in units of time. However, for 
compression if latency is calculated from data first entering the system, and data 
emerging from the system, uncertainty of latency time will arise due to the amount of 
compression achieved (that is, latency with be dependent on the data). Therefore to 
have an accurate measurement of latency for compression, a slight modification to 
this definition is needed. The new definition is taken as the interval time between the 
last bit of data entering the system and the last bit emerging. Again depending on the 
technology implementation chosen, the latency time will change; therefore latency 
will be in units of clock cycles. Latency in seconds can be determined from the 
number oflatency clock cycles and the clock frequency when implemented in silicon. 
4.4 System Modelling 
Until recently, the vast majority of digital design methods involved first developing a 
high-level behaviour model capturing the required functionality of the design, usually 
this is written in C or C++ to verify the design concepts and algorithms. After the 
concepts are validated and meet the desired functionality, the parts to be implemented 
in hardware are manually converted to a VHDL or Verilog description, ready to be 
taken into silicon. This is because low-level methods such as schematic capture or 
hardware description languages (HDLs) used to design hardware are becoming 
inappropriate to analyse and fully model the system due to the increasingly complex 
digital designs. However, higher-level behaviour models in C++ language are unable 
to easily simulate certain aspects of digital design, as they are originally designed for 
sequential software, for example the concept of clock signals and concurrency are not 
standard. To alleviate some of the problems with the increased complexity and 
46 
Chapler4 Experimental Framework 
functionality facing digital designers, there have been several consorted efforts aimed 
at creating a standard system level language that allows designs to be described in 
various levels of abstraction. 
4.4.1 System Design Languages 
In the development of system design languages there seems to be two converging 
paths. From the hardware perspective, there are attempts at modifying standard HDL 
(VHDL, Verilog) to produce constructs that give richer system models. On the 
software side, languages used in software design (C, C++) are being extended to be 
suitable for modelling hardware by adding support for timing and concurrency. 
However, deciding which of these languages will come to dominate it is hard to 
determine, as designer's preference usually depends on whether their background is 
software or hardware. Since choice of design language is not always clear, subjective 
language wars often exist with exponents able to give detailed and equally valid 
explanations of the benefits of one language over another. As an example of this, 
there are two competing hardware design languages, with VHDL dominating in 
Europe and Verilog dominating in the USA. Without delving too deeply into the 
language debate, the next few sections cover a brief description of the main 
contenders in system design languages and give an introduction of some of their 
principal advantages and disadvantages. 
4.4.1.1 Handel-C 
The Handel-C [Handel-C] language and design environment was one of the first 
system level design developments and is being driven by Celoxica. Celoxica is a spin-
off company formed by the University of Oxford in 1996 to commercialise its 
research into the Handel-C programming language. Handel-C is aimed at compiling 
high-level algorithms directly into gate level hardware. Handel-C uses the subset of 
ANSI C with the additional syntax for describing concurrency in the system. Figure 
4.1. shows the Handel-C design flow, but is worth noting that to date design tools are 
restricted to that produced by Celoxica and there is with limited adoption by other 
vendors. 
47 
Chapter 4 Experimental Framework 
System Model 
Figure 4 Celoxica Design flow for Handel C 
Advantages of Handel-C 
• Higher level of abstraction than most competitors 
• Familiar to software designer as the language is based on ANSI-C standard 
with additional syntax to represent hardware. 
• Automatically deals with clock signals, so abstracting away some of the 
complexity of hardware design. 
Disadvantages of Handel-C 
• Design tied to a propriety system design language 
• Limited toolset 
• Familiarity with C needed 
• Mainly aimed at FPGAs 
4.4.1.2 SystemC 
SystemC [SystemC] has been created to form a single standard open source system 
leve1language for the design and verification of software and hardware architectures. 
The SystemC community consists of a large and growing number of system, 
semiconductor, lP, embedded software companies and EDA companies working 
48 
Chapter 4 Experimental Framework 
towards the submission of the SystemC language for IEEE standardisation. SystemC 
fits into the design process by providing functionality for system-level design and 
verification, while relying on much of the existing RTL Hardware Design Language 
(HDL) flow methodology to provide a path to silicon. Figure 4.2 illustrates the design 
flow and abstraction available in SystemC (taken from SystemC user guide version 
2.0) 
Design Explorotinn 
Petf=Analysis 
HR'S'" panifiOnlng 
TargetRTOS 
pnTimed Functional 
Timed Functional 
lbrdwarf 
BllS Cycle Accurate 
Cycle oo\ccurate 
Synthesiuble 
Fig. 4.2. SystemC Abstractions and design flow 
SystemC uses object oriented programming concepts of classes, implemented as c++ 
libraries. It defines parent classes that describe ports, modules, signals, processes, etc. 
from which user code can inherit hardware and software descriptions. To simulate the 
design, standard C++ streams can be used for console output or tracing methods and 
for creation of VCD files that can be viewed in third party software. A compiled 
SystemC design takes the form of an executable that can run from an operating 
system prompt. 
Advantages: 
• Simulation speed adjusted according to the level of abstraction required 
• One language environment for mixed hardware/software design 
• High level of abstraction available - suitable for system descriptions 
• Source code is open source, is not tied to one EDA company, and the 
language has recently submitted for IEEE standardisation 
49 
Chapler4 Experimental Framework 
• Executables are able to be fanned out to several computers to speed up 
system simulation times 
Disadvantages: 
• Requires some familiarity with C++ 
• As a new language there is limited support of certain tools (design entry, 
synthesis, etc.) 
• Uncertain future, as a number of rivals with significant backing are trying to 
push alternative solutions, e.g. System Verilog 
• Language still evolving as designers request new features. 
4.4.1.3 System VeriIog 
System Verilog represents an overhaul for Verilog language but it is to remain fully 
compatible with the IEEE 1364-2001 Veri log standard. The proposed standard is 
directed by Accellera [Accellera]. SystemVerilog extends the Verilog language 
beyond RTL, by including basic capabilities for verification and more abstract 
modelling, including some C++-like constructs. The first version, SystemVerilog 3.0 
was approved in June 2002. SystemVerilog supports built-in C types to provide a 
clear translation to and from C for better encapsulation and code compaction. The C 
types are also to provide users with improved methods for creating algorithmic 
models and to advance the abstract syntax a designer can use to create efficient 
synthesisable code. To improve modelling at higher abstraction levels, SystemVerilog 
adds interfaces to help model the communication between blocks of a digital system, 
which help facilitates design reuse. 
Advantages: 
• Backward compatible with Verilog standard 
• Able to use already developed and proven Verilog design flow 
• Several large EDA companies are pushing the standard 
• Hardware engineer familiar with Verilog will be able to move easily to 
System Verilog 
50 
Chapter 4 Experimental Framework 
Disadvantages: 
• Unfamiliar language for software developers 
• Toolset is still limited 
• Uncertain future as alternatives such as SystemC has many supporters, 
including EDA vendors. 
4.4.2 System Design Language Summary 
Out of the more widely known system design languages presented, SystemC was 
chosen for this work in order to model and analyse the routing of a multiple 
compressor system. As the development of system design languages are still in their 
infancy, it offers the greater protection of the IP generated due to its open source 
nature. SystemC has also had broader take up from the commercial and academic 
community and there is a significant body of knowledge that can be tapped by new 
researchers. 
4.5. Summary 
This chapter has introduced a selection of datasets used to analyse the performance of 
the design and allow valid comparisons with previous research. It also defines the 
terms used to measure system performance. Also covered is the available system 
design languages and justification behind why the SystemC design methodology was 
chosen. 
51 
Chapter 5 Practical Investigation o(Jnput and Output Routing Strategies 
Chapter 5 
Practical Investigation of Input and Output 
Routing Strategies 
5.1. Chapter Objectives 
This chapter investigates the impact of alternative routing schemes have on system 
performance, when compressors are implemented in a MIMD architecture. 
Specifically, the objectives of this chapter are to: 
• IdentifY, investigate and develop input and output data routing techniques for a 
MIMD architecture. 
• Determine the effect input routing has on the compression, throughput and 
latency performance. 
• Determine the effect output routing has on the compression, throughput and 
latency performance. 
• Investigate how the system performs as the architecture is scaled. 
5.2. MIMD Architecture Routing Strategies 
There are several approaches in which data can be routed to and from the 
compressors. Two methods for supplying data are identified, termed interleaved and 
blocked. In the interleaved technique, data are separated into individual data streams 
for each compressor. In the blocked method, sections of fixed data length are routed 
52 
Chapter 5 Practical Investigation of/npu! and Output Routing Strategies 
to each compressor in turn. For output routing, schemes include sending single and 
multiple compressed blocks to the output, and interleaving compressed blocks 
together. The following sections describe the input and output routing methods in 
detail, along with an investigation of their performance in relation to the stated 
chapter objectives. 
5.2.1. Input Routing 
The two schemes for routing the data into the system are blocked and interleaved. 
Shown in the left half of figure 5.1 is the routing of data to the compressors in blocked 
format. A fixed length block of data is sent from the incoming data stream to each of 
the compressors in turn. To minimise the latency introduced by the blocked mode, the 
compressors are required to start processing data as soon as it arrives. Data has to be 
written to memory dedicated to individual compressors at a rate faster than it can be 
processed, allowing the memory to be filled with data, thereby ensuring sufficient 
data are available for the compressor as no new data can be added to this compressor's 
memory whilst data are routed to the other compressors. The right half of figure 5.1 
illustrates the routing of data to the compressor in an interleaved input approach for 
two compressors. The interleaved method avoids the need for buffering as data fed to 
the compressors are acted upon as it arrives. For example, in a 64 bit wide input 
stream, the first set of 32 bits is allocated to the first compressor and the second set of 
32 bits is routed to the second compressor. Although both techniques shown in figure 
5.1 are for two compressors these can be extended to supply data to any required 
number of compressors. Analyses of these techniques are carried out in the following 
sections. 
BLOCKED 
uncompressed_Da.,...,ta ....... / 
~~ 
Compressors 
INTERLEAVED 
Compressors 
Uncompressed Data ~=::-cl../ 
Figure 5.1. Illustrations of blocked and interleaved input routing 
53 
Chapter 5 Practical Investigation o(Jnput and Output Routing Strategies 
5.2.1.1. Interleaved Input 
In the interleaved input approach, the router divides the input data into 4-byte wide 
data streams that are fed into each of the compressors. The interleaved method avoids 
the need for input buffering as data are continuously fed to the compressors and acted 
upon immediately on arrival. This minimisation of latency is an important advantage 
of the interleaved approach. Also this strategy benefits from not needing additional 
silicon real estate to buffer temporary data in readiness for processing. 
However, experimentation is required to determine the compression performance 
of the interleaved approach. Figure 5.2 shows the mean compression ratios for the test 
datasets as the number of compressors in the system is increased from one to eight. 
For the initial experimentation a 16 word deep dictionary for the compressor was 
chosen. 
0.8,-------::::--------=:----------, 
0.7 +------, 
o 
:;::: 
RI 0.6 
0:: 
C 
00.5 
'iij 
III f! 0.4 
Cl. 
~ 0.3 
U 
~ 0.2 
CD 
== 0.1 
Application Canterbury Executable General Memory User 
I [] one compressor [] two compressors [I four compressors • eight compressors I 
Figure 5.2 Mean compression ratio for a range of datasets using the interleaved 
data method for input routing, The compressors have an internal 16 word deep 
dictionary. 
Although the throughput of data is increased by four bytes per clock cycle for each 
additional compressor, the results confirm that the attainable compression ratio per 
compressor worsens as the number of compressors is increased. On further 
investigation, this appears to arise because of a loss in data locality imparted by the 
54 
Chapter 5 Practical Investigation of Input and Output Routing Strategies 
interleaved approach. As the reoccurrence of a symbol is more likely the more closely 
the data are spatially or temporally related, the models of the data constructed in the 
individual dictionaries of the compressors are generally not as efficient in identifYing 
and removing redundancy as would be the case for a single compressor. For example, 
in the case of a two-compressor interleaved system, where the data is represented by 
symbol one, symbol two and symbol three, the information used to construct the 
dictionary model of the data in the first processor will be from symbol one and three. 
In contrast, for a single compressor, symbols one and two would be used in the 
construction of the dictionary. 
The interleaved input routing is further tested to verifY that the reduced 
compression performance still applies for larger dictionaries. This investigation will 
assess whether larger dictionaries, which allow a more accurate model of the 
redundancy in the data to be constructed, can overcome the loss of locality in the 
interleaved input routing. 
0.8,....-----------------------, 
0.71----1 
o 
:; 0.6 h=lE1lhf 
It: 
c o 0.5 
.jjj 
I/) I!! 0.4 
Co 
E o 0.3 
o 
C 
'" 0.2 Cl) 
:i: 
0.1 
Application Canterbury Executable General Memory User 
I [] one compressor C two compressors El four compressors • eight compressors I 
Figure 5.3 Mean compression ratio for a range of datasets using the interleaved 
data method for input routing, The compressors have an internal 32 word deep 
dictionary. 
55 
Chapter 5 Practical Investigation of/nput and Output Routing Strategies 
0.8~------------------------------------------~ 
0.7.J-----_,,.,. 
o 
:;:; 0.6 -\---1""" 
~ 
5 0.5 
'Cij 
VI 
~ 0.4 
Co 
E 
<3 0.3 
c: 
:ll 0.2 
:!: 
0.1 
o 
Application Canterbury Executable General Memory User 
I [J one compressor [J two compressors [! four compressors • eight compressors I 
Figure 5.4 Mean compression ratios for a range of datasets using the interleaved 
data method for input routing, The compressors have an internal 64 word deep 
dictionary. 
Both Figure 5.3 and 5.4 illustrates that interleaved input routing results in a reduction 
of the achievable compression ratio for increasingly larger dictionaries across all of 
the datasets. However, having larger dictionaries is still beneficial to the mean 
compression ratio. On further analysis of the results, table 5.1 shows the percentage 
changes in compression ratio between the one compressor and eight compressor 
systems for all datasets with the 16, 32 and 64 dictionary length. These results reveal 
that larger dictionaries and interleaving the data have a greater effect on the 
compression ratio than does a smaller dictionary. 
Data Set Com ressor Dictiona 
16 32 
9.5% 10.2% 10.5% 
2.6% 4.2% 6% 
8.7% 11.5% 14.3% 
3.7% 7.1% 7.6% 
11.3% 15.5% 17% 
4.8% 7% 9.6% 
Table 5.1 Percentage changes in mean compression ratio when ranging the 
number of compressors between one and eight with interleaved input 
56 
Chapter 5 Practical Investigation o(Jnput and Output Routing Strategies 
There is clearly a cost involved (extra bits needed) to address all the dictionary 
locations in a larger dictionary. This is usually outweighed by the increased number of 
data matches occurring in the dictionary, hence an overall better mean compression 
ratio is achieved with larger dictionaries. However, when interleaving data across 
larger dictionaries, more matches still occur then with smaller dictionaries but there 
are still not as many as in the non interleaved case, therefore the cost of addressing the 
additional locations with fewer matches has more of an effect than with smaller 
dictionaries. 
5.2.1.2. Input Blocked - Theoretical 
For the blocked input routing performance, certain aspects can be calculated, such as 
the ideal input rate to keep all the compressors active, the latency introduced from 
having to buffer the input data and the size of these buffers. Experimentation is 
needed to determine the compression ratio and to verify the theoretical calculations. 
Rearranging the consumption rate equation from chapter 1 (Equation 1.1)· in terms of 
time gives 
Equation 5.1. T=N/C T=time (clock cycle) 
N = Block length (Bits) 
C = Consumption rate (Bitsl clock cycle) 
Due to the pipelining in the compressor, there is a number of clock cycles of delay 
introduced to flush data through the pipeline before it is ready to accept a new block 
of data. This leads to a modified equation for the time to consume a block of data; In 
the following, T burst is the minimum time required before a new burst of data can be 
accepted. 
Equation 5.2. Tbu", = (N/C) + D D = pipelining delay (in clock cycles) 
Taking the equations for production rate and the modified consumption rate gives 
Equation 5.3. Tbu,sI = NIP and consequently Tbu", = (N/C) + D 
Equation 5.4. NIP = (N/C) +D 
57 
Chapter 5 Practical Investigation o([nput and Output Routing Strategies 
Rearranging to put in terms of production rate, P 
Equation 5.5. P=NI((NIC) +D) 
This equation gives the ideal production rate for a single compressor in terms of block 
length (N) and consumption rate (C) if overflow or underflow conditions of the 
buffers are not to occur. For a multiple compressor system, this needs to be multiplied 
by the number of compressors in the system, termed X. This results in the following 
equation that can be used to calculate the ideal production rate (Px) to supply enough 
data for a multiple compressor system. 
Equation 5.6. Px = XNI((NIC) +D) 
By changing the parameters, block length (N) and the number of compressors (X) it is 
possible to investigate how the input rate (sustainable throughput) changes for 
different architecture setup. The results are plotted in Figure 5.5. In the case ofthe X· 
MatchProRli compressor, the pipelining delay is six clock cycles and the consumption 
rate of each compressor is 32 bit/clock cycle. 
400 
350 
.. 
~ 300 
.:.: 
g 250 ~ 
:;; 200 
.5 
.! 150 
&! 
'[100 
.5 
50 
o 
................................ 
. 
..... 
. 
;;; 
. ----.~ 
~ 
7:" ".k"~ 
'~ 
o 2000 4000 6000 8000 10000 12000 14000 16000 
Block lenath In bits 
• 2 ____ 3 ---+--4 
~5 ~e .7 
I , 
- . 
..... 11 No. of processors 
Figure 5.5 Graph of ideal input rate (bit/clock cycle) against block length (bit) 
for the buffers to remain in steady state as the number of compressors is raised. 
58 
Chapter 5 Practical Investigation of/nput and Output Routing Strategies 
Figure 5.5 demonstrates that, as the number of processors is increased the production 
rate increase, since more compressors are available to compress the data and so 
greater the throughput. Also greater block lengths allow a higher throughput than 
smaller block lengths for a given number of compressors, which is due to the fixed 
pipeline delay having a diminishing effect on the utilisation on the total compression 
time when blocks are larger. The conclusion that can be drawn that it is better for the 
blocked input routing to have as large a block as possible in order to reduce the 
overhead from flushing the pipeline after processing each block of data. 
Now that the rate at which data enters the system for a varying number of 
compressors and block size can be determined, the latency for each case can be 
calculated. As mentioned in the queuing theory section of the thesis, the emptying 
time E can be viewed as the time it takes for the last message that was put into the 
queue at the end of the burst, to move through the queue and exit. Since the queue is 
at its longest at the end of a burst, this is the longest time it will take for any item of 
data to go through the system, meaning this is the worst-case latency that is 
introduced by the queue if it is completely empty before a new burst of data. The 
worst-case latency E is given by 
Equation 5.7. E = T. - Tp 
Where T p is the time to route a block of data to the buffer and Tc is the time to 
consume all the data in the buffer and ready to start a new block. If C is the 
consumption rate of a single compressor and P is production rate of the multiple 
compressor system, then 
Equation S.S. Tc=(N/C) +D 
Allows the latency for a particular block size and production rate to be calculated. The 
worst-case latency can now be obtained from 
Equation 5.9. E = ((N/C) +D) -NIPx 
59 
Chapter 5 Practical Investigation of!nput and Output Routing Strategies 
Expanding Px from equation 5.6. allows the worst case latency to be obtained in terms 
of block size, pipeline delay, consumption for a single compressor and the number of 
compressors in the system. 
Equation 5.10. E = ((N/C)+D) - N/(XN/((N/C) + D)) 
For the X-MatchProRli case, the consumption rate and pipeline are respectively 32 
bit/clock cycle and 6 clock cycles. The number of compressors and the block length 
are varied to determine the effect of latency and the variation of the latency against 
block length for a range of compressors, is shown in Figure 5.6. 
450t--------------------------------------------------:~~!_i 
~r-----~--------------------------------~~~~~~~ 
~r_---------------------------------------
• ~~r-------------------------------~ ~ 
i 250 r_-----------------------'-----,-
S 
~ 
~200r---------------------~ 
150 r_--------------=ro!l 
100 r_---------, 
5Or_--~~~~~--------------------------------------------~ 
2000 4000 6000 8000 10000 12000 14000 16000 
Block length In bits 
1-""-2 _3 ---'IIlr-4 ~5 _6 --+--7 --+--8 9 10 - ....... 11 I 
Figure 5.6. Graph of input latency (clock cycles) against block length (bits) for 
various numbers of processors. 
Figure 5.6 shows that as the block length routed to each compressor increases so does 
the latency introduced. This is due to a large quantity of data that needs to be buffered 
and processed in this approach. As the number of compressors in the system 
increases, each additional compressor added have a diminishing contribution to the 
overall latency. This can be explained by the fact that for a given block length, the 
time to process it is the same no matter how many compressors are in the system. 
60 
Chapter 5 Practical Investigation o(!nput and Output Routing Strategies 
However, the time to route the data into an individual compressor's buffer decreases 
as the input rate is increased to an appropriate level to ensure adequate data are 
supplied, thus the overall latency increases (time to empty a buffer less the time to fill 
it), but the time taken to fill the buffer decreases with each additional compressor. 
5.2.1.3. Input Blocked Experimentation 
Experiments were also carried out to investigate the effect of block routing on 
compression ratio performance. A behavioural model of the X-MatchProRli was used 
to compress all of the datasets using a range of block lengths. 
Figure 5.7. shows the change in mean compression ratio for data sets of a range of 
block lengths when using a 16-word dictionary. 
0.85 
0.8 
0.75 
o 
~ 0.7 
c: 
o 
"2i 0.65 
e 
c. 
E 8 0.6 
0.55 
0.5 
o 
~~ + ~ ~ ~ 
x:t:t .... l: ........ -.. ;: . ---....... -...... -. --. -~ 
-- ------ -------------
""- I I 
2000 
I 
4000 
Block Length (bits) 
6000 
I 
8000 
l----'-apPlicalion -¥-canterbury ---t-executable --+--general '* - memory ~ .. +- ~ user I 
Figure 5.7. Effect of the block size on the compression ratio, using a 16-word 
dictionary 
The results reveal that, for block lengths above lkbit, further increases in block length 
have a minimal effect on the compression ratio. Larger dictionaries maintain the 
61 
Chapter 5 Practical Investigation of]nput and Output Routing Strategies 
characteristic curve shown for the 16-word dictionary, but the mean compression ratio 
for each dataset improves, and, by choosing the block length to be where this plateau 
begins, the latency introduced into the system due to the data being blocked is 
minimized. An additional benefit of having a smaller block length is that the 
distributed memory requirement for each compressor is reduced. Using blocks of data 
retains locality in the data, therefore, as the number of compressors in the system 
increases, the compression ratio remains the same as for a single compressor. The 
compression ratio is better for larger block lengths compared to smaller block lengths, 
as, with a larger block length, there is additional data available to build the dictionary 
and consequently better modelling of the input data source can be achieved. The 
diverse average mean compression ratio for the datasets demonstrates varying degrees 
of redundancy in the different types of data. 
Blocked input compressor utilisation 
It is important to determine the transient (start-up) characteristics of multiple 
compressor systems as well as the steady state behaviour in order to understand how 
the system performs over time. At start up for the blocked scheme, data are routed to 
each compressor in turn. Subsequently, this requires a certain number of clock cycles 
before sufficient data are available for all the compressors to process and the full 
benefits of multiple compressors can be reaped. 
Cycle-accurate SystemC models of architectures consisting of two and four 
compressors have been developed and were used to simulate and monitor the 
compressors' utilisation over time. Figure 5.8 is a block diagram of the main 
components in the design. The 'Input Routing Control' module controls the quantity of 
data routed to each compressor, which in this case is set to blocked mode; the 'Control 
Logic' provides signals to control the compressor core and the reading of data from 
the input FIFO. In this simulation, large output FIFOs are selected so as not to cause 
any stall conditions that will affect the utilisation of the compressors. 
62 
Chapter 5 Practical Investigation o(!nput and Output Routing Strategies 
Figure 5.8. Block diagram of a two processor system to monitor compressor 
utilisation 
Table 5.2 shows the beginning of a log, which monitors and records how the different 
parts of the system react at start up for a two-processor system. In this example, the 
small block length (4 clock cycles * 64 bit) is deliberately selected to demonstrate, 
early on in the log, the two processors working independently from one another. 
Table 5.2. Example status log of system components at start up 
Clock Cycle INPUT INPUT X-MatcbProl X-MatcbPro2 OUTPUT OUTPUT 
FIFOl FIF02 FIFOl FIF02 
I 0 0 Idle Idle 0 0 
2 0 0 Idle Idle 0 0 
3 0 0 Idle Idle 0 0 
4 0 0 Idle Idle 0 0 
5 I I 0 Idle Idle 0 0 
6 i I 0 Processing Idle 0 0 
7 I 2 0 Processing Idle 0 0 
8 I 2 0 Processing Idle 0 0 
9 i 2 I Processing Idle 0 0 
10 I I Processing Processing I 0 
11 I 2 Processing Processing I 0 
12 0 2 Processing Processing 2 0 
13 I 2 Processing Processing 2 0 
14 2 I Flushing Processing 2 0 
15 3 I Flushing Processing 2 0 
16 4 0 Flushing Processing 2 I 
17 4 I Flushing Processing 2 I 
18 4 2 Flushing Flushing 3 I 
19 4 3 Flushing Flushing 3 2 
63 
Chapter 5 Practical Investigation of!nput and Output Routing Strategies 
At start up, the system is in the idle state, where both compressors are idle and the 
FIFOs are empty. When the start and compress signals are set, the Input Routing 
Control enables a block length of data to be written to Input FIFO 1, then proceeds to 
write a block length of data to Input FIFO 2. However, as soon as data are available to 
process, the compressor starts to accept data to compress. The compressor accepts 
data until the selected block length is reached. When the block length is reached, no 
more data are read from the FIFO till the compressor is ready to compress a new 
block of data. Figures 5.9 and 5.1 0 shows, for the two compressors, the percentage 
utilisation between the idle, processing (accepting and compressing the data) and 
flushing (clearing pipeline) states for each compressor over the first 5000 clock cycles 
with a block length of 256 bits. 
100% 
90% 
.. 80% 
.9 
l! 70% 
.!Il 
::= 60% 
::::I 50% 
-
0 
'" 
40% III 
-
30% Co 
E 20% 0 
'-I 10% 
0% 
501 1001 1501 2001 2501 3001 3501 4001 4501 
Time in Clock Qldes 
oldIe 1 0 Processing 1 • Flushing 1 
Figure 5.9. Compressor 1 utilisation with a block length of 256 bits 
64 
Chapter 5 Practical Investigation of!nput and Output Routing Strategies 
100% 
90% 
80% 
c 
c 70% 
'" N JII 
60% 5 
l; 50% 
.. 
.. 
f 40% Q, 
E 30% ,j 
20% 
10% 
0% 
1 501 1001 1501 2001 2501 3001 3501 4001 4501 
Time in Clock Cycles 
!] Idle 2 I!!I Processing 2 0 Flushing 2 
Figure 5.10. Compressor 2 utilisation with a block length of 256 bits 
The figures reveal that when processing a 256-bit block length, approximately 28 % 
of a compressor's time is spent flushing the pipeline ready to accept a new block of 
data, 26% waiting for data and 46% of the time actually processing the data. On start 
up, the blocked input approach rapidly reaches its steady state compression 
performance and both compressors are quickly brought into action. 
The simulation was repeated for two compressors but using a block length of 8192 
bits and these results are plotted in figures 5.11 and 5.12. 
100% 
90% 
= 
80% 
i 70% 
.iIl 
;i 60% 
:::l 
... 50% Q 
.. 
.. 40% .. 
... 
'" E 30% Q 
.... 20% 
10% 
0% 
1 501 1001 1501 2001 2501 3001 3501 4001 4501 
Time in Clock CYdes 
Elldle 1 Il ProcesSrg 1 • Flushirg 1 
Figure 5.11. Compressor 1 utilisation over time with a block length of 8192 bits 
65 
Chapter 5 Practical Investigation ofInput and Output Routing Strategies 
100% 
90% 
c 80% 
0 
i 70% 
'" ~ 60% 
.. 50% 0 
'" 
'" 40% .. ..
Co 
E 30% 
0 
U 20% 
10% 
0% 
501 1001 1501 2001 2501 3001 3501 4001 4501 
Time in Oock Cydes 
'"Idle 2 • Processing 2 r:::JFJushing2 
Figure 5.12. Compressor 2 utilisation over time with a block length of 8192 bits 
With an Sl92-bit block length, the compressor reached a steady state with 
approximately 90 % of the time used for processing data, S% for idling and 2% for 
flushing. It is apparent that a larger block length results in greater throughput due to 
increased compressor utilisation. This is in agreement with the theoretical findings 
where with larger block length a greater proportion of the time is spent processing 
data. The initial startup shows that additional clock cycles are needed before 
compressor 2 starts working to full capacity compared with its performance at shorter 
block length. For the SI92-bit block length, the benefit of two compressors is only 
gained after 250 clock cycles, being due to the larger block taking an increased length 
of time to route to each compressor. 
5.2.2. Input Routing Conclusions 
Table 5.3 summarises the performance of the different input schemes compared to 
that of a single compressor system. For the interleaved technique, degradation in the 
compression ratio is worse than in the single compressor case, while latency remains 
the same, as the compressors can continuously process the data, thereby resulting in a 
speed up improvement of a factor of n where n is the number of compressors. The 
blocked technique retains the locality in the data ensuring the achievable compression 
66 
Chapter 5 Practical Investigation ofInput and Output Routing Strategies 
is similar to that of a single compressor. However there is a throughput and latency 
penalty from this routing strategy. The throughput penalty results from the 
compressor having to flush the pipeline of previous data before starting on a new 
block of data, and this penalty is greater for small blocks, as flushing takes a larger 
proportion of the compressor's time. The additional latency results from there being 
additional data waiting to be processed in a compressor's input buffer compared with 
the single processor case. 
Compression Speed Latency 
Single 1 1 1 
Interleaved <1 n I 
1 B 
B+D B 
Blocked n.-- P Throughput B+DP 
. . Table 5.3. Comparison of the performances of mput data schemes 
relative to a single compressor system, n = number of compressors, 
D = time to flush pipeline (clock cycles), B = Block length (bits), 
Throughput = Input data rate (bits/clock cycle), P = processing rate for a 
single processor (bits/clock cycle) 
The findings for the two input routing schemes are summarised below. 
Blocked Input Scheme 
• Larger block sizes result in an increased throughput. For lengths above 8kbits, 
benefits of having larger block lengths are minimal. 
• Larger block sizes result in increased latency; as further compressors are 
added each has a diminishing impact on increasing the latency time. 
• Using longer blocks requires correspondingly more dedicated memory for 
each compressor, thus increasing silicon requirements. 
67 
Chapter 5 Practical Investigation of Input and Output Routing Strategies 
• A shorter block length reduces the ability to take advantage of redundancy in 
the data. However, increasing block lengths above 2kbits does not greatly 
enhance compression ratios. 
• Compressor utilisation for 256 bit blocks is Iow, but following start up, all 
compressors quickly begin compressing data. Increased block lengths have 
greater compressor utilisation but have the start up time before all compressors 
start processing data is longer. 
Interleaved Input Scheme 
• As further compressors are added to the system, they have a detrimental effect 
on the compression ratio due to loss ofIocality in the data. 
• Latency introduced by the interleaved input routing is zero as data are supplied 
to all the compressors simultaneously. 
• As dedicated memory is not needed to store the data, silicon requirements are 
reduced. 
Both strategies offer increased throughput, however the interleaved approach allows a 
tradeoff that results in a lower latency than the blocked method, but at the cost of a 
reduction in compression, especially as the system scales. Conversely, in the block 
method there is no detriment to compression performance as the system scales, but 
greater latency is introduced. 
5.2.3. Output Routing 
The lengths of the compressed data .output blocks from an array of parallel 
compressors will generally not be uniform due to variation in redundancy in the data. 
As a decompressor system would not know the data boundaries of each block, these 
data cannot be sent directly to the output bus and additional manipulation is needed in 
order to guarantee that the original data can be recovered. For data processed by a 
single compressor, decompression of blocks takes place according to their order of 
input, but to ensure correct reconstruction when using parallel sets of compressors and 
68 
Chapter 5 Practical Investigation of!npuf and Output Routing Strategies 
decompressors, the time order of the output data streams need to be matched to the 
original input order. One possible method for enabling reconstruction is to add a 
boundary tag to the output stream, as shown in figure 5.13. The boundary tag forms 
the basis of reconstruction in the single compressed block and multiple compressed 
block output routing methods considered in the current work. A third method, namely 
the interleaved compressed block architecture avoids the need for a boundary tag by 
routing a fixed length of data for each compressor. These output routing strategies 
will be discussed in more detail in the following sections 
Variable Lengtb 
Compressed Data 
IX.~tChhO~ I ~ 0 ~ 
1'7"'~"+~ [] ~ =: ~ 
Ix.~tCbPro~l~ 0 ~ 
Figure 5.13. The X·MatchProRIi compressors output blocks that vary in 
length. The boundary tag is added to delimit the blocks. 
5.2.3.1 Single Compressed Block 
In the single compressed block method, shown in figure 5.14, it is assumed that the 
data enter the system using the blocked mode technique, as the interleaved input 
routing produces a continuous flow of data from the compressors, which is unsuitable 
for this routing strategy. The compressed data blocks are collected in each 
compressor's output buffers ready for routing. The buffer outputs are routed in strict 
order of the compressor number and a boundary tag that contains information on the 
block length is added so as to precede the data. As the tag will enter the decompressor 
system first, the tag can then be decoded to gain information on the length of the 
compressed data input belonging to the decompressor. The introduction of tags is 
detrimental to the compression ratio, but this diminishes as the block length is 
69 
_C_ha_D_te_r_5 __________ P_ra_c_tl_·cal Investigation o(Jnput and Output Routing Strategies 
increased, as the overhead of one tag per block of compressed data are largely 
constant. One of the drawbacks of this approach is that the output data bus may 
contain idle time. This arises since a whole block of data needs to be compressed 
before the appropriate tag values can be determined, and so a compressor may stilI be 
compressing its data when the output router becomes available. 
time t time 0 III 
c c T c c T c c c T 
M M A M M A Idle TIme M M M A p p 
G 
I' p p 
G p p p G 3 3 2 2 1 1 1 
Figure 5.14 Single compressed block output. Each compressor's output data 
(here denoted by CMP1, CMP2 and CMP3 for compressors 1, 2 and 3 
respectively) is preceded by a boundary tag indicating the block length. 
The pseudocode covering the single block output algorithm for compression and for 
the decompression of the single block routing is described in figure 5.15 and 5.16 
Repeat until all data are compressed 
{ 
I 
for (n = I, n = number of compressors, n++) 
{ 
Wait until whole block of data is compressed by compressor n 
Insert a tag at the front of the compressed block 
Send the complete compressed data block with the tag from compressor n to 
the output bus 
Figure 5.15 Single Block Output Routing Algorithm - Compression 
Repeat until all data are decompressed 
{ 
I 
for (n = I, n = number of decompressors, n++) 
{ 
I 
Read and decode tag to get compressed block length (X) 
Route compressed block length (X) of data to decompressor n 
Start decompression when complete compressed block is available for decompressor n 
Figure 5.16 Single Block Routing Algorithm - Decompression 
70 
Chapter 5 Practical Investigation of!nput and Output Routing Strategies 
5.2.3.2. Multiple Compressed Block 
Figure 5.17 illustrates the format of an output data stream containing multiple blocks. 
This is similar to the single block scheme, but, instead of waiting for each compressor 
to finish processing its block of data, all compressors need to finish compressing 
blocks before the data are sent. In this technique, the tag provides information on the 
length of the compressed data to be sent to each decompressor. As all compressors 
need to have completed their operations before an output can be produced, this 
approach has a greater latency compared with the single compressed block case, but, 
as fewer tags are needed, the effect on the compression ratio is reduced. The 
combined tag is shorter than the sum of the individual tags as the output bus 
granularity is of fixed width. Output tags are sized in accordance with the output bus 
width in order to simplify the routing architecture and decoding operations, even 
though fewer bits are required to determine block length boundaries. 
time t time 0 
... 
c c c c c c c T 
M M M M M M M A Idle llme p p p p p p p 
G 
.. 
3 ·3 2 2 1 1 1 . 
Figure 5.17. In the multiple compressed block output case, a single tag 
indicates the lengths of the blocks to follow from the complete set of 
compressors. 
Again, the pseudocode covering the multiple compressed block algorithms for 
compression and decompression is described in figure 5.18 and 5.19 
Repeat until all data are compressed 
{ 
for (n = 1, n = number of compressors, n++) 
{ 
) 
Wait nntil whole block of data is compressed 
Insert block length into ontput tag 
Insert tag to the front of the first compressed data block 
for (n = 1, n = number of compressors, 0++) 
{ 
Send complete compressed data block from compressor 0 to the output bus 
) 
Figure 5.18 Multiple Block Routing Algorithm - Compression 
71 
Chapler 5 Practical Investigation oflnpu! and Output Routing Strategies 
Repeat until all data are decompressed 
{ 
} 
Read and decode tag to gain compressed block length count for each decompressor 
for (n = I, n = number of decompressors, n++) 
{ 
} 
Route compressed block length of data to decompressor n 
Start decompression when complete compressed block for decompressor n is 
available 
Figure 5.19 Multiple Block Routing Algorithm - Decompression 
The effect of the tags on the overall compression ratio can be determined 
theoretically. Figure 5.20 illustrates the minimum percentage increase in the length of 
a compressed block of data when using a 64-bit output tag for the single and multiple 
block methods. The graph reveals that the impact of tags on the compressors' 
performance is greater on smaller blocks than for larger blocks. The multiple block 
output has a smaller impact on the compression ratio for output routing compared 
with the single block routing strategy, but has the disadvantage of increasing the 
latency. 
30% 
s. 
"'\ 
'\ 
"'-~ 
;: ~ 
~ 25% 
... 
. ~ 20% 
'" ~ ~ 15% 
--§. ~ 10% 
c 
.s 5% 
0% 
256 512 1024 2048 4096 8192 
Output block length (BIts) 
1 __ Single --M.Jtiple I 
Figure 5.20 Percentage increase in output block size 
due to the addition of a 64 bit tag 
Further experiments were carried out to collect results on the mean compression ratio 
for the single output blocked routing scheme and in the case of compression without 
tags. Data are collected for all the datasets using a range of block lengths. The results 
in figure 5.2 I reveal that, as the block length is increased, the impact the tag has on 
the mean compression ratio is reduced. Using tags for smaller blocks has the effect of 
72 
Chapter 5 Practical Investigation oflnput and Output Routing Strategies 
causing data expansion rather than compression, due to the increased proportion of 
bits being used as tags. A converse effect is that shorter block lengths result in a 
reduced latency, and clearly a design trade-off is needed. 
1.3 J:j 
.. 
1.2 -1--1-------------------------1 
0.5 -1---,-----,.----,----r---..,.---,..-----.,---,...J 
o 1000 2000 
-B--canterbury-Tag 
.-,,-,-~, general- Tag 
--+-canterbury w No Tag 
-<>-general - No Tag 
3000 4000 5000 
Block Length (bits) 
- '* - application -Tag 
--il-memory-Tag 
- application - No Tag 
_memory- No Tag 
6000 7000 8000 
- • <> - . executable -Tag 
e user-Tag 
... 0 ... executable - No Tag 
.. user - No Tag 
Figure 5_21 Impact single tags have on mean compression ratio on the different 
datasets with varying block lengths 
5.2.3.3 Interleaved Compressed Block 
Figure 5.22 illustrates an interleaved approach for routing mUltiple compressed blocks 
of data to the output stream. Instead of waiting for a whole block to be compressed, a 
predefined fixed length of compressed data is always sent to the output. If a 
compressor has not completed its operations, the system must wait until the data has 
been produced. 
time t time 0 
.. 
c c c c c c c c ' ' 
M M M IM Idle nme M M M M Id1e TIme 
'p '-p p' " p 'p p p p 
'- ,', 1, ,1 3, ' 3 ,',,' ",,' 2, 2 ! 1 1 
Figure 5.22. In the interleaved compressed block output method, the 
block length and order are both constant. 
There are two benefits of this approach compared with the previously discussed 
methods. Firstly, there is a reduction in latency, since data can be sent to the output 
73 
Chapter 5 Practical Investigation o(]nput and Output Routing Strategies 
before the whole block is compressed. Secondly, since no boundary tags are required, 
there is an improvement in the compression ratio. The method suitable for data 
compressed with either the blocked or interleaved routing method, whereas the single 
and multiple blocked output routing approaches are only applicable for blocked input 
routing. One possible drawback of this routing strategy is that at the end of a 
compression sequence, the interleaved approach may need extra dummy tags to be 
added to the output stream in order to ensure a constant interleaved block length. On 
receipt of the stop signal, output routing continues until all compressors have 
completed operations on their input blocks. It is likely that the final interleaved block 
from each compressor will contain insufficient data to fill the required fixed output 
length, and so dummy data tags are added as required in order to maintain the 
interleave length and output routing control can be allocated to the next compressor. 
The main shortcoming of this scheme occurs when there is high variation in 
compressed data block lengths. In such a case, the output bus may be occupied 
waiting for a compressor to provide adequate data to output, while compressed data 
could be present in the other compressors' FIFOs. This would increase the overall 
latency of the approach and increase the output buffer memory requirements for each 
compressor. A reduction in compression performance will also occur if the system 
becomes exhausted of input data when there is a large disparity between the quantities 
of compressed data present in the compressors' outputs, as significant dummy data 
will need to be added to empty all the output FIFOs. The pseudocode for the 
interleaved output routing and the subsequent decompression routing is described in 
figure 5.23 and 5.24. 
Repeat until all data are compressed 
{ 
for (n = 0, n = number of compressors, n++) 
{ 
} 
if (compression is still active) 
{ 
} 
else 
{ 
} 
Wait until compressed data count for compressor n >= interleave length 
Send interleave length of data from compressor n to the output bus 
Send available compressed data from compressor n to output bus 
Send dummy data of length equal to (interleave length - available data length) 
to the output bus 
Figure 5.23 Interleaved Routing Algorithm - Compression 
74 
Chapter 5 Practical Investigation o(Jnput ond Output Routing Strategies 
Repeat until all data are decompressed 
{ 
} 
for (n = 0, n = number of decompressors, n++) 
{ 
Route interleave length of data to decompressor n 
} 
Figure 5.24 Interleaved Routing Algorithm - Decompression 
In order to detennine if the interleaved output routing method is suitable for routing 
compressed data from a multiprocessor system, an understanding is needed of how 
the type of data being compressed will affect the lengths of compressed data output 
blocks. This will be critical in detennining the appropriate output strategy to adopt, as, 
if the variation is large and interleaved output is used, an undetennined amount of 
latency will be introduced as it will be more likely that the output router will need to 
wait between the receipt of consecutive compressed data outputs from the bank of 
compressors. Conversely, if the variation is small, the interleaved method will 
introduce relatively little latency. Figure 5.25 shows the results of tests to detennine 
how both block length and dataset type affect the standard deviation of the 
compression ratio. The compression ratio mean used in the calculation of the standard 
deviation is that of the particular dataset, for example the 'memory' standard 
deviation is obtained using the 'memory' dataset's mean compression ratio. A large 
number of tests were able to demonstrate that the distribution of compressed block 
lengths around the mean takes the fonn of a Gaussian distribution. It can be seen that 
the compression ratios of the executable and memory datasets have a similar standard 
deviation of around 0.25, while those of the other four datasets are grouped in the 
range 0.1 to 0.15. The low standard deviation in compression ratio keeps small the 
mean time the output router will need to wait for data to become available in the 
compressor's output buffer. This helps keep the latency introduced to a minimum, as 
well as providing an indication of the maximum size of the output buffers needed, and 
so saving on-chip memory. 
75 
Chapter 5 
o 
~ 
"0 C 
... 0 CO ._ 
"0 III 
C III 
S ! 
III Co 
Cl) E 
.c 0 
- u 
'0 .... 
C 0 
CO C 
Cl) 0 
:!E:;::; CO 
.~ 
"0 
Practical Investigation o(Jnput and Output Routing Strategies 
0.3 -y----------------------, 
0.25 ~.l~.;-: ..l~. '.:". ~ •. :-:.::'.:::. ~ •• :=:.:::.::.::.;.;. ~.:;:.:;:. : .. ::======~~ 
0.2 +----------------------1 
0'15~~~ 
0.1~ 
0.05 +----------------------1 
o+---__ --__ --~----~--~--__ --__ --__ ~ 
o 1000 2000 3000 4000 5000 6000 7000 8000 
Block Length (bits) 
-+--canterbury --ft-application ...... executable )( eneral )I( memo __ user 
Figure 5.25. Effect of the block length and dataset type 
on the compression ratio standard deviation 
By selecting the memory and user datasets, identified from figure 5.25, where having 
differing mean standard deviation variation characteristics, experiments were carried 
out to determine the final difference between two compressors' output buffers with a 
range of block lengths, indicating the quantity of dummy data needed to flush all 
compressed data from the system. Figure 5.26 illustrates how data with different 
standard deviation characteristics influence the quantity of dummy data that need to 
be created in order to allow all compressors to output their data at the end of the 
compression. The figure demonstrates that for a dataset with a relatively small mean 
of standard deviation of compression ratio the block length does not significantly 
affect the overall compression achieved. However for a high variation dataset with a 
relatively large mean of standard deviation of compression ratio, the number of 
additional dummy tags needed in the interleaved output routing method increases with 
block length. 
76 
Chapter 5 Practical Investigation ofInput and Output Routing Strategies 
50000.----'--------------------, 
45000 +--------------------:oc-'!" 
.' o 'iii 40000 +--------------------; ..;" . .,.--1 ~ ill 35000 +----------------.-=' . .....:-------j 
ell 0 .'" 
.! il: 30000 -1-------------.-:0 ......-.:-------1 
11 ~ 25000 +-------------:: .. ~. 0-'.'-' ------1 
!l ll! 20000 +-----.7 .... -"'.-'-'-.~ .... c:-.:-: •• "" .... .-'-' •• ------------j 
! [ 15000 +------:'-----------------1 
lE is 10000 -1----;:-'-..... -----------------1 
Q u 500~ t~"-.--~·~:::=:~====~====:;::====j 
o 2000 4000 6000 8000 
Block size (bits) 
I ~ ... - .... arithrretic rrean rrem ---o---arithrretic lTSan user I 
Figure 5.26. Final difference between two compressors output FIFOs for 
the memory and user data set for various block lengths 
5.2.4 Output Routing Conclusions 
Table 5.4 summarises the relative performances of the described output routing 
methods compared to that of a single X-MatchProRIi. Latency is introduced to the 
single block output routing since a complete block of data needs to be processed and 
stored in memory before the tag can be determined. The latency introduced for 
multiple blocked output method is greater than that of the single block output 
approach, as multiple blocks of data need to be processed before the tag can be 
determined. However, the latency introduced for the interleaved output is dependent 
on several factors, such as the degree of compression in the data, the variation in 
length between compressed blocks and the selected predefined interleave length. 
Compression Speed Latency 
Single X-MatchProRli I I I 
Single Blocked Output »1 n >1 
Multiple Blocked 
>1 n »1 Output 
Interleaved Output I n Data Dependent 
Table 5.4. Comparison of the performances of output data schemes 
relative to a single compressor system, n = number of processors, 
77 
ChapterS Practical Investigation of!npuf and Output Routing Strategies 
5.3. Summary 
This chapter has focused on analysing the impact various routing strategies have on 
the multiple compression system performance. When considering compression ratio 
in input routing, the blocked method provides a more scalable solution than the 
interleaved input scheme, where the achievable compression is adversely affected by 
an increase in the number of compressors. The blocked method introduces extra 
latency into the system, and although this can be minimised by adopting small block 
lengths, the effective compression performance starts to degrade when blocks shorter 
than 2kbits are employed. 
Several output routing schemes have been introduced which overcome the data 
boundary problem and maintain the time order of the data to ensure correct 
reconstruction when decompressing. The single and multiple blocked output routing 
has a detrimental effect on compression due to the introduction of tags, with the effect 
becoming more pronounced as the block lengths are reduced. The interleaved strategy 
avoids the need for tags by routing a set quantity of data from each compressor in 
turn. However, depending on the data characteristics, a variable number of dummy 
data tags are required on completion of a compression cycle in order to enable output 
routing control to be passed to the next compressor. However, the results have 
demonstrated that overhead has a negligible detrimental effect on compression ratio 
compared with the blocked output routing. 
Based on the evidence gathered in this chapter, the methods selected for the 
hardware implementation work described in the remainder of this thesis are the 
blocked scheme for input routing with a block length of 2kbits, and the interleaved 
strategy for output routing. 
78 
Chapter 6 Implementation 
Chapter 6 
Implementation 
6.1. Chapter Objectives 
This chapter describes the detail of the hardware implementation of the mUltiple 
compressor system. Analysis of performance and complexity is achieved by 
synthesising the design for a range of FPGA technologies. 
6.2 Hardware Design 
To demonstrate the feasibility and the operation of the architecture, a dual-compressor 
system was developed and its implementation is described. The purpose of carrying 
out this process is to demonstrate the practical feasibility of the multi-compressor 
approach described in the previous chapter. The HDL description of the architecture 
was developed in a generic manner, thereby, allowing the architecture to be extended 
should a larger number of compression engines be required in a particular application. 
The EDA tools used for the hardware implementation were Modelsim from Mentor 
Graphics to develop the RTL code and simulate. SynplifyPro from Synplicity was 
used to place and route the design for FPGAs. 
From the findings of chapter 5 the following routing strategies were adopted. 
• For the input routing, blocked input data is chosen rather than the interleaved 
approach due to the former's superior compression performance. 
• For the output routing, the interleaved technique is used as it imparts no overhead 
to maintain compressed data boundaries, and so has no detrimental effect on the 
compression ratio. Moreover, as data is routed to each decompressor in turn by a 
79 
Chapler6 Implementation 
fixed interleaved length rather than by variable compressed block length, the 
design of the interleaved decoder can reuse the architecture of the routing logic 
already employed for compressing data. 
Figure 6.1 illustrates the block architecture of the multiple X-MatchProRli system. 
The architecture is designed to allow compression and decompression operations to 
share the same resources. The area enclosed by the broken line denotes the part of the 
architecture that would need to be replicated should additional compressors be added 
to the system, and includes a compression/decompression engine (X-MatchProRlz) 
with its own independent control logic (X-MatchProRIi control and FIFO control) 
and distributed memory (input FIFO and output FIFO). The routing of the input and 
the output are provided from a single set of external architectural blocks. The user 
signals allow the system to be started and halted and for the block length and 
interleave output length to be selected. 
80 
Chapter 6 
Input 
Control 
hllerleavt)eal'b 
_. 
.1,.,+,·"_·" -_ .. _ .... _ .... - ... 
1 .. ____ .... _. __ ... _ ... ____ ... _ • _____ .... __ _ 
~,- -,- --,- --:'-':;.',..:.- -',--''--- -,.:. -'- - -'-<~'- - -'-'':''''"''- ,- -'-.- - ~! 
Figure 6.1 Block architecture of the two X-MatchProRli 
compressor/decompressor system 
Implementation 
Output Control 
Selector 
... 
Applying external signals to the user signal registers sets up the system for operation; 
these signals indicate the block length for compression and decompression and 
start/stop status. The system begins operation when a suitable signal from the user 
signal registers are sent to the input control state machine enabling data to be written 
to the first compressor's input FIFO. This process continues until the required block 
length is reached, at which time writing is enabled for the next compressor's input 
FIFO and consequently writing to the previous input FIFO is disabled. Whenever data 
is present in any of the compressors' input FIFOs, the empty signal is sent low and as 
a consequent this activates the input FIFO control state machine. If the empty signal 
is low and the compressor's status is in an idle state, the read input FIFO control 
signal is activated high, which allows 64 bits (in the two compressor case) to be 
81 
Chapter 6 Implementation 
written to the selector. The selector ensures that 32 bits of data are sent on 
consecutive clock cycles to the X-MatchProRli compressor when in the compression 
mode. When the first 4 bytes of data are read from the input FIFO a start signal is sent 
to the single X-MatchProRli control logic to begin the state machine that generates 
the control signals for X-MatchProRIi engine. When the compressor has valid 64 bits 
of compressed data it is written to that compressor's output FIFO. This process is 
repeated until a whole block of data has been compressed. At this point in time the 
internal dictionary of the X-MatchProRIi is reset ready for the next block of data. The 
output control waits until an interleaved length of compressed data is present in the 
first compressor's output FIFO before the data are written to the output bus and 
control is passed to the next compressor. 
As discussed above, the architecture used for compression has been reused for 
decompression. In the decompression mode, compressed data are routed to each 
compressor in turn, in a similar fashion to that used for compression, except the data 
length is that of the interleaved output section rather than being that of the input block 
size. Clearly, this interleaved length needs to be identical to the value used for 
compression otherwise the data will not be decompressed correctly. When a complete 
block of compressed data is present in the input FIFO, the decompression mode ofthe 
X-MatchProRli is enabled. Compressed data are read as required and the block of 
uncompressed data is accumulated in the output FIFO. The output control waits until 
the decompressor has provided a quantity of data equal to the block size and is written 
to the output, before passing control to the next compressor. 
82 
Chapter 6 
Multiple _c ompressor _generic 
Gen_ Write_input_control 
Engine_O 
Read_input_fifo 
FIFOl 
Selector_input 
Single_X-Match_ Control 
X-MatchProRli 
Selector_output 
FIF02 
Geo_mux 
Gen_interleave_ control 
Figure 6.2 Multiprocessor Block Hierarchy 
Implementation 
Figure 6.2 displays the design hierarchy; the top-level module 
Multiple_compressor ~eneric instantiates the other sub-modules to form a complete 
system. This module includes a generic integer variable labelled proc _num, whose 
value determines how many X-MatchProRli units (along with their associated signals) 
are generated. Furthermore, gen_write_input_control, gen_mux and 
gen_interleave_control all utilise the proc_num variable to adjust their 
implementation for correct operation with the defined number of 
compressors/decompressors. The gen _write_input _control handles the incoming data, 
determining which processing engine receives data and the number of bits sent. The 
gen _ mux is a simple logic multiplexor that connects the output data streams from all 
the engines to one single output bus. The gen_interleaved_control manages the output 
data streams from each X-MatchProRli output FIFO's engines to determine when data 
should be placed onto the output stream to ensure data boundaries can be resolved. 
83 
Chapter 6 Implementation 
~ wave ~ default ~, 1~;i\ 1!I~E3 
• st~rt 
• compress 
• decompress 1 
11 .lop 
• clear 11 elk 
.. 11 b.jn 
"full 0 
t write_output 0 11 (1) 0 11 (0) 0 
11 h.U 0 
.. 0 
" , I, 11111 111111 1111111 11""" 
200 400 600 800 1 
6.3 Write input control, block length 256 bits 
(4 clock cycles*64 bits), for a two compressor system 
Figure 6.3 illustrates the waveform of a two processor system in compression mode 
(indicated by the compress signal going low). BS_IN determines the block length, in 
this case '0000' is selected for a block length of 256 bits (4 clock cycles x 64 bits). 
Each input FIFO has data written to its input in turn, by the appropriate write_output 
signal going high. When the stop signal goes low, the system continues until a 
complete block is routed to maintain the integrity of the data. Similarly, when the 
decompression mode is activated, the same operation occurs for compression but now 
the interleave length determines the length of data routed to each X-MatchProRli 
engine in turn. For decompression the BS_IN signal must be equal to the interleave 
length used during compression. 
84 
Chapter 6 Implementation 
.. start 
• stop 
.. compress 
• decompress: 
.1I bojn 
• cl< 
.-
... u_datain 
.. c_d~tain 
.. u..dat&lut 
.. c_d!ltaoul 
• add,-en 
• cd.!lt!i_en 
• udal!l_er'I 
• finished 
• compressilg 
• fin,""", 
• deCOfnpl'e$sing 1 
lHI·dm 0000 1:~::::5:::::::::::::::::~~:=~::::~~~~::~ .. ce 1 
• O • 
• IW 
I I " , I I " " I 
1 "' 
compressed data 
Figure 6.4. Waveform for the Compression Operation for a single 
X-MatchProRli Engine 
Figure 6.4 displays the compression of a block of data in one of the X-MatchProRli 
engines. The X-MatchProRli is set ready to perform compression or decompression if 
the corresponding compress and decompress signal is high. From the moment the start 
and compress signal goes low, a block length of data, determined by the value of 
BS_IN, is read from the u_datain bus and is compressed. The rw signal going low 
indicates that 64 bits of compressed data are available on the c_dataout bus. The· 
finished signal going high indicates the compression of a block of data has been 
completed. 
85 
Chapter 6 Implementation 
t:inpuLSlre~m + (1) + (0) 
rtt-II selector 
l±I-II "",pu<_,,,e.~ 
• v<!Ilid 0 
, , I I I , , I I 
2500 
• • 
Dummy Tag. 
inserted for 
interleave output 
I , , I I 
3u. 
on a two 
Engine System in compression mode 
, I I I , , I 
Figure 6.S illustrates an example of the waveform from the generic mux at the output 
for a two processor system in compression mode. The input_stream bus contains the 
data busses from each processor, the selector signal is controlled from the generic 
interleave control module determines which bus shall have valid control of the main 
output _stream bus. In this case, an interleave length of compressed data is written to 
the output_stream bus in turn. Shown in the waveform is the end of the compression 
operation for the whole system, where dummy tags are inserted to flush the remaining 
compressed data from each of the processors' output FIFOs. These tags get inserted by 
reading from the output FIFO even though no data is present, when decompression 
occurs this extra data gets ignored. When the valid signal is in a high state this 
indicates that there are valid data on the output_stream bus. 
86 
Chapter 6 Implementation 
6.3 Hardware Implementation 
The design was synthesized using Synplicity EDA tools called simplicity targeted for 
an Altera EP20KIOOO FPGA, and the resource allocation for this device is shown in 
table 6.1. 
Table 6.1. Architecture Complexity 
Logic Block Look Up Chip Equivalent 
Tables area used gate count 
Compressor * 2 X-MatchProRIi 4540 11% 110,000 
Control Logic 346 -1% 10,000 
Routing Input/Output 78 <1% 5,000 
9850 of 
Total 25% 245,000 
38,400 
The logic used by the FIFO depends on the maximum block length selected for the 
system. For a single 8192 deep, 64-bit wide FIFO, eight Embedded System Blocks 
(ESB) are needed from 160 ESB available on the Altera FPGA. A system clock speed 
of 34.7 MHz is achievable, providing, for the two-compressor system, a throughput of 
2.2Gbit/s (64 bits * 34.7 MHz). One of the major benefits of the particular routing 
strategies adopted is that they give a scalable solution, namely one that maintains 
compression performance as the number of compressors in the system is increased. 
Simple scaling calculations show that a four-compressor system would be required to 
meet the requirements of the PCI-X standard and a ten compressor system would be 
needed to achieve the performance required for fast Ethemet. However, when four 
compression engines are synthesised for the Altera EP20KlOOO FPGA, a maximum 
clock speed of 28.IMHz is attained with a chip utilisation rate of 50%, giving a 
maximum throughput of 3.6 Gbit/s (128*28.1 MHz). This discrepancy in throughput 
results from FPGAs having a limited number of high-speed routing wires, requiring 
routing through less optimal paths, as utilisation increases this results in a 
commensurate decrease in clock speed. This limitation can be mitigated either by 
selecting larger FPGAs or by taking the design to an ASIC. Taking the system into 
ASIC technology allows a design to be produced that matches exactly the required 
number of connecting wires. Note that in FPGA implementations the benefits of 
87 
Chapter 6 Implementation 
flexibility are often migrated by inefficiencies that arise from the reconfigurability 
overhead. 
Table 6.2. Synthesis results for a range of FPGA technologies using a 
f .. t our-compression engine sys em 
Logic Clock Speed Data throughput FPGA Utilisation for MHz four compressor 
Xilinx Virtex2 59% 56.6 7.2Gbitls 
XC2vp20ff896-7 
Xilinx Virtex 49% 41.6 5.3Gbitls 
XCVIOOO 
AlteraAPEX 50% 28.1 3.6Gbitls 
EP20KIOOO 
The resource allocation figures shown in table 6.2 demonstrate that with modem 
FPGA technology, mUltiple compressor architectures with their own dedicated 
memory and routing mechanisms can easily be implemented on a single device. 
6.4 Summary 
A cycle accurate model of the system has been created in a suitable design language, 
allowing for several output routing strategies to be developed and tested in a parallel 
system for various numbers of compression engines. The practical implementation of 
the multi-compressor systems have demonstrated that it is possible to perfonn 
compression at throughputs up to 7.2 Gbitls and the practical compression 
perfonnance, speed and complexity to be assessed. 
88 
Chapter 7 Conclusion 
Chapter 7 
Conclusion 
7.1. Chapter Objectives 
This chapter assesses the outcomes of the research with respect to the research 
objectives set out in chapter 3. Potential future research is also discussed. 
7.2. Summary of Objectives and Design Flow 
The objectives of the work as outlined in chapter 3 are to investigate and develop 
hardware architectures to 
• Improve Lossless Compression throughput without significantly 
compromising other compression performance aspects. 
• Provide scalability in the design, such that future bandwidth requirements can 
be met without the need for significant redesign. 
An extensive literature review was performed to identify high performance lossless 
compression systems, which utilise parallelism to improve data throughput and 
compression. It was found that current research has mainly focus sed on the use of 
systolic arrays and CAMs for hardware implementation. Until recently, there has been 
little research effort on using multiple compression algorithms that cooperate to share 
a computational task. The work presented in this thesis has built upon previous 
research that developed high performance lossless data compression engine. A 
89 
Chapler 7 Conclusion 
number of architectures that contained compression/decompression engines operating 
in parallel were investigated, and this included the development of suitable 
behavioural and cycle accurate models. Representative datasets were used to gain 
understanding of how these candidate designs influenced the overall system 
performance compared to that of a single compressor/decompressor. The most 
suitable architecture for the current work was selected and was implemented as a 
synthesisable IP core. The HDL description was verified and tested against the 
SystemC model. The resulting soft-core had a minimal target technology bias 
allowing it to be potentially synthesised for a wide range of silicon technologies. 
Results for complexity and speed were demonstrated by the place and routing 
statistics gained from the Synplicity EDA tool applied for a selection of common 
FPGA targets. 
7.3. Contribution 
This thesis has provided a comprehensive review of the use of parallelism in lossless 
data compression, with the main research findings summarised in a table. The thesis 
has also identified a range of techniques for routing data to and from multiple 
compression engines, with their own dedicated internal memory. It has been shown 
that important design considerations need to be made that can have an effect on both 
compression and latency. It has also been shown that suitable architectures and 
routing strategies can be applied to implement scalable high-speed compression 
systems that can be tailored to meet the requirements of different data types and 
locations. For example, the main priority for backing up data is normally achievable 
compression rather than latency, while for the compression of memory data, more 
emphasis will be laid on the latency due to the time constraints involved. 
For the input routing, this work has identified that the interleaving technique has a 
low impact on latency, but that the attainable compression ratio worsens as the 
number of compressors in the system increases. This loss in compression detracts 
from the benefit of increased throughput, as the compressed data will need more 
bandwidth. With the blocked data sent to each compressor, this results in an increase 
in the latency in the system, but it takes into account the locality present in the data, 
thereby ensuring that the increase in the number of compression engines does not 
adversely affect the compression ratio. 
90 
Chapter 7 Conclusion 
For the output routing, several schemes are developed and analysed. The 
interleaved technique, needing no tags to mark data boundaries, introduces no loss in 
compression as the number of parallel compressors is increased and the similar 
compression ratio standard deviations between certain datasets would make it possible 
in many applications for good estimations to be made of the memory requirement for 
buffering. The blocked techniques introduce additional latency, but this is often at the 
benefit of improved compression performance. Since the blocks of compressed data 
can be compared to the input blocks, it is possible to send the original data instead of 
the compressed data if expansion rather than compression has taken place. This is 
particularly useful if already-compressed data (for example mp3 or jpeg files) or 
encrypted information is present in the input data stream. Another contribution this 
work has provided is in using an emerging system level design language in the 
development of the cycle accurate model to analyse the architectural tradeoffs of the 
system. As such system level approaches have a largely unproven track record in 
digital design. Companies may well be reluctant to take the risk to try them out. The 
use of a system-level method in the cnrrent research has yielded some independent 
first hand experience of this approach in the development a large system and the 
reporting of this work through conferences provides useful feedback to the business 
communities on the benefits and drawbacks of such languages. 
7.4. Measurement of Success 
A thorough analysis has been undertaken in the assessment of the merits of alternative 
solutions of multiple independent compression engines. As a result of the assessment, 
routing architectures that are suitable for parallel compression/decompression have 
been developed, thereby enabling an improvement in data throughput without 
significantly affecting compression performance by trading off compression ratio with 
latency. 
The work has generated a scalable and functional integrated multiple compression 
and decompression IP core suitable for incorporation in other digital systems. The 
SystemC model developed in this work is currently being used by another researcher 
to integrate compression with an embedded microprocessor [Xu04]. 
91 
Chapter 7 Conclusion 
7.5. Limitations of Research 
The architectures developed in this thesis utilise the X-MatchProRli hardware 
compressor due to the availability of the IP. A comparison of the performance of X-
MatchProRli with other compression algorithms in a multiple compressor system 
would have been desirable part of the current work but such hardware descriptions are 
limited in their availability. Moreover, to implement alternative compression engine 
from scratch is very time intensive and would have constituted a considerable drain 
on the time available to perform other aspects of the research. 
The routing strategies that have been explored have largely been those that 
maximise throughput, but generally there is a trade-off between achievable 
compression and latency. A solution that combines high throughput without 
compromising compression and latency would have been desirable, but none was 
identified in the current work. 
7.6. Future work 
Due to the limited time available, all potential avenues for the PhD research could not 
be undertaken. Some of the possible extensions to the work are as follows. 
• The system could be extended to provide the facility to select or 
dynamically change the routing strategy in multiple compressor systems 
depending on data characteristics or system requirements. 
• Similarly, a dynamic system that allocates additional compressors or 
decompressors depending on the current throughput requirements is also 
possible, allowing optimisation in power or silicon area usage. 
• The solution developed in this work could be integrated with other 
hardware IP blocks and software to form a complete system for appropriate 
parts of a computer system, for example high-speed networks or hard disk 
compression. 
• Alternative paraIlel paradigms could be investigated to understand their 
impact on performance. For example, one possible alternative is to carry 
out parallel compression using a shared dictionary as in the IBM MXT 
memory chip. 
92 
Chapter 7 Conclusion 
• Instead of using multiple compressors for increased throughput, they could 
be used to increase compression. As each compressor can be tailored to 
suitable data characteristics, all compressors could process the same data 
but using different methods and the algorithm with the best compression 
performance selected. This approach is already in some software 
compression archivers but these are extremely computationally intensive as 
the data is compressed for each compression algorithm in a sequential 
manner before the one with the best compression ratio is chosen. 
• A combination of some of the above ideas could allow a system to be 
designed in which the best compression algorithm is chosen from multiple 
parallel compression algorithms in real-time, with the chosen algorithm 
being duplicated many times into a reprogrammable device where a 
speedup can be gained from using multiple identical algorithms. 
7.7. Summary 
This thesis has addressed the problem of high-speed lossless data compression 
capable of operating at high bandwidths. The work has resulted in a scalable multiple 
processor compression and decompression architecture, and has adopted routing 
strategies suitable for implementation in a range of digital systems. 
93 
References 
[Accellera] SystemVerilog, System Design Language, ''http://www.accellera.org/, 
September 2003. 
[AHA] AHA, Advanced Hardware Architectures, Adaptive lossless data compression, 
.. http://www.aha.com/tech.php ... July 2003 
[Arnold97] R. Arnold and T. Bell. "A corpus for the evaluation of lossless compression 
algorithms" IEEE Computer Society Press, Data Compression Conference, pp. 201- 210, 
1997 
[Arps88] R. B. Arps, T. K. Truong, D. J. Lu, R. C. Pasco, and T. D. Friedman, "A Multi-
Purpose VLSI Chip for Adaptive Data Compression of Bilevel Images", IBM J. Res. 
Development, Vo\. 32, pp. 775-795, 1988. 
[BeIl90] T.C. Bell, J.G. Cleary and I.H. Wilten, "Text Compression", Englewood Cliffs, 
NJ Prentice-Hall, 1990 
[Boliek94] M. Boliek, J.D. Alien, E.L. Schwartz and MJ. Gormish "Very High Speed 
Entropy Coding", IEEE International Conference on Image Processing Vo\. 3, pp. 625-
629, November 1994 
[Bongjin94] B. Jung and W.P. Burleson, "A VLSI Systolic Array Architecture for 
Lempel-Ziv based Data Compression", Proc. IEEE Int. Symp on Circuits and Systems, 
pp. 65-68, June 1994 
[Bongjin95] B. Jung, W.P. Burleson, "Real-Time VLSI Compression for High Speed 
Wireless Local Networks", Data Compression Conference, March 1995 
[Bongjin98] B. Jung, W.P. Burleson, "Performance Optimization of Wireless Local Area 
Networks through VLSI Data Compression" Wireless Networks, Vol. 4. Issuel, pp. 27-
39, January 1998 
[BoseO 1] S. K. Bose, "An Introduction to Queueing Systems" Kluwer AcademiclPlenum 
Publishers, ISBN 0-306-46734-8 December 2001,. 
[Canterbury] Canterbury Corpus, Lossless Compression Dataset 
''http://corpus.canterbury.ac.nz'', valid August 2003 
[Chen98] J.M. Chen, C.R. Wei, "A Novel VLSI Design for Ziv-Lempel Data 
Compression", The 1998 IEEE Asia-Pacific Conference on Circuits and Systems, pp. 
739-742,1998 
[Cormack87] G.V. Cormack and R.N.S. Rorspool, "Data Compression using Dynamic 
Markov Modelling" The Computer Journal, Vol. 30, No. 6, pp. 541-550, 1987 
[DagastineOl] Gary Dagastine, MXT, IBM Compressed Memory 
''http://domino.watson.ibm.comlcomrnlwwwr_thinkresearch.nsf/pages/memory200 .html" 
, 2001 
[DorwardOO] S. Dorward and S. Quinlan. "Robust Data Compression of Network 
Packets", Bell Labs, Lucent Technologies, 2000 
[DoubleSpace] DoubleSpace, Lossless disk compression 
.. http://www.stiller.comldos6.htm .. September2003 
[DriveSpace] DriveSpace, Lossless disk Compression 
.. http://www.faqs.org/faqs/windows/win95/faq/partll/preamble.html .. September 2002 
[Expand03] Expand Networks, Accelerator, .. http://www.expand.com ... July 2003 
[Flynn66] MJ Flynn "Very High Speed Computing Systems" Proceedings lEE, Vol. 54, 
pp. 1901-1909,1966 
[Franaszek98] P.A. Franaszek, United States Patent, "Parallel Compression and 
Decompression using a Cooperative Dictionary" Patent Number 5,729,228, 17 March 
1998 
[FranaszekOl] P.A. Franaszek, P. Heidelbeger D.E. Poff and IT. Robinson "Algorithm 
and Data Structures for Compressed Memory Machines" IBM J. Research, Vo1.45, No2, 
March 2001 
[Golomb66] S. Golomb "Run-Length Encoding" IEEE Trans. Info. Theory, Vol. IT-22 
No 4, pp. 399-401, July 1966 
[Gooch96] M.Gooch, M. Kjelso, S. Jones, "A Role for Main Memory Compression", 
Proceedings of the 22nd Euromicro Conference Beyond 2000: Hardware/Software 
Design Strategies Short Contributions, pp 26-31, September 1996 
[Gonzalez85] M.E. Gonzalez Smith, J.A. Storer "Parallel Algorithms for Data 
Compression" Journal of the Association for Computing Machinery, Vol. 32, No2, pp. 
344-373, April 1985, 
[Handel-C] Handel-C, System Design Language, http://www.celoxica.com. July 2003 
[Hifn] Hifn, LZS data compression, .. http://www.hifn.com/products/Compression.html ... 
Valid August 2003 
[Howard92a] P.G. Howard, J.S. Vitter, "Analysis of Arithmetic Coding for Data 
Compression", Information Processing and Management, Vol. 28, No6, pp. 749-763, 
1992 
[Howard92b] P.G. Howard and J.S. Vitter "Parallel Lossless Image Compression Using 
Huffman and Arithmetic Coding", Data Compression Conference I 992, pp. 299-308, 
March, 1992 
[Howard94] P.G. Howard and J.S. Vitter "Arithmetic Coding for Data Compression", 
Proceedings ofIEEE, Vo!. 82, No. 6, pp. 857-865, June 1994 
[Huffman52] D.A Huffman "A Method for the Construction of Minimum Redundancy 
codes", IRE Proceedings, Vo!. 40, No. 9, pp. 1098-1101, 1952 
[Jiang94] J. Jiang and S. Jones, "Parallel Design of Arithmetic Coding", Proceedings 
lEE, Part E, Vo!. 141, pp. 327-333, November 1994 
[JonesOO] S. Jones,"Partial-Matching Lossless Data Compression Hardware" lEE Proc. 
Compt Digital Tech., Vo!. 147, No. 5, September 2000 
[Jones92] S.R. Jones, "lOO Mbitls Adaptive Data Compressor Design Using Selectively 
Shiftable Content-Addressable Memory", Proc lEE Part G Vo!. 139, No. 4, pp. 498-502, 
August 1992 
[Knuth82] D.E. Knuth, "Dynamic Huffman Coding" Journal of Algorithms Vo!.6, pp. 
163-180, 1982, 
[Lee95] C.Y. Lee, and R.Y. Yang 'High Throughput Data Compressor Designs using 
Content Addressable Memory', lEE Proceedings on Circuits Devices and Systems, Vo!. 
142, No. 2, pp. 69-73, 1995 
[Lee99] 1.S. Lee, W.K. Hong, and S.D. Kim, "Design and Evaluation of a Selective 
Compressed Memory System" International Conference on Computer, pp. 184-191, 
Texas USA, October 1999 
[LeeOO] S. Lee, W.K. Hong, and S. D. Kim, "An On-Chip Cache Compression Technique 
to Reduce Decompression Overhead and Design Complexity," Journal of Systems 
Architecture, Vo!. 46, pp. 1365-1382, December 2000 
[LekatsasO 1] H. Lekatsas, J. Henkel and W. Wolf, "Design and Simulation of a Pipelined 
Decompression Architecture for Embedded Systems", Proceedings of the 14th 
International Symposium on Systems Synthesis, pp.63-68, Montreal, 2001 
[MyoupoOO] J.F. Myoupo and A. Wabbi, "Move-to-Front and Transpose Hybrid Parallel 
Architectures for High-Speed Data Compression", IPCCC2000, Arizona, February 2000 
[Nunez99] J.L. Nunez-Y anez, C. Ferengino, S. Bateman and S. Jones, "The X-
MatchLITE FPGA-based Data Compressor" Proceedings 25th EuroMicro Conference, pp. 
126-133, September 1999 
[NunezOla] Jose Luis Nunez-Yanez "GBitlsecond Lossless Data Compression 
Hardware" Loughborough University, Thesis, 2001 
[NunezOlb] J.L. NUiiez-Yanez, C. Feregrino, S. Jones and S.Bateman, "X-MatchPRO: A 
ProASIC-Based 200 Mbytes/s Full-Duplex Lossless Data Compressor", Proceedings of 
the 11th International Conference FPL 2001, Lecture Notes in Computer Science, 
Springer, pp. 613-617, August 2001. 
[Nunez02] J.L. Nunez_Yanez and S. Jones, "Lossless Data Compression Programmable 
Hardware for High-Speed Data Network", Proceedings of IEEE International Conference 
on Field-Programmable Technology (FPn, pp. 290-293, Hong Kong China, December, 
2002 
[Penzhom92] W T Penzhom "A Parallel Architecture for High Speed data Compression" 
IEEE South African Symposium on Communications and Signal Processing, pp. 173-
175, September 1992 
[PKZIP] Software lossless compression, Pkware .. http://www.pkware.com/.. Valid 
September 2003 
[PPM] PPM compression, ''http://datacompression.infoIPPM.shtml'', valid 
September2003 
[PPMZ] High Compression Markov predictive Coder, Charles Bloom 
.. http://www.cbloom.com/src/pprnz.html .. Valid September 2003 
[Ranganathan93] S. Henriques and N. Rangonathon, "High speed VLSI Design for 
Lempel-Ziv based Data Compression" IEEE Trans. on Circuits and Systems, Vol. 40, 
No. 2, pp. 90-106, Feburary 1993 
[Rissanen79] J. Rissanen and 0.0. Langdon. "Arithmetic Coding". IBM Journal of 
Research and Development, Vol. 23, No. 2, pp. 149-162, March 1979 
[Shannon-Fano] Shannon-Fano Coding, Lossless compression algorithm 
.. http://www.nist.gov/dads/HTMLlshannonFano.html ... Valid September 2003 
[Simpson98] J.L. Simpson, C.L. Sabharwal "A Multiple Processor Approach to Data 
Compression" ACM Symposium on Applied Computing, Florida, pp. 641-649, March 
1998 
[Slattery98] M.J. Slattery, F.A. Kampf "Design Consideration for ALDC Cores" IBM 
Journal of Research and Development, Vol. 42, No. 6, pp. 747-752, November 1998 
[Stacker] Stacker, Hard disk Compression .. http://www.loc!.netlstac.html .. Valid August 
2003 
[Stauffer93] 1.M. Stauffer and D.S. Hirschberg, "Parallel Text Compression" Technical 
Report 91-44 Revised Info and Comp Sci., Department University of Califonia, Irvine, 
1993 
[StefoOl] R. Stefo, J.1. NWlez-Yanez, C. Feregrino, S. Mahapatra, and S. Jones, "FPGA-
Based Modelling Unit for High Speed Lossless Arithmetic Coding", Proceedings of the 
11th International Conference FPL 2001, Springer, pp. 643-647, August 2001 
[Storer90] J.A. Storer and J.H. Rief "A Parallel Architecture for High Speed Data 
Compression" 1990 IEEE 
[SystemC] SystemC, System design language ''http://www.systemc.org'' August 2003 
[TremaineOl] R.B. Tremaine, P.A. Franaszek, J.T. Robinson, C.O. Schulz, T.B. Smith, 
M.E. Warlowski and P.M. Bland "Memory Expansion Technology (MXT)" IBM Journal 
of Research and Development, Vo!. 45, No.2, March 2001 
[V 42.bis] Recommendation V 42.bis, "Data Compression Procedures for Data Circuit 
Terminating Equipment (DCE) using Error Correction Procedures", CCIrr (ITU), 
January 1990 
[Wei93] B. W. Y. Wei, R. Tarver, J. S. Kim, and K. Ng, A Single Chip Lempel-Ziv Data 
Compressor, Proc. IEEE Int'l Symp. Circuits and Systems (ISCAS), pp. 1953-1955, May 
1993 
[XieOl] Y. Xie, W. Wolf, H. Lekatsas "A Code Decompression Architecture for VLIW 
Processors", Proceedings, 34th Annual International Symposium on Microarchitecture, 
IEEE Computer Society Press, pp. 66-75, 2001 
[Xu04] X. H. Xu, C. T. Clarke, S. R. Jones, "High Performance code compression 
architecture for the embedded ARMrrHUMB processor" Proceedings of the first 
conference on computing frontiers on Computing frontiers, pp. 451-456, 2004 
[Ziv77] J. Ziv and A. Lempel, "A Universal Algortihm for Sequential Data 
Compression" lEE Trans. on Information Theory, Vo!. IT-23, No. 3, pp. 337-343, 1977 
[Ziv78] J. Ziv and A. Lempel, "Compression ofIndividual Sequences Via Variable-Rate 
Coding", lEE Information Theory, Vo!. 24, No. 5, pp. 530 -536, 1978 
Conference 
Mark Milward, Simon Jones, Jose Luis Nunez-Yanez, "SystemC Modelling of the X-
MatchPro Data Compressor for Multi-Gbitls Networks", 4th European SystemC Users 
Group Meeting, Copenhagen, October 5th 200 I 
Mark Milward, Simon Jones, Jose Luis Nunez-Yanez, "Implementation of the X-
MatchPro Data Compressor into a Parallel Architecture" PREP2002, Nottingham, 17th 
April 2002 
Mark Milward, Jose Luis Nunez-Y anez, David Mulvaney, "Routing Strategies for High 
Speed Parallel Data Compression" International Conference on Parallel and Distributed 
Processing Techniques and Applications, Las Vegas, June 2003 
Mark Milward, "Lossless Parallel Data Compression Systems" ECS Division 1ST 
Miniconference, Loughborough, September 2003 
Journal Publication 
Mark Milward, Jose Luis Nunez-Y anez, David Mulvaney "Design and Implementation 
of a Lossless Parallel High-Speed Data Compression System" IEEE Transactions on 
Parallel and Distributed Processing Techniques, Vol. IS, No. 6, June 2004 
Patent 
Simon Jones, Jose Luis Nunes-Yanez, Mark Milward. "Apparatus to Provide Fast Data 
Compression" Patent No. 02710146.8-2205-GB0200443, Date of Filing 01.02.02 
f 
Abstract 
The current increases in silicon logic densities have made feasible the implementation 
of multiprocessor systems onto a single chip able to meet intensive data processing 
demands of highly concurrent systems. This thesis describes the research and 
hardware implementation of a high performance parallel multicompressor chip. In 
order to fully explore the design space, several models are created at various levels of 
abstraction have been implemented to capture the full characteristics of the 
architecture. A detailed investigation into the performances of alternative input and 
output routing strategies for realistic data sets demonstrate that the design of parallel 
compression devices involves important trade offs that affect compression 
performance, latency, and throughput. The most promising approach is written in a 
hardware description language and synthesised for FPGA hardware as proof of 
concept. It is shown that a multicompressor architecture can be a scalable solution 
able to operate at throughputs able to cope with the demands of modern high-
bandwidth applications whilst retaining good compression performance. 
Keywords: Lossless Data Compression, Parallel Architectures, Hardware, Routing 
Strategies 
i 
i , 
I 
i I "I 
GLOSSARY 
ALDC 
ASIC 
CAM 
CPU 
DCLZ 
DMC 
EDA 
FIFO 
FPGA 
LZ77 
LZ78 
LZS 
MIMD 
MISD 
ODA 
SISD 
SIMD 
PCI 
PCI-X 
PE 
PPM 
PPMZ 
RAM 
RTL 
VLSI 
X-RLI 
Adaptive Lossless Data Compression 
Application Specific Integrated Circuit 
Content Addressable Memory 
Central Processing Unit 
Data Compression Lempel-Ziv 
Dynamic Markov Compression 
Electronic Design Automation 
First In First Out 
Field Programmable Gate Array 
Lempel-Ziv Coding 1977 
Lempel-Ziv Coding 1978 
Lempel-Ziv Stac (developed by Stac Electronics) 
Multiple Instructions Multiple Data 
MUltiple Instructions Single Data 
Out of Date Adaptation 
Single Instruction Single Data 
Single Instruction Multiple Data 
Peripheral Component Interconnect 
Peripheral Component Interconnect Express 
Processing Element 
Prediction by Partial Matching 
Prediction by Partial Matching with Lempel-Ziv 
Random Access Memory 
. Register Transfer Language 
Very Large-Scale Intergration 
X-MatchPro-RLI (Run Length) 
Glossary 
x 


