Compressed instruction cache architecture for high-performance embedded RISC systems by Elena G. Nikolova (7200578)
I University Library 
•• Loughborough 
• University 
Author/Filing Title .......... .N.\~ .. !?~g:{~ ... \ .... ~ .......... .. 
........................................................................................ 
-. 
Class Mark ............................................ 1 ...................... . 
Please note that fines are charged on ALL 
overdue items. 
~. .~ 
0403600111 
! IIIIIIIIIII~II~ IIIIIIIIIII~IIIIIIIII ~III 
~~--~~~ 
. I 
Compressed Instruction Cache Architecture for 
High Performance Embedded RISC Systems 
by Elena Georgieva Nikolova 
A Doctoral Thesis 
Submitted in partial fulfilment of the requirements 
for the award of 
Doctor of Philosophy 
of 
Loughboroug~.u.niversity 
--; .' '. 
July 2007 
© Elena Georgieva Nikolova 
. , . --
•. ~8 I .. "ugllh'" ";:;','; 
~ University 
Pilkington Libr'" 
Dale '& !'\..DO ~ 
Class -'\ 
Ace 
No. 
Abstract 
The influence of embedded systems is felt in many aspects of our daily lives; being particularly 
apparent in consumer electronics and automotive products. Customer demand and rapid 
advances in the complexity of the underlying technology has enabled the introduction of new 
systems and services that were simply not feasible just a few years ago. 
Although the cost of embedded systems is an important design parameter, their development 
is also affected by performance and functionality. The performance issue is traditionally 
addressed by the design of faster microprocessors, but more recently by the exploitation of 
parallelism (for example, vector units and very long instruction word processors), as well as 
special purpose hardware architectures, such as graphics processing units and network cards. 
In such systems, however, the main performance bottleneck is often the memory hierarchy, 
particularly in systems with complex memory access arbitration, where read or write 
operations to the main memory could result in delays of thousands of cycles. AI though the 
widespread use of cache memories aims to alleviate this effect to some extent, memory access 
penalties remain a significant drain on performance. Functionality is closely related to the 
memory capacity available, particularly in portable systems such as mobile phones and hand-
held games consoles. 
The work described in this thesis includes a comprehensive analysis of code size and 
performance issues of embedded reduced instruction set computer architectures. The main 
contribution is a novel lossless compression-based solution (both hardware architecture and 
software tools) that generates significant reductions both in the memory requirement of the 
executable and the number of instruction cache misses. The new solution is quantitatively 
evaluated, demonstrating improvements in system performance for a wide variety of 
embedded applications, but particularly in high-end, real-time applications, such as high-
definition televisions, mobile platforms and embedded media processing systems. The benefits 
obtained are twofold. Firsdy, code compression gives the opportunity to increase the number 
of features implemented in the system without the need to increase the memory budget. 
Secondly, filling the instruction cache with compressed instructions enhances its effective 
capacity and consequendy improves performance due to the increased likelihood of finding 
the required instruction in the cache. 
ii 
Acknowledgements 
I would like to thank to ARC International Ltd for financing this research and providing the 
necessary equipment to carry it out. 
I am deeply indebted to my supervisor Dr. David Mulvaney for his support, guidance, sound 
advice and much valued input throughout the extended period of this research. Special thanks 
to Dr Vassilios Chouliaras for the stimulating suggestions and his technical expertise that he so 
generously shared with me. 
Thanks to my colleagues and friends at Loughborough Universiry for making such a 
challenging period special and fun. 
My deepest gratitude to my parents, Zhika and Georgi, for their love, understanding and 
constant encouragement. The same goes to my new family, Soledad, Adolfo and Cesar, who 
have always been so supportive and good to me. 
Lastly, and most importantly, I would like to thank to my husband, Dario, for his inspiration 
and support, and for sharing this and everything else in my life. It is to him that I dedicate this 
work for he is truly the most remarkable person that I have ever met ... 
... and, of course, to my little girl who keeps me company while I write these lines, and who 
is to be born in a few months. 
iii 
Abbreviations 
AEU 
AIDC 
ALU 
ASIC 
ATU 
BBC 
BTC 
CAM 
CCRP 
CCS 
CISC 
CLB 
CLB 
COF 
CPC 
CPI 
CR 
DCLZ 
DU 
DDR 
DMC 
ELF 
EPIC 
FCRAM 
FIFO 
FPGA 
FSM 
HLL 
mc 
IC 
ICache 
Address Extraction Unit 
Adaptive Lossless Data Compression 
Arithmetic and Logic Unit 
Application Specific Integrated Circuit 
Address Translation Unit 
Burst Buffer Controller 
Branch Target Cache 
Content Addressable Memory 
Code Compression RISC Processor 
Code Compression Suite 
Complex Instruction Set Computer 
Cache line Address Lookaside Buffer 
Configurable Logic Block (Xilinx) 
Change Of Flow 
Compressed Program Counter 
Cycles Per Instruction 
Compression Ratio 
Data Compression according to Lempel and Ziv 
Decoding Unit 
Double Data Rate 
Dynamic Markov Compression 
Executable Linkable Format 
Efficient Pyramid Image Coder 
Fast Cycle RAM 
First-In First-Out 
Field Programmable Gate Array 
Finite State Machine 
High Level Language 
Instruction Based Compression 
Instruction Count 
Instruction Cache 
iv 
ICE 
I/O 
ISA 
ISS 
LAB 
LAT 
LC 
LE 
UFO 
LRU 
LTP 
LUT 
LZ 
MIPS 
MIPS 
MMU 
MSB 
PAG 
PBC 
PC 
PPM 
RISC 
RPE 
ROM 
RTL 
RTOS 
SADC 
SAMC 
SoC 
SRS 
TBC 
TIA 
VHDL 
VLIW 
In-Circuit Emulators 
Input/Output 
Instruction Set Architecture 
Instruction Set Simulator 
Logic Array Block 
line Address Table 
Logic Cell (Xilinx) 
Logic Element (Altera) 
Last-In First-Out 
Least Recently Used 
Long Term Prediction 
Look-Up Table 
Lempel-Ziv (compression algorithms) 
Microprocessor without Interlocked Pipeline Stages 
Million Instructions Per Second 
Memory Management Unit 
Most Significant Bit 
Pre-fetch Address Generator 
Pattern Based Compression 
Program Counter 
Prediction by Partial Match 
Reduced instruction set computer 
Residual Pulse Excitation 
Read Only Memory 
Register Transfer Level 
Real-Time Operating Systems 
Semi-Adaptive Dictionary Compression 
Semi-Adaptive Markov Compression 
System-on-Chip 
Subroutine Return Stack 
Tree Based Compression 
Target Instruction Addresses 
VHSIC Hardware Description Language 
Very Long Instruction Word 
v 
Contents 
List of Figures x 
List of Tables xiii 
1 Introduction 1 
1.1 Historical overview............................................................... 2 
1.2 Problem identification and existing solutions................................. 3 
1.3 Research aims and objectives.................................................. 5 
1.4 Summary of achievements...................................................... 5 
1.5 Structure of Thesis .............................................................. 6 
2 Literature Review 8 
2.1 Embedded systems ...................................... ;........................ 8 
2.1.1 ARCangel development board [9]................................. 10 
2.1.2· PNX8550............................................................. 10 
2.2 Data compression.......... .... .... ........ .. .. .. .. .... .. ...... .. .. .. .. .. .... .. .... 11 
2.2.1 Modelling.................. .................. .................. .... 14 
2.2.2 Coding .................................................. ............ 15 
2.3 Compiler optimisations ............................................................ 16 
2.4 Hybrid RISC Architectures ....................................................... 18 
2.4.1 MIPS16............................................................ 18 
vi 
3 
4 
5 
2.4.2 ARM Thumb Processor.......................................... 19 
2.4.3 ARCompact. ... ......... ......... ... ...... .... . ............. .. ..... 19 
2.4.4 Conclusions. ... .. ... . . . . . . ..... . . .. . ...... . ... . . . . ......... . . . ..... 20 
2.5 Code Compression.......... ... .. .. ....... . .. ....... .. .. ......... . ... ... ...... .... 20 
2.5.1 Algorithms..................................................... .... 21 
2.5.2 Decompression implementations............................. ... 25 
2.6 FPGA devices. .... ..... ...... ... ... ......... ...... ....... ......... ....... ....... ... 31 
2.7 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 
Entropy of Embedded RISC Code 
3.1 The concept of entropy in information theory ................................. . 
3.2 Experimental framework ........................................................ . 
3.2.1 Entropy measurement tool ...................................... . 
3.2.2 Test Material ..................................................... . 
3.2.3 Lossless data compression algorithms ......................... . 
3.3 Experimental trials ............................................................... . 
3.3.1 Entropy measurements ......................................... . 
3.3.2 Evaluation of data compression algorithms ................... . 
3.4 Conclusions ....................................................................... . 
Compression Algorithm, Tools and Design Flow 
4.1 Overview of the proposed solution ............................................ . 
4.2 Compression Algorithm ........................................................ . 
4.2.1 Algorithmic design considerations ............................. . 
4.2.2 General description .............................................. . 
4.2.3 Compressed to uncompressed address space mapping ..... . 
4.3 Design flow and development tools ............................................ . 
4.3.1 Design flow ...................................................... . 
4.3.2 Compressor tool description ................................... . 
4.4 Conclusions ....................................................................... . 
Decompressor Functionaliry, Architecture and Implementation 
vii 
37 
53 
71 
37 
39 
39 
40 
43 
45 
45 
47 
51 
54 
55 
55 
57 
61 
66 
67 
69 
70 
5.1 Overview... . ....................... .... ... . .......... .. ........... .. . .. ... ..... ..... 71 
5.1.1· ARCTangent-A4 architecture.................................... 72 
5.1.2 Integration of the decompressor................................. 73 
5.2 Functionality and architecture of the decompressor............................ 75 
5.2.1 Decoding unit (DU)... .... ...... ...... ........... ..... .... ......... 76 
5.2.2 Address Translation Unit (ATU) .. .... .... .................. ..... 81 
5.3 Decompressor implementation .............................................. .... 94 
5.3.1 Configuration parameters...................................... ... 95 
5.3.2 Implementation Results... . ... ... ...... .. . . .... ..... .... .. .. .... .. 95 
5.4 Conclusions. . . . . . . .. .. . . .. . . . .. . . .. . .. . . . ... . .. . .. . . . ... . . . . . . . . . . ... . . . . . . . . . . . . . . 96 
6 Metrics and Tools for System Performance Analysis 98 
6.1 Performance metrics............. ............................................... .... 98 
6.2 Performance analysis techniques... ... .... ... ........... ... ...... .... ........ ...... 103 
6.3 Simulation tools.... ...................... ........ ............ ........... ........ ..... 104 
6.3.1 ARCTangent-A4 ISS.................................. ............ 105 
6.3.2 Code Compression Suite (CCS)................................... 106 
6.4 Summary..................... ..... ... ......... ... ............... ... ...... ... ...... ... 115 
7 System Performance - Results and Analysis 116 
7.1 ICache simulation cases ......... ....... .... .............. ... ...... ...... ... ... ..... 116 
7.2 Performance parameters .......................................................... 117 
7.2.1 ICrofAL and Pipeline Hazards cycles.............................. 118 
7.2.2 Instruction cache performance... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 119 
7.2.3 COF targets calculation......... .... .............. .... ..... ... ..... 126 
7.3 System level performance......................................................... 130 
7.4 Summary.: ................ ~.;......................................................... 132 
8 Conclusions 133 
8.1 Summary of objectives................................................. ............. 133 
8.2 Research overview....... ... ......... ... .... ................. ...... ... . ........ . .... 134 
8.3 Summary of achievements. ............ .... .......... ... ...... ......... ............ 136 
viii 
8.4 Future work.. . .. .. .. ... .. ............ ..... ............. .. ..... .. ..... .. .. .. .... ... .. 137 
8.5 Summary ............................................................................ 138 
REFERENCES 139 
A EXECUTABLE AND LINKING FORMAT (ELF) 
B COMPRESSION RATIO RESULTS 
C PUBLICATIONS 
ix 
142 
144 
148 
List of Figures 
2.1 Data compression taxonomy ........... .. . . .. . . . . . . . . . . . . . . . . . . . . . .. .. .. .. . . . . . . . . . . . . . . ... 12 
2.2 An example of compression process. . .. . . . . ......... ... ... .. ... . . . . . . . . . . .. ...... . ..... . . 13 
2.3 Mapping of hybrid instructions ........ ..... .......... ........ ... ... .•....... .............. 18 
2.4. Code Compression RISC Processor, [15]................................................ 27 
2.5 Line Address Table Entry.... . .............. .. . . . .. . . ... . .. ... ... ............ . . . . . . .. .. .. .. 27 
2.6 CLB Organisation [27] .......................................................•............. 29 
2.7 Index table mapping ofTIAs to compressed memory [7]............................... 30 
2.8 Basic Spartan-lIE Family FPGA Block Diagram [31]........ .... .... .....•....... ....... 32 
2.9 Virtex-II Architecture Overview [32].............. .......................... ....... ....... 34 
2.10 Cyclone EP1C6 Device....... ............ ............... ...... .... .................... ..... 35 
3.1 Entropy Tool Pseudo Code....... ...... .............................. ......•.............. 40 
3.2 Ratio between entropy and actual programs' sizes....... ......... ...... ...•....... ....... 47 
3.3 Compression performance of a range of algorithms applied to the test-bench 
programs..... .. ... .... ............... ......... .. . .. . .............. . .. ... . . . . . . . . ...•...... . ... .... 48 
3.4 Compression ratio results for different input block sizes............................. .... 50 
x 
3.5 Compression ratios for various dictionary sizes........................................... 51 
4.1 Uncompressed to compressed file conversion......................................... .... 58 
4.2 Codeword format. ... .... . ... . .... ..... . ............ .. . .. . ... . .... ... ... ... ... .. . . . . ... . . .... 59 
4.3 Compression ratios achieved for different dictionary sizes....... .... ......... ... .... .. . 60 
4.4 Relative COF instructions format.......................................................... 62 
4.5 PC update and loop detection mechanism [42]......................................... ... 63 
4.6 Calculating the branch appendixes.... .... ......... ... .. ............ ... ...... ... . .. ... . . .. . 64 
4.7 Schematic of the hardware/software co-design process including compression. ..... 68 
4.8 Compression stages......................................................................... 69 
5.1 ARCfangent-A4 pipeline stages........................................................ .... 72 
5.2 Decompressor interface.. ..... .... ..... ... .... .. ... ... ..... .... .................. ....... .... 73 
5.3 Modification of the pipeline of ARCfangent-A4 for decompression.... ......... ..... 74 
5.4 Architectural overview of the decompressor................... ...... ... ... ...... ......... 76 
5.5 Functional diagram of the decompressor's DU.......... ......... ..... . .. . ....... . . .. . ... 77 
5.6 Basic diagram of the decoding unit....... ............. .............. ...... ...... .... . ...... 78 
5.7 Operation of the input buffer........................................................... .... 79 
5.8 The instruction word in the decoder....................................................... 80 
5.9 Decode control unit state diagram. ........................................................ 81 
5.1 0 St~rt~p behaviour of the address translation unit ............................ ;.. . . . . . . .. 82 
5.11 Functionality of the A TU during sequential execution. ... .. . .. ... .... . ..... . ..... .... .. 83 
5.12 Functionality related to branch and branch and link instructions...................... 85 
5.13 Jump related functionality of the ATU................................................... 86 
5.14 Main modules of the A TU. . . . ... . ....... . .. . . . . . .. ... ... . .. .. . . .. ... ... . ..... . . . .. . . .. .... 87 
5.15 Factorial calculation illustrating recursive function calls.......... .... ..... .... ... ... .... 89 
xi 
-- -------------------------
5.16 SRS stack operation..... ........ .......... ............................. .......... ..... ..... 90 
5.17 Jump table architecture. ... ..... . .... .... .... ........ . .... .. ... .... . ........ ... .. .. .. ... ... 90 
5.18 BTC entry fields.......... ......................... .................... ......... ............ 91 
5.19 BTC configutability allows for hardwired and dynamic branch tables................. 92 
5.20 ATV control unit finite state machine cliagram.... ...................................... 93 
5.21 Output Address Select Logic.... .... ... . .... ......... .......... ............................ 94 
6.1 Excerpt from the trace me generated by the ARCTangent-A4 ISS during the 
simulation of a test bench program........ ... .. .............. .......... .... ........ .......... . 106 
6.2 CCS main interface window ............................................................... " 1 07 
6.3 Summary of the command bar functionalities............................................ 107 
6.4 Settings dialog window. .. .. .. .. .. . .... .. .. • .. .. .. .. .. .. . .. .. . .. .. . .. . .. .. . .. .. .. .. . .. . . .. . .. 109 
6.5 View of the dynamic instruction Oeft) and branch (right) analysis windows....... ..... 110 
6.6 Two-way set associative instruction cache architecture................................... 112 
6.7 Functional diagram of the cache simulation module .................................... " 113 
6.8 CCS modules interdependency. ........... . ......................... ..... ... ... ............ 114 
7.1 ICache simulation cases. ................... ..... ............ ...... ................ .. ... .... . 117 
7.2 Improvement in cache hit ratio for a number of the test applications.................. 124 
7.3 Average improvement in cache hit ratio. •.. ...... ............... .......... ................ 125 
7.4 Number ofBTC misses for various configurations....................................... 128 
A.l Structure of ELF me. ................... ... . . ... .... ................. ......... ... .......... 142 
xii 
List of Tables 
2.1 Four examples of embedded systems with approximate attributes, Koopman[8] .... 9 
2.2 Results achieved by Araujo et al. (18]........................................................ 23 
2.3 Main complexity parameters for the selected Spartan-lIE FPGA........................ 32 
2.4 Virtex2 FPGA main complexity parameters.............................................. 33 
2.5 Selected Cyclone FPGAs main complexity parameters [33].............................. 35 
3.1 Entropies for a range of programs compiled for the ARCfangent-
A4.................................................................................................. 46 
, 
5.1 Processor interface signals.... .... .. ................ ........ ...... ........ .. .. .... .. .. ...... . 73 
5.2 Targeted FPGAs .................................................................. ................ 94 
5.3 Decompressor configuration parameters and values used for synthesis............... 95 
5.4 Decompressor's clock frequencies for the targeted devices. ........ .......... .... ... .... 95 
5.5 Complexity figures for Cyclone EPIC6 device........................................... 96 
5.6 Complexity figures for Xilinx devices....... .............. . .... ... ... ....... ........ ... .... 96 
6.1 Summary off!les that may be found in an application's project directory ............. 108 
xiii 
7.1 ICrorAL and pipeline stall cycles for the test bench applications ........................... 119 
7.2 Number of instruction cache misses for uncompressed memory systems............ 120 
7.3 Number of instruction cache misses for compressed memory systems.................. 122 
7.4 COF instructions and coverage ofBTC hardwired entries................ .............. 127 
7.5 Performance improvement for systems with different number ofICache refill cycles 131 
B.1 Compression ratio results for ADLC...................................................... 144 
B.2 Compression ratio results for LZS........................................................ 144 
B.3 Compression ratio results for X-MatchPro for dictionary size of 128 bytes.... ....... 145 
B.4 Compression ratio results for X-MatchPro for dictionary size of 256 bytes........... 145 
B.5 Compression ratio results for X-MatchPro for dictionary size of 512 bytes.... ....... 145 
B.6 Compression ratio results for X-MatchPro for dictionary size of 1024 bytes.......... 146 
B.7 Compression ratio results for X-MatchPro for dictionary sizes of 2048 and 4096 
bytes................................................................................................ 146 
B.8 Compression ratio results for PPMZ.............. ............... ........... .............. 146 
B.9 Compression ratio results for DCLZ.... ............ ................................ ...... 147 
B.10 Compression ratio results for CLB....... ............ .................................... 147 
xiv 
Chapter 1 
INTRODUCTION 
Embedded systems play a substantial and ever-increasing part in our lives. Without such 
systems, most consumer electronics devices would not be feasible and many modem systems, 
such as these developed in the aerospace and automotive industries would be significantly less 
advanced in their features. While the physical dimensions of many devices have reduced 
considerably, the performance and features expected by consumers have increased. 
Consequently, an ever-increasing quantity of code must be developed by software engineers 
for inclusion in embedded systems that are limited in terms of their resources, such as space, 
memory, speed, and power. This research concentrates on RISC processors that are 
commonly found at the core of embedded systems. In particular, the work focuses on the 
widely recognised problem [1] of such architectures, which has received significant amount of 
attention in the past decade: namely the inefficient use of memory by their executables. 
Although a number of solutions have been adopted, they usually tend to improve the 
memory usage, at the expense of performance. This thesis, describes a novel solution based on 
a holistic approach, where the requirements of modem embedded systems in terms of code 
size, performance, and power consumption awareness are taken into account. The proposed 
solution is based on a lossless compression method, where the boundary between compressed 
and uncompressed address spaces is located between the instruction cache (ICache) and the 
processor core. In this way, the ICache holds compressed instructions, and therefore its 
effective size increases, thus reducing the ICache miss ratio. 
In this chapter, a brief historical overview of the development and adoption of RISC 
architectures is presented, followed by the identification of the problem, objectives and aims 
of this research. The last section of this chapter summarises the structure of the thesis. 
Chapter 1- Introduction 2 
1.1. Historical overview 
In the late 1960s, due to the significant increase in the code size of the developed applications, 
for the first time software costs began growing faster than hardware costs [1]. High-level 
programming languages (HLLs) had already appeared, but still, due to memory limitations and 
underdevelopment in compiler technology, most programs were written in assembly. In the 
following 15 years, in order to alleviate the software crisis, the major efforts mainly 
concentrated on developing new software-oriented processor architectures. Probably, the 
most significant representative of these architectures was the DEC VAX, with a highly 
orthogonal instruction set that provides mapping for most high-level language statements into 
a single instruction [1]. The implementation of architecture of such complexiry was a 
significant advance for its time, but it suited the limited memory resources and state-of-the-art 
compiler technology available. 
In 1975, IBM began a project to build the first reduced instruction set computer (RIsq, 
the 801, which was not announced until 1982. In 1980, at Berkley, the RISC-I and RISC-I! 
processors were designed [1]. At around the same time, at Stanford University, the MIPS 
(microprocessor without interlocked pipeline stages) processor was developed. These first 
RISC computers all had the basic features that are still common in current RISC processors, 
namely a simple load-store architecture, efficient pipelining, fixed-length instructions (typically 
32-bit), and compiler-based static scheduling. By the end of the 1980s, most of the major 
processor manufacturers, including HP, Sun Microsystems, Apple, IBM and Motorola had 
developed a range of microprocessors based on the RISC architecture. These were rapidly 
adopted in server and desktop applications and these and their descendants have been widely 
used ever since. However, the embedded market proved to be rather slow in migraring to the 
32-bit RISC architectures, and only in the early 1990s, when demand increased for applications 
requiring enhanced performance, did the technology become widespread. The emergence of 
Systems-on-Chip (Soq technologies, based on RISC cores, in the late 1990s, met the demand 
from the consumer elecrronic market for high performance and low power. 
Today, almost every modem processor is based on the RISC architectural principles, with 
the principal exception of the 80x86 families of microprocessors. The advantages and 
disadvantages of both RISC and CISe (complex instruction set computer) architectures have 
been thoroughly studied and documented [1]. In today's rather saturated and highly 
Chapter 1- Introdllction 3 
competitive market, the number of features that can be offered by a product is often seen as 
one of the main differentiating factors. To support such demand, additional functionality and 
more applications are under continuous development, requiring more memory and hence 
increasing cost. Furthermore, in many embedded systems such as mobile phones or real-time 
control systems, the memory contains a substantial portion of the solution's cost. As 
additional memory consumes more power, optimising memory usage is of huge importance, 
especially for portable and handheld devices. 
1.2. Problem identification and existing solutions 
The RISC computer paradigm has determined, to large extent, the way that microprocessors 
are currendy engineered. Rather than following the design rules of their time, their 
architectures combined simplicity (e.g. fixed-length instruction encoding), efficiency (e.g. 
efficiendy designed pipelining) and performance. Such a combination of characteristics makes 
them particularly suitable for embedded system and System-on-Chip (Soq architectures, 
where area constraints and power consumption are often as important as the performance of 
the final solution. However, RISC architectures are not exempt from drawbacks, the principal 
one being poor code density. Considerable effort has been expended in the improvement of 
compiler technology, in order to better match the target executable requirements thereby 
generating smaller and more efficient code, and significant advances have been made in terms 
of local optimisations (within the scope of functions). Conversely, the quantity of software 
that needs to be run on modem embedded products (such as MP3 player, mobile phones and 
gaming platforms) is gtowing rapidly in order to support the ever-increasing number of 
features demanded by an extremely competitive market. Thus, applications that only a few 
years ago required a small number of assembler or C functions for their implementation now 
require full scale software engineering programs, some of which require the code to be 
partitioned for simultaneous execution on a number of embedded processors. Current 
compile technology is not sufficiendy advanced for the optimisation of such a parallel set of 
executables. Bearing in mind that embedded applications are constrained in terms of physical 
dimensions and unit cost, memory is often a scarce resource in embedded systems. Therefore, 
efforts to improve the code density of RISC processors date back to the very first embedded 
Chapter l-lntroduction 4 
implementations, and have generally followed one of the three main strategies described 
below. 
Compiler technology improvements 
A range of techniques are available in most compilers when they are configured to optimise 
the code size of an application. These include dead or unreachable code elimination (which strips the 
programme of redundant code) and global and local sub-expression elimination (which replaces two 
instances of the same computation by a single copy). A more comprehensive discussion of 
these and other compiler optimisation techniques are presented in Chapter 2. The main 
advantages of allowing the compiler to reduce the code size are the automation of the process 
and that no additional run-time support in terms of software of hardware is required. 
However, the reduction ratios that compilers achieve are rather moderate compared with 
compression, and some of the techniques, such as sub-expression elimination, can seriously 
reduce the performance of the generated executable (1,2]. 
Hybrid instruction set architecture (lSA) 
This approach has been taken by most RISC manufacturers [3], and it is based on the 
development of a 16-bit subset of the previously purely 32-bit instruction set architecture 
(ISA). The implementation of a hybrid ISA requires the development of a new set of software 
tools (compiler, linker and debugger), together with the implementation of dedicated hardware 
to support the decoding of the new 16-bit instructions and a method for switching between 
ISAs. Some examples of t1Us technology are: Thumb (from ARM Ltd.) [4], MIPS-16 (from 
MIPS) [5] and ARCompact (from ARC Int.) [6]. However, the 16-bit ISAs do not yield a 50% 
reduction in code size, nor is the 32-bit performance retained. ARM, for example, claims that 
the Thumb technology decreases code size by about 30%, but that this comes at the price of a 
reduction in performance (execution time increase) of up to 26% [4]. 
Code compression 
The third approach uses' code compression to achieve higher code density. Since the early 
1990s, a number of compression algorithms have been adapted for application to embedded 
code, but no single best approach is apparent at present. This technique achieves a more 
substantial code reduction than the other two methods discussed above and was exploited by 
IBM in the development of its CodePack system [7] (see Chapter 2). 
Chapter l-Introduction 5 
1.3. Research aims and objectives 
This research is concerned with alleviating the code size inefficiency typically found in 
embedded RISe implementations. Over the past twenty years, many solutions have been 
proposed, but in most cases the resulting code density improvements, which often were 
moderate, came at the cost of serious performance degradation. Therefore, there is a clear 
need to further investigate this field with the aim of identifying a suitable solution that 
provides good code density while preserving (or ideally improving) the overall performance of 
the system. Thus, the aim of this research is to obtain such a solution and in order to achieve 
this goal, several objectives are apparent. 
• To highlight the current state-of-the-art in code reduction, by carrying out a 
comprehensive literature review of the field, including the previously mentioned 
techniques of compiler optimisations, hybrid architectures and code compression. 
• To quantify the redundancy present in RISe code, by studying the suitability of code 
compression to be used at the core of a novel code reduction solution for high-
performance embedded RISe applications. 
• To implement a suitable simulation model that would allow the validation of the 
proposed solution and to quantify its impact on a system's performance. 
• To develop the necessaty software tools for compressing RISe code and the required 
hardware architectures for supporting the real-time execution of the compressed 
applications. Such architecture should be power-efficient and performance oriented. 
1.4. Summary of achievements 
This thesis presents a novel code compression technique particularly well suited to high-
performance embedded RISe applications and complex soe solutions, which not only 
provides excellent compression ratios (typically about 50% code reduction), but also improves 
performance by reducing significantly the number of cache misses. The solution employs a 
dictionaty-based compression scheme that is tailored to the characteristics of the RISe ISA. 
Chapter I-Introduction 6 
The decompression process is carried out in real-time by a novel hardware decompression 
architecture, which is seamlessly incorporated in the pipeline of the processor, extending it by 
two stages. A number of original design solutions have been developed in the building of the 
decompressor, including the elimination of large look-up tables usually employed to resolve 
the mapping between compressed and uncompressed address spaces. The solution is scalable 
and configurable to suit the characteristics of each application and can be easily tuned to the 
physical area or execution speed requirements of the project. A desktop software application 
supported by an intuitive graphical user interface (GUI) has been developed that allows the 
control and analysis of all the implementation parameters. For instance, this tool allows the 
user to analyse the entropy and control £Iow structure of a ruse executable, to compress it 
selecting from a variety of options (e.g. dictionary size and type, codeword length, etc.), to 
generate the decompression tables, and to evaluate the performance of different hardware 
configurations, amongst others. 
1.5. Structure of Thesis 
The structure of the thesis is as follows. 
Chapter 2 provides background information on lossless data and code compression and 
presents the state-of-the-art in the particular field of code size reduction .. The techniques 
subject to analysis includ~ those enumerated in Section 1.2. 
Chapter 3 presents a quantitative study of the entropy of embedded ruse code. A set of 
test bench programmes compiled for a particular commercial ruse processor (ARCTangent-
A4 is used as proof of concept) are used to obtain results that are valid for a wide range of 
embedded applications, ranging from automotive systems to consumer electronic applications. 
These executables are compressed with a number of alternative compression algorithms, 
demonstrating that compression is able to modify the executables in such a manner that their 
entropies become significandy closer to their intrinsic values. 
Chapter 4 identifies the restrictions that embedded systems impose on the candidate 
algorithms used for compressing the executables, and the criteria for selecting one from the 
various candidate solutions. The compression scheme developed in this work is then described 
Chapter 1- Introduction 7 
in detail, explaining its design considerations and benefits. This chapter also develops the 
design flow and the mechanisms necessary to perform the translation between compressed 
and uncompressed instruction addresses. 
Chapter 5 describes the hardware implementation of the real-time decompressor, its 
functionality, interfaces and the main building blocks. The design is synthesised and a layout 
produced for various FPGA technologies and implementation results are presented. 
Chapter 6 identifies the metrics used for analysing the performance of compressed 
executables. A comprehensive compression tool suite developed in this research is presented 
and described. Among the main components of this suite are: a system simulator designed to 
evaluate the impact of code compression on the overall system performance, a compressor 
tool, and a trace analyser that provides information about the number of delay cycles in the 
processor that arise from instruction dependencies and pipeline hazards. A configurable 
simulation model of the instruction cache enables the generation of results for different cache 
configurations. 
Chapter 7 presents and analyses the performance results (that is, the impact of the 
compression solution) at system level, obtained by the software suite presented in Chapter 6. 
Chapter 8 concludes the thesis by summarising the main findings and how they fulfll the 
aims of the research. Potential future research paths are also outlined at the end of this 
chapter. 
Appendix A introduces the ELF format, commonly used in embedded code. Appendix 
B presents in full the compression results discussed in Chapter 3, and Appendix C presents a 
list of the publications based on the outcomes of this work. 
Chapter 2 
LITERATURE REVIEW 
To set the context of the current work, a brief overview of embedded systems is given first. 
An introduction to data compression is then given, followed by a review of existing methods 
developed in the past decade that improve the density of the code produced for RISe 
embedded machines. The most efficient technique, in terms of maximising entropy, is code 
compression, but other methods such as compiler optimisations and hybrid architectures are 
widely available. These methods are introduced in this chapter, but as the literature indicates 
they are unlikely to give significant code reduction improvements, they are not considered in 
detail. However, code compression methods are considered more thoroughly as the 
investigation of this area is more likely to lead to efficiency improvements. Finally, an overview 
of the FPGA devices used for synthesising the developed design presented in Chapter 5 is 
given. 
2.1. Embedded systems 
Embedded systems are computer systems with software and hardware architectures tailored to 
the specific application for which they are designed. Examples of such systems are DVD 
players, digital TV sets, video cameras, mobile phones, microwave ovens, automotive control, 
avionics and spacecraft control systems, elevators, arcade games, laser printers and cash 
registers. These systems typically have one or more microprocessors, memoty and peripheral 
interfaces. Although in some applications the devices are programmable, in many embedded 
applications the only programming occurs in connection with the initial loading of the 
application, or during a later software upgrade. 
Chapter 2 - Literoltlre Review 9 
Koopman [8] categorised embedded systems design into four groups, as shown in Table 2.1, 
where each group represents generalisation of a real system currently in production. 
Table 2.1 Four examples of embedded systems with approximate attributes, Koopman [8] 
Computing 
Speed 
I/O transfer 
rates 
Memory Size 
Units Sold 
Development 
Cost 
Lifetime 
Environment 
Cost Sensitivity 
Other 
Constraints 
Maintenance 
Signal 
Processing (1) 
1 GFLOPS 
1Gb/sec 
32-2S6MB 
10-500 
$20-100M 
15-30 years 
Vibration, heat 
$1000 
Size, weight, 
power 
Frequent 
repairs 
Mission 
Critical (2) 
10-100 MIPS 
10Mb/sec 
16-32MB 
100-1000 
$10-50M 
20-30 years 
Heat, 
Vibration, Light 
$100 
Size, weight 
Distributed 
(3) 
1-10 MIPS 
100Kb/sec 
I-16MB 
10-10,000 
$1-10M 
25-50 years 
Dirt, Fire 
$10 
Size 
Aggressive fault Scheduled 
detection maintenance 
D· . al C Digital except -1/2 dimtal 19tt ontent I/O signals b' -1/2 digital 
Repair time 
goal 
Initial cycle 
time 
Product 
variants 
Other possible 
examples 
1-12 hours 
3-5 years 
1-5 
Radar/Sonar 
Video 
Medical 
imaging 
30 min 
4-10 years 
5-20 
Jet engines 
Manned 
spacecraft 
Nuclear power 
4 min-12 hours 
2-4 years 
10-10 000 
Trains/ 
Subways 
Air 
conditioning 
Small (4) 
100,000 
1 Kb/sec 
1 KB 
1,000,000+ 
$100K-IM 
10-15 years 
Over-voltage, 
Heat, Vibration 
$0.05 
Size, weight, 
power 
''Never'' breaks 
Single digital 
chip; rest is 
analogne 
1-4 hours 
0.1-4 years 
3-10 
Automotive, 
Consumer 
electronics 
Although some of values are now somewhat dated (the table. was published in 1999), it is 
interesting to note that the two first columns, namely military and aerospace embedded 
applications show performance characteristics of memory size and Input/Output transfer 
tates are now not uncommon in modem consumer electronic devices, demonstrating the 
substantial growth in hardware capabilities in the intervening years. 
One of the most important design decisions when building an embedded system is the 
selection of an appropriate hardware architecture. This decision is typically based on two 
major factors, namely the processing capability requirement and the cost per unit. 
Chapter 2 - Uterature Review 10 
Performance is often determined according to the real-time requirements of the application. 
For example, a domestic television with a single frame buffer must be able to process the 
current image frame before the next one is received. The production cost on the other hand, 
depends on a variety of factors, such as implementation technology, number and type of 
components, manufacturing process. In many embedded systems the cost of memoty 
constitutes a substantial portion of the overall cost, restricting the memory capacity of the 
design and forcing software engineers to limit the supported features. As a consequence, the 
memory architecture, capacity and usage must be carefully considered in many embedded 
applications. Finally, systems with heat dissipation issues and embedded in a portable device 
may have additional restrictions in terms of power consumption. 
As it became clear from Table 2.1 the term 'embedded system' covers a wide variety of 
systems, ranging from simple electronic devices to complex multi-processor platforms. In 
order to give a better idea of this variety, this subsection presents two modem systems with 
very different complexities and performance capabilities. The first is a simple test board 
implementation, while the second is a complex SoC ASIC media-processor that is featured in 
a number of high quality multimedia products such as domestic televisions and video cameras. 
2.1.1. ARCAngel-1 test board [9] 
This FPGA development board is designed to test and evaluate embedded solutions based on 
the ARCTagent-A4 microprocessor. It includes an Altera Flex 10K200E/250E PLD, which 
has the capability to hold an entire processor design (including a limited capacity instruction 
cache). It also implements phase-locked loop (PLL) circuitry (to generate system clock 
frequencies up to 25MHz), incorporates a range of displays and switches, and provides parallel 
and serial ports to enable interfacing a host works ration. In terms of memory, the board has 1 
MB of fast SRAM, from which 512 kB is used as system memory, and the remaining 512 kB is 
allocated to the data cache. 
2.1.2. PNX8550 
Due to their efficient pipelined architecture and small physical area requirements, RISC 
processors are often used in high performance SoC implementations. Philips Semiconductors' 
PNX8550 SoC solution [10] is a highly integrated media processor, responsible for video 
Chapter 2 - Uleraltlre Review 11 
improvement processing for both analogue and digital sources. It includes integrated dual 
program conditional access, dual program MPEG2 transport stream de-multiplexers, dual 
standard definition or single high definition MPEG2 video decode, audio decode and 
processing, graphics generation, video processing, and image composition and display. Its 
internal architecture includes more than 25 different hardware blocks that perform core video 
functions, such as picture-level MPEG2 decoding, scaling, image composition and pixel post 
processing. In addition, two 32-bit 240 MHz very-long instruction word (VLIW) processors 
carry out the advanced video improvement processing, as well as all audio operations, and a 
further 32-bit 250MHz RISC processor core runs the operating system and both accesses and 
controls all the hardware resources available on the chip. 
The memory access infrastructure of PNX8550, called the HUB, is based on the pipelined 
memory access network (PMAN) technology specification. In addition to the network 
. function, the HUB includes a generic arbiter for data flow control within complex memory 
systems and the PMAN Securiry block. The PMAN or HUB provides direct memory access 
(DMA) data paths and control which link the majority of the peripheral devices with the main 
memory controller, allowing data to be read from or written to main memory at a very high 
rate. 
These two systems are used later in this work as a base for defining some of the 
parameters used in the performance evaluation study of the proposed compression solution. 
2.2. Data compression 
Technological developments continue to allow ever more data to be held by the computer 
storage systems. More efficient use of data storage can be achieved by sophisticated 
compression algorithms, which are prevalent not only in portable systems, where the 
reduction in the memory requirement can bring power savings, but are also frequently 
provided by the memory management units of modern operating systems. 
Compression is a means of reducing the number of bits required to represent a given item 
of data. Not only does this result in an effective increase in the storage capacity of computer 
memories, but also improves the effective transmission bandwidth over internal busses. 
Chapter 2 - Literature Review 12 
Compression can be lossless or lossy. In lossless compression, algorithms data can be retrieved 
completely after decompression, while during lossy compression some information is 
irretrievably lost In this thesis, the emphasis is on lossless compression, as in computer 
applications any modifications to instruction code will affect its functionality. 
Lossless compression algorithms can be divided into three categories, namely ad hoc, 
statistical and dictionary, as shown in Figure 2.1. Ad hoc algorithms use the natural 
redundancies of specific data sets, for example the differential coding where only the 
difference between the consecutive data symbols is coded, or run-length encoding, where the 
number of identical values of data items are recorded Dictionary approaches achieve 
compression by replacing groups of consecutive characters (or symbols) with indexes into a 
dictionary that contains a list of symbols that occur frequendy. The resultant compressed data 
contains indices into the dictionary that take less space than the symbol they encode. In 
statistical methods, a probability is estimated for each character and a code is chosen based on 
the value of the probability. Ad hoc methods generally only perform well on the data sets they 
are designed to work on, and on general data sets, statistical methods normally achieve better 
compression ratios, but require more complex calculations in their implementation. 
Figure 2.1 Data compression taxonomy 
For both dictionary-based and statistical compression, the process is performed in two 
steps; firsdy modelling the input data and secondly coding it (see Figure 2.2). The model 
Chapter 2 - Literatllre Review 13 
provides the probability distribution of the symbols in the data set, while the coder, once 
supplied with this prediction, constructs a compressed representation of the data with respect 
to the probability distribution. 
SYMBOL PROBABl1nv 000£ 
lSPU'r S1'REA.\{ MODEUJSG CODlNG OUTPUT ST'REA..'\I 
r . 
Figure 2.2. An example of compression process 
The rate of compression is called the compression ratio (CR) and can be calculated as in 
equation 2.1: 
eR = output size 
original size 
(2.1) 
where inpllt si'{! is the length of a file before compression (m bits) and the Olltpllt size is the 
length of the file after compression (m bits). The value of output size depends on the method 
employed for compression and also on the information content, or entropy, of the data set 
that is being compressed The entropy (H) of a model can be determined by: 
H, = -log2 P, bits (2.2) 
where Pi is the probability of symbol occurrence and HI is measured in number of bits as 
defined by Shannon in [11) (for further details see Chapter 3, Section 3.1). The symbols with 
greater probabilities of occurrence contain less useful 'information' and the lower the total 
information content of a data set the greater the redundancy and consequently the higher the 
potential compression ratio that can be achieved 
It is important to notice that choosing a suitable model for the data set and implementing 
an efficient coding technique, can result in a compression ratio that approaches the limit that 
can be determined from the natural entropy of the data [12). 
~ ...... ---------------------------------------------------------------------
Chapter 2 - literature Review 14 
2.2.1. Modelling 
The role of the model is to supply the probabilities for the symbols. Based on how they are 
constructed, models can be divided into three main groups, termed static, semi-adaptive and 
adaptive. There is no rule of thumb that can be used to select the most suitable model for a 
particular application. Compared with static and semi-adaptive models, adaptive models are 
more complex to implement but they provide greater flexibility and better performance by 
being able to modify their behaviour as the probabilities of symbols change during 
transmission. The operation of a static model is determined before the transmission of the 
messages starts. Usually static models are built on analysis of set data representative of the 
required type, resulting in an assignment of the same probability to any give symbol in the 
message each time it appears in the stream. Semi-adaptive models require two data parses. In 
the first, the data stream is parsed and the probabilities of each symbol estimated, and, in the 
second, each message is assigned the probability determined in the first pass before being 
encoded. Adaptive models, as the name suggests, are built as the data is being streamed. The 
assignment of probabilities to symbols is based on the values of the relative frequencies of 
occurrence at each given point in time. This means that a symbol! that occurs frequently near 
the start of the data stream may be represented by a short codeword c early in the 
transmission, even though its probability of occurrence over the complete data is low. Later, 
when other symbols begin to occur more frequently, the adaptive modeller is likely to assign c 
to one of these more probable symbols. 
In addition to the type, a number of classes of model exist. Finite-state probabilistic 
models are based on finite-state machines. They have a set of states Si and a set of transition 
pr~babilities Pg that give theprobabilit}; that when the niodel i~ ih stateSith~ne~t s~te will be 
SI Each transition is labelled and no two transitions from a state will have the same label. A 
message defines a specific path through the model that follows the sequence of symbols that 
form the message itself. The probability of a message is computed as the product of the 
transition probabilities that rpake up the path. The most popular finite-state model is the so-
called order-n fixed-context model, which uses the n preceding characters to determine the 
probability of the next character. The grammar model, which is built to accommodate nesting 
depths with probabilities associated with each production, has been used successfully for 
compressing text in formal languages, such as Pascal programs [12]. 
Chapter 2 - Vteratllre Review 15 
2.2.2. Coding 
Given a particular message, coding involves the selection of a suitable set of codewords, each 
member being a unique string of symbols (bits). Codewords can be of fixed length, in which 
case all symbols contain the same number of bits, or of variable length where symbols may be 
of different lengths. Fixed-length encoding generally yields relatively low compression ratios, 
but due to its simplicity it is still used in some applications (Including code compression 
schemes [13]). However, most modem coding techniques employ variable-length codewords. 
A well-known example of variable-length coding is Huffman coding [14]. In Huffman coding, 
all symbols are sorted in order of probability of occurrence before the two symbols of lowest 
probability are combined to form a new composite symbol; eventually a binary tree is built 
where each node is the. probability of all nodes beneath it and the leaves are the original 
message. Forming a code for any symbol then consists of traversing the tree from the root to 
the particular leaf. For a given frequency distribution, there are many possible Huffman codes, 
but the total compressed length will be the same. Huffman codes have the unique property 
that no code is a prefix of another code; meaning Huffman codes can be unambiguously 
decoded. A number of different versions of Huffman coding have been used in dictionary-
based compression schemes. 
Another popular and very efficient coding algorithm is arithmetic coding, which differs 
from Huffman coding in that it operates incrementally on a stream of symbols, whereas 
Huffman coding requires the full data set to be available to build its tree. In arithmetic coding, 
words are not necessarily represented by an integral number of bits, as the input stream is 
translated into a single floating-point value in the range [0, 1). The algorithm for coding a file 
using arithmetic coding works conceptually as follows. The model provides the symbols and 
th~ir associat~clpr6babilides 'tothe~~der. When the fi~t sy~bol is pro~essed; theinltial 
interval [0, 1) is narrowed to the probability of the symbol, for example [0.2, 0.5). If the 
probability of the next symbol is in the range [0, 0.2) it will further narrow the interval by a 
factor of 1/5, namely to [0.2, 0.26). Each successive symbol then narrows the interval in the 
same manner in accordance with its probability. If the coding is incremental, the whole data 
set is divided into blocks of equal size and each of these blocks is encoded separately in the 
full interval [0,1). 
Chapter 2 - Uteratllre Review 16 
2.3. Compiler optimisations 
The reduced number of instructions available on RISC architectures makes them particularly 
dependent on compiler technology in order to provide efficient solutions. There is 
considerable research effort into code size reduction and the wide commercial adoption of 
RISC processors has stimulated the development of new compiler optimisation techniques. 
Modern optimising compilers usually provide the option to optimize the code either for 
'speed' or for 'space'. Speed optimization techniques, which aim to shorten execution time, 
include procedure inlining (rather than branching to a subroutine), loop unrolling and 
inversion, and allocation of live variables to registers. 'Space' optimizations, which aim to 
reduce the memory requirements, usually utilize techniques such as local and global common 
subexpression elimination (where two instances of the same computation are replaced by a 
single copy), factoring out invariants, global value numbering, and unreachable and dead code 
elimination. Due to the nature of these techniques, a program that is compiled with speed 
optimisations will, generally, result in faster but larger code, while conversely, when compiled 
for space, the code will be more compact (up to about 18%) [15], but slower. It is important to 
realize that compilers typically only deal with a small part of the program at anyone time 
(perhaps as much as a compilation unit, but often only a single function), and so are unable to 
take into consideration optimisation at a higher level of abstraction that could further reduce 
code size. 
Code compaction includes a number of compiler techniques that were developed as a 
next step of improving code density, building on the base of "classical" compiler 
optimisations. Initially, compaction was envisaged as a suitable solution to the RISC code 
density problem as it does not require any further support: no decompression takes place and 
there is no need to implement additional hardware or provide further run-time software .1 
support. Thus the overhead is limited to the development of the special compiler optimizer 
only. 
Much of the earlier work to reduce the memory space occupied by executable code 
treated it as a simple linear sequence of instructions. Praser et al. [16] used suffix tree 
construction to identify repeated instruction sequences for extraction into functions. Applied 
to a range of Unix utilities on a V AX processor, this technique was found to reduce the code 
- - -----------------------
Chapter 2 - Uferafllrtl Review 17 
size by about 7%. A shortcoming is that since it relies on a purely textual interpretation of the 
program, the approach is sensitive to superficial differences between code fragments, such as 
the use of different registers in identical operations. This disadvantage was addressed by 
Cooper and McIntosh who used register renaming to alleviate the problem [17]. Two 
improvements were made: firstly insrtuctions were rewritten so that instead of using hard-
coded register names, the register operands of an insrtuction are expressed in terms of 
previous references to that register; and secondly, jump insrtuctions were rewritten in a format 
where they were compared relative to the Program Counter (PC), where possible. These 
transformations allowed the suffix tree construction to detect the repetition of similar, but not 
lexically identical insrtuction sequences. Code reductions of about 5% were reported using 
these techniques on classically optimized code. 
Debray et al. [15] concentrated their efforts on the control flow of the program. They 
identified 'similar' basic blocks by adopting a fingerprinting scheme and, by renaming registers 
locally, similar blocks were translated into identical blocks. Then, these were replaced by a 
single code block and a number of jump insrtuctions to its location. They called this technique 
'code factoring'. By the use of an aggressive, inter-procedural application of the classical 
compiler techniques and code factoring they reported typical reductions in code size of 30%. 
They implemented their idea in the form of a binary-rewriting tool based on Alto, a post link-
time code optimizer. A derivative of Alto, called 'squeeze' can also be integrated into 
compilers capable of inter-procedural optimization in order to perform program compaction. 
The authors do not mention the effect this technique has on performance. 
In summary, it can be concluded that most compiler techniques developed for code 
compaction can only achieve moderate improvement in code densiry [15]. Moreover, they 
would seriously detriment the performance due to the nature of the optimisations carried out, 
for example, limiting the number of registers used implies more (slower) external memory 
accesses are required and introducing additional functions requires more branch and jump 
instructions that increase the execution time. 
As shown in section 3.3, compression is able to achieve much more substantial reductions 
in code size, and, although requiring hardware support, has little effect on overall 
performance. 
Chapter 2 - Uterattlre Review 18 
2.4. Hybrid RISe Architectures 
When the RIse architecture was first developed in the early 1980s, memory was not 
considered a scarce resource. Nowadays, under the strict requirements for low memory usage 
in embedded systems, most of the 32-bit RIse processor suppliers now provide a hybrid 
version of their cores, supporting both 32 and 16-bit instructions. The instructions included in 
the 16-bit subset are typically selected for one of the following reasons: 
• Frequency of use 
• Size (they may not require the full 32 bits in the 32-bit ISA) 
• Are important to the compiler for generating small object code 
Several examples of such hybrid architectures are described next. 
2.4.1. MIPS16 
MIPS16 is a 16-bit instruction encoding "application specific extension" that offers the choice 
of a mixed mode (16 or 32-bit instruction lengths) and runs on specific MIPS processors 
designed to accommodate it. MIPS16 instructions are translated on-the-fly using relatively 
simple hardware [5]. In order to reduce the number of instruction bits by half, all fields in the 
instruction word (opcodes, register numbers and immediate values) are downsized, as shown 
in Figure 2.3. 
6 bits 5 bits 5 bits 16 bits 
OPCODE SOURCE am TARGET BEG JMMmlATBVALUE 
OPCODE SOt:RCE TARGET IMMEDlAl1! 
5 bits 3 t;ts 3 bits 5 bits 
Figure 2.3 Mapping of hybrid instructions 
The opcode field is reduced from 6 to 5 bits limiting the number of instructions available, 
and only 8 different registers can be accessed (3-bit operand fields). Furthermore, MIPS16 
usually permits only 2 register specifiers per arithmetic instruction as opposed to the 3 registers 
---------------------
Chapter 2 - literature Review 19 
available to the 32-bit ISA. The most substantial reduction is achieved by restricting (from 16 
to 5 bits) the range of immediate values that can be represented. The MIPS16 instruction set 
specifically excludes coprocessor instructions, floating point instructions and those that 
reference the 'system' coprocessor. These instructions must be in 32-bit code routines. 
Switching between MIPS 16 and 32-bit modes of operation is performed by an additional 
instruction. 
2.4.2. ARM Thumb Processor 
The ARM Thumb instruction set contains a subset of 36 instructions drawn from the standard 
32-bit ARM instructions and recorded into 16-bit format. In the ARM7, the 16-bit instructions 
are decompressed into their 32-bit equivalents in real time during the first phase of the decode 
cycle. In the second phase these are decoded as normal 32-bit ARM instructions. A Thumb-
aware processor, such as the ARM 9, has separate logic for executing both Thumb and 
original ARM instructions, switching between them by means of a new instruction. High-level 
code can be compiled to either native ARM code or Thumb code on a function by function 
basis [3]. 
In order to define the Thumb instruction set in 16 bits, several restrictions have been 
imposed. Like MIPS16, only three .~its are used to encode registers (opposed to four in ARM 
~ 
instructions), so only eight registers can be directly accessed by Thumb instructions. The 
immediate field is also narrower. In addition, instructions can contain two or three operands 
only, while some ARM instructions have four. Conditional execution is not supported, so 
Thumb code tends to look much more like a conventional assembler with compares followed 
by short branches. Yet another disadvantage is the increase in the number of instructions that 
need to be fetched from memory and executed, which may increase the overall power 
consumption of the system. Thumb code typically provides a 30% improvement in code 
density over native ARM code at the expense of a 26% performance reduction [4]. 
2.4.3. ARCompact 
The ARCompact ISA was announced as part of the ARCtangent-AS processor architecture. 
Its ptinciples are largely the same as these discussed above for Thumb and MIPS16. However, 
instead of using a decompressor at the head of the instruction pipeline to convert a 16-bit 
Chapter 2 - Uteratllre Review 20 
instruction into its 32-bit equivalent (the technique used by MIPS16), the ARCTangent-A5 
instruction decoder processes the 16-bit instructions natively, leaving the basic pipeline 
unchanged. Thus, no extra instructions are needed to change between compressed and 
uncompressed modes, since the nature of the instruction is encoded in its most significant bit 
(0 for 32-bit instructions and 1 for 16-bit instructions). With this approach, it is possible to 
achieve compression ratios of around 0.7, but the performance may be significantly affected, 
increasing the execution time by between 15% and 20%. Moreover, the new instruction subset 
implementation reduces the flexibility of the ISA and results in an increase of the instruction 
count by up to 40% [6]. 
2.4.4. Conclusions 
Providing a hybrid subset of the ISA is an established technique for code compaction, which 
has been implemented in a number of commercial RISC architectures. The achieved memory 
reduction can reach up to 40%, but typically it comes at the expense of performance, which 
might decrease up to 26% [4]. Performance degradation occurs due to the reduction in the 
number of registers accessible in the instruction, resulting in the need to include additional 
instructions to carry out many arithmetic and logic operations. The reduced number of 
registers available also means that more data needs to be stored in external memory, thereby 
slowing access and increasing power consumption. An additional disadvantage of the hybrid 
approach is the need to develop new software tools (compiler, linker, and loader) to support 
the ISA, which increases the cost of supporting these solutions. 
2.5. Code Compression 
Shortly after the development of the first RISC architectures, a group at Berkeley Universiry 
[1] proposed compressing RISC executable code, namely a RISC machine that supported both 
16-bit and 32-bit instructions in memory, but which were translated to 32 bits in the cache. 
Since then, a number of different techniques have been developed based on modified data 
compression algorithms. 
Chapter 2 - Uterature Review 21 
Due to performance constraints, a natural separation arises in the implementation of the 
compression and decompression stages. Compression can be carried out off-line, and 
therefore a software approach is most convenient, whereas decompression needs to be 
executed in real-time immediately before execution and so is better suited to hardware 
implementation. 
The purpose of this section is to give a comprehensive appraisal of the most relevant 
existing approaches for compressing embedded RISC code in order to clarify the 
contributions of this thesis to the state-of-the-art. 
2.5.1. Algorithms 
As discussed in Section 2.1, embedded systems are subjected to many constraints such as low 
power consumption, high performance and minimal area requirements. In order to be viable 
in the highly competitive embedded systems market, it is important to make a significant 
impact in one of these areas, or perhaps mere minor contributions across two or more of 
these areas. As the initial intention of compressing code has been developed in order to 
economise on memory area and, to a lesser extent, power consumption, the decompression 
implementation must consist of relatively little logic, and, at the same time, should not 
significantly affect the performance of the system. Therefore, the algorithms used in code 
compression implementations need to take into account a range of factors additional to those 
, 
considered in general data compression applications. 
Specifics of code compression algorithms 
Although code compression algorithms have generally been developed on the basis of 
existing, well-known techniques used for compressing either text or data, the following 
features are particular to code compression applications: 
• A[Jmmetrit, fast decompression 
In embedded systems, code compression and decompression are not symmetrical proceSses. 
Compression is normally catried out off-line on a host development machine and can be 
rather time consuming due to the optimisations and analysis of the code that need to be 
performed. After the code segment of the executable has been compressed, it can be loaded in 
Chapter 2 - Uleralure Review 22 
the memory of the embedded system. Decompression is performed in real-time on the 
embedded system. The extra hardware required in the decompression stage should remain as 
simple as possible in order to avoid excessive overheads in silicon area and execution speed. 
• Random access decompression 
The program on the embedded system could not normally be decompressed in its entirety 
before being executed as this would defeat the whole purpose of compressing the program. 
The control flow of the program is not sequential as it will contain jump and branch 
instructions (discussed in more detail in section 4.2), which may require that the execution of 
the program moves to a new point in the code. Consequently, the decompression should be 
capable of starting at any point in the code, or, at least, at a suitable block boundary, which 
indicates that decompression should perhaps be carried out on instruction or block basis. This 
restticts the algorithmic development, as, in many compression algorithms, the statistical 
history of the data already seen contributes to the building of the appropriate models. The 
widely used Ziv-Lempel family of algorithms [12], for example, use pointers to previous 
occurrences of strings, and consequently the decompression of fixed size blocks is likely to be 
not well suited to such algorithms. 
• Low buffering requirements 
As the main purpose of introducing compression is to conserve memory usage, it is 
appropriate to severely resttict the lengths of any buffers required by the compression 
algorithms. Dynamic Markov compression (DMC) and prediction by partial match (pPM) 
[12], are both able to achieve excellent compression ratios for certain applications. However, 
they are not suitable for compressing code as they require considerable memory resources for 
both compression and decompression, and such large buffers are not normally available in 
embedded systems. 
• Correct program exeClltion 
As expected of any lossless compression algorithm, the decompressed instructions must be 
identical to the original ones. However, in code compression, correct execurion requires that a 
number of different mechanisms are in place to resolve certain issues that arise; principally 
those that result from changes in control flow. In particular, it is often necessary to 
Chapter 2 - Literature Review 23 
incorporate large address translation tables in order to map between the compressed and 
uncompressed address spaces. 
Code compression algorithms 
Due to the limitations outlined above, many data compression algorithms are not suitable for 
code compression applications. Code compression algorithms are typically dictionary-based, 
where the most commonly used instructions or sequences of instructions are held in the 
dictionary, and replaced in the programme by a single, shorter codeword. In that way, the 
compressed programme is composed either of codeword only, or of codewords interspersed 
with non-compressed instructions. The codewords are generally encoded using some variant 
of Huffman (often used due to its simplicity) or other fIXed-to-variable size coding technique. 
To simplify the decompression process, the number of different lengths available to represent 
the codewords may be limited (class-based approach) [12]. 
Araujo et a!. [18] developed a code compression algorithm in three variants, which differ 
only in the granularity of the symbol used for compression. The main principle behind the 
algorithm is to separate the symbols into classes based on the frequency distribution of the 
symbols. Symbols in each class are assigned codewords of the same length. The symbols used 
are whole expression trees (tree based compression - TBC), parts of expression trees (pattern 
based compression - PBC) or single instructions (mstruction based compression - !BC). In 
TBC and PBC, the expression trees (formed by the same rules used by compilers) are the basic 
unit of compression, but in PBC the operands are factored out, that is, the expression trees are 
divided into two parts that are separately encoded (the tree pattern and operand pattern). 
Table 2.2 shows typical compression ratios achieved by the three approaches described above. 
The better compression ratio of PBC results from the repetition of patterns of instructions, 
which differ from each other only in the operands they use. 
Techniques 
symbols used for 
compression 
compression ratio 
tree based 
compression 
whole expression 
trees 
0.73 
Table 2.2 Results achieved by Amujo et al. [18] 
pattern based 
compression 
parts of expression 
trees (patterns) 
0.60 
instruction based 
compression 
single instructions 
0.68 
Chapter 2 - Uteratllre Review 24 
An algorithm, similar to the ones described by Araujo, was implemented in Motorola's 
MPC565 range of microcontrollers, the main difference being the use of instructions and half-
instructions as units of compression. The compression ratio achieved was quoted to be as low 
as 0.5, but the memory used by the dictionary was not taken into account in the supplied 
figures [19]. 
Lekatsas et al [20, 21] proposed two methods for code compression. The first method, 
semi-adaptive dictionary compression (SADC) is ISA dependent. Instructions are divided into 
streams (groups of bits within the instruction) in a pre-detetrnined fashion according to the 
properties of the ISA, whereby suitable combinations of instructions and fields within 
instructions are identified, such as opcodes, opcode-register and opcode-immediate. Semi-
adaptive dictionaries (of up. to 256 entries) are built for each executable programme, where 
only the most frequendy repeated instructions are included. The compression itself encodes all 
streams using Huffman coding. The second method is semi-adaptive Markov compression 
(SAMC), which is ISA-independent. It uses a binary arithmetic coder driven by Markov 
models as well as the stream division approach of SADC. For each stream, a binary Markov 
tree with 2k;+1 - 1 states is built and transition probabilities are calculated, which are used by 
the compressor as predictions of the forthcoming bit. The compressor parses the subject 
programme a second time to encode the message. The storage requirements for SAMC are the 
encoded message and the Markov trees for the streams and the reported reduction in code 
size is 50%. Although the authors acknowledged that the decompression speed is poor, they 
proposed to improve this by building a decoding table that allows multi-bit arithmetic 
decoding. 
Yoshida et al. [22] described a logarithmic-based method to compress instruction 
memory, by substituting the original 32-bit instructions by a set of references to a translation 
table (dictionary) that can b~ ~tored intl!e ~n-the-chip memory. The ~ain drawb;ck of the 
scheme is the substantial memory requirement of the large translation tables. However, the 
approach not only produced good instruction memory size reductions Qgnoring the 
translation tables) with a mean compression ratio of 0.62 for an ARM610, but also registered 
significant power consumption savings. 
Lefurgy et al. [23] proposed two dictionary-based compression techniques differing in the 
codeword sizes they employed. In this approach, the most frequent sequences were 
Chapter 2 - Uteratur8 Review 25 
compressed and additional bits, called escape bits, were used to define the boundaries between 
compressed and uncompressed instructions. The instructions corresponding to each 
codeword were stored in a dictionary in the decompression engine and the codeword bits are 
used to index the dictionary entries. The decompression engine was designed to incorporate 
sufficient information to be able to expand codewords to reproduce the original instruction 
sequences. Since the compressed program is composed of codewords and uncompressed 
instructions, branch targets need to be recomputed so as to reflect their new location in the 
program. The target address bits are divided into two parts, namely the address of the 
compressed word 'and the offset from the beginning of the compressed word. Forming the 
sum of these values determines the target address. To achieve bit addressing, further 
modifications to the control unit of the processor are required. The results achieved when 
using fixed-length codewords were compression ratios of 0.61, 0.66, and 0.74 for the 
PowerPC, ARM and i386 processors respectively. 
An approach described by Liao et al [24] identified frequently appearing strings of 
instructions (termed mini-subroutines) in a program and replaced each instance of a mini-
subroutine by a call instruction. The mini-subroutine appears only once in the text of the 
program and may themselves contain branch instructions under certain conditions. Although a 
mean compression ratio of 0.88 was achieved, the large number of branch instructions 
introduced by this algorithm adversely affects overall performance. An extension that 
augmented the instruction set with a new call instruction that was able to point to any location 
in the dictionary was also described, which resulted in the elimination of most of the mini-
subroutines. The call instruction consisted of a dictionary address plus a length value 
specifying the number of instructions to be executed from the dictionary. With these 
m~difications the compression ratiowas furtherred~~ed to a 0.84. 
2.5.2. Decompression implementations 
Although the decompression stage could be implemented in either software or hardware, the 
performance of a software solution makes this option unpractical in most cases, whereas 
hardware decompression has already been successfully included in commercial products. 
Commercial microcontrollers that incorporate dedicated hardware decompression engines are 
the IBM CodePack and the Motorola M5xx series. The decompression architectures of these 
Chapter 2 - Uteratllre Review 26 
devices, together with the one of the first rusc processors to incorporate code compression, 
are described in more detail later in thls section. 
Decompression in Software 
Lefurgy et al [23] proposed a dictionary compression scheme, where each 32-bit instruction is 
indexed by a 16-bit codeword. The obvious limitation is that a maximum of 216 (65,536) 
unique instructions can be compressed in this scheme, but if the number of unique 
instructions is larger, then the less frequently used instructions are left in uncompressed form 
and the memory is divided into separate compressed and native regions. The decompression 
software is implemented in only 26 assembler instructions, but they need to be executed each 
time a compressed instruction is encountered, while the only hardware change is the addition 
of a special instruction that invokes the decompressor at cache miss. The compression ratio 
that can be achieved depends mainly on the number of unique instructions in the application 
and the dictionary. 
Bird and Mudge [25] proposed a look-up table method that provided fast decoding. 
Following code generation, the instruction stream is analyzed for often-reused sequences of 
instructions from withln the program's main blocks. These patterns of multiple instructions 
were then mapped to single byte opcodes, yielding compression of multiple, multi-byte 
operations into a single byte. Compressed opcodes are detected during the instruction fetch 
stage, and expanded within the CPU into the original (multi-cycle) set of instructions. As they 
operate within the program's basic blocks, branch instructions and their targets are left 
unaffected by thls technique. The authors reported that by incorporating "lkbyte of decode 
ROM in the CPU" [25], a reduction in the memory occupied by the program code of between 
45% and 60% can be achieved. 
Kirovski et al. [26] described a system where code is decompressed on a procedure basis 
from ROM into a procedure cache (a dedicated region of RAM that is explicitly managed by 
the runtime system). An important issue in thls approach is the management of thls procedure 
cache. Any decompression algorithm (both dictionary as well as statistical) can be used, as long 
as fast software implementations are available. In the authors' implementation, the 
decompression is performed by software and little or no hardware support is required. 
Chapter 2 - Uteratrm Review 27 
Decompression in Hardware 
In 1992, Wolfe and Chanin [27] proposed a new solution, termed the code compression RISC 
processor (CCRI'). The CCRP consists of a standard RISC processor core augmented with a 
special code-expanding instruction cache. CCRP programs are compiled and linked using 
standard tools, and the generated executable is then compressed. Fixed 32-byte blocks are 
considered as the atomic unit of compression. The compression algorithm used is based on 
bounded (up to 16 bits) static Huffinan encoding, where some blocks remain uncompressed if 
their lengths were to increase during compression. The hardware decompression engine is 
located between the main memory and the cache, as shown in Figure 2.4. In this architecture, 
the instruction cache holds uncompressed instructions (those in the original form as generated 
by the compiler), while the compressed programme is stored in the main memory. 
Figure 2.4. Code Compression RISC Processor [27] 
At run time, on a cache miss, the ICache (Instruction Cache) refill-engine locates the 
missed cache line in the main (compressed) memory and expands (decompresses) it into the 
cache. 1bis approach does not require any modifications to the core processor, but only the 
implementation of an ICache refill-engine. 
I 
24blls 
• BASE ADDRESS 
Figure 2.5. Line Address Table Entry 
In a CCRP program, the instructions are allocated addresses according to their location 
(for example, in the main memory or in the cache). A line address table (LA 1), stored 
alongside with the compressed program, is used to map the original (uncompressed) 
instructions addresses, to the compressed ones. Each LA T entry is 64-bits wide and is used to 
------ --- ------- ----------------------------
Chapter 2 - Literature R£view 28 
map 64 consecutive instructions, which are grouped into eight blocks of eight instructions 
each. The base address field (see Figure 2.5), represents the (compressed) base address of the 
first block. The length Qn bytes) of each compressed block is stored in the eight subsequent 
fields. A length of 0 represents an uncompressed block, which is always of length 32 bytes. In 
this way, the starting address of a particular compressed block can be calculated by adding the 
length of its preceding blocks to the base address. This operation is performed within the 
cache line address lookaside buffer (CLB). The CLB contains a small fully associative cache 
with least recently used (LRU) replacement policy, capable of storing from 4 to 16 LAT 
entries. The CLB is accessed in the fetch instruction stage, simultaneously with the instruction 
cache (see Figure 2.6). When an instruction cache miss occurs, if the required LAT entry is 
found in the CLB, the compressed block can be immediately requested for transfer to the 
main instructi~n memory. If, on the other hand, the CLB misses, it has to be refilled before 
requesring the compressed address, thus incurring delays due to both ICache and CLB refill 
times. 
The CLB'scache is llllplementedusing acont~~t addressable memory (CAM) to map 
uncompr~ssed block ~ddre~ses to LA T enrries, and an address computational unit to calculate 
the compressed address of the selected block. The processor instruction address (24-bit long) 
is composed of three parts, as shown in Figure 2.6. The 16 most significant bits, the LA T 
index, are compared with the tags of the enrries currently in the CLB to determine whether the 
required address computation LA T entry is currently in the CLB or has to be fetched from 
memory. The next 3 bits are the LA T length pointer index that determines which elements of 
the LA T entry have to be summed in order to compute the compressed block starring address. 
The 5 least significant bits act as the byte offset into the selected cache line. In terms of 
resources and performance, the memory required to store the LA T is around 3% of the total 
memory usage, achieving compression ratios of around 0.73 for MIPS RZOOO, but clearly 
additional memory and logic are needed to implement the decompression engine. 
. Chapter 2 - Ulerallf/? Review 29 
UT Leng.h Poin.er In~ Cache Une Offset 
I L\Tlnda I I Cach • 
I . 
TAG ).AT Entry 
TAG ).AT Entry Addrt'88 
TAG LAT I' ... ry 
Computational 
Unit 
TAG LAT Iln.ry 
Cumpn: .. cd Bluck Adoh"" 
Figure 2.6. CLB Organisation [27] 
Similar in operation to CCRP is IBM's CodePack code compression system [7,28] that 
was designed to compress any PowerPC programme into a format that can be executed by the 
processor after decompression. During compression, the CodePack compression utility 
analyses the instructions' frequency distribution and produces a pair of 2-kbyte lookup tables 
generated specifically for that particular programme. When the compressed programme is run, 
a CodePack-equipped processor uses these tables to decompress the code on the £Iy before 
execution. Although there is some performance penalty for decompression (typically around 
10%), significant extra latency is introduced only during instruction cache misses. 
The CodePack compression algorithm operates as follows. Each 32-bit instruction is 
divided into 16-bit most significant and least significant half-words, which are then translated 
(using Huffman encoding) to a variable bit codeword whose length is in inclusive range 2 to 
19 bits, where the most common half-word values ate placed into one of the two look-up 
tables. The main reason for using two separate look-up tables is the different frequency 
distributions that characterise the upper half of the PowerPC instruction (which holds the 
opcodes) and the lower half (which typically holds constants, displacements, or masks). Each 
group of 16 instructions is combined into a compmsion block, which is the granularity at which 
decompression occurs and, when an instruction is requested by the processor, the block in 
which it is located is fetched and decompressed 
A further set of tables, termed the index table, is used for mapping between 
uncompressed and compressed addresses, as shown in Figure 2.7. The index table holds 32-bit 
--------------......... 
Chapter 2 - literatllTe "&view 30 
entries that point to the compressed-space addresses of a group of two compression blocks, 
forming a compression group. During run-time, at a cache miss, the decompression core 
calculates the address of the index table entry of the compression group containing the target 
instruction address (TIA) and fetches it from the memory. Next, it calculates the address of 
the compressed block to which the requested instruction belongs, and as the compression 
block is read in, the decompression core uses the contents of the decode look-up tables to 
decompress the block. Finally, once the cache-line refill is completed the processor continues 
execution. The size of the index table can be up to 2 Mbytes in size, which would cover an 
entire 64 Mbytes compression region. 
.. "".. 
h~el(Table 
., ,...., 
G1 B2 c.--,.", 
., /~ 
.2 G2 
., 
GO B2 
2fl..,ItAdd .... of"':~ Block otltle Compression 
Group 
6-bi1s 
"'::---.. ,,~
~ 
Compressed Address 
Space 
V bIe\'\'ktth Ma 
., 
B2 
., 
B2 
., 
B2 
6-bit Offset of the 2'"' 
block In the group 
Figure 2.7. Index table mapping ofTIAs to compressed memory [7] 
The tool chain available for CodePack is similar to that of other PowerPC processors, as 
no specific compilation or linkage is required, but they are supplemented by a post-processor 
that compresses the executable and builds the compression tables. IBM claims that the 
compression scheme of CodePack typically reduces code size by 35 - 40% [7). 
Larin and Conte [29) compared Huffman coding compression performed on cache misses 
with tailored encoding of the ISA for a very long instruction word (VLIW) architecture. 
Depending on the requirements of a particular program, such as general purpose registers or 
the number of floating point operations, the register and opcode fields in the instructions are 
coded to the optimum length needed For example, if no more than four registers are alive at 
the same point at a given source code position, the register fields are coded into two bits only. 
In terms of performance improvement, the tailored ISA compares favourably with Huffman 
Chapter 2 - Ulerallire Review 31 
encoding, while, in terms of code size, compression gives better results. Lecatsas et al. [20] 
investigated the potential benefits of implementing a modified arithmetic coding technique 
and claimed that they achieved energy savings between 22% and 82%, were able to reduce 
chip area and even simultaneously improve performance. Although their savings are attractive, 
the results were presented only for a theoretical study and no implementation has been carried 
out. The implementation of the logarithmic algorithm of Yoshida et aI [22], briefly described 
earlier, achieved a reduction in power consumption by up to 42.3% in its implementation. 
2.6. FPGA devices 
FPGA technologies are becoming increasingly widely used in industry, delivering performance 
and features that previously only ASIC devices could provide. Their prices are far more 
competitive and their design tools are often much more user friendly than their ASIC 
counterparts. FPGAs not only allow for fast prototyping, but they provide reconfigurability, a 
feature not available in ASICs; However, it has been suggested that FPGAs are still only a 
first-generation embodiment of the big idea of a general-purpose, reconfigurable substrate for 
special purpose computing [30]. In any case their advantages are sufficiently impottant to 
investigate the role that this new technology can play in current and future control 
implementations. 
This section briefly reviews the main characteristics of a selected number of FPGAs that 
were used in the implementation of the decompressor in order to understand the 
implementation figures presented in later subsections. 
Spartan-lIE 
The Spartan-lIE family of FPGAs is targeted towards very Iow cost solutions that require a 
good level of performance. These devices provide system clock rates beyond 200 MHz and 
they are implemented on a 0.15 J.U1 technology. A Spartan-lIE FPGA is composed of five 
major configurable elements [31] (see Figure 2.8). 
• lOBs (Input/Output Blocks), which provide the interface between the package pins 
and the internal logic. 
Chapter 2 - Literature Review 32 
• CLBs (Configurable Logic Blocks), which provide the functional elements for 
constructing most logic. 
• Dedicated block RAM memories of 4096 bits (4 Kbytes) each. 
• Clock Delay-Locked Loops (DLLs) for clock-distribution delay compensation and 
clock domain control. 
• Versatile multi-level interconnect structure. 
The basic building block of the CLB is the logic cell (LC). An LC includes a 4-input 
function generator Qmplemented as 4-input LU1), carry logic, and storage element. The 
output from the function generator in each LC drives the CLB output or the D input of the 
flip-flop. Each CLB contains four LCs, organised in two similar slices. The main parameters 
for the selected device are given in Table 2.3, 
Part 
XC2S200e 
COOOOOOOOO oooooccooo 
D-L 0000000000' •• 0000000000 DLL 
8El ~ 00000 00000 B8 
88 " DDQ,r;1D ••• DQr,lDD ~ 88 §o§ ~ DDl:itJD DlJ[JDD ~ BEl 
og 0DtlDD 00000 ~§ 
• • 
• • 
• • 
• 
• 
• 
• 
• 
• 
• • 
• • 
• • 
Figure 2.8. Basic Spartan-lIE Family FPGA Block Diagram [311 
Table 2.3. Main complexity parameters for the selected Spartan-lIE FPGA 
Typical System CLB Available Distributed Block Block 
Logic Gate Range Array Total UserI/O RAM RAM RAM 
CeUs (Logic and RAM) (RxC) CLBs Bits Bits Number 
5,292 71,000 - 200,000 28x42 1,176 285 75,264 56K 14 
Chapter 2 - Uterature Review 33 
Virtex-I1 
The Virtex-II devices are platform FPGAs developed for high performance designs that are 
based on IP cores and customised modules. The family is often used in telecommunication, 
wireless, networking, video, and DSP applications. The 0.15 fJ.ll /0.12 fJ.ll CMOS 8-layer metal 
process and the Virtex-II architecture are optimised for high speed with low power 
consumption. The main parameters of the Virtex-II device used for implementing the 
decompressor are summarised in Table 2.4. 
Table 2.4. Virtex2 FPGA main complexity parameters 
System 
Part Gates 
XC2VSOO sOOK 
CLB 
Array 
(RxC) 
32x24 
CLB 
Slices 
3.072 
Distributed 
RAM 
KBits 
96 
Multiplier 
Blocks 
32 
18 
Kbits 
Blocks 
32 
Max 
RAM 
KBits 
576 
GCLKs 
8 
10 
Pads 
264 
Its internal configurable logic includes four major elements organised in a regular array 
(32) (see Figure 2.9). 
• CLBs, which provide functional elements for combinatorial and synchronous logic, 
including basic storage elements. 
• BUFTs (3-state buffers) associated with each CLB element, which drive dedicated 
segmentable horizontal routing resources. 
• Block SelectRAM memory modules, which provide large 18 kbit storage elements of 
dual-port RAM. 
• DCM (Digital Clock Manager) blocks. 
CLB resources include four slices and two 3-state buffers. Each slice is equivalent and 
contains: a) two function generators; b) two storage elements; c) arithmetic logic gates; cl) large 
multiplexors; e) wide function capability; f) fast carry lookahead chain and g) horizontal 
cascade chain (OR gate). 
Chapter 2 - Uterature Review 34 
The function generators are configurable as 4-input look-up tables (LUTs), as 16-bit shift 
registers, or as 16-bit distributed SelectRAM memory. In addition, the two storage elements 
are either edge-triggered D-type flip-flops or level-sensitive latches. 
Global Clock MUlC-"""'q.W'I'/ 
Programmable VOs 
\ 
\ , 
Figure 2.9. Virtex-II Architecture Overview [32] 
Each CLB has internal fast interconnect and connects to a switch matrix to access general 
routing resources. The block Se1ectRAM memory resources are 18 kbits of dual port RAM, 
programmable from 16384 x 1 bit to 512 x 36 bits, in various depth and width configurations. 
Each port is totally synchronous and independent, offering three "readduring-write" modes. 
Block SelectRAM memory is cascadable to implement large embedded storage blocks and 
memory configurations for dual-port and single-port modes are supported. A multiplier block 
is associated with each Se1ectRAM memory block. The multiplier block is a dedicated 18 bit x 
18 bit multiplier and is optimized for operations based on the block SelectRAM content on 
one port. The 18 x 18 multiplier can be used independendy of the block SelectRAM resource 
and both the SelectRAM memory and the multiplier resource are connected to four switch 
matrices to access the general routing resources. 
Altera's Cyclone 
The Cyclone field programmable gate array family is based on 1.5 V, 0.13 fln technology. With 
features such as phase-locked loops (PLLs) for clocking and a dedicated double data rate 
(DDR) interface to meet DDR SDRAM and fast cycle RAM (FCRAM) memory 
requirements, Cyclone devices are a cost-effective solution for data-path applications [33]. The 
Chapter 2 - Uterat1l" Review 35 
main parameters of the Cyclone device used for describing the implementation of the 
decompressor are summarised in Table 2.5. 
Part 
EP1C6 
Table 2.5. Selected Cyclone FPGAs main complexity parameters [33] 
LEs 
5,980 
M4KRAM 
blocks 
20 
Total RAM 
bits 
92,160 
PLLs 
2 
Max user 10 
pins 
249 
Cyclone devices contain a two-dimensional row-based and column-based arcltitecture to 
implement custom logic. Column and row interconnects of varying speeds provide signal 
interconnects between logic array blocks (LABs) and embedded memory blocks. 
The logic array consists of LABs, with 10 logic elements (LEs) in each LAB. An LE is a 
small unit of logic providing efficient implementation of user logic functions. LABs are 
grouped into rows and columns across the device. 
M4K RAM blocks are true dual-port memory blocks with 4 kbits of memory plus parity 
(4,608 bits). These blocks provide dedicated true dual-port, simple dual-port, or sing1e-port 
memory up to 36-bits wide at up to 250 MHz. These blocks are grouped into columns across 
the device in between certain LABs [33]. 
Each Cyclone device input/output pin is fed by an I/O element located at the ends of 
LAB rows and columns around the periphery of the device. 
Loglcarray M4KBIock 
PLL 
10Es 
Figure 2.10. Cyclone EP1C6 Device 
Chapter 2 - Literature Review 36 
2.7. Conclusions 
The problem of poor code density. of RISC processors has attracted considerable attention 
and research effort since the introduction of RISC processors into embedded systems. A 
number of different techniques targeting this problem have been described in this chapter, 
including compiler optimisations, hybrid ISAs and code compression. 
Improvements in compiler technology have made it possible to eliminate much of the 
redundant application code and to provide optimisations that reduce memory requirements, 
such as local and global common subexpression elimination, as well as unreachable and dead 
code removal. In addition, code compaction techniques have been developed, consisting of a 
set of mechanisms, such as register renaming, to further maximise the effects of compiler 
optimisations. Although some of these techniques achieve relatively good results in terms of 
code reduction (30% code reduction is reported in [15]), they have a detrimental affect on 
performance. Similar improvements have been achieved following the introduction of hybrid 
ISAs by a number of RISC suppliers. However, the use of hybrid 16-bit instructions can result 
in a decrease in performance of up to 26% [4], due to the increase in the number of 
instructions that need to and the limiting of the number of registers that can be used making 
additional accesses to slower memory necessary. 
Code compression, often achieves better reductions compared with the compiler 
optimisations and hybrid ISA approaches, and a number of existing implementations, both in 
software and hardware, have been presented in this chapter in order to reflect the state-of-the-
art in the field. A number of authors found significant improvements in the density of the 
code when using compression and a number of the previous works also report decreases in 
the power consumption as well as performance improvements. 
Chapter 3 
ENTROPY OF EMBEDDED RISe CODE 
The problem of poor code density of RISC processors is commonly acknowledged in the field 
of computer architecture and, as the literature review showed (Chapter 2), considerable effort 
has been invested in its resolution, both in academic and industrial circles. The very idea on 
which the RISC ISA is based, namely the uniformity of the insttuction formats, implies the 
existence of significant redundancy in its encoding. This chapter investigates the nature and 
the extent of this redundancy and its amenability for compression. 
The first part of this chapter describes an experimental framework and methodology for 
quantifying the information content of data in terms of its entropy. A set of representative 
RISC programs is used in order to obtain an indication of the entropy of data that will be 
studied in this thesis. Once the entropy is known, suitable techniques can be identified to 
reduce the redundancy in the code while typically increasing its execution performance. 
Although a number of techniques exist that are capable of decreasing the code size, section 
1.2., code compression significantly outperforms other approaches (m terms of code size 
reduction), such as compiler optimisations and hybrid architectures. This chapter will evaluate 
relevant modem compression techniques in order to set the context for the fmal solution 
presented in later chapters. 
3.1. The concept of entropy in information theory 
In communication theory, one of the fundamental problems is to determine efficient 
representations of the data to be transmitted in order to increase the information bandwidth 
Chapter 3 - Entropy of Embedded RISC Code 38 
of the communication channel. This requires reproducing at one point (the destination), either 
exactly or approximately, a message selected at another point (the source). The task of the 
current study is to a large extent similar to the fundamental communication problem. It can be 
assumed that the executable program in its original format is the communication source, while 
the instruction fetch stage of the processor pipeline is the destination. The communication 
channel can be seen to be a combination of the memory resources and the instruction data 
bus, where both of these will benefit from an improved representation. This assumption 
would allow us to employ entropy, as presented by Shannon in his classical work 
"Mathematical Theory of Communication" [11], as a method of identifying the actual 
information content of the program's code. In the second theorem of Shannon's paper, 
known as the noiseless channel encoding theorem, he postulates that if there were a measure 
of the information content of a data set (entropy), Hip" p" ... , p;), where Pi , i = 1, ... , n, 
accounts for how much choice is involved in the selection of the event or of how uncertain we 
are of the outcome (m our case the probability of occurrence of a selected symbol in the 
code), the following three conditions must be satisfied: 
• H must be continuous in the p, 
• If all the p; are equal, Pi = 1/ n, and H should be a monotonically increasing function of 
n. 
• If a choice can be broken down into two successive choices, the original H should be 
the weighted sum of the individual values of H at each stage. 
And the only equation for H that satisfies the three conditions is of the form 
n 
H(P"P2, ... ,Pn) = -K~>llog(pl)' (3.1) 
;=0 
where K is a positive constant, which amounts to the choice of a unit of measure. The minus 
sign preceding K merely reflects the desire for entropy to be a positive quantity, whereas, 
always being less than unity, probabilities always have negative logarithms [11]. When the unit 
is bits, K = 1, the logarithms are taken with base 2 and H is the entropy of the set of 
probabilities p; then 
Chapter 3 - Entropy of Embedded RISC Code 39 
• 
H = - LPliog2 P, bits (3.2) 
;=0 
As defined by Shannon [11], entropy applies only to a probability distribution, that is a set 
of choices with probabilities summing to one. In some cases, it can be more useful to know 
the information content of a particular choice. If the probability of a choice is P;, its 
information content or entropy can be defmed as a negative logarithm of its probability 
H, = -log2 P, bits (3.3) 
This means that those symbols more likely to occur contain 'less' information. The term 
entropy as defined in Equation (3.2) will be further used in this thesis as a measure of the 
entropy in a program's code, and the entropy of an individual symbol will be defined as in 
Equation (3.3). 
3.2. Experimental framework 
The main purpose of this section is to describe the experimental framework used to identify 
the entropy of a set of programmes that represent adequately a wide variety of embedded 
RIse applicarions, and the improvements, in terms of code size, achieved by the use of 
different compression techniques. These programs (selected mainly from the MiBench [34] 
test suite) have been compiled for the ARCTangent-4 microprocessor using the space 
oprimisation option. The software tool developed in this research for measuring the entropy 
of the code is described along with a number of compression algorithms that were evaluated 
against this code, in order to highlight the state-of-the-art in the field of lossless data 
compression. 
3.2.1. Entropy measurement tool 
As stated at the beginning of the chapter, a scientific study that aims to improve the code 
memory utilisation of RISe processors requires a quantitative analysis of the actual entropy of 
the code, thereby providing an insight into the degree of redundancy present. This analysis is 
Chapter 3 - Entropy of Embedded RISC Code 40 
automatically performed by a software tool developed in this research, whose main structure is 
presented in pseudo code in 
Figure 3.1. In the code, the tool takes as an input parameter the name of the executable file for 
which it builds a probability distribution table for symbols of three different lengths, namely 8, 
16 and 32 bits. The ELF headers of the executable file are processed in order to extract the 
code segment's size and offset. Analysis of the code segment identifies the elements of the 
data set and builds the frequency distribution tables for each of the three symbol lengths. The 
probability of occurrence of each symbol is calculated and its individual entropy is obtained 
according to Equation (3.3). Finally the entropy of the whole code segment is calculated using 
Equation (3.2) and the results are sent to a statistical log file. 
1n111e • topen(inputF11e, "rb",; 
readHeaderlnto(&pCodeSegmene, ,codeSegmentSize); 
tlndEltCode5egment(lnYl1e, pCodeSegment); 
tseek(inF11e, pCodeSegmene); 
forti • 0; 1 < codeSegmentS1ze: 1++) 
( 
tread('byee,l,l,O); 
trequencyDletrlbutlonToble(byte, 1, byteFrDlstrlbutlon); 
calculateProbabil1ties(byteLevelPrObsb111ties, byteFrDistr1but1on): 
calculateEntroples(byteLevelProbobl1itles); 
tseek(inF11e, pCodeSegment); 
forti· 0; 1 < codeSegmentSize; 1++) 
( 
tread('hsltWord,2,1,O); 
trequencyD1stribut1onTable[haltVord, 2, hVordrrDistr1bution); 
calculateProbabl1ltie3(hWordLevelProbabilitles, hWordrrDietr1bution); 
calculateEntropiee(hWordLevelProbobl1ities); 
Figure 3.1 Entropy Tool Pseudo Code 
3.2.2. Test Material 
Ten representative embedded RISe applications (eight of them taken from the MiBench suite 
[34]) were sdected in order to provide the typical redundancy present in embedded RISe 
code. While many test bench suites target only specific areas of computation (such as integer 
Chapter 3 - Entropy rfEmbedded RISC Code 41 
or floating point arithmetic) or application (for example, MediaBench is specifically designed 
for multimedia applications), MiBench is a free, portable benchmark suite, specially designed 
to evaluate compiler and ISA performance for embedded implementations and based on a 
wide variety of representative commercial embedded applications. These applications are 
divided into the following five categories (with flexible boundaries), according to their market 
segments: control and automotive, consumer electronics, telecommunications, networking and 
security. The first category (automotive and industrial control) probably represents the 
majority of the current embedded system applications, including air bag controllers, image 
recognition, engine performance monitors, industrial controllers and sensor systems. The test 
programmes contained in this category are focussed on assessing performance in basic 
mathematical operations, including bit manipulation, data input/output and simple data 
organization. The consumer electronics sector is currendy experiencing rapid expansion, 
opening the arena to many new embedded technologies, and the selected test programmes for 
this category are intended to represent the many consumer electronic devices that have grown 
in popularity during recent years such as scanners, digital cameras and personal digital 
assistants (PDAs). Another relatively new and growing sector powered by modern embedded 
system applications is telecommunications. With the explosive growth of the Internet, many 
portable consumer devices incotporate integrated wireless communications. The benchmarks 
from this category consist of voice encoding and decoding algorithms, frequency analysis and 
a checksum algorithm. The network category covers embedded systems that support network 
devices, such as switches and routers. The benchmark designed for this category focuses on 
shortest path calculations, lookup tables, search. tree generation and. processing, and data 
input/output. Finally, the data security group includes a range of encryption algorithms. A 
brief description of each of the selected applications is provided below. 
Basicmath performs simple mathematical calculations that often don't have dedicated 
hardware support in embedded processors, such as cubic function solving, integer square root 
and angle conversions from degrees to radians. The input data is a set of constants. 
The bitcount algorithm tests the bit manipulation abilities of a processor by counting the 
number of bits in an array of integers. Five methods are used, including an optimized l"bit per 
loop counter, recursive bit count by nibbles, non-recursive bit count by nibbles using a table 
look-up, non-recursive bit count by bytes using a table look-up and shift and count bits. The 
input data is an array of integers with an equal number of 1 's and O's. 
Chapter 3 - Entropy q[Embedded RISC Code 42 
The CRC32 benchmark performs a 32-bit cyclic redundancy check (eRC) on a me, an 
approach often used to detect errors in data transmission. 
Dijkstra is a benchmark that constructs a large graph in an adjacency matrix 
representation and then calculates the shortest path between every pair of nodes using 
repeated applications of Dijkstra's algorithm, a well-known solution to the shortest path 
problem. 
The display application was supplied by ARC International for performing simple tests 
on ARCAngel, the ARCfangent-A4 development board. It drives a set of user interface 
indicators, including an LCD display, a 7-segment display and a number of LEDs. It was 
selected as a part of the test set as its functionaliry is commonly found in many consumer 
electronic devices. 
The FFT /IFFT benchmark performs a fast Fourier transform (FFI) and its inverse 
(IFFI) on an array of data. 
The qsort program sorts a large array of strings into ascending order using the well-
known qnick sort algorithm. 
Sha is a secure hash algorithm that produces a 160-bit message digest for a given input. It 
is often used in the secure exchange of cryptographic keys and for generating digital 
signatures. 
Shine is an MP3 encoder, whose source code is freely available from the web [35]. The 
baseline source code contains no processor-specific optimizations. The encoder's code 
includes primarily 16-bit integer and double precision floating-point code. 
The susan. image recognition package was developed for identifying corners and edges in 
magnetic resonance images of the brain. It is also capable of image-smoothing and has 
adjustments for threshold, brightness, and spatial control. In this implementation, the input 
data is a black and white image of a rectangle. 
Chapter 3 - Entropy of Embedded RISC Code 43 
3.2.3. Lossless data compression algorithms 
As discussed in Chapter 2, data compression algorithms can be divided into three categories: 
ad hoc, dictionary and statistical. Ad hoc algorithms are not used in this study, as they are not 
suitable for compressing 32-bit RISC code as there is not sufficient regularity in the op-codes 
and the symbols would vary widely between programs. Dictionary algorithms are the most 
commonly used as they present a good combination of processing speed and compression 
efficiency. Statistical techniques are more complex to implement, but achieve typically the 
highest compression ratios. Some of the most popular compression algorithms from both the 
dictionary and statistical categories have been selected for evaluation and they are briefly 
described next. 
I.Z Family of Algorithms 
LZ encoding [36] is a popular adaptive dictionary technique commonly used in text 
compression applications, which replaces phrases with pointers to their previous occurrences. 
A phrase might be a word, part of a word, or several words. The pointer is usually a pair (m,~, 
where m is the position of the input string and I its length. References can be recursive, that is, 
they can point to other pointers. There is a whole family of algorithms based on LZ technique, 
each member of which reflects different design decisions. The two main distinguishing 
characteristics are the size of the history buffer (that is how far back in the me a pointer can 
access), and which substrings are allowed as targets for a pointer. The history buffer can be 
unrestricted (growing window) or limited to (typically) several thousands of words (fixed-size 
window). The choice of characters can be also unrestricted or limited to a set of phrases that 
are chosen according to some heuristics. Each combination of these choices represents a 
trade-off between performance (execution time), memory requirement and compression ratio. 
Decoding is quite simple - the decoder replaces the pointer with its corresponding phrase. 
Three different commercial algorithms based on LZ are used in this study. The LZS 
algorithm [37] (designed and commercialised by Hifn Inc.) has a fixed history window of 
2kbytes. The special characteristic of this implementation is the ability to specify the level of 
searching effort carried out to find the best matching Qongest) string, allowing the 
compression ratio to be traded for execution time. 
Chapter 3 - Entropy of Embedded RISC Code 44 
A more complex implementation of the LZ algorithm, which is used in this study, is data 
compression according to Lempel and Ziv, (DCLZ) produced by Advanced Hardware 
Architectures Inc. [38]. The algorithm allows the resetting of the dictionary if a poor 
compression ratio is detected and makes use of a number of control codes. The dictionary 
entries have lengths in the range 2 and 128 bytes and uncompressed bytes in the output string 
are also encoded so that they fall into a specified range (8 .. 263). The algorithm has been used 
for compressing data in a range of commercial devices, including high speed data 
communication systems, high resolution laser printing devices and SCSI host-bus adaptors. 
Adaptive Lossless Data Compression (AIDC) is another implementation of the LZ family of 
algorithms developed by IBM [39]. The two major differences between AIDC and DCLZ are 
the variable size of ADLC's history buffer, and the use of control codes in DCLZ. 
Prediction by partial match (PPM) 
PPM is a compression technique based on probabiliry estimation. It uses an adaptive statistical 
modelling technique that blends together different length context models to predict the next 
character of the input. The models record the frequency of characters that have followed each 
of the contexts. For example, if a particular context happens to be "thei", then all the 
characters that have followed this context are counted and the next time the context "thei" 
occurs in the text, these counts are used to estimate the probability of the next character. A 
feature of PPM is that it operates for different context lengths (for example, "hei" as well as 
"ei"), to arrive to an overall probability distribution, and that arithmetic coding can then be 
used to optimally encode the character with respect to this distribution. Variants of PPM have 
been developed for different context lengths and local order estimation (that is, selecting a 
particular context model from all the current context models). The executable used in this 
study (PPMZ2) includes a number of coders based on PPM techniques, including order-12 
PPMdet, order-8 PPMZ, order-5-4-PPMQ and order-3-2-1 PPMC [40]. 
X-MatchPro 
X-MatchPro is an adaptive dictionary based algorithm that allows partial matching of 
incoming data with the data stored in the dictionary [41]. The data is read by tuples of 4 bytes, 
and each incoming tuple is compared with the dictionaty entries. A full match occurs when all 
the bytes of the incoming tuple match a dictionary entry and a partial match occurs when at 
least two of the bytes of the tuple exactly match a dictionary entry. In the case of a partial 
match, the bytes that do not match are transmitted literally. The code prefixes of each 
------------------........... 
Chapter 3 - Entropy if Embedded RISC Code 45 
compressed tuple indicate its match location in the dictionary and the match type, thereby 
specifying which bytes matched the specified dictionary entry. In the case of miss, a single bit 
is added to the incoming tuple and it is placed in the dictionary. The dictionary is maintained 
using a move-to-front strategy, whereby the current tuple is placed at the front of the 
dictionary and other tuples move down by one location. XMatchPro is a high bandwidth 
algorithm, allowing very fast real-time processing, but the compression ratios achieved are 
moderate. 
Among the set of algorithms discussed above, PPM provides, in general, the best 
compression ratios, but its performance comes at the expense of a highly complex 
implementation that makes this algorithm unsuitable for real-time commercial embedded 
applications. Conversely, XMatchPro's architecture is targeted towards high-speed 
implementation, and, although it offers moderate compression ratios, it is the fastest 
compression implementation currently available [41]. The LZ family has many commercial 
features allowing the user to select an appropriate compromise between compression 
performance and implementation complexity, making it very suitable for a wide range of 
applications. 
3.3. Experimental trials 
The test applications described in the previous section were compiled for ARCTangent-A4 
microprocessor, using the hcarc compiler supplied by ARC International, and the entropy of 
each measured using the entropy measurement tool described in subsection 3.2.1. The 
following section presents the results of the experiments described to determine the entropy 
of the executable codes. 
3.3.1. Entropy measurements 
The entropy of a program's code can be defined as the minimum number of bits necessary to 
encode it. It is important to note that entropy can be evaluated only relativelY to an estimate of 
the probabilities of occurrence of the symbols that comprise the program. Thus, the entropy 
- ------------------------------------------------------
Chapter 3 - Entropy of Embedded RISe Code 46 
depends not only on the choice of model but also on the set of symbols selected for 
generating the probabilities. 
Although, in general, better entropies can be achieved by applying more sophisticated 
models (for example, using units of compression with variable lengths), the practical 
overheads (such as complexity of hardware, slower execution speed, realizing in a larger 
physical area and the requirement for greater power usage), required to support the 
decompression process, makes this approach unrealistic in most cases. 
As mentioned in Chapter 1, this research has, as primary objective, the design and 
development of a compression architecture suitable for demanding embedded RISC 
applications where memory requirements are intimately coupled with real-time performance. 
Therefore, the solution developed in this thesis (see Chapter 4) aims to balance three major 
aspects: compression ratio, performance (Increasing significantly the cache-hit ratios and 
ensuring that the decompression engine can be clocked at least at the same frequency as the 
host RISC processor) and flexibility. 
An important patt of the solution is its compression model. A zero-order fixed-context 
model was selected because it provides good entropy results at low (hardware) costs (see 
section 2.2.1). Once the model is identified, the next step is to determine the appropriate 
number of bits for the input symbols. Table 3.1 present the entropy results of the test 
programmes included in the test bench (compiled for ARCTangent-A4), for three different 
symbol sizes (8, 16 and 32 bits). 
Table 3.1. Entropies for a range of programs compiled for the ARCTangent-A4 
Programs Entropy Actual size(bits) 8-bit 16-bit 32-bit 
basicmath 299594 220038 128439 405632 
bitents 250696 182378 106432 341216 
ere 239291 173923 101144 325088 
djikstra 293427 214615 123899 396256 
display 65248 45106 25291 91520 
fft 252044 184730 107903 341888 
qsort 294425 215415 126036 398048 
sha 241043 176015 102552 327520 
shine 492695 374779 226969 662656 
susan 445214 333075 199031 606976 
Chapter 3 - Enlro/lY of Embedded RISe Code 47 
Figure 3.2 shows, that fo r 32-bit symbols, the original code appears to have a redwldancy 
of up to 69%. Clearly, embedded IUse code has significant redW1dancy, and an appropriate 
compression model should be able to reduce the number of birs in the program. 
·8 0.8 • .. • • • El 0.7 
" 0.6 
.t:I .. .- • • 
'" O.S 
.~. I 
__ 8-bit 
-~ 0.4 
-- 16-bit ~ • • 0.3 . • ~I 
..... 
-- 32-bit » 
... 02 
~ 0.1 
~ 0 
~ il u ... ~ lE tl ... ~ ~ ~ w 0 .<:: Ji u ~ -a. El u ~ a-. ~ ~ 
." ,!!j :a ~ ~ .~ 0
'" ~ 
.0 
T est pro grams 
Figure 3.2. Ratio between entropy and actual programs' sizes 
As stated before, the selection of a model that matches the requirements of a particular 
compression application is certainly necessary in order to achieve good compression. 
However, as Table 3.1 demonstrates, another decisive factor is the selection of the appropriate 
nwuber of bits for the input symbols . During the experiments cl,e smallest intlividual entropy 
achieved for 32-bit input symbols (which corresponds to a instruction word size) was 4.3, 
while cl,e largest was 14.63. Ths not only m eans that the code size reduction that could be 
achieved is more than 50%, but it also provides a suitable range for use when determining the 
appropriate lengths of tbe codewords that sho uld be employed in the compression algorithm. 
3.3.2. Evaluation of data compression algoritluns 
To take advantage of the redW1dancy in the ruse program s as found in cl,e previous section, 
compression can be used to produce more compact executable code. Section 3.2.3 introduced 
the most popular compression algorithms in the field and in this section, the same test bench 
Chapter 3 - Efl/ropy ,!/Embedded RISe Code 48 
used for entropy evaluation is used to assess the code reduction that these general-purpnse 
compression techniques can achieve on RlSC code. 
The compression results presented in Figure 3.3 arc obtained using history and input 
buffers (for the LZ family and XMatchPro) set to 2 kbytes and a dictionary length for 
XMatchPro of 256 bytes (64 entries). 
As expected, PPMZ provides the gteatest reduction in code size, generating compressed 
executables that are typically 34% of the original length. The compression results of the LZ-
family followed closely, with ALOC (0.43 mean compression ratio) consistently outperforming 
the other two algorithms. The worst mean compression ratios were achieved by DCLZ (0.55) 
and XMatchPro (0.53) 
0.7 
0 0 .6 
.:: 
0 .5 
" M 
= 
.10 0.4 
'" 
'" ~ 0 .3 
Cl. 
e 0.2 0 
u 0.1 
0 
-5 l'l u " 
" 
c tJ lj 
'" e B a 
.il :0 
-0 
'" 
" .fj
... ,J: t: 
.;; ~ 0 Cl. 
'" .!<l er 
-0 
Programs 
" " ..c l '" 
'" 
c 
" 
'" 
" '" 
O ALDC 
_ PPMZ2 
O XMatchPro 
O DCU 
- us 
Figure 3.3. Compression performance of a range of algorithms applied to the test-
bench programs 
These results can be improved by fine-tuning the setnngs and parameters of each 
algoritlun, such as input buffer or dictionary size. The effect that the values of these two 
parameters have on the final compression ratios is discussed below for three different 
compression algorithms for which it is relevant. 
LZS 
Chapter 3 - Entropy of Embedded RISC Code 49 
Figure 3.4 (a) shows the effect of varying the input buffer size on an LZS implementation. As 
can be seen there is a clear trade-off between buffer size (which is directly related to the 
hardware complexity and decompression performance) and compression ratio. As is expected, 
a small buffer size (such as 32 or 64 bytes) provides almost no improvement on th e final code 
size. On the other hand, in code-compression applications, big buffer sizes will often 
introducc significant performance overheads related with the decompression and processing 
of jump instructions. 
DCLZ 
This implementation shows no signi ficant response to a change in size of the history buffer. 
As Figure 3.4 (b) shows, all the test ptogrammes present almost no variation in their 
compression ratios for all the range of history buffers tested. Unfortunately, IBM's 
implementation does not allow the setting the buffer to sizes smaller than 512 bytes, thus 
preventing the evaluation of the algorithm at these very small block sizes. 
XMatchPro 
This algorithm behaves in a similar way to LZS with respect to changes in the input buffer 
lengrh (see Figure 3.4 (c)) . Although the LZ family of algorithms are fo rmally categorised as 
'dictionary' techniques, their implementation does not include a separate dic tionary. Thus the 
only algorithm (among the ones described in this chapter), which can be used to evaluate the 
influence of the dictionary size on the compression performance is XMatchPro, which 
features dictio nary sizes between 128 bytes and 4 kbytes. Figure 3.5 shows the compression 
ratios obtained for an input buffer of 4 kbytes for a range o f different dictionary sizes. As 
intuition would suggest, an increase in the dictionary size does indeed bring an improvement 
in the compression ratio. However, as the dictionary size is increased and approaches the size 
of the input buffer, the proportionate improvement diminishes. 
Chapter 3 - Entropy of Embedded RISC Code 
1.0 
o 0.9 
.~ 0.8 
~ 0.7 
o 0.6 
.~ 6 0.5 
:;. OA 
E 0.3 
8 0.2 
0.1 
0.0 
LZS 
32 64 128 256 512 1024 2048 4096 
0.62 
o 0.60 
.~ 
~ 0.58 
§ 0.56 
. ~ 054 
:;. 052 
§ 0.50 
u OA 8 
OA 6 -t'-' ...... ----
1.0 
~ 0.9 
',c 0.8 
C 0.7 
§ 0.6 
'6 0.5 
.. OA E 03 
o 0.2 
u 0.1 
0.0 
512 
Input buffer size (bytes) 
(a) 
DCLZ 
1024 2048 4096 
Histozy buffer size (bytes) 
(b) 
XMatchPro 
D basiemath 
• bitcnlll 
o ere 
D djikstra 
• display 
D fft 
. qso rt 
D sh. 
• shin. 
• swan 
D basiem.th 
. bitcnlll 
o ele 
o djikstta 
• display 
o fft 
• qsort 
O sh. 
• shine 
. susan 
D basiem.th 
• bitcnlll 
D ele 
D djikstta 
• display 
Dfft 
. qsort 
O sh. 
• shine 
32 64 128 256 512 1024 2048 4096 • sus an 
Input buffer siz es (bytes) 
(c) 
Fig ure 3.4. Compression ratio results for different input block sizes 
50 
Chapter 3 - E lltropy of Embedded rusc Code 
0.54 
0.53 
c 0.52 
'll 
~ 
~ 0.51 
= c 
.~ 
.. 0.5 il 
~ 
"" 049 e
c 
u 048 
047 : : :. 
046 
128 256 512 1024 2048 4096 
Diction:uy sizes (bytes) 
51 
-t-- basicmath 
~bitcnts 
~ C!C 
djikstta 
--display 
--- fft 
~qsort 
--sha 
__ shine 
sus an 
Figure 3.5. Compression ratios for various dictionary sizes 
Several conclusions can be drawn from the results presented in this subsection. Firstly, it 
is clear that compression algorithms can significantly reduce tlle redundancy of !Use code 
achieving an average compression ratio of 0.46. Secondly, the compression results are highly 
dependent on the selected parameters, where larger input buffer and dictionary sizes 
significantly in1prove the compression performance of the algorithms. This suggests that the 
compression algoritl1ffi to be developed by this work sbould allow for certain fl exibility in a 
number of parameters to be finely tuned to the characteristics of the program's code. 
3.4. Conclusions 
This chapter presented a study of the entropy of d,e embedded RI e code, together with an 
investigation of the performance of a number of state-of-the-art lossless data compression 
algoritluns. 
TI,e entropy of the code was numerically quantified by the use of several real-li fe 
applications, representative of those found in embedded systems compiled for commercial 
RIse processors. TI,e results obtained demonstrated the presence of high levels of 
- -- -- - -- -- --~-----------------------------
Chapter 3 - Entropy of Embedded RISC Code 52 
redundancy in the executable RISe code, whose representation was found to frequently 
exceed the most highly compressed form by a factor of three. Apart from demonstrating the 
feasibility of the investigation of RISe code compression, the entropy study provided a 
number of additional useful outcomes. For example, the results suggested that a suitable unit 
of compression is the 32-bit instruction word, while the lengths for the codewords, ranging 
between 5 bits for the instructions with higher probabilities and 14 bits for those with lower 
probability of occurrence, might result in the most efficient representation. 
In the second part of this chapter, the suitability of compression as means of removing 
redundancy from the code was examined. A number of the most popular compression 
algorithms were presented and evaluated, achieving excellent compression ratio results (0.46 
on average for the test bench programmes). This demonstrated that lossless compression can 
be successfully and usefully applied to RIse executable code. 
Chapter 4 
COMPRESSION ALGORITHM, TOOLS AND 
DESIGN FLOW 
The experiments described in Chapter 3 provided insight regarding the degree and nature of 
the redundancy in embedded RISC code, as well as indicating the relative performance of 
different compression algorithms when applied to embedded code, while disregarding 
hardware implementation issues. As described in Section 2.5.1 in order to be suitable for 
embedded code compression, an algorithm should: 
• allow for a highly efficient hardware implementation that not just avoids performance 
degradation Qengthening of execution time), but enables improvement. This algorithm 
will need to differ from those ones presented in Chapter 2, since the code size 
reduction achieved by them was at the cost of performance; 
• be able to work from a limited context, where the program is decompressed in small 
blocks of length typically no greater than that of a cache line; 
• provide for real-time non-sequential decompression, so that it can cope with changes 
to the execution flow of the programme. 
The above issues are considered in this chapter. The chapter describes a novel algorithm 
suitable for high-performance real-time embedded applications, whose seamless hardware 
implementation does not come at the expense of degradation in the compression ratios 
relative to the algorithms presented in Chapter 3, and that, in most of the benchmark 
problems considered, improved the overall system performance. 
--------------------............ 
Chapter 4 - Compmtioll algorithm, tools and desigll flow 54 
4.1. Overview of the proposed solution 
This section provides an overview of the design decisions leading to the compression solution 
developed in this study. Details of the algorithm's implementation are presented in later 
sections. 
One of the particular characteristics of code compression is its asymmetric nature, where 
compression and decompression take place in separate environments and, therefore, have to 
comply with different constraints. The compression process takes place off-line, on a host 
development machine, and is, therefore, substantially free from memory and performance 
restrictions. Generally, the compressor can be implemented in software, and can make use of 
relatively complex and time-consuming code analysis methods in order to achieve optimum 
encoding. On the other hand, decompression occurs in the embedded environment and needs 
to be cartied out in real-time, as embedded products are often targets of strict performance 
requirements. Nevertheless, overheads in terms of area usage, power consumption and 
execution time are inevitable as decompression involves additional processing of the 
instructions fetched from the memory at run-time. The hardware implementation will need to 
not only minimize overheads, but also counterbalance commensurate drawbacks by providing 
an overall performance gain as well as a reduction in memory requirements of executables. 
This study has achieved both a significant reduction in the size of embedded RIse 
executables and an increase in the system performance by: 
• Using code compression in order to' obtain executables with sizes close to their 
entropy levels, thus reducing significantly the redundancy levels of the original code. 
• Relocating the boundary between compressed and uncompressed spaces Qn order to 
obtain a significant increase in the instruction cache hit ratio), hence improving the 
overall performance of the system. Alternatively the cache memory could be reduced 
in capacity while maintaining the same hit ratio. 
In most of the compression schemes described in Chapter 2, the boundary between 
compressed and uncompressed space is located between the ICache and the instruction 
memory, the decompression process being triggered by a cache miss. Only the instructions in 
memory are in compressed format, while in the I Cache they remain in their original 
I 
Chapter 4 - Compression algorithm, tools and design flow 55 
uncompressed form. Relocating the boundary between the ICache and the processor core 
allows the ICache to hold compressed instructions, enlarging its effective size. As will be seen 
in the results presented in Chapter 7, this virtual increase in the ICache size has a significant 
effect on the performance of the system resulting from substantial improvements to the 
ICache hit ratio, and amply over-compensating, in the majority of the cases, the 
decompression-related overheads. Furthermore, this scheme does not require any changes to 
the processor core or to the ICache architecture and the whole compression/decompression 
system (IP-core and support software) could be implemented as a plug-in for System-on-Chip 
development tools. 
4.2. Compression Algorithm 
The initial intention for the compression algorithm developed in this work was to mirror the 
instruction encoding lengths typical in x86 processors, in order to obtain the code 
representation benefits of its CISC ISA. This implied replacing the most commonly-used 
instructions with codewords of length 8, 16 or 24 bits, thus preserving memoty alignment and 
keeping address translation relatively simple. However, after performing a series of tests, it 
became clear that the algorithm should provide greater flexibility, by allowing a wider range of 
classes and codeword lengths. This flexibility resulted in significant improvements in 
compression ratio, at the cost only of loosing the byte-alignment of instructions. This section 
presents the algorithmic design decisions and the implementation details of the developed 
compression scheme. 
4.2.1. Algorithmic design considerations 
Code compression is performed in two stages, namely modelling and coding. In the first stage, 
a model well suited to the characteristics of the code is selected and used to obtain the 
probabilities of the symbols in the embedded program. The modelling requires that three 
important parameters, namely the unit (or symbol) of compression, the type of the model and 
its order, are determined. These parameters, together with the defined encoding algorithm, 
. ' ," . -., ' -',', "', . ,'.' 
specify the compression algorithm. Due to the influence that these modelling decisions have 
Chapter 4 - Compression algorithm, tools and design flow 56 
on compression performance, the process involved in obtaining suitable parameters are now 
considered in detail. 
In a number of the code compression studies, the instruction words are divided into 
smaller building blocks (half-words), which are compressed separately. Examples of this 
approach can be found in IBM's CodePack system [7] and the compression scheme 
implemented in Motorola MPC5xx microcontrollers [19]. llis approach has been 
demonstrated to be very successful when used with code that incorporates only a small 
number of unique half words that have high probabilities, or where the data contains some 
half-words that have a much higher probabiliry than others. However, as the entropy study 
proved (see Section 3.3.1), such a scheme does not produce the best compression 
performance for ARCfangent-A4's code (or for other RISC processors), due to the large 
number of unique half-words normally present in the code. Studies that utilised compression 
symbols longer than a single instruction, such as expression trees or instruction patterns (see 
Section 2.5.1. for description of TBS and PBS), achieve compression results that are poor in 
comparison with using the whole instruction word. Therefore, based on the outcomes of the 
previous studies described in Chapter 2 and the entropy results presented in Figure 3.2, single 
instruction words are used as compression units in this work. . 
, >', ,. -" • - • • ',-' •• -
. The remaining two parameter~aefining the model, namely the type and order or~odel, 
are determined mainly by the environment in which the compression process is carried out. 
Modelling of the embedded code, as already discussed, takes place on the host development 
machine and is not normally subject to strict timing constraints. llis allows for a static model 
to be used, in which the application to be compressed can be parsed many times in multiple 
passes in order to determine accurately the probability of each symbol. llis is likely to 
improve compression ratio in comparison with dynamic modelling, as the probabilities of all 
the symbols are more accurately known before encoding starts. 
The model chosen for implementation in this work can be classified as a fixed-context 0-
order model. llis means that the probabilities of individual compression symbols will be 
estimated without taking into account either the preceding or subsequent symbols, and the 
probabilities are entirely independent of the location of the symbols in the source. llis choice 
is dictated according to the following line of reasoning. The code is unlikely to contain a 
sufficient number of repeated blocks of two or more consecutive instructions, where all the 
Chapter 4 - Compression algorithm, tools and design flow 57 
opcodes, register names and immediate values match precisely. The absence of such repeating 
blocks means additional complexity introduced in the hardware decompressor by 
implementing the support for higher-order models, is not warranted. Moreover, should such 
repetitions of blocks of instructions appear frequently in the code, they would be normally 
identified and optimised by the compiler when a space optimisation option is set. 
Following the definition of the parameters of the model, the encoding algorithm also 
needs to be specified. There are a number of encoding. algorithms available, originally 
developed mainly for text and data compression, but which can be adapted for compressing 
code. However, due to the restrictions imposed by real-time embedded processing, many of 
these algorithms proved to be unsuitable. A clear example is arithmetic coding that requires 
extensive use of multiplications and divisions .. Other encoding techniques, such as Huffman 
and Shannon-Fano [12] coding, produce variable-length codes, where the number of classes 
(sets of codes of the same length) is dictated by the number of symbols and their probability 
distribution, and not by the designer. Hence, if this technique were to be used for code 
compression, the decompressor needs to be able to manage an undetermined (and often large) 
number of different classes, which would result in highly complex solution that is unsuitable 
for real-time hardware implementation. Should variable-length encoding be employed by the 
compression algorithm, then the following two requirements should be followed in order to 
achieve an efficient real-time-based hardware decompressor. Firstly, the number of classes 
should not be determined by the characteristics of the programme to be compressed (such as 
its length and probability distribution of instructions). Secondly, the number of classes should 
be small. 
The following section describes in detail the compression algorithm developed in this 
study in order to achieve an efficient hardware decompressor architecture that is particularly 
suited t~ oper~tion in demandingreal:time e~bedd~d application; .. 
4.2.2. General Description 
The compression algorithm that has been developed for this study can be described as a class-
based dictionary algorithm, with a 32-bit instruction word as the symbol of compression. The 
compression process consists of the following two stages. Firstly, the executable me produced 
by the tool chain (compiler/linker) is analysed and the frequency of occurrence of each 
Chapter 4 - Compression algorithm, tools and design flow 58 
instruction is detennined. Secondly, the instructions with the highest frequencies are assigned 
codewords and placed in a dictionary table. As Figure 4.1 shows, during compression the 
instructions are replaced by their corresponding codewords and a new, compressed executable 
file is generated that includes uncompressed instructions not assigned a codeword 
Oxl000 
Oxl004 
OXl008 
Oxl00c 
Oxl010 
Oxl014 
Oxl018 
Oxl01c 
Ox1020 
60111000 
63603800 ~ 
28020100 r 
60111.00 
10083600 
601lleOO t-
Ob6e1010 
53887810 r-
...... 
Original Executable 
01 
02 
03 
...... 
601ffeO 
53887810 
Ob6el010 
..... 
DIc1lonary 
..... 
Oxl000 
Oxl004 
Oxl008 
Oxl00e 
Ox1010 
Oxl014 
01 636038 
00 280201 
00 01 lODe 
3600 01 102 
03 I ..... 
...... 
Compressed Exacutsble 
Figure 4.1.Uncompressed to compressed file conversion 
Codewords are divided into two classes. The first (class 1) has a shorter word length and 
therefore is used to represent the instructions most frequently present (thus, achieving higher 
compression ratios), while the remainder of the instructions are represented using class 2 
codewords that are of longer length. The word length of each class not only detennines the 
number of instructions that can be represented using this class, but also affects the 
compression results. Furthermore, two programs using classes of the same length will, in 
general, yield different compression ratios. To achieve the best compression performance, it is 
important to detennine the optimum size of each class for a particular program, and to have a 
decompressor architecture capable of dealing with parametetisable class sizes. In embedded 
applications, there will need to be a practical limit placed on the dictionary size, and will 
consequently place a constraint on the class sizes. Once the dictionary size is fixed, there will 
be a finite number of class size combinations and the one that provides the best compression 
results can then be chosen. After selecting the size of the classes, the next step is to map the 
most frequently used instructions to a unique set of codewords. The manner in which a 
codeword is generated is shown in Figure 4.2. 
1. As not every instruction in the original program is compressed (due to the finite 
size of the dictionary), this work uses the most significant bit (MSB) of an 
Chapter 4 - Compression algorithm, tools and design flow 59 
instruction (m the compressed space) to signify whether it is a compressed 
instruction (codeword) or an uncompressed one. In compression terminology, 
this bit is called the Escape Bit. TIlls bit was chosen as the ARCTangent-A4 
processor used for proof-of -concept has all its base-case instructions' MSB set to 
O. Consequently, this bit is available, assuming that no extensions to the ISA are 
required. 
2. The second MSB of a codeword is used to identify its class, namely 0 for class 1, 
and 1 for class 2). 
3. The remainder of the codeword is an index (the position of the codeword in the 
class). Note that, in general, the smaller the index the more frequently the 
codeword will be found in the compressed file. During decompression, this index 
is used to locate the original uncompressed instruction in the dictionary. 
lnfomation blt 
o 
Index field 
Figure 4.2 Codeword format 
For example, the following instruction 1100010 in the compressed space would represent 
a codeword (MSB = 1), of Class 2 (second MSB = 1), with index 2 (010). 
During the decompression process, an instruction is fetched from the cache and its MSB 
is decoded in order to determine whether it is a codeword or an uncompressed instruction. In 
the latter case, the instruction is directly sent to the processor, otherwise the codeword is 
fetched and its class information bit and index are read Depending on the class to which it 
belongs an appropriate different offset is added to the index in order to calculate the precise 
address of the original (uncompressed) instruction in the dictionary. The offsets are calculated 
off-line as follows: for class 1, the offset is always 0, while class 2's offset is equal to the size 
(the number of codewords) of class 1. Therefore, in the example above (codeword 1100010), if 
the size of class 1 is 4, the corresponding uncompressed instructions for this particular 
codeword would be located at position 2 + 4 = 6 in the dictionary. 
Chapter 4 - COlllpnSS;OIl algoritblll, tools alld desigll flow 60 
This foml of codeword representation has proved very efficient during the 
decompression of sequential sections of code. However, it will be necessary to implement 
mechanisms to support the control flow changes that occur during the compressed program's 
execution. Such mechanisms are discussed in detail in the next subsection of this chapter, 
while their hardware implementation is presented in Chapter 5. 
In order ro evaluate the compression ratios achieved by the proposed algorithm, the same 
test-bench presented in Chapter 3 for assessing different compression algoritbrns has been 
used, with the results summarised in Figure 4.3 for six different dictionary sizes. For this test, 
class 1 was allowed to vary in the range 4 to 8 bits, and class 2 in the range 8 to 12 bits giving 
dictionary sizes from 256 to 4 Kbytes. As can be seen Figure 4.3, the compression results are 
very promising (close to 0.5) for dictionary sizes of 4 Kbytes. 
O.SO 
0.75 
o 0.70 
'll 
~ 0.65 
§ 0.60 
.~ 0.55 
:;. 0.50 
§ 0.45 
U 0.40 
0.35 
0.30 +---~--~--~--~-~--~ 
256 512 lK 2K 4K SK 
Dictionary Six os 
-+- basicmath 
--- bitcnts 
-+-- ere 
-- djikstu 
--- display 
-+- qsort 
-- sha 
- shine 
-+- susan 
mean 
Figure 4.3 Compression ratios achieved for different dictionary sizes 
Although increasing the dictionary size initially improves the compression ratio, 
eventually, a saturation point is reached where any further increase does not provide any 
further substantial compression improvement. The selection of the dictionary size has to take 
in consideration also the total size of the code. Unreasonably large dictionary would not 
improve memory utilisation, as instructions would be stored in the decompressor's dictionary 
instead of instruction memory. TIlerefore, although that 8Kbyte dictionary gives better 
Chapter 4 - Compression algorithm, tools and design flow 61 
compression ratio results for the programs included in the test-bench, the clictionary size used 
for further experiment is 4 Kbytes. 
In summary, this subsection has presented a novel compression scheme that provides 
similar compression ratios to those obtained by the standard text and data compression 
algorithms presented in Chapter 3. However, the principal advantage of the new class-based 
method is that it is amendable to an efficient hardware implementation in real-time embedded 
systems. 
4.2.3. Compressed to uncompressed address space mapping 
The instructions of a computer programme are normally executed sequentially, except when 
control flow statements, which will change the point of execution, are encountered. Examples 
of control flow statements are: 
• Conditional statements, which may only be executed under certain conclitions (typically 
specified flag states); 
• Loops, 'a group of statements that may be executed repeatedly; 
• Subroutines, a group of remote statements that may be executed before control returns 
to the point from where the statements were called. 
Conclitional statements are not further regarded by this study, unless the statements 
themselves are jumps or branches, as they are always fetched and decoded by the processor in 
order to evaluate the conclition. Consequently, even when such statements are not executed, 
their own and their following instruction's addresses are consecutive and thus they do not 
represent true change of flow (COF). The other two cases, namely loops and subroutines, would 
normally interrupt the sequential fetch process, forcing the recalculation of the address in real-
time (by adcling an offset to the current address, or by using a clifferent address supplied as 
immecliate data in the instruction). Since, in the solution presented in this work, the processor 
is unaware of compression, it always receives uncompressed instructions. If these instructions 
generate a COF, the target address will be always relative to the uncompressed space, and 
therefore a mapping mechanism needs to exist between the compressed and uncompressed 
Chapter 4 - Compression algorithm, tools and design flow 62 
address spaces in order to allow a compressed instruction to be obtained from an 
uncompressed wget address. 
In previous studies, this problem is usually solved by the use of address look-up tables, 
where each original address (or the address of a block of instructions) has a corresponding 
entry in the compressed address space [7, 27]. TIlls solution guarantees correct mapping 
between compressed and uncompressed memory spaces, but requires large buffers, which has 
a significant detrimental effect on the compression results. Other implementations have 
modified the processor's architecture in such a way that the progtamme counter (pq value 
corresponds to the address in the compressed space [4]. Since one of the aims of this study is 
to implement the compression system such that its presence is unknown to the processor, the 
latter needs to continue to operate entirely in uncompressed space, while it is the 
decompressor that takes care of locating instructions in compressed space. The mechanisms 
developed in this research for solving this problem are desctibed in the following subsection 
and are highly innovative while avoiding the use of latge look-up tables and processor 
modifications. TIlls section will present COF instructions available in the ISA of 
ARCTatigent-A4, thus clarifying the requirements for this scheme. 
Relative changes of flow instructions 
In ARCTangent-A4's ISA there are three instructions whose execution can result in relative 
changes of flow, and each has the same format, as shown in Figure 4.4. These instructions 
are: conditional branch (Bee), conditional branch and link (BUr), and conditiona1loop set up 
(IPcc) [42]. 
I 5~ ! I I I 2j) bits -r:ts!, 5~ ! 
CONOInoN • RELATIVe OFFSET 
Figure 4.4 Relative COF instructions fonnat 
For Bee, the displacement value (or offset) is encoded into the 20-bit RELATIVE 
OFFSET field, which, if the specified condition is met, will be added to the value of the 
current PC in order to calculate the taxget instruction address. Bue is executed as Bee, but in 
addition, it places the address of the instructions that follows it (nextPq in a special Branch 
and link (BLINK) register. The IPee instruction, together with three special purpose registers 
(LP_COUNT, LP_START and LP_END) provides a mechanism for performing loops 
Chapter 4 - Compression algorithm, tools and design flow 63 
without any delays bcing incurred by the count decrement or the end address comparison. The 
functionality of the loop mechanism is illustrated in Figure 4.5. 
Figure 4.5 PC update and loop detection mechanism [42J 
If the loop condition is true, then the address of the next instruction is loaded into the 
LP_START register and LP_END is loaded with the address PC + RELATIVE OFFSET. If 
the condition is evaluated to false, then a branch occurs to the address PC + RELATIVE 
OFFSET. The loop mechanism comes into operation only if there are no pipeline st:alls, 
interrupts, branches or jumps. The number of loops that need to be executed is held in 
another register, the loop counter, LP_COUNT. If LP_COUNT is not equal to 1, then the 
PC is loaded with the contents of LP_START and LP_COUNT is decremented Otherwise 
(LP _COUNT=1), the PC is allowed to increment normally and LP_COUNT is decremented. 
To produce a correct address translation during decompression, branch instructions are 
processed by the compression tool as follows. 
1) On detection of a branch, its target address (10 uncompressed space) is calculated 
2) The branch is compressed, its codeword is written in the compressed executable 
and the 16 bits that immediately follow are bookmarked and left empty for a 
branch appendix to be written. 
3) The location of the branch target in the compressed space is available once the .. 
entire executable has been compressed The branch appendix can now be 
calculated by subtracting the compressed target address from the original 
------------------............ 
Chapter 4 - Compression algorithm, tools and design flow 64 
(uncompressed) target address (see Figure 4.6), and written back in its 
bookmarked space. 
As illustrated in Figure 4.6, the first 11 bits of the appendix represent the memory offset 
(difference) between compressed and uncompressed target addresses. Since the compressed 
space is not word-aligned, 5 bits of the appendix are used to indicate the offset of the 
compressed instruction in the 32-bit memory word. At run time, the inverse process takes 
place: the decompressor calculates the branch target in compressed space by simply adding the 
appendix to the target (uncompressed) address supplied by the microprocessor. 
r·~::::::::::::~~~22~bl~ts~~;=::::::::::::~·tln2o-~ootr-~~~Notus~ ~ Orlglnal target address . !- l....: 
L 22 bits _L 5 bits .I • 
I .. Compressed target address J Btt posnion \thesa me 
) 
Offset difference . Bit poslllon .. !&' 
11 bits 5 bits 
Branch appendix 
Figure 4.6. Calculating the branch appendixes 
Adding an appendix after each branch instruction will usually marginally degrade the 
compression ratio (depending on the application), yet it provides a simple and efficient 
mechanism for handling relative COF instructions, namely branches, loops and branch-and-
links. It is important to notice that these are the vast majority of COF instructions found in 
typical embedded applications. 
Jumps - absolute changes of flows 
Jump instructions are COF instructions that take a memory address as an argument and, upon 
execution, modify the PC to this new address. There is small number of different jump types 
that can be classified, as follows, according to the format of the address found in the 
instruction: 
• direct jumps, where the target address is given as an immediate value; 
• indirect jumps, where the target is calculated at run time and is held in a register; 
I 
I 
Chapter 4 - Compl'rlSsion algorithm, tools and design flow 65 
• jump to a link regjster value that holds the return address of a subroutine. 
How these different types of jump can be dealt with in the compression architecture 
developed in the current work is described below. 
Direct Jumps 
Direct jumps are normally the least frequently occurring of the jumps being considered, 
although they are the easiest to resolve. The target address is either part of the jump 
instruction word (as a short 16-bit immediate value, which is expanded by the processor to the 
required 24-bit address form) or is stored in the 32-bit memory word that immediately follows 
the instruction. During compression, the target addresses are stored in a table together with 
the corresponding addresses in compressed format. Later, this table, (which also holds the 
targets of indirect jumps) will be required to form part of the decompressor's branch 
management. 
Indirect jumps 
An indirect jump (also known as a computed jump) does not specify the address of the 
next instruction to execute, but rather the argument (register) where the address is held. The 
actual target address of the jump is generally not known at assembly or compile time, as it is 
only computed when the instruction is executed. Indirect jumps are mainly used to make 
conditional, multi-way jumps, and might be generated by a compiler in the following cases (1]: 
• switch statements, which select one of several alternatives; 
• JUnction pointers that hold the address of a function to be called; 
• virfllaljUnctions in object oriented languages such as C++ and Java; 
• t/ynamicalIY shared libraries, which allow a library to be loaded at run time and its 
functions called. 
In embedded systems the last case is rarely encountered, while the virtual function 
addresses are located in a virtual function address table (vtable) by the compiler and thus can 
be easily resolved. Therefore, this research focuses on resolving the mapping between 
compressed and uncompressed spaces for switch statements and function pointers. For both 
of these cases, compilers normally generate a jump table that holds a list of addresses of a set 
Chapter 4 - Compression algorithm, tools and design flow 66 
of routines that can be selected by index. For the specific purposes of this research, compiler 
engineers at ARC International generated a special code section in the executable that provides 
the address of the jump table and indicates the number of entries it contains. With this 
information, a look-up table is built that holds the target addresses of the indirect jumps and 
their corresponding addresses in the compressed instruction memory. In those cases where 
the compiler is unable to generate the required jump table information, it is possible to 
exhaustively determine all the targets of indirect jumps by simulating the application. 
J limpS 10 Blink &gjsler 
In the ARCTangent-4 processor, there are two special-purpose registers dedicated to holding 
the subroutine return addresses, namely BUNK and IUNK. In order to change the control 
flow to the beginning of a subroutine, the two available insttuctions are conditional branch-
and-link BLee and conditional jump-and-link (fLee). The final instruction of a subroutine is 
usually Jee %blink Gump-to-blink), whose execution changes the control flow to the address 
stored in the blink register. As the decompressor must be able to track the return addresses in 
the compressed space, the solution proposed uses a small, configurable stack, managed by the 
decompressor at run-time. This stack stores the return address (i.e. PC+4) of a taken branch. 
of the type BLcc or JLee. Its· size is determined off-line by simulation and the number of entries 
corresponds to the maximum nesting of subroutines in the application. 
Using the above approaches, the mapping from uncompressed to compressed space can 
be achieved without the need for large look-up tables; details of the hardware implementation 
can be found in Chapter 5. 
4.3. Design flow and development tools 
Compiler-based and hybrid architecture techniques for reducing code size require significant 
modifications to the standard development tools and often to the whole development process. 
Code compression solutions, including the one proposed by this research, introduce only. 
minimal changes to the whole system, since not only the original hardware architecture 
(processor and memory system) is unchanged, but also the standard development tools 
(compiler and linker) remain unaltered. The proposed code compression scheme is thus, 
effectively, a code-size reducing and performance-improving plug-in module, which might be 
Chapter 4 - Compression algorithm, tools and design flow 67 
easily configured and used where a need for additional performance or extra memory has been 
identified . 
. 4.3.1. Design flow 
, .,' . . .' '.' . 
Embedded systems design is a complex paradigm of hardware and software development 
and their integration. With the ever-increasing complexity of embedded programs and pressing 
time-to-market constraints, software development very often starts far before there is any 
reliable target hardware in place for its verification. Thus, the quality of the development tools, 
their effectiveness and ease-of-use, together with the completeness of the design flow, play a 
key role in the successful fulfilment to time and within budget. 
A hardware/software co-design process that includes the integration of a soft-IP 
processor would have the following software-development stages. 
• Software application writing (typically in C and, perhaps, a small proportion of 
assembly language) 
• Compilation and linkage of the softWare for the target hardware platform, often 
resulting, for embedded RISC architectures, in a single executable ELF 
(executable and linkable format) file. 
• Testing, debugging and final implementation of the software run from non-
volatile memory Qnternal or external to the embedded system). 
In parallel, the following hardware development stages would take place. 
• Customisation and simulation of the soft-IP processor core Qn the case of this 
research, a VHDL model of ARCTangent-A4). 
• Integration with other hardware modules required by the embedded solution. 
• Synthesis and place-and-route of the hardware system for a particular technology 
and vendor. 
Chapter 4 - Compression algorithm, tools and design flow 68 
Integmting the developed compression scheme in the design flow highlighted above is 
relatively straightforward. Figure 4.7 shows the hardware/software co-design process 
including the additional compression stage. In the software process, the only extra step 
required is to run the linked executable file through the compressor tool in order to generate 
all the outputs (compressed executable, dictionary and jwnp tables) necessary for 
decompression and correct execution of the code in real-time. Based on these outputs, the 
hardware decompressor can be configured and synthesised as a part of the processor's 
architecture. The software-hardware verification can be performed, for instance, using an 
FPGA development board 
" ARCAngel il' »Mo.lo,..", bo.~d 
f __ ~ .•. 
. _, .. 
I SRAN I FPGA I 
~ 
~·D ....:,; ...... ::.:; ...... ::::v ..... 
I 
COr-:PltESSEO 
I 
CO ... 'NLOADA3LE ~ .... -: .... ::::---
-..:.::-- £XIXurAB:.E (E!'F) BI'ISTREAM 
rr r 
-..L 
10000PRlISSORI PLACE AND IOOllTII 
T T 
I EXF:CtJT1\l3I.E (ELF) J SYN'l'IIISIS TOOL j T 
ARCTA...';G~.NT-M. 
I 
OCIIPlLER I 
I CORE LINl<ER 
T ' DECCv.PRESSOR , I-CACHE.Ip-CAC3E I SOURCE CODE , I (C, Cit, ASSEY.nLY) Ml<'J 
Software development Hardv .. ze developa.nt 
Figure 4.7 Schematic of the hardware/software co-design process including 
compression 
Chapter 4 - Compression algorithm, tools and design flow 69 
4.3.2. Compressor tool description 
The analyses that need to be carried out by the compressor in the development path in order 
to enable the run-time decompression are shown in Figure 4.8. 
AcUons taken during different 
parses 
Gather statistics 
sort unique instruction entries by 
frequency distribution 
Generate dictionary 
Determine dass structure 
Output. produced by the 
compressor tool 
ConftguraUon file 
r-~nd pa~""~:'-~ ~---.~, ._-," "'_"",,"_"_"C_~"O"" - "1 
I I Com~. code segnent Il r-/>Jj-d-re-ss""" .. took-um-~-~-a~-~-:., .. 'l 
L ... ~:::~~~~~_::_J ,t ... :. ;;;;._;;; ..::.;._::.:.====;:;;:~j 
r;.;,:; . r '. 
; Backpatch branch Instruction. ! Comp'essed 
r ExeOodable 
j.l .. ,~:-:~~.nthe~de~_lj L .... u.
d 
.... __ " 
Figure 4.8. Compression stages 
First, the code segment of the executable file is parsed in order to gather the necessary 
statistics regarding the frequency disrtibution of the instructions. Each unique instruction is 
identified for sorting according to its static frequency disrtibution, while the size of the static 
dictionary, and the length of the codewords for each class are determined by exhaustive 
analysis. Next, the static dictionary is generated; its enrties containing those uncompressed 
instructions assigned a codeword. Two outputs are produced at this stage: the static dictionary 
and a configuration file, which includes the optimum codeword lengths. The codewords are 
encoded using two classes, with class 1 codewords being assigned the most frequendy 
occurring instructions and encoded using a smaller number of bits than those in class 2. Note 
Chapter 4 - Compression algorithm, tools and design flow 70 
that the dictionary generated by the compressor is termed static because its entries contain the 
most frequently occurring instructions found in the executable me. However, as Chapter 6 will 
show, the final dictionary used to compress the executable me may contain also a selection of 
the most frequently executed instructions. 
In the second parse, the code segment is compressed. This again involves reading each 
instruction, searching for an identical entry in the dictionary and, if it is found, replacing it in 
the compressed executable with its corresponding codeword. At this stage, the instructions are 
partially decoded, so when a branch or branch-and-Iink instruction is detected, its offset in the 
compressed executable is stored and an empty 16-bit entry appended for completion in the 
final parse. An uncompressed-compressed address mapping table is generated, which holds 
each instruction's original address and its corresponding address in the compressed me. This 
table, together with the jump table information required to address indirect-jump cases, is used 
to generate the jump-table look-up table. 
In the third and final parse, the compressor calculates the branch appendices as described 
in section 4.2.3 and writes them at their defined (bookmarked) locations in the compressed 
executable. Finally, the ELF headers are adjusted in accordance with the new sizes and offsets 
of the sections and segments of the executable. 
4.4. Conclusions 
This chapter has presented a novel compression algorithm targeted at high-performance 
real-time embedded systems that would benefit from a significant reduction in the size of their 
executable code. In contrast with current state-of-the-art compression algorithms discussed in 
Chapter 3, the new algorithm allows operation from limited context, avoids the use of large 
buffers to hold decompressed instructions and removes any dependency on previously 
decompressed instructions. Non-sequential execution mechanisms have also been designed in 
order to permit compression when there are changes of execution flow. The hardware blocks 
developed to support these mechanisms, the rest of the components of the decompressor and 
its operation at run-time will be described in the following chapter. 
----- -- ----------------------------------
Chapter 5 
DECOMPRESSOR FUNCTIONALITY, 
ARCHITECTURE AND IMPLEMENTATION 
Chapter 1 put into context this research, introducing the widely acknowledged problem of 
poor code density inherent to the RISC architecture, while Chapter 2 presented an overview of 
the different approaches that have been proposed to address this issue. Chapter 3 focus sed on 
the entropy analysis of embedded RISC code in order to quantify the amount of redundancy 
present in an executable me. Based on these results and the analysis of the compression 
algorithms described in the previous chapter, a suitable code compression algorithm was 
developed in this research (as presented in Chapter 4). 
The main objective of this chapter is to describe the functionality and architecture of the 
decompression hardware designed to support real-time high-performance embedded 
applications. The chapter also presents, for a number of FPGA devices, the hardware 
implementation results in terms of physical area usage and speed. 
5.1. Overview 
As discussed in chapter 4, in order to (over)compensate for the decompression overheads 
(delays), the decompression module is located between the ICache and the core of the 
processor. This design increases the effective size of the ICache, since its lines will contain 
compressed instructions, and so will improve the cache hit ratio. This section provides a brief 
overview of the operation of the ARCfangent-A4 processor's pipeline (used in this study for 
proof-of-concept) and the integration of the decompressor within its first stage. 
Chapter 5 - Decompressor speciftcation, arrhitecl1lre and implementation 72 
5.1.1. ARcrangent-A4 architecture 
ARCTangent-A4 is a 32-bit RISC processor with four-stage pipeline implementation (see 
Figure 5.1). In the first stage (Instruction fetch or IFetch), the requested instruction is fetched 
from the ICache. The request is done by placing the instruction's address on the address bus 
and setting the request signal, ffetch, high. In the event of a cache hit, the instruction is placed 
on the instruction bus and the ivalid signal validates it. Otherwise, on a cache miss, the ICache 
sends a fetch request to the instruction memory and the processor is stalled until the cache is 
refilled. 
! 
--+ Core Registers ~~ 
ICache 
. 
AlU 
-
+--1 Program Counter I "--+V'" 
IFetch Decode Execute WliteBack 
Figure 5.1. ARCTangent-A4 pipeline stages 
In the' second stage (Derode), the instruction is decoded and any operands required are 
fetched from the register file. Where the instruction is a branch or jump, its conditional codes 
are evaluated. If the condition is fme, the jump or branch is taken, the control flow is changed 
and the PC is updated with the value of the operand supplied. Otherwise, the PC is 
incremented normally (pC=PC+4). In the third stage (Execute), the appropriate arithmetic or 
logic operations are carried out in the arithmetic logic unit (ALU). Finally, at the Write Bock 
stage, the output of the ALU or the result of a load instruction is written to the register file. 
Load and store instructions use the ALU in stage three to calculate the data transfer address 
and make the access request to the memory management unit. 
Although the basic operation of the ARCTangent-A4 is very similar to other RISC 
architectures, its configurability, and the fact that it is a commercial processor, supplied as a 
synthesisable (VHDL) soft IP core, make it an excellent platform for experimentation. With 
the tools ARC provides to accompany the processor, design engineers can customise its 
architecture by adding or removing specified hardware blocks. Following the philosophy of 
Chapter 5 - Decompressor specification, arr:hitecture alld implementation 73 
this modular approach, the decompression hardware developed by the author is designed to 
operate as a plug-in module that can be easily con figured and incorporated as a part of the 
processor system. 
S.L2. Integration of the decompressor 
The interface between the processor core and the instruction cache supports a simple, 
synchronous handshake protocol, described in the previous subsection (stage one). The 
interface signals for the two modules are summarised in Table 5.1. 
Table 5.1. Processor interface signals 
Signal Name Input/Output Descriprion 
pc_next out [23 .. 0) Address of the instruction to be fetched. 
if etch out Instruction fetch signal. Indicates to the cache controller that a 
new instruction is required. The instruction will be fetched from 
the memory at address pc_next. 
ic_busy in Instruction cache busy. Set active during tag clearing and line 
loads from memory. 
pliw in [31..0) The 32-bit instruction fetched from the ICache. 
ivalid in Qualifying signal for pliw[31:0). A low value would indicate that 
the requested instruction is not yet available, and therefore the 
prograntme counter should not be incremented. 
In order to achieve a seamless integration of the decompressor in the processor's pipeline 
(one that avoids any changes in the original interface of the processor), the decompressor 
mitrors the signals presented in Table 5.1. The incorporation of the decompressor into the 
processor system is shown in Figure 5.2. 
Processor 
_next 
fetch 
Oecompressor 
_next 
fetch 
Figure 5.2. Decompressor interface 
.Cache 
Chapter 5 - Decompmsor specification, arrhitectu,. and implementation 74 
This scheme implies that the decompressor should directly receive the instruction request 
from the core, trllnslate the instruction's address (expressed in uncompressed space) to the 
corresponding one in the compressed space, and raise a request for the compressed 
instruction. Once the (compressed) instruction is available, it is decompressed and sent to the 
processor's core. 
The decompressor itself has a two-stage pipelined architecture (described in detail in the 
next section), and therefore after being integrated into the processor's architecture, it extends 
by two stages the processor's pipeline (see Figure 5.3). During the decompressor's first stage 
(Decompressor Instruction Fetch or Dec IFelch), the compressed 32-bit instrucrion word is 
fetched from the ICache and stored in a 64-bit input buffer. Due to its size, the buffer can 
hold a number of compressed instructions and is refilled independendy of the processor's 
requests. In the second stage (Decompressor Decode or Dec Decode), the instruction is 
decompressed and sent to the processor. During normal, sequential execution no delays are 
incurred by the extension of the processor's pipeline. However, at start-up and when a COF 
occurs, filling the decompressor's pipeline will take four additional cycles, that can be 
compensated by the decOmpressor's Branch Management Unit (BMU). The BMU's functional 
specification and architecture, together with those of the rest of the decompressor's 
components are desctibed in the following section. 
Dec IFetch OecOecode ~p IFelch ~PDecod. Execute Write Back 
_. 
-'-'-'-'-'-'-'-'-
._._._._._._ . ., 
-++ -+ Decoder Dictionary ~ -t I- ~. Core 
-+ Registers f-+ ! 1 -t ICache ALU 
I Program r-~ Address Translation Unit I+!-
-
Counter 
! 
I 
.-
.-.-._._.-._._._._._._._._._._._j 
Figure 5.3. Modification of the pipeline of ARCTangent-A4 for decompression 
Chapter 5 - Decompressor specification, architecture and implementation 75 
5.2. Functionality and architecture of the decompressor 
The role of the decompressor is to make the decompression process completely transparent to 
the processor, which will run unaware of the fact that the executable code is compressed. One 
of its main tasks of the decompression unit is, therefore, the handling of misaligned 
(compressed) memory. The instruction memory, which normally would hold 32-bit instruction 
words with 4-byte alignment, will, when compressed, contain instructions which start at any 
bit position within that memory word and may spread across the boundary between two 
words. In this way, the memory word, which for 32-bit memories is identical to the instruction 
word, in compressed space might hold a number of instructions, or, alternatively, only parts of 
two instructions. The same holds true for the instruction cache, where instructions may reside 
in two consecutive words or be spread across the boundary between adjacent cache lines, the 
latter requiring two cache line fills to fetch a single, compressed instruction. 
The decompressor (shown in Figure 5.4) contains two parts, namely the Decoding Unit 
(DU) and the Address Translation Unit (ATV). The DU carries out the buffering and 
decompression of compressed instructions and generates the necessary control signals for the 
interface the processor and the ICache. The ATU performs the mapping between compressed 
and uncompressed memory spaces and provides bit memory support. It contains the 
following modules: Branch Target Cache (BTC), Jump Table, Subroutine Rerum Stack (SRS) 
and Compressed PC (CPC) extraction logic. The A TU performs all the address conversion 
operations, supplies the compressed address to the ICache when a instruction request is made 
(ensuring that the memory is accessed following the original 4-byte address alignment) and 
handles the bit-addressing operations internally. Both units (A TU and DU) are synchronised 
with the processor's operations in order to ensure appropriate execution in the event of a 
COF or a stall in the processor's core or ICache. 
The remainder of this section provides a detailed description of the functionality and 
hardware architecture of the main components of the decompressor. 
Chapter 5 - Decompressor specijication, architecture and implementation 76 
ch 11w 
cheJvalld 
d I! tch doc.JX; xl el~ che_lc_busy ec~f",~~·,_"~~_~~·~·~,~~···:r:·"·~~·~c,~",W"" 1 
I ~ ... ' .' output Add""", Seleet .' 1  
I 
Input Buffer 
- ~ ~ 121 }21 1~ -:, Decode ATV 
r Decoder 1-0 Control Control 
-ry Unit 
.-l Unit Jump CPC BTC Tb! SRS Logic 
Dictionary 
-
27 
i ~"'..1!t~ 1-,.... T Decoding ~ . Address Tranllatl Unft i'-' 
'"' 
24 Unit (ATU) 32 ~, .. - T' -,-",--" .. , .-.,,-~.,-. "" .. _"'" ... ".'-""'~"-' 
doc...J)1lw decJvaUd Ifetch br_taken pc_next 
Figure 5.4. Architectural overview of the decompressor 
5.2.1. Decoding unit (DU) 
j 
! 
I 
~J 
A functional description of the DU is provided first and then the architecture and design are 
given. 
Functional description 
The main tasks perfonned by the DU are: 
• buffeting the compressed instructions fetched from the ICache; 
• bit·address alignment management of the buffer; 
• instructions decompression and dispatching to the processor's core. 
The functional operation of the DU is depicted in Figure 5.5. If the generated dictionary (see 
Section 4.2.2) is not implemented as a ROM component in the design, it is loaded into the 
DU's SRAM Once the dictionary is in place, the DU waits for a request signal from the 
microprocessor to start its normal execution cycle. During normal execution, the DU 
monitors the status of the input buffer, which might be either 'empty' or 'full'. In this 
particular context, these words mean the following: 'empty' signifies that the input buffer still 
Chapter 5 - Decompressor specijication, architecture and implementation 77 
does not hold 32 bits of valid clata, while 'full' means the contrary. As soon as the input buffer 
becomes 'empty', the ATV places the correct address on the address bus, and sends a fetch 
request (dec--ffetch) to the ICache. When the ICache is ready to supply the requested memory 
word (which may contain more than one compressed instruction), it enables the che_ivalid 
signal and the decompressor stored the word into its input buffer. Once the buffer is 'full', the 
decoder starts processing the first 32 valid bits available, and, depending on whether the 
instruction is compressed, either the relevant uncompressed instruction is retrieved from the 
dictionary or the instruction is send directly to the processor's core. 
, \ 
N 
y 
Signal address 
translation unit 
Figure 5.5. Functional diagram of the decompressor's DU 
Architecture and design 
The decoding unit consists four main blocks, namely input buffer, decoder, dictionary and 
control unit. A schematic diagram of these blocks is given in Figure 5.6. 
Chapter 5 - Decompressor specification, architectllTr and implementation 
Inpllt buffer 
1 T 
decJfetCh 
Decode 
Control 
Unit 
1 1 i I decJvalld ifetch 
Instruction cache 
che_Jvalid che..,p11w 
I ---------~l32----s,;putbUii8fl 
I r . . . 1><2 " ! '1 32 '1 32 i 
..L Address --.J I ;,' 
"'T Logic ---., 64blt register 1 
I lM I 
'" '" , l._. ____ ._y __ j-- .. ~~:j~~:.:":-.---.\~.J 
~ 1 2 ...... 3~ 
Decoder J 
fY '1 32 
.1 Dictionary I I 32 bit register J "I 
~ 32 ~32 
2><1 / 
132 
p1iw 
ARCTangentA4 processor 
Figure 5.6 Basic diagram of the decoding unit 
78 
I 
I 
The input buffer is a 64-bit register, tailored to operate as a circular buffer, which stores the 
32-bit memory words fetched from the ICache. Note that these memory words may contain 
between zero and four compressed instructions. The contents of the register are updated by 
alternating writes to the upper half of the register (from bit 32 to bit 63) or to the lower one 
(from bit 0 to bit 31), when 32 or fewer bits are valid. 
The address logic of the input buffer (see Figure 5.6) keeps track of the number of valid 
bits that remain as well as a start pointer to the locati';n of the first one of those bits. This 
pointer is used by a small select logic unit to calculate the select signals for an array of 32 64-
to-1 multiplexers, the outputs of which form the input to the decoder. 
A second pointer (end pointer) designates the end of the 32-bit word, which will be sent 
to the decoder. The values of both, the start and end pointers are updated after the decoder 
has deterroined the class and the length of the first codeword (or uncompressed instruction) 
contained in the 32 bits provided. Figure 5.7 shows an example of the operation of the input 
buffer. In the first cycle, there are 40 valid bits and the start pointer is set to bit 39. Thirty-two 
Chapter 5 - Decompressor specification, architecturu and implementation 79 
bits (delimited by the start pointer and the end pointer) are fetched to the decoder, wbich 
analyses the class infonnation and updates the pointers with the length of the codeword A, 
wbich, in this example, is 8-bits in length. In cycle 2, the next 32 bits (from 31 to 0) are output 
to the decoder and codeword B is extracted The start pointer is updated to point to bit 23 
and, as there is an insufficient number of valid bits available in the buffer (the end pointer falls 
into a region of non-valid data), a request for a refill is sent to the ICache. While the 
subsequent 32 bits are being decoded by the decoder, the upper half of the register is filled 
with valid data, as shown in cycle 3. As the word is longer than the number of valid bits 
curtently being held in the register, the output word from the decoder is invalidated and the 
pointer positions are not updated The processing can continue only after the register has been 
loaded with the remainder of the instruction. However, when the length of the codeword is 
smaller than the number of valid bits curtently available in the buffer (as is the case with 
codeword D), the pointers are updated and the word is sent to the dictionary. 
Cycle 1 
Cycle 2 
Cycle 3 
End pol." • .I 
End p.I",.rI 
3' 
Start pointer 
" 
Start pointer 
Figure 5.7 Operation oCthe input buffer 
End pointer 
7 
o 
o 
It is important to note that the 64-bit register is shadowed, that is, its previous value is 
stored for one cycle, so that its contents can be recovered when a branch instruction is 
detected at the next stage of the pipeline. TIlls allows the correct retrieval of the 16-bit branch 
appendices, wbich are otherwise treated as normal instructions. 
Decoder 
After receiving 32 bits from the input buffer, the decoder takes the first instruction stored in 
these 32 bits (lOStruction A in cycle 1 of Figure 5.7). The MSB is analysed to determine 
whether the instruction is compressed. When an uncompressed instruction is detected it is 
stored in a 32-bit pipeline register, to be accessed in the next cycle by the processor. When the 
Chapter 5 - Decompressor specification, architecture and implementation 80 
decoder detects a codeword (Indicating a compressed instruction), its class offset is added to 
the index field (see Figure 4.1) in order to obtain the address of the corresponding 
uncompressed instruction in the dictionary. Finally, the decoder transmits the length of the 
codeword (obtained from the class information) both, to the input buffer that uses this 
information to move the pointer to the beginning of the next instruction stored in the buffer 
(see Figure 5.8), and to the A W, to allow it to calculate the bit-address of the next instruction 
in the ICache. 
31 0 I; A I B C c I 
Figure 5.8 The instruction word in the decoder 
Dictionary 
The dictionary module holds the uncompressed instructions that have been replaced by 
codewords in the compressed programme memory. The dictionary can be implemented either 
in non-volatile memory or as on-chip RAM (typically SRAM on an FPGA device). Due to its 
significant effect on compression ratios, performance and memory requirements, the size of 
the dictionary is configurable to adapt to the requirements of each application. 
Decode controlllnit (DCU) 
The DCU is responsible for synchronising the interaction between the decoding unit, the 
ICache and the processor's core. At the heart of the DCU is a finite state machine (FSM), 
whose structure is presented in Figure 5.9. 
At start-up (or after a soft reset) the DCU is in RESET state. When a request signal from 
the processor arrives (jfolch), it moves to COF state. There the input buffer is flushed and a 
request for refill (decjfetch) is sent to the ICache. In case of a cache miss, the state becomes 
STAlL and when stage one of the decompressor is valid (the input buffer is 'full,), the 
transition is made to the CONTINUOUS_PROCESSING state. In this state and as long as 
the input buffer is 'empty', instructions are continuously decompressed as they are fetched 
from the cache. This occurs until either the processor or the ICache stall (the later due to an 
ICache miss), in which case, the DCU re-enters the STAll.. state. When both the processor 
and the ICache are again available, the DCU returns to the CONTINUOUS_PROCESSING 
Chapter 5 - Decompressor specification, architectun and implementation 81 
state. When a change of execution flow occurs (Indicated brjaken=1), the DCU returns to the 
COFstate. 
RESET-1 I FETCH = 1 RESET-1 
IFETCH-O 
Figure 5.9. Decode control unit state diagram 
5.2.2. Address Translation Unit (ATV) 
In order to avoid the need for large look-up tables to map the address of each instruction 
from uncompressed to compressed memory space, a more elegant and area-efficient solution 
based on resolving each type of COF targets independendy has been developed in this work 
Functional and architectural description of the hardware required to support these 
mechanisms is presented in this section. 
Functional description 
Allowing misalignment of instructions within words for the purpose of compression, while 
retaining word aligned accesses to the instruction memory and ICache, requires that the entire 
address handling and mapping between compressed and uncompressed memory space is 
performed by the decompressor. The implementation of such a demanding task requires a 
number of provisions to be made in hardware, which include: 
• extraction of the bit-aligned compressed programme counter (CPC) address; 
---------------------------111 
Chapter 5 - Decompressor specification, architecture and implementation 82 
• calculation of the compressed memory branch target addresses; 
• translation of the addresses of indirect jump targets from uncompressed to 
compressed space; 
• keeping track of subroutine rerum addresses in compressed space. 
These functions are performed by the A TU of the decompressor. 
The A TU has three modes of operation: start-up, sequential and non-sequential 
execution. The start-up mode, as the name suggests, takes place during system initialisation. 
The sequential execution mode of operation corresponds to the normal system operation, 
where instructions are fetched from consecutive memory locations and executed in an orderly 
fashion. When a COF instruction has been executed and the address supplied from the 
processor is not consecutive, that is a jump takes place, the A TU switches to non-sequential 
execution mode. 
Start-up behaviour 
Figure 5.10 shows the start-up sequence of the address translation unit. 
y 
Initialise the Jump table 
Look up In the jump table the address 
supplied by the processor 
Output compressed space address 
Figure 5.10 Start up behaviour of the address translation unit 
First, the jump table, which is generated by the compressor (see Chapter 4) and holds the 
target addresses of jump instructions, is loaded into the jump table memory (see Figure 5.4). 
The decompress or then waits for a request signal from the processor, takes the instruction's 
Chapter S - Decompressor specijication, arrhitecture and implementation 83 
uncompressed address, searches the jump table and outputs the corresponding compressed 
address. Once the instruction has been fetched, the decompressor enters into sequential 
execution mode. 
Sequential execution mode 
The decoder located in the A TV is responsible for detecting instructions capable of changing 
the control flow, prior to being sent to the processor. Whenever such an instruction is found, 
the ATV's processing mode changes in accordance with the type of the instruction. If the 
instruction decoded is not COF, the ATV operates in sequential execution mode (see Figure 
5.11), during which the ATV calculates and stores the 27-bit address (the bit-aligned address 
of the instruction in compressed memory) of the instruction being currendy decompressed. 
When the input buffer is 'empty', the A TV updates the decompressor's PC (decPC=decPc+4) 
and raises a request for a memory word fetch (dec-ffetch) to the rCache. 
Jump or non-COF 
Instruction processIng 
~---<1lufTer empty 
Extract and save CPC address 
End 
Figure S.U Functionality of the ATU during sequential execution 
Execution f!! COF instructions 
Once a codeword has been decoded, its related uncompressed instruction is found in the 
dictionary and send to the ATV's decoder in order to check if it is a COF instruction. The 
A TV must handle all the addressing modes supported by the processor, which are: 
Chapter 5 - Decompressor specijication, architecture and implementation 84 
• displacement addressing mode (used by branch and branch and link instructions); 
• immediate addressing mode (used by direct jumps); 
• indirect register addressing mode (resulting from the execution ofindirect jumps). 
As tailored mechanisms have been developed for handling the different addressing modes, 
these are presented separately below. 
In the event a conditional branch or a branch and link instruction being detected, the 
immediately following uncompressed instruction output from the dictionary will be discarded, 
for each branch codeword is followed not by another codeword, but by a 16-bit appendix (see 
Chapter 4), which will be used to calculate the target address of the branch. In addition, the 
start pointer of the input buffer (see Figure 5.7) is updated to point to the next buffered 
codeword (that is, it is increased by 16 in a circular fashion). The course of action that follows 
is shown in Figure 5.12 and depends on whether the branch is taken (as informed by the 
brjaken signal shown in Figure 5.4). 
• If the branch is taken, then, as the ARcrangent-A4 conditionally executes the 
instruction that directly follows a branch (delay slot instruction), the codeword pointed 
by the updated start pointer of the input buffer needs to be decompressed and sent to 
the processor. Then, the input buffer is reset (flushed) and a fetch instruction request, 
decifetch on Figure 7.4., with the address of the calculated branch target is sent to the 
ICache. If the branch instruction is of the rype branch and link, then, additionally, the 
address of the instruction following the delay slot instruction is placed on the SRS, 
which is a stack that holds the return addresses from subroutines Q.e. generated by fLec 
and BLec instructions). 
• If the branch is not taken, the decompressor continues with its normal (sequentiaQ 
execution. 
Chapter 5 - Decompressor spedfication, arrhitectllre and implementation 
N 
y 
Invalidate Instruction at stage 
1, get the branch appendix 
Jump or non·COF 
instruction processing 
"":>~N4~xtract and save CPC 
address 
Calculate branch target 
N 
>_Y+l Put delay slot instruction 
address on SRS 
Figure 5.12. Functionality related to branch and branch and link instructions 
85 
Figure 5.13 shows the decompressor's functionality for the case when a jump instruction 
is detected by the ATV's decoder. First, the decompressor waits for the br_taken signal from 
the processor core, and where this indicates the jump is not executed, the CPC is saved and 
the processing continues in its normal sequential mode. Conversely, if the jump is executed, 
the instruction in the delay slot is decompressed. At this stage, the actions taken by the ATV 
depend on whether the jump instruction is direct or indirect. 
• For a direct jump, the target address in uncompressed space is held in the form of a 
long immediate value in the 32 bits directly following the jump instruction. Therefore, 
the operation required is simply to send the next 32 bits stored in the input buffer to 
the Jump Table in order to convert the target address from uncompressed to 
compressed address space. 
--- - ----------
----------------------............. 
Chapter S - Decompressor specijication, architecture and implementation 86 
N Branch or non-COF 
Instruction processing 
y 
N 
Extract and saye CPC address 
N 
Decompress delay slot Instruction 
y 
N N 
Jump to blink ndlrect jump 
End 
Figure S.13.Jump related functionality of the ATU 
• For an indirect jump, the target address (in uncompressed space) is supplied by the 
processor and the same process is carried out as for direct jumps. If a jump to 
BLINK is being decoded, the address at the top of the SRS is popped from the stack 
and used as target address (in compressed space). Finally, in the case of a jump and 
link instruction, the target address is retrieved in accordance with the type of jump, 
and the address of the instruction Cm compressed space) following the delay slot 
instruction is pushed onto the SRS. 
The mechanism to handle loop instructions mirrors that of the processor (see Figure 4.5): 
the start and end loop addresses are stored, together with the number of loops that have to be 
Chapter 5 - Decompressor specification, architecture and impkmentation 87 
performed. A counter is decremented in each loop, and, when it reaches zero, sequential 
execution resumes. 
ATU architectural implementation 
The architecture of the modules of the ATU are presented in Figure 5.14 and described in the 
following subsections. 
Decoder 
Instruction cache I dec Ifetch decJ>C next 
I dec.Jl1iw br_taken pc_next 
ARCTangentA-4 processor 
Figure 5.14 Main modules of the ATU 
I 
I 
The instructions output to the processor are decoded partially in the A TU's decoder in order 
to detect COF instructions at an as early stage as possible, sending the relevant information, 
which includes the type of COF detected, to both control units of the decompressor. Based 
on the decoder's output, the A TU's control unit activates the COF circuitry and determines 
the correct output address. 
CPC 
As previously discussed, ARcrangent-A4 has a 24-bit instruction address bus, where the two 
least significant bits are not used due to the instruction word-address alignment common in 
-----------------------------------------------------
Chapter 5 - Decompressor specification, architecture and implementation 88 
32-bit micrDprDcessDr ISA. This study, hDwever, makes extensive use Df these two. bits (byte-
address alignment), and adds three further bits in Drder to. implement the addressing that 
permits instructiDns in the cDmpressed space to. start at any bit pDsitiDn in a memDry wDrd. 
The CPC's architecture is based Dn a three-entry 27-bit shift-register file, whDse write enable 
signal is the !fetch signal frDm the processDr. Reg A hDlds the address Qn cDmpressed space) Df 
the cDdewDrd being decDded (by the DU). Its value is periDdically incremented by the size Df 
the cDdewDrd, unless a COF instructiDn is executed, in which case the register value is updated 
with the target address Df the branch Dr jump. The increment can be Df x, y, 16 Dr 32 bits, 
where x and y are the lengths Df the cDdewDrds Df the two. different classes, 16 cDrrespDnds to. 
the length Df the branch appendices and 32 to. the length Df uncDmpressed instructiDns. The 
purpDse Df having a three-entry shift register is to. be able to. push into. the SRS the CDrrect 
retum address that the subrDutine executiDn requires. NDte that subrDutines are initiated by 
JLcc Dr BLcc instructiDns, and terminated by a Jump-to-Blink instructiDn (fee %blink), since 
the return address Df a subrDutine (in uncDmpressed space), is stored by the prDcessDr in its 
BUNK register. 
SRS 
The SRS is a 27-bit wide dual pDrt stack (Df cDnfigurable size) that hDlds the return values Df 
all the pending subrDutines (Dnes that have been called but have nDt yet returned). The 
minimum size Df the stack cDrrespDnds to. the maximum number Df nested subrDutine calls 
that can be issued by the executable, a figure that can be fDund during simulatiDn Dr by the use 
Df prDfiling tDDls. A special case Df subrDutine nesting is recursiDn, that is, where a subrDutine 
calls itself. AlthDugh recursive functiDns are nDt in widespread use in embedded applicatiDns 
(mainly due to. their aggressive and statistically unpredictable use Df the system's stack that is 
liable to. result in stack DverflDw errDrs), they are legal syntactical CDnstructs and need to. be 
cDnsidered fDr cDmpleteness. Each time a recursive functiDns call is made, nDt Dnly its return 
address, but alSo. its IDCal variables, return value Qf such is present) and its parameters are 
pushed into. the stack. ShDuld the number Df recursive calls exceed a certain limit, DverflDw Df 
the stuck will result. This research addresses recursiDn by implementing a simple but effective 
mechanism in the SRS that, instead Df pushing the functiDn's return address Dn its stack every 
time the functiDn is called, keeps track Df the number Df calls made. The DperatiDn is depicted 
in the fDllDwing example Df the calculatiDn Df the factorial Df a given number (see Figure 5.15). 
FDr this particular example, the SRS wDuld behave as fDllDWS (see Figure 5.16). 
----------------....... 
Chapter 5 - Detompressor specification, architecture and implementation 
int main ( ... ) 
{ 
int a; 
func1( •• ); 
func2( •• ); 
a - Factorial(4); 
II Return Address [AJ 
func3 ( •• ); 
int ractorial(int number) { 
int 1nter.med1ate_re~ult; 
) 
if (number > 1) 
{ 
intermediate_result - number * Factorial(number - 1); 
II Return Address [BJ 
return intermediate_result; 
return 1; 
Figure 5.15. Factorial calculation illustrating recursive function caJJs 
89 
Initially, before Factorial(4) is called, the SRS stack is empty (disregarding for the purpose 
of clarity the return address of the mainO function), and the recursive counters are set to zero 
(Figure 5.16 (a». After Factorial(4) is called by the mainO function, its return address [A] is 
pushed into the SRS (Figure 5.16 (b» and the counter is incremented During the execution of 
the function, a recursive call occurs (Factorial(3», and its return address [B] is pushed on to 
the SRS (Figure 5.16 (c». The subsequent Factorial(2) call has the same return address as that 
currendy on the top of the stack, and here the address is not pushed again, but instead, its 
counter is increased (Figure 5.16 (d». At this point, there are no further calls to the factorial 
function and as it reaches its return point, its counter is decremented Return addresses are 
only popped out when theit associated counters are equal to zero and therefore, no address is 
removed from the stack at this stage (Figure 5.16 (e». Finally, the faetorial function finishes its 
execution and the stack returns to its initial state Figure 5. t 6 (fiJ). 
Chapter 5 - Decomprrssor specification, arrhitecture and implementation 
Stackantries 
10) 
Jump Table 
Stack entries 
le) 
Smelt entries 
Ib) 
RltCUrsrve 
count81'S 
Recu .... 
counters 
Stack entries 
Stack entries 
I') 
R8QJrsfve 
coun""" 
Figure 5.16 SRS stack operation 
""""""-counters 
Stack entries 
(0) 
90 
Stad<_. 
Id) 
The Jump Table (see Figure 5.17) is implemented as a content addressable memory (CAM), 
which stores the target addresses of direct and indirect jump instructions in both compressed 
and uncompressed spaces. The table ensures that the target address provided by the processor 
(for a jump instruction) is translated from its original uncompressed space into compressed 
space. 
ARC 
processo r I-jump_,arge'-< 
memory V 
Idingjump 
Searchabie 
contents, ho 
target 
uncompress 
sin 
ed space 
24 bits 2Hits 
Uncomp Addresses Comp Addresses 
Ox10000 Ox 1 0000 
Ox11884 Ox118234 
Ox13688 Ox1254b8 
Ox13acc Ox130012 
. 
V . . . 
--r 
1 
Figure 5.17. Jump table architecture 
I 
r-:-
co 
I/~ 
Associated with the 
ntenl outpul data. 
ere jump largets In 
compressed space 
CAM architectures search the indexed entries in parallel, thus obtaining the entry sought 
in a single cycle (outperfonning any software-based search algorithm). In this implementation, 
Chapter 5 - Decompressor spedfication, architec/JIre and implementation 91 
the index is the target address of a jump in uncompressed space, and the output is its 
corresponding address in compressed space. In order to save power, the CAM is disabled 
(cam_enable signal) when not required. 
Branch TalJ!,et Cache 
The calculation of jump and branch instructions' targets (In the compressed space) requires 
four additional cycles: one cycle for translating the target address into compressed space, two 
cycles for loading the target instruction into the input buffer, and one cycle to perform 
decompression. In order to compensate for these overheads, a configurable branch target 
cache unit (BTC) is implemented, as shown in Figure 5.18. 
Branch Table 
1 
2 
1 
2 
1 
2 
1 
2 
1 I 2 
3 " 3 3 3 3 
. -
... 
_ .
... 
_ . 
l 
--- -- --- -- --- -- --- f--
Branch 
Address found I I I I I I I 
Figure 5.18. BTC entry fields 
The Branch Table holds a list of target addresses (in uncompressed space), while the 
Instruction 1, Instruction 2 and Instruction 3 tables hold respectively the first, second and 
third uncompressed instructions originally located at the target address. The last table (Next 
Address) holds the address of the instruction following Instruction 3, but this time in 
compressed space. In this way, when the target address of a branch or jump, provided by the 
processor, is found in the Branch Table, the target instruction is sent without delay to the 
processor. In parallel, the corresponding entry in the Next Address table is output to the 
ICache in order to fill the input buffer with the correct instruction (to be output to the 
processor two cycles later). In the next two cycles, the processor will be given the instructions 
contained in the Instruction 2 and Instruction 3 tables. In the final cycle, the instruction 
requested by the Next Address table to the ICache will already have been decompressed and 
ready to be sent to the processor. 
The BTC is divided into two modules of similar architecture (see Figure 5.19). The first is the 
Static Branch Table that contains the target addresses of the jumps and branches most often 
~ 
Chapter 5 - Decompressor specfftcation, rm:hitectnre and implementation 92 
found in the code and that can be detennined following a simulation: The second is the 
Dynamic Branch Table that is configured in a cache-like fashion to dynamically update the 
tables, using a Least Recendy Used (LRU) replacement algorithm. A ttace-based, software 
tool, which is described in Chapter 6, has been developed to evaluate the required number of 
static and dynamic entries for each particular application, thus ensuring the best perfonnance 
for the given area resources. The benefits of introducing the BTC module on the overall 
system perfonnance will be quantified in Chapter 7. 
Branch Addres s 
• 
DYNAMIC 
"'; '. t~ I-found--o BRANCH 
.. , ,.., TABLE 
' .. 
! 1 ! 
""- / 
Instruction 1 
Instruction 2 
Instructlon 3 
N ert Mares s 
Figure 5.19. BTC configurability allows for hardwired and dynamic branch tables 
AIU Control Unit and Output Address (dtc .. pcnext) Select LJgic 
The A 1U control tmit is implemented as a finite state machine (FSM) that controls the SRS 
activities, enables the Jump Table's CAM, and detennines the values of the dtc..jJc_next and 
tfetch signals. This FSM is graphically described in Figure 5.20. 
I 
I 
--------
- ----------------------------...................... .. 
Chapter 5 - Decompmsor specijication, architecture and implementation 93 
RESET~ 1 
RESET = 1 
Figure 5.20. ATU control unit finite state machine diagram 
The initial state of the FMS is RESET. The A W waits until a request from the processor 
arrives and then enters into COF address calculation state, where the BTC and the Jump Table 
are searched in parallel in order to supply the correct starting address to the cache. If the 
address is found in the BTC, the FSM enters into BTC state, and will remain in this state for 
thtee cycles. If none of the instructions supplied by the BTC to the processor is a jump, then 
the FSM will move into the Sequential Address Calculation state, where the CPC is 
inctemented and the Jump Table citcuitty is disabled. If at any point in this state a branch or 
jump instruction is detected (equally if a jump instruction is detected in the BTC state), the 
FSM again enters into COF state. In the COF state, if there is no match in the BTC, then the 
target address using the mechanisms explained earlier. 
During sequential execution, the CPC and dec..,t>cnext are inctemented. In the event of a 
change of flow, dec../>c_next will be updated with the value of the target address as shown in 
Chapter 5 - Decompressor specijication, architecture and implementation 94 
Figure 5.21. In case of a processor or ICache stall, the FSM will transit into IDLE state, which 
will be left as soon as processing resumes. 
B~CH SiS JMP tBL 1U CPC 
Increment 
~. 4x1 / Logic 
21· 
ATU 21 ,. 
Control 24 24 
,L. , J 
2x1 / Idec_pe_next + 41 
24 t 
Figure 5.21. Output Address Select Logic 
5.3. Decompressor implementation 
The architecture of the decompressor, as described in Section 5.2, has been implemented in 
VHDL, is vendor independent and synthesises can be achieved in both FPGA and ASIC 
(Application Specific Integrated Circuit) technologies. The results provided in this section are 
related to the particular FPGA technologies listed in Table 5.2. 
Table 5.2. Targeted FPGAs 
Vendor Technology Part Package Speed Grade 
Xilinx Spartan-lIE XC2S200E fg456 -6 
Xilinx Virtex-lI XC2V500 fg256 -6 
Altera Cyclone EPIC6 T144 -6 
The synthesis tool used was SynpIify Pro v.7.3.1 from Synplicity. The place & route tools 
used were XiIinx's ISE (Integrated Software Environment) v.6.1 for XiIinx the devices and 
AItera's Quartus II v. 4.1 for the AItera Cyclone. 
----------------.............. 
Chapter 5 - Decompressor specification, architecture and implementation 95 
5.3.1. Configuration parameters 
The implementations results depend on the particular configuration of the decompressor. The 
summary of these parameters and the values for which the decompressor architecture has 
been synthesised are shown in Table 5.3. 
Table 5.3. Decompressor configuration parameters and values used for synthesis 
Configuration Parameters Range Implementation Values 
Number of jump table entries 1-256 32 
Number of hardwired BTC entries 1-32 1 
Number of dynamic BTC entries 1-32 1 
Dictionary size (bytes) 16-8192 16 
SRS depth 1-256 4 
Number of codewords 1-4 2 
Codeword lengths (bits) 8-16 8,14 
Branch offset size (bits) 8-32 16 
Flag bit for uncompressed instructions o or 1 1 
5.3.2. Implementation Results 
The results provided in the following subsections are related to the particular technologies 
listed in Table 5.2 
Clock rates 
The clock. frequency· achieved by the decompressor depends largely on implementation 
technology. The clock frequencies obtained after place and routing for the targeted FPGA 
technologies are summarised in Table 5.4. 
Table 5.4. Decompressor's clock frequencies for the targeted devices 
Technology Part Clock Period Frequency 
(ns) (MHz) 
VIRTEX-2 x2v500 17.928 55.778 
SPARTAN-lIe x2s200e 35.437 28.2 
CYCLONE EPIC6 17.637 56.7 
Complexity figures 
The complexity figures for the different configurations of the decompressor (see Table 5.3) 
are presented in Table 5.5 (Cyclone) and 
Chapter 5 - Decompressor specification, architecture and implementation 
Table 5.6 (Xilinx devices). 
Table 5.5. Complexity figures for Cyclone EPIC6 device 
LEs 
3531(59%) 
M4KRAM 
blocks 
o 
PLLs 
o 
Max user 110 
pins 
119 
Table 5.6. Complexity figures for Xilinx devices 
Spartan-lIE Virtex-II 
Logic Utilization 
Number of Slice Flip Flops 1,637 (34%) 1,610 (26%) 
Number of 4 input LUTs 2,616 (55%) 2,119 (34%) 
Logic distribution 
Number of occupied slices 2,028 (86%) 1,785 (58%) 
Total 4 Input LUTs 
... used as logic 2,616 2,119 
... used as a route-thr. 26 30 
Number of bonded lOBs 118 (41%) 119 (69%) 
Number of GCLKs 1 (25%) 1 (6%) 
96 
The decompressor in its standard configuration can be successfully implemented in low-
cost FPGA families such as Cyclone and Spartan-lIE, including its low gate count model 
XC2S200E. As the dictionary size used is rather small the memory resources of the targeted 
FPGAs are not utilised. The utilisation of the RAM blocks of the devices will increase 
according to the size of the dictionary. 
5.4. Conclusions 
This chapter presented the functional specifications, hardware architecture and 
implementation figures (speed and area) of the decompressor unit, which is designed as an 
application-specific, reconfigurable IP module that extends the pipeline of the processor by 
two extra stages. The COP issues typical to compressed code execution architectures were 
introduced and addressed using novel mechanisms that avoid the implementation of large 
address translation tables (m accordance with most of the work reported to date), thus 
providing a more refined, scalable and area-aware solution. These mechanisms translate to an 
Chapter 5 - Decompressor specification, architecture and implementation 97 
additional delay of up to 4 cycles per executed COF, which commonly are overcompensated 
by the significant increase in the ICache hit ratio, for modem embedded systems may incur 
penalties of up to hundreds of cycles on every cache miss. Finally, this chapter introduced the 
idea of compression-transparent architecture, where the decompressor module can be 
seamlessly integrated and synthesised on a host processor, without the need of any 
architectural changes in the processor, which is completely unaware that the instruction 
memory is compressed. 
Chapter 6 
METRICS AND TOOLS FOR SYSTEM 
PERFORMANCE ANALYSIS 
In order to develop and improve a particular design or architecture, it is necessary to measure 
and analyse its performance. The results (and their usefulness) obtained by such an analysis 
will be direcdy related to the measurement framework used, including the parameters and 
metrics selected, as well as the measurement method and the dataset. 
Ths chapter focuses on the description of the measurement framework developed to evaluate 
the impact of the compression scheme (algorithm and hardware implementation), presented in 
this thesis, on the performance of embedded RISC systems. The measurement results 
obtained for the selected test applications are presented in Chapter 7. 
The software application developed to automate the compression of RISC executables, and 
the range of performance measurements applied are described at the end of this chapter. 
6.1. Performance metrics 
The development of modem consumer electronics products must temper production costs 
with ever-increasing performance requirements, often leading to solutions with an acceptable 
level of performance at the lowest possible price. In such an environment, it is important to 
identify suitable performance metrics that correcdy reflect the processing capabilities of the 
system with respect to a particular set of design requirements. Usually, the performance of a 
microprocessor-based embedded solution is directly related to the time (m terms of CPU 
cycles) that processor takes to execute a particular task or program. Although in the marketing 
literature MIPS (million instructions per second), GFLOPS, TFLOPS (Giga and Tera Floating 
Point Operations per Second, respectively) are commonly-used metrics, they do not properly 
reflect the performance of the entire system, since they do not take into account factors such 
as memory bandwidth, complex hierarchical memory architectures, cache and page misses, 
and pipeline issues. Thus, a performance metric that takes into account the whole system and 
not only the number of instructions that can be executed on ideal circumstances is needed. 
Such a metric could be defined as (1]: 
CPU clock cycles per program T = __ -:-::'-:---'---'---0-_ 
CPU clock frequency , (6.1) 
where T CPU is the CPU execution time, defined as the time necessary for the processor to 
execute a predefmed task. As a first approximation, T CPU could be expressed as the (ideal) 
number of clock cycles required to execute such a task, and therefore Equation (6.1) could be 
written as: 
n 
'i)C1xCPI1 
Tepu 
= --!'=::!1 ____ _ 
IDEAL clock frequency' (6.2) 
where le, (instruction count) is the number of repetitions of instruction i, and CPl, is the 
;. ., 
number of clock cycles taken to execute instruction i and n is the number of instructions in the 
n 
ISA. The number of instructions required to execute a given (~)C/) program is clearly 
;:) 
dependant on the processor's ISA, while the CPI, depends on its architectural design. This 
model, however, ignores effects on the performance that arise due to memory access and 
pipeline hazards and, in general, any situation that causes the processor to stall. Pipeline 
hazards prevent the next instruction in the instruction stream from executing during its 
designated clock cycle and so reduce the performance from the ideal speedup gained by 
pipeJining. There are three classes of pipeline hazard [1]: 
• Structural hazardr, which arise from resource conflicts when the hardware cannot 
support all possible combinations of instructions simultaneously in overlapped 
execution. 
• Data' hazard.rarise 'when an instruction depends on the result' of a previous 
instruction in a way that is exposed by the overlapping of instructions in the 
pipeline 
--------......... 
_C_h_a~p_t_er __ 6_-_~ __ em __ ~_a_n_d_T._o_oh~fi~o_r~~~s_u_m _ R~e~~o_nw_a_n_~_.4 _ n~a&~s_u ______________________ ~lOO 
• Control hazards arise from the pipelining of branches and other instructions that 
change the Pc. 
Most pipelined processors incorporate additional hardware structures designed to reduce 
stalling, but these are not effective for all instruction combinations. Stalls effectively reduce the 
mean CPI of an instruction, thus adversely affecting the processor (and hence, the system) 
performance. 
Memory-related stalls are common in microprocessor-based systems, especially in complex 
bus hierarchies where a cache miss can easily lead to hundreds of stall cycles when accessing 
either instruction or the data memory. Data memory stalls that occur when executing load or 
store instructions are not considered in this study, since only the code is compressed and this 
has no affect on the behaviour of data memory (or the data cache) performance. The memory 
stalls considered here result from ICache misses and the number of memory stall cycles can be 
expressed as the product of the number of cache misses and the cache miss penalty in terms 
of cycles: 
memory stall cycles = N h . x miss penalty 
cac e_mlSses (6.3) 
The actual miss penalty that is obtained when accessing instruction memory can vary 
significantly. Determining the cache miss penalty for a particular system depends on many 
parameters, such as the type and access times of the memory, the priority of instruction loads 
in the memory arbiter, the status of the memory (such as busy or available), the memory 
controller implementation, etc. The choice of the values for cache miss penalties used in this 
research for the performance analysis of the compressed memory system is based on two 
alternative processor implementation scenarios, as presented in Chapter 2. The first, the 
ARCAngel-A4 evaluation board, is an example of a simple embedded system, in which the 
processor can directly access the memory. The second, the PNX8550, represents a complex 
SoC ASIC commercial system, where the memory accesses may be significantly delayed due to 
a highly complex bus hierarchy with multiple arbitrations; several hundreds or even several 
thousands cycles could be needed to recover from a cache miss. 
_C_h_a~p_te_r_6_-_ ~_e_tn_~_~_an_d_~_o_o_hfi~o_r~Sy~s~k_m_R_e~~_onw __ a_nt~e_r1_n~a&~n_s ____________________ ~101 
Taking into account the additional execution times caused by the presence of hazards and 
memory access delays results in a more realistic (and therefore useful) representation of the 
processor execution time. Equation (6.1) can now be rewritten as: 
where 
and 
T CPU = Tcpu _'DEAL + T DELAYS' 
system stall cycles 
T DELAYS = clock frequency 
(6.4) 
(6.5) 
system stall cycles = memory stall cycles + pipeline hazard cycles (6.6) 
Finally, combining Equation (6.4) and (6.5), the following result is obtained 
n 
'ff/c, X CPI, + system stall cycles T CPU = .,!;,=:L1 ____________________ _ 
clock frequency (6.7) 
Substituting Equation (6.6) in Equation (6.7), the final expression for the CPU execution time 
is found: 
• 
" IC, X CPI, + N, . x miss penalty + pipeline hazard cycles £...J caCne _ mIsses T CPU = .,!;,-:L1 ______________________ --,--______________________ _ 
clock frequency (6.8) 
Equation (6.8) will be used for the analysis of the comparative performance of the original 
with the compressed memory system 
In order to analyse how the developed compression scheme affects the overall 
performance of the system, Amdhal's law [1) will be used. Amdhal's law, which is commonly 
used for calculating the performance gain that can be obtained by improving a particular 
portion of a computer system, states that if an improvement can be made to a portion of a 
~C=h=a~p~t=er~6~-~A!=e=m=c.=~=an=d~Y,~o=o=b£fo=r~~£1=k=m~P~e~~=onw~a=n~=e=r1=n=ag~f~=u~ __________________ ~102 
system, the increase in performance (in terms of T epu) when using that improvement is the 
ratio: 
speedup = T CPU _ ORIGINAL , 
T CPU _ENHANCED 
(6.9) 
where T CPU-'~HANCI1D and T CPU_ORIGINAL are the execution times of a task run on a system with 
and without the enhancement, respectively. 
Assuming that the processor in both the compressed and the original systems runs at the same 
clock frequency, the impact of compression on the system performance can be calculated by 
substituting Equation (6.8) in Equation (6.9), leading to: 
( r/C, x CPI, + Ncache_m",es x miss penalty + pipeline hazard CYcles) 
speedup = '=1 ORIGINAL 
( 'tIC, X CPI, + Ncache_mi"" x miss penalty + pipeline hazard CYcles) 
.:1 ENHANCED 
(6.10) 
Hence, in this study the main parameters required to calculate the performance 
improvement of the compressed system are: 
• CPI, of each instruction, which requires the knowledge of the instructions 
execution cycles. 
• lC, of each instruction executed. 
• Number if lCache misses (for the selected ICache configuration) of both the original 
and the compressed memory systems. 
• lmlrllmon cache miss penalty for the particular system architecture. 
Section 6.2 presents suitable methods to obtain the values of the above parameters. 
6.2. Performance analysis techniques 
Techniques for computer performance analysis can be divided into three broad classes: 
performance measurement (benchmarking), analytical modelling and simulation modelling. 
Performance measurement (mcluding the use of benchmarks) does not use models but instead 
relies on direct observation of the system of interest, or a similar representative system. 
Benchmarking is difficult to perform without actual hardware, which not always is available. 
Additionally, the measurements that can be collected can be rather limited, depending on the 
instrumentation installed. All analytical and simulation techniques require the construction of a 
model, that is, an abstract representation of the real system, but while an analytical 
performance model is a mathematical construct, a simulation model is a specialized computer 
program that· emulates (at different levels) the behaviour of the real system. Analytical 
modelling provides a quick way to compare different product configurations and it is cheapest 
in terms of effort and cost. Its main disadvantage and inaptness for this work is that it does 
provide only the mean values of the basic performance metrics, such as throughput, latency 
and utilization. It is not generally possible to obtain estimates of metrics such as number of 
cache misses or instruction counts, and it does not account for the particularities of each 
executed programme. Simulations, on the other hand, can be programmed to collect any 
metric required (up to the level of accuracy provided by the simulation system) and to allow 
for different system configurations to be evaluated (such as different cache sizes in a processor 
or different dictionary sizes in a dictionary-based compression scheme). Clearly, the more 
parametrisable and accurate the model is, the more effort its design and implementation 
requires. 
Only simulation modelling will be discussed further as it stands out as the most suitable 
technique for measuring the performance of the compressed system developed in this work. 
Benchmarking is not feasible due to the absence of real hardware (that is a complex embedded 
SoC with its design fully available to be freely analysed and modified), while analytical 
modelling is generally limited to providing a guide of some particular macroscopic properties 
of the system highly averaged over time. 
The behaviour of a computer system can be typically simulated using one of the following 
three techniques (1]. 
------- ---------------------------------
• Pro me-based static modelling 
• Trace-driven simulation 
• Cycle-accurate execution simulation 
Prome-based static modelling is the simplest technique to implement since it disregards 
memory a~cess behaviours and complex pipeline architectures are generally only 
approximately modelled. The main purpose of this technique is to provide a statistical report 
(prome) of various well-defined parameters (for example, instruction repetition, total 
instruction count, cache hits and misses), and often to generate a trace me with the memory 
addresses that have been accessed. Its simpliciry allows for a relatively fast execution compared 
with the trace-driven and cycle-accurate simulations. Trace-driven simulation is particularly 
useful for modelling memory system performance and when combined with a static (prome-
based) analysis of the pipeline it leads to a reasonably accurate performance model of the 
processor system. Finally, cycle-accurate execution simulation performs simultaneously both, a 
detailed simulation of the memory system and the pipeline. Although this technique is the 
most accurate one, it requires substantial effort to develop such a detailed model of a complex 
system. Furthermore, the simulation time can be very long (particularly for complex 
architectures) and, depending on the parameters under observation, a pro me-based technique 
could provide similar results (within a known percentage of error) in much shorter time. 
In this study, the uncompressed system Qncorporating the ARCfangent-A4 processor) was 
simulated using ARC's instruction set simulator (I55), which falls into the prome-based 
simulator category. The influence of the compression scheme on the system was analysed 
using a trace-driven simulator developed during this research. Both simulation systems will be 
presented in detail in the following section. 
6.3. Simulation tools 
This section presents the tools used for evaluating the performance of the original 
(uncompressed) and modified (compressed) systems. Initially, the uncompressed system 
performance is analysed by means of a pro me-based simulation tool, namely ARCfangent-A4 
155, which executes the original ELF executable me. This instruction simulator provides, 
among other information, the number of instructions executed, the number of cache misses, 
_C_h_a~p_te_r_6_-_ A4_e_m_~_t_an_d_Y,~o~O~Mfi~o~r~Sy~s~k~m~P~eL~_onw~an~~~eE1~n~a&~s~u~ __________________ ~lOS 
and generates a trace file. Once the execution traces are available, they are processed by a trace 
analysis tool developed in this research, which obtains various statistics such as a list of the 
most frequendy executed instructions and jumps, and counts of instructions according to the 
type, as well as producing a instruction-address table (lA 1). This table is used by a 
configurable instruction cache simulator (lCS), which by processing the trace file, calculates 
the number of cache misses. This rcs can be configured to analyse different cache 
configurations, where the cache type, size and line size can be varied. All the code 
compression and analysis tools developed in this research have been gathered under a 
common visual interface (the Code Compression Suite) developed by the author using C++ 
for Windows platforms. This development suite is described in detail later in this section. 
6.3.1. ARCTangent-A4 ISS 
The ARCfangent-A4 rss was provided by ARC International Ltd. in support of the author's 
research. The rss is designed to simulate, at instruction level, the execution of ARCfangent-
A4 applications, providing a wide range of statistical information such as instruction count, 
ICache and DCache misses, and whether branches are taken, and is able to generate 
instruction trace files (see Figure 6.1). The trace files record the activity on the instruction 
memory bus, recording the number of instructions executed Qn decimal format) and the acrual 
instruction executed (in hexadecimal format), its virrual address and its assembler 
representation. 
Each time a conditional instruction is executed, the contents of the flag register is shown 
on a separate line and the outcome of the conditional test is shown in brackets (either 'execute' 
or 'do not execute,). The details of data load and store operations are also written on a 
separate line. The generated trace files do not record the execution of the processor's 
initialisation code that is common to all ARCTangent-A4 applications, and which usually 
accounts for the first few thousand instructions. The number and range of instructions to be 
simulated is set by the user through the use of an rss command. As a reference, the simulation 
of 2.5 million instructions results in a trace file of length of about 2GBytes. 
1:0 (do not execute) 
5:0 (do not execute) 
condition 5:0 (do not execute) 
condition 2:1 (execute) 
reg 1 
Figure 6.1. Excerpt from the ttace file generated by the ARCTangent-A4 ISS during 
the simulation of a test bench program 
6.3.2. Code Compression Suite (CCS) 
The CCS provides visual configuration, displays the analysis results and manages the 
integration and the flow of data and information between the various development and 
analysis tools. Figure 6.2 shows the main interface of the CCS. As can be seen in this figure, 
the interface is divided into four main areas, namely Command Bar Area, Working Directory 
Area, Applications Area and Results Area. The Command Bar allows the user to conttol the 
operation of the tool (e.g. compress a file, process a ttace-file, view results, configure 
parameters, etc.). The meaning of each button is summarised in the dialog window shown on 
Figure 6.3, and this can be obtained at run-time by pressing the help button located on the 
Command Bar. The Working Directory Area allows manual modification of the working 
directory, whose structure and meaning will be described in detail later. The user can also 
browse to this directory using the Command Bar. The Applications Area displays which 
programs are stored in the working directory, these detennining the operations available. 
Finally, the Results Area displays the statistical information exttacted during the processing of 
the ttace files. 
-=C::h::aLP..::te::r...:6,--_M=etri..::..::'cs...:a::1/:;.d.::.Ti.:..:oo.:.:IsLfo:..:.r-=~,,--t..::te,-m_R_e","-rfi:..:.orm.:..:.::.a1/",~.:...e A_1/~a/y.,-'S1S:..:.· _________ ------'107 
I Wormg Diectory Id:\elenallr",,", 
ReNt. 
B··Tr~ ;AI 
EilSUSAN n 
ElSHA I i 
=00J343 I; 
;····5""""14248 I .! 
i .... B,onches:258186 1. 
~I:.. I:! 
: " Tok",257744 i . 
; ; ... Misoed:442 '1 I L.. D~erent B,onche.230 
g·Proc.s.... i 
; .. COF ~!Il1B I,: 
; .... UM:66B97 
L Load Dep.:l6563 
ilm 
'···lmtructions:l638574 
; .. I.....k:>!m1 
AppicaIions List 
Trace FoIdef T 
!:!> SUSAN !!l 
~SHINE ? 
t:}SHA !!l 
!:!> Im ? 
~m !!l 
" DJIKSTRA !!l 
t:} DISPlAY ~ 
"CAe ? !:!> BITCNTS ~ 
"BASICMATH ? 
il~ ~ 
~J 
E 0 IAT 
!!l ? !!l 
!!l ? !!l 
!!l !!l !!l 
? ? ? 
!!l ? !!l 
!!l ? ~ 
~ ? ~ 
~ ? ~ 
~ ? ~ 
~ ? !!l 
Figure 6.2. CCS main interface window. In Applications List window T stands for 
trace file , E for uncompressed executable, D for dictionary file and IAT for 
Instruction-Address Table file 
Gonorol-------, ·Dynomic~.-------, 
i'i ..\ 
t.:.t Wcrl<i1g _"" oeIecIion • 
~ System P5~* 
. ..". ~ Trace processoJ 
Figure 6.3. Summary of the Command Bar functionalities 
The Working Directory 
The working directory holds the applications' projects for all the applications that the user 
wishes to process (that is compress and analyse). Each application's project is located under its 
own subdirectory, and the list of the files it may contain is summarised in Table 6.1. 
Table 6.1. Summary of files that may be found in an application's project directory 
File Extension Description Reouired 
Trace File .trace ISS-2enerated trace file X 
lJncompressed 
.elf Original (uncompressed) executable file X Executable 
Compressed Executable .comp Compressed executable file 
Address Map Compress to uncompressed address .map 
mappin2 file 2enerated bv the comnressor. 
Static Dictionary .static.diet Dictionary based on the most frequently 
repeated instructions in the code 
Dynamic Dictionary .dynamic.dict Dictionary based on the most frequently 
executed instructions 
Dictionary .dict Dictionary used to generate the compressed 
executable 
Instruction-Address 
.iat Instruction-address table file generated after Table analysing the trace file. 
Dynamic Instruction 
.dinst.results File containing dynamic code analysis Analvsis Results statistics 
Dynamic Branch 
.branch.results File containing dynamic branch analysis Analysis Results statistics 
The first two files of Table 6.1, namely the trace file and the uncompressed executable, are the 
... only inpu~s that CCS require~ ·In ~rder to generate the remai~cl~r according to the pa~~meters 
selected by the user. As this table shows, three different dictionaries are generated. The static 
dictionary is generated based on the most frequently occurring instructions found in the 
original executable file. The dynamic dictionary contains the most frequently executed 
instructions found from the analysis of the trace file; clearly its usefulness will depend on how 
well the generated trace file is able to characterise a typical execution of the application. The 
third dictionary is a combination of the static and dynamic two dictionaries, with the 
constituent proportions selected by the user. As shown in Figure 6.2, the Applications Area 
displays information about the existence or otherwise of a number of the key files presented in 
Table 6.1. 
System Parameters 
The following system parameters are accessible to the user through the Settings dialog 
window, see Figure 6.4. 
• Branch Target Cache (BTC) configuration. This area allows the selection of the number of 
entries in the static and dynamic BTC tables. 
--------------------------------------, 
-=C=h=a:!:p.::te=r-=6~-...:M=etn=·cs_=a:::1I:;..d.:.:r..:..:oo.::Is.oL:fo:.:.r_=Sy~!::.:te...:m_=R:"':~:.oLrfi:.:.orm=an.::c,.:..:e A=na::;&<.;.:ftS.:..:· _________ ~l 09 
• Codewords and dictionary ronftg_tion. The lengths of the codewords of each class can be 
specified; note that the size of the dictionary is determined by these lengths (see 
Chapter 4). The proportion of static and dynamic dictionary entries used to generate 
the final dictionary can also be set. 
• ICache ronftguration. The number of ways and ICache lines, together with the ICache 
size, can be determined by the user. 
• Executables' Path. This allows the user to select the location of the modules that 
comprise the CCS. 
Trace file analysis 
Ic.che CorilgUoIion,---------. 
N<.rnber oIW"l'1 r 
Cache Size (byte$) 1102( 
SetUp ••• 
OK Cancel 
Figure 6.4 Settings dialog window 
After the system parameters have been set, the CCS is ready to process the trace file, following 
which the following files are obtained. 
• The Instruction-Address Table (!AT) file, which contains the executed instructions 
and their virtual addresses in uncompressed space. This file is used by the Dynamic 
ICache simulator. 
Chapter 6 - Metrics and Tools for System Performance Ana!ysis 110 
--~--------------~~--~~----~~------------------~ 
• The Dynamic Branch Analysis Results file shows the COF instructions sorted by 
their frequency of execution. This file also stores the number of hits and misses in 
the BTC module. 
• The Dynamic Dictionary file holds the most frequently executed instructions and is 
generated from the Instruction Analysis Results for the selected and for the 
selected dictionary size. 
• The Dynamic Instruction Analysis Results file, stores the instructions sorted by 
their frequency of execution. This file also provides other statistical information 
such as the instruction count, number of executed cycles, number of loads and 
stores, executed COF instructions, and pipeline stalls. The results stored in this file 
can be visualised using the Instruction Analysis button of the Command Bar (see 
Figure 6.5). 
J 
I 
11 
.... ,-
"'" 
eo- ", , 
.... '0400 215131 27.~:C ~J 
2 O<'8lO84CO 241643 5225 .. , 01141041(0) .... 57.7S3:t: 
• I><21IdX1 """ 
59.971. ,
.. - "'" 
61.84U IlIclO1S4 2116 "' .... 6 
"""""" 
, .... 
""". "'0208 21" 97.7U'I t , 
..... '000 '9371 .. .. 
"'0390 2166 "' .... • .. <tOl..., '9371 &7.$4% 
" 
.. ,,",. ,725 ,. .... 
• ..."..." '9371 '''01' 15 ""- "" ,."". 10 .. 10, .. 00 ''''' 71.037% " 0<,"'" ... "531' 11 • ''"00'''' , .... 72"', 17 
""""" 
11' 99.., • 12 
.. - "no 74.343% 
" 
CbllD57c 11' 99"" 13 ., ..... 
""" """. " 
.. , .... 11' 99610 • 14 ON4181fcOO 
"'" 
76226. 10 "" .... m 99.ns% 15 .. , ..... 
"'" 
71.14U ,_~I 21 .. ,,"" '14 ,.,.,. '~_...,...,,~,...,....,.,...,.9173 __ ~ ... ,."". 22 
""-
51 99"9% ~:""'."."·-,·,·-",.."'.",·"""~ih"-""'-,_,·_ •. ,.,,,',., __ "._'"''cJl L?~I 
" 
.... '7.-A1. 
" 
.. ,.,. 
Figure 6.5. View of the dynamic instruction (left) and branch (right) analysis 
windows 
,~I 
In the ARCTangentA-A4 processor, there are three main causes of pipeline stalls (apart 
from cache misses), namely long immediate data (UMM) fetch, flag setting before conditional 
execution, and data interdependency. The first occurs because some ARCTangent-A4 
_C_h_a~p_te~r_6~-~~~e~m_~_r_an_d_r.~o~o~kfi~o~r~Sy~s~k~m~P~e~~~onw~a~n~~~~.4~n~a&~~~s~ ____________________ l.ll 
instructions use 32-bit long immediate (LIMM) data, stored in the instruction memory. Thus, 
when the processor needs to access such data, it has to fetch it from the ICache, so 
introducing an additional cycle during which an instruction cannot be fetched. The second 
occurs when the instruction that modifies the processor flags is followed by a conditionally 
executed instruction. As flags are set in stage three of the pipeline, whereas the conditional 
tests that use these flags are carried out in stage two, the pipeline has to stall for one cycle until 
the flags are updated. The third cause of pipeline stalling occurs when the instruction in the 
execution stage requires a register whose value has not yet been obtained by a previous 
instruction. An example of this is the execution of an arithmetic instruction, which has an 
operand that depends on the result of a previous instruction whose execution is not yet 
completed. 
The first two stall cases are relatively straightforward to analyse during the processing of the 
traces and the CCS includes their statistics in the Instruction Analysis Results me. However, to 
account for the data-interdependency stalls, an accurate model of the microarchitecture of 
ARCTangent-A4 ,Processor would be required. The development of a model of such 
complexity is out of the scope of what is feasible to achieve in a postgraduate research project. 
However, in the current work such a model is not required as the compression of instructions 
affects neither the order in which they are executed, nor the occurrence of date cache misses, 
and consequendy has no effect on the data-interdependency-related stalls. 
ICache Simulator (ICS) 
The ARCTangent-A4 implementation allows the user to choose the type and size of 
instruction cache (1, 2, 4 or 8 Kbytes of instruction and data cache and cache line lengths of 
16, 32, 64, 128 and 256 bytes) according to the requirements of the target application [43). All 
instruction cache architectures essentially perform the same operations, in that, on a cache 
miss, the processor is halted while cache line containing the missing instruction word is 
fetched to the cache. The main differentiating factor between instruction cache architectures is 
the organisation 0 f the RAM. 
The direct-mapped instruction cache is the simplest supported architecture, where, 
following a cache miss, the fetched instruction can be copied into only one of the cache lines. 
The cache line address is derived from number of the least significant valid bits of the address 
(the index), to point to the physical location in the cache. An n-way set associative instruction 
..:C=h=aIP,;::te=t..:6:....--,M::.:::etri::.=cs..:a:::/::.d.:.1I::.;:oo::.:isd..fo::.r..:..fJ,,;".:::te::.::1J1:..:R:..:e:,t.rfl::.o1m=an::.:te::.;:A:.::.:I1:;;a!Yc..:m.:.:· __________ 11. 2 
cache requires more complex logic in its implementation, allowing an instruction word to be 
placed into anyone of 11 locations in the cache. A set refers to a collection of cache line 
locations where the instruction can be placed; thus, a set that has four lines is referred to as a 
4-way set associative cache. Figure 6.6 shows a 2-way set associative cache . 
LIIIE AtlOflEs$l ...r UNJSED BV'TE AIlOAESS 
AOORESS FORt.Io\T: TAG f<DEX 
. 
I 
r-*- TAG RAM OATAFW.I 
! 
~ 
'-'-- ~ ~ ~ ~ ~ 
- COt.IPARATOR 4.1 t.lJX . 
I • ~~ OAT""'" 
I ~ • 
r 
"""''''''''" 
-1 - 4x1 t.lJX ~ 
;:::::L- i t t t t 
• § 
w 
0 
TAGRAN OATARAM 
~ 
Figure 6.6. Two-way set associative instruction cache architecture 
When the cache line is filled with instructions after a cache miss, the tag part of the address 
is stored into the tag RAM of the cache. Following an instruction fetch request, the tag of the 
address is compared with the RAM entry pointed by the index tag. In the example, assume that 
a cache line is made of four 32-bit words, and a four-to-one multiplexer (with a select signal 
coming from the address) is employed to output the correct word from the line. When the tag 
of the requested address equals the one pointed to by the index tag entry, a bit is detected and 
the output instruction is validated by the hit signal. As the figure shows, the cache architecture 
is duplicated for a 2-way cache and four units would be needed for a 4-way cache. When a 
miss occurs, the fetched instruction word can be placed in either of the two ways and a 
replacement algorithm is needed to determine the value stored in the way pointer (wp), which 
in turn dictates which of the ways receives the data. The replacement algorithms supported by 
ARCTangent-A4 for the multi-way set associative caches, are round robin (RR) and a pseudo-
random algorithm. In RR the wp value is incremented after each instruction cache miss, until 
the maximum number of ways is reached when the value of the wp becomes zero. In the 
pseudo-random algorithm the wp is incremented after each successful hit and so it is not easily 
possible to predict the value held in the wp when a miss occurs. 
The ICache Simulator (ICS) has been specifically developed in the current work to provide 
an accurate model of all the ARC-supported ICache configurations, together with a functional 
model of the decompressor hardware. The ICS's flow diagram is depicted in Figure 6.7. 
IATFILE 
TERMINATE 
ADDRESS MAP FILE 
N 
y 
Figure 6.7 Functional diagram of the cache simulation module 
The ICS requires two files to start its execution, namely the !AT (generated during the trace. 
analysis) and the uncompressed-to-compressed address mapping (generated by the 
compressor). The execution starts by reading a request (m uncompressed space) address from 
the !AT, which, in a physical embedded system, would be supplied by the processor. This 
address is then translated into a compressed space address using the address mapping file and 
appropriate bits of the address are compared against the relevant entries in the tag. If there is a 
hit, an additional check takes is needed in order to determine whether the requested 
instruction is only partially present in the current cache line. This is possible when the cache 
line contains compressed instruction words, as they are not guaranteed to be aligned on 32-bit 
_C_h_a ... p_te_r_6_-_M_etn_·cs_a_n_d_1I_oo_ls",,-fo_r~.fy,,-!_te_m_R_e ... tfi_orm_a_nce_A_n~a!J,-ro_· _________ ----'114 
boundaries, and so can be stored across memory word boundaries. In case of a miss, a cache 
miss counter is incremented and the tag is updated with the new address. The ICS repeats 
these steps for each IA T entry until the end of the file is reached. The number of instructions 
that the ICS can process is limited only by the memory capacity and processing power of the 
host machine. These parameters determine the simulation time required to process the whole 
trace file. As an example of execution time to obtain the results in this study, a 3GHz 
Pentium4 dual-processor with 1 GB RAM required over 50 hours to process a 2.5 million 
instructions trace file. 
Code Compressor 
The code compression tool developed in this research is based on the compression scheme 
described in Chapter 4, and its basic architecture was described in Section 4.3.2. As mentioned 
there, the compressor tool generates a static dictionary that contains the most frequently 
occurring instructions. The analysis of the trace file allows the development of a dynamic 
dictionary that contains the most frequently cxcC1lled instructions. Fmally, the CCS allows the 
user to select the proportion of static and dynamic instructions present in the final dictionary 
that will be used by the compressor in order to generate the compressed executable file. 'This is 
illustrated in Figure 6.8, where the interdependency of the different modules is depicted 
I Trace File I (qsort.nc.) 
, ' Generate 
I ELF Executable J (qson.oul) 
! 
General. 
CompresSed EJlCecotable 
(qaOft.camp) 
. .• 
r' Dynamic Dictionary +- Dynamic Analysis 
(qaort.dyndfcl) , 
'I'. Dyna .... c ~=~ R.~'" (q.OIf.dl".tresult) 1 Generate Dynamic Branch Resutts (qson.djump,re,ult) 
Figure 6.8. CCS modules interdependency 
_C_h_a ... p_te_r_6_-_M_e_tri_i:_~ _an_d_1i_o_ols-,fi,-o_r~.fy,,-~_te_m_P_e,,-rfo_rm_an_~_~ A_n-,a!Y,,-~_'s __________ --,l1 5 
6.4. Summary 
This chapter has presented the metrics and tools used in the current study for analysing and 
comparing the performance of the compressed and uncompressed systems presented in this 
thesis. In order to automate and simplify the development and tuning of the compression 
scheme, the Code Compression Suite (CCS) was engineered. This chapter has presented an 
overview of its functionality and use, which allows the user to compress executables, generate 
specific types of dictionaty, analyse trace files, obtain instruction and jump/branch-related 
statistics (including BTC performance) and simulate the different types of compressed ICache 
system, all under a common and user-friendly interface. The results provided by the CCS for 
the test bench programmes introduced in Chapter 3, will be presented and discussed in 
Chapter 7. 
Chapter 7 
SYSTEM PERFORMANCE - RESULTS AND 
ANALYSIS 
The mathematical foundation and software tools described in the previous chapter are suitable 
for measuring the performance of processor-based embedded systems. In this chapter, these 
tools are used to quantify the impact that the code compression scheme proposed in this 
thesis has on the system performance. Producing accurate performance improvement figures 
for the scheme is rather difficult, as most of the parameters considered in a srudy of this type 
(such as the processor clock frequency, memory, cache and ISA configurations, and 
implementation complexiry) vary significantly from one system implementation to another. 
Consequently, rather than drawing general conclusions of the benefits of compression of 
instruction memory, the approach taken is to collect results with the purpose of presenting the 
effects of compression for a variery of typical embedded system configurations. From these 
results, it is possible to identify some trends and those are also presented in this chapter. 
7.1. ICache simulation cases 
The cache size and configuration of the cache influence the performance of a computer 
system, affecting the execution time of the applications. In order to illustrate this, and how the 
proposed compression scheme can be seamlessly adapted to the particular needs of an 
embedded application, the test bench applications have been simulated with a set of 24 
different ICache configurations (see Figure 7.1), covering the ICache spectrum supported by 
ARCfangent-A4. 
In total, for the eleven benchmark programs (section 3.2.2), 528 cache simulations were run, 
48 for each program, covering 24 different cache configurations for each of the compressed 
and uncompressed executables. In order to isolate the influence of the ICache configuration 
on the system performance, the compression parameters were kept constant for all the 
_C_h_a~p_te_r_7_-~Sy~u_em __ R~e~~o_~_a_n_a_-R£ __ !_u!J_~_a_nd_£t _ n~a&~su_' ________________________ ~117 
programs, with a fully static dictionary of 1024 entries (4 Kbytes) with codewords of 8 and 12 
bits being used throughout . 
CACHE TYPES CACHE SIZES 
1 Kbyte 
Direct Mapped 
2Kbyles 
2-Way Set Associative -f------i 
4Kbyles 
4-Way Set Associative 
8 Kbyles 
Figure 7.1 ICache simulation cases 
7.2. Performance parameters 
UNELENGTHS 
16 byte. 
32 bytes 
32 bytes 
64 bytes 
32 bytes 
64 bytes 
64 bytes 
128 bytes 
Based on the metrics presented in Chapter 6, this section will introduce a new metric, 
purposefully devised to determine the overall performance improvement that the compression 
scheme proposed in this thesis provides. 
In general, the performance improvement between two processor-based systems can be 
calculated as shown in equation (6.10). 
In the particular case of an ARCfangent-A4-based system, CPI; = 1 (for all the 
instructions), Le. all the instructions supported by its ISA have single cycle execution, and 
therefore: 
n n 
~)C, X CP/, = ~ IC, = ICroTAL , (7.1) 
1",,1 1-1 
where ICroTAL represents the number of instructions executed duriug the simulation. 
As explained in Chapter 5, due to the particularities of the decompressor module, every 
COF instruction requires four cycles to calculate the target address (in compressed memory 
--------------------------.......... 
space) and to fill the decompressor's pipeline. The BTC module, introduced in Chapter 5, was 
designed to mitigate these overheads, and it is only the BTC misses themselves that give rise to 
decompressor overheads. These overheads need to be included in equation (6.1 0) in order to 
reflect the compressed system's particular architecture, and the equation (6.1 0) can be 
rewritten as: 
C' d {ICroTAL + NCMXMP+PLH)OR1G1NAL 
.,pee up = 
VCroTAL + NCMXMP+PLH +BTCM1SSES X4)COMPPRESED 
(7.2) 
where NOd is the number of ICache misses of each system, MP is the ICache miss penalty, 
PLH is the pipeline hazards cycles, and BTCMlSSFS are the number of misses of the BTC table. 
The remainder of this section determines the parameters of equation (J.2) for the test bench 
programs and the performance improvement of the compressed system is quantified. 
7.2.1. ICTOTAL and Pipeline Hazards cycles 
For the test bench programs, the value of [Crom (the number of instructions simulated) was 
found to fall in the approximate range of one to three million. Neither the ISS nor the CCS 
imposes a particular limit on the number of instructions that can be simulated; the only 
limiting factors being the simulation time and memory requirements that a vety large ICrorAL 
would impose to the host machine. Note that the compression scheine does not affect the 
number of instructions in the executable. 
In order to calculate the number of pipeline stall cycles that occur on ARCTangent-A4, it is 
necessaty to determine the flags set prior to the execution of conditionally executed 
instructions and to identify the instructions requiring long immediate data (LIMMS). Table 7.1 
summarises the pipeline analysis results and the ICTOTAL for each test bench application. 
~C~h~a~p~te~r~7~-~Sy~s~ft~m~P~e~~~or.m~a~nc~~-~Re==m=u~s=a~nd~rt~n~a&~s=u ________________________ ~119 
Table 7.1. ICTOTAL and pipeline stall cycles for the test bench applications 
Pipeline stall cycles due to Number of different Application ICTOTAL Conditional instructions 
Instructions LIMMS 
baslcmath 1015046 4747 61049 1849 
bltcnts 959351 418 30211 1578 
crc 1042150 20916 41987 706 
djikstra 1068193 22512 52464 897 
display 791781 24 196839 456 
Ifft 2376902 17597 121515 2026 
fft 2307810 17518 114512 1889 
qsort 1579600 5561 14679 443 
sha 1000343 4 75740 777 
shine 2023763 15795 103727 1885 
susan 3000130 107821 136649 2065 
It can be observed that a substantial number of delay cycles is fairly consistently introduced 
into all the test bench programs as a result of fetching long immediate data (32-bit data) stored 
in the instruction memory, accounting for, in many cases, for more than 5 % (diSPIqy, ifft, shine, 
sI/san) of the total number of execution cycles. While fetching these data the execution stage of 
the pipeline does not perform any useful work. The number of stalls arising due to flag 
dependencies varies considerably, being quite significant in some cases (ere, djikstra, sI/san), but 
has a minimal influence in others (bi/en/s, display, sha). 
7.2.2. Instruction cache performance 
Instruction cache misses often give rise to the major source of pipeline stalls in a computer 
system. The cache miss ratio is application specific. If, for example, the code consists mainly 
of loops containing a sufficiently small number of instructions that they can be contained in 
the instruction cache, the cache miss ratio will be relatively low and cache refill times would 
not have a major influence on the overall system performance. Conversely, if the code 
contains few repeated sequences, the cache miss ratio would increase and the cache refill 
delays might impair significantly the system performance. Both of these cases can be found in 
the results of simulated test applications (see Table 7.2). Another major factor that influences 
the cache miss ratio is the instruction cache configuration, that is, its size, type and 
architecrure. 
_C_h_a~p_te_r_7_-~~~S_k_m_R_e~~_onw __ a_n~_~_Re _ m_"_~_a_nd_r1 _ n~a&~s_u __________________________ ~120 
Table 7.2. Number of instruction cache misses for uncompressed memory systems 
Cache Type Direct Mapped Cache (I way) 
Cache size IK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
basiemath 77721 49087 26926 18428 16341 11003 7872 5606 
bitcnts 16446 10936 6000 4481 4668 3563 2391 1746 
ere 168572 115854 63509 52817 32030 21371 271 185 
dijkstra 238232 150756 108812 74031 49282 37694 19308 13821 
display 804 501 365 264 264 161 150 90 
ire 147401 95118 55030 40046 35594 24264 13237 10785 
re 204095 131149 72466 53125 43322 29485 15918 12888 
qsort 3073 1781 348 239 274 166 154 95 
sha 2800 1753 563 416 471 314 305 201 
shine 162279 97020 46032 37279 17809 13788 8142 6426 
susan 147031 89082 51650 33625 26117 16194 7533 7177 
Cache Type 2 Way Set-associative Cache 
Cache size IK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
basiemath 76928 48968 26910 18232 14910 9697 6102 4023 
bitents 10655 6923 5495 3944 4479 3348 2039 1463 
ere 199934 131503 47740 52773 592 397 264 182 
dijkstra 255545 166252 78722 69072 27687 23555 6517 8444 
display 760 473 355 256 271 163 135 78 
ire 142436 92250 56157 41045 29919 20430 9644 8540 
re 196531 126669 75821 55861 38243 26143 10998 9683 
qsort 738 445 325 213 279 176 136 78 
sha 3739 2347 503 354 456 307 231 145 
shine 160785 98302 47666 41422 15005 11863 3395 2331 
susan 148560 92311 54092 35410 22484 15594 3759 3460 
Cache Type 4 Way Set-associative Cache 
Cache size IK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
basiemath 83484 53619 24948 17274 15555 9873 6147 4040 
bitents 10761 7353 5418 3910 4216 2879 2150 1464 
ere 152971 149834 780 26709 481 377 269 192 
dijkstra 263630 167594 84502 67072 17793 21662 3215 2048 
display 635 412 339 233 271 161 135 78 
ire 151701 100516 53376 37769 29045 18934 9602 7651 
re 211088 140062 71164 50150 38333 50150 10619 8464 
qsort 675 425 332 214 276 167 137 78 
sha 4070 2543 548 307 408 262 233 144 
shine 171798 107190 41039 33940 15669 12750 1776 1616 
susan 157421 98013 54821 35137 25964 17863 4509 4372 
_C_h_a~p_te_r_7_-~~~J_k_m_P_e4~~onn~an~c~~_Re~Ju~k_!_an~d_A~na~&~~~s ________________ ~ _________ .121 
In the following subsections the performance of the test bench programs run on a range of 
instruction cache configurations are assessed for both the original and the compressed 
memory systems. 
Instruction cache misses for the uncompressed memory systems 
The number of cache misses for the original, uncompressed memory system was determined 
by running the test programs on the ISS of ARCfangent-A4 for all the use-cases specified in 
Figure 7.1. Note that although the ICrorAL for some of the programs is very similar, the 
number of cache misses, presented in Table 7.2, varies significantly. This confirms that as each 
application has specific execution behaviour, the best way to determine the most suitable 
cache configuration, as well as appropriate compression parameters, is by simulation. 
From the data in Table 7.2, trends in the performance resulting from specific configurations 
can be derived. Firstly, increasing the size of the cache clearly results in improvement of the 
cache hit ratio. However, it should be notified that in some cases doubling the cache line 
length for a fixed cache size gives better results than doubling the size of the cache RAM while 
maintaining the same cache Une length (see the results for bilenls and display for a 2 kbytes 
cache with 64 bytes line length and for a 4 kbytes cache with 32 bytes length). As far as cache 
architecture is concerned, for the smaller cache sizes (1 kbyte or 2 kbyte), the simpler direct 
mapped or two-way set associative caches would, in most cases, perform better in terms of 
cache misses than the four-way set associative cache type. For the larger cache sizes, however, 
the results were in favour of the four-way set associative cache. 
Instruction cache misses for the compressed memory systems 
In the proposed memory compression scheme, the instructions in the instruction cache, as 
well as in the main memory, are in compressed format. This results in an effective increase in 
instruction cache capacity (and in the length of the cache line), thereby improving memory 
resource utilization. It is expected, therefore, that the cache miss results of the compressed 
~emory system will improve on those of the uncompres~ed memory system. 
The cache performance figures for the compressed memory system were obtained by use of 
the trace-based simulation model described in Chapter 6. These are presented in Table 7.3 
and, as can be seen, the results for the different cache configurations for the test applications 
follow closely the trends found in the uncompressed memory system results. 
--------------------------.......... . 
_C_h_a~p_te_r_7_-~Sy~s_k_m_B_e~p~onw~an~~~~_Re~m_k_s~an_d_A~na~&~s~" ________________________ ~122 
Table 7.3. Number of instruction cache misses for compressed memory systems 
CacbeType Direct Mapped Cache (1 way) 
Cache size lK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
baslcmath 31314 21775 10624 7717 6965 4948 2999 2150 
bltcnts 6591 4514 3801 2729 2906 2011 1140 922 
cre 73856 63188 31727 21174 31614 21083 123 97 
dijkstra 120698 96745 56500 45557 29985 20542 16391 10950 
display 360 231 154 94 117 70 57 36 
ifft 56426 38043 25080 17452 16682 12019 6621 5165 
fft 62407 45330 31588 21543 18777 13988 7484 5814 
qsort 316 202 141 100 114 72 69 42 
sha 1382 789 240 170 231 160 129 82 
sbine 55189 38029 20746 16550 11882 8723 5427 5139 
susan 87741 54899 31007 24243 16629 12648 4693 4516 
CaeheType 2 Way Set-associative Caehe 
Cache size lK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
basicmath 32253 23398 12180 8652 7246 4777 2894 2170 
bitcnts 6294 4292 3110 2173 2478 1701 920 715 
ere 79108 63206 31747 36880 280 208 143 127 
dijkstra 106598 89376 30623 33136 10642 13262 1850 2873 
display 288 182 146 98 122 75 61 40 
ifft 64800 43779 25527 17577 15774 10842 4411 3959 
fft 74752 53770 30673 22509 16809 11910 5449 5091 
qsort 338 215 153 114 119 79 70 44 
sha 387 264 227 189 212 139 133 88 
shine 64581 48337 16748 14433 8659 6604 2703 1920 
susan 92995 57802 32835 25211 12156 8207 580 393 
CaebeType 4 Way Set-associative Cache 
Cache size lK 2K 4K 8K 
Line lengths 16 32 32 64 32 64 64 128 
basicmath 30284 22455 11604 7866 4843 3206 2322 1919 
bitcnts 5736 3919 3005 2107 1428 1053 804 631 
cre 767 52782 371 296 256 209 118 77 
dljkstra 98997 79320 23300 22463 3739 8650 1628 1337 
display 271 196 115 75 107 60 57 36 
ifft 55051 37062 21908 14648 11855 8557 2978 3426 
fft 65275 47310 24981 17674 13819 9069 3943 4326 
qsort 274 182 128 96 116 75 69 43 
sha 375 246 218 147 198 127 117 73 
shine 64692 45335 14744 11368 1031 803 441 307 
susan 93198 59318 32635 23966 8822 7180 538 358 
~C=h=a~p~te=r~7~-~Sy~s~k=m~P~e~~=or.w~a=~=~~Re==~u=k=.~a=nd=A~n~a&~s=u~ _________________________ 123 
In particular, the number of misses generally decreases as the cache size, the cache line size 
or the complexity of the cache architecture are increased, and a doubling of the cache line 
length often benefits more the performance than doubling cache capacity. 
Using the results presented in Table 7.2 and Table 7.3 for the number of cache misses for 
each system, the improvement of cache hit ratio, achieved by the proposed compression 
scheme can be quantified in percentage by: 
cache hit rate improvement = (1- N,omp,/Norig',al )xl00, (7.3) 
where N,omp, and Nori,;,~ are the number of cache missed for the compressed and 
uncompressed memory systems respectively. From the equation, it can be seen that a 60 per 
cent improvement in the cache hit ratio means the number of misses for the compressed 
cache constitutes only 40 per cent of the total number of misses for the original system. 
The achieved cache performance improvement for the test applications is visualised in 
Figure 7.2 and Figure 7.3. The proposed compression technique improves performance for all 
of the simulation use cases from Figure 7.1, with the degree of improvement ranging from 50 
% (half the number of cache misses) to 90 % (ten times fewer misses). For some programs, 
such as the test application sha, the cache performance improvement is more marked when the 
cache size is small and divided into more ways (as in the case of 2- and 4-way set associative 
caches). For others, such as SNsan, the most significant improvement is achieved for larger 
cache memories; with longer line lengths. Other applications, such as basicmath, dirplqy and JJt, 
show similar improvement regardless of cache configuration. Figure 7.3 presents the mean 
improvement in cache miss ratio achieved by the use of compression for the selected test 
applications. The mean improvement falls between 45 and 65 percent for all the simulated 
cases, with'the cache miss ratio of the 4-way set-associative caches being the most dramatically 
decreased by a mean of around 60 %. 
Chapter 7 - Systelll Perjormallce- IVslI/ls alld A llalysis 124 
--~----~----~------------~--------------------------
&)'/0 
':F10% 
-:"60% 
;: 50% 
E· '/0 ~ "A> 
a.."O% 
.5 10% 
basicma lh 
0%+---.---.---.---.--.---,---,---, 
16 128 
lK SK 
70'/0 
0'60% 
-=SO"/o ~ 40'1o , 
~ 30'k 
ls,,-?Q"1o 
..5 10% 
bilel1'" 
O%t---,---,---,---,----r---r---,---, 
16 
lK SK 
C;ICh~ and ~ngth sa~5 
--DM ...... 2W __ .JW 
," 
$ 61>-;. 
~~~/. 
E4!W, 
'j~ 
l 2OtJ. 
! l 
1. 
11: 
eRC 
64 128 
81: 
1 )0'1. 
l sO'/o , 
E 60-1. 
~ 40+ .... 
e E ;l0" 
&.t--,--,---,--,--,---,--,--, 
16 
11: 81: 
~DM --2W ~;W 
'0 
!:aO'I, 
~60'l. 
=40'/, , 
220'/, 
16 
lK 
128 
2K '1: I: 
Cache and !enllll .il.n 
Ri 
~ O~. +---~---r---'----r---'----r---r---' 
: 16 III l2 
lK 2K 4K K 
Cache :lnd Ic ngtlt sizcs 
80 • 
-;"70 
7 JO';' 
~ ~f) •• 
=40· ... ~ 31) •• 
~O·t. 
.§ 10·. 
lOO', 
c soo. 
~ 60% 
l q,4 
E 20', 
iJ'lPT 
11: 81: 
0" t---,--,--,---,--,--,--,--, 
16 1'28 
lK 1K ' K 8K 
C.chc :tud Irn,llI siu, 
~C=h=a~p=t~e~r~7_-~SYcs=u=m~1~~~1<~o~nn=a=n=u~-=~==U='H=s=a=n=d=~='=w~&=u=j _____________________________ 125 
l( ""'/0 
_ Sl 
C E (.0-
• 2 4r..,. 
"- or ! . 
16 
Sbll 
lK ZK 4K 
lK 
1::"8 
SK 
IV" 
;; . 
E IJIh 
e 
~ 40' 
f 
~ ~o-
--DM ~2W' -.- 4W 
lK 
:lhill~ 
1~8 
2K 'K 8K 
128 
8K 
Figure 7.2. Improvement in cache hit ratio for a number of the test applications (DM 
is the direct-mapped cache, 2W is the 2-way set associative cache and 4W is the 4-
way set associative cache) 
70% 
= 65% 
1160% 
~ 55% e 
~O% 
- 45% 
40% +----.--~-~-~--~-~--~-, 
16 32 
1K 
32 64 32 64 
2K 4K 
Cache anc1 JenEth sizes 
-+-DM ........... TW -.- 4W 
64 128 
8K 
Figure 7.3. Average improvement in cache hit ratio 
[n order to improve cache performance even further, the dictionary used for compressing 
the executable code can be built using a dynamic frequency distribution (that is, the most 
frequcndy exccuted instructions are incorporated) instead of a sratic onc. Alternatively, a 
_C_h_a~p~t_e_r_7_-~Sy~u_u_w_P_eLif<~omw_n~a_-_Re~n_lfi_s~a~n~d~A~'~w2&~s~li ______________________________ 126 
suitable mix of both static and dynamic dictionaries could be used. The introduction of a 
dynamic dictionary would likely lead to some degradation of the compression ratio, as the 
most frequendy executed instructions are not always the ones most commonly found in Static 
code. Experiments showed that using an equal combination of the most frequent static and 
dynamic instructions (50 per cent each) can improve the cache performance typically by 
further 5 %, while decreasing the compressio n ratio with around 0.02. 
The tuning of the compression parameters, such as the capaciry and type of dictionary, 
allows the compression scheme to be targeted towards achieving either better performance or 
an improved compression ratio and can be determined by the user of the compression scheme 
using simulation. 
7.2.3. COF targets calculation 
As discussed in Chapter 6, in the ARCTangent-A4 embedded sys tems, effect on execution 
time of the COF target calculation are minimised by the delay slot instruction execution, 
which utilises the extra cycle required for calculating the target address to fetch the next 
instruction and conditionally execute it. Consequently, the number of stall cycles has already 
been minimised and, therefore, need not be considered any further in this study. However, 
when the instruction memory is being compressed, the decompressor recalculates the correct 
target addresses in the compressed address space and this process introduces stall cycles into 
the extended processor pipeline. In order to decrease the number of such sta lls, a dedicated 
hardware unit, the BTC, is incorporated in the design to hold a number of target addresses 
and the necessary instructions to compensate for the delay. As the capaciry of BTC might be 
practicaUy limited by silicon area, usage and power consumption constraints, the following 
results investigate the effect of the number of BTC entties both on the percentage of COF 
instructions that can be determined and on the overall system performance. 
The BTC is configurable both in the number of hardwired and dynamically written entries 
for a given application, and so simulations need to be carried out to determine the appropriate 
BTC configuration that gives satisfactory results for each testbench program. To determine 
the effects of altering the capaciry and configuration of the BCT, the code compression su ite 
can be used as described in chapter 6. While processing the traces, the too l builds a table of 
- ---------------------
_C_h_a~pte_r_7_-__ S~~_n_~_w_P_e~1'_on_'_w_n~~_-_Re~JI~JI~s~a~n~d~A~n~a~&~s~u ______________________________ .127 
the dynamic frequency distribution of the executed COF instructions and logs the number o f 
hits and misses of the BTC for a preset number of hardwired and dynamic entries 
The dynamic branch analysis wi ndow of the tool, shown in Figure 6.5, facilitates the BTC 
configuration definition process by showing the frequency distribution table of the target 
addresses, their number, and the coverage in terms of percentage of the total number of COF 
targets. 
Table 7.4. COF instructions and coverage of BTC hardwired entries 
Program T otal COF COF 50% 75% 90% instructions ins tructions coverage coverage coverage 
basiem ath 159869 655 3 6 37 
bitents 156530 516 5 10 19 
ere 163052 216 14 22 26 
dijks tra 143511 263 18 32 50 
di splay 791781 149 1 2 2 
ifft 26301 4 674 3 11 57 
fft 366132 674 3 11 57 
qsort 7315 142 4 8 15 
sha 258186 230 4 7 9 
shine 313307 693 2 4 20 
susan 370596 719 6 18 53 
Table 7.4 presents the number o f hardwired BTC entries fo r each test program that would 
be required to Store specific fixed percentages of the total number of executed COF 
instructions. For example, for one of the test applications used (display), a single hardwired 
entry would provide around 50 per cent coverage of all the targets and twO entries would 
provide coverage of at least 90 per cent. For most of the benchmarks, 16 hardwi red entries 
were insufficient to cover even 75 per cent o f the COF targets. The trade-off between silicon 
area usage and the number of BTC entries is effectively a trade-off between area and 
performance, as the greater the number of uncovered targets, then the larger the number of 
pipeline stalls and the further the degradation in system performance. The use of dynamic 
entries in some cases can, although rarely, improve the coverage. This can be seen in Figure 
7.4, which presents the number ofBTC misses for several different BTC configurations. 
j 
~C=b=a~p=t~e~[~7_-~Sy~S=k~m~P~· e~co~nn~a=I~,a~-~R£~g~lk=s~a=n~d~/1=,=,a2&~g~s _____________________________ 128 
U.\:-. I (: ,\txn I 111'·( : ...... . 
.- .-
.-
.-
.- 1 .-! 
- -•  
E 
- -
- -
-
~ ~ <I ~ ~ Q a 
• > ~ > 
C"<: Ul .... '· . , 
c_ 
...-
.- .-
.- .-
.- .-
J .~ .--
-E 
- -
- -
- -
- -
~ ~ • 
-• ~ 
a S S >. ~ > ~ • Q E • ~ 3 
• > 
OJ Il.S'fK..\ I .-s 
,-
-,- ,-
,-
-
- -
-
i 
-E 
-
.-
- -
-
-
• • 
~ Q 
. 
! ~ $ ~ : 
> • > 
• 1 1 r 
QSOfIT 
-
--
,- ... 
- -
- -
. l 
-
~ 
·S .; ,- c 
-
... 
- '''" 
~ ~ 0 z , ~ • 0 ~ 3 
• > - , 
SHA 
''''DO 
"DODO 
200000 
" .~ 130000 
100000 
50000 
SII1~E 
'''''''''' 
........ 
, ..... 
~ ...... 
E "0000 
....... 
...... 
• 
, ~ i5 
• 
~ 
Figure 7.4. Number ofBTC misses for various configurations 
The results include all combinations ofO, 4, 8 and 16 dynamic entries and 0, 4, 8, 16, 32 and 
64 bard wired entries, for each of the benchmark programs. For the programs biICIIIS, sba and 
SI/SOli, a small number of dynamic entries can be used togerher with a few hardwired ones to 
achieve coverage almost equivalent to a much greater number of solely hard,med entries. For 
most of the programs a larger number of bard,med entries were more effective at reducing 
the cache misses and consequently in the next section the performance at system level is 
carried out with the inclusion of 64 hardwired BTC entries. 
In general, it can be concluded tll3t rhe BTC provides a significant improvement in target 
coverage, diminishing rhe effect of rhe delays due to COF target calculation when rhe memory 
of rhe embedded system is compressed. 
_C_h_a~p_te_r_7_-~Sy~S_k_m_P_e~~_o~~a~n~~~~~~su~U~s_an~d~.4~n~~~~~~s ________________________ ~130 
7.3. System level performance 
The performance of an embedded system depends on a number of application-specific 
factors, such as instruction execution times, memory access times and cache miss ratios. The 
importance of instruction cache performance on the total system performance can vary from 
minor, for systems where the number of cache line fills are few and refill times are short, to 
being a significant bottleneck when the refill times are longer or more fills are needed, such as 
when the code contains many branches. As the proposed compression scheme decreases the 
cache miss ratio, a suitable evaluation of the effect on system performance is to investigate a 
range of systems with different cache refill penalties. The video processing unit (PNX8550) 
was described in section 2.1.1 as an example of a complex SoC ASIC that incorporates 
elaborate networks of busses and arbitration units. The PNX's instruction cache fetches for 
the integrated RISC processor competes for bus accesses with continuous data and instruction 
memory transactions of two DSPs as well as a number of video and audio processing 
hardware blocks. For such complex architectures, an instruction cache miss might take from 
several thousand up to hundreds of thousands of cycles, depending on the load of the system. 
At the other end of the complexity scale is ARCAngel-l test board (section 2.1.1), in which 
the processor is implemented on an FPGA and runs at much lower clock frequency, typically 
around 50 MHz, than would an ASIC realisation. The memory on the ARCAngel-l board 
runs at such a clock frequency that refilling the cache line after a miss would take only 14 to 18 
processor clock cycles. There are no arbitration units and no multiple bus systems, making the 
system simpler and the operation of the processor itself more efficient than the corresponding 
PNX processor, with a reduced number of cache miss penalties and cache performance 
having a reduced impact on total system performance. 
The improvement in performance, achieved by implementing the compressed instruction 
cache developed in has been obtained for systems with cache line refill penalties ranging from 
16 to 2048 cycles, is presented in Table 7.5. A result of 1.0 means that the performance of the 
system is not changed while a result of 2.13 shows that the system with compression runs 
about 2.13 times faster than the original. 
The results obtained for the test applications demonstrate that the instruction memoty 
compression technique can significantly improve overall system performance. Most of the test 
programmes show steady increase in the improvement factor when the cache refill time 
~--------------------------............. I 
Chapter 7 - System Performance- Reslllts and Ana/ysis 131 
--~----~----~--------------~--------------------------
increases. There are some cases, such as the test programme dispk[y, for which only a minimal 
performance improvement could be achieved. The main reason for this can be either very low 
cache miss ratio of the original system, or the relatively high number of COP instructions in 
the code. As display has a small number of cache misses (as low as 76 in the original system), 
even though the ICache performance has increased with around 50 %, there is little scope for 
improving performance, and in particular to compensate for the COP target calculation delays 
introduced by the decompressor. 
Table 7.5 Performance improvement for systems with different number ofICaehe refill cycles 
Cache miss penalties (cycles) 
Program 16 64 512 2048 
basiemath 1.13 1.47 2.13 2.33 
bitents 1.02 1.12 1.56 1.86 
.. 1.23 1.52 1.80 1.85 of! ere 
" dijkstra 1.15 1.45 1.69 1.73 U 
"i! display 1.00 1.01 1.06 1.20 
~ ifft 1.14 1.51 2.14 2.31 );;1 fft 1.14 1.52 2.17 2.34 
11 qsort 1.05 1.19 2.84 3.76 
.!:1 sha 1.01 1.02 1.18 1.49 ~ 
shine 1.13 1.45 2.12 2.32 
susan 1.02 1.17 1.47 1.56 
basiemath 1.11 1.41 1.98 2.13 
~ bitents 1.01 1.08 1.39 1.62 ere 1.25 1.59 1.95 2.02 
" '0 dijkstra 1.21 1.66 2.10 2.18 0 
., 
display 1.00 1.01 1.06 1.21 < 
II ifft 1.12 1.43 1.96 2.09 
'" ID 1.12 1.44 1.97 2.10 ... ~ qsort 1.01 1.04 1.29 1.65 
! sha 1.01 1.05 1.35 2.10 shine 1.12 1.43 2.06 2.25 
susan 1.02 1.17 1.49 1.59 
basiemath 1.14 1.50 2.23 2.45 
~ bitenis 1.02 1.18 1.49 1.81 ere 1.39 2.79 4.67 5.61 .~ dijkstra 1.26 3.09 2.46 2.58 
., 
< display 1.00 1.01 1.05 1.19 
II ifft 1.15 2.13 2.32 2.54 
'" : ill 1.15 2.16 2.33 2.55 qsort 1.01 1.05 1.32 1.75 
.. 
sha 1.01 1.06 1.38 2.22 
" ~ shine 1.14 1.97 2.38 2.66 
susan 1.04 1.60 1.59 1.71 
-
_C_h_a~p_te~r_7~-~SYL~~u~m_R_e~~~onw~a~n~~~~Re~tu~k~J~a~nd~.4~n~a&~s~u ________________________ ~132 
Consequently, systems with relatively low cache miss ratios and systems with very large 
number of COF calculation delays, might benefit from compression only in terms of memory 
area, while only a limited improvement in performance can be achieved. 
Overall, it can be seen that the mean improvement factor is 1.10 for a cache miss penalty of 
16 cycles, 1.46 for a penalty of 64 cycles, 1.87 for a penalty of 512 cycles, and 214 for a penalty 
of 2048. Consequently, there will be some improvement even for less complex systems that 
have a small number of cache refill cycles, especially if the compression parameters are tuned 
to target performance. For the more complex SoC architectures, in which cache refills may 
take thousands of cycles, the results show that the performance could be several times better 
than when using the uncompressed cache. Further improvement in performance can be 
achieved by (a) the use of larger dictionary, which would ensure that higher compression ratios 
are achieved and so resulting in larger segments of code being contained within a single cache 
line, (b) ensuring the dictionary is better configured for a specific target by adopting dynamic 
trace analysis, or (c) incorporating a larger BTC so that more of the delays that result from 
COF target calculation can be avoided. 
7.4. Summary 
This chapter has presented the results of studies investigating the effects of cache compression 
on both processor and system performance. A study of performance-related parameters, such 
as the CPI, and instruction count was followed by investigation of the COF target coverage 
provided by BTC for compensating the target address calculation. An improvement in cache 
performance was used as an indicative measure of implementing compression, and the 
simulation outcomes produced a mean reduction of more than 50 per cent in cache misses 
was achieved. In an attempt to characterise the interrelationship between the improvement in 
performance for a range of practical processor-based systems, figures were obtained for a 
range of cache misses and cache types. The results showed that, in the majority of the cases, 
the compression scheme implemented in this study can improve considerably the systems 
performance, reaching a peak improvement of 5.6 (see Table 7.5) times faster than the original 
programme execution. The limitations of the scheme were analysed and cases where 
compression might lead to deterioration of performance were considered. 
Chapter 8 
CONCLUSIONS 
Thls chapter reviews the aims of the thesis, summarises the results of each chapter, analyses 
how well these results meet the thesis objectives and discusses the limitations of the current 
research in order to identify potential future work. 
8.1. Summary of aims 
In recent decades the features offered by consumer electronics products have become a major 
differentiator in the market Thls demand for additional functionality often requires 
commensurate development of evermore complex applications, which in turn requires larger 
memories and better performing systems on which to run; all having an impact on the cost of 
the product. The main aim of this study is to address these issues with respect to 32-bit RISe 
processors. RIse processors are popular choice in embedded design due to their small die 
size, high performance architecture and low power consumption. A major disadvantage of 
RISe processors compared to its CISe counterpart is code inefficiency, which, due to 
additional memoty requirements, can significandy increase cost. 
In previous work, compression techniques have been applied to improve RIse memory 
performance, yet there is no holistic solution available. Thls thesis aimed to provide such a 
solution, by systematically studying the use of compression in RISe processors in order to 
improve system performance. Futthermore, inspired by the modular approach taken in the 
design of the processor used for proof-of-concept, ARCTangent-A4, the solution was 
_C_h_a£P_te~r~8~-~G~on~c_m_n_on_! __________________________________________ ~134 
developed as an independent add-in module which can be easily con figured and incorporated 
into the processor pipeline. 
8.2. Research overview 
At the beginning of this work the research scope and objectives were identified (Chapter 1) 
and a literature review (Chapter 2) was performed. The literature review focused on the 
following aspects. 
• Identification of techniques for reducing code redundancy and their analysis in terms 
of efficiency and performance. At this stage, compression was identified as the most 
suitable approach to provide code reduction. 
• An analysis of algorithms used for code compression identified which were the most 
suitable for implementation in embedded environments. 
• For high-performance operation on a target, evidence from the literature was that the 
decompression should be implemented in hardware rather than in software. 
It became apparent from the initial stage that a unified approach to the problem of 
reducing embedded code redundancy can benefit system performance. Existing work in the 
field targeted only certain aspects of the problem, for example, some concentrating only on 
algorithm development, others on only power consumption reduction. Secondly, a large 
number of solutions found in the literature were unsuitable for embedded systems, such as the 
use of large look-up tables to map compressed to uncompressed address spaces or the placing 
of the decompressor after the ICache which seriously impacts the performance of the system. 
To substantiate the basis of the current work, the entropy of embedded code for a 
commercial RISC processor, ARCTangent-A4, was studied in Chapter 3. For this purpose, a 
number of applications from MiBench were selected in order to give a representative sample 
from embedded market segments. The programs were compiled for ARCTangent-A4 and 
their entropies at different levels of resolution (instruction word, byte, etc.) were studied as 
well as other aspects such as the effect of dictionary size on compression ratio. Advanced data 
- ---------------------------------------------------
_C~h_a~p~te~r~8~-~C~o~m~m~n~on~s _____________________________________________ 135 
compression algorithms were then applied in order to verify their suitability for resolving 
embedded code redundancy problems. The results showed that a number of algorithms such 
as PPMZ almost fully achieve the code's entropy and so entirely remove the redundancy. 
Chapter 4 described a dictionary, class-based compression algorithm, especially developed 
for this study, to fully comply with the restriction of embedded systems implementation, 
giving compression ratio results comparable to some of the most advanced text compression 
algorithms. The compression tool takes the linked executable as input, determines the most 
suitable number of classes and dictionary size and generates the compressed executable and 
the other necessary components to achieve correct execution of the compressed code on the 
target system at run time. To obviate the need for large look-up tables for address mapping, 
novel approaches were developed to deal with the various classes of COF instructions. Taking 
into account the dictionary overheads, the achieved mean compression ratio for the test bench 
programmes was around 0.55. 
Once the application code has been compressed off-line, it is loaded in the instruction 
memory and run on the embedded target. The processor and its memory interface have 
remained unaltered in the current work and are unaware of the addition of the newly designed 
dedicated hardware decompressor that is able to carry out decompression in real-time. The 
functionality, architecture and implementation of the decompressor were presented in Chapter 
5. It was developed as a soft IP plug-in module, writren in VHDL, which, in a practical 
solution, can be synthesised together with the ARCTangent-A4 processor. The decompre.ssor 
is a configurable extension of the original pipeline of the processor, adding two additional 
stages and residing between the processor and the ICache. The decompressor in conjunction 
with the cache handles the entire functionality related to compression, including translation 
between uncompressed and compressed memory address spaces and decoding of dictionary 
information. The decompressor was synthesised for two different FPGA technologies, namely 
Xilinx and Altera. As the decompressor is configurable, its area requirement depends to large 
extend on the nature of the particular application. For Altera's Cyclone device, the clock 
frequency achieved was 56 MHz, which is similar to that which can be obtained for the 
processor alone on FPGA devices. 
As one of the main objectives of this research is to produce a solution which not only 
does not degrade the system's performance but generally improves it, the performance 
Chapter 8 - Conclusions 136 --~----------------------------------------------' 
efficiency of the compressed memory system has been one of the main considerations 
throughout the whole design process. In order to evaluate the achievement of this objective, 
Chapter 6 described the development of a trace analysis tool and an ICache simulation model, 
both operating in conjunction with the compressor. Such a tool set was able to simplify the 
selection of the configurable compression parameters and provided the necessary results for 
calculating performance improvements, both for the ICache and the full system. 
Chapter 7 presented results obtained from the benchmark programs for ICache 
configurations, of different capaciry, type and line length and they demonstrated that the 
proposed compression scheme can lower the cache miss ratio by between 45 and 65 per cent 
on average. A decrease in cache miss ratio translates into an improvement in system 
performance. Programs with shorter refill times, particularly when this is combined with a 
small number of cache misses, do not result in substantial improvement in performance when 
the cache is compressed. However, for larger number of cache misses and longer refill times, 
th~perform~nce improvem~n~ gains si~ficarice, and particularly in large syste~s with severe 
penalties, such as SoC architectures with complex bus hierarchies and multiple arbitration 
units in which cache misses cost thousands of cycles, may execute several times faster than a 
system without compression. 
8.3. Summary of achievements 
This research has addressed the problem of code size inefficiency in 32-bit RISC processors. It 
has focused on the development of a holistic solution, which has considered not only code 
densiry, but also performance improvements and power efficiency. The original contributions 
of the thesis are as follows. 
• Development of a dictionary class-based compressiontechnique that is able. to provide 
a substantial saving of relatively expensive cache memory cost savings, while 
complying with the constraints impqsed by embedded environments. The mean 
compression ratio for the test applications was about 0.51 with 4 Kbytes dictionaries. 
• Implementation of a set of simple-to-use software tools, which allow the developer to 
incorporate memory compression in a fast and efficient manner. 
~C~h~a~p~t~er~8~-~C~o~n~~m=s=w~m~ __________________________________________ ~137 
• Development of hardware decompressor IP that extends the processor pipeline and 
performs all related decompression functions, such as decoding and address 
translation. The module is configurable and allows tuning of several compression 
parameters to deliver a suitable balance between higher compression ratio and higher 
performance for the targeted application. 
• A novel approach for resolving COF targets in the compressed address space has been 
implemented, which providing a tailored solution for each type of COF instruction 
and eliminating the need for large address look-up tables, as used in earlier work. 
Furthermore, the COF calculation delays which would normally hinder performance 
improvement are greatly reduced by operations of the BTC. 
• A decrease in the ICache miss ratio of embedded applications of about 55 per cent on 
average has been produced as a result of the above, which especially for systems with 
larger cache refill penalties, translates into substantial overall performance 
improvements. 
• A modular approach has been followed to ensure that the compression hardware is a 
plug-in solution and no changes, either to the memory system, or to the processor 
architecture and the supplied development tools, are required. 
8.4. Future work 
The work presented in this thesis can be developed further in several ways. 
• The research has focused on a single RISC architecture. Although a certain level of 
generalisation has been provided for in the current work, other aspects that 
differentiate RISC architectures could be covered, for example, the opcode instruction 
field of ARCTangent-A4 is only 5 bits long, while that of MIPS is 6 bits. Having the 
ability to support different opcode lengths, addressing modes and other ISA specific 
parameters can make the solution more general and applicable to any RISC machine. 
~-------------------------------------------------------------
~C=h=a~p~te=r~8~-~C=o~n~~m~n~on~s _____________________________________________ 138 
• More sophisticated algorithms for CO F management could beanaiysed in order to 
further improve performance. 
• The transfer of data between different levels of memory hierarchy is one of the most 
power-hungry operations in embedded systems. As the use of a compressed cache 
decreases the number of transfers, it is likely that the power dissipation of the system 
can be reduced. The benefits in terms of reduction in power consumption could be 
quantified in future studies. 
• To further simplify the design flow, the CCS software tool could be extended to 
automatically configure the VHDL modules that make up the decompressor. 
8.5. Summary 
This chapter has concluded this thesis by reviewing the research effort and by summarising its 
objectives and achievements. Finally, a number of potential future developments were 
outlined. 
REFERENCES 
[1) J.L Hennessy, D. A. Patterson, Computer architeclllre: a quantitative approach, Morgan 
Kaufmann Publishers, Inc., 2002. 
[2) M. Bar, Programming Embedded Systems in C and C++, O'Reilly, 1999. 
[3) Arm, An Introduction to Thumb, Advanced RISC Machines Ltd., March 1995. 
[4) S. Segars, K. Clarke, L. Goudge, Embedded Control Problems, Thumb, and the ARM7TDMI, 
IEEE Micro, Vol. 15, Issue 5, p. 22-30, October 1995. 
[5) K. Kissell, Mips16: High-Densi!J MIPS for the Embedded Market, Silicon Graphics Mips 
Group, 1997 
[6) ARC International, ARCompact Technical Backgrounder, www.arccores.com 
[7) M. Game, A. Booker, CodePack: Code Compression for Power PC Processors, International 
Business Machines (IBM) Corporation, 1998. 
[8) P. Koopman, Embedded Systems Design Issues (the Rest of the Story), http://www-
2.cs.cmu.edu/ - koopman/ des_s99 / 
[9) ARC International, ARCA.ngel-1 test board. Version 5.0, 17 December 2000. 
[10) Philips Semiconductors, Nexperia PNX8550 Data Sheet, December 2003. 
[11) C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical 
Journal, Vol. 27, pp. 379-423, 623-656,July, October 1948. 
[12) T. C. Bell, T. C. Cleary, I.H. Witten, Text Compression, Prentice Hall, Englewood 
Cliffs, New Jersey, USA, 1990 
[13) C. Lefurgy and T. Mudge, Fast software-managed code decompression, International Workshop 
on Compiler and Architecture Support for Embedded Systems, July 1999 
[14) D.A. Huffman, A method for construction of minimum redundanry codes, Proceedings of the 
IRE, Vol. 40, Issue 9, p.l098-1101, September 1952. 
) 
References 140 ----------------------------------------------~ 
[15] S. Debray, W. Evans, R. Muth, B. de Sutter, Compiler techniques for code compaction, ACM 
Transactions on Programming Languages and Systems, Vol. 22, Issue 2, p. 378-415, 
2000. 
[16] S. W. Fraser, E. W. Myers, A. L. Wendt, AnalYsing and compressing AssemblY Code, 
Proceedings of the ACM SIGPLAN '84 Symposium on Compiler Construction, 
SIGPLAN Notices Vol. 19, No. 8, p. 117-121,June 1984. 
[17] K. D. Cooper, N. McIntosh, Enhanced code compression for embedded RISC processort, 
Proceedings of Conference on Programming Languages Design and Implementation, 
p. 139-149, May 1999. 
[18] G. Araujo, P. Centoducatte, R. Azevedo, R. Pannain, Expression tree based algorithms for 
code compression on embedded RISC architectures, IEEE Transactions on VLSI Systems, Vol. 
8, p. 530-533, October 2000. 
[19] Motorola, MPC565 Product Brief, Ver. 3., February 2002. 
[20] H. Lekatsas, J. Henkel, W. Wolf, Code compression for low power embedded !)Istem design, 
Proceedings of the 37th Conference on Design Automation, p. 294-299, 2000. 
[21] H. Lekatsas, W. Wolf, SAMC: A code compression algorithm for embedded processort, IEEE 
Transactions on Computer Aided Design ofIntegrated Circuits and Systems, Vol. 18, 
p.1689-1701, December 1999. 
[22] Y. Yoshida, B. Song, H. Okuhata, T. Onoye, I. Shirakawa, An object compression approach to 
embedded processort, Proceedings of the 1997 international symposium on Low power 
electronics and design, p. 265-268, August 1997. 
[23] C. Lefurgy, P. Bird, I. C. Chen, T. Mudge, Improving code density using compression techniques, 
Proceedings of 30th Annual IEEE/ ACM International Symposium on 
Microarchitecture, p. 194-203, December 1997. 
[24) S. Liao, S. Devadas, and K Keutzer, Code density optimization for embedded DSP processors 
using data compression techniques, Proceedings of the 16th Conference on Advanced 
Research in VLSI, p.272, March 1995. 
[25] P. Bird, T. Mudge, An instruction stream compression technique, Technical Report CSE-TR-
319-96, EECS Department, University of Michigan, November 1996. 
[26] D. Kirovski, J. Kin, W. H. Mangione-Smith, Procedure based program compression, 
Proceedings of the 30th Annual ACM/IEEE International Symposium on 
Microarchitecture, p. 204-213, December 1997. 
[27] A. Wolf and A. Chanin, Executing Compressed Programs on an Embedded RISC Architecture, 
SIGMICRO Newsletter, Vol. 23, pp. 81-91, December 1992 
References 141 ~====~------------------------------------~ 
[28] T.M. Kemp, R.M. Montoye,J.D. Harper,J.D. Palmer, D.J. Auerbach, Decompression Core 
For Power PC, IBM Journal of Research and Development, Vo!. 42, Number 6, Nov 
1998. 
[29] S.Y. Larin, T.M. Conte, Compiler-driven cached code compression schemes for embedded ILP 
proceessors, Proceedings of the 32nd Annual ACM/IEEE International Symposium on 
Microarchitecture, p. 82-92, Haifa, Israel, November 1999. 
[30] 1. Page, Compiling software to gates, Embedded Systems Programming, no. 1, p. 14-21, 
2005. 
[31] Xilinx, Spartan-Het.8v FPGAfamify: Complete data sheet, ds077-3 (v2.1), Xilinx,July 2003. 
[32] Xilinx, Virtex-II platform FPCAs: Complete data sheet, ds031 (v3.4), Xilinx, March 2005. 
[33] Altera Corporation, (yclone Device Handbook, Volume 1, C51002-1.4, August 2005. 
[34] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, R. B. Brown, 
MiBench: A free, commercialfy representative embedded benchmark suite, Proceedings of the 
IEEE 4th Annual Workshop on Workload Characterization, Vo!. 00, p. 3-14, 
December 2001. 
[35] G.abriel Bouvigne, Shine MP3 encoder, http://www.mp3-tech.org. 
[36] J.Ziv, A.LempeI, A Universal Algorithm for Sequential Data Compression, IEEE Transactions 
on Information Theory, Vo!. 23, Issue 3, p. 337-343, May 1977. 
[37] Hi/fn, How LZS Compression Works, AN-00009-00 Application Note, www.hifn.com 
[38] ECMA, Data Compression for Imformation Exchange - Adaptive Coding with Embedded 
Dictionary - DCLZ Algorithm, Standard ECMA -151, June 1991. 
[39] ECMA, Adaptive Loss/ess Data Compression Algorithm (ALDC), Standard ECMA-222, June 
1995. 
[40] C. Bloom, PPMZ: High Compression Markov Predictive Coder, 
http://www.cbloom.com/src/ppmz.html 
[41] J.L Nuiiez, S. Jones, Cbit/ Second Lossless Data Compression Hardware, IEEE Transactions 
in VLSI Systems, Vo!. 11, No. 3, p. 499-510,June 2003. 
[42] ARC International,ARCTangent Programmers Reference. Version 6.4, 19 December 2000. 
[43] ARC International, ARCTangent-A4 Cache Systems Reference. Version 3.2, 27 December 
2001. 
Appendix A 
EXECUTABLE AND LINKING FORMAT (ELF) 
The Executable and Linking Format was originally developed and published by UNIX System 
Laboratories (USL) as a part of the Application Binary Interface (ABI). The Tool Interface 
Standards committee (TIS) has selected the evolving ELF standard as a portable object me 
format that works on 32-bit Intel Architecture environments for a variety of operating 
systems. The following three main types of object mes comply with ELF. 
• A re10catable me holds code and data suitable for linking with other object mes to 
create an executable or shared object me. 
• An executable object me holds a progtam suitable for execution; the me specifies how 
exec (BA_OS) creates a program's process image. 
• A shared object me holds code and data suitable for linking in two contexts. First, the 
link editor may process it with other relocatable and shared object mes to create 
another object me. Second, dynamic linker combines it with an executable me and 
other shared objects to create a process image. 
As the objective of this research was to compress executable code, the sttucture of an 
executable me is presented in Figure A.1 and a brief description of its main building blocks 
follows. 
The ELF header resides at the beginning and describes the me's organization. It gives 
information on type of machine used, the actual sizes of the other tables and their offsets from 
the beginning of the me, the number of sections and segments, different processor specific 
flags, and the virtual address to which the system first transfers control, thus starting the 
~A=PLPLe=n=d=~~A~-~E=x=~=m=m=b=k=a=nd=IJ==n=~=ng~R=orm==M __________________________ ~143 
process. A section header table, which is optional for executable mes, contains information 
about me sections. It looks like an array of strucrures, which hold information for all rhe 
sections in the ELF me. The program header table tells rhe system how to create a process 
image. Its strucrure is very similar to rhe section header table, but it describes me's segments. 
A segment is a block of information rhat holds all sections of rhe same data type. 
ElF header lable 
Program header lable 
Segmenl1 
Segmenl2 
.............. 
Section header lable 
Figure A.1. Structure of ELF fIle 
The sections of main interest are rhose rhat hold executable instructions, namely rhe .text, 
.init and :fin;, and rhose are usually combined into a single segment. The ELF and section 
header tables have to be changed after compression in order to reflect the new sizes and 
offsets of rhe sections and segments. The program header table, as well as rhe acrual relocation 
information have to be adapted as well. 
Appendix B 
COMPRESSION RATIO RESULTS 
This appendix presents the full set of compression results gathered from the tests described in 
Section 3.3. 
Table B.1 Compression ratio results for ADLC 
Program History buffer size (bytes) 
512 1024 2048 4096 
basicmath 0.476 0.457 0.437 0.437 
bitents 0.469 0.448 0.429 0.429 
ere 0.498 0.463 0.442 0.442 
djikstra 0.483 0.457 0.427 0.427 
display 0.442 0.427 0.418 0.418 
fft 0.474 0.455 0.435 0.435 
qsort 0.481 0.452 0.422 0.422 
sha 0.469 0.450 0.429 0.429 
shine 0.498 0.481 0.469 0.469 
susan 0.474 0.450 0.427 0.427 
average 0.476 0.454 0.434 0.428 
Table B.2 Compression ratio results for LZS 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.952 0.840 0.746 0.654 0.585 0.529 0.495 0.467 
bitcnts 0.943 0.820 0.719 0.637 0.575 0.521 0.481 0.457 
erc 0.943 0.826 0.730 0.649 0.581 0.532 0.490 0.474 
djikstra 0.952 0.833 0.735 0.658 0.592 0.535 0.493 0.467 
display 0.893 0.769 0.676 0.588 0.526 0.488 0.457 0.439 
fft 0.935 0.820 0.725 0.641 0.578 0.529 0.493 0.465 
qsort 0.952 0.840 0.741 0.658 0.592 0.535 0.493 0.461 
sha 0.935 0.820 0.725 0.641 0.578 0.526 0.485 0.461 
shine 0.893 0.787 0.699 0.633 0.578 0.538 0.505 0.488 
susan 0.962 0.847 0.741 0.662 0.588 0.532 0.490 0.465 
average 0.9360 0.8204 0.7237 0.6421 0.5773 0.5264 0.4881 0.4643 
----------...... 
_A~p~p_en_d_i_x_B_-~C~o~~ro_~~y~~~r._~ti~o_ro~~_m~~ ________________________________ ~145 
Table B.3 Compression ratio results for X-MatchPro for dictionary size of 128 bytes 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.863 0.800 0.728 0.669 0.585 0.542 0.518 0.507 
bitents 0.867 0.789 0.712 0.673 0.604 .0.555 0.540 0.504 
ere 0.858 0.796 0.723 0.653 0.591 0.548 0.527 0.519 
djikstta 0.870 0.795 0.730 0.665 0.608 0.541 0.516 0.505 
display 0.811 0.743 0.691 0.612 0.553 0.510 0.496 0.484 
fft 0.841 0.789 0.740 0.671 0.602 0.556 0.535 0.527 
qsort 0.863 0.801 0.738 0.659 0.589 0.538 0.512 0.500 
sha 0.852 0.781 0.715 0.649 0.597 0.534 0.509 0.498 
shlne 0.821 0.766 0.715 0.659 0.602 0.567 0.550 0.536 
susan 0.878 0.809 0.725 0.668 0.589 0.541 0.516 0.503 
average 0.852 0.787 0.722 0.658 0.592 0.543 0.522 0.508 
Table B.4 Compression ratio results for X-MatchPro for dictionary size of256 bytes 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.863 0.800 0.728 0.669 0.585 0.541 0.505 0.487 
bitents 0.867 0.789 0.712 0.673 0.604 0.551 0.523 0.484 
ere 0.858 0.796 0.723 0.653 0.591 0.547 0.511 0.500 
djikstta 0.870 0.795 0.730 0.665 0.608 0.539 0.502 0.484 
display 0.811 0.743 0.691 0.612 0.553 0.509 0.484 0.471 
fft 0.841 0.789 0.740 0.671 0.602 0.554 0.522 0.506 
qsort 0.863 0.801 0.738 0.659 0.589 0.536 0.501 0.478 
sha 0.852 0.781 0.715 0.649 0.597 0.532 0.497 0.480 
shlne 0.821 0.766 0.715 0.659 0.602 0.564 0.538 0.522 
susan 0.878 0.809 0.725 0.668 0.589 0.540 0.503 0.484 
average 0.852 0.787 0.722 0.658 0.592 0.541 0.509 0.490 
Table B.5 Compression ratio results for X-MatchPro for dictionary size of 512 bytes 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.863 0.800 0.728 0.669 0.585 0.541 0.504 0.481 
bitents 0.867 0.789 0.712 0.673 0.604 0.551 0.521 0.478 
ere 0.858 0.796 0.723 0.653 0.591 0.547 0.510 0.495 
djikstta 0.870 0.795 0.730 0.665 0.608 0.539 0.502 0.478 
display 0.811 0.743 0.691 0.612 0.553 0.509 0.483 0.469 
fft 0.841 0.789 0.740 0.671 0.602 0.554 0.522 0.501 
qsort 0.863 0.801 0.738 0.659 0.589 0.536 0.500 0.473 
Appendix B - ComprY!Ssion ratio results 146 ~~--~~~~--------------------------------~ 
sha 
shine 
susan 
average 
0.852 
0.821 
0.878 
0.852 
0.781 0.715 
0.766 0.715 
0.809 0.725 
0.787 . 0.722 
0.649 
0.659 
0.668 
0.658 
0.597 
0.602 
0.589 
0.592 
0.532 
0.564 
0.540 
0.541 
0.496 
0.537 
0.503 
0.508 
0.475 
0.516 
0.479 
0.485 
Table B.6 Compression ratio results for X-MatchPro for dictionary size of 1024 bytes 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.863 0.800 0.728 0.669 0.585 0.541 0.504 0.481 
bitents 0.867 0.789 0.712 0.673 0.604 0.551 0.521 0.478 
ere 0.858 0.796 0.723 0.653 0.591 0.547 0.510 0.495 
djikstra 0.870 0.795 0.730 0.665 0.608 0.539 0.502 0.478 
display 0.811 0.743 0.691 0.612 0.553 0.509 0.483 0.469 
fft 0.841 0.789 0.740 0.671 0.602 0.554 0.522 0.500 
qsort 0.863 0.801 0.738 0.659 0.589 0.536 0.500 0.472 
sha 0.852 0.781 0.715 0.649 0.597 0.532 0.496 0.475 
shine 0.821 0.766 0.715 0.659 0.602 0.564 0.537 0.515 
susan 0.878 0.809 0.725 0.668 0.589 0.540 0.503 0.479 
average 0.852 0.787 0.722 0.658 0.592 0.541 0.508 0.484 
Table B.7 Compression ratio results for X-MatchPro for dictionary sizes of204B and 
4096 bytes 
Program Input block size (bytes) 
32 64 128 256 512 1024 2048 4096 
basicmath 0.863 0.800 0.728 0.669 0.585 0.541 0.504 0.481 
bitents 0.867 0.789 0.712 0.673 0.604 0.551 0.521 0.478 
erc 0.858 0.796 0.723 0.653 0.591 0.547 0.510 0.495 
djikstra 0.870 0.795 0.730 0.665 0.608 0.539 0.502 0.478 
display 0.811 0.743 0.691 0.612 0.553 0.509 0.483 0.469 
fft 0.841 0.789 0.740 0.671 0.602 0.554 0.522 0.500 
qsort 0.863 0.801 0.738 0.659 0.589 0.536 0.500 0.472 
sha 0.852 0.781 0.715 0.649 0.597 0.532 0.496 0.475 
shine 0.821 0.766 0.715 0.659 0.602 0.564 0.537 0.515 
susan 0.878 0.809 0.725 0.668 0.589 0.540 0.503 0.479 
average 0.852 0.787 0.722 0.658 0.592 0.541 0.508 0.484 
Table B.B Compression ratio results for PPMZ 
basiemath bitents ere djikstra display fft qsort sha shine 
0.345 0.336 0.353 0.335 0.360 0.363 0.332 0.369 0.315 
~A~p~p~e~n~d~ix~B_--=C~omp~n~!~~~on~r.~~~no~n~!~u~US~ ________________________________ ~147 
Table B.9 Compression ratio results for DCLZ 
Program History buffer size (bytes) 
512 1024 2048 4096 
basicmath 0.565 0.569 0.558 0.552 
bitents 0.554 0.552 0.553 0.539 
ere 0.565 0.564 0.555 0.560 
djikstra 0.567 0.564 0.559 0.549 
display 0.543 0.538 0.519 0.532 
fft 0.561 0.558 0.556 0.572 
gsort 0.566 0.562 0.557 0.548 
sha 0.559 0.557 0.553 0.550 
shine 0.613 0.609 0.609 0.605 
susan 0.562 0.563 0.559 0.552 
average 0.565 0.564 0.558 0.556 
Table B.I0 Compression ratio results for CLB 
Program Dictionary size (bytes) 
256 512 1024 2048 4096 8191 
basicmath 0.730 0.663 0.621 0.572 0.529 0.386 
bitents 0.729 0.664 0.624 0.574 0.524 0.378 
ere 0.729 0.664 0.623 0.572 0.522 0.384 
djikstra 0.744 0.679 0.634 0.582 0.534 0.390 
display 0.667 0.598 0.552 0.488 0.402 0.325 
fft 0.742 0.677 0.634 0.583 0.533 0.389 
gsort 0.742 0.677 0.634 0.583 0.533 0.389 
oha 0.735 0.671 0.630 0.577 0.526 0.369 
shine 0.777 0.721 0.684 0.639 0.598 0.447 
susan 0.758 0.699 0.654 0.603 0.557 0.434 
average 0.735 0.671 0.629 0.577 0.526 0.389 
PUBLICATIONS 
E. G. Nikolova, D. ]. Mulvaney, V. A. Chouliaras, ].L Nufiez, "A Novel Code 
Compression/Decompression Approach for High-performance SoC Design", lEE Seminar 
on SoC Design, Test and Technology, Cardiff University, Cardiff, UK, 2 September 2003. 
Received best paper award 
E. G. Nikolova, "A Compression/Decompression Scheme for Embedded Systems Code", 
Proceedings of ESC Division Research Seminar, Loughborough University, 25 September 
2003 
E. G. Nikolova, D.]. Mulvaney, V. A. Chouliaras,]. L. Nunez-Yanez, "A code compression 
scheme for improving SoC performance", Proceedings of the 2003 IEEE International 
Symposium on System-on-Chip, IEEE Cat. No.03EX748 
\ 
