In this paper we present a method for reducing the memory requirements of an embedded system by using code compression. We compress the instruction segment of the executable running on the embedded system and we show how to design a run-time decompression unit to decompress code on the y before execution. Our algorithm uses arithmetic coding in combination with a Markov model which is adapted to the instruction set and the application. We provide experimental results on two architectures, Analog Devices Sharc and ARM's ARM and Thumb instruction sets, and show that programs can often be reduced more than 50%. Furthermore, we suggest a table-based design which allows multi-bit decoding to speed up decompression.
Introduction
We propose a new code compression method which allows a processor to decompress compressed code during runtime. This could be used in embedded systems where the available memory to store the executable is often small. A reduced executable can also indirectly a ect the size, weight and power consumption of the chip, which are also important considerations when designing embedded systems. Furthermore, denser programs can reduce cache misses for a given-size cache or allow smaller caches to be used. As the gap between processor and memory performance increases, the delays incurred by cache misses can increasingly become the bottleneck. If, however, the cache stores compressed code, instruction cache misses can be reduced, and the processor can potentially provide more instruction bandwidth. This holds only when the decompression unit is placed between the processor and the cache, decompressing on instruction fetch. On systems where decompression takes place between the cache and main memory, cache misses would not be reduced. In general, an instruction decode unit for compressed instructions could be part of the processor pipeline, or could be designed as a separate module which decompresses instructions and stores them in a bu er or a cache before they are sent to the pipeline. In this paper we assume that the CPU core is left intact and that all necessary decoding takes place in an add-on module. This simpli es the design as we need not modify the processor.
Our work is restricted to instruction compression. Since data is not compressed our method will only reduce instruction memory. Compressing data would require a di erent approach since data must be writable as well, and we would have to provide a fast run-time compression scheme in addition to fast decompression. Regarding multimedia applications, where often data size is the dominant cost, an instruction compression scheme would not signi cantly reduce the overall memory requirements. This is usually true in video applications. However, in audio applications, where complex algorithms are used, our scheme could be useful. Another example of an application that would bene t from a code compression scheme is a laser printer where program memory requirements are typically large.
Code compression poses some restrictions on the type of algorithm that can be used since executable code does not execute in a sequential manner. This fact combined with the need for small tables (or models) and high speed when decompressing, render most text or image compression algorithms unusable for code compression. Wolfe and Chanin were the rst to propose an embedded processor design which incorporates code compression 19], where Hu man coding 7] is used to compress cache blocks. Subsequently Liao et al. 13 ] and Bird and Mudge 4] proposed dictionary methods 2] which compress instructions so that they are easily decompressible. In 11] we proposed two methods, a dictionary-based one (SADC) and a statistical one (SAMC). A number of industrial attempts have also emerged, including ARM's Thumb processor 1] and MIPS16 8] .
In this paper we extend our work on SAMC (Semiadaptive Markov Compression) which is based on arithmetic coding 18] in combination with a precalculated Markov model. We improve on our Markov modeling techniques, and we use decoding techniques similar to those we proposed in 12] . Our approach to arithmetic coding uses the look-up table technique by Howard and Vitter 6] as a starting point and extends it to allow for random access and multibit decoding. Apart from its usefulness for embedded code compression, our work makes a contribution to the data compression eld as well, as it presents a new way of designing a decoder for fast individual block decompression. Some of the algorithms' details have been presented elsewhere 12] . Furthermore our method is general and can easily be used for any instruction set with xed-size instructions. In terms of compression performance, SAMC compresses programs signi cantly better than byte-based Hu man coding as used by Wolfe and Chanin 19] . Although direct comparison with other methods such as those by Bird et al. 4] and Liao et al. 13] is not possible, our method seems to compress more e ectively as seen by our results on the ARM instruction set. However, the main drawback of arithmetic coding is its slow execution. We propose a novel technique to attack this problem by building a decoding table which allows multi-bit arithmetic decoding.
In section 2 we explain the constraints introduced by code compression and discuss some of design issues involved. In section 3 we discuss previous work in text compression and in code compression. Section 4 describes the details of binary arithmetic coding, while section 5 discusses the Markov model details. In Section 6 we present experimental results on two di erent architectures and we evaluate di erent algorithms for code compression. Finally in section 7 we conclude and discuss possible future directions.
Code Compression basics
The main characteristic of code compression in a system with run-time decompression is the need to have random access in decompression. This stems from the lack of sequentiality in program execution as branches and calls alter the ow of programs. The decompression engine should be able to start decompression at any point in the code, or at least at some byte boundaries. Wolfe and Chanin 19] use a cache block decompression scheme, where cache blocks are decompressed whenever there is a cache miss. In such a system random access means that the decompression engine should be capable of starting decompression at cache block boundaries. In other systems, such as the one proposed by Lefurgy et al. 10] , where variable number of instructions are encoded with one codeword, decompression should start at a memory location that is a valid codeword.
We de ne a few terms which we will use in the following: (1) 2. Block: We use the term block to refer to the number of bytes that constitute a decompressible unit. Our algorithm will be capable of starting decompression at any byte boundary that is a block beginning. Note that blocks are xed-size and should not be confused with basic blocks. In our experiments they range between 4 and 64 bytes.
3. Byte alignment: To make decoding easier, code compression systems often pose a byte or word alignment restriction. This means that compressed blocks can only start at a byte or a word boundary. A word boundary is a big penalty for a block-based compression scheme. We always ensure our blocks are aligned on byte boundaries.
4. Indexing a block: When we decompress code and come across a branch instruction, the target address we get is an uncompressed one which does not correspond to the same location in the compressed code. One way of solving this is to use a table called LAT 19] which maps program instruction block addresses into compressed code instruction block addresses. The main advantage of the LAT approach is that it can be readily used on any instruction set without modi cation. However, as the block size decreases, the overhead of storing a LAT table increases. An alternative approach is the one proposed by Lefurgy et al. 10] , where relative branch instructions are not compressed and their o set elds are patched to point to compressed instructions. Our compressor goes through the program once and compresses all instructions using arithmetic coding except for branches. Branches are packed into 2,3 or 4 bytes depending on the o set size, and the o sets are patched to point to compressed addresses. Note that all direct branches as well as jump tables must be patched. Jumps to registers used for return from subroutines need not be changed and can be compressed using the arithmetic coder. Figure 1 shows an example ARM branch instruction. As shown in the gure ARM instructions are 32 bits wide, where the rst 4 bits (bits 28-31) specify a condition for the instruction (execute if the zero ag is set etc). Branch and branch and link instructions have always a "101" in bit positions 25-27 as shown in the gure, while bit 24 is used to specify whether the instruction is a branch or a branch and link. Finally bits 0-23 constitute the 24-bit o set. In this branch compression example an ARM branch is compressed to a branch which has a 16-bit o set. The highmost bit is set to 1 to signify that this is a branch and should not be decoded using the arithmetic 16-bit offset Figure 1 : Arm branch compression decoder (We can force our encoder to start with a 0 on all other instructions without signi cant loss of coding e ciency). The condition and link elds are preserved, while a 2-bit eld is added (NB eld) which tells the decompressor how many o set bytes follow (1, 2, 3, or 4) . Note that in the worst case since the new o sets point to byte positions, a 24-bit o set might be expanded to more than 24 bits. An important characteristic of code compression is that compression and decompression can be asymmetric. Since decompression must be done real-time, it has to be fast. However compression is done once, hence compression time is immaterial, as long as it can be done in reasonable time. Figure 2 illustrates our compressor and decompressor. The left side of the gure shows a ow diagram for the encoder while the right side shows what happens during decoding. To explain decoding assume that a block of n instructions is currently being decompressed. If the decoder is currently at position j of instruction k , the example in gure 2 shows how decoding is performed. For each possible bit position that can occur within an instruction the decoding table gives a number of possible matches (encoded bits) which are compared with the leading bits at position j. In this example a match has been found, namely "01001" which gives the decoded bits "11010010" as given by the table. The table is precalculated and stored in main memory. From this diagram we see that decoding is essentially a number of comparisons between the encoded value stored in memory and the table entries for the current bit position. The decoding table is generated using a Markov model and a table-based arithmetic coder as described in section 4. The coder and the model are combined into one table, resulting in fast decoding. Figure 3 shows the components of the decompressor module. This module can be placed between the CPU core and the ROM or a cache, or it can be placed between between two levels of memory hierarchy, for example between a cache and main memory. The main overhead is the decoding table which stores the matches and decoded bits as shown in gure 2. The match logic is essentially a number of comparators. Finally the branch decoder essentially maps compressed to decompressed branches as shown in gure 1. The branch signal is triggered by the rst bit of the compressed instruction and is used to select the output of the branch decoder and the match logic.
Previous Work
There is an abundance of algorithms for data compression in the literature. Statistical methods, which use the frequencies of appearance of symbols to choose between possible encodings usually achieve the best compression in terms of size. However, most of the existing state-of-the-art algorithms cannot be used directly, mainly because they violate our individual block decompression constraint. In terms of compression ratios the algorithms that use nite context modeling such as PPM (Prediction by Partial Match) 2], DMC 5], and WORD 14] seem to achieve the best performance. However they require large amounts Figure 3 : Decompression Unit of memory both for compression and decompression, making them unsuitable for hardware implementations where keeping the size small is important.
The simplest in terms of hardware implementations appear to be the dictionary coding methods. In dictionary coding a group of commonly appearing characters is replaced by an index to a dictionary. This dictionary stores such groups, hence decompression has to replace the indices with the entries found in the dictionary. Compression is achieved because the indices usually occupy less number of bits than the corresponding groups of characters. It has been proven 2] that for any dictionary method there is a statistical one that can achieve at least equally good compression. This implies that statistical methods are better than dictionary methods in terms of compression performance. However, the main attraction of dictionary methods is the ease of decompressing, making them particularly useful for hardware implementations.
The nal size of the compressed program using a dictionary method is usually measured as the sum of the compressed program plus the dictionary size. The Ziv-Lempel family of algorithms 21] eliminate the dictionary by making the indices point to the le itself. They also seem to be relatively simple to implement in hardware, especially the variant by Welch 17] . However, the Ziv-Lempel family of algorithms use pointers to previous occurrences of strings which makes an individual block decompression scheme a very complex task.
Our method belongs to the statistical coding category. This gives us the advantages of statistical coding in terms of compression performance. Furthermore our algorithm is semiadaptive; this means that encoding involves two steps: in the rst step we build a Markov model suitable for the given application and in the second step we perform the actual coding. The decoder uses the same model as the encoder and hence does not need to go through such a two-step process. More on the di erences between static, semiadaptive and adaptive coding can be found in the book by Bell et al. 2] .
The rst code compression system for embedded processors was proposed by Wolfe and Chanin 19] . It assumes a system with an instruction cache which holds uncompressed code and a main memory which holds compressed code. The decompression unit is placed between the main memory and the instruction cache and decompresses a cache line whenever there is a cache miss. To compress main memory code, byte-based Hu man codes 7] are used. Kozuch and Wolfe 9] report a compression ratio around 0.73 (compressed size/original size) for MIPS code. A fast Hu man decompressor chip has been designed by Benes et al. 3] which uses asynchronous logic and achieves 1303 Mbits/sec.
The main advantage of this is that the processor does not need any modi cation, as it fetches uncompressed code from the instruction cache. A shortcoming is that 8-bit symbols have been used instead of 32-bit symbols corresponding to the size of RISC instructions. This means that all 4 bytes within the same 32-bit word are encoded using the same table.
Since instructions have di erent elds which have di erent statistical characteristics such a choice increases the entropy of the source signi cantly. Furthermore, this method does not take into account dependencies between instructions, limiting the overall compression performance.
Liao et al. 13] proposed two methods based on the idea of mini-subroutines. The rst method is done purely in software. The main idea is that frequently appearing sequences of instructions (mini-subroutines) are replaced by a call to a dictionary. The dictionary stores all these mini-subroutines once, thus compression is achieved by replacing a sequence of instructions with a call instruction which occupies less space. The second method needs some hardware modi cation. The use of the dictionary is more exible as any substring in the dictionary can be replaced by a pointer. The return instructions from the mini-subroutines are eliminated, and the instruction set is augmented with a new call instruction which points to anywhere within the dictionary. This call instruction consists of a dictionary address plus a length which gives the number of instructions to be executed from the dictionary. Liao et al report an average compression ratio of 0.882 ( rst method) and 0.841 (second method) for optimized TMS320C25 code.
Bird and Mudge 4] proposed a dictionary-based method which outperforms byte-based Hu man coding and results in a fast decoder. They nd common instruction sequences and they replace them with a codeword. This codeword is an index to a dictionary which contains the original instruction sequence. The dictionary is accessed during program execution time to expand the singleton codewords back to the original instruction sequence. The nal compressed program is a mixture of codewords and uncompressed instructions. One drawback of this method is that branch targets should be aligned and the range of branches is reduced which sometimes requires a jump table. They report compression ratios of 61%, 66% and 74% for the PowerPC, ARM and i386 instruction sets respectively. Yoshida et al. 20 ] presented a logarithmic-based compression scheme to compact instruction memory. They store a ROM decompression table which is used to decode. The main drawback of this method is the large ROM table, which can exceed 270KB for large examples. Okuma et al 16] proposed a more elaborate compression method where the immediate elds of instructions are compressed. To implement the decoders they use ROMs and unlike most authors, they present results which include the cost of the decompression hardware. They report an average overall memory reduction of 12.4%, including the cost of the ROM decoders. Kemp et al. at IBM 15] proposed a code compression scheme for PowerPC code. They divide PowerPC instructions into 2 16-bit pieces and they use two Hu man tables for each piece. In order to improve performance they group instructions into 64-byte blocks, minimizing the indexing table (LAT) requirements. They report a compression ratios of 60% for PowerPC code.
Lekatsas and Wolf 11] proposed a dictionary-based method. Dictionary generation is much more elaborate compared to the method proposed by Bird and Mudge 4]. The complicated encoding can result in signi cantly better compression, however it is more expensive to implement as the decompressor complexity is substantially larger. Lekatsas and Wolf report compression ratios of 52% and 67% on MIPS and x86 code respectively.
In the industry, a number of new architectures have emerged for the embedded market, which attempt to provide 32-bit performance with lower cost. The Thumb architecture by Advanced Risc Machines 1] is one such example. The Thumb instruction set is essentially a new instruction set which consists of a subset of the most commonly used 32-bit ARM instructions. On execution, these 16-bit instructions are decompressed to full 32-bit ARM instructions real time. Another approach, very similar to Thumb is MIPS16 8] . MIPS16 instructions map to the MIPS architecture by presenting a subset of the MIPS-III instruction set to the programmer. Using some simple translation hardware, MIPS16 instructions can be translated on-the-y into MIPS-III instructions. The Thumb instruction set achieves compression ratios between 60% and 70% while MIPS16 achieves an average compression ratio of 60%.
This approach has some disadvantages: It requires considerable e ort to design a new instruction set, and although each 16-bit Thumb or MIPS16 instruction corresponds to a full 32-bit instruction, some instructions may need the extra bits. In such cases one 32-bit instruction will map to many 16-bit instructions. This means that, although absolute code size is smaller, the number of instructions in Thumb programs increases, resulting in some loss of performance as well as code density. Another disadvantage of this approach is that it requires a new instruction decoder, a whole new set of software development tools, such as a specialized compiler, an assembler and a linker.
Binary Arithmetic Coding using Tables
The rst question we need to ask is which compression method would give us the best results? The best known statistical compression method which uses a probability model is Hu man coding 7] . Hu man coding has been proved to give optimal coding for each symbol encoded. However the message encoded as a whole is not optimal since there is loss of coding e ciency at each symbol boundary. Furthermore, its redundancy greatly increases when the probabilities of symbols are highly skewed 6]. A more modern method of coding is arithmetic coding 18] which always achieves at least as good compression as Hu man coding 2]. If the whole le is encoded serially, arithmetic coding achieves the theoretical entropy bound to compression e ciency. Our motivation to use arithmetic coding is further strengthened by the fact that most state-of-the-art data compression algorithms such as 3, 8) . For each state we have di erent transitions and outputs depending on the input bit and some probability ranges. We denote the input bit with either an MPS (Most Probable Symbol) or an LPS (Least Probable Symbol) which we will explain shortly. The probability ranges are shown in the second column of the table. For each transition we show the output and the next state (next interval) separated by a comma. For example, if the current state is state 1,8), the current probability of the MPS is 0.9, i.e it falls in the 0.7857,1) range, and the input symbol is the LPS, then the table gives 001 as the output and 0,8) as the next state (row 4, column 1). Some high probability transitions do not output anything, denoted by a dash in the table. The "f" symbol appearing in some outputs denotes the bits to follow procedure 2]. This means that the output is not known yet, instead it depends on the next output bit. If the next output bit is a 0, then the "f" should be replaced with a 1 and vice versa. Although we will demonstrate how to use the table in gure 4, we will not explain how approximate arithmetic coding can result in such a table, as this is the topic of the paper from Howard and Vitter 6]. Furthermore, it is not required for understanding our ideas.
We now take the reader through an example of compression and decompression. Consider the ARM instruction MOV R1,R0 with binary representation 11100001 10100000 00010000 00000000. Assume that each bit in the instruction has a probability of being a "0" given .., where the rst bit of an instruction is a "0" with probability 0.1, the second bit is a "0" with probability 0.3 and so on (Generation of these probabilities is the topic of the next section). Compression works on a bit-by-bit basis: We start at state 0,8). The above list tells us that the rst bit is a "0" with probability 0.1. In other words, the most probable symbol (MPS) is a "1" with probability 1-0.1 = 0.9. Since the rst bit of the MOV instruction is a "1" we use the MPS column of the table in gure 4, and since Prob(MPS) = 0.9 which belongs in the 0.8125,1) range, we use the rst row of this table. The result is no output (denoted with a "-" in the table) and the next state is state 1,8). The probability list tells us that a "0" will appear with probability 0.3, or equivalently the most probable symbol (MPS) is a "1" with probability 1-0.3 = 0.7. Since the second bit of the MOV instruction is a "1" we use the MPS column of the 3, 8) . Continuing in the same manner, we now see that the third input bit which is a "1" in combination with the probability of 0.2 tells us to use row 9 of the table, and the output is a 1 and next state is 0,8). The procedure for compressing the rst byte is summarized in gure 5, where we can see that the rst byte 11100001 compresses to 11010. Using the same table we can decode back our MOV instruction. The decoder has access to the same probability model, hence it can use the list of probabilities we used during the encoding example. Furthermore, it has access to the encoded message. Starting at state 0,8) and given the probability of a "0" being 0.1, or equivalently, the MPS symbol is a "1" with probability 0.9, we know that the rst row of the table would have been used by the encoder. If the input was the LPS i.e. a "0" then the output would have been a 000. By inspecting the encoded message we see that it starts with a 1 hence the input was the MPS or a "1". We have now decoded one bit of the original instruction and we move to state 1, 8) as dictated by the table. Since this transition does not involve any output for the encoded we do not shift any encoded bits and continue to use 11010 as our current encoded message bu er. Note that this method is guaranteed to give the original instruction back without any loss of information.
The main drawback of decoding as presented in the example above is that it works on a bit-by-bit basis. We now show how to build a decoding table that achieves multi-bit decoding. In the following "M" will denote the MPS and "L" the LPS for brevity. Suppose we are currently at state 0,8) and that the probability of the MPS falls in the 0.8125,1) range. If the encoded message is 000 then we would decode a "L" and the next state would be 0,8), else we would decode a "M" and the next state would be 1,8). Decoder I in gure 7 shows this, where at each transition we show the encoded fragment of the message decoded, the corresponding decoded bit(s), and a shift signal which tells us how many encoded bits to shift out, as they are no longer needed for decoding. We can go one step further and look at what happens at state 1, 8) . Assuming the probability falls in the range 0.7857, 1) then if the output is not 001 the next symbol is a "M" and the next state is 2,8). At state 2,8) if the probability is in the range 0.5, 0.75) and the output is a 1, then the input is a third "M" in a row. Hence, given that the probabilities are in the above ranges, a series of three "M" symbols will give a 1 as an output. The reader can verify that this is the only input sequence that will result in an output starting with a 1. This means that we can augment our decoder to include this transition as shown in decoder II, gure 7. If during this process an output con icts with the output of another input sequence then we do not add this transition, instead shorter transitions are tried by taking the leading common bits of these two con ict-inducing inputs 12]. Continuing in this manner and adding highest probability transitions rst, and taking care not to include con icting transitions, we get more complex decoders such as decoders III and IV. In other words, gure 7 shows a number of options of increasing space requirements, but decreasing decoding delay.
A possible decoding table entry for Decoder III is shown in gure 6. We have shown how to generate a decoding table entry for the decoder for a given starting interval state and a given set of probabilities. For any state depending on the current position within an instruction, this set of probabilities can di er. The next section will discuss this in more detail. The more entries we store in our decoding table, the larger number of bits we can decode per cycle. On the other hand, matching the right output with the encoded data will require more comparators, and more importantly, it will result in a larger decoding table.
Apart from the approximations introduced by the use of a table instead of doing oating point calculations 6], the encoder described above has some loss of e ciency because we Finally we pad at most 7 extra bits at the end of each block to ensure byte-alignment.
Model building
The next problem we need to address is the generation of the probabilities used by the arithmetic coder. Clearly we want our probabilities to have some memory; if for example we are encoding the fth bit of an instruction we would like to remember what were the previous 4 bits of the instruction and adjust our probability accordingly. In order to keep track of such information we use a Markov model. A Markov model consists of a number of states, where each state is connected to other states and each transition has a probability assigned to it. By assigning each node to a certain bit of an instruction (in fact, as we will see subsequently, many di erent nodes are assigned to a certain bit) and assigning each of the two transitions to the probabilities of this bit being a "0" or a "1", we can form a Markov model which will remember what the previous bits were. Depending on the instruction to be encoded, a certain path along the Markov nodes is followed, giving a list of probabilities such as the one we used in the previous section for our example.
Elsewhere 11] we proposed a stream subdivision technique where instructions are broken into elds (streams). In this work we present a di erent Markov model which can avoid the stream subdivision process, one which is easier to expand for larger programs and can give better compression if it is properly tuned to match the instruction size of the given processor. Figure 9 shows an example model for a hypothetical instruction set with 9-bit instructions. It consists of 3 sub-models such as the one showed in gure 8. The last sub-model is connected to the rst one but is not shown here for simplicity. Figure 8 shows an example of such a sub-model which could be used to encode 3 bits. The starting node is node 0. The following example demonstrates how it is used: Consider a hypothetical instruction which starts with the bit sequence 010. Each Markov node in gure 8 has two outgoing arrows (transitions), the left one corresponding to a "0" and the right one to a "1". Each arrow has a probability assigned to it. Instruction 010 will follow the path shown in the gure. The probabilities on this path, i.e. 0.1, 0.7 and 0.2, are the probabilities used by the encoder for this instruction. In other words, the probability list we used in the encoding example of the previous section is actually derived by such a model. It is important to realize that di erent instructions will take di erent paths, and hence will use di erent probabilities. By assigning appropriate values to these probabilities, we can take advantage of the statistical properties of instructions and compress e ectively. In practice bigger models are used but due to lack of space this gure shows a smaller version.
We use two main variables to describe our models, namely the model depth and the model width. These variables represent the number of layers and the number of Markov nodes per layer respectively. We have found experimentally that the depth should divide the instruction size evenly, or be multiples of the instruction size. This is intuitively true as we would like our model to start at exactly the same layer after some constant number of instructions. This ensures that each layer in the Markov model corresponds to a certain bit in the instruction and therefore it stores the statistics for this bit. We have a number of di erent nodes per layer because we need some memory of what were the previous bits (the current node depends on the path we followed for this instruction). The model's width is a measure of the model's ability to remember the path to a certain node. Since each node has two transitions leading to it, after log 2 Width transitions the model will lose all information about where it started. For 32-bit code, we found that a width of 16 and a depth of 32 achieve reasonable compression, while keeping the number of Markov nodes small (16 32 = 512 nodes). For large programs where the model size is amortized by a big reduction in program size, bigger models can be used.
On each node of the Markov model, a probability is assigned which corresponds to the probability of the MPS. An extra bit is also stored which signi es whether the MPS is a 0 or a 1. The probabilities calculated above are done as a preprocessing step. The encoder goes through the subject program once and traverses the paths according to the instructions encountered in the program. Keeping track of which paths are visited more frequently it calculates the probabilities for each transition. Once the model has been built, the encoding Apart from encoding the current bit the encoder also calculates all possible outputs for all possible input combinations up to a speci ed number of bits. This information is used to build a decoding table entry for that bit. When encoding completes the decoding table is ready, and the Markov model is essentially embedded in the decoding table. The nal storage requirements are the encoded message and the decoding table.
Experimental results
In this section we present experimental results on two architectures: Analog Devices Sharc and ARM's Thumb. Both instruction sets have xed-size instructions and are therefore suitable for SAMC. However, compression ratios are di erent because Sharc code tends to be more redundant than ARM code. This is due to the design of Sharc's instruction set. The instruction width is larger (6 bytes as opposed to 4 bytes for ARM) while the number of di erent instructions in programs remains small, making Sharc code very redundant. In general, the compressibility of programs on an instruction set depends on several factors: We compare SAMC with well known general purpose algorithms such as Hu man coding and Gzip and show its advantages and limitations. We used a number of benchmarks which 
Experiments on Sharc
We use two models (768 and 3072 bytes respectively) and compress 8 benchmark programs. SAMC with model1 averages a 48% compression ratio while SAMC with model2 averages a 41% compression ratio. Figure 10 shows the two models' parameters. Note that the depth of the model for Sharc is 48 bits which is in accordance to our statement that the depth should divide the instruction size evenly. We also show the range of decoding table sizes for our experiments. The larger the Markov model, the larger the resulting decoding table. The bars in gure 11 consist of two portions. The bottom portion is the compression ratio while the top portion shows the decoder ROM overhead. In this sense the results show the overall memory reduction. Sharc code seems to be very compressible and would bene t alot with a SAMC implementation in a tight memory system. Another interesting observation is that although it is generally true that bigger models can give better compression, for the smaller benchmarks such as FFT, Lucas and Reed-Solomon the smaller model (model1) produced better results. Clearly for small benchmarks a big model and hence a big decoding table is a considerable overhead.
In gure 12 we compare SAMC with two well-known algorithms, Hu man and Gzip. In the Hu man implementation used here we have mapped bytes to codes which range from 4 to 16 bits. Hu man coding does not take into account dependencies between bytes, hence its performance is limited, which is con rmed by our experiments. Of course it would be possible to build a Hu man table for whole instructions and achieve optimal encoding at the instruction level. However, building Hu man codes for 32-bit symbols will result in a huge table (2 32 entries), unsuitable for code compression. Gzip uses LZ77 and is the most e ective among the three algorithms compared. However, as explained in section 3, it is unsuitable for compressing small blocks. SAMC gives an average compression ratio of 45% which is around 17% better than Hu man and 15% worse than gzip. The benchmark programs which appear with an asterisk in gure 12 are the ones which use the big model (3072 bytes). To understand the di erence between the fourth bar and the third bar, it is important to realize that the third bar is essentially SAMC code over Thumb code size, while the fourth bar is essentially SAMC code when compressing Thumb code over its equivalent ARM code. This is e ectively the result of multiplying the rst bar with the third bar. Figure 10 shows the parameters used by SAMC for these experiments. SAMC outperforms Thumb on all benchmarks with an average compression ratio of 56% compared to an average of 68% produced by Thumb. The third bar shows that Thumb code is also compressible but to a lesser extent than ARM, as expected. The fourth bar should be contrasted with the second (SAMC on ARM) bar. On average if we compile for Thumb and then apply SAMC we get an average total compression ratio of 55% which is almost equal to the SAMC average on ARM. This means that a system with SAMC on ARM will most likely be adequate, and a preprocessing step by compiling for Thumb will not be needed to enhance compression.
ARM & Thumb
In gure 14 we compare SAMC with Hu man and Gzip on ARM code. Figure 10 shows the SAMC parameters used for these experiments. As expected gzip outperforms SAMC on all benchmarks, while Hu man is about 15% worse than SAMC.
Block size e ect
Here we show compression ratios on Sharc, ARM and Thumb code for di erent block sizes. We have used the parameters shown in table 10 for Sharc (model1), ARM and Thumb, apart from the blocksize which ranges from 4 bytes to 64 bytes. The block size can a ect the compression ratio signi cantly. In general, the bigger the block, the better the compression for two main reasons: After each block we have to add bits to ensure correct decompression and byte alignment, and bigger blocks can take better advantage of statistics. Figure 15 shows the results for block sizes ranging from 4 to 64 bytes. 4 and 8-byte blocks compress signi cantly less than larger blocks. From 16-byte blocks onward the situation levels o . 
Bits per probability e ect
These experiments essentially measure how the interval subdivision accuracy a ects the compression ratio. Figure 16 shows the experimental results for di erent probability sizes. A probability size of 2 means that the interval can only be divided into 2 2 = 4 parts. On the other extreme a oating point probability can divide the interval using oating point arithmetic and can thus achieve maximum accuracy. We note that beyond probability size of 3 bits, the compression ratio is not a ected signicantly. This means that with only 4 bits/probability we can achieve competitive compression ratios. Also note that the last column which corresponds to oating point subdivision is only 2-3% better than the rest.
Decoding Speed Experiments
To evaluate the performance penalty imposed by the decompressor, we performed a number of experiments where we modi ed the number of bits the encoder looks ahead to store matches and we measured the average bits per cycle the decoder can decode, the total number of matches found in the decoding table and the average number of comparisons needed per cycle. Since the number of cycles an instruction needs to be decoded is not If we allow pipelining of the decoder we can decode adjacent blocks simultaneously increasing the throughput. Note arithmetic coding and decoding are recursive operations and we cannot decode bits of a certain block without having decoded all the previous bits in that particular block. That is because the next state of the decoding table is determined by the current match (see gure 3). This implies that we can pipeline only if we allow simultaneous decoding of di erent blocks which are uniquely decodable by design. For example if our block size is 4 bytes for the ARM processor, then we can pipeline by decoding 2 instructions (2 blocks) simultaneously. A possible way to pipeline is to access the decoding table in one cycle and perform the matching in the next cycle.
Algorithm Evaluation
Here we attempt to compare the di erent code compression methods we know of so far. Since the benchmark sets, compilers and often the instruction set architectures are di erent, the results shown in gures 18 and 19 only represent our opinion on the overheads involved. Since we are unable to directly compare speed and area overheads, the results shown are mostly derived by our experience in compression methods. More precisely, to evaluate speed we take into account the inherent decoding di culty of each method. For example compression schemes based on dictionary coding are the fastest. For the area overheads we essentially count the number of tables needed, although in most implementations exact table sizes either vary depending on the application or their exact size was not given by the authors.
Regarding SAMC the general block diagram is shown in gure 3 where the biggest area overhead is the decoding table. Given the fastest con guration presented in gure 17 (8 bit lookahead), a 16-bit eld for matches, an 8-bit eld for output and an 8 bit eld for the next state value, the decoding table can t in less than 4K. Slower implementations can t the table in 1K.
According to gure 18 it seems that SAMC has superior compression performance over all algorithms except SADC. In terms of speed it cannot match the fast dictionary coding methods, however it can perform comparably to a Hu man decoder. The area requirements are somewhat bigger than those for a Hu man decoder. These results show that if compression ratio with a small table is the dominant design requirement, SAMC should be the choice. On the other hand if decoding speed is of utmost importance, the dictionary methods should be preferred.
Conclusions and Future Work
This paper presents SAMC, a new algorithm to reduce code size for embedded systems. SAMC is targeted for instruction sets with xed-sized instructions and can work for any such architecture. The experimental results show that it has superior compression performance over all dictionary and statistical methods except SADC which is a hybrid method. In particular SAMC outperforms Thumb in terms of compression ratios, on most applications. Furthermore we demonstrated that even Thumb code can be compressed by SAMC. The main disadvantage of a compression algorithm based on arithmetic coding is the costly arithmetic operations. We present a novel decompression method which avoids arithmetic operations completely, by building a table. This table can be designed in a way that allows for multi-bit decoding resulting in decoding speed similar to a Hu man decoder.
There are several areas that need further work. One possible research area is to attempt to compare a clever Hu man approach where multiple tables are used with our arithmetic coding scheme. As mentioned in section 4, arithmetic coding does not su er from some of the shortcomings of Hu man coding, such as outputting an integral number of bits per symbol and also loss of e ciency when the probabilities are highly skewed. However, due to our approximations, some of the advantages of arithmetic coding are lost, making a comparison with Hu man a reasonable idea. A problem that needs further research is how to divide instructions in order to create multiple Hu man tables.
We are investigating various ways of speeding up the decompressor design. One possible approach is the use of asynchronous logic for matching decoding table entries with the encoded message. Asynchronous logic can be useful since we have to match bits of variable length. We are also conducting some research on integrating decompression with decryption. This could be useful in systems where code resides encrypted in memory for intellectual property reasons and must be decrypted and decompressed on-the-y before it is executed by the processor.
