Abstract: Dictionary-based code compression stores the most frequently used instruction sequences in a dictionary, and replaces the occurrences of these sequences in the program with codewords. The large dictionary size is due mainly to many instruction sequences which are different only in operands, but are otherwise the same. The operand factorisation technique divides the expression tree into tree-pattern (opcode sequence) and operand-pattern (operand sequence) to reduce this redundancy; instruction sequences with the same opcodes but different operands may thus share the same tree-pattern dictionary entry. The paper proposes an operand field remapping method to further reduce dictionary size. The key idea is to explore the relations between the current operand to be compressed with those already compressed. The operandpattern dictionary is therefore divided into an operand remapping dictionary and an operand list dictionary. Each entry in the operand remapping dictionary indicates whether the operand (register or immediate value) to be compressed is the most used operand, the same as the destination register of the previous instructions, or otherwise. With this remapping technique, the operand dictionary size is greatly reduced. An average 46% compression ratio can be achieved where compression ratio ¼ (dictionary size þ compressed code size)=(original program size).
Introduction
Embedded processors are highly constrained by cost, size, and power. Reducing the program size of the embedded systems is important to reduce system size, cost and power consumption, and to speed-up program execution. Compression methods fall into two categories: statistical [1, 2] and dictionary [3] [4] [5] . Statistical compression takes advantage of replacing frequently used instructions with smaller codewords to reduce the code size. Dictionarybased compression methods store frequently used instruction sequences in a dictionary and replace the occurrences with shorter codewords. Classification is made based on compression granularity: Program-based methods [6] compress the whole program and expand it back before execution, while Procedure-based methods [7] reduce the granularity to a procedure. These two methods avoid the difficulties of branch target re-addressing, and provide the software with an opportunity to help the systems to execute the compressed code directly, at the cost of execution speed and a larger decompression buffer. Instruction-block-based [1] [2] [3] [4] or instruction-based [5] [6] [7] [8] [9] [10] methods achieve effective decompression and execution with a much smaller decompression buffer. A front-end decompression engine is used to decompress the instructions and send them to the CPU on the fly, thus speeding up the execution.
This paper proposes an instruction-block, dictionarybased compression technique named 'operand field remapping'. This method divides the operand-pattern dictionary into an operand remapping dictionary and an operand list dictionary to reduce its size. The entries in the operand remapping dictionary indicate whether the operand (register or immediate value) to be compressed is the most used operand, the same as the destination register of the previous instruction, or otherwise. With this remapping technique, the operand dictionary size is reduced significantly and the compression ratio is improved.
Related code compression work
One way to achieve a reduction in codes is to restrict the size of instructions. This is the approach adopted in the design of the Thumb [9] and MIPS16 [10] for ARM7 [11] and MIPS-III [10] processors, respectively. Shorter instructions are achieved mainly by restricting the number of bits that encode registers and immediate values. This results in 30-40% smaller programs running 15-20% slower than programs using a standard RISC instruction set [10, 12] .
Another way to reduce the size of a program is to compress the codes by general compression methods. Lefurgy [3] proposed a dictionary-based compression method named Compressed Program Processor (CPP). This method simply stores one copy of common instruction sequences in the dictionary and replaces the occurrences with codewords. Average compression ratios of 61%, 66% and 74% were reported for the PowerPC, ARM and i386 processors respectively.
Wolfe [2] proposed a statistical compression method in a Compressed Code RISC Processor (CCRP) using Huffman-encoding [13] . A Line Address Table (LAT) to the compressed code instruction addresses. The size of the LAT is approximately 3% of the original program size. A cache-like hardware called the Cache Line Address Lookaside Buffer (LCB) stores the most recent referenced LAT entries, so the cache refill engine can rapidly translate the addresses and fill the instructions. An average compression ratio of 73% on MIPS R2000 is reported.
Araujo [4] finds that most frequently used instruction sequences are identical with either opcode sequences or operand sequences, but not both, so that he separates the dictionary into a tree-pattern dictionary and an operandpattern dictionary. The decompression engine reassembles the instruction sequence by combining the entries in both dictionaries indexed by the codeword pair of opcode and operand. The average compression ratio for this scheme is 43% using Huffman [13] and 48% using MPEG-2 VLC [14] .
A language grammar-based code compression method [15, 16] accepts a grammar for programs written using a simple bytecoded, stack-based instruction set, as well as a training set of sample programs. The system transforms the grammar, creating an expanded grammar that represents the same language as the original grammar. An average compression ratio of about 36% [15] is reported.
Operand field remapping
The key idea in this paper involves the transformation of the operands to reduce the dictionary size, and the following subsections describe the detailed operand field remapping method.
Instruction reformatting
To establish the compression model, consider first the instruction formats (Appendix, Section 8.1) of the embedded processor ARM7TDMI [11] . All fields in the instruction format except opcodes (in conjunction with some fixed bits) are considered operands. The condition field, registers and immediate values are manipulated as multiples of 4 bit operand fields (OF) for simple hardware decompression. For example, Fig. 1a shows a shift instruction, which can be reformatted as Fig. 1b: a 4 bit conditional field as an OF, the opcode with some fixedvalue bits ('000'and S) as the new opcode, RN, RD and RM as three OFs, and an 8 bit shift value as two OFs.
Operand factorisation
The opcode sequences were extracted by factorising all of the OFs from the instruction sequences and storing them into the opcode dictionary (OPD). All the OFs are removed from the instructions sequence to form the operand field dictionary (OFD). For example, the instruction sequence in Fig. 2a is factorised as shown in Figs. 2b and 2c. An occurrence of the instruction sequence is now replaced with a codeword consisting of two parts: Idx OPD and Idx OFD indexing to the OPD and to the OFD, respectively, as shown in Fig. 3 .
Operand field remapping
Note that there are many dependencies among the operands in an instruction sequence, e.g. a destination register may immediately be used as the source operand of the following instructions, or a source operand may be re-used. Therefore, if we could use fewer bits than that of OF to record the dependencies instead of storing the OF itself, further reduction in OFD size may be achieved. To use this advantage, the OFD is transformed into an operand remapping dictionary (ORD) and an operand list dictionary (OLD). The former records the dependency relations and the latter records the modified operand lists after remapping. This method is termed the 'operand field remapping' technique.
The Idx OFD in the original codeword is now split into two parts: Idx ORD and Idx OLD , indexing to an ORD entry and an OLD entry, respectively. Each entry in the ORD contains six fields, called mapping tags, to specify the six OFs of an instruction. The tag provides two kinds of operation: load and mapping. When a load operation is specified, an operand is loaded from the OLD entry indexed by Idx OLD into the corresponding OF of the instruction. The load operation then also pushes the loaded operand into a mapping queue (MQ) such that other tags can map to this operand. When a mapping operation is specified, an operand in the MQ at a position specified by the value of the tag is loaded. So, if an operand depends on an operand in the previous instruction, which was loaded from the OLD by a previous load tag, then the latter tag can load it from the MQ. This reduces the repetitions in OFD. For a 2 bit mapping tag as an example, '00' means a load operation, and '01-11' signify mapping operations that index to the first, second, or third operand in the MQ. When the MQ is full and a further operand is pushed into it, the oldest MQ entry is pushed out and disappears. Further reference to this operand will need a load tag to load it from the OLD again. Fig. 4 illustrates the operand remapping technique using the example in Fig. 2 . First, the three opcodes of the instruction sequence are stored into OPD. Next, notice that the first three operands of the first instruction are different, that three load tags '00' are required to load three operands '1110', '0001' and '0000'. Since the fourth and fifth operands are the same as the third one, these are simply mapped to the third operand. Two mapping tags '11' are used to indicate the operand in the MQ at position 3. The last operand is the same as the second operand, a mapping tag '10' is needed. These six tags form one entry to the ORD. The second and third instructions are coded in the same way.
Use of operand majority
An operand sequence usually contains some value that appears more frequently. The most frequently used operand is termed the first majority, the second most frequent, the second majority, and so on. Use of the majorities in remapping is extremely efficient, and can reduce the OLD size. These majorities are stored in majority registers (MRs) and include tags to map the OFs to these majorities (refer to the example in Fig. 4) . The value '1110' is assumed to be the majority operand in the OLD. Fig. 5 illustrates the remapping technique applied to the original code in Fig. 4 . The mapping tag 11 is assigned to the majority. Here the tag 00 still implies a load operation, and tags 01-10 are mapping operations.
Size of mapping tag
In this Section, OFD size reduction is evaluated when various sizes of mapping tags are applied. The benchmarks tested come from the MediaBench [17] . MediaBench contains applications culled from available image processing, communications and DSP applications. Appendix, Section 8.2 provides a summary of the programs in MediaBench. The benchmark programs are compiled for ARM7TDMI [11] using the ARM's Software Development Toolkit (SDT) ARMCC [18] . All the experimental programs are compiled with '-O2' optimisation.
Figs. 6-8 show the ratios of the size of ORD plus OLD over the size of the original OFD using 3 bit, 2 bit and 1 bit mapping tags, respectively. The x-axis categorises the benchmark programs and the y-axis shows the size reduction ratio. In Fig. 6 , each program has eight lines, indicating 0-7 majorities, respectively. It was found that introducing the first majority reduces the OFD most, but the greater the number of majorities, the smaller the reduction in OLD size and the larger the increment in ORD size. The best case is to use one tag 000 for load operation, four tags 001-100 for mapping and three tags 101-111 for majorities. This tag assignment saves about 70% of the OFD size. In Fig. 7 , each program has four lines, indicating 0-3 majorities, respectively. The best case is one tag 00 for load operation, the other tags 01-11 for majorities. In Fig. 8 the best case is one tag 0 for load operation and one tag 1 for the majority. However, Figs. 7 and 8 show that the size reductions are minor. The reason for this unsatisfactory result is that the mapping tag is too small to take advantage of the remapping. It is concluded that a 3 bit mapping tag is needed to achieve the best compression ratio.
Identities of the condition field
All ARM instructions contain a condition field (Cond in Fig. 1 ) which controls the instruction to be executed depending on the N (Negative), Z (Zero), C (Carry), and V (oVerflow) flags in the current program status register (CPSR) [11] . Experiments reveal that the values in the condition fields of 82% of instructions depend on the previous or the following instructions. Only 18% of instructions have a different condition value from those of their neighbours. Therefore, when building the OPD, a restriction is made that the instruction sequence must have the same condition code. We can then remove the condition field from the OLD. A condition code new stands for all instructions of a sequence and is encoded separately.
Compression algorithm
First, the benchmark programs are compiled to the executables using ARMCC [18] . Instruction sequences with the same opcode sequence are extracted from the basic blocks to avoid the branch instructions jumping within them. The opcode sequences are put in the OPD. Each OPD entry has two fields: an opcode field (8 bits) and a boundary field (1 bit). The opcode field contains the opcode of an instruction and the boundary bit indicates whether this instruction is the end of a sequence. The OF sequence is separated into the OF mapping sequence and operand list using 3 bit mapping tags. The OF mapping sequences make up the ORD and the operand lists make up the OLD. It is assumed that the first OPD, ORD and OLD entries of the first instruction of a sequence must be aligned to the byte boundary, while the following entries are arranged side by side. As all dictionary entries are defined, the occurrences of the instruction sequence are replaced with codewords. A variable-length codeword is used in our compression method to give a better compression ratio. A codeword consists of four parts: a condition code encoding (CC), an index to OPD (Idx OPD ), an index to ORD (Idx ORD ) and an index to OLD (Idx OLD ). All of these four parts are coded separately using Huffman coding [13] . One obvious side effect of code compression is that it alters the addresses of the instructions. To overcome this problem, all branch targets must be a codeword. The branch instruction (B) and the branch and link instruction (BL) use the offset addressing that could be simply corrected according to the compressed memory locations of targets. The 24 bit offset of B (BL) instruction is divided into two parts: the first 21 bits indicate byte offset, and the last three bits are treated as bit offset. The call-return instructions need not be patched because these instructions load the contents of the link register, which contains the compressed address during execution. Note that the indirect branch is a branch and exchange instruction (BX). The BX instruction uses an address register to identify the boundary location between the ARM and Thumb codes. When modifying this type of instruction, we must backtrack the contents of the address register and modify the contents before storing the address into the register.
Experiment results
Compression ratios using the traditional [3] , operand factorisation [4] and operand field remapping compression methods are compared. In the traditional compression method, the undefined instruction (refer to Appendix, Section 8.1) is used to store the codewords. Since the undefined instruction occupies only 4 bits (bit 27 to 24 and bit 4), at most two codewords can be packed into an undefined instruction if there are consecutive codewords. The maximal number of dictionary entries is 2 14 . In the operand factorisation method, expression trees [19] are factorised into opcode sequences and operand sequences. The condition field is treated as the immediate value and the instruction sequence is factorised into new opcode (refer to Fig. 1 ) sequence and the operand field sequence, then the occurrences are encoded with codeword pairs using Huffman encoding. The first entries of both the opcode sequence and the operand sequence are also aligned to the byte boundary. Figure 9 shows the final compression ratios. The x-axis categorises the benchmarks tested and the y-axis shows the final compression ratio. Each benchmark has three lines, indicating the compression ratios using traditional, operand factorisation and operand field remapping methods, respectively. The average compression ratio using the traditional compression method is 65%, ranging from 61% to 67%. The compression ratio is constrained since there are still 16% to 25% of instructions uncompressed. The reason is that these instructions are unique and are not selected into the dictionary, and there are not enough dictionary entries to include them.
The compression ratio using operand factorisation is about 48%, ranging from 43% to 51%. It was found that the RGEN (operand sequence dictionary) and the compressed code are the major two parts that contribute to the compression ratio. The average compression ratio using the operand field remapping compression method is 46%, ranging from 41% to 50%. This result is better than that of operand factorisation, if the dictionaries of both methods are aligned to the byte boundary. The better result is because the mapping sequence can exploit more repetitions than the operand sequence does, and omitting the conditional codes can reduce the dictionary size in advance. Table 1 shows the percentages of all components of the compressed program. Table 1 indicates that the average reduction of RGEN and IMD to ORD and OLD is about 13%, although the compressed code increases by about 12%. This is the main advantage over the operand remapping method.
Conclusions
The paper proposes an operand remapping compression method to compress embedded system programs for an ARM processor. The key idea of this method is to map later occurrences of an operand to the previous same one in the same operand field sequence. The best compression ratio obtained using this method was 41.34%, with an average of 45.9%. This can be further improved in several ways. First, compressing the OPD, ORD and OLD by utilising recursive pointers [20] to reuse the dictionary entries can further reduce dictionary sizes. Pointers in the dictionaries will cause the decompression engine to be much more complicated, but will produce a denser code. Second, the compiler could attempt to produce identical instruction sequences for the same expression tree [19] so that the more common instruction sequences become more compressible [21] . One way to accomplish this is to allocate the same registers [21, 22] for the same expression tree. Finally, the compression algorithm could be improved by finding more relations between operands, such as the register allocation rules. 
