Meinory h;is lieen one of the most rcstrictcd resources in t,la: ernbcddad compuling systoin domain. Code comprcssioir has bocu proposed as a stiliil.ion l o this problcin. Prcvinus work used fixed-to-vnriablc coding nlgnritlrnrs that traiislatc lixcd-length hii, scqiicnces into variaI~le-lnngl.h hit scqucn
INTRODUCTION
Emhaddcd computing systciiis are spec and cost scnsitive. 1,lnmory has hlien one of t,lir must rcst,~icl,cd TO-SUIII'CBS, whicli poscs scrious cmstrnint.s on program siac. This pvoblcm lins lcd 1.0 nimy rxrcni.ahlc code compression efforts. diist,rial cxainplr is tlic IBM Powcr PC 400 series or. In l.'igiire I, t,ho coinjjressed code is stored ill tha lekatsasa nec-lab.com cxtrrnnl mcinory imd a dccomprcssioii core, which is called CodePack [I] , is placcrl ljci,wcan tho mcrrlory and c.ache. Existing st:at:ist,ical code compression algoritlinrs arc mostly vnrial~lc~to-varial~l~ coding or lixcd-to-va,rial~le codiug. This mcnns that the decomprcssion talcs variable lcngth inpul. and t h e clecoinprcssion procedure is scqucni:ial, since tha docoinprcssor docs not know wlicrr to start dccnrnprcssing tlic next syml~ol iintil the current syrnlml is fiilly dcconqxesscd. Powt:r consumption is ;Lnotlier irnpoitant issiir f i x enihcdrlcd systems. According in Givargis and Vahirl [2] , a systcm's power citn b e broken into t,wo componcnls. Tlie first componcnt is internal circnit capacitance tirnes llic nvcriige inti:rnd circuit transitions, while the secoml component is cxt,ornal bus capitcitnncc I.imcs tlic avernge exl.emal bus transil.ions.
On tlie nveragc, 1 . l~ busses in n 1,ypical IC consume half of the total chip powcr. It is tlmrefoorc important to rcclucc bus powcr consumption. M i d i has hccii done in rcchicing thc internal circuit power. InsLriicl.ion busses tire oSt,cn consi& crcd highly rigid and unalterable. Therefore 1.l1cy Iiavzvc not bccn greally opi.imizccl fur powcr.
Pcnvcr PC 4ux

Ex1omal
Dreodrr Tohles In this p q m , we prcscnt novel code ctmprcsainn schemes thnt, us(: variable-to-lixcd (V2F) coding. '.L'la! rlecomprcssiori ~xocwliirc talm Iixed length input, which makes dccurnprcssor design cnsicr. Parallel decompl-ession for memoryliigli-1jandwidI.li instruction pre-Sctch machanisni is roquired to supply miilf.iplc opcrations per cyclc. A novel inst.niction bus powor roduction scheme is also pmposcd liascd on tlie V2F coding. 'I'his pnpcr is orgmkeil as followfi. Sect.ion 2 revicws prcvious relaLcd work. Sactioil 3 descrihcs the code iori ulgocitlnn. Scctiou 4 discusses the dcconiprcs-SOL' dcsign. Section 6 dcscribas t,lic instruction hiis powcr Kduction by rising V2F coding. Expcrimcntnl reaults 0x1 two VLIW arcliitcctiircs, IA-64 and 'rMS320CBx, arc presented less VZT,' coding schamc favors v r m avchii.cctures ~h c r c a in Section 6 and finally Section 7 concludes the paper 2. RELATEDWORK Wolfe and Chanin [8] were the first to propose an embedded processor design that used Huffman coding to compress cache blocks. A similar technique, which uses more complicated Huffman tables called CodePack [1] , has been developed by IBM and used in their PowerPC processor. Liao et al. 151 proposed dictionary methods, which make compressed code easy to be decompressed. Lekatsas and Wolf proposed an algorithm called SAMC [4] , which is based on arithmetic coding in combination with a precalculated Markov model.
Both schemes targeted RISC architectures.
For VLIW architecture, Ishiura [3] reduced the problem of finding a good instruction coding for code compression to the problem of splitting up instructions into fields such that these fields are compressed optimally. Their scheme is a dictionary-based table look-up approach. Nam et al. [6] also proposed a dictionary-based code compression using isomorphism among VLIW instruction words. Frequently used instruction words are extracted from the original code to be mapped into two dictionaries, an opcode dictionary and an operand dictionary. Both approaches mentioned above are dictionary-based schemes and target traditional VLIW architectures, which have rigid instruction word formats and a lot of redundancy.
Our investigation [SI of several modern VLIW architectures, including TI'S DSP TMS320C6x and Motorola's StarCore DSP SC140, as well as Intel/HP's IA-64, reveals that modern VLIW ISAs adapt a VLES (various length execution set) scheme to achieve high code density, which implies that the dictionary-based schemes by Ishiura and Nan1 are not feasible for modern VLIW processors. We extended the arithmetic coding algorithm and present compression schemes as well as the decompression architecture design for modern VLIW architectures, which have very flexible instruction word formats to achieve code density [9] [lo]. Variable-to-fixed (V2F) coding was first investigated by Tunstall One advantage of V2F coding is that it is easy to index into the compressed data since the codeword length is fixed. To the best of our knowledge, although variable-to-fixed length codes have been investigated, there is no application on the code compression area yet. Our research is the code compression application using variableto-fixed coding.
['i].
COMPRESSION ALGORITHM
Both the Huffman coding and arithmetic coding that used by previous work are fixed-to-variable coding algorithms, which translate fixed-length bit sequences into variablelength bit sequences. In this section, we describe two variabletofixed coding algorithms that translate variable-length bit sequences into fixed-length bit sequences.
Memoryless V2F coding algorithm
We use the same procedure that was proposed by Tunstall 171 to generate an optimal V2F code for discrete memoryless source. Assume that the ones and zeros in the executable code have independent and identical distribution (iid); we calculate the probability for 1s and 0s. For example, in IA64 ISA, probability of 0 is about 83% and probability of 1 is about 17%, while in TMS320, probability of 0 is about 75% and 1 is about 25%. Suppose we want to construct N-bit Tunstall codewords, the number of codewords is 2N. The algorithm is given below: 1. A tree is created with the root node having probability 1.0. We attach 0 and 1 to the root, the resulting two leaf nodes each have probability of the occurrence of 1 and 0, which are Prob(1) and Prob(0) respectively.
2. The leaf node with the highest probability is split up into two branches with 0 and 1 as the label. After splitting, the number of leaf nodes increase by 1.
is equal to 2 Y
3.
Step 2 is repeated until the total number of leaf nodes 4. Assign equal length codeword (length=N) to the leaf nodes. The assignment can be arbitrary, which will not affect the compression ratio at all. Figure 2 shows the construction of a 2-hit Tunstall code for a binary stream with iid probability, in which the probability of a bit to be 0 is 0.8 and the probability of a bit to be 1 is 0.2. The code tree expands until there are 4 leaf nodes. A 2-bit codeword is randomly assigned to each leaf node. After the Tunstall codebook is ready, compression of the binary stream is straightforward. For example, a binary stream 000 01 001, will be encoded as 11 a U.
10
F i g u r e 2: A memoryless T u n s t a l l C o d i n g Tree After constructing the coding tree and the codebook, we compress the instructions block by block to ensure random access. To compress each block, we start from the root node, if a "1" is occurred, we take the right branch; otherwise, we take the left branch. Whenever a leaf node is encountered, a codeword related to that leaf node is produced and we restart from the root node.
There are two problems that have to he taken care during compression:
End of Block. Since we compress the instructions block by block, it is very likely that at the end of the block, the t.ree traversal ends at a non-leaf node. For instance, when we restart from the root node in Figure 2 , if the last two bits in the block are "00", the compression ends at a non-leaf node and no codeword is produced. To avoid this problem, at the end of the block, when the compression ends without reaching a leaf node, we pad extra bits to the block such that the traversal can continue until a leaf node is met and a codeword is produced. In the example we gave, we simple pad a "1" to the original block such that the last 3 bits "001" can be encoded into "11". During decompression, the whole block is decoded together with the extra padded bits. However, since we know the block size a priori, we simple truncate the extra bits.
e Byte-alignment. To make decompression hardware simpler, and make the storage of the conrpressed code easier, the compressed block must be byte aligned. This means that if after compressing a block the result is not a multiple of 8 (in bits), a few extra bits are padded to ensure that it becomes a multiple of 8 . We can thus ensure that the next block will start on a byte-aligned boundary.
Markov V2F coding algorithm
In this section, we present a new Markov V2F code compression algorithm that combines the original V2F coding algorithms with a Markov model.
In order to improve the compression ratio, we have to exploit the statistical dependencies among bits in the instructions and use a more complicated probability model. One of the most popular ways of representing dependence in data is through the use of Markov models, which consist of a number of states, where each state is connected to other states and.each transition has a probability as-signed to it. Two main variables are used to describe our model, namely, the model depth and the model width, which represent the number of layers and the number of Markov nodes per layer, respectively. Intuitively the depth should divide the instruction size evenly, or he multiples of the instruction size, since we would like our model to start at exactly the same layer after a certain number of instructions, such that each layer corresponds to a certain hit in the instruction, and therefore it stores the statistics for this hit. The model's width is a measure of the model's ability to remember the path to a certain node. The upper part of Figure 3 is an example of a 4x4 Markov model. The left (right) edge with a probability P from a state A to state B implies that if current state is A , then the probability of next bit is zero (one) is P and next state will be B. model can he described &s following:
The procedure used to compress instructions using a Markoi 1. Statistics-gathering phase. Choose the width and depth for the Markov model. The first state is the initial state corresponding to no input bits. Its left and right child correspond to the "0 input" and "1 input", respectively. By going through the whole program, we gather the probability for each transition. Note that we always go back to the initial state whenever we start a new block.
C o d e b o o k construction p h a s e After constructing
the Markov model, we generate an N-bit variable-tofixed length coding tree and codebook for each state in the Markov model, using the same memoryless algorithm mentioned in the previous section. Each state in the Markov model has its own coding tree and codebook. Therefore, for a M-state Markov model using a N-bit variable-to-fixed length codes, there are M codehooks and each codebook has ZN codewords. Similar to the memoryless V2F coding, the codeword assignment for each codebook of these M codebooks can be arbitrary and will not affect the compression ratio. We compress instructions block by block. We always use the coding tree and the codebook for initial state at the beginning of each block. This ensures that the decoder can start decompressing at any block houndary. Starting from the root of the coding tree for each state, the compression procedure traverses the tree according to the input bits until a leaf node is met. A codeword related to the leaf node is produced and the compression procedure jumps to the root node of the coding tree (and use the code book) of the state that indicated by this leaf node. Figure 2 , here we use the probability that is associated with each edge instead of a fixed probability for bit 0 and bit 1. Furthermore, for each codebook entry, we have to indicate what the next state is. For example, starting from state 0, if the input is 000, then the encoder output is 11 and next state is 12. The encoder then jumps to the codehook for state 12 and starts encoding using that codebook.
DECOMPRESSION ARCHITECTURE
In order to decode the compressed code, the same codebook must be available to the decoder. For memoryless variable-to-fixed code compression, parallel decompression is possible because the codeword size is fixed and all codewords in the compressed code are independent. If it is compressed using N-bit V2F codes, we can segment the compressed code to be many N-bit chunks, and all those N-hit chunks can he decompressed simultaneously in one clock cycle.
t,c+fixed coding. Each decoder D is an N-bit table lookup
unit that corresponds to the codebook such as the one in Figure 1 . The decoder is very small. For example, a 4-bit decoder for the codehook in Figure 2 has only less than 100 Figure 4 shows the parallel decoding for memoryless variahle-gates and the size is only 4um' when implemented in TSMC 0.25 standard cell library. For Markov variable-to-fixed coding, we can not decompress the next N-bit chunk (assume it is compressed using N-bit fixed length VF code) before the current N-bit chunk is decompressed, because we have to decompress the current N-bit to know which codebook to he use to decode the next N-bit chunk. We can use the similar architecture that was present in our previous work for arithmetic coding [9] , storing the codebooks in the RAM (or ROM), and using match logic to decode the current N-bits and send the next codebook address to the RAM (or ROM).
I ICI, w11 LC2,WZI lIC3,W31
... 
POWER REDUCTION FOR INSTRUC-TION BUS
Even though much work on reducing the address bus power has been done, the instruction blls is often considered highly rigid and unalterable. Therefore it has not been greatly optimized for power. In this section we show that by using variable-to-fixed coding, we can reduce instruction bus power consumption when transmitting compressed instructions. As we mentioned in Section 2, the codeword assignment for V2F coding can he arbitrary. Since the codeword length is fixcd, any codeword assignment will result in the same compression ratio. But carefully assigning the codeword can reduce bit toggling on the bus, therefore bus power consumption is reduced since the energy consumed on the bus is proportional to the number of hit toggles on the bus.
Assume that we use Markov V2F coding (memoryless V2F coding is a special case of Markov V2F coding, in which the Markov model has only one state). There are M states in the Markov model and the length of the codeword is N. Therefore we have M codebooks and each codebook has 2N codewords. Each codeword can be represented by [C,,Wj] , in which Gi (C, =1,2,3 . . . M) is one of the M codehooks and Wj (Wj =1,2,3 ... ZN) is a label for each of the ZN codewords in codebook C;.
When packing and transferring compressed blocks via bus, there are two ways to pack the compressed block one is to pad current compressed block with part of next com-pressed block to increase bandwidth and another is to just leave the leftover bits without padding. These two different packing approaches for the bus may affect bus toggling. Our experiments show that the non-padding one results in lower bus toggles than the padding approach, therefore the discussion below will be based on the non-padding approach. Figure 5 shows the example of codeword patterns that are transmitted over the instruction bus. [Ci,W,] is an Nbit codeword that belongs to codebook C,. The beauty ot the variable-to-fixed coding compared to variable-to-variable coding or fixed-to-variable coding is that the bus transition patterns can be transferred to the codeword transition patterns because the codeword length is fixed.
N-bit N -bit
N -bit F i g u r e 5: I n s t r u c t i o n bus t r a n s i t i o n By going through the whole compressed program, we can construct a codeword transition graph as shown in Figure  6 . Each node in the graph is a codeword in the codebook. The edge between two nodes indicates that there are bus transitions between these two codewords. Each edge has a weight E; associate with it, specifying how many times the transition happens.
Figure 6: Codeword t r a n s i t i o n g r a p h
The N-hit codeword assignment can be arbitrary except that for the same codebook, each codeword has to be distinctive, i.e., for [c,,W,] and [ck,wL] , if c, = c k , and w j # Wi, the N-bit binary codeword assigned to node [C,,Wj] must be different from the one that assigned to [Ck,Wi] . We use Hi to denote the Hamming distance between two N-bit binary codewords that assigned to the nodes that the edge associated with. Therefore, the total bus toggles can be represent by the sum of H i s Ei. Our goal is to find out the best codeword assignment such that the bus toggles are minimized. There are ( 2 N ! ) M combinations for an M-state Markov model with codeword length N. Actually, when M =I, the problem is simplified to be a classical state assignment problem in VLSI design, which has been proved to be an NP problem. Therefore, we use a greedy algorithm as following, to achieve a good codeword assignment, even though the bus toggles are not minimized.
A greedy heuristic codeword assignment algorithm:
1. sort all the edges by weights in decreasing order.
2. for each edge, if either node is not assigned, assign valid codewords with minimal Hamming distance 3. go to step 2 until all nodes are assigned.
The greedy algorithm sorts all the edges by weights in decreasing order. Then for each edge, it tries to assign two binary N-bit codeword to the nodes that associated to the edge, such that the Hamming distance is minimized. The Hamming distance could he 0 if the two nodes belong to different codebook. There is only one restriction on the assignment; codewords in the same codebook must be distinctive. Figure 7 and Figure 8 show the compression ratio for TMS320C6x and IA-64 respectively, using a memoryless V2F coding. We can see that when N=4, it achieves the compression, R decreases as N increases, which means that the compression ratio is improved. But the improvement is not very significant, especially after N larger than 4. On the other hand, since the compression poses a byte alignment restriction for every block, by using a 4-bit length codeword, the chance of padding extra bits is greatly reduced. This explains why we achieve best compression ratio when N=4 in both experiments. Intuitively, if we choose N=8, there is no need to pad extra bits and we can get better compression ratio. Our experiments confirmed this. however, the average improvement of the compression ratio is less than 1%. Considering the codebook size for N=4 is only 24 = 16 entries, while the codebook size for N=8 is 2' = 256 entries, we conclude that the best choice for static V2F code compression is to use 4-bit length codeword. It is interesting to note that as the codeword length increases from 2-hit long to 4-hit long, the compression ratio is improved, hut after that the compression ratio gets worse as the codeword length increases.To explain the experimental Figure 10 show the compression ratio for TMS320C6x and IA-64 benchmarks by using a Tunstall based V2F compression scheme, with a 128x4 (depth=128,width=4) Markov model and a 32x4 (depth=32, width=4)model respectively. As the codeword length increases, the compression ratio is improved, though the codebook size doubles when the codeword length increases by 1. When N=4, the average compression ratio is about 56% for IA-64 and about 70% for TMS320C6x.
To construct a codeword graph a s the one in Figure 6 , we have to profile the program execution and get the memory access footprint. We used a cycle accurate simulator for TMS320C6x and profile a benchmark program ADPCM d c coder (Adaptive Differential Pulse Code h4odulation). The experimental result on the bus toggles is shown in Figure  11 . The bus toggles are normalized over the original toggle counts 6699013. The figure shows the bus toggles after compression and codeword assignment using the greedy algorithm that mentioned in section 5. The experiment use Chit length codewords with different probability model. We can see that using a static 1-bit model, we can not get much bus power saving. As the model becomes larger, we have more flexibility on codeword assignment to reduce the instruction bus toggles. 
I51
F i g u r e 11: I n s t r u c t i o n b u s toggles reduction for AD-PCM decoder r u n n i n g o n TMS320CBx
CONCLUSION AND FUTURE WORK
In this paper, we propose code compression schemes using variable-to-fixed (V2F) coding for embedded systems. Though the algorithm can be used foi any embedded processor, it is more suitable for VLIW. By using a greedy codeword assignment algorithm, the instruction bus toggles can be reduced compared to the original uncompressed program. This is the first code compression schemes that uses variableto-fixed coding. Our future work includes finding a better heuristic algorithm for low power codeword assignment and the ASIC design of the decompression architecture.
ACKNOWLEDGMENTS
This work was supported by Semiconductor Research Corporation (SRC). The authors would like to thank Prof. Niraj Jha in Princeton University and Dr. George Cai from Intel for valuable discussion.
