instructions while satisfying the instruction bitwidth constraint. We ~~~~~~~~i~~~~~~~i f i~ instructions significantly improve the formulate the problem using integer linear programming (ILP). But since performance, energy, and code size of configurable processors. A solving an ILp Problem takes a prohibitively long time even for a common approach used in the design of such instructions is to MnVm moderately sized problem, we also present an effective hewistic application-specific operation patterns into new complex instructions. algorithm. Experimental results show that our proposed technique However, processon with a fixed instruction bitwidth synthesizes instruction sets that generate up to 38% performance accommodate all the potentially interesting operation panerns, due to the Improvement Over the PrOCeSWI'S native' IS for different application limited code space afforded by the fixed instruction bitwidth. We present a novel instruction set synthesis technique that employs an efficient The contributions of our work are three-fold First, it is aimed at instruction encoding method to achieve maximal performance modern RISC pipelined architectures with multi-cycle instruction improvement. We build a library of complex inst,,,ctions with various SUPPOfl -representative Of Current configurable processon -while encoding alternatives and select the best set of complex instructions most existing methodologies while satisfying the instruction bitwidth constraint. we formulate the PrOCeSSOrS. Second, it tries to improve a given processor hardware problem using integer linear programming and also present an effective through IS specialization, which makes our technique suitable for an heuristic algorithm. Experimental results using our technique generate emerging Class of configurable processors that build on existing, popular instruction sets that h o w improvements of up to 38% over the native ISA families. Third, our technique takes instruction encoding into instruction set for several realistic benchmark app~ications on a aCCOUnt so that the obtained IS can be as compact and efficient as custom typical embedded RISC processor.
Introduction
application-specific instruction set synthesis using motivating examples. Section 3 summarizes related work and Section 4 details the proposed IS Configurable -processors are application-specific synthesizable synthesis technique with the problem formulation and heuristic processors where the instruction set andlor microarchitectural parameters algorithm. Section 5 shows the efficacy of our techniques through such as register file size, functional unit bitwidth, etc. can be easily experiments on a typical embedded RISC processor running realistic changed for different applications at the time of the processor design. applications, and Section 6 concludes the paper. Easier integration and manufacturing, and architectural and imolementational flexibilitv make them better suited for embedded 2. Motivation ~~ ~ ~ ~~~~~~ processors in system-on-a-chip (SOC) designs than traditional custom designed architectures [I] . With the commercialization of such configurable processors [2] [3] , and also the increased interest in platform-based SOCs that employ such configurable processors, the problem of instruction set (IS) optimization is receiving a lot of attention from both indushy and academia [4] [5] .
In the design of such application-specific instructions, one of the common ways of improving performance is to make complex instructions from frequently occurring operation patterns [6] [81 [9] , where a complex instruction is an instruction with more than one basic operation. The use of such complex instructions generally leads to better performance, smaller code size, and lower energy', since they replace sequences of simple instructions. However, in a processor with a fixed instruction bitwidth, not every operation pattern can be made into a new instruction due to the instruction bitwidth limitation? This paper presents a novel instruction set synthesis technique that employs efficient instruction encoding to achieve maximal performance improvement. Our approach first builds a library of complex instructions with various encoding alternatives and then selects the best set of complex ' Due to the reduced number of instruction fetches.
' We consider fixed instruction bitwidth because it is more common in contemporary embedded RISC processors.
We illuseate the potential for significant performance improvements using application-specific instructions on a typical embedded RISC processor (the Hitachi SH-3 [IZ] ) and a realistic application (the H.263 decoder algorithm). Initial profiling of the H.263 decoder application
shows that about 50% of the actual execution time is spent in a simple function that does not contain any function calls and which consists of only two nested loops, either one of which, depending on a conditional, is actually executed. One of these innermost loops, when processed by an SH-3 targeted GCC compiler, generates 13 native instructions that executes in 14 cycles (not including branch stall). By devising a custom instruction that takes 6 cycles, the loop is reduced to 4 instructions that execute in 9 cycles. Alternatively, the entire loop can be encoded into a single custom instruction that executes in 7 cycles. In this example, it is obvious that greater performance enhancement and code size reduction can be obtained by introducing custom instructions than by relying on traditional compiler optimizations or assembly coding.
However, those two custom instructions require 4 and 5 register arguments respectively: since one SH-3 register argument requires 4 hits, each ofthese seemingly attractive custom instructions cannot fit into a 16 bit instruction. On the other hand, if we fix the positions of register arguments for these custom instructions, we do not need to specify arguments and only the opcodes need to be specified (similar to a function call with fixed register arguments Due to the instruction bitwidth constraint, however, instruction sets may not have the full power to express all the possible combinations of the operations in the hardware. Furthermore, each processor family has its own IS quirks. For instance the SH-3 has a two-operand IS with an implicit destination: a typical SH-3 ADD instruction assumes the destination is the same as one of the source operands. This instruction encoding restriction adversely affects the performance, which further motivates the need for an application-specific instruction encoding method. One practical consideration is that it is very difficult to add any new complex instructions into an existing IS, due to the limited size of the IS code space. One possibility is to use "undefined instruction opcodes, which, however, often provide too small a code space for complex instructions. Another way to solve this problem is to define a basic IS using only a part of the instruction code space. Complex instructions can then be created in an application-specific manner and added into the remaining code space. To gain performance benefit in this manner, the generated complex instructions should be optimized in terms of their encoding so that more of those instructions can be located within the limited instruction code space. These observations motivate our code space-economical IS synthesis technique detailed in Section 4.
Related Work
The synthesis of application-specific instruction set architectures (ISAs) have been approached in different ways: depending on which part of the ISA is decided first, IS-oriented and structure-oriented. In IS-oriented approaches [6] [7l [8] , the IS is first optimized from the application's behavior, which is typically represented as dependency graphs. The hardware is later designed to implement the instruction set, manually or automatically. Among the approaches in this direction, PEAS-I [V is most similar to our approach in that both approaches assume a basic IS and target pipelined RlSC architectures. However, PEAS-I has a fixed set of instructions from which a subset is selected; thus instruction encoding is never an issue, unlike in our approach. Other approaches, however, tend to presume their own architectural styles (e.g. 'transport triggered architecture' in 161) and thus cannot be applied to modem pipelined RISC processors. More importantly they do not give any hints on how to improve an existing processor architecture, which can bc very helpful particularly in the context of configurable processor-based system-on-a-chip design.
Structure-oriented approaches [9] [10] have a structural model of the architecture either implicitly assumed or as an explicit input and try to find the best instruction set matching the application. This approach has the advantage that it can leverage existing processor designs. However, previous work in this direction has not fully addressed the issue of instruction encoding: opcode and operand fields all have fixed widths.
As a result, the instruction bitwidth cannot be fully utilized as in custom processors, leading to limited performance improvement for the same bitwidth or increased instruction bitwidth and code size. Our approach is significantly different since multiple field widths are allowed for the same operand type, with different benefits and cost profiles, so that the optimal set of complex instructions, depending on the application, can be selected satisfying the instruction bitwidth constraint.
Another very closely related work is application-specific ISA customization for configurable processors. Zhao et al., in a case study [SI with a commercial configurable core, demonstrated 2 -4 times performance improvement via configuration and application-specific instructions. They changed the data path width, which is readily supported by the core, and designed several custom instructions, which require the application program to be re-coded to accommodate the newly added instructions. Therefore, their work is limited to the application and the core they used. Our approach is significantly different, since it automates the design of application-specific instructions, tries to improve the processor by changing only the IS, and generates the instructions that can be easily supported by compilers without re-coding the program. To make it easier to create complex instructions from an application, we first define a basic IS: a basic instruction is restricted to have at most one operation to maximize the code space for a complex IS. A complex instruction can be viewed as a panern of basic instructions. A target processor is then represented by a basic IS, which specifies all operations supported by the target processor, and structural information associated with each basic instruction, which is crucial to decide the usefulness of complex instructions. The structural information reveals functional and pipeline resources for each instruction with their cycle-level timing.
IS Synthesis and Instruction Encoding
From an assembly code of basic instructions we create complex instructions, each of which is essentially a condensed and generalized
form of a basic instruction sequence. Complex instructions can be multicycle instructions and are created such that their latency is minimized under the resource constraints of the processor (e.g. # of register readiwrite ports, U of hnctional units and buses at each.pipeline stage, etc.). High-level synthesis (HLS) techniques such as resource-(in Fig. 2 (c), this probability is assumed to be 0.75): constrained list scheduling [I I] can be used to find the number of cycles saved by using a complex instruction over a basic instruction sequence. If the saving is positive, we create a group of complex instructions, differing in the degree of generalization of operands, and put them into reduced by the probability of necessitating an additional move instruction
Complex instructions thus created are compared with those in the library and the library is updated either by including the new complex instruction or by increasing the reference count ofthe matched one in the library. At the end of this pattern generation step, the library contains
the complex instruction panem library, to be used in the subsequent step of complex IS selection. We now describe each step in more detail.
Complex Instruction Pattern Generation
We assume the total instruction bitwidth is fixed but opcode and operand field widths are optimized for the applications. The complex instruction patterns generated here have two distinctive features as detailed below.
First, the opcode field widths are set to use the remaining bits &er the widths of the operands are determined. This scheme is more flexible and eficient in terms of total instruction bitwidth usage than fixed-width opcodes for two reasons: ( I ) complex instructions with more operands (thus requiring more bitwidth) can be allowed if they are used oflen enough to justify the bitwidth usage, and (2) more complex instructions can be allowed that have fewer operands and hence requiring fewer operand bits. The only condition to he satisfied here is that the sum of the code space (defined as 2"(the number of bits needed for operands)) of every instruction should not exceed the allowed total code space (defined as 2"(the instruction bitwidth)). It is obvious that ifthis condition is met, every instruction can be given an opccde, possibly with a different width.
Second, since there can be multiple choices of field widths for an operand type, we create operand classes, where each class defines a set of operand instances with a bitwidth specification. These operation classes are used to define complex instructions. Generally, in a custom instruction set, the operand field width of a certain operand type can vary among instructions to allow more compact encoding of the operands. For example, an immediate field may have only 4 bits in the ADDtLOAD type instructions while it may have 8 bits in an ADD1 instruction. Register fields can also have a reduced size to allow more operands to be encoded at the cost of accessing only a subset of the registers in that instruction. An example can be found in the SH-3 microprocessor [12], where some complex load instructions have the RO register as their destination, so that the destination register need not be specified at all. To support such variability in operand field widths, an operand is allowed to be generalized into multiple operand classes with different bitwidths. complex instruction can be estimated.
~i~. 2 illustrates the process of generating a complex instruction from a Complex instructions contribute to cycle count reduction at the cost of sequence of basic instruction instances. Fig. 2 (a) shows a conceptual code space. fierefore the Problem is to select a set Of view of the instructions that maximizes the cycle count reduction across the entire complex instructions. Operand classes can be defined as in Fig. 2 @) , application Program while respecting the code space requirement. Let where "#bits" is the number of bits needed for encoding operands. ~i g . 2 (CIi) be the set Of complex instructions created, Wi be an associated (c) shows a group of complex instructions.created from a sequence of bitwidth needed for a complex instruction Cli, and xi be a binary variable basic instructions, where "#bits" is defined as before, and CR stands for representing whether Cli is selected or not. Then the code space the benefit of a complex instruction in terms of the cycle count reduction. Complex instructions with such a register operand class have their CR register allocation capability. Operand classes with a single register value (cycle count reduction by a single use of the complex instruction) be dealt with relatively easily while those operand classes with multiple but not all registers require more complex register allocation.
selecting
Then the complex instruction set selection problem can be thought of as selecting BIL's. for which CorresDondinc! Cli's can be substituted. Here . . , the selected Cl<s is maximized. Table 1 shows a simple example of complex instructions and the associated information. In the fourth column each represents a basic instruction instance (e.g. one line of assembly code). The benefit of a complex instruction is defined as the sum of the CRs from all instances of basic instruction sequences covered, with the block repetition count taken into account. The last column shows the cost of selecting the complex instruction in terms of the code space taken. The benefit and cost can change as complex instructions are selected.
This complication arises due to two factors: superset instructions and multiple covers. In Table 1 , CI2 is a superset instruction of CII because CI, has the same opcodes and operands as those of CI, but has more general operand classes.' Therefore CI, covers more basic instruction instances but also requires more bits for representation. In this case, CIl is a special m e of Cl, so if Cl, is included in the set of selected complex instructions, say Sc, then CII is in effect already included with no cost. On the other hand, the cost of CI, should be decreased to (26 -Z4) when it is known that CIl has been selected. When there are more than one superset instructions of a complex instruction, however, only one of them should have the decreased cost. Multiple covers occur when more than one complex instruction covers the same basic instruction instance. In Table 1 , a3 is covered by all the complex instructions. Obviously only one complex instruction can actually be substituted for a, (and the neighboring basic instructions), which leads to the decrease in their collective benefit. Assuming a possible compiler optimization pass that applies, in a predefined order, substitutions of complex instructions for certain basic instruction panems, multiple covers lead to the reduction of effective benefit of those with lower priority. In Table 1 , if only CI, and CI, have been selected for Sc and CI, has higher priority than CI,, the total benefit (number of reduced cycles) becomes 45 instead of the sum of all the benefits. In other words, assuming C13 has higher priority than CII, the benefit of including C1, in Sc when CI1 is already in Se is 15 (= 45 -30),
where 30 is the benefit of CI,. A special case of this problem, where no superset instructions or multiple covers take place, can be shown to be a knapsack problem (which is NP-hard) if the cost value of complex instructions can be any positive integer while in the original formulation it should be 2" (n ? 0)
WI.

ILP Problem Formulation
For the description of the problem formulation, we first introduce the following variables. Let the complex instruction library be given as {Cli I i = 1, 2, ..., n) and each instruction has bitwidth information W, that is the number of bits needed for operands encoding. With each complex instruction (CIJ is associated a set of basic instruction instances sequences (BIIij I j=l, 2, _.., m), each member of which matches CI,.
Lastly, G, is the benefit or expected cycle count reduction by replacing BIIij with CI,. G, can be defined as in the following equation, where Repetitiong is the repetition count of the block, Cycle-Reduc, is the CR of Cli . Note that the CR value has been compensated for the penalty or probability that Cli necessitates an additional move instruction when Cli has register operand generalized with reduced field size.
G .
: -Repetiliong . Cycle-Reduc,
(3)
An operand class is more general than another if the set of its instances includes that of another.
-.,.
-.
the actual compiler that will use those complex instructions is assumed to be able to find the optimal set of BIIij's from a sequence of basic instruction instances. Now let us define the following binary variables for the ILP problem formulation: (6) and (7) can he linearized using the following identities. Fig. 3 shows our heuristic algorithm proposed for the problem of complex IS selection and ordering based on the above observations. The algorithm works by repeatedly selecting the most promising complex instruction (i.e., the one with the largest benefit per cost ratio). The ordering of the selected complex instructions is decided by their CR values: the greater the CR, the higher the priority. Among those with the same CR value, priority follows the order in which they were selected. Supenet Instructions: Every complex instruction has a set of pointers to more general complex instructions (More-General-Form in Fig. 3) . A complex instruction is more general when all operands are encoded with more general or equal operand classes. This set can be built up as new complex instructions are created and added to the library. When a complex instruction is selected, its cost is subtracted from the cost of each of the more generalized complex instructions.
Heuristic Algorithm
Multiple Covers: If a basic instruction instance is covered by more than one complex instruction and those complex instructions are all selected, the total benefit of the selected instructions is less than the sum of each benefit. To accurately quantify the benefit of selecting a new complex instruction under the assumption that a compiler substitutes complex instructions for matching patterns in a predefined order, the algorithm introduces two new integer variables for every basic instruction instance. Ma-Cycle-Reduc of an instruction instance is the maximum of C R s (Cycle-Reduc) of the complex instructions that cover the instruction instance and have been selected so far. 
Experiments
For our experiments, we used the SH-3 processor 1121 as the representative architecture for the basic IS and structural information such as pipeline configuration. We ran a number of realistic benchmark applications covering multimedia (e.g., H.263 decoder6, IPEG), controlintensive (e.g., ADPCM') and cryptography (e.g., DES) domains. These applications were processed by the EXPRESS retargetable compiler [14] (targeting the basic IS) to generate preliminary assembly code, which was used for the experiments. We chose the SH-3 processor because it is representative of popular contemporary RISC cores that also contain DSP-like features such as auto-increment load and MAC (multiply-accumulation). The SH-3 has an IBW (instruction bitwidth) of only 16 bits, so that it becomes very important to find a better IS for a more effective utilization of the hardware. From the native IS of the SH-3 architecture, a basic IS was defined with code space of 15442, which is a little less than 2"14 or quarter of the total code space. Since about half (2"15) of the total code space is used for system control function or reserved for other versions of the processor family, the code space that is available for a complex IS is about 2"14, which is used as the value of Constr in our experiments.
Another parameter N (the number of basic instructions that are considered together for complex instruction creation) was set to be 2 to 4 (i.e., up to 4 consecutive basic instructions are considered). In defining operand classes, we used some statistics such as frequent values of imrnediatddisplacement or average register pressure, etc. One definition of operand classes was used for all benchmark applications except for the DES application.
Comparison of ILP and Heuristic Algorithm
For the comparison of the two proposed selection methods -ILP and heuristic algorithm (HA), we used small benchmark programs so that the ILP solver would terminate within a reasonable amount of time. Each small benchmark program written in C contains only one function other than main. For each benchmark program, a pattern library was created and then the two selection methods were applied. For an ILP solver, a public domain software Ip-solve [I51 was used on a Pentium 866 MHz
Linux PC. Table 2 shows the results.
In Table 2, Complex instructions that require more bits than IBW are not added to
Comparison of Basic, Synthesized, Native IS's
To evaluate the effectiveness of the proposed IS synthesis technique, four realistic applications were used as benchmark programs: JPEG encoder, H.263 decoder, ADPCM coder/decoder, and DES (Data Encryption Standard) algorithm. The first two are multimedia applications, the third one is control-intensive, and the last one is a number crunching application with many bit-level operations. For each benchmark program a different complex IS was generated using our heuristic algorithm. The generated complex IS was used in a back-end optimization pass, which, in a given order, substitutes complex instructions for basic instruction sequences. One of the reasons the synthesized IS gives only a slight improvement for the ADPCM application is that the application is control-intensive and has small basic blocks often with a few instructions. Since the proposed scheme creates complex instructions only within basic blocks, it often cannot find performance-improving complex instructions for those blocks in control-intensive applications. And in the H.263 application, one of the reasons the synthesized IS gives especially good improvement is that the application bas a number of multiplication operations (which take place in a different pipeline stage than ALU operations in the SH-3 architecture), which makes it possible to exploit the parallelism in between the pipeline stages. On the other hand, all the other applications have very few multiplication operations. The results in Table 3 also show that, contrary to conventional wisdom, the code size and the performance improvement obtained thmugh the use of anolication-soecific instructions do not alwavs eo aeainst each other.
I.
This shows that the performance improvement by application-specific instructions is not very dependent upon the size of the application, although it is certain that the amount will diminish as the application grows bigger and more complex.
The newly generated complex instructions using our approach include subword access instructions (e.g., byte load), their combinations with other operations, and several kinds of load-shifl and shift-store instructions, as well as those already included in the native IS such as ADDI+LOAD. One of the interesting instructions found in only multimedia applications is the ADD instruction with three different register operands, the absence of which, however, is one of the distinguishing features of the native IS. 
Conclusion
We presented a novel IS synthesis technique employing an efficient instruction encoding method for configurable ASIPs. The technique improves the processor architecture through IS specialization, which makes the technique suitable for the emerging class of configurable ASIPs being deployed in contemporary SOCs and platform-based designs. We formulated the problem using integer linear programming and presented an efficient heuristic algorithm. Our experimental results demonsmte that through the use of efficient instruction encoding, our technique can generate up to 38% performance improvement over the native IS of a typical embedded RlSC processor, for different domains of applications. We believe our approach is thus very useful for designers of systems that need customization of programmable engines, but which leverage soflware investments made in existing RISC-based ISAs. Hardware implementation of the synthesized instruction set demands the modification of control path, most importantly instruction decoder. Modifying instruction decoder to suppolr complex instructions may affect the critical path and further the clock speed of the processor, potentially leading to degraded performance improvement. This is particularly true if we compare the synthesized instruction set with the basic one. Native instruction sets, however, often have their own "complex" instructions, sometimes with many odd ones. The instruction set synthesis replaces the native complex instructions with applicationspecific ones. Therefore, complexity increase of instruction decoder (and also the cycle time increase) by using synthesized instruction sets rather than the native ones are not necessarily substantial. Thorough investigation is needed, though, to quantify the effect of the synthesized instructions on the cycle time and the final performance.
Acknowledgements
