An instruction set serves as the interface between hardware and software in a computer system. In an application specific environment, the system performance can be improved by designing an instruction set that matches the characteristics of hardware and the application. We present a systematic approach to generate application-specific instruction sets so that software applications can be efficiently mapped to a given pipelined microarchitecture. The approach synthesizes instruction sets from application benchmarks, given a machine model, an objective function, and a set of design constraints. In addition, assembly code is generated to show how the benchmarks can be compiled with the synthesized instruction set. The problem of designing instruction sets is formulated as a modified scheduling problem. A binary tuple is proposed to model the semantics of instructions and integrate the instruction formation process into the scheduling process. A simulated annealing scheme is used to solve for the schedules.
Introduction
Microprocessors (instruction set processors) offer a flexible and low cost solution for embedded systems with complex algorithms or control intensive applications. The performance of a microprocessor-based system depends on how efficiently the application is mapped to the hardware. One key issue determining the success of the mapping is the design of the instruction set, which serves as the interface between the hardware and application (software). How to design an instruction set that closely matches the characteristics of the hardware and the application is an important design problem.
The design of instruction sets was once viewed as a design process independent to the design of the hardware (micro-architecture). Instruction sets designed under this principle, such as those of many mainframe computers, suffered from the fact that their supporting hardware was difficult to speed up or hardware was wasted due to the low utilization rate of the related instructions in real applications. The necessity of closely matching the design of instruction sets with the design of micro-architectures was recognized and adopted in the design of many modern RISC-style pipelined processors, in order to achieve better performance and cost tradeoff. However, in most design projects, the designs were carried out manually, which limited the exploration of the design space and the understanding of the interaction between hardware and software. CAD tools are necessary to explore and manage such complex design space. While there has been much progress in automating the instruction set processor design, most of the work synthesizes microarchitectures at the RTL level from given instruction sets (e.g., [11] , [13] , [14] ). How to systematically design instruction sets which closely match the characteristics of hardware and software is still an open problem. The goal of our research is thus to investigate the instruction set design problem in a systematic way. The research intends to provide further understanding of the design and interaction of the hardware and software interface.
In this paper we present the problem formulation and the algorithm of a systematic approach [7] which synthesizes application-specific instruction sets for parameterized, pipelined microarchitectures, from a given application benchmark. The problem is formulated as a modified scheduling problem, with the micro-operations (MOPs) representing the application benchmark as the nodes to be scheduled, subject to several design constraints. Instructions are formed by an instruction formation process which is integrated into the scheduling process. The compiled code of the application is generated, using the synthesized instruction set. A simulated annealing scheme is used to solve for the schedule and the instruction set. The design issues addressed in this approach include: instruction utilization, instruction operand encoding, delay load/store and delay branches.
generate application-specific instruction sets and compiled codes for microprocessor-based embedded systems.
Another design problem that is close to the instruction set design problem is microcode compaction [15] [16] [17] . However, it differs in terms of the design space and design goals. The micro-instructions do not have "opcodes" (and hence the semantics) and the goal of microcode compaction is to reduce the number of cycles to execute a microprogram. On the other hand, in the instruction set design, the size of the instruction set is determined by both syntax and semantics. The goal of the instruction set design is to optimize and trade off the instruction set size, the program size, and the number of cycles to execute a program.
Design models
In this section we present the models of instruction sets, micro-architectures and application benchmark programs, and describe how they are represented in our design system.
Instruction sets
The instruction set is assumed to be of fixed word length, typically 32 bits, which is specified by the designer. An instruction consists of fields. The fields are a combination of some field types.
For example, the instruction add(R 1 ,R 2 ,Immed) consists of an opcode field add, two register index fields R 1 and R 2 , and one immediate data field Immed. The bit width of each field type is provided by the designer. Table1 lists the specification of some instruction field types and their bit widths, taken from the BAM instruction set [19] . Each instruction has one opcode field, but the use of other fields is constrained only by the total number of bits needed by the operations in the instruction.
Instruction Field Type
Number of bits The operands of instructions can be encoded in the opcodes. There are two ways to encode operands. First, a specific value can be permanently assigned to an operand and becomes implicit to the opcode. Second, the register specifiers can be unified. For example, the instruction inc is obtained from the general instruction add. The facts of R 1 =R 2 (unifying register specifiers; i.e., both register accesses refer to the same physical register) and Immed=1 (fixing an operand to a specific value which becomes implicit) are encoded into the opcode inc. Encoding operands saves instruction fields, at the cost of possibly larger instruction set size, additional connections and hardwired constants in the data path. For example, adding the instruction inc to the instruction set increases the instruction set size by one, and adds a hardwired constant '1' and an additional multiplexer in the data path, as shown in Figure2. Furthermore, encoding allows more MOPs to be packed into a single instruction. For example, if we find it happens very often that the values of two independent registers are increased by one at the same time, we may then devise a new instruction incd(R 1 ,R 2 )which performs the MOPs 'R 1 <-R 1 +1; R 2 <-R 2 +1' (';' represents concurrency). This instruction uses only 16 bits, as opposed to 58 bits used by its generalized form 'R 1 <-R 2 +Immed 1 ; R 3 <-R 4 +Immed 2 ' which does not meet the instruction word width constraint for 32-bit instructions.
Micro-architectures
The styles of micro-architectures considered in this work are pipelined micro-architectures. The pipeline is controlled in a data stationary fashion [9] . In the data stationary control, the opcode flows through the pipeline in synchronization with the data being processed in the data path. Figure4 shows the relationship between the control path with data stationary model and the data path. The register files at the top and bottom are the same register file. They are duplicated for the ease of readability. Opcodes are forwarded to next stages synchronously. At each stage, the opcode, together with possible status bits from the data path, is decoded to generate the control signals necessary to drive the data path. This pipeline configuration supports single-cycle instructions  1 which are typical of modern   RISC-style processors. Multiple-cycle instructions can be accommodated with some modification to the linear pipeline such as the insertion of internal opcodes [10] . To manage the complexity of this research, general multiple-cycle instructions are not considered at this moment. However, multiple-cycle arithmetic/logic operations, memory access, and change of control flow (branch/ jump/call) are supported by specifying the delay cycles as design parameters.
The specification for the target microarchitecture
The target microarchitecture can be fully described by specifying the supported MOPs and a set of parameters. The supported MOPs describe the functionality supported by the microarchitecture, and the connectivity among modules in the data path. For example, the first two columns of Table2 list some of the MOPs supported in the VLSI-BAM microprocessor [20] and their corresponding MOP type IDs. The basic pipeline structure of the microprocessor is the same as the 
Application benchmarks
Each application benchmark is represented as a group of weighted basic blocks. The weight is defined by the designers, and is usually used to indicate how many times the basic block is exe- 
Instruction set design as a modified scheduling problem
The instruction set design problem can be formulated as a modified scheduling problem (Figure6). The inputs of the problem are: an application represented in CDFGs, constraints of the 
Instruction formation: the binary tuple and its relation with scheduling process
The semantics of an instruction can be represented by a binary tuple <MOPTypeIDs, IMPFields>, where MOPTypeIDs is a list of type IDs (as shown in the first column of Table2) of Cycle count 7
Instruction set size 6
Hardware cost 2R, 1W, 1M, 2F
Max. instruction width = 68 Cycle count 4
Instruction set size 3
Hardware cost 1R, 1W, 1M, 1F
Max. instruction width = 32 Instructions are generated from time steps in the schedule. Each time step corresponds to one instruction. The type IDs of the MOPs scheduled to the same time step are assigned to the first argument of the binary tuple for the instruction at the time step. The operand encoding specification, which is generated by an encoding process integrated into the scheduling process (described in Section 5), is assigned to the second argument of the binary tuple.
In Table5 For example, in Table5, the MOPs scheduled into time step 4 and 5 have the same binary tuple, and thus are mapped to the same instruction inst4(R 1 ,R 2 ,I), with their field values instantiated to (r1,r2,1) and (r2,r2,2), respectively. Note that we use capitalized letters, e.g. R 1 , to denote the instruction fields, and non-capitalized letters, e.g. r2, to denote the instantiated values of the fields. On the other hand, the MOP in time step 2, is mapped to a different instruction inst2(R 1 ,R 2 ), although it contains the same type of MOP rrai as in time steps 4 and 5. The reason is that its field for the immediate data I is permanently assigned to the constant 'zero' and made implicit in the opcode, which is indicated by the specification I=0 in the 'Encoded field' column. This implicit field makes the generated instruction behave as a 'move' instruction, instead of 'add'.
The compiled code can be obtained easily from the instruction names and instantiated field values. For example, the compiled code for the scheduled basic block in Table6 is represented as the sequence: inst7(r2,r0,0),inst7(r2,r1,1),inst5(1024),inst4(r2,r2,2).
The instruction set is formed by unioning instructions generated from all time steps. For example, the instruction set derived from the schedule in Table5 contains six instructions
Performance (cycle count) and Costs (instruction bits and hardware resources)
The weighted sum of the lengths (number of time steps) of the scheduled basic blocks is the execution cycles of the benchmarks. The length of the basic block includes nop slots which are inserted by the design process to preserve the constraints due to multi-cycle operations. The design process will try to eliminate the nop slots by reordering other independent operations into the nop slots.
Each instruction has two costs associated with it. One is the total number of bits required to represent the instruction. The number is a summation of field widths of opcode and all explicit fields required to operate the MOPs contained in the instruction. The implicit fields do not consume instruction bits. For example, in Table5, the instruction inst4 requires 32 bits, using the bit width specification in Table1 The example in Table6 shows that compact and powerful instructions can be synthesized by packing more MOPs into a single instruction, and making fields implicit and register ports unified to satisfy the cost constraints. This is particularly useful in an application specific environment where instruction sets can be customized to produce compact and efficient codes for the intended applications.
Constraints
The MOPs are scheduled into time steps, subject to several constraints. First, the data/control dependencies and the timing constraints (for multi-cycle MOPs) have to be satisfied. Data-dependent MOPs have to be scheduled into different time steps, subject to the precedent relationship and timing constraints, except single-cycle MOPs with WAR dependencies, which can be scheduled into the same time step if the registers can be read and written simultaneously. A control dependency with a timing constraint, e.g., a delayed jump, has to be dealt with differently. The
MOPs that are data-independent to the jump/branch MOPs can be scheduled into the time steps before the jump/branch MOPs or the delay slots after the jump/branch MOPs. The length of the delay slots is determined by the timing constraint. For example, in Table6, the independent MOP MO5 is scheduled into time step 4, which is the delay slot of the jump MO6.
Second, the instruction word width and the hardware resources consumed by the instructions have to be no larger than what are specified by the designer. Third, the size of the instruction set has to be no more than what the opcode field can afford.
Objective function
General speaking, a richer instruction set may result in more compact and efficient compiled code. On the other hand, the larger the instruction set size, the more complex the decoding circuitry, and the more time the hardware designers spend in design and verification. The same trends hold true in the compiler side as well. Therefore, an objective function is necessary to control the performance/cost tradeoff.
The goal of our design system is to minimize the objective function. The objective function is a function of the cycle count C and instruction set size S, where C represents the performance metrics, how many cycles the benchmarks execute on the target machine, and S represents the cost metrics. An interesting objective function suitable for our purpose is the following equation.
Objective = (100/P)⋅ln(C) + S EQ 1 This is an integral form, derived by Holmer in [4] , of the statement "a new instruction will be accepted if it provides a P% performance improvement," which tries to balance the instruction set size with the performance gain. Other types of objective functions can be used with the design system as well.
Note that in our formulation, the design constraints are checked separately, and are not captured in the objective function.
Simulated annealing algorithm and the design flow
Although we have formulated the instruction set design problem as a scheduling problem, it is indeed more difficult than a regular scheduling problem, because we have to control the number of unique patterns (instruction set) in the time steps during the scheduling, in addition to the dependency and performance/cost constraints. Also, the problem size is usually much larger than regular scheduling problems since the application benchmarks may easily contain thousands of MOPs to be scheduled.
We propose an efficient solution to the problem based on a simulated annealing scheme. An initial design state consisting of an initial schedule and its derived instruction set (generated by a pre-processor) is given to the design system, and then a simulated annealing process is invoked to modify the design state in order to optimize the objective function, until the design state achieves an equilibrium state.
Figure7 lists the basic structure of our simulated annealing algorithm. In the outer while loop are the operations performed at each temperature point T. The temperature T is updated at the end of the operations. At each temperature, several movements (changes of the design state) are generated by the inner while loop. The number of movements (M) generated is specified by the designer.
In the following subsection, we present the move operators (Section 5.1) and heuristics (Sec- 
Move operators
The move operators change the design state. They provide methods of manipulating the MOPs and time steps. The move operators can be characterized into three groups.
Manipulation of the instruction semantics and format
The first group manipulates the instruction semantics and format of a selected time step. There are five move operators in this group.
• Unification: Unify two register accesses in the MOPs; i.e., they always access the same register. For example, the specification of R 1 =R 2 in our previous example of the increment instruction inc(R) is a result of the 'unification' operator. The effects of this operator are the decreases in the instruction word width and register read/write ports.
• Split: Cancel the effect of the 'unification' operator. Two register accesses that are previously unified to the same register are made independent. The effects of this operator are the increases in the instruction word width and register read/write ports.
• Implicit value: Bind a register specifier to a specific register, or an immediate data field to a specific value. The specific values are the instantiated values in the MOPs of the selected time step. For example, the specification of Immed=1 in the instruction inc(R) is a result of this operator. The effect of this operator is the decrease in the instruction word width.
• Explicit value: Cancel the effect of the 'implicit value' operator. Instruction fields that are previously bound to specific values are made explicit; i.e., their values are assigned by the compiler and are specified in the regular instruction fields. The effect of this operator is the increase in the instruction word width. 
Manipulation of MOP's locations
The second group of move operators involves the movement of the MOPs. There are four move operators in this group, which are all subject to the data/control dependencies and delay constraints when moving MOPs. The target MOPs and time steps can be selected randomly or with the guidance of heuristics.
• Interchange: Interchange the locations of two MOPs from different time steps. This operator changes the semantics and formats of the two instructions in the corresponding time steps.
• Displacement: Displace a MOP to another time step. This operator simplifies the semantics and format of the instruction in the original time step, and enriches the semantics and format of the other instruction in the destination time step.
• Insertion: Insert an empty time step after or before the selected time step and move one MOP to the new time slot. This operator simplifies the semantics and formats of instructions in the selected and new time steps, and increases the cycle count.
• Deletion: Delete the selected time step if it is an empty one. This operator decreases the cycle count.
In our current implementation, if the selected MOPs contain unified or implicit fields, these fields are restored to the original forms (generalized, explicit) before the move operators in this group are applied to the MOPs. In addition to the aforementioned effects, these move operators may changes the resource usage in the selected time steps as well.
Microarchitecture-dependent operators
The third group of move operators includes methods that explore the special properties of the target microarchitecture. These move operators are provided by the designer as part of the microarchitecture specification.
For example, if the target microarchitecture provides both register file → functional unit → register file, and register file → register file data paths, then the designer can specify that the following MOPs ( rrai and rr) are functionally equivalent and can be transformed from one to another:
rrai:
rr:
These MOPs have different costs in hardware and instruction format. While rrai uses a functional unit and consumes an additional instruction field for the immediate data, rr uses a direct bus between the read and write ports of the register file. When discovering an rrai MOP with its immediate data being zero, the design system can map this MOP to the equivalent rr MOP, or vice verse.
An example: changing the design state with move operators
We demonstrate how the move operators are used to change design states. Here we show a sequence of move operators which transforms the schedule and instruction set (one design state) in Table5 to the ones (a better design state) in Table6. The sequence is:
1. DISPLACEMENT: displace the MO2 from time step 2 to 1 (as shown in Table7). Note that there are more than one sequence which accomplish the same design state transition.
How such sequences are formed depends on the design algorithm. In our simulated annealing scheme, the move operators are selected with a mix of random and heuristics strategies as described in Section 5. Cycle count 7
Instruction set size 5
Hardware cost 4R, 1W, 1M, 3F
Max. instruction width = 68 Cycle count 7
Max. instruction width = 32 Cycle count 6
Max. instruction width = 48 Table 9 . The design state after the application of the seventh move operator
Heuristics for target selection
During each iteration, the design space is examined whether it violates design constraints. If yes, a time step is randomly selected from a pool of time steps that violate constraints. If more than one constraint is violated, the resource violation gets higher priority than the instruction word width violation since a movement that resolves the former may resolve the latter as well.
Depending on the type of the constraints, one of the following rules is applied.
1. If the instruction word width constraint is violated, apply randomly one of the move operators: 'unification', 'implicit value', 'interchange', 'displacement' or 'insertion'; 2. If the resource constraint is violated, apply randomly one of the move operators: 'unification' (only when the register port constraint is violated), 'implicit value', 'displacement' or 'insertion'.
When the current design space does not violate any constraint, all move operators are eligible for changing the design state. In this case, a basic block is selected with the probability Selection i , which is the selection weight of a basic block i and is defined by the following equation, where F i is the execution frequency of the basic block i in the benchmark, N i is the number of MOPs in the basic block i, and the summation in the denominator is the total number of MOPs executed in the benchmark. Therefore, the selection weight is intended to denote the degree of importance of a 
Cycle count 5
Max. instruction width = 32 Table 10 . The design state after the application of the eleventh move operator basic block in the benchmark. A time step is then randomly chosen from the selected basic block, and one move operator is randomly selected and applied to the time step.
Cooling schedule
The cooling schedule is controlled by five parameters:
1. The initial temperature (T 0 ) should be high enough so that there is no rejection for highcost states at the initial temperature. A simple heuristic to set the initial temperature is to start the simulated annealing algorithm with a given initial temperature. If some states are rejected at the initial temperature, then the value of the initial temperature is doubled. The trial run is repeated until the ideal initial temperature is obtained.
2. The number (M) of movements tried at each temperature is proportional to the total number (O ps ) of MOPs in the benchmarks, typically five times, which is given by the designer.
3. The next temperature is 90% of the current temperature.
4.
A low temperature point is defined such that a special handling routine can be applied to stabilize the design state. The special handling routine stabilizes the design state by adopting move acceptance rules that are different from the ones in high temperatures. The move acceptance rules are described in Section 5.4.
5. The annealing process terminates when the design state stays unchanged for a certain (e.g., four) consecutive temperature points. The number of the consecutive stable temperature points is given by the designer.
The complexity of the algorithm is mainly determined by the cooling schedule and the data structures used to represent the design state. As discussed previously, the number of movements tried at each temperature is proportional to the total number (O ps ) of MOPs in the benchmarks; the complexity of accessing the data structures, in our current implementation, is proportional to O ps as well. Therefore, the complexity of the algorithm at each temperature is of the order of O ps 2 . This complexity can by lowered by using more efficient data structures in our future implementation.
To derive the global complexity formally, we need to determine the total number of temperature points, which is difficult to analyze since it is affected by both the problem size and the nature of the benchmarks. However, our empirical study shows that the global complexity of the algorithm is roughly about the order of O ps 3 .
Move acceptance
At high temperatures, a movement that satisfies one of the following conditions is definitely accepted.
The movement reduces the value of the objective function; 2. The movement is a result of constraint resolution; i.e., it is a necessary movement in order to resolve some constraint violations.
Otherwise, a movement is accepted with the probability of exp -(∆/Τ) where ∆ is the increased value of the objective function and T is the current temperature.
At low temperatures, a different strategy is adopted to stabilize the design state. A movement is accepted when either one of the following conditions is true.
1. The movement generates a new state which does not violate any design constraint and has lower objective value;
2. The movement is a result of constraint resolution. This condition is same as the one at high temperatures.
Otherwise, only those movements that generates new states which do not violate any design constraint are accepted with the probability of exp -(∆/Τ) .
In addition, the current best design state is kept when the algorithm decides to accept inferior design states. At the end of each temperature point, if the reached design state is inferior to the current best state, the design state falls back to the current best state with the probability 1-T/T i where T i is the initial temperature.
Design flow based on the simulated annealing algorithm
The instruction set design process consists of three major steps.
1. The given application is translated to dependency graphs of MOPs which are supported by the given architecture template. This translation is performed in two steps. First, the application, written in a high level language, is translated into an intermediate representation by the compiler of the high level language (in our current environment, the Aquarius Prolog Compiler [21] ). Second, a retargetable MOP mapper, consulting the given architectural template specified with the language described in Section 3.2, transforms the intermediate representation into the dependency graphs of MOPs.
2. A preprocessor generates a simple schedule for the MOPs. The schedule is obtained by serializing the dependency graphs. An initial instruction set is then derived from the schedule. This is done by directly mapping time steps in the schedule into instructions without encoding any operand. The obtained schedule and instruction set constitute the initial design state.
The best instruction set, microarchitecture, and assembly code which minimize the objective function can be obtained after the design state reaches the equilibrium state.
We have implemented the algorithm and its supporting tools into our design system ASIA (Automatic Synthesis of Instruction-set Architectures). It consists of about 8000 lines of Prolog code.
Experiments
We first demonstrate our technique with a small, illustrative example, and then with Prolog application benchmarks.
A small example
In this example, we assumed the target architecture in Table2, the instruction field specification in Table1 with smaller bit widths for tag (2 bits) and immediate (14 bits), and the delay specification in Table4. The example used in this subsection is a small application which sets up a list of two elements in Prolog. It consists of 18 MOPs. Table11 lists the MOPs and their dependencies. The bf clauses in the last row specify the before dependencies between MOPs. For example, bf (1, 4) constrains that MOP 1 has to be scheduled in a time step earlier than MOP 4's. The ctl (18) clause specifies that the MOP 18 changes the control flow. Note that the control flow change has one cycle delay. We synthesized the 32-bit and 64-bit instruction sets, with the resource constraints <3R, 1W, 2M, 1F> and <6R, 4W, 4M, 4F> 3 , respectively. The objective function used is EQ 1 with P=1.
The synthesized 32-bit instruction set is listed in Table12, consisting of four instructions.
Note that two instructions inst11 and inst12 contain encoded fields, in order to satisfy the required 32-bit word constraint. This instruction set compiles the application into 12 cycles, as shown in Table13. Note that time step 12 is the delay slot of inst11 which changes the control flow. An independent instruction inst12 is scheduled into time step 12 to make use of the delay slot. Dependencies bf: before ctl: control inst13 R, T, I R <-T ^ I rit inst14 R 1 , R 2 , T, I R 1 <-T ^ (R 2 + I) rrait Table 12 . 32-bit instruction set *. The right two columns specify the binary tuples for the corresponding instructions.
Time
Step (1, 4) . bf (2, 3) . bf (2, 5) . bf (3, 5) . bf (4, 5) . bf (4, 6) . bf(4,7).
bf (5, 6) . bf (5, 7) . bf (5, 8) .
bf (6, 8) . bf (7, 10) . bf (7, 9) . bf (8, 11) .
bf (8, 9) . bf (9, 11) . bf (10, 12) . bf (10, 13) . bf (11, 12) . bf (13, 14) . bf (13, 15) .
bf (13, 16) . bf (14, 15) . bf (14, 16) . bf (16, 17) .
ctl(18).
Prolog application benchmarks
In this subsection, experiments are presented to show the versatility and practicality of our tools by synthesizing instruction sets for some application benchmarks, with various design constraints and objective functions. Four benchmarks were selected from the Prolog Benchmark suite [18] . The benchmarks con1 and nreverse are programs for list manipulation. The benchmark query is a program for database query. The benchmark circuit maps boolean equations into logic gates.
The second column in Table16 lists the characteristics of the benchmarks, including the numbers of MOPs, data-related dependencies, and control dependencies in the benchmarks. MOPs represents the size of the benchmark; the number of data-related dependencies is related to the degree of parallelism available within the benchmark; the number of control dependencies indicates the degree of the impact of the branch/jump delays on the benchmark.
We assumed that every basic block executes once. we assumed the target architecture in Table2 and the instruction field specification in Table1. The delay constraints for control and memory operations are one and zero, respectively. The experiment was conducted on a HP750
workstation with 256M memory.
For each benchmark, we synthesized its 32-bit, 48-bit, and 64-bit instruction sets, respectively. We were interested in how the instruction sets vary with bit widths. Table16 lists the results, synthesized under the objective function with P=1 in EQ 1. For all three benchmarks, as we had expected, the cycle decreases when the instruction word width increases. However, we observed a smaller gain in nreverse and circuit. This can be explained by their larger ratios of the number of data dependencies to the number of MOPs. Most of the MOPs depend on each other such that there is less parallelism available when packing MOPs into instructions.
In general, the size of the instruction set also increases when the instruction word width increases. This is due to the fact that wider words can accommodate more MOPs, resulting in richer and more powerful instructions. However, the 48-bit instruction sets are 'embarrassing' designs for con1 and nreverse. Their instruction set sizes are larger, and their performance is worse than their 64-bit alternatives in compiling the benchmarks. The 48 bits are not wide enough for these benchmarks to accommodate the most frequent MOP patterns, for which 64 bits are sufficient. Therefore, the design process has to specialize the general forms of some powerful instructions into several distinct instructions by making fields implicit or unifying register ports, in order to satisfy the bit width constraint.
In the 'Instruction set space' column we examined the number of instruction candidates explored by the design process. The numbers, much larger than the final instruction sets, show that the design process was able to explore a rich design space for the best candidates while keeping the size of the design space manageable.
In the two right most columns we also list the run time and memory usage of our algorithm, which show that our tools were able to synthesize instructions for application benchmarks within reasonable time and consume a modest amount of memory.
In Table17 we compared the synthesized 32-bit instruction sets for these benchmarks with the BAM instruction set, which was designed for the VLSI-BAM micro-processor by the Aquarius Project at the University of California, Berkeley. The VLSI-BAM micro-processor has RISCstyle instructions plus some powerful instructions to support efficient logic computation such as Prolog. The benchmarks were compiled with the BAM instruction set, and we measured the number of distinct instructions used (in the 'Instruction set size' column), and the number of cycles to execute the compiled code (in the 'Cycle' column). The programs were compiled by the Aquarius Prolog Compiler, with the post-phase optimization phase turned off 4 . The experiments show that the synthesized instruction sets produced more compact codes for all four benchmarks, with 10%, 5%, 17%, and 3% reduction in the code size, respectively. This was achieved at the cost of a small number of additional instructions (7, 1, and 2 for con1, nreverse, and query, respectively), except in circuit where 16 additional instructions are required. We then used Holmer's objective function '100⋅ln(C)+S' to evaluate the global performance/cost tradeoffs for both instruction sets and found that in most cases (con1, nreverse, and query) the synthesized ones yield better results, as indicated in the 'Objective value' column (smaller values are better). It is possible to improve the result of circuit by adjusting the initial temperature and the cooling schedule in our future experi- ment. We also compared the hardware resources used by both instruction sets. They both use the same amount of resources, except in the nreverse case our synthesized instruction set uses one less register read port and one less memory port than BAM does. This experiment shows that ASIA is capable of competing with manually designed instruction sets within our collection of benchmarks. Further studies will be needed to investigate its competence in more general cases.
Table18 shows some interesting instructions synthesized for the benchmark query. They are selected from the 32-bit, 48-bit, and 64-bit instruction sets, respectively. For ease of illustration, we do not list the binary tuples for these instructions; instead, we describe the RTLs of these instructions directly. In the RTLs, the register sharing is indicated by using the same register index. Note that the 32-bit version of the instructions can be found in the BAM instruction set as well. This fact provides the BAM designers with more confidence about their instruction set, since some of the instructions that they considered 'powerful' retain their existence when the instruction set is designed by other independent designers (in this case, the ASIA design automation system). This observation suggests that ASIA, in addition to its original purpose (an automatic design tool), can be used as a verification tool for designers to verify their manually designed instruction sets as well. Table 17 . Performance comparison with a manually designed instruction set Finally, Table19 shows how the synthesized instruction sets vary with the objective functions. In this experiment we synthesized 32-bit instruction sets for the benchmark query with two objective functions: one with P=1, another with P=5. The latter assigns less importance to the cycle count. Therefore, the tools focused on reducing the instruction set size, resulting in 7 instructions less, but 16 cycles more than the former case.
Conclusions
We have presented a design automation system ASIA (Automatic Synthesis of Instruction-set during the scheduling phase. A binary tuple is used to describe the semantics and formats of instructions. The binary tuple is the key idea which links the instruction formation to the scheduling process. In addition to the synthesized instruction sets, ASIA also generates the compiled codes for the given benchmarks, showing that how the instruction sets can be actually used to compile programs. An objective function of the cycle count and instruction set size is used to guide the design process, in order to balance the performance/cost tradeoff. A simulated annealing algorithm is used to solve for the schedules. We have discussed the move operators suitable for our problem, and other issues such as cooling schedules and heuristics.
We have demonstrated the versatility and practicality of ASIA by conducting experiments on some application benchmarks, with various design constraints and objective functions. The tools used reasonable amount of CPU time and a modest amount of memory. It has been shown that our tools are capable of synthesizing powerful instruction sets. Many of them can be found in today's processors. Compared with manually designed instruction sets, the synthesized instruction sets produce more compact code and may require less hardware. The tools were able to explore a rich design space, and handle important design options such as the instruction word width, and performance/cost tradeoff. We were able to explain the variation of the performance of the instruction sets on different benchmarks, based on the characteristics of the benchmarks. The experiments also show that ASIA, in addition to its original purpose in automating the design process, can be used by the designers to verify their manually designed instruction sets as well.
The current limitations include: First, the designers are required to specify the number of hardware resources, which may takes several iterations to find the best hardware allocation. Second, ASIA does not recognize the situation when the constraints are too loose, e.g., the instruction word is too wide or hardware resources are too rich. In this case, it is possible to suggest some partitioning of the constraints. For example, a 128-bit instruction word can be realized as a single wide-word instruction or an abutting of several smaller instructions. Third, in our problem formulation, the concept of the basic block is used to partition benchmarks into small pieces. However, there are other ways of partitioning benchmarks such as traces, and random segments [5] . What is the best way is unknown at this moment. Fourth, even though we have demonstrated that our algorithm is able to synthesize instruction sets from thousands of MOPs within 22 hours, real world application benchmarks, such as system, CAD and simulation software, are usually much larger. How to manage problems of such sizes is an important issue. Fifth, the machine model is insufficient to account for the dynamic behavior of some modern architectures such as superscalar machines.
In the future, we will continue our efforts in ASIA and pursue the following issues: (1) improving the aforementioned limitations; (2) code generation for the synthesized instruction sets; (3) synthesis and comparison for application specific uniprocessors and VLIW processors; (4) design and synthesis of low-power instruction set architectures; (5) analysis of architectural properties for application benchmarks.
