In this paper, we show how hardware software coevaluation can be applied to instruction set de nition. As a case study, we show the de nition and evaluation of instruction set extensions for fuzzy processing. These instructions are based on the use of subword parallelism to fully exploit the processor's resources by processing multiple data streams in parallel. The proposed instructions are evaluated in software and hardware to gain a balanced view of the costs and benets of each instruction. We have found that a simple instruction optimized t o p erform fuzzy rule evaluation o ers the most bene t to improve fuzzy processing performance.
Introduction
In this work, we analyze how fuzzy processing can be implemented e ciently on general purpose CPUs and what functionality is required to achieve peak performance.
Instruction sets are often optimized for some software metric such as a minimum number of clock cycles, but this approach neglects the hardware impact of the proposed instructions. While some instructions may reduce the cycle count, they may also lengthen the cycle time and even result in a net performance loss. To obtain a more balanced view of software and hardware implications of instructions, a co-design approach to instruction set de nition is necessary.
In this work, we show h o w proposed instruction sets can be evaluated for both its impact on program performance and on hardware e ciency. This approach is demonstrated with the evaluation of fuzzy instruction set extensions using the hardware software coevaluation method.
Fuzzy computation can be implemented on any general purpose processor. However, because instruction set architectures were not designed with fuzzy computation in mind, the available primitives can result in an ine cient implementation. A possible approach t o rectify this situation is to introduce additional instruction set primitives to e ciently support fuzzy processing.
The proposed fuzzy instruction set extensions have been designed to optimally exploit the available processing resources by using subword parallelism. Because fuzzy computation operates with short data types, multiple data can be packed in a single 32 bit processor word.
As a starting point for optimizing fuzzy performance, we use a RISC processor core based on the popular MIPS-I instruction set architecture 1 . Both the processor core and the added application speci c instructions have been described using the VHDL hardware description language 2 , and synthesized using logic synthesis. This paper is organized as follows: section 2 introduces the hardware software co-design methodology we have used for instruction set architecture evaluation and section 3 gives an overview of related work. In section 4, we describe the MIPS RISC processor core which serves as starting point for our design. We give an overview of fuzzy issues and the proposed extension of the processor in section 5. The hardware software co-evaluation of the fuzzy instructions is given in section 6 and we draw our conclusions in section 7.
2 Hardware Software Co-Design of Instruction Sets
The evaluation of the instruction set architecture is a major issue in the design of application speci c instruction processors 3 . To optimize processor performance for a particular application, a common approach is to extend the instruction set by application speci c instructions. Many instruction set extensions have been proposed, such as for signal processing or multimedia processing.
Adapting an instruction set to a particular problem is a di cult task, as many unknown issues have t o b e explored. Due to the many factors involved in performance optimization, suggested optimization solutions often minimize only the number of instructions necessary to solve a problem, or at best the number of cycles.
However, many of the suggested special purpose instructions are complex. As a result, it may not be possible to clock an extended processor at the same frequency as the original design. This is often neglected by studies, as the processor characteristics can be difcult to predict, and as a full implementation of a processor is often out of scope. Thus, studies most often use instruction or cycle level simulators such as SPIM 4 to predict performance. While these instruction set emulators can be used to test software, generate traces and gather statistics, they do not allow to predict the e ects of the extended instruction set architecture on the processor design itself.
In this work, we e v aluate the instruction set architecture ISA optimized for fuzzy computations using hardware software c o-evaluation of instruction set extensions to gain a more balanced view of the bene ts of di erent instruction sets.
Proposed instructions are evaluated in hardware and software to establish the performance impact of each instruction:
Software evaluation can be performed using program traces, instruction set simulators or object code instrumentation. During this step, the cycle count of benchmarks is established. Hardware evaluation is performed using rapid prototyping based on logic synthesis. To e v aluate hardware e ects of instruction set extensions, we have designed an extendible RISC processor core. Instructions are implemented and the extended processor architecture is synthesized to establish cycle time and chip area. Using the information derived from these steps, the bene ts of each proposed instruction can be evaluated in order to decide whether to implement a particular function in hardware or in software. E ectively, this process moves functional blocks from software to hardware or vice versa to optimize performance and cost.
Using this co-evaluation approach, we h a v e e v aluated several application speci c instruction set extensions to implement a memory prefetching mechanism and other performance enhancing extensions, including tag support for dynamically typed languages such as Prolog 5 . In this work, we e v aluate instruction set extensions optimized for fuzzy computation.
Previously presented automatic instruction set denition approaches have used either pipeline scheduling or module selection to de ne an optimal instruction set. Alas, these automatic approaches cannot consider or optimize data layout based on such methods as subword data parallelism which require human intervention to adapt the data layout to a particular problem set.
Related Work
Instruction set de nition has previously been addressed in a number of publications, but the authors have treated instruction set design and instruction set selection mostly as a scheduling problem of operations 7 , 8 , or as a module selection problem 9 .
Scheduling approaches derive the best combination of operations to be executed in a pipeline or where to put them in a pipeline. One starts from a xed pipeline and tries to schedule operations found in application programs to achieve high resource utilization of di erent functional blocks available in the pipeline 7 . A di erent approach is the partitioning of instructions on di erent pipeline steps while the instruction set is mostly xed and the operations are mapped on the di erent pipeline stages to reduce delay.
The module selection approach 9 uses frequency analysis of software traces to determine the types of instruction to be supported by processor. The n most frequent operations are implemented in hardware selecting from a xed set of modules, but no thought is given to the impact of implementing a functional block in hardware on the cycle time.
Both the pipeline scheduling and module selection approaches cannot generate new logic resources. An approach which can actually generate new logic capabilities for a processor has been presented in 10 for an adaptive machine architecture. Here, the compiler extracts functionality from a high-level languages description and implements it in eld-programmable gate arrays FPGAs attached to a processor. Alas, this approach su ers from high communication overhead between the processor and the attached FPGAs and also the idioms recognized by the system seem rather limited.
Thus, to generate truly optimized logic, human intervention as supported by our hardware software coevaluation approach is still required to make the best usage of logic capacity.
In the area of fuzzy processing, a number of approaches have been used to optimize fuzzy processing based on either custom hardware implementations or programmable solutions, using custom hardware processors or extensions to existing processors.
Custom fuzzy implementations are generally mapped directly to an ASIC process to implement a particular class of hardware problems 11 . This approach gives the most e cient solution if only a restricted class of fuzzy problems are to be implemented.
Programmable implementations of fuzzy processing are either additions to existing processors such as found in the CPU12 from Motorola 12 or in the FLORA processor 13 , or custom fuzzy programmable processors. The FLORA processor extends a RISC instruction set with the min and macc instructions for the minimum calculation and the multiply-and-accumulate operation, respectively to improve fuzzy processing.
An Extendible Processor Core and Its Development Environment
We have developed an extendible processor core based on the MIPS-I RISC architecture in VHDL. This processor core gives us the possibility to study the e ects of instruction set architectures on processor speed and implementation area using rapid prototyping.
For the processor to be useful for these purposes, we identi ed the following requirements:
high-level description The format of the processor description should be easy to understand and modify. modular To add new instructions, only the relevant parts should have t o be modi ed. A monolithic design would make experiments di cult. extendible All data structures and interfaces should be designed such that new elds can be added with ease.
synthesizable The processor description should be synthesizable to derive actual hardware implementations. The processor core has been designed with a distributed controller to facilitate instruction set extension and processor adaptation for speci c application requirements. This distributed controller approach replaces a monolithic controller which w ould be di cult to adapt. The distributed controller is responsible for pipeline ow management and consists of communicating state machines found in each pipeline stage. Thus, changes in the architecture can be restricted to those modules where new functionality is provided. The processor core is described in synthesizable VHDL. Thus, hardware implementations can be derived using logic synthesis. In our work, we use the Synopsys Design Compiler 14 as synthesis tool to generate ASIC implementations. We h a v e synthesized the VHDL description of the processor core for the AMS 0.6 CMOS process 15 . Table 1 gives the size of each module of the synthesized design.
A more detailed description of the processor core, its implementation and validation can be found in 16 .
De ning Fuzzy Extensions

Fuzzy Principles
Fuzzy computation consists of three steps: fuzzication, inference and defuzzi cation.
During fuzzi cation, crisp input signals are mapped onto fuzzy variables. Each input value is assigned a degree of membership in each fuzzy set, also referred to as alpha value.
Inference implements the evaluation of fuzzy rules. Fuzzy rules from a rule data base are applied to the fuzzi ed inputs, determining a fuzzy control action. The intersection of rule premises is performed by selecting the minimal alpha value. The rule conclusion gives the membership degree in the output fuzzy set C. As several rules may be applicable to some combination of input values, more than one membership degree can possibly be computed for a single fuzzy Figure 1: Subword data memory organization: each w ord contains membership information for two fuzzy sets. The fuzzy sets are speci ed using unique fuzzy set identi ers and the membership is encoded using 8 bit unsigned integer. The current implementation supports 2 overlapping fuzzy sets for each input.
set. These degrees are than consolidated into a single membership degree by selecting the maximum of all computed values for each output set. During defuzzi cation, control actions are converted back to crisp signals. The values delimit the output fuzzy sets de ning an area. The ordinate of gravitational center of this area determines a crisp output control signal.
Subword parallelism
Packing of multiple data streams in a single processor word is referred to as subword parallelism. This method has gained widespread acceptance lately to support operation on multiple related data items in a single cycle for applications such as media processing, video conferencing, or multimedia and communication applications 17 .
Subword parallelism can be applied to fuzzy computation for fuzzi cation, inference and defuzzi cation, as 8 bits o er su cient precision for representing inputs, outputs and membership functions of fuzzy sets 18 . Subword parallelism can be used during fuzzi cation to compute multiple membership degrees in one step, during inference to perform rule evaluation in a single cycle, and during defuzzi cation to operate on multiple data sets for integration.
As alpha values require only 8 bits, more than one set can be packed in a single processor word. To identify fuzzy sets, some set identi cation has to be included, requiring an additional 4 bits for a maximum of 16 possible sets for a single input. Thus, in a 32 bit processor we can operate on two o v erlapping sets concurrently. This concept can be extended to ve overlapping sets on a 64 bit processor.
By using data parallelism, parallel operation is performed on multiple data in determining alpha values for all fuzzy sets for an input, and performing defuzzication of two output sets concurrently. For fuzzi cation, we de ne an optimized data layout in memory see gure 1. For each memory access, the membership function for a crisp input is computed for two fuzzy sets. Both fuzzy sets and the associated membership degree are encoded in a single 32 bit word. The membership degree alpha for all other fuzzy sets defaults to 0.
This memory organization minimizes the number of memory accesses, requiring only one load per input and in this way speeding up the execution. In addition, memory requirements for this organization are minimal and independent of the number of fuzzy sets.
Fuzzy Instructions
To improve performance of RISC processors for fuzzy calculation, we h a v e explored di erent ISA extensions for fuzzy workloads. For this purpose, we have extended the original MIPS-I instruction set architecture ISA with several instructions specialized for fuzzy computation. For each new instruction, we have analyzed its impact on hardware and software, as well as obtained performance. The nal ISA extensions are determined by results of the hardware software co-evaluation of these extensions.
To support fuzzy computations, we h a v e considered the following instructions: slw loads fuzzi ed values from the memory, rulev evaluates fuzzy rules from the rule base, macc multiply-and-accumulate operation for defuzzication, hmul halfword multiplication for defuzzi cation, and hadd result collection for defuzzi cation. slw To optimize access to arrays, a register plus shifted register memory addressing mode scaled load can be used. This instruction is not speci c to fuzzy calculation as it improves performance of array accesses as can be found in all general purpose programs and is included in a number of microprocessors, such as most CISC machines and several RISC processors such as the Motorola m88k. The instruction slw R d ,R s ,R t takes two operands to specify a memory address the rst operand R s speci es the base of the array and the second operand R t the index in the array. T o generate the actual memory address, the index is multiplied by the data size 4 bytes and added to the base.
In fuzzy processing, the scaled load instruction can be used to implement the table lookup used for fuzzication of input values in a single processor cycle.
rulev Evaluation of fuzzy rules is performed by determining the minimum of all rule premises. During the inference step, only the rules where all premise alpha values are non-zero have t o b e e v aluated. A rule evaluation instruction rulev evaluates a fuzzy rule in a single cycle. The instruction rulev R d ,R s ,R t ,S e t 1 , S e t 2 ,1 0 checks whether the alpha values from the source registers R s and R t are premises of the rule under evaluation and non-zero. If this is true, it determines the minimum alpha value and stores it in the register R d . The values S e t 1and S e t 2are fuzzy set identi ers in the range 0 to 15. The last argument of the instruction can take the value 0 or 1 to specify whether the result is written in the left if 1 or in the right if 0 half of the destination register see gure 2.
If the alpha values in the source registers are not premises of the rule under evaluation, the corresponding halfword of the destination register is set to zero. Simple  2  1  7  5  Medium  3  2  14  5  Complex 7  3  80  5   Table 2 : Complexity of fuzzy problems: fuzzy problems are classi ed by the number of inputs I, the number of outputs O, the number of fuzzy rules R and the number of membership sets MF.
Problem I O R MF
macc Multiply-and-accumulate instruction is a popular choice for various instruction set extensions, as it is useful for solving several problems. In our design, the instruction macc R d ,R s ,R t ,1 0 multiplies the left or the right halfword depending on the eld speci er supported as the last argument of the instruction of the source registers R s and R t and accumulates the result in the register R d .
hadd We have introduced this instruction to perform the addition of two halfwords in parallel.
The instruction hadd R d ,R s ,R t performs addition of the left and of the right halfwords of the source registers R s and R t in parallel and stores the results in the corresponding halfwords of the destination register R d .
hmul The last instruction we have analyzed is the halfword multiplication.
The instruction hmul R d ,R s ,R t multiplies the halfwords of source registers R s and R t in parallel and stores the result in the destination register R d . 6 Hardware Software Instruction CoEvaluation
To give a balanced evaluation of the new instructions, we h a v e performed evaluation of both hardware and software aspects.
We e v aluate the impact of proposed instruction set features on di erent classes of fuzzy problems of varying complexity. Table 2 shows the classi cation of fuzzy problems based on Costa et al. 13 which will be used throughout this work. For each class of problem complexity, we have generated application programs using the new instructions and established the cycle count required for execution.
We have analyzed the performance improvement o ered by several di erent processor con gurations, ranging from the addition of a single instruction to support fuzzy rule evaluation to hardware support for all fuzzy processing steps. Table 3 gives an overview of the analyzed processor con gurations and the instructions included in each of these con gurations. 
conf. instruction set features
A core B core, rulev C core, slw, rulev D core, slw, rulev, macc E core, slw, rulev, hmul, hadd F core, slw, rulev, hmul Table 3 : Processor con gurations under evaluation.
Con guration A implements the MIPS-I RISC instruction set architecture and serves as a reference for the comparison of the extended processor con gurations. Con guration B implements only a single additional instruction supporting fuzzy rule evaluation, con guration C adds support for scaled loads for fuzzication and the remaining con gurations o er di erent t ypes of hardware support for defuzzi cation.
Software evaluation of proposed extensions is performed by computing the number of cycles needed to implement the functionality of the test programs when the instructions under evaluation are used. The cycle counts for a sample program are given in table 4.
The table lists cycle count required for performing the same fuzzy application on each of the analyzed con gurations. The cycle count is given separately for each of the three fuzzy processing steps. In this table, the division has been extracted to simplify the analysis of proposed instruction set extensions. Table 4 : Cycle count required for each processing step in fuzzy calculation for a simple fuzzy problem. F indicates the cycle count for fuzzi cation, I for inference, D for defuzzi cation and for division.
The performance improvement o f i n troducing specialized instructions becomes more pronounced with the growing complexity of the problem. Figure 3 gives an overview of the cycle count required for executing fuzzy problems of varying complexity for di erent processor con gurations. The highest performance improvement is achieved by the introduction of the instruction rulev. This instruction improves performance of fuzzy calculation ranging from 74 to 157, depending on problem complexity. The introduction of scaled load instruction reduces cycle count by an additional 7 to 9, whereas instructions supporting defuzzi cation by 20 to 46, depending on the architecture and on the design complexity. Figure 4 shows that the execution pro le of fuzzy problems di ers signi cantly from simple to complex problems. With growing program complexity, the impact of fuzzi cation and defuzzi cation on the overall execution time decreases whereas the computing time for rule base evaluation becomes more signi cant. Thus, while simple problems spend much of their execution time in defuzzi cation, inference takes up 70 of execution time in complex problems. As a result, complex problems bene t the most from the rule evaluation instruction 157, with only minor improvements gained by other instructions 7 for the scaled load, 20 for defuzzi cation support. For simple problems, the biggest gain is still obtained by rule evaluation support 74, but other extensions also o er signi cant improvements 9 for scaled load, 46 for defuzzi cation support because fuzzi cation and defuzzi cation make up a larger part of the execution time.
Another important metric is code size, especially for embedded application. Using the extended ISA for fuzzy calculation we have fewer instructions for rule evaluation and we reduce the code size signicantly e.g., for a simple fuzzy problem the code size decreases from 82 to 40 instructions for con guration C.
However, the reduction of cycle count as a result of introducing the specialized instructions does not automatically imply shorter execution time. The implementation of specialized instructions is often complex and may increase the cycle time of the processor and thus reduce the bene ts of using the application specialized architecture. For this reason, it is necessary to perform hardware evaluation of the architectures under evaluation as well.
Data about the hardware implementation are derived by designing prototype implementations of the proposed instructions and using logic synthesis to generate a hardware implementation from the VHDL deinstruction area increase critical path sq.mil ns Table 6 : Results of hardware software co-evaluation for a fuzzy problem of medium complexity.
scription. The resulting gate-level netlist can then be analyzed to obtain information about area and timing. As target process, we have used the AMS 0.6 process.
The unmodi ed processor core architecture A achieves an operating frequency of 62 MHz, resulting in an inference speed of 2.5 s to 8.5 s depending on the problem complexity. The area cost and critical path of the proposed instructions are reported in table 5 . By combining this information with the information about cycle count, overall performance of the di erent con guration can be obtained table 6. This information can then be used to decide which instructions to include in the nal processor design.
Because the proposed fuzzy instructions were all designed to minimize cycle time impact and resource usage in the rst place, hardware evaluation reports only moderate implementation costs. The area increase to implement proposed instructions is low, especially when compared to overall chip size. The critical path is increased by up to 2.5ns for the most expensive instruction macc, but none leading to an increase in cycle time.
Conclusion
In this paper, we have demonstrated how to apply hardware software co-evaluation to instruction set de nition. We have de ned and evaluated fuzzy instruction set extensions. The instruction set extensions have been added to a RISC processor core based on the MIPS instruction set architecture. The fuzzy processing instructions are based on the use of subword parallelism to fully exploit the processor's capabilities by processing multiple data streams in parallel.
The highest performance gain for fuzzy processing is brought by the rule evaluation instruction rulev, which alone accounts for a performance increase in excess of 150. Fuzzi cation can be improved by using the scaled load instruction found in several commercially available processors. The defuzzi cation step in fuzzy processing can be improved signi cantly by subword instructions hadd, hmul such as found in a n umb e r o f m ultimedia extensions e.g., HP's MAX, Intel's MMX, or Sun's VIS instructions.
Based on the results of hardware software coevaluation of the proposed con gurations, we have identi ed con guration C as the optimal architecture for fuzzy processing. This architecture implements fuzzi cation and inference in hardware whereas defuzzi cation is implemented in software. The architecture speeds fuzzy processing up to 175 at a hardware cost of only 1.3.
