Abstract-In this paper, we introduce the ARISE framework for the systematic extension of typical processors with the necessary infrastructure to support arbitrary number and type of reconfigurable hardware units. ARISE extends the microarchitecture of the processor with an interface to allow the coupling of the hardware units. Furthermore, the instruction set of the processor is extended with instructions which expose to the programmer/compiler the full control of the interface. This control includes the configuration of operations on the hardware units, execution of these operations, and communication of data between the processor and the units. The new instructions are incorporated without the need to redesign the processor instruction set architecture. To evaluate our proposal a model of an ARISE extended MIPS processor has been designed. Using a turbodecoder algorithm as benchmarking application a simulation of the ARISE model has been performed. Performance results show impressive application speedups up to x7.5.
I. INTRODUCTION
Instruction set customization is an effective way to improve performance over a certain set of applications (application domain). Critical portions of the application can be more efficiently executed on Custom Computing Units (CCUs) . While the base instruction set serves as the bulk of the flexibility used to implement any application, the customized instructions utilize the CCUs to enhance performance over the target application domain. Furthermore, providing to the CCUs the capability to reconfigure their functionality, high degrees of flexibility are gained. Reconfigurable units provide dynamic instruction set extensions offering the adaptation of the system to the targeted application.
The process of designing such a hybrid system, except from designing the CCUs itself, is divided in two steps: first provide an interface between the processor and the CCUs and second expose to the architecture the control of the new system. Although a number of such systems have been presented in the literature most of them use an ad-hoc design approach to couple the CCUs to the processor. In addition, most systems are designed based on the needs of the attached CCUs excluding this way modularity and making difficult the extension of the system with new types of CCUs. Finally, in most cases the approaches presented in the literature are suffering from limitations which could reduce performance. Such limitations are the number of parameters which the CCU can access (parameters limitation), the opcode space available to encode operations performed in the CCUs (opcode space explosion) etc.
In this paper we present the A UTH Reconfigurable Instruction Set Extension (ARISE) framework. The framework extends a typical processor with: 1) a micro-architectural interface which is used to couple CCUs to the processor and 2) a set of instruction set extensions to control the interface. These extensions are one-time performed to a processor to create an ARISE machine. After that, an arbitrary number of CCUs can be attached to the machine. Moreover, the CCUs complexity can vary from hardwired to reconfigurable units with multiple contexts of configurations. The instruction set extensions exposes to the programmer/compiler all necessary instructions required to control the configuration of operations in the CCUs, their execution, and the communication of arguments between CCUs and the processor. Furthermore, using a buffering technique ARISE overcomes the parameters limitation while exploiting at the same time the complete register file bandwidth. ARISE also deals with the opcode space explosion by dynamically assigning opcodes to operations.
To evaluate the proposal, a model of an ARISE extended MIPS processor was designed. A turbodecoder application was implemented in the ARISE machine and simulations was performed to estimate the performance of the machine and identify any possible bottlenecks. Results indicate that the ARISE machine is able to speedup the execution of the application by a factor of 7.5. The rest of the paper is organized as follows. Section II gives a general description of the ARISE framework and discusses in detail the general organization of an ARISE machine, the ARISE micro-architectural interface and the ARISE instruction set extensions. Furthermore, the procedure which must be followed to program an ARISE machine is also described. Section III presents the design of an ARISE evaluation machine based on a MIPS core processor. Experimental results are presented in section IV. Section V discusses some related work while we conclude in Section VI. The data are read from the register of the register bank indicated by the address of the address generation unit. After the read the address is incremented to point to the place from where the next data will be read.
F. Programming an ARISE Machine The procedure of programming an ARISE machine starts from the identification of code fragments which can be implemented in a CCU as ARISE operations. What follows is the generation of the configuration bitstreams for these ARISE operations. A configuration bitstream is composed of two parts: the CCU specific and the ARISE specific. The first part is the common bitstream which configures a reconfigurable hardware to implement a certain operation and is generated by the CCU's vendor tools. This bitstream can be generated at any time and ported later to any ARISE machine. Obviously, it is omitted for a hardwired CCU with no need of configuration.
The ARISE specific part holds information which will drive the AIF to successfully execute the operation. For example the number of cycles required for the completion of the operation is part of the ARISE specific configuration bitstream. Only this part must be re-generated each time the operation is implemented in a new ARISE machine. The ARISE specific part of the configuration bitstream is stored in the ACW in a storage structure designed specifically to support the attached CCU. Thus, if a range of addresses has been assigned to a CCU, implying that multiple operations can be configured at the same time in the CCU, the storage structure must support all this different configurations. Furthermore, in the case of a hardwired CCU with standard specifications, the bitstream can be permanently stored in the ACW (firmware can be used instead to insure portability).
The configuration bitstream can be assigned in a high-level data structure which is included in the source code of the application. During linking of the application, the data structure is stored in the main data memory or to a reserved space in a dedicated configuration memory. In the following what remains is the trivial process of constructing the instruction sequences for configuring and executing the ARISE operations. The first is constructed by assigning the opcode of an ARISE operation to a CCU address (opc2addr ARISE instruction) and then -if required -downloading the configuration bitstream (using the confld ARISE instruction). Then, this operation can be called implicitly through its opcode, in any region of the application, until another opcode-to-address assignment invalids it. This call is performed with an instruction sequence of movta-execa-mofa ARISE instructions
III. AN ARISE EXTENDED MIPS-I EVALUATION MODEL
To evaluate the proposed scheme a model of an ARISE extended processor has been designed. For this purpose a MIPS with a MIPS-I ISA was used as the core processor under extension. In the following we discuss issues encountered in designing this model. Although a specific processor is used, as it will become clear in the following, the ARISE microarchitecture and the ARISE ISA can be viewed as a general template which can be easily re-targeted to any processor.
A. Encoding the ARISE ISA into the processor instruction word The first step in order to create an ARISE extended processor is to encode the ARISE ISA in the instruction word of the processor and design the AID. As already stated the ARISE is tightly-coupled to the processor following the model of a functional unit (just like an ALU is used by a RISC processor). Exploiting this feature the MIPS-I R-Type instruction format is used to encode the ARISE ISA. Figure 3 shows the R-Type instruction format and the correspondence of the word fields to the ARISE. As already mentioned the AID receives the fetched instruction word and performs an initially decode. Ifthe opcode of the instruction belongs in the processor ISA the instruction word is forwarded as it is to the processor. In the case where the ARISE reserved opcode is identified, the fields "ARISE Instruction" and "ARISE Operation Opcode" are forwarded to the AIF to drive the ARISE operation. In addition the AID modifies the word to form a pseudo-instruction which is part of the processor ISA and will drive the processor operation based on the AIF needs. To achieve this AID replaces the "opcode" and "funct" fields to encode an R-Type instruction which does not raise any exception (i.e. a logical operation). The values of rs, rt, and rd are kept as it is to communicate data with the processor register file. Issuing this pseudo-instruction to the processor will drive the register file to provide the rs and rt registers and write the result of the operation to the rd register. Thus, the AIF can directly access the rs and rt registers while a multiplex can force the rd value to be provided by the AIF rather than the processor pipeline. Data hazards are resolved by the processor infrastructure. The simulator of the ARISE machine is based on the ArchC architecture description language [3] . The MIPS-I ArchC model provided by the ArchC team has been extended with the complete ARISE features and an ARISE MIPS-I simulator was generated. To model the execution of code fragments on CCUs a functional simulation approach was followed. These code fragments were removed from the application code and were placed in a separate source code which was compiled as a dynamic library. This library is imported by the ArchC simulator dynamically to functionally simulate the CCU operation. The simulator has been also extended to provide profiling results regarding the ARISE operation.
IV. EXPERIMENTAL RESULTS
To demonstrate the effectiveness of our proposal and evaluate its performance we experimented with the ARISE MIPS-I evaluation model and a turbodecoder algorithm as the benchmarking application. The application source code was obtained from the XiRisc benchmarking suite [4] . For the implementation of the application in the ARISE machine we used information derived from [5] where the same application is implemented in the XiRisc [6] Table IV presents the execution cycles spend by each type of ARISE instructions. As it is presented the 5700 of the total execution time is consumed by ARISE instructions. From these cycles execution and configuration ARISE instructions consume most ofthe execution time. The six ARISE operations have been initially designed to be implemented on the XiRisc processor. However, XiRisc suffers from a 4 inputs 2 outputs arguments limitation. Although, the ARISE machine has a 2 read/I write register file exploiting the move instructions it is able to perform these operations. As it is observed only a small 900 of the total execution time is consumed in moving data between the ARISE and the processor. Table V presents the execution cycles reported by the simulator for this case. As it was expected the cycles spend for ARISE move and execute instructions remain the same. However, they are now accounting for a smaller portion of the total execution time. This is because execution time consumed in reconfiguring the CCU has been significantly increased to the 3500 of the total execution time. Increase in the configuration overhead had an impact on the speedup factor which has been reduced from x7.5 to x6.3. However, performance improvements by the exploitation of the ARISE are still significant. [1] , [13] , and [14] . Program Total 31660 100
Prisc [7] uses a RISC processor core augmented with a Programmable Functional Unit (PFU). The unit is tightly coupled to the processor just like a typical ALU. PFU can execute 2 input/1 output operations. Reconfiguration is performed via exceptions. Only one instruction set extension is required to access the PFU. OneChip [8] extends the PRISC allowing the PFU to implement multi-cycle sequential and combinational operations.
Garp [9] is a MIPS processor extended by a custom designed reconfigurable unit. The MIPS instruction set is extended with several instructions in order to control the unit operation following a co-processor-like model. Chimaera [10] like Garp couples a custom designed FPGA-like unit to a MIPS processor. However, the coupling is tighter following the functional unit approach. Communication of data is performed via a shadow register file.
PipeRench [11] augments a core processor with a coarsegrain reconfigurable array of ALUs. PipeRench is focused on implementing linear pipelines of arbitrary length. This machine targets to serve as a co-processor in a general purpose computer and access the same memory space with the host processor.
In Molen [12] instructions are decoded by an arbiter determining which unit is targeted. "Normal" instructions are computed by the hardwired core processor while reconfigurable instructions are computed on the reconfigurable logic. Using this arbiter the processor ISA does not need to be redesigned to support the reconfigurable unit. Tasks to be mapped on the programmable hardware unit are considered microcoded in the processor architecture. The Molen can be considered as a general machine organization allowing a high degree of freedom in the definition of the reconfigurable hardware structure.
XiRisc [6] couples a very-long instruction word (VLIW) processor, featuring a set of digital signal processing (DSP)-specific hardwired function units, with a custom designed gate array. The gate array is tightly integrated within the CPU instruction set architecture, behaving as part of both the control unit and the datapath. Reconfigurable instructions implemented in the array can have at most four inputs and two outputs.
In [15] a coarse-grain reconfigurable functional unit is tightly-coupled to a RISC processor. The functional unit consists of a 1-D array of processing elements. The integration of the functional unit to the processor allows the efficient exploitation of the processor pipeline stages. Reconfigurable instructions implemented in the hybrid system combine spatial and temporal computation to speedup execution. The processor register file has been extended to provide four input arguments to the functional unit. In [16] the same architecture has been extended with a "virtual" opcode technique to alleviate the opcode explosion problem and the ability to execute partial predicated operations in the functional unit to improve performance.
As becomes clear by the discussed related work most approaches in the literature present an ad-hoc solution to couple a specific reconfigurable unit with a typical processor. In contrast to that, our approach proposes a general machine organization, a micro-architectural and, an architectural extension which can be systematically performed in any processor. After this extension has been one-time performed, the new ARISE machine can support arbitrary number and any type of reconfigurable units. From this point of view our approach is more related to [12] . In addition, our approach also deals with limitation of previous approaches. Thus, it does not suffer from parameters limitation like PRISC, XiRisc and OneChip, and [15] or opcode space explosion like Chimaera and XiRisc.
VI. CONCLUSIONS
In this paper, we introduced the ARISE framework for reconfigurable instruction set extensions. The microarchitectural interface of the ARISE allows to a processor to couple arbitrary numbers and types of CCUs. Using the ARISE ISA the programmer/compiler has the full control over the interface and the coupled units. Furthermore, the new instructions can be incorporated without any re-design of the processor ISA. A buffering mechanism is used in order to overcome the parameters limitation problem. Thus, operations with any number of input/output parameters are supported by the ARISE. In addition, using dynamic assignement of opcode to ARISE operations the opcode space explosion problem is also relaxed.
