Abstract-In this paper, we have presented a Reconfigurable Application-specific Instruction-set Processor (rASIP) that processes mixed-radix(2, 4) 64 and 128-point Fast Fourier Transform (FFT) algorithms while satisfying the partial execution-time requirements of IEEE-802.11n standard. The rASIP was designed by integrating a template-based Coarse-Grain Reconfigurable Array (CGRA) in the datapath of a simple Reduced InstructionSet Computing (RISC) Processor. The instruction set of the RISC processor was extended to add special instructions to enable cycle-accurate processing by the CGRA. The rASIP is synthesized for Field Programmable Gate Arrays for the measurement of resource utilization and execution time. The postfit gate-level netlist of rASIP was simulated to estimate the power and energy consumption. Based on our measurements and estimates, we have studied the advantages of using rASIP in comparison with other systems.
I. INTRODUCTION
Reconfigurable Application-specific Instruction-set Processor (rASIP) is one of the most optimal solutions from cost and performance point of view as it combines the features from general-purpose processors and application-specific accelerators. The add-on of reconfigurability allows to target multiple computationally-intensive algorithms. In recent past many ASIPs were developed specialized for image and video processing [1] , Viterbi or LDPC channel coding/decoding ( [2] , [3] ) and Fast Fourier Transform processing ( [4] , [5] ). Later, the idea of adding dynamically reconfigurable circuitry in the datapath of the processor got popular and rASIPs were made for Software-Defined Radio (SDR) applications [6] and also for basic Digital Signal Processing (DSP) applications [7] . To control the reconfigurable part of rASIPs, special instructions were added in the instruction-set of the processor and also necessary additional hardware for interfacing. rASIPs require less area/resource utilization than traditional processor/coprocessor based systems because the reconfigurable part is closely integrated with the processor and also the control logic is reduced as most of the control flow is written in the software. During the execution, while the reconfigurable part of rASIP stays busy in processing the large amount of data, the ASIP can handle the other specific applications of relatively less computation intensity.
An important class of reconfigurable devices is a CoarseGrain Reconfigurable Array (CGRA). CGRAs have a proven track record of almost ten years and some of the most popular CGRAs are ADRES [9] , Morphosys [8] , PACT-XPP [10] . The only drawback is that they require an area of a few million gates and such a large area utilization is not justified unless heavily utilized. To avoid this problem, CGRAs were designed as a template which could generate special-purpose array-based accelerators tailored for a set of applications [12] . CGRAs are ideal for processing computationally-intensive signal processing algorithms as they offer a high throughput and parallelism. Some of the interesting examples are Wideband Code Division Multiple Access (WCDMA) cell search [14] , Viterbi decoders [15] and image/video processing [11] , [16] .
In this paper, we have used a template-based CGRA called AVATAR [13] as the reconfigurable part in ASIP. In past, AVATAR-generated mixed-radix(2, 4) FFT accelerator could process 64 and 128-point of FFT algorithms while satisfying IEEE-802.11n execution time constraints. The supporting system for the AVATAR generated accelerator was a RISC processor called COFFEE [17] , a Direct Memory Access (DMA) device [19] which was responsible to fetch a large amount of data from the main memory of the system and provide it to the internal memory of the accelerator. In between all these modules, there was a network of switched interconnections which used to provide dedicated connections for high-speed transfer of data. The rASIP presented in this paper has the AVATAR integrated in the datapath of a simple RISC processor and the control flow is operated by special instructions added in the instruction-set of the RISC processor. In this way, we have reduced the overhead caused by the DMA and network of switched interconnections in terms of resource utilization. Furthermore, the control logic has become simpler as now it only has to control the cycle accurate processing by the accelerator and not the functionality of the DMA. We synthesized the rASIP for Stratix-IV Field Programmable Gate Arrays (FPGAs), measured the execution time, resource utilization and estimated power and energy consumption for comparison with COFFEE-based system and other state-ofthe-art.
In the next section, we will explain the COFFEE/AVATAR (C/A)-based system. Then we will discuss, how AVATAR was separated from the previous system and then integrated into the datapath of a simple RISC processor to make an rASIP. Section IV discusses the synthesis results and comparisons based on execution time, resource utilization, power and energy consumption. In the last section, we will draw the conclusions.
II. COFFEE/AVATAR-BASED SYSTEMS AND THE FFT ACCELERATOR
This section is composed of two parts; the first one describes the architecture of C/A-based system which is considered as a reference for the design of rASIP. The second section describes the design and functionality of mixed-radix(2, 4) FFT accelerator that was generated using AVATAR.
A. COFFEE RISC Processor and AVATAR
COFFEE RISC processor and AVATAR work in a processor/coprocessor model where a C code can be compiled for COFFEE using a customized gcc compiler. COFFEE controls the cycle accurate processing of the AVATAR-generated accelerator by passing the control words to the control registers of the accelerator. AVATAR is written in VHDL and most of the information related to AVATAR can be found in [13] and [12] but the key points are highlighted here to build the discussion.
AVATAR is a 4×16 processing element (PE) template-based CGRA. Each PE can perform 32-bit integer arithmetic and logic operations plus 32-bit IEEE-754 floating-point operations. The PEs have two inputs and two outputs and they can connect with their neighboring PEs in point-to-point fashion.
In between the local memories and the processing array of AVATAR, there are I/O-buffers that provide interleaving to the data from the local memories before and after it is processed through the array.
As AVATAR is a template-based device, the user can specify using a graphical platform different contexts that comprise the operation(s) to be performed by each PE and the pattern of interconnection among the PEs to be made at each clock cycle. The graphical platform in return generates a VHDL file containing the parameters that set the condition for the hardware components to be instantiated inside the accelerator. Another file generated is a C header file containing the configuration stream. At run-time, the configuration stream is fetched from the main memory of the system by the DMA and then distributed to the respective configuration memories of the PEs by a pipeline infrastructure designed to reduce the cost of distribution of the configuration bit-stream [18] . Then the DMA starts loading the data in one of the local memories in a time-multiplexed way as required by the accelerator. AVATAR is equipped with two local memories that contain the data to be processed. Once the data to be processed is loaded in the local memories, the COFFEE RISC processor can enable different contexts as required by the flow of algorithm. Once the data has been processed, it can be fetched from the local memories and stored in the main memory of the system. Fig. 1 shows the C/A-based system.
B. The FFT Accelerator
The main reason for the design of AVATAR was to accommodate a radix-4 butterfly in a single context as CREMAgenerated accelerator needed three different contexts for the same purpose [20] . CREMA was a 4×8 PE template-based CGRA and was needed to be scaled-up to 4×16 PEs to satisfy the execution time constraints of IEEE-802.11n standard for FFT processing. The design of the FFT accelerator is described in detail in [13] but some of the important details are highlighted in this subsection. AVATAR-generated mixedradix(2, 4) FFT accelerator could process both 64 and 128-point FFT algorithms. To process 64-point of FFT, only radix-4 scheme will be required that completes the processing in three stages. To process a 128-point FFT, the first stage is processed by radix-2 scheme and it needs three more stages by radix-4 scheme to complete the processing. The main structure of the accelerator consisted of two contexts; the first context contained four radix-2 butterflies and the second context contained one radix-4 butterfly. The other contexts were designed for preprocessing which is required between two processing steps. Preprocessing is required to reorder the data as we violate the inherent parallelism of the algorithm by employing less number of butterflies than the algorithm demands from its signal flow graph. Finally, the accelerator processed four different streams of 64 and 128-point FFT within 3.2μs+0.8μs (guard interval) = 4.0μs.
III. THE ARCHITECTURE OF RASIP, INTEGRATION AND TESTING
We have used Synopsys Processor Designer, LISA (Language for Instruction Set Architectures) for the integration of AVATAR generated accelerator with a RISC processor core [21] . A template RISC processor model in LISA is used as the starting point in integration. The RISC processor has five pipeline stages that are instruction-fetch, instruction-decode, execute, memory-access and write-back plus all the respective pipeline registers. Special instructions are extended in the RISC processor to operate the AVATAR generated accelerator. The design of rASIP was complicated from integration point of view, requiring three different phases which are explained as follows. 
A. rASIP Architecture
The RISC-template in LISA tool is a 32-bit architecture but the instructions that directly belong to RISC are 30-bit wide, so the two most significant bits can be used for adding special instructions to control the operations on AVATAR. The register-file contains 16 32-bit general-purpose registers and the register-file indexing is of 4 bits. The immediate values are of 16-bits in regular RISC related instructions. Fig. 2 shows the 32-bit instruction-set and the extensions we have introduced.
As described in section II, there are three basic steps that need to be followed for operating an AVATAR-generated accelerator. They are 1) load the configuration stream 2) load the data to be processed 3) write the control registers
The configuration stream and the data to be processed reside in the memory of rASIP. Configuration stream may consist of many configuration arrays in the main memory. The total number of configuration arrays and their sizes depend on the design of the context(s) that are made to constitute the accelerator from the AVATAR template during the design time. All of these configuration arrays have to be loaded one by one into the configuration memories of the accelerator. To address each and every configuration array, we need the base address and the size of the array. To store the configuration array, we also need to know the base address plus the offset to be added to address the configuration memories. These are the same parameters that are required to store the configuration words in the configuration memory of I/O-buffers that are described in section II. To load the data to be processed from the main memory and store it in one of the local memories of the accelerator, we need the base address plus the offset of the respective memories. In this regard, we have employed two adders to calculate the address of the main memory and one of the local memories of the accelerator. The two adders calculate the addresses at the execute pipeline stage of the RISC processor. Each adder increments the address iteratively starting from the base address with an offset. The iteration stops when all the configuration and data words are loaded from an array stored in the main memory. To count the iterations, the counter in the memory-access stage counts down starting from the size (total number of words) of the configuration or data array. By using these two adders and a counter, we complete the load from the main memory and store in one of the local memories of the accelerator at the cost of only one clock cycle per word. The instruction cgra load counter reg and its instruction coding for loading the counter register by the size of the configuration or data array is shown in Fig. 2 . To load and store the configuration, data and context words, we added special instructions that are cgra load word, cgra load context and their coding is also mentioned in Fig. 2 . The mem sel is of two bits that either enables the configuration memory or one of the two local memories for write operations. If mem sel="00", then the data can be read from second local memory in case that the accelerator has completed the data processing. The one-bit field post inc is used as an enable signal for the register in the counter. The rest of fields are obvious considering the above discussion. The context words are written at specific addresses of the data memory, so we used RISC load-word instruction to load these context words from those specific addresses and write them in the register file of the RISC processor. To carry out this operation, the cgra load context instruction contain two 4-bit fields for register indexing. The first field contains the address of the register file where the address of the main memory will be loaded and the other for the context word to be loaded in the register file from that address in the main memory. The address of the main memory from where the context word is fetched has to be transfered to the accelerator as there are two control registers that have to be written one at a time. The address in the main memory corresponding to the context word is provided to the address decoder inside the accelerator which decides which of the control registers has to be written by a specific context word.
B. Integration and Testing
The integration between RISC and AVATAR-template is supported by an interface register as shown in Fig. 3 . The upper part of the figure shows the RISC processor while the lower part is the AVATAR template. The interface register has the following fields which can be updated as required by the RISC processor.
• The 32-bit address and data bus is provided to multiple blocks of the AVATAR-generated accelerator and the data is loaded to the specific block when the relevant enable signal is high. The decision on whether the configuration word will be stored in PEs configuration memory or I/Obuffer configuration memory depends on the address field. The processed data can be read from the second local memory of AVATAR generated accelerator and in that case lm1 [0] , lm2 [0] and conf ig en [0] signals will all be in active low state.
Integration of this rASIP was a complicated task as the system in which AVATAR was integrated before, shown in Fig. 1 , was an effort of many engineers spanning over several years. To extract AVATAR from this system, we needed to be sure that the functional correctness remains unchanged. We wanted the AVATAR-generated mixed-radix(2, 4) FFT accelerator to be operated in a VHDL test-bench before its functional requirements are analyzed to be integrated with the RISC processor. As the accelerator was controlled by the flow written in C compiled for COFFEE RISC core, we step-by-step commented the C code for radix-4 64-point FFT execution and replaced the commented C functions by VHDL test sequences while keeping the synchronization between C code execution and VHDL by introducing fixed-time delay intervals. Using this method, the whole C code for 64-point FFT was commented and the accelerator was completely stimulated by a VHDL test-bench. The compiler and VHDL code for the rASIP is generated by Processor Designer. Since AVATAR template is also written in VHDL, we port-mapped the accelerator with the interface register of the RISC processor. At the time of integrating the accelerator with the RISC processor customized using LISA tool, we step-by-step commented the VHDL code and wrote equivalent micro-coded functions that were called in the C flow to be compiled. After the completion of each step, we used to check the functional correctness for any errors and the synchronization between C flow and VHDL stimulator was established again using constant delay intervals. In this way, the whole VHDL testbench was commented and the radix-4 64-point FFT execution control was completely transferred to the customized RISC core. We also mapped 128-point FFT on this accelerator and after comprehensive functional testing, we are confident that any other application-specific accelerator can be tailored using AVATAR template and can be operated just by writing the control flow in C for this customized RISC processor.
IV. SYNTHESIS RESULTS AND COMPARISONS
We have synthesized the rASIP on two different Stratix-IV FPGA devices to establish comparison based on resource utilization, execution time, power and energy consumption. As described in section I, the DMA along with the network of switched interconnections was removed as now the configuration and data words can be written using special instructions. Furthermore the control unit for AVATAR-generated accelerator is simpler as now it does not have to control the functionality of DMA but only the processing of the accelerator. The resource utilization by rASIP on Stratix-IV FPGA device (EP4S100G5H40I1) is shown in Table I in comparison with the system shown in Fig. 1 . The system in Fig. 1 was also synthesized on EP4S100G5H40I1 device in [13] . From the table, we can observe that there is 25% decrease in the usage of logic registers and 10% decrease in ALUTs consumption when rASIP is compared with C/A-based system. This reduction in resource utilization was the target to be achieved besides studying the advantages of using rASIP. In rASIP and C/A-based system, most of the resources are consumed by AVATAR as it is a large 4×16 PE CGRA. If a smaller CGRA is used in both of the systems, for example a 4×4 PE CGRA, then there will be a larger difference in resource utilization between the two systems.
The operating frequencies achieved at the temperatures of 0 and 100°C on the same FPGA device are shown in Table II . To calculate the execution time, we have considered operating frequency at 0°C (fast timing model) as even in [13] , the fast time model was used to calculate the execution time. However, we don't see much time difference between the rASIP and C/A-based system. For example rASIP needs 3.15μs to process 4×64-point FFT while C/A-based system needed 3.07μs while both of them are satisfying the execution-time constraints for IEEE-802.11n standard. On top of this, rASIP requires less area than the C/A-based system which shows the significance of using rASIP. To satisfy the timing constraints for 4×128-point FFT, there is a double C/A-based system which requires double the amount of resources than a single C/A-based system. The double C/A-based system contains two AVATAR and such a system cannot be compared with rASIP which contains only a single AVATAR. It can not be expected by a device like rASIP to meet constraints of IEEE-802.11n to process four streams of 128-point for FFT within 4.0μs unless two AVATARs employed in rASIP.
The power consumption of rASIP was estimated by generating the post-fit gate-level netlist of rASIP for Stratix-IV FPGA device (EP4SGX70HF35C2) device. The netlist was then simulated at 0°C and at an operating frequency of 160 MHz to create the value-change-dump file which is used to estimate the power consumption. This is the most accurate method for estimating the power consumption and we achieved a 'HIGH' estimation confidence by the Quartus II tool. The static and dynamic power consumption of rASIP and the C/Abased system are almost the same as shown in Table III . The power consumption mentioned in the table for the C/Abased system is related only to the mixed-radix(2, 4) FFT accelerator and its control unit plus the DMA. It does not include the power consumption by the COFFEE processor and the network of switched interconnections. The rASIP contains the mixed-radix(2, 4) FFT accelerator generated from AVATAR, a simplified control unit, a simple RISC processor and requires almost the same power consumption. The energy consumption by rASIP for processing 64 and 128-point FFT algorithms is shown in Table IV 
VI. CONCLUSION
In this paper, we have presented a Reconfigurable Application-specific Instruction-set Processor (rASIP) which is designed by integrating a template-based Coarse-Grain Reconfigurable Array (CGRA) in the datapath of a Reduced Instruction-set Computing (RISC) processor. The integration was carried out by extending the instruction set of RISC processor with special instructions to control the cycle-accurate processing by the CGRA generated accelerator. A mixedradix(2, 4) Fast Fourier Transform (FFT) accelerator had been generated from the CGRA which is used in this work. The rASIP satisfies the partial execution-time constraints of IEEE-802.11n standard for FFT processing. We have observed substantial reduction in resource utilization by rASIP while the power and energy consumption are almost the same in comparison with a standard processor/coprocessor model.
