Abstract: To meet the increasing demand of large bandwidth and high throughput in modern radar system, we proposed a reconfigurable application specified processor (RASP) according to the feature of radar digital signal processing applications. RASP is a reconfigurable coprocessor based on hierarchical floating-point operation elements that is capable of executing a set of fundamental subalgorithms, take these subalgorithms as the minimal task node can improve the computational efficiency tremendously. The experimental results show that the processor performance exceeds TI stateof-the-art DSP by 1.05× to 3.22×. Our reconfigurable processor can be integrated into customizable radar systems, it was fabricated with TMSC 40 nm CMOS process and has an area of 19.2 mm 2 .
Introduction
Coarse-grained reconfigurable architecture (CGRA) has come into the spotlight with enormous increase in demand of various kinds of high performance applications in various application field because of their flexibility and performance [1, 2, 3] . CGRAs reduce the overheads of typical fine grain field programmable gate array (FPGA), replacing look up tables (LUTs) with coarser computational blocks and simplifying interconnect patterns.
Coarse-grain reconfigurable logic has been mainly proposed for speeding-up loops of multimedia and digital signal applications in embedded systems [4] . They consist of processing elements (PEs) with world-level data bit-widths connected with a reconfigurable interconnect network. Their coarse granularity greatly reduces the delay, area, power consumption and reconfiguration time relative to FPGA at the cost of flexibility. To sum up, CGRA can not only boost the performance by adopting coarse-grained array but it can also be reconfigured to adapt different characteristics of class of applications.
The commonly used solution for digital signal applications are hardwired application specific integrated circuit (ASIC) or commercial digital signal processor (DSP), but ASIC has limited applicability hence brings about huge non-recurring engineering (NRE) cost, while DSP suffers from poor energy efficiency because of sequential software execution. Therefore, CGRA can provide an alternative solution in terms of both flexibility and efficiency.
Steaming applications, cryptography applications and video decoding, baseband processing are the relatively mature used real-life applications which implemented on reconfigurable architectures, and there are many commercial products released in recent years, such as ADRES [5] for wireless communication, Awashima [6] for cryptography, MorphoSys [7] for audio-visual data codec, etc. The aforementioned processors have a common feature of integer-based operation, thus they cannot handle floating-point based applications effectively. FloRA [8, 9] added extra FSM to support floating-point operation, but the improvement is not significant.
When the target application comes to radar applications, which need to meet precision requirement and deal with large volume of data, integer-based architecture is insufficient. Considering the circumstances, we propose an architecture use floating-point algorithm intellectual properties (IPs) as the basic processing elements, construct a reconfigurable processor that can fulfill floating-point operation efficiently. In radar applications, the most frequently used kernel subalgorithms are FFT, FIR, correlation and so on [10] . Based on this character, we decompose a general radar task into a set of subalgorithms (RASP can support 17 frequentlyused subalgorithms) and take them as the minimal task node. This solution can provide desired ASIC-like performance, meanwhile, maintain a certain degree of programmability.
The validation of proposed processor is demonstrated through real application benchmarks and prototype chip test. The proposed system has shown performance boost: exceeding the state-of-the-art TI DSP by 1:05Â to 3:22Â.
The organization of the paper is as follows. Section 2 presents the base hardware architecture and the workflow of RASP. Section 3 explains the implementation of algorithm on RASP, performance analysis is carried out in Section 4. Section 5 presents the characteristics of the chip. Finally, Section 6 concludes the paper with some future work.
2 Architecture 2.1 Overall hardware architecture RASP is a digital signal processing architecture based on a hierarchical array of coarse-grain computing elements called Reconfigurable Processing Elements (RPEs). It consists of a bus interface, a Direct Memory Access (DMA), a main controller (MC), a reconfigure controller (RC), a data memory and a processing element array as show in Fig. 1 . The bus interface is used to connect general purpose processor and perform data exchange with external memory. In the design case, all these components are coupled through high bandwidth AXI4 buses. The main controller manages the task allocation and synchronization in the system, DMA transports data from the external memory to local data memory and vice versa, reconfigure controller can choose and organize processing elements to complete certain task, data memory for data storage and the RPE array for the regular computing.
The RPE array is the core part that takes charge of the entire compute task in the system as shown in Fig. 1 . Each RPE is a cluster of algorithm IPs that can be configured to perform dedicated operations, different RPE has different computational resource, the details are listed below (see Table I ). To reach high operating frequency, all the IPs used are pipelined and have an instinct fix delay of 4 clock cycles. The multiplexers in RPE are used to select input operand from configure port. Operation codes (OPCODE) define the operations in the RPE and the interconnections among them.
The RPE interconnection network is fully meshed, but the routing mode is relatively static, which means that the communication between RPEs is determined during the configuration phase, and cannot be changed at runtime. Through the interconnection network, two or four RPEs can be combined to perform more complex operation such as butterfly unit used in FFT.
The data memory stores source data and intermediate results, it has a capacity of 2M Byte and is divided into 32 banks for high bandwidth and parallelism requirement.
Main controller and processor workflow
In this section, the detailed workflow and configuration information delivery will be introduced.
RASP needs to be booted via external host processor, the host processor can be a RISC CPU. An Application Programming Interface (API) function library is developed for host processor, after host processor executes the API, configuration information will be generated. The entire configure process is depicted in Fig. 2 . The input application is described in a high-level language. Firstly, the code is programmed manually with the collaboration of calling API function, then compiled by RISC, generating a string of bit streams written to external memory, these bit streams are the initial configuration information which will send to RASP.
RASP can receive this initial configuration information directly from host processor or fetch them from external memory initiatively, the MC accepts and translates them. The translation results determine which subalgorithm to be execute, what size the subalgorithm is, and so on. The RC acts as a secondary decoder in the system, it contains subalgorithm controllers, each subalgorithm controller could assign the configure port in RPE directly when the particular subalgorithm is chosen, and the interconnection between the RPEs is organized by reconfigurable controller simultaneously.
An example of configuration context transition of RPE1 is shown in Fig. 3 , Fig. 3(a) indicates the configuration information after API compiling, stored in external memory, Fig. 3(b) is the detailed configuration context structure, they are sent to the RPE configure port by reconfigurable controller, and these contexts determine the RPE function directly. The connection network between RPEs is also managed by configuration controller.
With the emerging configuration technologies such as time-multiplexing and runtime parallelism [11], they can reduce configuration time, but also significantly enhance the configuration memory requirements. Our proposed multistage configuration procedure can greatly minish the usage of configuration memory, but at the cost of hardware overhead, which are included in reconfigurable controller.
Implementation of algorithm
In this section, the executing of single subalgorithm and combinational algorithms will be discussed. The most common used function in radar digital signal processing is Fourier transform, here we interpret how the computing kernel is mapped onto RPEs. Similarly, we will give a brief description about complex combinational algorithm implementation in RASP and take a frequently-used real life radar application as the example.
Signal algorithm implementation
The Fast Fourier transform (FFT) forms the basis for many radar signal processing algorithms. The basic computational element in FFT is butterfly unit, as shown in Fig. 4(a) , this structure is referred to as a radix-2 butterfly. Higher-radix butterflies provide some computational savings, in RASP we can provide flexible radix-2/4/ 8, mixed radix FFT is also supported.
During the execution, the intra structure of RPEs reorganize according to the configuration context. These contexts determine which IPs are active and how they are connected. The input data of RPE is allocated by extra address generator module. A sample of IP connection and data distribution in RPE1 when constituting a radix-2 structure is shown in Fig. 4(b) .
It is obvious that during the implementation of FFT, the direct data flow exceeds the number of computational resource in the chip, so these flows need to be scheduled by time sharing. For example, it takes 8 cycles to complete a radix-8 butterfly unit under the control of FFT finite-state-machine (FSM).
Ping-pong operation for large size processing
As mentioned above, RASP has a local memory of 2M Byte and can stores 256K floating point numbers in total. The storage capacity limits the calculation size in a We divided the RAM into 32 banks, 16 of them as a group, act as a double buffering manner: one part for calculating while the other for exchanging data with external memory, the schematic diagram can be seen in Fig. 5 .
Since in most cases, data transportation between external memory and local memory occupied a large proportion in the whole executing processing. When the data transportation cycles is comparable to processing cycles, the vast majority of data transportation can be hided through ping-pong operation, and thus improve the utility of resources.
Combinational algorithm management
The strength of RASP originates from the combination of subalgorithms with high efficiency management mechanism and simple compilation method.
A complete radar task usually contains more than one subalgorithms, for instance, digital pulse compression fast convolution include FFT, point-by-point multiplier and IFFT, while moving target indication (MTI) includes matrix transposition, dot production and FIR. As mentioned above, the most computationintensive and commonly used subalgorithms in radar system are covered in RASP as the subtasks.
We will take pulse compression as an example to show how the RASP implements a complex task. Pulse compression is a signal processing technique commonly used by radar to increase the range resolution as well as the signal to noise ratio. Compared to time domain convolution, a more computational efficiency approach is frequency domain complex multiplication, the operate procedure is shown in Fig. 6 . From the diagram we can see the task was partitioned to a sequential process, that contain three subalgorithms which are FFT, multiply and IFFT. Three subtasks will be called by user complier, the framework of function construction is shown below. The code in the block is a framework of application preparation process, three segments represent three subalgorithms respectively. The subalgorithms functions are predefined in API library, parameters in these functions are some basic information such as algorithm type, processing points, source data addresses, and destination data addresses, it provides great convenience to programmers with the support of function library.
Unlike instructions in DSP dealing with common mathematical operations such as multiply-accumulates (MACs), one piece of configuration of RASP corresponding to one function that takes charge of a macro-task. Even though DSP instruction sets are optimized for digital signal processing, the coarser configuration in RASP is supported by the tailor-made hardware acceleration. How the FFT and matrix inversion module exploit the reconfigurable computing resources can be seen in [12] and [13], the other kernels also go through similar optimization.
By constructing functions, a complex task can accomplish with high efficiency due to its ease of task decomposition and modularization.
Real life Radar task: digital pulse compression 1: int DoDpcByRasp(raspid, resptype, fftlen, srcaddr, dstaddr, tmpaddr, lfmcoefaddr, fftcoefaddr) //define the parameters used in DPC 2: int xPhyRaspDesc; int xVirRaspDesc; 3: xVirRaspDesc = xPhyRaspDesc -0xa0000000; xPhyRaspDesc = xPhyRaspDesc + insindex*14*4*3; 
Experiment and comparison
In this section, we assess the performance of our processor. First we measure the configurational and computational efficiency of our method. Additionally, some classic subalgorithms of RASP are implemented, the result analysis is made and the comparison between a state-of-the-art DSP is figured out.
Analyze the efficiency of configuration and calculation
The total execution cycles of a task contains three elements: configuration, calculation and data transportation. We choose FFT as an example to analysis the efficiency of configuration and calculation, the cycles of different phases are shown in Fig. 7 . The FFT size under analysis varies from 32 points to 1M points.
Since each configuration information storage in external memory has fixed length, the configuration cycles maintain at a relatively stable value of about 1200 as shown in Fig. 7 , independent with the operation size, is negligible in large point operation. When the total processing point is relatively small, data transportation dominates the whole process, while in large processing point, by adopting pingpong operation, the proportion can be reduced significantly.
For 256K-point FFT, where the data fill all the local memory, has the highest computing efficiency. With the measurement result, it provides a real-time FFT throughput of 97.55 Mpps and data transmission bandwidth of 1.89 GB/s @ 500 MHz.
The utilization rate of the resources in RASP during the calculation phase and the entire execution phase can be expressed below: where computation amount of radix-2 FFT is N 2 log 2 N times complex multiply and N log 2 N times complex adder.
Substituting the detailed experimental values, where N ¼ 256K, calculation cycle = 787212, and execution cycle = 1343733, the results can be obtained, they are 74.9% and 43.9% respectively. The utilization rate means the average ratio of number of active RPEs over number of total RPEs within execution cycles.
Results comparison
We chose a TI DSP processor TMS320C6672 [14] as a contrast, for it has similar computing resources compared with RASP. The TMS320C6672 is a multicore fixed and floating-point DSP optimized for radar and electronic warfare applications, it consists of two C66+ DSP core and the total floating-point multiplier are of the same amount with RASP, the main device characteristics comparison is shown in Table II . The operating frequency of RASP can achieve 500 MHz in the first edition prototype chip, but in the second edition it can reach 1 GHz by updating memory IP and code optimizing.
We selected three typical kernels which are available in commercial libraries and the benchmarks cover classical sizes. From Table III [15], we can see that in FIR and matrix multiply, RASP has an improvement varies from 5% to 183% compared with TMS320C6672, but when it comes to FFT, which goes through deep optimization in our design phase, RASP can get higher speedup ratio (up to 3:22Â) as the processing points increase.
Development environment and prototype chip
Our work was developed on the Synopsys VCS platform, descried with Verilog hardware description language and implemented using TMSC 40 nm CMOS technology. RASP consumes a chip area of 19.2 mm 2 , and the clock frequency of the chip is 500 MHz. The hardware layout is shown in Fig. 8 , the SoC chip contains two DSP cores and one RASP core, RASP acts as a coprocessor in the entire SoC. 
Conclusion
We have introduced a floating-point units based reconfigurable architecture specially for radar signal processing. Subalgorithms as the minimum task can simplify the configuration process and improve configuration efficiency, the proposed multistage configure procedure can improve computation efficiency and reduce programming complexity. Ping-pong operation applied in large size processing can hide the vast majority of data transportation cycles. The configuration and computing efficiency are analyzed, we also have compared the chip implementation of RASP with a state-of-the-art TI DSP on running the same tasks.
As future work, larger RPE array with customizable capabilities should be used in our system and the corresponding configuration method needs to be improved for high flexibility and extendibility.
