DMA is often used to transfer data between different kinds of memory-mapped slave endpoints. It is a common peripheral component for DSP. In order to make full use of the time during the reading and writing, this paper proposed an intelligent DMA. Compared with the traditional DMA, a newly designed operational unit is added to the RAM-based architecture and the parameter RAM(PRAM) is improved to suits the whole DMA controller. It supports floating-point computation and the interface is designed based on the AMBA bus protocol AXI3. FFT algorithm has been used to evaluate the proposed design.
Introduction
With the development of integrated circuits and the advancement in semiconductor progress, a variety of bus protocols have been proposed among which AMBA bus architecture showed advantages among others because of its low power consumption and compatibility. DMA has been proposed with the demand for the amount and the speed of processing data. It saves the workload of CPU and improves the efficiency of the whole system.
DMA must be constantly optimized to guarantee the work of DSPs such as image processing, target tracking, radar signal analysis, scientific computing etc. [1] [2] TMS320 C2000, C5000 and C6000 are three products of DSPs produced by Texas Instruments (TI) which are mainly used in area of digital control, communication and portable application, audio and video technology respectively [3] [4] [5] . There DMAs are optimized to supports there work. However, there DMAs do not have operational units. Efficient could be largely improved if data have been pre-processed before arriving at the CPU core. The traditional DMA. The DMA consists of two blocks: channel controller and transfer controller. The main blocks of the channel controller are DMA/QDMA channel logic, parameter RAM(PRAM), event queues, transfer request submission logic, completion detection and registers [6] . The main blocks of the transfer controller are DMA program register set, DMA source active register set, read controller, destination FIFO register set, write controller, data FIFO and completion interface. This paper designs a new PRAM format in channel controller and an operational unit which supports floating-point computation.
Design of the Intelligent DMA
The PRAM. The DMA controller is a RAM-based architecture. As the traditional PRAM shows in figure1, the PRAM contains transfer parameters such as source address, destination address, transfer counts, indexes, options etc. [7] Each PRAM set includes eight 4-byte set entries and the first two bytes of the last 4-byte entry is reserved. So, this RAM set can be improved according to the different needs. Talbe1 and figure2 clearly illustrates the 16-bit OP-CODE. Op shows the operational mode. Times could be the count when DSP wants to execute the operation of accumulation and the address is the place where the operand be placed in the data FIFO. Addr The address of the operand DSTREGDEPTH parameter determines the number of entries in the DST FIFO register set which determines the amount of TR pipelining possible for a given TC. This means that read controller can go ahead and issue read commands for the subsequent TRs while the DST FIFO register set manages the write commands and data for the previous TR. The value of DSTREGDEPTH can be set as n, so the read controller can process up to n TRs ahead of the write controller [7] . The operational unit consists of two parts, floating-point pre-computation and fixed-point pre-computation. The part of fixed-point pre-computation is mainly aimed at radar signal processing. Radar signals are represented by grey value so all the adders and the multipliers in fixed-point part are eight-bit [8] . The part of floating-point pre-computation consists of adders and multipliers with 32-bit input. There are sixteen fixed-point adders, sixteen fixed-point multipliers, seven floating-point adders and four floating-point multipliers integrated in the operational unit. The floating-point adders and multipliers are designed using 3 cycles of pipeline. As it is shown in figure 4 , the floating-point precomputation part supports accumulation up to 4 operands. So it must go through another two operation of add in accumulator. However, some of the operations except the accumulation and power sum in fixed-point can be completed in one cycle.
Experiment and result
In order to evaluate the area and power, the proposed DMA design has been modeled by Verilog HDL and synthesized in a TSMC 65nm CMOS technology after fully functional verification. FFT Evaluation. Fast Fourier transform (FFT) is a high-efficient algorithm of discrete Fourier transform. It is improved based on the Fourier transform's characteristics, parity, symmetry of real part and imaginary part.
As it is clearly shown in figure.5, there must be a four-number-accumulation in every step of the butterfly operation. The proposed DMA could complete this accumulation if the PRAM is set appropriately. The CPU in figure.6 is an FFT processor we designed previously. The results are as the table 2 shows. All the data is stored in SRAM which is the memory in figure. 6. Table 2.Experiment results  Point  Radix   256  512  1024  traditional  improved  traditional  improved  traditional  improved  2  768  720  1536  1460  3072  2919  4  896  784  1792  1613  3584  3226  According to the table 2 , the efficient is improved using the proposed DMA. Our DSP designed previously has also been used to evaluate the DMA system. Some benchmarks such as Fir, Matrix and Convolution have been run on the system. The result shows that DSP could be more efficient using the proposed DMA system.
Conclusions
This paper presents an intelligent DMA system design with data pre-processing function and the PRAM is improved according to the pre-processing. It supports both floating-point and fixed-point operation which is more flexible for CPU when processing the data. In comparison with the traditional DMA, the proposed design reduces the CPU's burden of simple calculations and increases the efficiency of whole operation. Experiential results show that this design can run at 800 MHz, the area and power of the additional arithmetic units for data preprocessing is 0.45 mm 2 and 20 mw, respectively.
