Abstract -This paper presents a processor architecture for Fast Fourier Transform computation of real-valued signals for on-chip analog to digital converter test and evaluation. The design performs a radix-2 technique optimized for low area overhead and easy integration into system on chips. The hardware logic supports variable transform lengths and accurate parameter extraction. The processor has been validated on 0.18um CMOS silicon and applied to a data converter test application for extraction of dynamic parameters that are SINAD, SFDR and THD. The architecture is suitable for safetycritical applications where spectral integrity of the converter signal path can be run at start-up or during interval down times.
I. INTRODUCTION
eal-value Fast Fourier Transforms (RFFTs) find use in many application specific ICs such as biomedical and industrial systems designs [1] whereby analog signals are converted and processed in the digital domain via an Analog to Digital Converter (ADC) component. These front-end physical signals are mostly real and after FFT transformation exhibit conjugate symmetry which give rise to redundancies that can be exploited in logic design [2] . ADC Built-in-SelfTest (BIST) is an application area where RFFTs can be used to monitor dynamic spectral performance in-situ on-chip. The purpose of this paper is to explore a programmable architecture that trades-off speed & area for minimal complexity, enabling a versatile design for mixed-signal System on-Chips (SoCs) capable of accurate ADC dynamic test calculations. The architecture is well suited to moderate speed, low-area and low-power on-chip applications. SoC data converter test techniques typically use DSP-based testing of analog and mixed-signal circuits [3] , whereby an on-chip processor is used to compute the converter's static and dynamic performance parameters. The ADC spectral performance is evaluated by performing a FFT on the sampled results of a coherent sinewave applied to the input of the ADC. Other ADC BIST techniques also exist which rely on signal processing calculations that provide static and dynamic assessment of the converter device under test [4, 5] .
In this paper, an in-place memory-based CPU architecture is proposed that computes the RFFT based on a radix-2 decimation in time algorithm. The architecture avoids pipelining and cache memory requirements, featuring application specific operations that help with the FFT and ADC test execution. Complex datapaths operations are also avoided, while simple conflict-free memory addressing techniques are employed to ease logic overhead. The key contribution is an in-place programmable CPU with FFT supports operating on real-valued data that uses less logic. The solution enables accurate ADC dynamic parameter extraction and spectral analysis for in-situ chip applications requiring high reliability. The RFFT operations and architectural aspects is presented in Section II. Section III covers the processor implementation and the test chip design results are discussed in Section IV. Finally conclusions are drawn in Section V.
II. RFFT OPERATION AND ARCHITECTURAL ASPECTS
The typical ADC test flow for spectral test and evaluation involves sinewave application/storage, time domain to frequency conversion and parameter extraction. A simple, but inefficient way to transform a time-discrete periodic signal from time to frequency domain is the Discrete Fourier transform (DFT), however a DFT requires N 2 complex multiplications for an N-point DFT. A significant reduction in computation time and resources can be achieved by employing a Fast Fourier Transform (FFT). The FFT decomposes the process into successively smaller DFT computations with an N-point FFT reduced to multiple 2-point DFTs (radix-2 technique). The number of complex multiplications is significantly lowered to N.log2(N) operations.
The ADC output is real-valued which makes the RFFT algorithm [6] applicable to BIST. The radix-2 decimation in time RFFT removes redundant operations that occur due to the real only input at every FFT stage and hence memory requirements and computation processes are reduced approximately by a factor of 2. An N-point real input of the RFFT results in a complex output with N/2 real and N/2 imaginary points generated. Figure 1 shows the signal flow for an 8-point RFFT. The input of the RFFT routine consist of N real data words whose address locations are initially bit reversal sorted. Compared to a traditional FFT, not all butterflies have to be computed and the order for choosing butterfly coefficients is reconfigured.
The grey arcs show the conjugated complex numbers. After the RFFT completion, the output array contains N/2 + 1 real and N/2-1 complex values. The real-valued (black) and complex butterflies (red/blue) are seen in the 3 rd RFFT stage. Re [4] Re [2] Re [6] Re [1] Re [5] Re [3] Re [7] RE[0]
RE [2] RE [3] RE [4] IM [3] IM [2] IM [ (1)
As floating-point processing is more area intensive on-chip, the proposed CPU uses a fixed-point architecture for ease of implementation and for less hardware needs. However, in fixed-point implementations, care is required to prevent overflows. During the RFFT, each butterfly performs arithmetic operations on two n-bit data words. To avoid overflow, a modification to the RFFT is made so that prior to entering the butterfly each data word is divided by 2, this prevents the butterfly outputs from exceeding the maximum bit width. To ensure that accuracy is maintained, the ADC outputs are pre-scaled to the maximum value allowable before being stored in memory. Other higher radix algorithms, such as radix-4, radix-8 and split-radix techniques improve speed compared to the radix-2 FFT, however they result in more complex butterfly structures with longer program code needs and more address computations. The aim of this work is to produce a processor design suitable for low-area, easy implementation into SoCs and so the radix-2 RFFT is selected.
The on-chip BIST computational accuracy needs to be comparable with off-chip automated test equipment (ATE). An analysis of the RFFT technique for ADC spectral testing with fixed-point implementation was used to study the CPU datapath bit width needs, resulting in the plot shown in Fig. 2 . The data shows that in order to spectrally test ADC resolutions from 8~14-bits with variable FFT sizes up to 32K-points, a datapath of at least 30-bits is necessary to achieve accuracy for signal monitoring and safety-critical applications. The plot shows that lower ADC resolutions (< 14-bits) can make use of shorter datapath lengths resulting in less logic, however it is notable that the accuracy of results below a bit width of at least 22-bits is insufficient. Most CPU architectures uses multiples of 8-bits for datapath and so the bit width for this CPU implementation is selected at 32-bits. 
III. PROCESSOR IMPLEMENTATION
The processor unit is designed for easy IC integration and is capable of reusing SoC main memory. As multiple memory instances are not guaranteed to be available for specific applications, the CPU architecture uses single-port SRAM and memory access is supported by complex instructions such as multiplication and CORDIC [7] functions. The CPU is focused more on lower hardware size than on execution speed and so the multiply and CORDIC functions utilize serial booth operations. The instruction set is kept simple for easy instruction decoding and user programmability. The CPU is supported by 29 instructions categorized into 6 data move, 13 arithmetic, 9 branch and 1 control instruction command. Furthermore, c-code assembler and cycle accurate simulator tools complement the datapath hardware design unit in Fig. 3 . For the RFFT algorithm, direct and in-direct address support via the P register is essential as large amounts of data from multiple memory locations are processed within the software FFT loops. The memory interface also features hardwired address bit-reversal alleviating remapping needs. The program counter (PC) holds the physical address of the executed instruction in memory, while the instruction decoder is responsible for decoding the current instruction and setting the control lines for all other modules. The datapath unit also has CORDIC functionality to generate the sine and cosine twiddle factors needed during RFFT butterfly computation. In order to perform the CORDIC operation, two barrel-shifters also shift right multiple digit locations within a single cycle. Both ALU1 and ALU2 perform add/subtract operations allowing the computation of one CORDIC iteration per cycle according to equations (2) ~ (4).
The x, y and z variables are then represented by the A, B and T registers in the datapath unit. The CORDIC The main advantage of the CORDIC algorithm in the proposed CPU architecture is that no large memory table is required to store FFT coefficients. A further benefit is the fast computation of the absolute value of a complex number which is performed after the FFT is completed.
In order to evaluate CPU accuracy, both an ideal and nonideal full-scale sinewave containing distortion is fed into a 10-bit ADC. The quantized signal is loaded into memory and processor code computes an 8K-pt RFFT, absolute values and final parameter extraction. Table I shows that the parameters computed by the CPU (fixed-point) closely match those computed by MATLAB (double precision). A negligible amount of precision is lost due to the iterative CORDIC generation of the twiddle factors in the butterfly computation. Figure 4 plots the error difference for the CPU with and without CORDIC compared to a double precision FFT implementation (MATLAB). The error difference is smaller than 8*10 -6 demonstrating the CPU's applicability for in-situ dynamic ADC test operations. A CPU without CORDIC is more precise, however dedicated cosine/sine table storage in memory consumes more logic and power overhead. 
IV. TEST CHIP & RESULTS
The CPU was designed using Verilog RTL and integrated into an overall ADC BIST SoC design. The design consists of a BIST manager unit supporting ADC data acquisition and an on-chip serial interface to communicate externally with a PC. The CPU core is connected to a single port SRAM that is 40,936*32-bits to support variable FFT record sizes up to 32,768-points. The remaining memory is available for general purpose program code. The full design was implemented in UMC 180nm CMOS technology with a 44-pin CLCC package and verified on a test board integrating 8~14-bit ADCs. The plot for the chip and test board design is shown in Fig 5. The CPU core equates to 11,750 ND2 gates and has an area of 0.142mm 2 . The memory unit contains 4-banks of 10,234*32bit SRAM cells (2mm 2 each) that surround the CPU core to minimize wire lengths and improve performance. The chip design operates of 3.3V IO and 1.8V core supplies with power consumption measured at 150mW operating from a 100MHz clock. The large memories consume most power, with the standalone CPU core dissipating 4.3mW of power during ADC test execution. The design is highly portable to advanced CMOS technology nodes due to its synchronous clock operation and synthesizable RTL code features. This is particularly advantageous in denser nanometer ICs as test time can be improved significantly by running the CPU faster at the cost of consuming more power. The program code for variable FFT records up to 32K-points and dynamic parameter extraction is contained within 670 op-codes -this code is downloaded to the CPU memory via a serial interface. The test time duration excluding data acquisition is a function of the number of computation clock cycles it takes to perform the RFFT, absolute and parameter extraction phases. Table II gives the clock-cycle breakdown for an 8192-point FFT applied to a 12-bit ADC. A significant time is taken up by the number of memory accesses that occur during the FFT butterfly stages, which interrupts pipelining sequence capabilities. During butterfly operation, the indirect address modes could be improved using multi-port SRAM or cache access to minimize cycle time, however this would be at the cost of extra hardware logic. Table II also shows the test time duration for FFT sizes from 2K~32K-point. Table III compares relevant characteristics of other memory-based FFT processors calculating 1024-point FFTs. It is difficult to achieve a like for like comparison as FFT processors vary according to their application needs, but in [8, 9] , normalized area, power per butterfly and FFT's per energy are useful as metrics for comparison. In this case, normalized area is the silicon area normalized to 90nm technology
The adjusted FFT's per Joule compares the number of FFTs calculated per energy scaled according to FFT size in (7) 
Power per butterfly operation = The proposed CPU has the best normalized area metric, supporting the low area easy integration needs for on-chip test. The power per butterfly operation is also notable since a lowlogic butterfly operation without the need for complex multiplier(s) is supported. The execution time due to boothserial Multiply and CORDIC operations, makes the FFT per energy result lower. In contrast, this design supports very high SQNR (96dB) to enable accurate spectral test capability for a wide variety of ADC resolutions and is highly scalable to lowvoltage advanced nm process nodes. The CPU not only performs FFTs, but also extracts accurate dynamic parameters from the frequency spectrum using the CORDIC unit and general purpose programming supported by the opcodes. This design is particularly useful for safety-critical IC applications where in-line testing of ADCs can be carried out during nonfunctional periods at startup and during interval down-times, ensuring greater product reliability.
V. CONCLUSIONS
This paper presented the analysis and design of a processor that executes variable length Real-Value FFTs. The architecture is easily implementable into SoCs and is suitable for in-situ applications where highly accurate ADC spectral analysis can occur when test time duration is not the major concern. The CPU architecture is very low logic area and can be reused for other non-test processing applications. A 0.18µm CMOS chip and test-board implementation validates the ADC dynamic measurement capabilities. 
VI. ACKNOWLEDGMENT

