Abstract-The Discrete Fourier Transform (DFT) is one of the fundamental operations in digital signal processing. This paper presents a software based implementation of a less known observer based Recursive DFT on a PC x86 architecture and on a Microchip PIC30 microcontroller. The results are compared with an efficient/optimized Fast Fourier Transform (FFT) implementation. Both algorithms require complex operations with real valued adders and multipliers, thus a complex result has to be calculated separately for the real and imaginary parts. During the implementation floating-point and fixed-point number representation is applied. The comparison is performed based on the required resources and the computational time.
I. INTRODUCTION
The Discrete Fourier Transform (DFT) is used in many engineering applications: it is not limited only to the field of signal processing, but it can be widely adopted for solutions in telecommunications, data compression and measurements as well. Direct implementation of the DFT is computationally inefficient, thus signal processing applications tend to implement one of the various Fast Fourier Transform (FFT) algorithms instead. The basic operation of these algorithms is block oriented, but there are some problems which require a continues, sample-by-sample calculation of the spectra. Such problems can be spectral sensing for cognitive radios [1] where accurate and continues measurements are required for the fast response in the usage of the radio frequency or the processing of DTMF signals using Goertzel's algorithm [2] .
The Recursive DFT (R-DFT) provides an alternative computation method to acquire the DFT components of a signal, which applies the observer-based structure [3] and recalculates the spectral values for every new incoming sample in a sliding manner. This paper presents a possible implementation based on the x86 architecture using Streaming SIMD Extension (SSE) and 16-bit architectures. The results are compared with an optimized versions of the FFT implementation for the given architecture.
The paper is organized as follows. First, in Section II, a short theoretical background is given for the DFT, FFT and R-DFT methods and a brief comparison of their computational complexity is shown. Section III presents efficient implementations of the R-DFT on the different architectures. The comparison of the performance of the FFT's and R-DFT's various implementations are summarized in Section IV. Finally, the results are concluded and possible further improvements are discussed in Section V.
II. THEORY
In this section the operation of the DFT and the R-DFT are given. The main difference between the operation of the DFT and the R-DFT is that the DFT operates on a block of input samples with a length of N elements, thus it has to wait for the entire block before it can be executed, meanwhile the R-DFT operates as closed loop control system, thus it operates in a sample-by-sample manner.
A. Discrete Fourier Transform and Fast Fourier Transform
Mathematically, the m th spectral component of an N -point DFT (X m ) can be calculated for an N -sample long discrete
where j = √ −1. The calculation of one spectral component requires N complex multipliers and N − 1 complex adders, so the computational complexity for N frequencies is O N 2 . The FFT provides mathematically the same results as the DFT, but it has a modified structure, which reduces the number of elementary operations. There are many FFT algorithms like split-radix FFT, Prime-factor FFT, Brunn's FFT algorithm, etc. [4] . Its computational complexity, depending on the number of samples, can be reduced down to O(N log 2 N ), which is significantly lower than the requirements for the DFT.
B. Recursive Discrete Fourier Transform
The R-DFT realizes the observer-based structure [3] , which calculates the DFT values of the incoming signal for every single new input. The observer theory is based on the predictioncorrection scheme, where the observer is a system, which can copy the state variables of the observed system. By suggestion of Péceli the DFT can be also realized as an observer-scheme.
If
978-1-5090-4591-4/17/$31.00 ©2017 IEEE 
The structure of the R-DFT scheme can be seen in Fig. 1 [k] . In case where the state variables are perfectly estimated, the feedback is identical with the input, thus the error signal is zero.
The complexity depends on the number of branches: After receiving one input sample, all N resonators operate. The first valid spectral components are therefore acquired once N + 1 input values have been received. So, the computational complexity is O N 2 . It can be seen that the complexity is the same as of the DFT's, but the calculation is equally distributed in time over the N samples. Also the spectra of the N sample long block which is shifted by one sample can be calculated in O (N ).
Other resonator-based sliding DFT solutions like Goertzel's [2] or Jacobsen's solutions [5] suffer from the fact that the poles of system are located on the unit circle, inaccuracies due to the finite numerical representation of these values may lead to divergence or convergence to zero. The beneficial property of the R-DFT compared to these solutions is that the arithmetical inaccuracy of the resonator's pole does not cause instability due to feedback [6] .
C. Comparison of the Computational Complexity
The theoretical comparison of the FFT's and R-DFT's computational complexity is discussed for three scenarios: 1) Offline calculations: First, if an offline calculation is considered for a block of N samples, the computational complexity of the FFT is proportional to N log 2 (N ), while the complexity of the R-DFT is proportional to N 2 . In the case of offline calculation, all data are available in memory, so it is not needed to wait for input signal. The latency is proportional to the complexity and the computational time of core operation of FFT (t cF F T ) and R-DFT (t cRDF T ). 2) Online calculations for a block of N samples: The second scenario is where the samples are coming from an outside source (e.g. an A/D converter) with a given sample rate (t s ). In this case the FFT algorithm has to wait for N samples then it can proceed with the calculations. So the computational complexity remains unaltered only the additional latency is introduced. On the other hand, for the R-DFT the feedback loop operated after each incoming sample, so after the arrival of the N th sample, only N operations with a latency of a single stage is required to calculate the spectraassuming that N · t cRDF T < t s -, thus the operations are distributed over the incoming block.
3) Online calculations in a sliding manner: When continuous evaluation of DFT is required, the FFT has to be calculated in sliding window manner over the sampled signal for each block repeatedly. In this case the R-DFT provides the spectral values much faster [1] with a significantly reduced complexity proportional to N . The latency of FFT is equal to t s + (N log 2 N ) · t cF F T after the arrival of the next sample, while the R-DFT's latency is equal to t s + N · t cRDF T by the reason of the reduced complexity. The summary of the required computational complexities and latencies for the various scenarios are presented in Table I . The difference between the block of N samples and sliding manner is illustrated in the Fig. 2 .
III. IMPLEMENTATION
The PC implementation was developed on x86 architecture using C language and Microsoft Visual Studio 2013 development environment. Two solutions are investigated. The first realization uses only the x86 instructions to calculate the floating point numbers. After that the algorithm was optimized by SSE instruction set as follows: every single elementary complex operation (addition and multiplication) is converted to SSE instruction, which are calculated on 128-bits data units. In the second solution, the algorithm is also implemented 
Online calculations in a sliding manner
on a Microchip dsPIC30F6014A microcontroller with 16-bits architecture in Assembly language under MPLAB X using C30 compiler, in order to reveal exactly the number of operations that are executed during DFT processing. Using the x86 architecture the problem was that the operating system hides this information.
A. x86 Architecture
Most of today's desktop PC systems are based on the x86 architecture. In the first implementation only x86 instructions are used, which are then optimized by SSE instructions and these realizations are compared.
1) Direct Implementation: First, the direct implementation maps the structural elements of R-DFT -as presented in Section II -into control statements. The algorithm contains two nested for-loops, which is iterating for every new input sample. This outer loop is responsible for the time domain, and increments the discrete time-variable k. In every discrete time slice, the spectral components are calculated in an em- 2) SSE optimization: The second realization using the SSE, is a Single Instruction Multiple Data (SIMD) instruction set extension of the x86 architecture, which was developed by Intel in 1999 and the company's later processors processors support several version of SSE since Pentium III. The first version of instruction set offers 70 new instructions that can be used to calculate single precision values more efficient. The SSE2 already supports double precision calculations and SSE3 has operations supporting complex arithmetics [7] .
SIMD means that the same instruction is executed on multiple data. SSE instructions perform their operations on 128 bit data units, which can be defined in various ways according to the applied data type. In this paper the decomposition to 2-element double vectors is applied. This scenario uses the same program structure as the direct implementation, but it is optimized for complex additions and complex multiplications. The operations are using both x86 and SSE instructions. This solution is optimal in that sense that the execution is performed on 128 bit data units instead of 64 bit, which means that two double values can be stored in one register. Using the SSE operation ADDPD which adds two registers vertically, the complex addition can be performed with one ADDPD instead of two FADD instructions. Since both instruction's latency is 3 cycles, we are expecting to halve the duration of execution here.
The optimized complex multiplication works on the same data types as the non optimized version, but it is not enough to modify only the FMUL instruction to MULPD as for addition, because it can not be evaluated vertically for vectors. The structure of the calculation has to be redistributed, the resulting optimized source code is as follows:
inline void sse_complex_mul(__m128d * result, __m128d * x, __m128d * y) { __m128d aa = _mm_movedup_pd( * x); __m128d bb = _mm_movedup_pd(_mm_shuffle_pd( * x, * x, 1)); __m128d cd = * y; __m128d dc = _mm_shuffle_pd(cd, cd, 1); * result = _mm_addsub_pd(_mm_mul_pd(aa, cd), _mm_mul_pd(bb, dc)); } The direct implementation of the multiplication contains 4 FMUL (5 cycles latency) and 2 FADD (3 cycles latency) operations, which results in a total latency of 26 cycles. In this case the MULPD instruction has a latency of 5 cycles as well. ADDSUBPD, MOVDDUP and SHUFPD need 1 cycle to be completed. Summarized, the complex multiplication takes 17 clock cycles, as result a maximal time reduction to 65% may be achieved. These estimations are rather optimistic, which do not take into account the load/store operations, and presumes that all data are available in the registers.
B. dsPIC30 Architecture
On the x86 architecture, with the usage of an operating system only the execution time of the algorithm can be measured. Therefore the R-DFT is also implemented on a Microchip dsPIC30 16-bit architecture to compare the number of the executed instructions, which determines exactly the required execution time. The algorithm of R-DFT is implemented in Assembly language to reach the best performance and avoid the compilation overhead. The implementation is improved with the DSP features of the microcontroller's architecture. The DSP engine features a 17-bit by 17-bit multiplier, a 40-bit ALU, two 40-bit saturating accumulators and 40-bit bidirectional barrel shifter [8] . It has MAC instruction which can concurrently multiply two working registers and the result is added to the accumulator while fetching two data operands for next operation. The DSP unit provides fix-point arithmetic, where the Most Significant Bit (MSB) of the 16-bit register represents the sign bit and its other 15 bits are the fractional part of the variable. It means that the values can be between 1 − 2 −15 (0x7FFF) and −1 (0x8000). The number representation of the accumulator is also in fix-point arithmetic, using 31 bits forfractional part and 9 bits for integer part (including sign bit). The content of accumulator can be write back into 16-bit wide memory segment with rounding.
IV. COMPARISON
The comparison presented in this section covers two run cases. First, the two previously introduced x86 implementation of the R-DFT are compared with each other, after that, the R-DFT and FFT implementations are analyzed and investigated. Finally, the used memory and the number of executed operation are presented and compared in the case of dsPIC30 implementation.
A. x86 Architecture
In Figure 3 the execution time of the direct implemented and the SSE optimized algorithm of R-DFT depending on the signal length is shown, where N = 512, the sample rate is considered to be 100 kS/s and the analyzed signal lengths are between 1-10 seconds. The measurement show an average of about 64% time reduction in case of SSE runs, which is in the expected and pre-calculated 50-65% range. It can be also seen that execution time for both of the two implementation is linearly dependent on the length of the input signal. The FFT and R-DFT use different concepts for computation. As with the FFT, a fixed N -size block is always defined which has to be filled with data first (and the program waits so long), while the R-DFT performs its operations for each new input sample. Therefore the algorithm of FFT has to be run using a sliding window on the input, which means that it has to push each next input value into the window, removing the oldest stored value at the same time. For the comparison we used the FFTW as an efficient implementation for the FFT. It is a C subroutine library for computing the DFT in one or more dimensions, of arbitrary input size, and of both real and complex data. FFTW is free software, which provide the FFT library of choice for most applications [9] . FFTW is used on one hand to compare the applied resources, rapidness of algorithms, on the other hand to verify the accuracy of output values. Figure 4 shows the execution time of the FFT and the R-DFT on 1 second input signal, which sample rate is 100 kS/s, depending on its spectral resolution in the range N = 64−530. The measurement indicate that the R-DFT is almost always faster then FFT if a sliding DFT calculated for each new sample is required. The other notable result, shown in Fig. 4 , is that the execution time of FFT has a large deviation from its expected values, but it has a growing trend in function of N . By contrast, the execution time of R-DFT seems linearly proportional to N . The reason of this effect derive from the algorithmic specialty, because if N is a prime or it has a large (larger then 11) prime divisor, than the "divide and conquer" principle can not be applied, while the number of R-DFT's adders and multipliers is proportional to N .
B. dsPIC30 Architecture
The implementation of the R-DFT algorithm is compared to the FFT realization to determine the performance of the algorithms considering the required memory size and the number of executed operations (in clock cycle). The static library of C30 compiler offers an implementation solution for FFT, which has been written in Assembly, and it is compiled into a library called libdsp. The algorithm can calculate only power of 2 point DFT. It needs pre-calculated twiddle factor for FFT's butterfly structure, its size depends on N . A similar situation can be observed for the R-DFT, it requires a pre-calculated complex coefficient array of size N for the operation of its modulators and demodulators. The befit of the R-DFT algorithm is that it can be applied for arbitrary number of samples N .
In Fig. 5 the used memory and the execution time of FFT and R-DFT depending on the size of the block (in case of FFT) and the number of integrators (in case of R-DFT) can be observed. The used memory includes the data memory and the program memory. The duration of the algorithms is equal to the clock cycle, i.e. how many instructions are executed during the run of the core of the algorithms. In the case of FFT this core is called for a block of N samples, and calculates the DFT values for the block. On the other hand the core of R-DFT is only the algorithm, which refresh the integrators, and it has to be called for every sample. The execution time for more incoming sample depends on the mode of operation (offline, block of N , sliding manner), which is presented in Section II. If the FFT is used, the memory usage exceeds the available resource at N = 512, even so the R-DFT can be implemented with the parameter 256 < N < 512 on the same platform.
V. CONCLUSION
This paper presents possible implementations of the lessknown observer based R-DFT on different architectures. Possible optimizations of the algorithm are investigated, their computational time and required memory usage is presented. The results show that the R-DFT is capable of calculating the DFT of the input signal significantly faster in a sliding manner than the point-by-point usage of the FFT.
In the future work the implantation of the R-DFT algorithm on graphics processing unit (GPU) and field-programmable gate array (FPGA) is to be considered; the hardware implementation should further improve the presented software implementations.
ACKNOWLEDGMENT
This work was supported by the János Bolyai Research Fellowship of the Hungarian Academy of Sciences.
