Abstract. This paper presents two Cordic based algorithms which may be used for digital baseband processing in OFDM and/or CDMA based communication systems. The first one is a linear least squares based multiuser detector for CDMA incorporating descrambling and despreading. The second algorithm is a pure Cordic based FFT implementation. Both algorithms can be implemented using solely Cordic based architectures (e.g. coprocessors or ASIPs). The algorithms exactly fit the needs of a multistandard terminal as they both are freely parameterizable. This regards to the accuracy of the results as well as to the parameters of the performed function (e.g. size of the FFT).
Introduction
A lot of modern signal processing applications require such a high computational power that only ASICs can fulfill the technical demands. Unfortunately, ASICs are inflexible, costly (development and debugging) and only economical for mass-products. As a consequence, system designers are striving to replace specialized hardware solutions with software based solutions as developments in the field of software radio demonstrate. Due to the fact that even the most commonly used programmable devices, i.e. DSPs, often lack the required processing power, one tries to develop a solution that lays somewhere in between the two extrema programmable signal processing and dedicated hardware. The efforts in this area are summarized with the term reconfigurable computing.
We are using this approach to create a common software defined baseband implementation for UTRA/UMTS-FDD and WLAN as shown in Fig. 1 . The most computational intensive tasks are performed on a dedicated hardware accelerCorrespondence to: B. Heyne (benjamin.heyne@uni-dortmund.de) ator called RACE using Cordic processing elements. The implementation of a Cordic based the FFT and a Cordic based linear equalizer/multiuser detector for CDMA, including descrambling and despreading, is presented in this paper.
As the equalizer and the FFT can now be build upon solely Cordics, we have derived a software defined architecture for a mobile multi-standard terminal. The main processing blocks of the WLAN and UMTS baseband can be replaced by this programmable architecture.
The paper is organized as follows. At first the multiuser detector including the system model, algorithm and simulations is described in Sect. 2. Secondly the Cordic based FFT algorithm and simulation results are presented in Sect. 3, followed by an overview of the RACE coprocessor in Sect. 4. Finally the conclusions are given in Sect. 5. the other users. The system model used is shown in Fig. 2 .
The incoming m complex data symbols of user i, collected in the vector d i , are first upsampled by the spreading factor q, so that one symbol now consists of q chips. Each upsampled symbol is now convolved with an OVSF code (3GPP, 2002) contained in the vector s i of length q. Finally, the summed data streams are scrambled with the complex data sequence in vector c which is repeated for every data frame (38400 chips in UTRA/FDD).
The received chips are now obtained by propagating the signal through a channel, which is characterized by its complex valued channel impulse response vector h of length h l , and adding an AWGN component n.
Therefore the received data vector r is given by
where H is a convolutional matrix describing the time variant complex channel, C is a complex valued diagonal matrix containing the scrambling code c on its main diagonal and S i is a block Toeplitz spreading matrix. In this matrix each block is one column wide and contains the OVSF code for the i-th data stream. For the proposed algorithm it is assumed that the received signal r has already passed the chip matched filter and has been sampled at chip rate. We also assume that the channel impulse response h (or the strongest taps of it) is known, as the channel estimation is not part of this paper.
Multiuser detection
The derivation of the single user detector/equalizer from this model is described in Heyne et al. (2003) . For the multiuser detection we are interested in the first j users and will treat the other users as an additional noise component, so that r is changed to
with
Resulting structure of the system matrixK j . The * -operator indicates a convolution.
To describe r as a simple matrix-vector multiplication, we create a new symbol vectord j which contains the time interleaved data symbols d i of the first j users
and a new spreading matrixS j containing the j corresponding spreading codes. Now the vector r can be rewritten as:
When we have a close look at the structure of the matrices involved in the computation, we will notice that there are several characteristic properties that can be exploited to simplify the calculation of the data symbols. In our approach the H , C andS j matrices are multiplied to get the system matrix K j of width m · j :
This matrix will be used to calculate the desired data symbols. The structure ofK j is shown in Fig. 3 .
The nonzero column vectors are calculated by convolving the channel impulse response with the scrambled spreading code. c x denotes the x-th code block of length q in c. As the scrambled spreading code just has got (± 1 ± i) entries the system matrix can be build without using multiplications.
It is obvious thatK j has got a very sparse structure which can be exploited as described in Sect. 2.3, to reduce the computational effort to solve the linear system.
Implementation
The detection of the estimated data symbolsd j can now be performed by solving the overdetermined linear system
The equation is solved in the least squares sense by a QRdecomposition which can be implemented efficiently on a 0 1 0 1 processor array (Otte et al., 2002) using Cordic processor elements (PE) performing Givens transformations. But, as the structure ofK j is known, a direct approach is used for the calculation of the QR decomposition. A new system matrix is build fromK j and r. Then the required Givens transformations are applied only to the nonzero elements ofk xy inK j .
In each step one vectork xy is annihilated. For each annihilation of one element in k xy it is necessary to recompute two rows of the matrix composed ofK j and r. The last step shows the matrix R and the vector r which are used to perform the back-substitution.
Subdividing the calculation
It is obvious thatK j can grow to a very large matrix. A whole data frame of m · j = 2 400 · 16 = 38 400 symbols at spreading factor q = 16 and a channel of length h l = 10 would require aK j matrix of size 38 409 × 38 400.
To overcome this problem the systemK jd j = r is subdivided into overlapping subsystems of manageable size as shown in Fig. 4 and described in Vollmer et al. (2001) .
The linear system is subdivided into blocks of size m b · j . Using this method it is possible to solve the complete system without the need to store the whole matrix which would also involve large latency and memory needs. Overlapping of the blocks is necessary, as the independent calculation of the subproblems leads to higher symbol errors at the block edges. These errors are nearly eliminated by using this overlapping method.
This method involves a certain amount of computational overhead as the grey blocks have to be calculated twice. The overhead can be reduced by choosing larger block sizes m b . For good results the overlapping factor should be chosen at least as high as the block overlapping factor p ofK j . If the overlap method is used for a data frame with q = 16, m b = 8, p = 2, j = 16 and m = 2 400 about 400 blocks have to be calculated for one frame. It is also assumed that the channel length h l is 10. In this case the decomposition/back-substitution for each block needs approx 179 000 (real valued) Cordic operations and ≈ 20 500 complex additions for the creation of the system matrix. Therefore the detection of a whole data-frame of m · j = 38400 symbols uses Note that there is no further descrambling/despreading necessary. Furthermore the Cordic based QR decomposition can make nearly 100% use of two parallel Cordics.
The complexity comparison of our proposed algorithm to other implementations is based on the numbers given in Nahler et al. (2002) for Rake and PIC based receivers. An equivalent of three array multipliers for one Cordic operation is used to include some overhead. Therefore a Cordic operation would be equal to three operations, and a complex addition equals two operations. Then complexity of our approach for this example is on the same order than the conventional Rake receiver as shown in Table 1 .
Of course this is only a rough estimate of the computational complexity. But it shows that it is about the same as for the conventional Rake receiver, while the performance is significantly increased as shown in Sect. 2.6. For a detailed description of the algorithm and the complexity analysis see Heyne and Goetze (2005) . Figure 5 shows the frame error rate for a 16-QAM based system with q = 16, j = 16, h l = 10, m b = 8 and p = 2. The channel is assumed to be constant throughout the simulation and contains four, randomly distributed, strong taps. Hence the Rake receiver is using four fingers.
Simulations
The Figure shows that the Rake has got no chance to detect the symbols in this case, and that the LS approach has got a large performance gain for rising SNR values. 
FFT

MAC based FFT
A DFT with N input values s can be described as a matrixvector multiplication. By exploiting the properties of the DFT matrix, operations can be greatly reduced and the well known Fast Fourier Transformation (FFT) (Oppenheim and Schafer, 1999 ) is derived.
The listing below shows a recursive implementation of a MAC based FFT in Matlab style. To derive a Cordic based FFT we will stop the recursion at n = 2. In this case line 11 will look like: The first matrix can then be decomposed to:
This equals a Cordic rotating the input values by π/4, followed by a scaling of √ 2/ − √ 2. As the Cordic elements are real valued but the input values are complex valued, the complex Cordic operation has to be separated into real valued operations. Due to the special structure of the rotation matrix, this is quite easy to perform. If we assume two complex numbers a, b ∈ C, the result of the complex rotation will be (with t = 1/ √ 2):
Thus the operation can be applied to the real and the imaginary part of the input values independently, and the complex "butterfly" can be performed by using two of the real valued Cordics shown in Fig. 6 . The scaling of the results can be performed at the end of the flow graph. The symbol for the resulting complex Cordic, called type I, rotating two complex valued numbers by π/4 is shown in Fig. 7 .
As shown in Eq. 7, the second scaling factor has got a negative sign. Therefore all "lower" results of the complex Cordic operation have got a reversed sign. Fortunately, because of the structure of the FFT stages, this compensation does not result in additional computational overhead. In each stage of the FFT the sign reversed results are just combined with other sign reversed results. Therefore a −3π/4 rotation is applied to the results in these cases to project the result from the third quadrant back into the first one. This equals a multiplication of the complex input value by −T . Therefore (8) s (4) s (6) s (2) s (7) s (3) s (5) s (1) this operation can be decomposed, too. The complex Cordic performing this operation is called type II.
The structure is the same compared to the type I Cordic, except that the real valued Cordics now perform a rotation by −3π/4.
Further optimizations are possible for w x y = − j. The result of this operation can also be obtained by swapping the real and imaginary part of the input value, and then inverting the sign of the new imaginary part. A closer look at the algorithm reveals, that this operation is always performed on sign reversed results of the previous stage. This also implies that the result of the multiplication with w = − j is always provided to complex Cordics of type II.
Therefore the input value of the real valued Cordic is −a. Thus a w = − j multiplication followed by a type II complex Cordic can be replaced by another complex Cordic, called type III.
The final optimized version of the FFT is shown in Fig. 8 . Note that there are no more twiddle factor multiplications after the first FFT stage, which saves one stage in a pipelined implementation.
The twiddle factors shown are the same as used for the standard FFT. As they are located on the unit circle, the multiplication with ω x y can be replaced by a Cordic rotation directly.
It is obvious that the FFT like butterfly structure is kept. The √ 2 log 2 (N ) scaling of the result can be performed after the computation of the three stages.
Complexity
If N is the size of the FFT, the number of Cordic operations is:
The same architecture used to implement the Cordic array also provides a MAC processing element (PE) (Lange et al., 2002) . This PE needs
activations for an FFT of size N . For small FFTs the operation count OP Cordic is even lower than for OP MAC . For the 64-FFT used in a WLAN receiver OP Cordic is 482 and OP MAC is 396. As the hardware accelerator currently provides two parallel PEs these numbers can be halved to get the number of accelerator activations (198 for the MAC, 241 for the Cordic).
So the Cordic based FFT is slower than the MAC based implementation, but on the other hand one Cordic based reconfigurable hardware architecture can now be used to implement the FFT for WLAN and the Rake substitute for UMTS. Also the OFDM channel correction can be implemented very efficiently on a Cordic, as it supports division. Hence, in the case of a FFT followed by an OFDM channel correction, the computational overhead is considerably small. A more detailed description of the algorithm and it's properties can be found in Heyne and Goetze (2004) .
Simulation results
Finally, the proposed FFT has been implemented in a WLAN transmitter/receiver simulation to replace the regular FFT. The simulation shown in Fig. 9 has been performed using an AWGN channel and the coding parameters defined in IEEE Std 802.11a-1999 .
The results for the 54 MBit case show that a wordlength ≥ 12 Bit is enough to achieve the same BER performance than the floating point FFT implementation. So the Cordic based FFT has to use just 12 Bit arithmetics to replace the standard FFT in a WLAN environment.
Hardware coprocessor
Both algorithms have been implemented on the reconfigurable hardware accelerator (RACE) shown in Fig. 10 . The RACE can be described as an algorithm specific instruction set processor (ASIP) with a limited instruction set that is optimized for different classes of algorithms. 
In our case the class of algorithms is composed of matrix based algorithms that can be implemented by enhanced Cordic operations. For this purpose the accelerator contains Cordic processing elements which are based on simple shiftadd operations, but there are other PE types such as e.g. MACs that can be used as well. The operations that can be performed, and which are used to implement the QR decomposition of the system matrix, are given in Table 2 . For example in the "Orthogonal Rotation" mode the Cordic rotates a two dimensional input vector a = (x, y) by an angle φ z .
The accelerator contains several processing elements (PE) in parallel that perform the computations, a Data RAM to store values and a Configuration RAM in conjunction with a finite state-machine (FSM) to control the data flow.
Here, the RACE is embedded in a processor environment where it is connected to the system bus via a bus wrapper, which has direct memory access (DMA) capability and thereby controls the dataflow into and out of the RACE. The processor itself is freed from all data moving tasks and is just informed by an interrupt when the results of an operation are available.
The number and the interfaces of the PEs as well as the amount and structure of the memory inside the Data-RAM can be parameterized. Hence it exactly fits the needs of the multistandard terminal where it is used.
Conclusions
We have presented two Cordic based algorithms usable for communication systems based on OFDM and CDMA, namely a FFT and a linear least squares based CDMA multiuser detector/equalizer. The good performance and the low computational complexity of both algorithms make an implementation feasible. Hence, they can be used to implement a reconfigurable (software defined) architecture for multi standard terminal digital basebands, replacing dedicated hardware by using Cordic coprocessors like the RACE accelerator.
