Abstract: Software radio implementations of beamformers on programmable processors such as digital signal processor (DSP) and field programmable gate array (FPGA) still remain as a challenge for the integration of smart antennas into existing wireless base stations for 3G systems. This study presents the comparison of DSP-and FPGA-based implementations of space -code correlator (SCC) beamformer, which is practical to use in CDMA2000 systems. Implementation methodology is demonstrated and results regarding beamforming accuracy, weight vector computation time (execution time) and resource utilisation are presented. The SCC algorithm is implemented on Texas Instruments (TI) TMS320C6713 floating-point digital signal processors (DSPs) and Xilinx's VirtexIV family FPGA. In signal modelling, CDMA2000 reverse link format is employed. The results show that beamformer weights can be obtained within less than 10 ms via implementation on c6713 DSP with direction-of-arrival (DOA) search resolution of Du ¼ 28, whereas it can be achieved within less than 25 ms on VirtexIV FPGA for five-element uniform linear array (ULA). These results demonstrate that FPGA implementation achieves weight vector computation in much smaller time (nearly 500 times) as compared to DSP implementation in this study.
Introduction
Mobile communication systems have been growing rapidly and services offered by them are being widely used all around the world. Increasing traffic rates, limited system capacity and low coverage range of base stations are the factors to be considered in the design of these systems. To overcome these issues, software radio implementation, advanced antenna systems and adaptive signal processing techniques have been considered over last decade. A software defined radio (SDR) is defined as a reconfigurable radio, in which the functionality is described by software [1, 2] . In an SDR, which provides a flexible radio architecture that allows changing the radio functionality in real-time, the same hardware can be used to implement different processes at different times.
A smart antenna system (SAS) employing an antenna array at the base station with advanced signal processing techniques adaptively adjust its beam pattern according to channel propagation dynamics. SAS decreases system complexity, expands coverage and increases data rate by efficiently utilising the bandwidth [3, 4] . One of the difficulties integrating a SAS into wideband code division multiple access (WCDMA) systems is the implementation of algorithms on programmable processors. Implementation of beamforming algorithms on programmable processors such as digital signal processors (DSPs), field programmable gate arrays (FPGAs) or special type of application-specific integrated circuits (ASICs) is a key point for upgrading such flexible base stations [5, 6] . DSPs can be considered as special purpose CPUs succeeding fast instruction sequences, such as shift, add and multiply. On the other hand, FPGAs with their re-programmable logic gates are more hardware-oriented devices, and also preferred for higher processing speeds.
Recently, many researchers have studied implementation of various beamforming algorithms using different programmable processors. In [7] , least mean square (LMS) and recursive least square (RLS) algorithms as a beamformer for WCDMA were implemented using Texas Instruments' (TI) C6211 DSP processor. In [8] , implementation of normalised least mean squared (NLMS) beamformer was performed on TI C6203 DSP using two DSPs for physical layer and media access control (MAC) layer. In [9] , a normalised constant modulus algorithm (NCMA) was implemented using Xilinx's SPARTAN II FPGA to study digital beamforming capability of an FPGA. In [10] , a beamformer system consisting of an eight-element antenna array, eight TI C6701 DSP processors and eight co-processors was implemented using Xilinx's XCV400E FPGA technology.
In this paper, we extend our previous works [11] [12] [13] on DSP and FPGA implementations for wireless environments. We specifically focus on implementation of a smart antenna algorithm that we have developed earlier and referred to as space-code correlator algorithm (SCC) using TI floatingpoint DSP (C6713 DSK) [13] and Xilinx's VirtexIV FPGA. Signal received from the antenna array is assumed to be transmitted in CDMA2000 format [14] . The advantage of SCC algorithm is that unlike other adaptive algorithms such as LMS and constant modulus (CM) [15] , it does not need any learning parameter and also its weight vector computation time is not affected by multipath propagation conditions [11] .
The remainder of the paper is organised as follows. SCC algorithm is described in Section 2. Implementation methodology based on DSP and FPGA is presented in Section 3. Setup of implementation is explained in Section 4. Results pertaining resource utilisation, weight vector computation time, effect of direction-of-arrival (DOA) search resolution, effect of signal-to-noise ratio (SNR) variation and antenna configuration are presented in Section 5. Finally, the concluding remarks are given in Section 6.
Description of SCC algorithm
The SCC algorithm whose implementation on DSP and FPGA to be presented in this paper was also discussed in [13, 16] . It is based on performing code correlation with desired user's code and then spatial correlation of despread signal with predetermined array response vectors in the reverse link search table. However, we herein briefly describe this algorithm. The transmitted signal s(t) from the mobile is exposed to multipath propagation environment, which induces complex path attenuation a i, f ¼ b i, f e jf i, f and time delay t i, f on the transmit signal. Let f and F denote the multipath index and number of multipaths, respectively, from the desired mobile to the base station. The received signal at the input of an antenna array is given as
where a(u 1, f ) is the antenna array response vector and n(t) is additive white Gaussian noise (AWGN) term, i is the interference index and I is the total number of interference.
SCC algorithm has two parts as code correlator and space correlator. In code correlator stage, received signal x(t) is despread by the code c 1 (t) of desired user to be obtain pth multipath
where T w is the symbol period and t 1,p is the multipath delay for the pth path of the desired user. If baseband signal is sampled at chip instants (T c ), and the pulse shaping waveform is chosen as rectangular function with unit amplitude, then above equation can be written as to obtain post correlation signal vector
In spatial correlator part, correlation of Z p (l ) and array response vectors a(u) is carried out. The scope [08, 1808] is separated into K DOAs, which are divided by search resolution Du8 ¼ 1808/K. Thus, K complex-valued steering vector with dimension M Â 1 a(u) is necessary to save in reverse link table. The output of space correlator tries to find the dot product of the code correlator output and the array response vector to find maximum DOA corresponding to its peak for each patĥ
The estimated DOA can be applied for both uplink and very slowly during several symbol periods. As can be observed above, the complexity and accuracy of the SCC beamformer depends on the number of multipath, correlation level of multipath, number of antenna elements in the array and angular resolution of the DOA range to be scanned.
3 DSP and FPGA implementation
DSP implementation
The implementation for the SCC algorithm on DSP was presented in our previous study [16] . Hence, we suggest reader to refer that paper for detailed information on DSP implementation. TI's C67x family DSPs were used in the implementation, which use some specific instructions which are 32-bit integer multiply, double word load and floating-point operations. Consequently, we used single precision of floating-point operands to code the algorithms. TMS320C67x DSPs use high-performance, advanced VelociTI very-long-instruction-word (VLIW) architecture [17 -20] , which enables multichannel and multifunction processing. The C67x processor consists of three main parts: CPU, peripherals and memory. Eight functional units operate in parallel, with two similar sets of the four functional units. The functional units communicate using a cross path between two register files, each of which contains 16 registers with 32-bit width. The 256-bit-length program memory fetches eight 32-bit instructions every single cycle.
FPGA implementation
Virtex FPGAs have an array of configurable logic blocks (CLBs) that are encircled by a ring of input/outputs blocks (IOBs). Block RAMs (BRAMs) are placed on the two sides of the FPGA. The CLBs are the main building blocks that consist of logic elements such as gates, flip flops and wiring for connectivity. Any CLB has two slices as an input multiplexer and an output multiplexer.
We have used a very high speed integrated circuit hardware description language (VHDL) library that we have previously designed for floating-point addition fp_add and floatingpoint multiplication fp_mul [21, 22] .
Owing to code correlator and space correlator parts of SCC algorithm and limited size of FPGA, it was not feasible to compute weights of each antenna element in parallel fashion in SCC algorithm. Serial implementation is not as efficient as parallel implementation in terms of weight computation time, but we try to minimise the gap between the two by optimising our implementation of arithmetic blocks on FPGA. The operations of these units are managed by control unit. The 32-bit floating-point format requires too high process load, so we have to use 16-bit floating-point format (half IEEE754 floating-point format) to implement on VirtexIV FPGA.
The SCC implementation architecture for an FPGA is shown in Fig. 1 . For the ease of understanding, we provide the explanation of implementation blocks as described in [13] . Implementation blocks on an FPGA are composed of six entities named as Main, Search Received signal entity: This entity is used as a buffer. The received signal entity provides an interface for incoming signals. Generating 1.2288 mega chips per second (MCPS) signal corresponds to data length of 24 576 sizes of data for 20 ms duration. Size of samples which is S ¼ 768 is saved in the internal RAM of Virtex IV, while 0.625 ms duration is considered as the part of input signal. If number of antenna elements are taken as M ¼ 5, the table size of RAM in the FPGA must be M Â S Â 2 because of real and imaginary parts of a signal sample. Values of received signal from antenna which has half IEEE 754 floatingpoint format is updated in this entity for every iteration.
Code correlator entity: Code correlator process is implemented for 64 complex multiplications (in multiplier entity, adder) and additions in total of n ¼ 60 steps. Complexity of code correlator spends too much memory space on the FPGA. Hence, a sub-module which has eight complex multiplications at a time was implemented. The output of the code correlator entity is M Â N complex-valued matrix whose elements are input to the space correlator entity.
Space correlator entity:
The received M Â N complex-valued data from code correlator entity is replaced by M Â N size array. This entity provides correlation of 1 Â M complexvalued array response vector with the M Â N size data. In each step u n, j , we find 1 Â N sized complex data that corresponds to spatial correlation result.
Search table entity:
This entity saves total of u n spatial angles which corresponds to array response vector. Hence, we require a table of size u n Â M. In this implementation, u n is equal to 90 for D ¼ 28 resolution.
Abs entity: This entity computes absolute value of 1 Â N sized complex-valued data from space correlator entity for the each spatial angle in each step.
Main entity: This entity controls all entities in this implementation. It saves the received data X r from space correlator entity in a Look-up table. Peak finder entity performs the largest value.
We can consider adder entity and multiplier entity as subentity.
Adder entity: In the complex adder entity, floating-point operations referred as functions are performed. In this entity, the addition function is called four times in order to perform a complex addition operation. complex multiplier entity, addition and multiplication functions are called two times and four times, respectively. Multiplier entity multiplies signal values from each antenna elements, which are received by main entity, with complex weight vectors stored in the RAM entity. These results are used for error correction in the next step.
Input signal parameters
We consider a five-element antenna array with uniform linear array (ULA) as receiving antenna. In signal modelling, a simple wireless channel model having a direct path and a multipath component for the desired signal and an interference signal is considered. DOA of the direct path (u 1,1 ) is fixed at 328, whereas DOA of the multipath (u 1,2 ) and the interference signal (u 2,1 ) is changing randomly (with uniform distribution) from one simulation run to another. The amplitude (b) and phase (f) components of multipath fading parameters are Rayleigh and uniform random variables, respectively. Multipath signal power level is set to 5, 10 and 15 dB below the direct path signal for testing the performance of the algorithm. Interference signal is 10 dB below the direct path of desired signal. In Fig. 2 , hardware settings for DSP and FPGA implementations are depicted. All the signal parameters and signal samples using these parameters are generated in Matlab, and then loaded into relevant board for simulations.
Implementation results
We first test the performances of DSP and FPGA implemented beamformers' in terms of their beamforming accuracy. In the context here, beamforming accuracy is measured via DOA of the spatial spectrum peak. If this DOA is close enough to DOA of direct path of desired user's signal, we consider high beamforming accuracy. In other words, beamforming accuracy shows how closely beamformer's spatial spectrum pinpoints in the direction of desired user. Fig. 3 shows representative spatial spectrum results for DSP and FPGA implementations. Since the peak spectrum points nearly 328, which is the direction of desired user, we obtain high beamforming accuracy for both DSP and FPGA implemented SCC beamformers.
The results regarding execution times for DSP and FPGA for DOA search resolution Du ¼ 28 are shown in Table 1 Although DSP and FPGA have different hardware structures, it is useful to examine resource utilisations in the algorithm implementation. Owing to limited resources on FPGA (flip flops, LUTs, slices), we were able to implement SCC algorithm with Du ¼ 28 using 16-bit floating-point format (half floating point) on VirtexIV. DSP and FPGA resource utilisation is given in Table 2 . In terms of percent resource utilisation of their own, DSP requires less resource than FPGA. SCC algorithm's execution time is not affected by the change in multipath DOA (u i, f ), fading level (a i,f ) and antenna array topology. However, SNR level is crucial in separating direct path and multipath DOAs. In Table 3 , mean values of DOA estimation errors obtained from implementation of 100 repetitions are summarised for ULA topology. DSP implementation leads to slightly smaller DOA estimation errors than FPGA for all SNR conditions.
Conclusions
We have presented a comparative study of a beamformer implementation on DSP (TI C6713) and FPGA (Xilinx Virtex IV). As a beamformer, our previously developed SCC algorithm, which is well suited for 3G CDMA applications, was selected. In signal modelling, CDMA2000 reverse link channel and five-element ULA were considered. The performance evaulation of the implemented SCC algorithm on the DSP and the FPGA was made in terms of beamforming accuracy, execution time, resource utilisation and DOA estimation error. Both the DSP and FPGA were able to provide a weight vector that can track the desired user direction. In terms of execution time (weight vector computation time), the FPGA implementation resulted in much faster execution time (500 times faster) when compared to the DSP implementation. Hence, as expected FPGAs can be used to reduce the execution time. The implementation of the SCC algorithm on Virtex4 FPGA with 16-bit floating point (half floating point) used up to approximately 99% of physical resources; on the other hand, DSP required only 30% of its memory resources. In a further study, optimisation of the implementation can be made by a hybrid implementation method which requires the use of both an FPGA and a DSP.
Acknowledgment
This work was partially supported by Kocaeli University Scientific Researches Divisions under the project number of KOU-BAP 2005/58. 
References

