Abstract. A heterogeneous Software Defined Radio (SDR) cluster platform that handles highly demanding processing algorithms in real-time is proposed. The solution based on a combination of FPGA, GPU and CPU offers the best balance between performance, cost, and flexibility. The key feature of our heterogeneous platform is achieving the required performance by assigning the tasks according to the technology characteristics. The FPGA in the proposed system does not only acquire external data but perform initial acquisition. This process aids in facilitating parallelism on the GPU side and optimizing the data transfer.
INTRODUCTION
Wireless locating systems providing position information for objects and individuals are demanded by diverse applications. The accuracy, update rate and the number of localized objects are parameters that define the complexity of such a system. An example for a demanding locating system is RedFIR [1] ; a high-accuracy and high-rate tracking system designed for real-time sports analysis.
The challenge to track up to 50,000 positions per second in real-time, with an accuracy of a few centimeters, would traditionally require designing a custom hardware platform. This in turn limits the flexibility and scalability of the platform, e.g. variation of the number of transmitters to be located or the support of variable burst length; favor an adaptive Software Defined Radio (SDR) approach.
Due to the required combination of computationally intensive signal processing, adaptive and flexibility, receivers are based on a programmable heterogeneous platform, consisting of Field Programmable Gate Array (FPGA), Central Processing Unit (CPU) and Graphic Processing Units (GPU).
II. SYSTEM OVERVIEW
In RedFIR, multiple receivers are used to triangulate the positions of multiple, miniaturized transmitters. Each receiver calculates the Time Of Arrival (TOA) of the signals sent by the transmitters. The Time Difference Of Arrival (TDOA) method is applied to the TOA measurements, which aids for a later position detection without the need to know the exact time of transmission. The unambiguous TDOA-based triangulation requires at least four receivers. However, in favor of accuracy and redundancy, more receivers are used in an average system setup. Figure 1 shows the system's layout: Beside the free-running transmitters on the pitch, the system mainly consists of the receiving infrastructure around the locating area. Thereby the antennas are distributed, while the digital processing receivers are located in a central computing cluster. The synchronization of the receivers, as well as the data exchange between antenna units and computing cluster, involves a fiber-optical network.
The computing cluster is the computational core of the system. It comprises all signal processing required for the computation of TOAs, TDOAs and transmitter positions. 
A. Complexity Factors
The system complexity is concentrated at the receiver side and the central computing unit while are transmitters designed for low power and small size. The complexity of such a system depends on a number of factors, most relevant are:
1) Real-time processing: Real-time refers to the online processing of the received data from each antenna within few milliseconds. Unlike in hard real-time systems (ex. heart pacemaker, flight control) missing deadline is not fatal but average time must be met.
2) Signal bandwidth: RedFIR operates in the 2.4GHz ISM band which allows the usage of a bandwidth close to 80MHz. Direct Sequence Spread Spectrum (DSSS) has been widely used in different ranging and navigation applications [2] . The essence of spread spectrum is to expand the narrow-band signal over the available bandwidth, which in turns enables a better performance in a noisy signal environment and leads to a better resolution. According to Shannon-Nyquist sampling theorem, this requires that the signal has to be sampled with at least twice its signal bandwidth. Each of the analog downconverted in-phase and quadrature components (I, Q) are sampled with two 100MSPS ADCs (Analog to Digital Convertor).
3) Update rate: The communication system operates in burst mode. The update rate represents the rate, at which the transmitter sends its bursts. A higher rate requires more processing power, and the receivers' peak performance should be at least 50,000 bursts/sec.
4) Burst length:
The burst length is the duration of the transmitted signal. The length of the burst depends on the required processing gain and the number of users. A longer burst results in a larger processing gain at the cost of higher processing complexity.
5) Number of users:
The system identifies different codes for each transmitter. This implies that even at the same frequency and the same time, the receivers can detect all the transmitters. The system supports 144 different transmitters, six times more than a GPS receiver.
6) Adaptive Parametrization: Based on the application at hand, the number of transmitters and the burst rate vary. Hence the overall system capacity has to be flexibly partitionable.
B. Motivation and Related Work
The first platform designed to support the requirements of RedFIR is a FPGA design based on Xilinx Virtex-II [3] . The high input data rate and the complexity of the algorithms requires around 450,000 Virtex-II Pro logic cells in addition to six PPC405. Despite the fact that the FPGA based receiver platform achieves the required performance, however the cost of a platform: the number of FPGAs, added to the number of receivers needed results in extremely high material cost.
Designers nowadays have plenty of options to bring their algorithms to life. As NVIDIA introduced the Computer Unified Device Architecture (CUDA), a new device is available that dramatically increase parallel computing performance by exploiting the power of the Graphic GPUs. On another aspect, modern FPGAs not only offer a large number of logic cells, but also dedicated blocks for mathematical operations such as DSP48 added to other hard cores available.
A number of recent studies have been conducted in this field, mostly comparing the performance of different computing architecture. For instance, in many applications GPU performance surpasses that of the FPGA, while FPGA provides a better solution for other applications [4] .
In this paper we introduce a hardware architecture based on FPGA, CPU and GPU implemented with off-the-shelf components. This platform takes the advantage of the FPGA flexibility to create custom hardware along with GPU parallel processing power to accelerate CPU performance. Unlike similar recently published architectures [5] , we introduce a concept for a balanced operation of these three components. Although the proposed heterogonous platform is designed for RedFIR locating system, however the principle of operation can be beneficial for a wide range of applications (e.g. astronomy [6] and communication [2] [7]).
III. SIGNAL PROCESSING ALGORITHM
In order to increase the battery life time of the transmitters and reduce the channel usage, signals are sent in a burst mode. Every transmitter sends burst spread-spectrum signals (tracking burst); it waits then for a defined time and sends again. At a much lower rate the transmitter introduces also a narrowband signal (acquisition burst). Unlike wideband, narrowband signals suffer from a lack in resolution however their detection processing is computationally less demanding. Hence this signal is needed by the system in the acquisition phase to acquire initial acquisition and keep the receiver synchronized with the transmitter.
The utilization of a heterogeneous platform is only profitable if the algorithms to be deployed are capable of exploiting the platforms heterogeneity. Accordingly the platform selection is based on the characteristics of the acquisition and tracking algorithms.
A. Acquistion
Every transmitter sends a narrowband signal in one of the available frequency channels. Bursts belonging to transmitters in the same frequency channel use different codes, with good cross-correlation properties in order to identify the different transmitters. Figure 2 shows the acquisition operation on the receiver side. The IQ output of the ADC is down converted to the desired frequency channels which are later down sampled and passed through a root raised cosine filter. Afterwards each filter output from each of the N channels reaches the correlation filter. The correlation filter correlates the filtered signal with M reference sequences. Hence, N(channels)×M(correlations) is the total number of transmitters the receiver can identify. The final step is to identify the transmitter related to the correlation outputs and calculate the time the corresponding acquisition burst arrived [8] .
The design of the digital down convertor can be achieved easily implemented using the method presented in [9] . However the implementation of a real-time correlator, or more precisely partial-correlator, is a complex component in term of resource utilization. The terminology partial-correlation comes from breaking of the burst signal into parts. Correlation is then performed on each of the partitions to minimize the effect of frequency offsets [10] . 
B. Tracking
Because of the predetermined time interval separating the acquisition and tracking bursts, once the acquisition burst has been successfully detected, the receiver predicts the window where the next tracking burst is expected. Across this window samples, correlation is performed allowing to precisely calculate the Time Of Arrival value of the tracking burst.
The tracking burst is spread over the complete available bandwidth, where each transmitter is again differentiated by a different code. Over the pre-determined window, a complex correlation for the received signals with the reference signal operates at the input data rate. The correlation filter is followed by frequency offset correction and then outputs the outcome for a precise detector. The function of the precise detector is to disregard multipath signals and predict the precise time of arrival of the spotted tracking burst, as visualized in Figure 3 . Once the correlation filter identifies the tracking burst, the receiver proceeds in the tracking mode and hops for the next burst. In case the synchronization is lost, the correlation filter waits for the next trigger from the acquisition.
IV. SYSTEM DESIGN AND ARCHITECTURE
In the previous section, the receiver algorithm was introduced. The available receiver platform, based on Virtex-II, provides a guideline for the proposed implementation and the efficiency of FPGA resource utilization for the acquisition and tracking algorithims.
It is noteworthy that neither the choosen FPGA, nor the GPU or CPU would be capable to perform the signal procecesing sufficiently on its own. While it would be possible to use a homogeneous platform, consisting of several FPGAs, GPUs or CPUs, to meet the performance requirements, our heterogeneous approach offers the best realization by making use from the peculiarity for each of these technoligies. Down-conversion, matched filtering and correlation operations involve serial processing on the input data stream. GPU implementation for the acquisition leads to reduced performance and higher latency in comparison with an FPGA implementation. On the other hand, tracking processing is performed on a GPU due to parallel realization of the signal tracking algorithms. As a result, a FPGA was choosen for the implementation of the Acquisition, and the Tracking is performed by a GPU. Further more some parts of the signal processing are performed by the CPU, leading to a balanced load on all system components.
A. Acquisition Implementation
In the acquisition algorithm, the first block is a digital down converter. One of the main features for the designed down-mixer in [9] is that no actual mixers or multipliers are needed to acquire the desired spectra from the received IQ components. Down conversion is performed at each stage with shifts f s /4, 0 and -f s /4, where f s represents the sampling frequency (100MHz). The transmitter sends the acquisition signal at one of N dedicated channels. With N=9, acquiring these nine channels is performed by using a two stage mixer. The mixed signal is then passed through FIR (finite impulse response) low-pass filter with a polyphase structure which simultaneously performs down sampling.
The input data rate is decimated after the matched filter output by f d = f s /16 (i.e. 6.25MHz) on each channel. The filtered output ( Figure 5 a) is correlated with all the M possible sequences. In consideration of the number of correlation filter needed (N×M), a highly efficient implementation is vital for this design. Correlation is performed when a code sequence, of length L, is compared with the input data stream.
Since the code sequence has a logic representation of 1 and -1, the multiplication in the correlation can be encoded into the FPGA solely by adder or subtraction operations. The IQ correlation filter in Figure 4 operates at a rate 32 faster than input filtered data. This allows the correlation filter to perform several multiply and accumulate operations per data cycle and take advantage of the SRL32 available on the Virtex6 FPGAs [11] . The magnitude of the complex correlation indicates how the received signal resembles the code sequence (Figure 5 b) . The output of every correlation result enters a detection filter that detects the presence of the transmitter and identifies the point in time the signal arrived at receiver antenna. 
B. Tracking Implementation
In contrast to FPGA acquisition algorithm the GPU does not perform the tracking alg The GPU is rather a co-processor, contro system's CPU. The overall tracking cor comprising three distinct steps, is therefor The window prediction and the post p correlation result are performed by the tracking correlation is outsourced to the GPU The window prediction is triggered acquisition signal in acquisition mode or correlated tracking burst in tracking mod streamed data, processed by the GPU, is part segments, and for each segment the w determines the contained tracking bursts. T used to fill an index buffer, containing the tr the predicted bursts' locations with respect origin. The index buffer is transferred to the parallel with the IQ data stream for fu Afterwards the correlation kernel is execute every signal segment containing the trac execution itself is initialized by the CPU performing the window prediction for the sig distinct CUDA streams are used alte processing, in order to overlap data transfers and GPU operations. The CUDA kernel's execution configuratio number of bursts that has to be proce execution: Each tracking burst is processe block consisting of 15 warps (= 32 threads system load, of 50.400 Bursts/second, accou of 126 thread blocks per execution. On the c load of about 11% is sufficient to occupy the 15 SMs (Streaming Microprocessors) avail While a high number of blocks are benefic occupancy, a low number does not require th as a highly loaded system. Each block determines the data processed by the index buffer and its block the n-th entry of the index buffer. Based offset, the start of the correlation window i retrieved transmitter ID is used to determ sequence used for the correlation. After the the corresponding result data is stored in memory and transferred to the host system's post-processing.
fter filtering and downm implementation, gorithm discretely. olled by the host rrelation process, re heterogeneous: processing of the CPU, while the U.
either from the r from previously de. The ADC IQ titioned into 2.5ms window prediction
This prediction is ransmitter IDs and t to the segment's GPU's memory in urther processing. ed by the GPU for cking bursts. The U application after gnal segment. Two ernately for this s, CPU processing on is based on the essed per kernel ed by one thread s). The maximum unts to an average contrary, a system e maximum of the able on the GPU. cial for the GPU's he same efficiency a that has to be ID: Block n reads d on the retrieved s determined. The mine the reference kernel execution, the GPU's main s main memory for
The partitioning between C reflect the CUDA processing exploitable degree of parall processing involves a high n favoring the CPU's or GPU's F The tracking algorithm part only about 70% of one of th compared to the post processin position calculation). For th utilization two different Nvid consumer grade Geforce 780 execution time for both GPUs i 
A. External interface
Added to the flexibility in algorithm, the FPGA plays a c to the platform. This can be a input (e.g. ADC) directly to th the Gigabit transceivers. For th the IQ data through fast serial t by means of optical commu function with multiple inputs 2XC6VLX240T supports 24 G data are transferred using a st position computing unit. CPU and GPU does not only flow, but also accounts for the elism: Furthermore the postnumber of complex operations, Floating-Point Unit. t running on the CPU requires he cores, which is insignificant ng of the correlation results (e.g. he measurement of the GPU dia GPUs have been used: A 0 and a Quadro K4000. The is shown in Figure 6 . K4000 and Geforce 780. 
B. Internal interface: GPU/FPGA Communication Via PCIe
Data transfer between the FPGA and the GPU is performed in two steps: At first, the data is transferred by the FPGA's DMA controller to the host system's main memory via the 8 Lane PCIe 1.0 interface. Afterwards the GPU's DMA controller reads the data from the host memory.
Basically the tracking algorithm performed on the GPU requires only the data stream window where the tracking burst correlation is executed. However, in typical conditions the bursts from different transmitters occupy more than 75% of the radio channel. Hence the extraction of single correlation windows would provide only a minor reduction of the data volume. Furthermore the extraction would require additional communication between FPGA and CPU: Results of the tracking correlation had to be fed back from the CPU to the FPGA with very low latency. Accordingly the IQ data stream is transferred entirety from the FPGA to host system's RAM, which omits time critical communication between CPU and FPGA at the cost of slightly increased data transfer volume. Likewise the data stream is transferred to the GPU in order to avoid costly in-memory copying.
1) DMA transfer
One critical element of the DMA transfer between FPGA and the host memory is the buffering of the IQ data stream:
Ideally the CPU program continuously initializes DMA transfers, and the stream data is continuously fed from the FPGA to the CPU's RAM. In practice however, latency between DMA transfers has to be considered. This latency has to be bridged by buffering the streamed data prior to the DMA transfer. Based on the data volume, about 0.5KB has to be buffered per microsecond; as a consequence even a buffer of 128KB is not sufficient to prevent buffer overflows reliably.
Instead of increasing the buffer size, e.g. by using DDR memory for buffering, the DMA core was extended with a queuing mechanism and a ring buffer in the host memory. By queuing the transfers up to 12 data chunks, a latency of about 30ms between transfer requests, issued by the CPU, can be bypassed. This allows the continuous streaming of data with a buffer of only 64KB. Bypassing a latency of 30ms without this queuing mechanism would require a buffer of 24MB on the FPGA board.
2) Heterogeneous data streams Beside the IQ ADC data stream, additional data has to be transferred from the FPGA to the host memory, in order to perform the window prediction. This additional data consists of timestamps associated with the data stream and transmitter IDs and timestamps associated with acquisition trigger. Furthermore data erroneous, originating from buffer overflows, has to be detected.
A valid approach consists of transferring the data along with the IQ input stream by additional DMA channels and/or register based data transfer. This however requires synchronization between the different data streams, for example to relate a register provided timestamp with IQ data segment.
A more efficient approach, with respect to the software's complexity, as well as the utilization of the PCIe bus, consists of keying this additional data in the data stream. This way also the alleged drawback of transferring data between FPGA and GPU via the host system's RAM can be exploited: While data is transferred from the host memory to the GPU memory, the CPU program can in parallel readout the additional data.
VI. CONCLUSION
In this paper we introduced a novel platform that takes into account the characteristics of FPGAs, GPUs and CPUs. High performance is achieved by optimizing the communication between the FPGA and GPU and making a clever map of the algorithms. The FPGA is required to connect between the external data input and the platform. In addition the FPGA is in charge of performing signal acquisition. This facilitates the parallelism of the GPU-CPU approach for signal tracking and time estimation.
By using standard components (except for the FPGA board) the overall cost is extremely low when compared to other solutions. The performance is demonstrated for a real-time localization system.
