Abstract-This paper presents a reconfigurable inner receiver for the LTE, DVB-H, and IEEE802.11n (WLAN) radio systems, all of which are based on orthogonal frequency division multiplexing (OFDM). The receiver is implemented in the CAL language. An FPGA-based hardware implementation is synthesized from RTL generated from the CAL description. The purpose of our work is to investigate the feasibility of dataflow methodology for high-level description of digital radio transceivers.
INTRODUCTION
Future platforms for radio systems will use very diverse architectures. A platform may be composed of a single DSP, multiple DSPs, coarse-grained reconfigurable arrays, a plethora of special-purpose hardware units or any combination thereof, whose components may be connected in synchronous or asynchronous on-chip networks. Describing a system on high level, for instance using dataflow programming, and still being able to map that description to a variety of architectures is a major research topic.
Due to cost and energy constraints, an important design criterion is to minimize the total chip size. To achieve this, sharing resources between supported standards is of critical importance in order to produce cost efficient solutions.
We will argue that dataflow graphs are not only suitable for modeling of signal processing systems, but also provide a synthesizable system representation useful for exploring system parallelism. Large and complex systems benefit the most from high-level models and the ability to do rapid iterations of the design is one key advantage.
The purpose of this work is to investigate the feasibility of dataflow methodology for high-level description of digital radio transceivers. Figure 1 shows a simplified overview of an OFDM receiver. This paper describes implementation of the blocks FFT, synchronization and frequency compensation. This paper is organized as follows; A general description of the CAL dataflow language is followed by a description of the implemented algorithms after which there is a description of the dataflow implementations, results, and conclusions. Figure 2 illustrates a dataflow graph with three nodes (actors) and two arcs (FIFO queues). An actor operates in steps, in which it fires one of its actions. The fired action may consume one or more tokens from the actor's inputs and present computational results at the outputs. CAL actors may have state, which a fired action may update. Dynamic behavior is achieved by allowing multiple actions with different preconditions on state and inputs. Actors can be composed in a hierarchical fashion: a dataflow graph itself constitutes an actor, possibly with both inputs and outputs.
II. CAL DATAFLOW GRAPHS
In a case study [1] , an MPEG-4 decoder was specified in CAL and implemented on an FPGA using a CAL-to-RTL code generator. The results of the case study were encouraging in that the code generated from the CAL specification outperforms the handwritten reference in VHDL, both in terms of throughput and silicon area, and also allowed for a significantly reduced development effort.
Signal processing systems can conveniently be modeled as dataflow graphs, in which the nodes denote computations and the directed arcs represent the flow of data. In this work, we have used the CAL dataflow language. CAL is a data-flow oriented language that has been specified and developed as a subproject of the Ptolemy project at the University of California at Berkeley. CAL is extensively described in [2] . 
III. OFDM RECEIVER ALGORITHMS
In OFDM [5] [6], the bandwidth is divided into a number of orthogonal narrow-band subcarriers. The orthogonality of subcarriers is realized through use of the FFT and IFFT for modulation. OFDM is simple to implement and facilitates low complexity equalization techniques. However, OFDM systems are highly sensitive to synchronization. Further, OFDM systems are prone to frequency offsets mainly caused by Doppler shifts.
A. Time Synchronization
The receiver needs to know when the symbol starts in order to give a correct interpretation to the information carried by the symbol. Synchronization is usually handled by two different stages: an initial coarse synchronization plus a posterior fine tuning synchronization. A number of different methods have been proposed and employed in different standards [6] . Synchronization in time may be performed using different approaches, e.g., using specially transmitted synchronization signals or the existing cyclic prefix.
Acquisition in OFDM systems based on auto-correlation is usually performed using an architecture illustrated in Figure 3 . The delay, N, is the auto-correlation window or repetitive distance. In cases where the auto-correlation is performed with the cyclic prefix, N equals the FFT size. In other cases like IEEE 802.11, N is equal to 16 for the Short Training Field (STF) [4] . The autocorrelation function performed in Figure 3 is mathematically expressed as:
Time synchronization is accomplished by searching for autocorrelation peaks above a threshold level, as shown in Figure 4 . When a peak that satisfies predefined requirements regarding height and peak-width is found, this is marked as a synchronization point. In both DVB and LTE the performance can be improved by using more than one OFDM symbol in the estimation. 
B. Frequency Error estimation and Compensation
The CORDIC algorithm [7] is used both for estimating the frequency error and to do frequency error correction. The CORDIC algorithm can be used to iteratively rotate a complex number, or to find the phase of a complex number. Figure 5 shows the iterative stages in the CORDIC algorithm, where each stage rotates the complex number by ±arctan 2 -k . The resolution of the CORDIC depends on the number of iterations.
If the CORDIC is used for frequency estimation, its input is the output from the autocorrelation at the peak position. The CORDIC is then configured to rotate angle towards zero resulting in the output "arg" being an estimate of the rotation angle at the correlation peak, which can be used to calculate the frequency error. In case the CORDIC is used as a frequency error compensator, its input is each incoming sample together with a rotation angle (arg). The rotation angle to the first CORDIC iteration is calculated from the frequency estimate, and then updated per sample basis for the following iterations. The resulting value for {Re, Im} is then the input rotated by input angle "arg".
C. FFT
FFT is an efficient algorithm to calculate the discrete Fourier transform (DFT). FFTs often build on the factorization of a number into prime factors. A well-known algorithm is the Cooley-Tukey FFT [8] . 
IV. DATAFLOW ARCHIECTURE DESCRIPTION
The architecture consists of two parts, 1) "Synchronizer", which contains estimation of time position and frequency error together with a CORDIC rotator to do digital frequency error compensation. 2) "FFT" that is a configurable FFT supporting a maximum symbol length of 8k samples.
A CAL dataflow network supports multiple clock domains with up to one clock domain per Actor. If the FIFOs between clock domains are replaced by asynchronous FIFOs, the dataflow is transformed into a GALS [10] (Globally Asynchronous Locally Synchronous) network, making the CAL dataflow an attractive candidate for modeling GALS architectures.
A. Dataflow Implementation of the Synchronizer
The synchronizer dataflow architecture shown in Figure 7 consist to large extent of a control flow. The data processing flow contains several CORDIC rotators implemented as one Actor for each rotator. Using one Actor for each rotation step is necessary to achieve high enough data throughput because the CORDIC rotators must be able to process tokens at baseband data rate, i.e. up to 40 Msamples/s. In the data processing path shown in Figure 7 is also the Actor "CP Remove" (CP stands for cyclic prefix), that passes only the relevant samples on to the FFT.
The remaining part of the dataflow for the synchronizer estimates the FFT time position and frequency error. The CORDIC rotator that estimates the frequency error consists of one Actor, because this operation is done at most once for each correlation peak and hence can iterate for several clock cycles. The estimated frequency error is passed on to the controller which then configures the initial CORDIC rotators. The time synchronization is estimated in the Actor "Peak Detector" in cooperation with the autocorrelation in the Actor "Correlator", as described in section III.B. Besides handling the frequency error estimates, the Actor "Controller" configures all other actors at startup through common configuration port, not shown in Figure 7 .
The Actor named "Data Reducer" decreases the number of bits used for the inputs to the autocorrelation depending on the selected standard. It can also be used for decimating the incoming stream of samples before entering the autocorrelation. The possible reduction of bits and sample rate has the purpose of controlling the computational accuracy as well as limiting the need for large bit widths in the autocorrelator.
B. Dataflow Implementation of the FFT
The FFT needs to process 40 Msamples/s, just as the Cordic rotator. Because FPGA designs seldom reach far beyond 100 MHz, the FFT must process approximately one sample every other clock cycle. This calls for a pipelined design. Consider the data flow graph in Figure 6 and assume the data arrive as a stream of sample tokens. One butterfly stage, illustrated in Figure 8 , may then be composed of a stream de-interleaver, the actual butterfly, and an interleaver.
The de-interleaver splits the incoming stream in segments of one or more tokens, outputting the segments alternating over its output ports. The interleaver joins segments of the same length into a single stream. The twiddle factors are read from an internal table, indexed by a rotating index, with a repetition interval equal to the segment length. The input data reordering can also be formulated as data stream transformations. Again consider Figure 6 . A reordering stage may be composed of a de-interleaver that divides the stream in segments, sending segments alternating over its output ports, followed by an actor that interleaves tokens oneby-one from its input ports. See also Figure 9 . The described building blocks are combined, forming an FFT pipeline. See Figure 10 for an example.
In describing configuration, preceding and following refers to the data flow direction. In a variable size transform, the stages must have configurable segment lengths, twiddle factors and bypass ability. A stage is bypassed, if the resulting reordering segment length does not divide the transform size (in integer sense). In a configurable pipeline, the twiddle table size is the largest a stage must accommodate. If a stage is bypassed, the index stepping for the following stage is multiplied by the radix of the bypassed stage. The butterfly segment length is 1 for the first stage, and then successively multiplied by the radix of each active stage. Similarly, the reordering segment length is 1 for the first stage, and then multiplied by the radix of each preceding stage.
A radix-2 stage requires two segment lengths of buffer capacity. One segment at one of the butterfly ports, another at one interleaver port. The size of the twiddle table, Nw, is the product of the radices of the following stages. The reordering stage needs to store (P-1) times N data elements.
V. RESULTS
The dataflow description was synthesized to an off-theshelf FPGA-based development platform, using OpenDF [11] tools. Test data was streamed over Ethernet to the development board and the result displayed on an attached VGA display. The Ethernet streaming and data display was handled by an on-chip processor. The synthesized hardware is capable of processing 50 Msamples/s, sufficient for real-time. Some effort was spent to obtain a high throughput in order to keep the clock rate moderately low. The synchronizer, which contains a considerable amount of control, requires 2 clock cycles per sample. Since the synchronizer is possible to run at 97 MHz, it meets the demand of 40 Msamples/s with some margin. Any further optimization of the synchronizer would involve reducing the amount of hardware to the cost of lower maximum clock rate. The pipelined FFT also has good throughput margin because it can do above 50 Msamples/s. Any further optimization of the FFT would involve decreasing the throughput by allowing 2 or even 3 cycles per sample. The figures are not directly comparable. The CAL implementation spends considerable area on reordering and configurability. The architectures are different which mainly affects the memory figures. The CAL infrastructure adds some overhead, but the main source is probably duplication of functionality because of actor state isolation and a division of labor on lots of actors to reach the performance goals.
VI. CONCLUSIONS
We found that we could certainly reach our performance goals, but also that our CAL implementation uses more resources than a comparable RTL implementation. This is probably due to the relatively crude tools. Essentially, there is a 1:1 relationship between the CAL code and the RTL, placing the burden of optimization on the RTL toolchain. However, we also found that the asynchrony of the dataflow programming model and its realization in hardware enabled convenient experiments with design ideas. Thanks to the emphasis on interfaces and hierarchical design, changes involving only one or a few actors do not break the rest of the In contrast, any design methodology that relies on precise specification of timing such as RTL, where designers specify behavior cycle-by-cycle would have resulted in changes that propagate through the design.
