Abstract --The Vandermonde system is used in OFDM predistortion to enhance the power efficiency dramatically. In this paper, we study efficient FPGA architectures of a recursive algorithm for the Cholesky and QR factorization of the Vandermonde system. We identify the key bottlenecks of the algorithm for the real-time constraints and resource consumption. Several architecture/resource tradeoffs are studied to find the commonalities in the architectures for a best partitioning. Hardware resources are reused according to the algorithmic parallelism and data dependency to achieve the best timing/area performance in hardware. The architectures are implemented in Xilinx FPGA and tested in Aptix real-time hardware platform with 11348 cycles at 25ns clock rate.
In this paper, we will focus on the efficient real-time implementation architectures.
The direct solution of the system with matrix inversion by Gaussian elimination has a high complexity at the order of O (n 3 ). By using the structure of Vandermonde matrix, the propagating algorithm requires (9n 2 +n-10)/2 multiply and divide operation (MDO) for the solution of the normal equations (NE) plus n 2 /2 multiplications to compute the meansquare error during the recursion [6, 7] . By using the principle of Levinson, the recursive algorithm proposed in [6] reduced the complexity to 3n 2 +9n+3 MDO. An extra feature is the minimized mean squared error (MMSE) is available as an output in each recursion of polynomial order. The avoidance of matrix inversion makes it attractive for real-time implementation. Despite of several patents in conventional linearization schemes, only few reports could be found for realtime architectures of pre-distortions based on the Vandermonde system. Although [3] discussed the feasibility of DSP and FPGA implementation, it is still a software simulation of matrix inversion in Matlab. For a wireless system, the computation of Vandermonde system is considerably expensive. The exploration for efficient real-time architectures is of great interest both theoretically and practically.
In this paper, we study several efficient architectures in FPGA implementation, e.g. single DSP processor type of architecture; fully parallel manual layout architecture; semi-parallel and pipelined architecture with configurable number of functional units (FU). We identify the key bottlenecks of the algorithm for the real-time constraints and resource consumption. By studying the commonalities in the architectures, we achieved the efficient system partitioning and resource sharing. A Precision-C based High-Level Synthesis (HLS) design methodology is applied to schedule the area/time efficient RTL by studying the architecture tradeoffs. The design is implemented by Xilinx Vitex-II FPGA in a real-time prototyping system.
II. PRE-DISTORTION IN OFDM SYSTEMS
In the transmitter of an OFDM system, a set of information bits [b 1 b 2 b 3 … b M ] are first mapped into the I/Q channel baseband symbols {S n (i,r) } using a modulation scheme such as phase-shift-keying (PSK) or quadrature-amplitude-modulation (QAM). Then each N symbols are packed into a parallel block and OFDM symbols in the time domain over time interval t ∈[0, T s ] are generated by IFFT for k= [1,2…N] . The proposed pre-distorter includes two stages [4] .
In the estimation stage, we will send a training sequence with sufficient dynamic range to probe the non-linearity. A feedback of the RF output to the base-band with this sequence is sampled and modeled as a P-order polynomial. 
The principle of pre-distortion [4] is to find a function g(x) before the actual non-linearity to make the overall effect of RF output linear. The inverse non-linearity is shown to be
Vandermonde matrix of the output vector.
III. RECURSIVE QR FACTORIZATION
A fast QR decomposition algorithm is proposed in [6] for Vandermonde systems. The matrix X is written as a product of an orthogonal matrix Q, with an upper triangular matrix ℜ (X=Qℜ). Because of the recursive feature in order, the Levinson type algorithm does not require explicit inversion of the coefficient matrix. The recursive feature also makes it suitable for parallel architectures in VLSI implementation. We summarize the key ideas of the algorithm here and make some modifications to facilitate real-time implementation.
The Levinson principle has been used extensively since its invention for structured matrices, especially Toeplitz systems. 
Using a guess procedure, the (j+1) th order solution is approximated by the forward estimation of the j th order solution.
The forward estimation F j is also formed similarly with a relation with forward MMSE E F,j as (j+1) th order. Using L.2 and minimizing w.r.t. parameter µ j+1 , we will obtain a simplified form for L.1 in (2×2) form as
Thus we can solve L.2 with µ j+1 and forward MMSE E A,j . By combining the recursive estimation of both the forward and backward estimation, we finally get a recursive algorithm with the j th iteration as summarized in:
where G j-1 is the matrix of for the computation of the forward estimation of F j and r j,2j is a vector which contains the j th to 2j th independent coefficients of covariance matrix R. This algorithm does not require the explicit computation of the inverse of the coefficient matrix.
IV. VLSI IMPLEMENTATION ISSUES

Design Methodology
A high-level software technology such as General Purpose Processor or DSP is more flexible to program for many applications. However, they are not efficient enough in speed for many real-time applications. Although ASIC is compact and cheap when the product volume is large, it is not easy to study the architectures tradeoffs. FPGA provides programmability and the flexibility to study several area/timing tradeoffs in hardware architecture by applying the intrinsic algorithmic parallelism. The FPGA net list can be mapped to an ASIC VLSI design for mass production. We use Precision-C and HDL designer based design flow shown in Fig. 1 for our study of efficient architectures. Precision-C is an RTL scheduler from Mentor Graphics. It can assign the number of FUs according the time/area constraints. We start from a floating-point algorithm in matlab. Then we will build a C/C++ test bench to model the exact behavior of the algorithm in a real system. Using some special design styles, we convert the algorithm to Tsunami compatible version. Precision-C can help study the data dependency of the algorithm. We add both time and area constraints for the design and Tsunami will schedule the solutions for efficient architectures according to the constraints. By studying the parallelism both within Precision-C and offline, many of the functional units are reused in the computational cycles. RTL output is generated and imported into HDL designer. In HDL designer, the arrays are mapped into memory blocks. Coregen will generate Xilinx IP cores for the RAM/ROM blocks as well as the pipelined dividers. After simulation in ModelSim, Spectrum Exemplar is used to do the synthesis and Xilinx Place & Route tools are used to generate the gate-level netlist. It is finally verified in a real-time configurable FPGA prototyping system from Aptix Inc. 
Real time requirement
Since the power amplifier property varies with time due to aging, temperature changes, supply voltage variations and the power control, the pre-distortion requires adaptive updating. The real time requirement determines the architecture tradeoffs. The real-time requirement is two-fold: 1). The update of the polynomial coefficients based on the changing of non-linearity. For a broadband PA, update rate within several ms is desirable based on measurement [3] . But considering the shift of frequency, fast update is in good demand. The interpolation requires much less training data than LUT-based schemes and can achieve adaptive updating. Moreover FPGA can work in much lower clock rate and consumes less power than DSP. Once the coefficients are updated, the computation units can be shut down to saves computation power. 2). Real-time generation of the actual pre-distorted signal with the captured non-linearity is critical for the date rate. Higher speed is favorable to higher data rate and wider bandwidth. With this knowledge in mind, we will derive different architectures with either different pipelining or resource sharing tradeoffs. Fig. 2 . Data dependency graph within an iteration.
Data dependency
The data dependency determines the algorithmic parallelism and possible resource sharing opportunities. In principle, two independent computations can be processed in parallel while two dependent computations must proceed in serial after the first result is ready. On the other hand, two serial computations can share the common resource or such as memory, or expensive computations such as multiply, divide. The data dependency during iteration is identified in Fig. 2 . The data with an arrow ending will depend on the result of the data in the previous paths in the graph. Also highlighted are the three divisions in the iteration because the divide is much more expensive in both area and timing. A typical divider will take 1000 LUTs and 16 cycles. By studying the data dependency and the time/area tradeoff, we can employ the parallelism and resource sharing to the most extent.
V. SYSTEM PARTITIONING
Based on the data dependency graph, we partition the system into two major blocks: the initialization stage and the recursion stage. We also identify the critical path in terms of the system latency from this graph. Fig. 3 shows the major partitioning of the system. The data vectors X and Y are input into the INIT block using RAM block generated by Xilinx CoreGen tool. Although dual-port rams will usually be used for inter-process communication, since the X, Y are only read, we can use a MUX to multiplex the ADDR_RD and ADDR_WR for a single RAM block because it is cheaper than dual port ram.
1).Initialization:
The operations in the INIT block are shown in the picture. Since R j and x j Y in (R.1), (R.3), (R.5) are used in all iterations with no dependency from other variables, we init the values of them before the iteration. The architecture will be discussed in more detail. 
FUs sharing ∆ *TE, we can reduce the explicit divisions to 2 in each iteration. With pipelined divider, the two divides in R.2 and R.4 can be pipelined. R.4, R.7, R.8 can be in parallel while R.4 can also be in a pipeline with R.2 since they have similar computation pattern for the j th order vector. The computation latency in the j th iteration can be reduced by about half with 4j+4 from 6j+10 multiplications. For a P order polynomial, the complexity is reduced to O(2P 2 +2P) from O(3P 2 +7P) for multiplications and to 17P to 48P for division cycles. Moreover, by comparing the MSE with a threshold, we can terminate the iteration when the MSE converges to a threshold.
3). Pipelined Pre-distortion:
In the pre-distortion stage,
. Each transmitted sample needs to be processed according to the sampling rate. So the accumulation architecture is not fast enough. Fig. 4 shows a pipelined architecture that applies the delay-line for a nested filter type of architecture. In this architecture, we only need 2P multipliers for the pow(P) computation and P adders for the accumulation.
Although this will consume more areas, it is still acceptable because typically the algorithm will converge at the order P<10. All functional units are fully utilized in a pipelined style, achieving the highest throughput.
4). Architecture Area/time tradeoffs:
With the partitioning, the major computation load is the INIT of R matrix and the Y H X cross correlation for a polynomial if P<<N, which is always true in reality. A naïve implementation of the R in the row order has prohibitive complexity of N*P*(P-1)/2 MULTS for X matrix and another N*P 2 for the X T X matrix computation. The storage is also considerably large. However, it is shown that R j is a Hankel matrix 
, by merging the two matrix computations, we not only save many computation, but also save many memory accesses. The comparison of the two schemes is summarized in Table. 1 and Table. Fig. 4 . Pipelined architecture for pre-distortion.
There are also many tradeoffs for the pow and accumulation architecture. The area-constrained VLSI architecture is shown in Fig. 5 (a) . It is similar to a processor type of architecture by repeating the computation within the loop structure with the same FU (multiplier and adders). It achieves the best resource sharing because of the sequential operation. We layout the FUs and use the Finite State-Machine (FSM) and logic design to generate the address signals for memory access and control signals for MUX with only one multiplier and adder for the whole system. However, this area-constrained architecture may take too much time. To achieve the highest throughput, a timeconstrained architecture is shown in Fig. 5 (b) . Because the computations of the x j for N samples are independent, we can achieve full parallelism and pipelining by laying out identical Process Entity (PE) for each sample. This requires at least N multipliers and all storages mapped to register file because memory access can stall the pipeline. The area is too big and the achieved speed is more than enough. In an area/time efficient architecture as shown in Fig. 5 (c) , we duplicate PEs for the smallest size by still meeting the time requirement. We divide the memories (X, Y, tmpXJ) according to the unrolled PE number. This architecture acts like multiple DSP processors in parallel according to the speed requirement.
Because of the limited space, many important implementation issues such as the complex logic design and pipelining details as well as Finite State Machine (FSM) are not included in this paper.
The performance gain of the pre-distorter is depicted in Fig. 6 . (Output Backoff )OBO is defined as the maximum output power(saturation power) over the mean power of the transmitted signal as "OBO(dB)=10log 10 (P maxo /P avo )". It determines the operating point of the amplifier. With predistortion, the BER curve is very close to the linear case with OBO=7dB, which is much better than pure non-linearity at the same OBO level.
VI. CONCLUSION
In this paper, we study efficients FPGA architectures of a recursive QR factorization for Vandermonde system in OFDM digital adaptive pre-distorter. The architectures utilize the arithmeric parallelisms and complex logic design for pipelining and resource sharing. It demonstrates the efficient area/time tradeoff by meeting real-time requirement. The architecture with partial optimizations is implemented in Xilinx FPGA realtime platform with 11348 cycles at 25 ns clock rate for N=100, P=10, which means a latency of 283 µs. 
