Abstract-Approximation of Toeplitz matrices with circulant matrices is a well-known approach to reduce the computational complexity of linear equalizers. This paper presents a novel technique to compute linear equalizer coefficients in the frequency domain. It is shown how a regularization term can help to reduce the error caused by the frequency domain approximation. A corresponding VLSI implementation provides reference for the true silicon complexity and for the complexity increase associated with the proposed algorithm.
I. Introduction
Many wireless communication systems (such as UMTS or CDMA2000) suffer from inter-symbol and multiple-access interference due to multipath propagation. In practice, this interference often exceeds the thermal noise in the system and becomes the performance-limiting factor. Hence, techniques must be found to mitigate this negative impact of multipath channels on error rate performance. An efficient (in terms of performance and computational complexity) approach for emerging standards is the use of orthogonal frequency division multiplexing (OFDM) modulation. However, solving the problem for established standards requires leaving the existing modulation schemes intact so that only signal processing at the receiver is a viable option to suppress interference. Linear equalization allows to partially restore the transmitted signal by inverting the transfer function of the multipath channel with a properly designed finite impulse response filter at the receiver. The corresponding filter coefficients can be obtained using different approaches: One possibility is to employ lowcomplexity adaptive algorithms such as LMS or RLS to adjust the filter coefficients directly based on a received training or pilot sequence. The drawback of this adaptive approach is that it often suffers from slow convergence and requires continuous tracking which may entail a considerable computational effort. Another possibility is to first estimate and track the channel's impulse response followed by the computation of the equalizer coefficients using an algorithm for direct matrix inversion (DMI). The main difficulty associated with this second approach is that a straightforward implementation requires the inversion of a large Toeplitz matrix which is a costly operation in terms of computational complexity and memory consumption.
Hence, for the practical application of equalization algorithms which are based on explicit estimation of the channel's impulse response, we are interested in techniques that avoid or simplify the associated matrix inversion. Viable solutions to this problem have been described for example in [1] , [2] , and [3] . The basic idea in these publications is to start from the DMI-based approach and to approximate the Toeplitz matrices with circulant matrices. The latter are easy to invert in the frequency domain, resulting in a complexity that is comparable to that of an OFDM receiver. However, it is also mentioned in [1] and [2] that achieving good performance with the cyclic approximation requires a regularization (or conditioning) of the circulant matrices. Unfortunately, none of the two papers gives guidelines for the computation of this term.
Contribution: In this paper, equalizer coefficients are computed using a circulant approximation of the convolution in the channel. The corresponding approach deviates slightly from [1] and [2] since our derivation is designed to yield an expression for the regularization term required to partially mitigate the error caused by the cyclic approximation. The approach is described for single-input single-output system, but can also be adopted for multiple-input multiple-output systems. The paper also describes a corresponding VLSI architecture and provides reference implementation results.
Outline: The next section introduces the system model and briefly repeats the computation of a straightforward timedomain (TD) equalizer. In Section III the low-complexity frequency-domain (FD) algorithm is introduced. Section IV then describes our regularized frequency-domain (RFD) equalizer algorithm together with an analysis of the performance and of the complexity scaling behavior with the length of the equalizer. A VLSI architecture and corresponding implementation results are presented in Section V.
II. System Model and Linear Equalizer
Consider a single-carrier communication system in which the delay-spread of the frequency-selective channel exceeds the symbol period causing inter-symbol interference (ISI). The sampled impulse response of this channel with Lc significant taps is given by the vector h = [h0, h1, . . . , hL c−1 ]
T and the vector x = [xN−1, xN−2, . . . , x0]
T denotes a collection of N subsequent transmitted data symbols in time-reverse order. The linear convolution in the channel that yields a vector y of Le consecutive samples, where Le is the length of the subsequent equalizer, can be described in matrix-notation according to
where the Le × (Lc + Le − 1)-dimensional Toeplitz matrix H is given by
The vector n denotes the additive (assumed i.i.d. Gaussian) noise with zero mean and variance σ 2 per complex dimension. The output of a linear length Le equalizer at the receiver is obtained by convolving the received signal with the equalizers length Le impulse response w H . In matrix notation, a single resultx of this convolution is obtained by premultiplying the received vector y with w H :
The vector w, is obtained from the Wiener-Hopf equations according to
where eD is a unit-vector of appropriate length with a one in the Dth row. The design parameter D thereby determines the combined delay of the channel and the equalizer and is set to D = Le/2 in the following. The bottleneck at the receiver is in the costly computation of (4) since the effort required for a straightforward matrixinverse grows according to O(L 3 e ). Hence, if Le is large (typically Le ≥ 16), the computational complexity becomes quickly prohibitive.
III. Frequency-Domain Equalization
To reduce this complexity, we start by defining anLe ×Le-dimensional circulant matrixH whose first rowh is given by the vector obtained from zero-extending h to a length ofLe. Note that as opposed to [2] we allow forLe ≥ Le. SinceH is circulant, the corresponding inverse is given byH
is a diagonal matrix composed of the elements of the discrete Fourier transformation (DFT 1 ) ofh, and where F denotes the Fourier transformation matrix of appropriate dimension (FF H = I). The approximate frequencydomain equalizer of length Le under the cyclic assumption is now given byw
where eD is again a unit-vector of appropriate length with a one in the Dth row. The matrix Z = [IL e ×Le , 0 Le ×(Le−Lc ) ] in (5) serves to truncate the impulse response fromLe to the desired length Le. For a performance comparison between the FD equalizer computed according to (5) and the TD equalizer given by (4), consider the cumulative distribution function of the squared error (SE) at the receiver, after equalization. Corresponding simulation results for a channel with eight sample-spaced, equal-power taps and for an equalizer length of Le = 32 and Le = 64 are shown in Fig. 1 . As expected, the FD equalizer suffers from a loss in terms of the average SE (taken in [dB]). However, it can also be seen that in many cases the FD equalizer completely fails to suppress the ISI causing excessive levels of interference.
IV. Regularized Frequency-Domain Equalizer
The goal of the subsequent derivation is to reduce the MSE caused by the cyclic-approximation and to avoid complete equalization failures. To this end, it is proposed to employ an additive noise model for taking the approximation error into 1 The DFT operator is normalized so that trace(Λ H Λ) =Le h 2 . account when computing the coefficientsw of the regularized FD equalizer. For clarity of exposition, the corresponding algorithm is obtained for the case of high-SNR, where the influence of the thermal noise is neglected (E{ n 2 } = 0).
A. RFD Algorithm
The derivation starts by writing a received samplexD after equalization according tõ
The next step is to to express the convolution of the transmitted signal with the channel in (6) based on the circulant matrixH. To this end, substitute H = ZHK into (6), where
Further substituting Z H ZH =H − P with P =H − Z H ZH and subsequentlyH = FΛF H yields
whereñ = F H PKx subsumes the source of residual interference from the cyclic approximation 2 . In our model, the components ofñ are described by zero-mean, independent identically distributed (i.i.d.) random variables with variance σ 2 n per complex dimension. To compute this variance, consider first the expected overall interference-noise power which is given by E{ ñ 2 } = PK 2 F , where · F denotes the Frobenius norm of a matrix. Since this interference-noise is modeled as i.i.d., σ 2 n = E{ ñ 2 }/Le, which can be written as
by exploiting the structure of the matrix PK.
To obtain the RFD equalizer coefficients from this additivenoise model, substitutez = F H Kx into (8) and consider now the problem of estimatingz from y = Λz +ñ.
(10)
The entries ofz are treated as if they were zero-mean, i.i.d. random variables with variance σz = N/Le per complex dimension. The corresponding minimum mean squared error estimator, is given bỹ
which can then be used in (6) to obtainw.
B. Efficient Implementation and Complexity
A block diagram summarizing the procedure for computing the RFD equalizer is shown in Fig. 2 . To properly exploit the complexity savings offered by the (R)FD algorithm,Le must be chosen as a power of two. This choice enables the use of the fast Fourier transform (FFT) to obtain the diagonal entries λi of Λ according to [λ1, λ2, · · · , λL e ] = FFT{h} which has a complexity scaling behavior of O(Le log 2L e). The reduced computational complexity compared to the TD algorithm results from the fact that the (regularized) matrix inversion in the frequency domain in (11) is trivial, with a complexity scaling behavior given by O(Le). The final conversion ofΛ −1 back into the time domain is performed by an IFFT, which can be truncated to yield onlywi for i = 1, . . . , Le.
The only overhead in the computation of the RFD algorithm, compared to the FD algorithm results from (9). While the associated complexity scaling behavior is given by O(Le), it is noted that the corresponding number of operations is extremely low when compared to the other required operations and is therefore usually negligible. 
C. Performance Analysis
The MSE performance of the RFD algorithm is illustrated by the simulation results shown in Fig. 1 . In the example, the use of the regularization term reduces the average of the SE (taken in [dB] ) by approximately 3.5 dB compared to the FD algorithm without regularization. In addition to that, a significant reduction of the probability that the FD equalization leads to excessive interference is observed as desired. Fig. 3 shows a comparison of the residual MSE after TD, FD, and RFD equalization for different channel-and equalizer lengths. Over the entire range of the simulation, the RFD algorithm provides a significant improvement over the FD approximation without regularization. It can also be seen that increasing the length of the equalizer improves the MSE performance since the circulant approximation becomes more accurate. However, it is also evident that the TD equalizer still outperforms both low-complexity frequency domain schemes.
V. VLSI Implementation In order to assess the true silicon complexity of the proposed algorithm, we shall briefly describe a corresponding VLSI implementation. The circuit under consideration is designed for a maximum channel length of Lc = 16, an equalizer length of Le = 32, and an FFT/IFFT length ofLe = 64.
A. VLSI Architecture
The basic idea behind the chosen high-level VLSI architecture is to extend the arithmetic unit (butterfly unit) of an FFT/IFFT processor for the computation of (9) and (11). A suitable starting point for the 64-point FFT/IFFT operations required forLe = 64 is a radix-4 architecture [4] implemented with a single time-shared radix-4 butterfly unit (BFU) as shown in Fig. 4 . The memory of the FFT/IFFT processor is divided into four independent dual ported banks (each holding 16 data words). The connection between these banks and the BFU is made through two bus barrelshifters which can cyclically shift the connections between the four memories and the four inputs/outputs of the BFU.
The extended radix-4 BFU is comprised of three pipeline stages which corresponds to the maximum degree of pipelining that does not cause any data hazards in this configuration.
Din (2) TWIcoeff (1) Din (1) TWIcoeff (0) Din (0) Din (3) TWIcoeff ( The additional components added for the computation of (9) and (11) are highlighted in the schematic in Fig. 5 . The fast computation of (9) adds almost no overhead, since three out of four branches of the standard radix-4 BFU already contain multipliers that can easily be reconfigured for the required norm computations. A squarer is added to the originally multiplierfree first branch of the radix-4 BFU to treat always four values in parallel. The norms are added and scaled after the accumulator circuit shown in the top-right corner of Fig. 5 . This accumulator and the subsequent constant-coefficient multiplier are the only components that can be removed when implementing the FD instead of the RFD algorithm. The computation of (11) is carried out in the second and third pipeline stage of the BFU. To this end, the multipliers and adders in the second stage are reused to obtain λ 2 i and to add the regularization term σñ/σz. For the FD algorithm this term is simply set to zero. The subsequent division must be carried out on dedicated dividers which have to be added to the standard radix-4 BFU. These dividers are responsible for most of the overhead (in terms of area and delay) compared to a standard radix-4 FFT/IFFT processor.
B. Operation
The operation sequence of the FD/RFD circuit is illustrated in Fig. 6 . First, the 16-entries of channel impulse response vector h are read. At the same time, the RFD implementation computes (9) on the extended BFU (the corresponding connection, shown in Fig. 4 , is not needed for the FD algorithm). Next, a zero-extended 64-point FFT is computed, requiring 32 plus three cycles to flush the BFU pipeline since the first butterfly stage of a zero-extended FFT can be skipped. Next, (11) is computed in 16+3 cycles, followed by a 64-point IFFT in 48+3 cycles. Unloading the 32 data words of interest requires another 8 cycles since four words can be accessed per cycle. In total, a single equalizer update requires 120 clock cycles, for both the FD and the RFD algorithm. 
C. Implementation Results
Tbl. I shows the VLSI implementation results for the original FFT/IFFT processor and for the derived FD and RFD circuits. Compared to the FFT/IFFT processor, the FD and RFD circuits suffer from a 75% increase in silicon area and achieve only one third of the original clock frequency. The main reason for this speed penalty are the slow 12-bit radix-2 dividers. Nevertheless, the achievable update rate of 2.15µs is already sufficient for many applications with mobile speeds in excess of 200 km/h at carrier frequencies up to 5 GHz. Hence, the area overhead for additional pipeline registers in the dividers would not be justified. 
VI. Conclusions
Computing the coefficients of a linear equalizer in the frequency domain (via FFTs) using a circulant approximation of the Toeplitz matrix describing the convolution in the channel is a viable approach to reduce computational complexity. The associated loss in mean squared error performance from the circulant approximation can be partially mitigated by including a regularization term derived from the channels impulse response. The incorporation of this term entails only a marginal increase in computational complexity and the hardware overhead in a VLSI implementation is below 2%.
