Abstract-This paper presents an architecture of an SVD based channel estimator. A number of simplifications of the estimator are presented. These simplifications reduces the complexity of the estimator enough to allow for a hardware implementation. A system is defined, in which the channel estimator is to be used. It is shown that with the proposed pilot symbol pattern and the usage of the channel correlation properties the estimator will reduce the number of multipications with 66 % compared to a brute force attempt. The hardware architecture of the channel estimator is also described. The estimator will have a throughput of 1/6th of the achievable clock speed and the word lengths are chosen such that there will be only a negligible increase in mean square error compared to the floating point case and still a 5 time improvement compared to the least square estimator.
I. INTRODUCTION
Multiple Input Multiple Output (MIMO) technology offers an increased bandwidth efficiency and thus enables higher data rates [1] . The capacity gain comes from the different, available signal paths from the transmitter to the receiver, which makes it possible for the receiver to distinguish between different transmit antennas. However, the different paths also complicates the situation for the receiver, since different signal paths have different delay and attenuation.
Various methods are available to overcome this issue. A method that has become popular in recent years is the Orthogonal Frequency Division Multiplexing (OFDM) scheme. This method allows for the receiver to collect all the energy from the different paths before making a decision on the transmitted symbol. The prize that has to be paid for this is that some bandwidth and power is lost in the transmission of the so-called cyclic prefix (CP) [2] .
To efficiently use the MIMO OFDM system, the channel must be known by the receiver. This means that the receiver must have knowledge of the attenuation and the delays of the different paths. In a narrowband system, as is provided by MIMO OFDM, it is sufficient to know the composite paths from transmitter to receiver. Each component is expressed as complex number and the channel can thus be described as a set of matrices, one per subchannel, with the dimensions Nr × Nt where Nr is the number of receive antennas and Nt the number of transmit antennas.
In many cases, such as for a mobile device, the channel changes over time and must be re-estimated for every data burst. This means that a scheme is needed for the channel estimation. This paper looks at the hardware implementation of a certain channel estimator based on Singular Value Decomposition (SVD), taking into account the correlation between the different subcarriers in the OFDM system. Section II describes the previous work within the field and gives motivation for this paper. In section III the channel and system models used for the estimator are explored, where after section IV describes the estimator algorithm. Then in section V the actual hardware implementation of the estimator is discussed. Some simulation results are shown in section VI before the paper is concluded in section VII. 
II. PREVIOUS WORK AND MOTIVATION
Since channel estimation is an important task in many wireless communication systems, there have been many suggestions on how to perform this operation. In OFDM systems, the main theoretical ideas have been developed in a number of papers [3] - [5] . Combining OFDM with MIMO adds somewhat to the complexity. This has also been treated in the literature [6] , [7] . However, all of this work is theoretical and the authors of this paper have not found any VLSI implementations and tests of these ideas presented in the literature. The MIMO OFDM implementations available use only basic least square estimates, which do not look at the subcarrier correlation [8] , [9] . This method is simpler, but gives a worse Bit Error Ratio (BER), requiring higher Signal-to-Noise Ratio (SNR) for the systems to work properly. The need for higher SNR will translate to lower data transfer speed. This makes the channel estimator an interesting field of research for high speed systems.
III. CHANNEL AND SYSTEM MODEL

A. Channel Model
In an OFDM system, the cyclic prefix should not be shorter than the channel impulse response, or the orthogonality between the subcarriers is broken. A too long channel impulse response may also introduce inter-symbol interference (ISI) between two consecutive symbols. Thus, the channel model assume that the channel impulse response is not longer than the CP.
Between each transmit-receive antenna pair there are a number of different signal paths. The model assumes a quasi-stationary channel. This means that the coherence time of the channel is longer than one OFDM frame or burst, containing a number of OFDM symbols. The channel will not change during transmission and it is sufficient to estimate the parameters in the beginning of the burst.
The energy in the different paths are typically higher for the components arriving early since they have traveled shorter. This is illustrated in Fig. 1 and mathematically expressed as where αn and τn describes the complex amplitude and delay of the different paths, δ is the Dirac function where δ(0) = 1, and i and j index the different receive and transmit antennas.
To get the frequency response of the channel, a Discrete Fourier transform is performed as
where the normalized frequency f = f k = k/N is assumed. The different frequency output values of this transform will describe the channel and are sought by the channel estimator. Each antenna pair, i.e., one transmit and one receive antenna, will have its own independent frequency response. All will still follow the same model as presented here, though.
B. System Model
A 4 × 4 MIMO OFDM system is assumed, i.e., there are Nt = 4 transmit antennas and Nr = 4 receive antennas. Furthermore, the system has N = 64 subcarriers and a cyclic prefix, L = 8, i.e., 1/8th of the length of the OFDM symbol without a CP. The system is full rate, i.e., each of the transmit antennas provide an independent data stream on each of its subcarriers. The antennas are sufficiently separated to be considered as independent [10] . The data is encoded using Quadrature Phase Shift Keying (QPSK). Thus each subchannel will transmit two bits per OFDM symbol. In addition, the four first symbols of each burst are pilot symbols, that are used for estimating the channel parameters.
It should be noted that not all system parameters are of equal concern, for the channel estimator. Therefore this rather simple system model is assumed for this implementation. Adaptation of the estimator to a more complex environment is quite straight-forward, as the channel estimator is unaffected by different coding schemes. Only the number of subcarriers and the structure of the pilot symbols are of concern. Thus, an increase in the number of subchannels or a more complex constellation will only affect the size of the estimator and will not change the basic architecture.
1) Transmitter:
The four different transmitters will be totally decoupled in this system. The schematic structure of one of them is shown in Fig. 2 . There it is shown how the data is split into the different subchannels before the signal is transferred to the time domain by an Inverse Fast Fourier Transform (IFFT). Addition of the CP is performed after the IFFT operation, before the digital-to-analog (DA) conversion, and the analog radio frequency (RF) operations.
2) Receiver: The receiver tends to be more complicated than the transmitter. This is especially true in a MIMO system, where all the receive chains will receive all the different transmitted signals. The first operations of the receiver can be executed independently for each receive chain, however, as is shown in Fig. 3 . Here the receive signal is digitalized with an analog-to-digital (AD) converter after the RF operations. After that, the CP is removed, before the serial time signal is parallelized before the Fast Fourier Transform (FFT) transfers the time domain signal to the frequency domain. Thereafter each chain estimates its channel parameters. For the rest of the operations, the different receive chains have to collaborate in order to detect the transmitted data. Exactly how this is done, is not described in detail but a number of different architectures exist [10] .
3) Pilot Structure: Channel estimation is not generally possible to do independently for the different receive chains. However the system presented here has that possibility due to its pilot structure, with four pilot symbols in the beginning of each burst as shown in Fig. 4 . The black squares mean that pilot data are transmitted at the given frequency and antenna. A white square means that the transmitter is not transmitting. Using this structure, where only one antenna is transmitting at any given time and frequency, ensures that it will be possible to separate the different transmit antennas for each receive chain. It should be noted that using four pilots make it possible to transmit a pilot for each subchannel from each transmitter.
IV. CHANNEL ESTIMATION
As stated above, the channel is estimated with a set of pilot symbols. The value of these are known by the receiver and by using them it is possible to get a least square estimate of the channel as
where y is the receive vector and X is a diagonal matrix containing the pilot symbols.
A. MMSE Estimation
The least square estimate is not using the correlation between the different subchannels that is present due to the time limited power delay profile of the channel impulse response. This correlation can be used to improve the least square estimates of the channels. A better estimator, that uses this information is the linear Minimum Mean Square Estimate (MMSE). This is can be found as [5] where the weighting matrix W is
where R hh is the channel auto covariance matrix and β is a constellation dependent constant. In QPSK modulation β = 1. The channel correlation is known and R hh can be precalculated. W depends on the SNR and will thus vary. However, it is a weak dependence, and therefore it is possible to design for a pre-defined SNR level and thereby also precalculate W. The estimation will thus turn into a constant matrix multiplication.
B. SVD Based Transform Estimation
The complexity of this operation is still too high, though. The number of multiplications will be equal to the dimension of W which means N × N = N 2 multiplications per estimated transmit antenna for each receive antenna, i.e., the total number of multiplications in the receiver will be of NrNtN 2 complex multiplications. In the proposed system means 4 × 4 × 64 2 = 65536 multiplications. However, it is possible to lower the complexity by noting that W can be decomposed by a singular value decomposition (SVD) as
where U and V are orthonormal matrices and Q −1 is a filtering matrix. This gives the structure shown in Fig. 5 .
An SVD will always yield a diagonal singular value matrix meaning that Q −1 will be diagonal. Also, almost all of the energy will be in the first few taps. The rest will contain very limited amount of energy and can therefore be disregarded. In addition, since R hh is a hermitian, positive-semidefinite matrix it is possible to choose V = U. This allows for the simplified architecture shown in Fig. 6 , where δ0 to δM−1 are found as [5] δi = λi λi + β SNR (7) where λi are the singular values of the SVD of R hh and δM to δN−1 are disregarded due to their minor energy content.
To ensure that most of the energy is used by the estimator, M must be longer than the cyclic prefix, L. Simulations have shown that M = L + 3 is a good choice for the given system. The effect of changing SNR has a minor impact on δi and it is thus possible to precalculate the values and combine them with the transform matrix, U. Doing this, it can be seen that now the number of multiplications needed will drop from N 2 to 2N M . The total number of multiplications will thus be 2NrNtN M which with the current system parameters equals 22528 multiplications. This means a reduction of 66 % compared to the previously calculated 65536 equations.
V. HARDWARE ARCHITECTURE
Due to the separability of the different paths from transmitter to receiver antennas, the complete estimator for the full Nr ×Nt channel matrix can be separated in Nr × Nt simpler estimators similar to the ones described in section IV and shown in figure 6 . This property simplifies the hardware architecture.
A. Complexity
The architecture chosen for the simple estimator is a pipelined architecture with a total of 4 complex multipliers available per estimator, making it a total of 64 complex multipliers for the whole receiver. Each subchannel estimate needs a total of 2M + 1 = 23 multiplications, where the first one is needed for finding the least square estimate, b h ls . Assuming that one complex multiplication and one add operation can be performed per clock cycle, the maximum throughput of the estimator will be ≤ 4/23 estimates per clock cycle. The actual throughput is chosen to be 1/6 estimates per clock cycle to simplify the control logic and to ensure a new estimator value every 6th clock cycle.
It is assumed that the FFT is of a pipelined structure, so that new data are provided at the same rate as they are consumed by the estimator. The order of the inputs is of no concern, which simplifies the FFT implementation. The latency of this architecture will be 1 + 192 + 3 = 196 clock cycles. The first cycle is due to there not being a least square available at first iteration. The next 192 cycles will then be used for calculating the first transform, as well as the remaining least square estimates. The final 3 cycles will be spent calculating the first output of the second transform.
B. Word Length
The fixed point word lengths of the estimator are chosen long enough to ensure negligible difference compared to a floating point system. Still it is important to not make the words longer than necessary as this will increase the size and slow down the design. Through simulations it has been seen that 10 bits of precision respectively for the the real and imaginary parts (hereafter noted as 2 × 10 bits) are needed to store of the elements of the transform matrix. Since all values are between −2 −1 and 2 −1 , the first bit (the sign bit) will be the 2 −1 position. It is also assumed that the data from the FFT unit also have a word length of 2 × 10 bits. However, the maximum input values will be in ranges requirering 3 integer bits, including the sign bit, thus allowing for 7 fractional bits.
Internally the design uses 2 × 16 bit registers to store the intermediate and final results of the first and second transforms. The words have 2 × 5 integer bits, including the sign bits and thus 2 × 11 fractional bits each. Simulations have shown that this is sufficient ot avoid overflows. The multipliers and adders should also manage 2 × 16 bits for each of the operands.
C. Memory
Calculations of the total memory needs has been performed for both the Read Only Memory (ROM) and the Random Access Memory (RAM).
1) ROM:
All estimators will multiply with the same transform matrices and these can thus be shared. Also, not the whole N × N U transform matrix needs to be stored, but it is sufficient to store the M × N elements that are actually used, and no extra space is needed for U H as the hermitian transpose is easily generated with a sign change operation of the imaginary part. However, it was stated above that the δi values were premultiplied into the transformation matrix. To ensure that this not forces twice the needed memory, it is noted that 2 6 4 δ1 0 . . . 0 δ2 . . . . . .
which means
and that it is possible for us to store only the N ×M = 704 elements of UD 1/2 in the ROM. Each element is a complex number requiring 2 × 10 bits giving a total ROM of 14080 bits. In addition the pilot bit pattern needs to be known. In a QPSK constellation, two bits are needed per symbol, and in total there are NtN pilots transmitted, giving a contribution to another 512 bits.
2) RAM: There are no specific need for RAM in the architecture, except for the registers that stores the intermediate steps. Each estimator needs to store M values for the first transform and N values for the second transform. Each of the words contain 2×16 bits, making a total of (11 + 64) × 2 × 16 = 2400 bits of distributed RAM per estimator and thus a total of 38400 bits of RAM in the complete receiver. However, since the memory is spread in different registers, it is not well adapted to be put in a single memory bank.
D. Hardware Reuse
Since channel estimation is only done for the initial few symbols of a burst, the design should enable re-use of the multipliers by other parts of the receiver at other times. They could for example be used both in the channel preprocessing and the symbol detection process. Therefore both of the operands are set to 2 × 16 bits even when there are only 2 × 10 bits in the transform matrix elements.
VI. SIMULATION RESULTS
The proposed channel estimator, with the word lengths given in this paper, has been simulated and compared with the least square estimator. The difference in Mean Square Error (MSE) of the channel estimate results are shown in Fig. 7 . The MSE of the proposed estimator is roughly 1/5th of that of the least square estimator.
A minor difference is seen between the fixed point and floating point estimators both at low and high SNRs. At the lower SNRs, this is due to the fact that the fixed point estimator is designed for an SNR of 30 dB, whereas the floating point estimator is recalculating the δis for each SNR value. The discrepancy at high SNR is due to the limited resolution in the fixed point implementation. However, further simulations have shown that this difference is not visible in the BER curves, since the BER seems to not be limited by the channel estimation error at these high SNRs. Shorter word lengths have been shown to yield worse MSE and also affect the SNR.
VII. CONCLUSION
In this paper it is demonstrated that it is feasible to implement a more complicated MIMO OFDM channel estimator than the least square estimator so far seen in literature. An example of a hardware MSE comparison between the Least Square and linear MMSE estimators implementation is given and its complexity and performance are shown. Since the channel estimator is only active in the beginning of the OFDM burst, when the channel needs to be estimated, the architecture is designed to be re-useable for other parts of the receive process, such as channel preprocessing or symbol detection.
In addition the word lengths of the channel estimator have been calculated. It is shown that 2 × 16 bit registers are sufficient to get the same performance as with floating point arithmetic. The channel estimator has then an MSE which is 1/5th of what is seen for the least square estimator. This means that the receiver will be able to work with higher data rates at lower SNR levels. This can be especially beneficial in an environment with many interferer, where an increase in transmitted power will not necessarily increase the SNR.
