A new minimum-total-mean-square-error (MTMSE) criterion based vector perturbation precoder is proposed for multi-user multi-stream MIMO downlink, which has much lower computational complexity and slightly better BER performance than the traditional MTMSE-MUMS-VP precoder we proposed before. Furthermore, the proposed precoder is verified on a Xilinx Virtex-4 FPGA at 400 MHz. Many slices resource can be saved due to the reduction of complexity, and with 16 QAM constellation the data throughput is up to 12.8 Gbit/s in a {2, 2, 2, 2} × 8 MIMO configuration.
Introduction
Vector perturbation (VP) is a promising multiuser multiple-input multipleoutput (MU-MIMO) precoding technique in the next generation cellular [1] and WLAN [2] systems, which is first proposed in [3] and shown to achieve near-capacity performance of MU-MIMO downlink with single receive antenna per user [3, 4, 5] . The combination of the popular block diagonal (BD) precoding [6, 7] and VP, named for BD-VP [8], extends VP precoder to be applicable for MU-MIMO downlink with multiple receive antennas per user. However, the direct combination of both structures brings rather high computational complexity to the implementation of the precoder. To solve this problem, we develop a low complexity BD-VP precoder in [9] . Moreover, the BD-VP precoder is designed to completely remove the multi-user interference (MUI) without considering its effect on the noise. Therefore, some multi-user multi-stream VP precoder structures are proposed in [10] to mitigate the noise enhancement problem encountered in BD-VP precoder, among which the minimum-total-mean-square-error (MTMSE) criterion based multi-user multi-stream VP (MTMSE-MUMS-VP) precoder (specifically, refers to MTMSE-MUMS-VP II in [10] ) has the maximum achievable sum rates. However, the computational complexity of MTMSE-MUMS-VP precoder is comparatively higher than that of other MUMS-VP precoders in [10] . So, in this letter, we further decrease the computational complexity of MTMSE-MUMS-VP precoder for easier hardware implementation. Fixed-point bit error rate (BER) simulation results validate that the proposed new MTMSE-MUMS-VP precoder is slightly better than tradi-tional one. Meanwhile, we verify the proposed precoder on a Xilinx Virtex-4 FPGA.
MU-MIMO downlink system model
Consider a MU-MIMO downlink system where a base station (BS) transmits signals as well as possible interferences to K independent users simultaneously. The BS is equipped with N t transmit antennas and the kth user has n k ≥ 1 receive antennas, expressed as {n 1 
The total number of receive antennas at all users is defined as N r = K k=1 n k . The channel from the base station to the kth user is modeled as flat fading n k × N t channel matrix H k . It is assumed that the stacked channel matrix
T can be obtained perfectly enough by uplink-downlink duality or instantaneous feedback [3, 4, 5, 6, 7, 8, 9, 10, 11] . Then, the received signal at the kth user isx
where x k and F k are respectively the transmit signal and precoder for the kth user, g k is the adaptive gain controller (AGC) factor at the kth user's receiver and can have any value in the set of positive real numbers, z k is the additive complex Gaussian noise vector with zero mean and covariance matrix σ 2 z I n k on the kth user.
Low-complexity MTMSE-MUMS-VP

Traditional MTMSE-MUMS-VP
In traditional MTMSE-MUMS-VP [10], let y k = F k x k , then the MSE betweenx k and x k of the kth user can be calculated as
where P is the total transmit power. The optimal precoding matrix is derived as,
where
, that is to say F k can be obtained from the corresponding columns of F. However, due to the columns of F k are not orthonormal, the AGC factor g k will bring noise enhancement at receiver side. A better approach is performing QR decomposition, i.e.,
where R k is an n k × n k upper triangular matrix and Q k is composed of n k orthonormal basis vectors of 
power of the kth user, η k includes the residual small MUI and can be approximated as α with marginal performance loss (refer to [10] ). In summary, the precoding matrix F k has been updated as
To further minimize the transmit power, the perturbation vector l k added to the transmit signal is optimized as follows:
where τ k is a positive real number related to modulation order and constellation size of user k, which is defined as τ k = 2(|C| k max + /2) [3, 9] . |C| k max is the absolute value of the constellation symbol(s) with largest magnitude for user k, and is the spacing between constellation points. CZ n k represent the set of n k × 1 vectors in complex integer field.
Proposed low-complexity precoder
In traditional MTMSE-MUMS-VP precoder, Q k is used to mitigate MUI, while D k is used to eliminate MSI. In fact, F k can be easily simplified as the closed-form expression
from which we can see the inclusion of interference term
. Thus, we conclude that F k can mitigate not only MUI but also MSI in itself. Consequently, the precoding matrix can be expressed as 
Then, similar to the idea of low complexity BD-VP precoder in [9], we can directly optimize the perturbation vector l k using F k instead of the cascade of Q k and D k .
According to [12] and assuming N r = N t , the detailed computational complexity comparison is shown in Table I , where K and K are the norm of the longest basis vector of D k and F k respectively. It is obvious that the proposed new precoder has four less processing steps than traditional MTMSE-MUMS-VP precoder. In addition, the computational complexity against the number of transmit and receive antennas N is visually illustrated in Fig. 1 . n k is set to 2. The minimum signal-to-leakage-and-noise-ratio criterion based MUMS-VP (MSLNR-MUMS-VP) precoder has the lowest complexity among all those percoder structures in [10] . Fig. 1 shows that the computational complexity of traditional MTMSE-MUMS-VP precoder has been greatly reduced by the proposed precoder structure especially when N is not large. For instance, in the case of {2, 2} × 4 configuration, the required flops of the proposed MTMSE-MUMS-VP precoder are counted as 397, which are much lower than 672 flops of traditional MTMSE-MUMS-VP precoder and even lower than 493 flops of MSLNR-MUMS-VP.
Numerical experiment and hardware verification
In numerical experiment, it is assumed that one BS with N t = 8 antennas transmits simultaneously to K = 4 users with n k = 2 antennas each. The SNR is defined as symbol energy per transmit antenna versus noise power. A fixed bitlength of 15 bits with a fractional part of 10 bits is adopted for BD-VP, traditional and the proposed MTMSE-MUMS-VP precoder. While a fixed bitlength of 15 bits with a fractional part of 7 bits is adopted for BD precoder due to its large transmit power. The real line and the dotted line represent for QPSK and 16 QAM respectively. From the figure, we can see that MTMSE-MUMS-VP precoders outperform BD and BD-VP precoder. More importantly, our proposed new precoder is slightly better than traditional MTMSE-MUMS-VP precoder. This is because the η k of traditional MTMSE-MUMS-VP precoder in equation (4) has not an optimal closed-form expression but only an approximate value α, which brings a few performance loss.
We verify traditional and the proposed MTMSE-MUMS-VP precoder on a Xilinx XC4VLX200FF1513-11 Virtex-4 FPGA. Matlab Simulink is used as the model builder in our design. The precoding matrices F k and perturbation vectors l k and l k are first calculated on Matlab, and then read into Simulink. After that, the Xilinx System Generator transforms the Simulink model into an RTL description, which is then synthesised and mapped onto the FPGA. We realize configurations from {2, 2} × 4 to {2, 2, 2, 2} × 8 using QPSK and 16 QAM constellation. It is found that a fixed bitlength of 15 bits with a fractional part of 10 bits is the shortest one that can give a performance nearly the same as that of the floating point. So, we use this bit width throughout the implementation. Respectively for traditional and the proposed precoder, the area results corresponding to QPSK and 16 QAM designs are shown in Table II . It is apparent that the proposed precoder structure greatly reduces the required slices resource in FPGA. Furthermore, with the proposed precoder structure, at a clock rate of 400 MHz, the QPSK {2, 2, 2, 2} × 8 system gives a data throughput of 6.4 Gbit/s, and the 16 QAM {2, 2, 2, 2} × 8 system can achieve a data throughput of 12.8 Gbit/s.
Conclusion
A low complexity minimum-total-mean-square-error criterion based multiuser multi-stream vector perturbation (MTMSE-MUMS-VP) precoder is proposed in this letter. Compared with traditional MTMSE-MUMS-VP precoder, the proposed new precoder has four less processing steps and slightly better BER performance. In practical hardware tests, traditional and the proposed precoders are realized on a Xilinx Virtex-4 FPGA under the configurations from {2, 2} × 4 to {2, 2, 2, 2} × 8 at a clock rate of 400 MHz. It is found that the proposed precoder structure greatly reduces the required slices resource, and with 16 QAM constellation its data throughput is up to 12.8 Gbit/s.
