Abstract-This paper describes the first VLSI implementation of lattice reduction (LR) aided multi-antenna broadcast precoding with vector perturbation. The considered LR scheme is based on Brun's algorithm for finding integer relations. We analyze its high-level architectural issues, we devise a corresponding low-complexity implementation, and, finally, we develop a suitable VLSI architecture. The resulting circuit provides reference for the true silicon complexity of LR for broadcast precoding with vector perturbation. Contribution: This paper shows how LR-aided vector perturbation can be implemented efficiently as an ASIC. To this end, we consider the LR algorithm proposed in [6] , that is based on Brun's algorithm for finding integer relations. The algorithm is first streamlined for VLSI implementation and a suitable hardware architecture is presented. The reported implementation results provide reference for the true silicon complexity of LR-aided vector perturbation, which -to the best of our knowledge -has not been described so far in the open literature.
I. Introduction
Multi-antenna multi-user communication systems employ multiple antennas at the transmitter to concurrently serve multiple non-cooperating single-antenna users [1] , [2] , [3] . Here, precoding is performed so that each user receives only its intended data stream. With simple linear preequalization the transmit power is increased significantly, which, however, can be partially mitigated by lattice reduction (LR) aided vector perturbation [4] , [5] , [6] . Unfortunately, many LR algorithms have a considerable computational complexity, rendering the economic VLSI implementation of real-time multi-user communication systems a challenging task.
Contribution: This paper shows how LR-aided vector perturbation can be implemented efficiently as an ASIC. To this end, we consider the LR algorithm proposed in [6] , that is based on Brun's algorithm for finding integer relations. The algorithm is first streamlined for VLSI implementation and a suitable hardware architecture is presented. The reported implementation results provide reference for the true silicon complexity of LR-aided vector perturbation, which -to the best of our knowledge -has not been described so far in the open literature.
Outline: The remainder of this section describes the system model and the LR-aided vector perturbation algorithm under consideration. Section II focuses on the high-level architecture and describes algorithm transformations that reduce the hardware complexity. The details of the proposed register transfer level (RTL) architecture are then described in Section III. Finally, implementation results are presented in Section IV.
A. System Model
We consider the downlink of a multi-user communication system in which the base station is equipped with MT antennas and transmits to K single-antenna users (see e.g., [1] , [3] ). The input output relation of this system is given by
where the K ×MT matrix H describes the channel and the MTdimensional vector x subsumes the complex-valued baseband samples transmitted from the MT transmit antennas. The normalization factor γ = 1/ x ensures unit transmit power level. Finally, the K-dimensional vector n denotes white complex Gaussian noise.
B. LR-Aided Zero-Forcing Precoding
With zero-forcing (ZF) preequalization, the vector x is constructed from the data vector d according to (e.g., [1] )
where the main drawback is an enhancement of the transmit power level, which, after the compensation by the normalization factor γ, can lead to a significant performance degradation. LR-aided vector perturbation [4] , [5] , [6] is one approach to alleviate the problem of transmit power enhancement. The basic idea is to find an unimodular matrix B that is suitable to transform P into a better basisP = PB. The vector x is then chosen according to
where d denotes the original modulated data, τ is a fixed constant, and · denotes rounding to nearest integer.
C. Low Complexity LR Algorithm
The challenge of LR-aided ZF precoding lies in the efficient implementation of LR. Here, the LLL algorithm [7] is most widely used in the literature. Unfortunately, its complexity is often prohibitive for real-time implementations. An alternative, low-complexity LR scheme based on Brun's algorithm has been described in [6] . It can be summarized as follows:
as an arbitrary column of HH H −1 (which is a sideproduct of many matrix-inversion algorithms such as the one used in [8] ), P (0) = P, B (0) = I, and C (0) = I. 2) Choose the indices s and t according to
3) Compute
and update where
t , and c t = c
If (10) is false, increment i and go to Step 2; otherwise terminate and return
II. System-Level Architecture Considerations For VLSI implementations it is practical to distinguish between channel-rate preprocessing (all operations, which are performed only when the channel changes) and symbol-rate processing (all operations that are carried out for each transmitted vector-symbol). Since the update-rate of the channel is usually much lower than the symbol rate, it is often argued that iterative decomposition may be used to schedule the computations on only few processing resources to reduce silicon area. However, in MIMO-OFDM systems, where preprocessing has to be performed on a large number of channel matrices, such area-efficient architectural transformations also result in a considerable preprocessing latency, which often cannot be tolerated in packet-based systems [8] . Hence, also the allegedly low-rate preprocessing must often be implemented using fast hardware architectures.
To obtain such an efficient, high throughput architecture we shall start from a straightforward high-level block diagram to restructure the original algorithm [6] to reduce its hardware complexity.
A. Initial VLSI Architecture
Fig . 1a shows the data dependency graph (DDG) and the complexity of the LR algorithm implemented according to [6] . The channel-rate processing is shown on the left-hand-side of the block-diagram. It is comprised of Brun's Algorithm, the update of the matrices B (i) , C (i) , and P (i) , and of the evaluation of the termination condition (10). Due to the nonlinear recursive structure of the LR algorithm, the highest degree of parallel processing corresponds to the parallel execution of one iteration. The corresponding symbol-rate processing (on the right-hand-side of to compute (d + τ z) according to (3) and P = P (0) to obtain the transmitted signal vector. Note that here P (J −1) is only required to evaluate the termination condition.
B. Restructuring for VLSI Implementation
At first, we streamline the original algorithm to avoid obvious redundant computations. To this end, one can exploit that
(J −1) must be computed anyway for the evaluation of the termination condition. The DDG in Fig. 1b shows the result of this optimization. Compared to the original implementation strategy, the computational complexity of both channel-rate and symbol-rate processing has been reduced and less memory is required since only two matrices (C (J −1) and P (J −1) ) must be stored while the original unoptimized architecture required storage for three matrices (B (J −1) , C (J −1) , and P (0) ). The second optimization step primarily aims at eliminating redundant multiplications in the update of C (i) during preprocessing and in the application of C (J −1) to d in the symbol-rate processing. The main reason for this redundancy in the preprocessing is that C (i) is sparse during the first few updates according to (9). However, in a parallel implementation that carries out one update step in each cycle the number of multipliers is determined by the worst case where all K entries of c to exploit the sparseness of C (i) for hardware complexity reduction in the channel-rate processing. A similar argument applies to the application of C (J −1) to d in the symbol-rate processing since for a small number of iterations even C (J −1) is still sparse. However, runtime allocation of processing resources to the nonzero entries of C (i) breaks the regularity of a matrixvector multiplication and adds considerable hardware overhead for control. The basic idea for avoiding multiplications with ones and zeros is to apply the transformations in (9) directly to the transmitted vectors d as shown in the block diagram in Fig. 1c . The corresponding update equation that yields d
where
s . The drawback of this approach is that the number of multiplications for the symbol-rate processing now depends on the number of iterations J. Compared to the implementation in Fig. 1b , the average computational complexity is only reduced if E{J} < K 2 , where E{·} denotes the expectation and where K 2 is the number of multiplications required for the evaluation of C (J −1) x. Numerical simulations show that E{J} ≈ 2 for MT = K = 4 which is far below the constant complexity of a straightforward matrix-vectormultiplication based implementation. Unfortunately, the storage between the channel-rate and the symbol-rate processing and the symbol-rate processing unit itself must be designed to deal with the worst-case rather than for the average case. Hence, an artificial upper bound Jmax must be enforced on the number of iterations (which is necessary anyway to constrain the preprocessing latency). This constraint should be chosen such that premature termination is a rare event. In a system with MT = K = 4 with i.i.d. Rayleigh fading channel setting Jmax = 8 is sufficient for 99.9% of all channel realizations, which at the same time still guarantees a 50% complexity and memory reduction when comparing the direct application of β (i) to d to the matrix-vector-multiplication based approach in (3).
III. RTL Implementation
The RTL implementation, described in the following is based on the high-level architecture depicted in Fig. 1c . Since the implementation of the symbol-rate processing is straightforward, we now focus on the channel-rate processing. The corresponding circuit is partitioned into two main entities:
• The Brun's algorithm unit (BAU) receives the vector u (0) (for example from the circuit described in [8] ), carries out Brun's algorithm as described by (4)-(6), and forwards β (i) to the second entity.
• The Update P unit (UPU) applies β (i) to update P (i) , checks the termination condition in (10), and sends the result of this check back to the BAU.
A. RTL Architecture of BAU
The RTL block diagram of the BAU is shown in Fig. 2 . The circuit is partitioned into two macro-pipeline stages:
1) Initialization Stage: The first stage (marked in dark-gray in Fig. 2) are then selected and a division unit computes β according to (5) , whereby the divisor |u
2 is immediately available from the iteration register. The result of the division is used to compute u s and |u s | 2 and is also forwarded to the termination unit. When the termination flag is not set, the next cycle starts by loading the iteration register with the updated vector u (1) and the corresponding norms of its entries. At the same time, the iteration register that contains s can be updated without explicitly searching for the entry of u (1) with the largest norm, since Brun's algorithm ensures that s (i+1) = t (i) . When the termination flag is set, the iteration is reinitialized (i = 0) from the first macro-pipeline stage. 
B. RTL Architecture of UPU
The RTL block diagram of the UPU is shown in Fig. 3 . Here, the problem is that an isomorphic architecture that implements (7) and (10) requires concurrent instantaneous read-access to two columns of P (i) and write access to the same memory. However, the memory that contains P (i) can usually not provide random read-access to more than one column of P (i) at a time [8] and must also be assumed to have an access latency of one cycle 1 . The resulting memory access bottleneck can be alleviated considerably by exploiting that p iteration, the first macro-pipeline stage of the BAU decides on the index s (0) which is used immediately by the UPU to issue a request to the memory for the corresponding column of P (0) . The answer to that request arrives one cycle later (Fig. 4 only shows the request to the one-cycle-latency memory 2 ). In that second cycle, the index t (0) is available from the BAU and the UPU can immediately request p (0) t . This vector p (0) t arrives in the third cycle together with β (0) which has been computed in the second cycle, but was delayed by one cycle by a pipeline register at the output of the BAU (shown in Fig. 2 ). Note that this register not only serves to align β (i) with the corresponding p (i) t in the UPU, but also decouples the timing of the two units and prevents a long combinatorial path through both units. The UPU can now update p (0) s and check the termination condition. If (10) is false, the UPU continues, by storing p (0) t in its internal cache register and by requesting p (1) t whose index t is available from the BAU. If (10) is true, the termination flag is set, causing the BAU to proceed immediately to the next precoding matrix which is already waiting in the first macro-pipeline stage so that the UPU can again start by requesting p (0) s . Finally, an additional memory bypass is required in the UPU to deal with the case where
. In that case, the UPU requests data from the memory that is just about to be written back and can not be read at the same time. Hence, the corresponding data must be fed back through a local register as shown in the block diagram in Fig. 3 .
C. Fixed-Point Considerations
We now determine the accuracy requirements, where we consider a compromise between block floating-point and fixedpoint arithmetic to work around the problem that the inputs of our circuit have an infinite dynamic range. To this end, the entries of u (0) and P (0) are normalized by powers of two in such a way that the magnitudes of their largest real-and imaginary parts are as close as possible to, but smaller than one. The same scaling factors are thereby applied to all entries of the same matrix or vector. As opposed to true block floating-point arithmetic, this normalization is only performed once, before the algorithm starts. In the following, we write [I, F ] for a number with I integer and F fractional bits in its real-and imaginary part, where I determines the dynamic range and F the numerical accuracy. 1) Numerical Requirements for u: Evidently, u (0) requires zero integer bits and Brun's algorithm ensures that the magnitude of the entries of u (i) can only decrease [6] . Thus, Iu = 0. Furthermore, simulations indicate that Fu = 7 maintains closeto floating-point performance up to a symbol error rate (SER) of 10 −4 . 2) Numerical Requirements for β: Since β is an integer by definition, F β = 0. Furthermore, the number of integer bits required to capture the result of the division in (5) is given by I β = 1+Iu+Fu, provided that the exception u 3) Numerical Requirements for P: The number of integer bits for P (0) is zero because of the described input normalization. The termination condition in (10) ensures that the length of the column vectors of P (i) can not increase. However, it may still happen, that a column in P (i) is altered in such a way that the corresponding vector lies in a single real-valued dimension. In that case, the dynamic range of the corresponding quadrature component is bound by ± √ 2K and thus requires IP = (1 + log 2 K)/2 integer bits (i.e., IP = 2 for K = 4).
The number of fractional bits must again be found through simulations. 
IV. Implementation Results
For a more detailed analysis of the true silicon complexity the proposed architecture must be implemented in hardware. The corresponding results and parameters used in the reference design for a system with MT = K = 4 are summarized in Tbl. I. A direct comparison of the proposed hardware architecture to other multi-user precoders is not possible, since -to the best of our knowledge -no other VLSI implementations of similar algorithms have been reported in the open literature.
Hence, we compare the complexity overhead associated with adding the proposed LR-aided vector perturbation unit to the matrix-inversion unit described in [8] , which has a silicon area of 89K GEs and requires 610ns for a 4 × 4 matrix. Obviously, the throughput-optimized LR unit requires only a fraction of the time needed for matrix inversion. This indicates that a more area-optimized design for LR may be a more economic choice, since spending more time on LR would only have a minor impact on the total preprocessing latency. However, a high-throughput architecture for LR does become relevant in MIMO-OFDM systems, where the LR unit can be shared among multiple matrix-inversion circuits operating in parallel.
Corresponding results are shown in Fig. 6 , where a single pipelined LR unit is used together with one, four, and eight matrix processors, required to handle the preprocessing of 1, 64, and 128 matrices within 10µs. Fig. 6 . Share of the total silicon area occupied by the LR unit when 1, 64, and 128 matrices need to be processed within 10µs.
V. Conclusions Reduced complexity algorithms based on integer relations enable the efficient VLSI implementation of lattice-reduction aided vector perturbation for multi-antenna multi-user communication systems. The presented high-throughput VLSI architecture is the first implementation of such a system and provides reference for the true silicon complexity of the required algorithms. Our implementation also illustrates how streamlining of algorithms, motivated by architectural considerations can lead to considerable complexity reduction.
