Abstract-The high processing complexity of data detection in the large-scale multiple-input multiple-output (MIMO) uplink necessitates high-throughput VLSI implementations. In this paper, we propose-to the best of our knowledge-first matrix inversion implementation suitable for data detection in systems having hundreds of antennas at the base station (BS). The underlying idea is to carry out an approximate matrix inversion using a small number of Neumann-series terms, which allows one to achieve near-optimal performance at low complexity. We propose a novel VLSI architecture to efficiently compute the approximate inverse using a systolic array and show reference FPGA implementation results for various system configurations. For a system where 128 BS antennas receive data from 8 single-antenna users, a single instance of our design processes 1.9 M matrices/s on a Xilinx Virtex-7 FPGA, while using only 3.9% of the available slices and 3.6% of the available DSP48 units.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) combined with spatial multiplexing [1] is the key technology in most modern wireless communication standards, such as 3GPP LTE or IEEE 802.11n. MIMO technology offers improved link reliability and higher data rates compared to single-antenna systems by simultaneously transmitting multiple data streams in the same frequency band. However, conventional point-to-point and multi-user (MU) MIMO wireless systems already start to approach the theoretical throughput limits. Consequently, novel transmission technologies become necessary to meet the ever-growing demand for higher data rates without further increasing the communication bandwidth [2] , [3] .
Large-scale MIMO (or massive MIMO) is an emerging technology, which uses antenna arrays having orders of magnitude more elements at the base station (BS) compared to conventional (small-scale) MIMO systems, while simultaneously serving a small number of users in the same frequency band [2] . This technology promises further improvements in spectral efficiency and link reliability over conventional (small-scale) MIMO systems [3] , [4] . In addition, large-scale MIMO has the potential to reduce the operational power consumption at the BS [2] , [5] .
Unfortunately, the benefits of large-scale MIMO come at the cost of significantly increased computational complexity in the BS compared to small-scale MIMO systems. Specifically, data detection in the large-scale MIMO uplink is among the most critical tasks, as the presence of hundreds of antennas at the BS requires novel detection algorithms that scale favorably to high-dimensional problems. Since optimal methods, such as maximum-likelihood (ML) detection or sphere decoding (SD) [6] , would entail prohibitive complexity, one has to resort to low-complexity sub-optimal linear detection schemes [3] or stochastic techniques, such as Markov-chain Monte-Carlo (MCMC)-based detection methods [7] .
Contributions: In this paper, we address the complexity issue associated with data detection in the large-scale MIMO uplink. We focus on linear soft-output data detection in combination with a novel approximate method for matrix inversion relying on Neumann series, which significantly reduces the computational complexity compared to exact inversion algorithms, while approaching the performance of SD methods. To demonstrate the efficacy of our inversion method, we present a novel systolic VLSI architecture for carrying out the inversion at high throughput for the high-dimensional problems arising in large-scale MIMO systems. Finally, we present reference implementation results on a Virtex-7 FPGA for various system configurations.
II. LARGE-SCALE MIMO UPLINK
We consider a large-scale multi-user (MU) MIMO system with N antennas at the BS communicating with M < N single antenna users. 1 In what follows, we focus on the uplink, i.e., where M users are simultaneously transmitting to the BS.
A. Uplink System Model
The transmitted bit stream for each user is first encoded using a channel encoder and then mapped to constellation points in the set O. is the (tall and skinny) uplink channel matrix, and n ∈ C N models additive noise at the BS; its entries are assumed to be i.i.d. zero-mean Gaussian with variance N 0 . We furthermore assume that the transmit symbols satisfy
B. Soft-Output Detection for Large-Scale MIMO
The task of the BS is to compute soft-estimates in the form of log-likelihood ratios (LLRs) for the coded bits given the channel matrix 2 H and the receive-vector y, by means of a soft-output MIMO detection algorithm (see, e.g., [8] ). Since the number of BS antennas N and the number of users M is expected to be much larger than that of conventional (smallscale) MIMO systems, one must resort to low-complexity detection schemes because sophisticated detection methods, such as the k-Best algorithm or sphere decoding, would entail prohibitive complexity [3] . The most prominent lowcomplexity MIMO detection method is minimum mean-square error (MMSE) detection [6] , which mitigates MU interference by inverting the effect of the channel matrix.
To arrive at low-complexity, we follow the linear detection approach of [9] . We start by computing the matched-filter output y MF = H H y and the M ×M Gram matrix G = H H H. Then, we compute the following regularized matrix:
An estimate of the transmit vector can then be computed aŝ
For soft-output detection, we further decompose the right-hand side of (2) for each user i as follows: 2 } (see [9] for the details).
III. LOW COMPLEXITY MATRIX-INVERSION FOR LARGE-SCALE MIMO
The main computational complexity of the algorithm outlined above is caused by the inverse of A. More specifically, computation of A −1 requires O(M 3 ) number of operations, which quickly results in prohibitive complexity for the typical dimensions arising in large-scale MIMO. Hence, to arrive at cost-effective hardware implementations, an efficient inversion method is of paramount importance. We next present a novel method that approximately inverts the matrix A at low complexity, by exploiting the fact that in large-scale MIMO systems the channel matrices H are tall and skinny and therefore, well-conditioned, in general [2] .
A. Neumann Series Approximation
To reduce the complexity of computing A −1 compared to direct (and exact) matrix inversion methods, we propose to use the Neumann series approximation [10] . Concretely, if A is close to an invertible matrix X satisfying
then the inverse of A can be rewritten using the following Neumann series [10] :
We can decompose A in (1) in (1) becomes a diagonally dominant matrix, i.e A ≈ D. For i.i.d. Gaussian channel matrices H and in the large antenna limit, i.e., for N → ∞ and a given M , it was shown in [2] that A → I M . This property of large-scale MIMO systems is the key to arrive at a low-complexity matrix-inversion method.
B. Low-Complexity Approximate Matrix Inversion
Since A is close to D for large-scale MIMO, we apply the Neumann series by letting X = D. By assuming that the matrix D is invertible and (4) holds, we can rewrite the inverse of A = D + E as
By keeping only the first k terms of the Neumann series, we arrive at the following k-term approximation of A −1 :
For k = 1, the inverse coincides to a MF detector, as (2) simply re-scales each entry of y MF . For k = 2, the inverse of A is approximated by A in place of A −1 , which significantly reduces the complexity of linear detection in large-scale MIMO systems.
C. Block Error-Rate (BLER) Performance in the Uplink
We now demonstrate the performance of the approximate inversion method for the large-scale MIMO uplink. We simulate a coded MIMO-OFDM system with 128 subcarriers, 16-QAM, and assume a 10 m linear antenna array, where the antennas are equally spaced similarly to [11] . We use the WINNERPhase-2 mode [12] to generate the channel matrices. At the BS, we use the soft-output MMSE detector described in Sec. II-B in combination with a rate-5/6 soft-input Viterbi decoder. Figures 1(a) and 1(b) show the BLER performance of the proposed algorithm compared to an exact matrix inverse and soft-output single tree-search SD [13] for M = 4 and M = 8 users. Evidently, the proposed inversion method approaches the performance of an exact matrix inversion algorithm for a large number of BS antennas N . Moreover, the performance of linear detection approaches that of max-log optimal softoutput SD at significantly lower complexity. In addition, the proposed method significantly outperforms MF detection typically considered for low-complexity detection in largescale MIMO [2] . For smaller systems, the approximate method incurs an error floor, which strongly depends on the number of BS antennas. We emphasize that practical communication standards commonly specify BLER = 10 −2 , which is below the error floor in typical large-scale MIMO configurations.
IV. VLSI ARCHITECTURE
We now present an architecture that is able to compute the 2-term approximation at very high throughput. In our architecture, we partitioned the inversion into two tasks, i.e., 1) computation of the Gram matrix G = H H H and 2) computation of A −1
, which are executed in pipelined fashion. Since both tasks produce symmetric matrices, we only process the lower-triangular part.
A. Gram-Matrix Computation Unit
The first unit computes the Gram matrix using an M × M lower-triangular systolic array. The architecture is shown in Fig. 2 for M = 3. Each processing element (PE, denoted by P k in Fig. 2 ) in the systolic array consists of a multiplyand-accumulate (MAC) unit. The transposed input matrix H T is shifted one column at a time into the systolic array. Each processing element performs a MAC operation with both input operands. To ensure that each PE processes the correct set of operands, the values in i th row of H T are delayed by i − 1 cycles. Once an input value reaches a diagonal PE, the value is conjugated and passed to the lower part of the systolic array.
There are two PE variants in the architecture. A PE on the main diagonal requires two multipliers to compute the squared absolute value of a complex number, |a + bj| 2 = a 2 + b 2 . A PE on the off-diagonal computes the product of two complex numbers, i.e., (a + bj)(c − bj). To reduce the number of multipliers, we deploy strength reduction, which requires three multipliers and five additions [14] . Consequently, the number of multipliers required in this unit is (3M 2 + M )/2. To minimize the critical path, each MAC unit is pipelined and has a throughput of one MAC operation per clock cycle.
B. Approximate Matrix Inversion Unit
To implement the approximate inverse, A ii
jj which is one real multiplication followed by a real to complex multiplication. As a result, this module requires three multipliers in total.
In the proposed architecture, we compute the entries of A 
C. FPGA-Friendly Reciprocal Unit
To compute D
−1
ii at high throughput, we designed a reciprocal unit that is particularly suitable for FPGA implementations. To improve numerical stability, we use a similar approach as described in [9] and shift the input value D ii = x such that it falls between 0.5 and 2, i.e., we get x = 2 m y, where y ∈ [0.5, 1) and m ∈ Z. We next rewrite y =ŷ + Δ, whereŷ corresponds to the 10 most significant bits of y and Δ the 10 remaining bits of y. We can now apply a firstorder Taylor-series expansion to arrive at the approximation a Fig. 4 shows the block diagram of the reciprocal unit implementing this approximation. The termŝ a −1 andâ −2 are obtained using two lookup tables (LUTs). The termâ −2 is multiplied by Δ and added toâ −1 . Finally, the result is multiplied by 2 −m to compensate for the initial shift. Since fixed-point simulations suggest that only the first LUT is necessary, the shaded region in Fig. 4 was omitted in the final design. Note that the proposed reciprocal unit requires only one table look-up, which is in contrast to the architecture used in [9] that consists of two multipliers and one LUT.
V. FPGA IMPLEMENTATION RESULTS
We implemented the approximate inversion architecture for large-scale MIMO systems on a Virtex-7 XC7VX1140T FPGA using Xilinx Vivado High-Level Synthesis 2012. 
A. Implementation Details and Fixed-Point Parameters
For the Gram matrix unit, the input values of the channel matrix H are 16 bit and we mapped all multiplications onto Xilinx DSP48E1 slices. As a result, the internal MAC registers are 48 bit. To reduce the output word-length, we truncate the final values of the Gram matrix to 20 bit for the next unit. For the inversion unit, we mapped all multiplications onto DSP48E1 slices and truncate the output to 16 bit. The LUT in the reciprocal unit has 1024 addresses and 18 bit outputs. Hence, we can implement it efficiently using one block-RAM (RAMB18E1) available on the FPGA. The resulting fixedpoint performance is shown with dashed lines (and labeled by 'FP') in Fig. 1(a) . The implementation loss for 4 × 32 is less than 0.2 dB SNR compared to a floating-point reference.
B. FPGA Implementation Results
We parameterized the architecture for N and M to explore the impact on the required FPGA resources and the maximum achievable throughput. The corresponding implementation results are detailed in Tbl. I. Note that-to the best of our knowledge-no implementation suitable for such highdimensional problems has been described in the open literature and, hence, no comparison to existing designs can be given.
We can see that Gram-matrix unit requires more clock cycles than the approximate inversion unit as the dimensions of the MIMO system increases. The maximum clock frequency of the approximate inversion unit is slightly slower than that of the Gram-matrix unit. Assuming 64-QAM, the throughput of a single unit for a 32 × 4, 128 × 8, and 128 × 16 system is 139.2 Mb/s, 95.52 Mb/s, and 128.64 Mb/s, respectively. In the case of OFDM or SCFDMA, the throughput can be increased by simply instantiating more units that process multiple subcarriers in parallel. As shown in Tbl. I, the inversion units are fairly small. Hence, the considered Virtex-7 FPGA has enough free resources to instantiate more than 10 units for a 32 × 4 and 128 × 8 system; the corresponding designs would easily reach throughputs exceeding 1 Gb/s. For 128 × 16 system, we could instantiate up to 7 units which corresponds to 0.9 Gb/s.
VI. CONCLUSIONS
We have developed-to the best of our knowledge-first implementation of a matrix inversion unit for linear data detection in the large-scale MIMO uplink. We have approximated the required matrix inversion using a small number of Neumannseries terms to reduces the complexity (compared to exact inversion) while only slightly degrading the error-rate performance in large-scale MIMO systems. To implement a fast approximate inversion unit, we have proposed a systolic VLSI architecture scaling favorably to large-dimensional problems. We have provided three reference designs, which enable data detection in large-scale MIMO systems at low complexity and high throughput with state-of-the-art FPGAs.
