Abstract-Large-scale (or massive) multiple-input multipleoutput (MIMO) is expected to be one of the key technologies in next-generation multi-user cellular systems based on the upcoming 3GPP LTE Release 12 standard, for example. In this work, we propose-to the best of our knowledge-the first VLSI design enabling high-throughput data detection in single-carrier frequency-division multiple access (SC-FDMA)-based large-scale MIMO systems. We propose a new approximate matrix inversion algorithm relying on a Neumann series expansion, which substantially reduces the complexity of linear data detection. We analyze the associated error, and we compare its performance and complexity to those of an exact linear detector. We present corresponding VLSI architectures, which perform exact and approximate soft-output detection for large-scale MIMO systems with various antenna/user configurations. Reference implementation results for a Xilinx Virtex-7 XC7VX980T FPGA show that our designs are able to achieve more than 600 Mb/s for a 128 antenna, 8 user 3GPP LTE-based large-scale MIMO system. We finally provide a performance/complexity trade-off comparison using the presented FPGA designs, which reveals that the detector circuit of choice is determined by the ratio between BS antennas and users, as well as the desired error-rate performance.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) in combination with spatial multiplexing [3] builds the foundation of most modern wireless communication standards, such as 3GPP LTE [4] - [6] or IEEE 802.11n [7] . MIMO technology offers significantly higher data rates over single-antenna systems by transmitting multiple data streams concurrently and in the same frequency band. Conventional MIMO wireless systems, however, already start to approach their throughput limits. Consequently, the deployment of novel transceiver technologies is of paramount importance in order to meet the evergrowing demand for higher data rates, better link reliability, and improved coverage, without further increasing the communication bandwidth [8] - [10] .
M. Wu and B. Yin contributed equally to the paper. Parts of this paper for a large-scale, point-to-point MIMO-OFDM system have been presented at the IEEE International Symposium on Circuit and Systems (ISCAS) [1] and the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [2] .
M. Wu, B. Yin, G. Wang, and J. Cavallaro are with the Dept. of ECE, Rice University, Houston, TX (e-mail: {mbw2, by2, wgh, cavallar}@rice.edu).
C. Dick is with Xilinx, Inc., San Jose, CA (e-mail: chris.dick@xilinx.com). C. Studer is with the School of ECE, Cornell University, Ithaca, NY (e-mail: studer@cornell.edu)
This work was supported in part by Xilinx and by the US National Science Foundation under grants CNS-1265332, ECCS-1232274, EECS-0925942, and CNS-0923479.
A. Blessing and Curse of Massive MIMO
Large-scale (or massive) MIMO is an emerging technology, which postulates the use of antenna arrays having orders of magnitude more elements at the base station (BS) compared to conventional (small-scale) MIMO systems, while serving tens of users simultaneously and in the same frequency band [8] . This technology promises significant improvements in terms of spectral efficiency, link reliability, and coverage compared to conventional (small-scale) systems [9] , [11] , [12] .
Unfortunately, the promised benefits of large-scale MIMO come at the cost of significantly increased computational complexity in the BS, as opposed to small-scale MIMO systems, which commonly deploy 2-to-4 antennas at both ends of the wireless link. In particular, data detection in the large-scale MIMO uplink is expected to be among the most critical tasks in terms of complexity and power consumption, as the presence of hundreds of antennas at the BS and a large number of users will increase the computational complexity by orders of magnitude. In addition, current cellular systems, such as 3GPP-LTE [4] , [5] or LTE-Advanced (LTE-A) [6] , rely on single-carrier frequency division multiple access (SC-FDMA), which further increases the dimensionality (and hence the complexity) of the underlying detection problem. As a consequence, optimal data detection methods, such as maximumlikelihood (ML) detection [13] - [15] or soft-output sphere decoding (SD) [16] - [18] , whose (average) computational complexity scales exponentially in the number of transmitted data streams [19] , [20] , would simply result in prohibitive complexity. Hence, one has to resort to low-complexity (but suboptimal) linear detection schemes [9] or stochastic detection algorithms [21] that deliver acceptable error-rate performance and scale favorably to the high-dimensional detection problems faced in SC-FDMA-based large-scale MIMO systems.
B. Contributions
This paper addresses the complexity issue of data detection in SC-FDMA-based large-scale MIMO systems in the uplink, i.e., where multiple users communicate with the BS. We focus on linear soft-output detection in combination with a new approximate matrix inversion method relying on a Neumann series expansion, which significantly reduces the computational complexity compared to that of an exact matrix inversion method. We analyze the implementation trade-offs associated with approximate and exact linear data detection in the large-scale MIMO uplink, and we show analytically that the approximation error caused by the proposed approximate inversion method depends on the ratio between BS antennas and users. We show that the proposed approximation performs well for medium to large ratios between BS antennas and users, while exact linear detection is advantageous for small antenna ratios. We present reference FPGA designs for both, the approximate and exact matrix inversion, and for various antenna configurations, which enables us to characterize the associated hardware complexity vs. error-rate performance trade-offs. The resulting FPGA designs are-to the best of our knowledge-the first data detection engines for massive MIMO systems reported in the open literature that achieve a peak uplink throughput exceeding the 300 Mb/s specified in 3GPP LTE-Advanced operating at 20 MHz bandwidth [6] .
C. Notation
Lowercase boldface letters stand for column vectors; uppercase boldface letters designate matrices. For a matrix A, we denote its transpose and conjugate transpose A T and A H , respectively. The entry in the k th row and th column of a matrix A is denoted by A k, ; the k th entry of a vector a is designated by a k . The Frobenius norm and 2 -norm of a matrix A and vector a are denoted by A F and a 2 , respectively. The M × M identity matrix is denoted by I M , and F M refers to the M ×M discrete Fourier transform (DFT) matrix, normalized as F H M F M = I M . In order to simplify notation, we make frequent use of the superscript (·) (i,j) to indicate the i th base-station antenna and j th user; the subscript (·) w designates the SC-FDMA subcarrier index.
D. Paper Outline
The remainder of the paper is organized as follows. Section II introduces the uplink system model and outlines the basics of linear detection for SC-FDMA-based systems. The approximate matrix inversion approach, a corresponding error analysis, and an error-rate performance/complexity comparison are shown in Section III. Section IV details our VLSI architecture. Section V provides reference FPGA implementation results and a trade-off analysis. We conclude in Section VI. All proofs are relegated to the Appendices.
II. LARGE-SCALE MIMO IN LTE UPLINK
We next introduce the LTE uplink model and present a new and efficient method for linear soft-output minimum meansquare error (MMSE) detection in SC-FDMA-based systems.
A. LTE Uplink Model
We consider the large-scale multi-user (MU) MIMO uplink with B antennas at the base-station (BS) communicating with U ≤ B single-antenna users. 1 To reduce the peak-to-average power ratio of the user equipment, LTE uplink employs SC-FDMA (short for single-carrier frequency division multiple access) [5] . The U users first encode their own transmit bits using channel encoders and then, map the coded bit stream 1 More generally, instead of having U single-antenna users, the proposed system may equivalently support U spatial streams, which can, for example, be shared among a smaller number of user terminals that are equipped with more than one antenna.
to time-domain constellation points in the finite alphabet O with cardinality M = |O| and average transmit power E s per symbol. An L-point discrete Fourier transform (DFT) block 2 is used to perform modulation of these time-domain symbols onto orthogonal frequency bands. The L time-domain constellation points for the i th user are subsumed in the vector
T . The output of the DFT block, namely the frequency-domain symbol, is defined as
. Subsequent processing performed for each user corresponds to that of conventional orthogonal frequency-division multiplexing (OFDM) transmission [22] . Specifically, for each user, the frequency-domain symbols are first mapped onto data-carrying subcarriers and then, transformed back to the time domain with an inverse DFT (IDFT). After prepending the cyclic prefix to the time-domain symbols, all U users transmit their time-domain signals simultaneously over the wireless channel.
At the BS, each receive antenna obtains a mixture of the time-domain signals from all users. For data detection, the time-domain signals received at each antenna are first transformed back into the frequency domain using a DFT. The data-carrying symbols are then extracted from the DFT's output. Assuming a sufficiently long cyclic-prefix (i.e., longer than the delay spread of the channel's impulse response), the received frequency-domain symbols can be modeled using the standard input-output relation y = Hs + n, with the following definitions:
. . .
. . . . . .
Here, the vector
T contains the received symbols on the i th antenna in the frequency domain, where y
is the symbol received on the w th subcarrier of the i th antenna.
contains the channel's frequency response of length L between the i th receive antenna and j th transmit antenna on its main diagonal, and
T models thermal noise at the i th receive antenna in the frequency domain. The entries of the vector n (i) are assumed to be i.i.d. zero-mean Gaussian with variance N 0 per complex entry.
B. Linear MMSE Detection
The task of a data detector for MIMO systems is to compute soft-estimates in the form of log-likelihood ratio (LLR) values for each coded bit, given the channel matrix 3 H and receive vector y. In order to arrive at low computational complexity for data detection in SC-FDMA-based large-scale MIMO systems, we focus exclusively on linear soft-output detection [24] . Linear detection for SC-FDMA mainly consists of the following two steps: (i) channel equalization to generate estimates of the frequency domain symbols, and (ii) soft-output computation to generate LLRs from the equalized frequency domain symbols. Both of these steps are detailed next.
(i) Channel equalization: The most common approach to linear MIMO detection is the minimum-mean square error (MMSE) equalizer, which computes equalized frequencydomain symbols asŝ = Wy with the MMSE equalization matrix defined as follows [3] :
Since the effective channel matrix H is built from diagonal L×L submatrices, we can apply MMSE equalization on a persubcarrier basis. Specifically, the received frequency symbols on the w th subcarrier in the frequency domain can be modeled as y w = H w s w + n w , where
, and n w = n
Here, y
w is the frequency symbol received on the w th subcarrier for the i th antenna, and h
is the frequency gain (or attenuation) on the w th subcarrier between the i th receive antenna and j th transmit antenna. The scalar s
w denotes the symbol transmitted by the j th user on the w th subcarrier; the scalar n (i) w models thermal noise at the i th receive antenna on the w th subcarrier. With this reformulation, the equalized symbols on the w th subcarrier are given byŝ w = W w y w , with the per-subcarrier MMSE equalization matrix defined as
A key method to arrive at low-complexity linear MMSE detection was put forward in [25] . This approach first computes the matched-filter (MF) output as y (ii) LLR computation: To obtain symbol estimates in the time domain, the MMSE detector performs an IDFT on the equalized frequency domain symbols for each user. The time-domain symbol estimates for the i th user are given bŷ
T contains the time-domain symbol estimates of the symbols transmitted by the i th user. To extract LLRs from the time-domain symbol estimates, we approximate each estimate as an independent Gaussian random variable. In 4 We are aware of the fact that the estimateŝw could be computed without forming the explicit inverse A −1 w , e.g., via the Cholesky decomposition combined with forward/backward substitution [26] . However, soft-output detection as performed here requires the explicit inverse A −1 w to compute the post-equalization SINR (see Section II-B). particular, the estimated t th symbol transmitted from the i th user is modeled asx 
where ρ In order to obtain an explicit formulation of the effective channel gain µ (i) as well as the NPI variance ν 2 i , we can write the t th symbol estimate of the i th user as follows:
Here,
is a horizontal concatenation of the i th block row of (diagonal) submatrices of W. The row vector f H t corresponds to the t th row of
be the horizontal concatenation of the j th block column of (diagonal) submatrices of H, consisting of the frequencydomain channel responses between the receive antennas and the transmit antenna associated with the j th user. We first compute the effective channel gain:
Since W (i,j) and H (i,j) are both diagonal matrices, we can write µ (i) as a sum of per-subcarrier operations. In particular, let w H i,w be the i th row of W w and h i,w be the i th column of H w . Then, we obtain the effective channel gain as
We next compute the post-equalization NPI variance ν 2 i of the residual noise plus interference as
The MMSE equalization matrix can be written in two ways [25] , i.e., either
Hence, we have W E s HH H + N 0 I LB×LB = E s H H ; this allows us to rewrite the post-equalization NPI in compact form as follows [25] :
We emphasize that both parameters µ (i) and ν w , ∀w, in (1) is responsible for the main computational complexity of linear MMSE detection in SC-FDMA-based large-scale MIMO systems. For a conventional small-scale LTE uplink scenario, i.e., where the number of receive antennas B and users U is small (on the order of U, B ≤ 6), existing VLSI designs for linear detection, such as [28] - [30] , compute the exact inverse explicitly. For large-scale MIMO systems with a large number of users U however, the computation of the inverse A −1 w can quickly result in excessive complexity. Hence, practical solutions for large-scale MIMO detection in LTE necessitate low-complexity matrix inversion methods-a corresponding approximate solution is proposed next.
A. Neumann Series Approximation
For large-scale MIMO systems, where the number of receive antennas is larger than the number of single-antenna users, i.e., for U B, the Gram matrices G w , and, consequently A w , become diagonally dominant [9] . In fact, for i.i.d. Gaussian channel matrices H w (with properly normalized entries) and in the large antenna limit, [8] shows that G w → I U . Inspired by this central property of large-scale MIMO, one can derive a low-complexity approximation of the inverse. In particular, let A w ≈ D w , where D w is the main diagonal of A w . As a result, the inverse A w , which requires evidently much lower complexity than that of the exact inverse. Unfortunately, for realistic antenna/user configurations, such a crude approximation would cause a significant performance loss. Hence, to arrive at an accurate approximation of the inverse at low computational complexity, we propose to use a Neumann series expansion.
We start by rewriting the inverse A −1 w with the following Neumann series expansion [31] :
which holds if lim n→∞ (I − X −1 A w ) n = 0 U ×U is satisfied. By decomposing the regularized Gram matrix A w such that A w = D w + E w , where D w is the main diagonal of A w and E w is the hollow, regularized Gram matrix, we can rewrite the Neumann series in (5) as
where we substitute X in (5) by D w . Note that if
The key idea of the proposed approximate inversion method is to keep only the first K terms of the Neumann series (6) . Concretely, we compute a K-term approximation as follows:
which can be computed at low computational complexity for approximations consisting of only a few Neumann series terms, i.e., for small values of K. With this approximation, the resulting approximate MMSE equalization matrix is given by
w , which is simply a scaled version of the MF detector, as W w exists. Hence, the proposed approximation (7) simply coincides with the MF detector for K = 1. For K = 2, we obtain A −1
w , whose computational complexity only scales with O(U 2 ) operations; this is in contrast to the O(U 3 ) complexity scaling required by computing an exact inverse. Hence, a second-order Neumann series approximation can be obtained at lower computational complexity. For K = 3, we obtain
whose complexity scales with O(U 3 ), which is equivalent to that of an exact inverse. Nevertheless, evaluating (8) requires fewer arithmetic operations than an explicit evaluation of A −1 . Note that for K ≥ 4, computing the exact inverse can be of lower complexity than the proposed approximation, e.g., when using a Cholesky factorization (see Section III-D). 5 
B. Analysis of the Approximation Error
We next analytically characterize the error induced by the approximate inverse (7) for MMSE estimation. To this end, we define the approximation error as
Now, consider the situation of using the approximate A
w to compute the equalized frequency-domain symbols, i.e.,
w y MF w being the exact estimate. We can bound the 2 -norm of the residual estimation error resulting from this approximate equalization by
From (9), we see that if the condition
is satisfied, then the approximation error approaches zero exponentially fast as K → ∞. Moreover, one can show that (10) is a sufficient condition for (6) to converge. We now show that the condition D −1 w E w F < 1 is satisfied with high probability for large-scale MIMO systems with a larger number of BS antennas B than users U , and if the entries of H w ∈ C B×U are assumed to be i.i.d. circularly symmetric complex Gaussian with unit variance. More specifically, we arrive at a condition that only depends on U and B for (i) the proposed Neumann series to converge and (ii) the residual approximation error (9) to be small. The following theorem, which is proven in Appendix A, makes this behavior explicit. 
.
We emphasize that this theorem provides conditions 6 for which the Neumann series converges with a certain probability; this can be accomplished by setting α = 1 and K = 1 and by inspecting the convergence condition (10) . Furthermore, Theorem 1 provides conditions for which the residual estimation error (9) is small. In both cases, we can see from Theorem 1 that increasing the ratio between the number of BS antennas B and the number of users U increases the probability of convergence. Moreover, for α < 1, increasing K also increases the probability that the residual estimation error caused by a K-term approximation in (9) is smaller than α.
We note that Theorem 1 also provides insight into the behavior in the large-antenna limit, i.e., for B → ∞ while U is held constant. In this case, we have Pr D
, which implies (i) that the Neumann series converges with probability 1 and (ii) that the approximation error for any K-term approximation is arbitrary small, which includes the MF detector (corresponding to K = 1). We note that this behavior is in accordance with existing results for MF detection in large-scale MIMO systems [8] , [32] . 6 The result in (11) also holds for the case where the regularization term
vanishes, which coincides to ZF detection. As a consequence, the condition (11) is rather pessimistic and is likely to be sub-optimal, especially for
The derivation of a tighter condition is left for future work.
C. Channel Gain and NPI Variance Computation
Computation of the max-log LLRs via the proposed Neumann series approximation is carried out by simply replacing the exact inverse A −1 w by the approximationÃ −1 w | K to perform MMSE equalization (2) . For this approximation, the effective channel gainμ In order to compute the effective channel gainμ
K , we first construct the LU × LU matrix W −1 · | K from the subcarrier equalization matrices W −1 w | K , ∀w, as explained in Section II-B. With this, we haveμ
, which can be rewritten as in (3) by replacing W (i,:) with W (i,:) · | K . Consequently, the effective channel gain is given bỹ µ
w . In order to compute the post-equalization NPI varianceν 2 i|K one might assume that it simply corresponds to E sμ (4). Unfortunately, this expression no longer holds, because of the following fact:
Furthermore, the above NPI variance expression is not guaranteed to be non-negative and hence, using it to compute LLR values inevitably results in poor error-rate performance. As a consequence, an alternative expression forν 2 i|K is required when using the approximate matrix inverse for data detection. Following the steps of the derivation ofν · | K , the exact post-equalization NPI variance can be expressed as:
Since 
This expression, however, is computational intensive, as it involves the L matrix multiplications, each requiring O(U 3 ) operations. In order to reduce the complexity of computing ν 2 i | K , we can use the K = 1 term approximation NPĨ
as a substitute forṽ
is the i th diagonal entry of D w , a H i,w is the i th row of A w , and g i,w is the i th column of G w . This approximation requires low computational complexity as it involves only L inner products, each requiring U operations. In addition, the larger K is, the closer the approximate inversion in (7) is to the exact inverse (assuming the Neumann series converges). Hence, for K > 1, the exact NPI variance would be lower thanν 2 i | K , which reveals that (14) is a pessimistic approximation.
We emphasize that we can further reduce the computational complexity of the NPI approximation in (14) .
w . Hence, we propose the following low-complexity NPI approximation:
Note that our own simulations show that the low-complexity NPI approximation (15) performs well compared to the exact NPI variance (13) . For example, the performance loss caused by the approximation compared to the exact NPI computation for U = 4, B = 8, and K = 3 is less than 0.02 dB at a BLER of 10 −2 (cf. Section III-D2 for the simulation settings).
D. Simulation Results
We next demonstrate the advantages and limitations of the proposed approximate matrix inversion approach in terms of computational complexity and error-rate performance. To assess the error-rate performance for practically relevant antenna configurations, we note that the Samsung Full-Dimensional MIMO prototype [33] consists of 64 BS antennas, whereas the massive MIMO research platform developed at Rice University [34] currently consists of 96 BS antennas (with plans for larger array sizes). Hence, we focus our results on the following cases: B = 64, B = 128, and B = 256.
1) Computational complexity:
To demonstrate that the proposed approximate inverse exhibits (often significantly) lower complexity than an exact inverse, we chose a Cholesky decomposition-based inverse as a reference (see Section IV-E for algorithm details), as this method exhibits lower complexity compared to other inversion algorithms, including (but not limited to) direct matrix inversion, QR decomposition, or LU factorization [14] , [26] . The computational complexity (characterized by the sum of real-valued division 7 , addition, and multiplication 8 operations) of an exact Cholesky-based inverse scales with O(U 3 ), whereas the complexity of a K = 1 and K = 2 Neumann series expansion scales only with O(U ) 7 The number of divisions is not significant for the total operation count. 8 To obtain the real-valued multiplication count, we assumed four realvalued multiplications per one complex-valued multiplication. One could further reduce the number of the real-valued multiplications by using strengthreduction; this approach, however, maintains the trends observed in Fig. 1 . and O(U 2 ), respectively. The computational complexity of K ≥ 3 is dominated by matrix-by-matrix multiplications, where the number of such operations grows linearly with K. For example, K = 3 requires one matrix-by-matrix multiplication, whereas K = 4 requires two. In general, a K ≥ 3 term approximation requires K − 2 matrix-by-matrix multiplications. As a result, the complexity of a K ≥ 3 term approximation is O((K − 2)U 3 ). Hence, we have O(U 3 ) for K = 3, which is equivalent to that of an exact Cholesky-based inverse. Consequently, a Neumann series approximation with K ≥ 3 does not appear to be advantageous.
The overall operation counts of both methods are dominated by the number of real-valued multiplications and additions, where real-valued multiplication is more expensive than realvalued addition. Since asymptotic complexity scalings do not, in general, reveal the full truth, we count the number of realvalued multiplications of both methods in Fig. 1 for varying numbers of users U . We observe that for K ≤ 3, the Neumann series approach results in substantially lower complexity than the exact inversion approach. As expected, K ≥ 4 results in higher complexity than a Cholesky-based exact inversion.
2) Error-rate performance: Evidently, the reduction in complexity for K ≤ 3 Neumann series terms comes at the cost of an approximation error (cf. Section III-B). To characterize the associated performance loss, we now compare the errorrate performance of the proposed approximate matrix inverse with the error-rate performance of the exact inversion for an LTE-based large-scale MIMO uplink system. To this end, we show simulation results of an SC-FDMA LTE uplink system with B antennas at the BS and U ≤ B single-antenna users. In particular, we study a challenging communication scenario (from an error-rate perspective) and focus on the MCS (modulation and coding scheme) of the highest rate (i.e., MCS 28) and 20 MHz bandwidth with 1200 subcarriers, as specified by the LTE standard [4] ; this mode corresponds to 64-QAM, and a rate ≈ 0.75 3GPP LTE turbo code. In order to generate channel matrices that reflect a potential 9 real-world scenario, we use the WINNER-Phase-2 model [35] . In addition, we assume a linear antenna array with an antenna spacing of 10/128 ≈ 0.0781 m, which resembles that of the real-world channel measurement campaign in [36] . At the BS, we use the exact and approximate soft-output MMSE detectors detailed above. Furthermore, we use a log-MAP LTE turbo decoder that performs 16 (full-)iterations. Further, we define signalto-noise-ratio (SNR) as BE s /N 0 , which corresponds to the average SNR per receive antenna.
Figures 2(a), 2(b), and 2(c) show the block-error rate (BLER) performance of the proposed approximate detection algorithm compared to that of an exact MMSE detector for U = 4, U = 8 and U = 12, respectively.
We see that for small ratios between BS antennas and users, the MF detector (equivalent to K = 1) and the Neumann series approximation for K = 2 result in large residual errors. 10 Hence, considering the 10% BLER requirement for LTE [4] , the MF detector and K = 2 term approximation are not suitable in practice in the considered 64-QAM cases (note that this fact is also reflected by Theorem 1). For a larger number of BS antennas, this error floor can be recovered partially. Our own simulations have shown that the MF detector achieves < 10 −2 BLER for U = 4 and B = 512. Furthermore, for 16-QAM, our approximation method requires smaller values of K (see [1] , [2] for corresponding 16-QAM simulations in a large-scale MIMO-OFDM setting).
We see that for 64-QAM, the proposed approximate inversion method with K = 3 terms is able to approach the performance of the exact detector, i.e., the BLER performance loss is less than 0.25 dB SNR at 10 −2 BLER in all of the K = 3, U = 4 cases and the K = 3, U = 8, B = 256 case. Hence, the proposed approximate inverse for K = 3 can deliver the performance of an exact inversion at (often substantially) lower complexity for large ratios between BS antennas and users. For small antenna ratios, however, the approximate inverse with K = 3 exhibits an error floor.
We conclude that systems with small ratios between BS and user antennas will need to resort to an exact inverse, while systems with large ratios can take advantage of the proposed approximate inverse. Hence, we next propose corresponding MIMO detection architectures for both, the approximate inverse and an exact Cholesky-based inverse.
IV. VLSI ARCHITECTURE
We now detail two VLSI architectures suitable for largescale MIMO detection in 3GPP LTE-A. The first design implements the proposed approximate inversion approach and the second design implements an exact inverse; this enables us to perform a fair hardware complexity vs. error rate performance comparison (see Section V for the comparison). 9 To the best of our knowledge, no specific channel model for large-scale MIMO systems is available in the open literature. 10 Compared to lower modulation orders, such as 16-QAM (not shown here), 64-QAM requires a relatively high SNR to perform well. 
A. Architecture Overview
The proposed general architecture is depicted in Fig. 3 and consists of the following parts. The preprocessing unit performs matched filter computation, i.e., computes y MF w = H H w y w , the regularized Gram matrix, and the (approximate) inverse. Note that for the approximate inversion unit, we also output D −1 w and G w , which are needed to compute the SINR (cf. Section III-C). To achieve the peak throughput specified in LTE-A [6] , while being able to handle the (worst) case where the channel estimates change from subcarrier to subcarrier and from SC-FDMA symbol to SC-FDMA symbol (see, e.g, [37] ), we use multiple instances of the preprocessing unit. 11 The matched filter output, the (approximate) inverse, and the regularized Gram matrix, are then passed to the subcarrier processing unit. This unit performs equalization, i.e., computesŝ w = A −1 w y MF w and the post-equalization SINR (detailed in Section II-B for the exact inverse and in Section III-C for the Neumann series approximation). To perform per-user data detection, a buffer is required that aggregates all equalized symbols and SINR values, which are computed on a per-subcarrier basis. The architecture then performs an IFFT, which transforms the equalized symbols from the subcarrier domain into the user domain (or time domain). The LLR computation unit finally computes, together with the buffered post-equalization NPI values, soft-output information in the form of max-log LLRs (2). We next provide the details for the key blocks of the proposed detector architecture.
B. Approximate Inversion and Matched Filter Units 1) Approximate inverse computation:
In order to achieve high throughput, we propose a single systolic array that computes both, the regularized Gram matrix and the approximate inverse in four phases. The proposed architecture is detailed in Fig. 4 and is capable of computing inverses for various Kterm expansions, i.e., the number of Neumann series terms can be selected at run-time. As shown in Fig. 4 , the lower triangular systolic array consists of two distinct processing elements (PEs): (i) PEs on the main diagonal of the systolic array (referred to as PE-D) and PEs on the off-diagonal (referred to as PE-OD). As detailed next, both PEs have different modes in the four computation phases.
In the first phase, the U × U normalized regularized Gram matrix A w /B = (G w + N 0 E −1 s I U )/B is computed in B clock cycles. Since A w is diagonally dominant with diagonal entries close to B, i.e., the number of BS antennas, we reduce its dynamic range by computing a normalized version, whose entries on the main diagonal are close to 1 by the 'scale down' unit shown in Fig. 4 ; this trick mitigates dynamicrange issues, which are common for matrix inversion circuits implemented with fixed-point arithmetic. The systolic array also computes D w E w only requires a series of scalar multiplications (rather than a matrix multiplication).
In the third phase, the systolic array computes the K = 2 term Neumann series approximation, i.e., A In the fourth phase, the K-term Neumann series approximation is computed with the results residing in the distributed register files. In particular, the systolic array first performs a matrix multiplication of −D 2) SINR computation unit: The SINR computation unit simply consists of U MAC units that sequentially compute the approximate effective channel gainμ
K . This unit furthermore computes the approximate NPI (15) using a single MAC unit. Subsequently, the unit multipliesμ 
D. IFFT and LLR Computation Units 1) IFFT unit:
In order to transform the per-subcarrier data into the user (or time) domain, we deploy a single Xilinx Discrete Fourier Transform IP LogiCORE unit (see [38] for the specifications). This unit supports all forward and inverse DFT modes specified in 3GPP LTE [4] , but we only make use of its IDFT capabilities. The IFFT unit reads and outputs data in a serial manner. For an IFFT transform size of 1200 subcarriers, the core can process a new set of data every 3779 clock cycles. This FFT unit achieves more than 317 MHz on a Virtex-7 XC7VX980T FPGA and hence, achieves a throughput beyond 600 Mb/s for 8 users, 64-QAM, and 20MHz bandwidth.
2) LLR computation unit: The LLR computation unit (LCU) generates max-log soft output values given the effective channel gains µ 
t ) and realizing that λ b (·) is a piecewise linear function that depends on the bit index (see [25] for the details). To this end, the LCU first scales the real and imaginary parts of the equalized time-domain symbol with the reciprocal of the effective channel gain 1/µ (i) . Then, it evaluates the piecewise linear function λ b (x (i) t ) and scales the result with the post-equalization SINR ρ 2 i . The resulting max-log LLR value is then delivered to the output of the unit. In order to minimize the circuit area, the proposed architecture evaluates each piecewise linear function with logical shifts and additions only. The reciprocals are computed with a lookup table that is stored in B-RAM units (see [1] for architectural details). A single instance of the resulting LCU is able to processes one symbol every clock cycle, resulting in a peak throughput of 1.89 Gb/s for 64-QAM at 317 MHz.
E. Reference Cholesky-based Inversion Unit
In order to enable a fair performance/complexity assessment of the proposed approximate matrix inversion unit, we also implemented a reference unit that performs an exact matrix inversion. This unit simply replaces the approximate inverse unit detailed in Section IV-B. We next summarize the used Cholesky-based inversion algorithm and then, outline the corresponding VLSI architecture.
1) Inversion algorithm: In the proposed exact inversion unit, we compute A −1 w in three steps: (i) we form the regularized Gram matrix
where L w is a lower-triangular matrix with real-values on the main diagonal [26] ; (iii) we compute the inverse A −1 w using an efficient forward/backward substitution procedure proposed in [25] . Specifically, we first solve L w u i = e i for u i , i = 1, . . . , U , where e i is the i th unit vector, via forward substitution. We then solve L H w v i = u i for v i , i = 1, . . . , U , via back substitution, which leads to the desired inverse A
Note that this approach avoids a costly matrixby-matrix multiplication, which would be needed by directly computing
w . 2) Cholesky decomposition architecture: The VLSI architecture for the Cholesky-based inverse differs from the one in Section IV-B. In particular, we deploy three separate units that compute (i) the regularized Gram matrix, (ii) the exact inverse using the above algorithm, and (iii) a forward/backward substitution unit to compute the inverse A −1 w . All units are detailed next and separated by pipeline stages.
The regularized Gram matrix is computed as a sum of outer products, i.e., as
, where r i designates the i th row of H w . Since the Gram matrix is symmetric, it can be computed efficiently with a triangular systolic array of multiply and accumulate units (MACs), similar to the array detailed in Section IV-B. The Gram computation unit reads one row of H w at a time and is able to output a Gram matrix every B th clock cycle. To obtain the regularized Gram matrix A w , we add N 0 E −1 s to the diagonal of G w in the final clock cycle. We then perform the Cholesky decomposition of A w with a lower-triangular systolic array to obtain the lower-triangular matrix L w . The systolic array consists of two distinct processing elements (PEs): (i) the PEs on the main diagonal and (ii) the PEs on the off-diagonal. The data flow is similar to the linear systolic array (the "obvious case") proposed in [39] . The difference is that our design processes an incoming column of A w with multiple PEs, whereas an incoming column is processed with a single PE in [39] . As a result, our design is able to achieve the peak throughput requirements of LTE-A. In our design, the pipeline of one column of PEs is 16 stages deep and streams out one column of L w every clock cycle (after a latency of 16(U − 1) clock cycles). Consequently, the achieved throughput corresponds to one Cholesky decomposition every U clock cycles.
3) Forward/backward-substitution architecture: The forward/backward substitution unit (FBSU) receives a lowertriangular matrix L w as input, and computes A . . , U ) is independent, we use U processor elements (PEs) to solve for all x i in parallel. Each PE is implemented using a fully pipelined architecture, which consists of U stages of computation logic. Each stage contains two multiplexers, a complex-valued multiplier, and a complex-valued subtraction. In each stage, either
is computed according to the control signals. Therefore, for an input matrix L w of dimension U , the FSU uses U 2 complexvalued multipliers; the entire FBSU utilizes 2U 2 complexvalued multipliers. The matrix conjugate unit is implemented using multiplexers and U FIFOs (realized by on-chip B-RAMs in the FPGA). The conjugate matrix L H w is also reordered based on the pattern of the input sequence of the BSU.
V. IMPLEMENTATION RESULTS AND TRADE-OFFS
The approximate detection engine for 3GPP-LTE and the exact Cholesky-based detector have been implemented on a Xilinx Virtex-7 XC7VX980T FPGA. The fixed-point parameters, FPGA implementation results, and the associated performance/complexity trade-offs are presented next.
A. Fixed-Point Design Parameters
In order to minimize the hardware complexity, fixed-point arithmetic is used in the entire design. The associated fixedpoint parameters were determined via extensive simulations. In the following, the word-lengths refer to the real or imaginary part of a complex-valued number.
The channel matrices H w , the receive-vectors y w , and the noise variance N 0 E −1 s , are all quantized to 15 bit. The wordlength of the output of the Gram matrix and inversion unit are also set to 15 bit; equivalently, the matched filter unit has 15 bit at the input and output. For both matrix inversion circuits, all multiplications have been mapped onto Xilinx DSP48 slices. In order to achieve sufficient precision at minimum implementation complexity, the MAC registers within the DSP48 units are set to 22 bit. The LUT in the reciprocal unit consists of 1024 addresses with 12 bit outputs. Hence, it can be implemented efficiently using a single block-RAM (B-RAM) available on the FPGA. The equalizer module uses a 15 bit input and its output, which is stored in the data buffer, is quantized to 12 bit. The buffer stores (complex-valued) data for 1200 subcarriers and U users. The SINR computation module has a 15 bit input and 12 bit output. The input and output of the IFFT unit are 12 bit; the precision of the internal multipliers is set to 18 bit. The inputs of the LLR computation are quantized to 12 bit and the computed LLRs are represented by 8 bit.
The resulting fixed-point performance is shown in Fig. 2 (labeled by 'FP') for 64 × 4 and 128 × 8 systems. As it can be seen, the fixed-point implementation is virtually indistinguishable from the floating-point golden model. In particular, the implementation loss is less than 0.05 dB SNR at 10% BLER.
B. FPGA Implementation Results
Table I summarizes the key (post-place-and-route) implementation results of the proposed approximate and exact softoutput data detector for LTE-based massive MIMO wireless systems. We parameterized the architecture for U and B to explore the impact on the required FPGA resources and the corresponding throughput. The implementation results for antenna configurations of 128 × 8 and 64 × 4 are detailed in Table I . In order to support 75 Mb/s data rate for each LTE-A user in 20 MHz bandwidth, we use multiple instances of the preprocessing unit. Specifically, we used 8 and 5 instances of approximate matrix inversion units for the 128 × 8 and 64 × 4 system, respectively. For the exact inverse, we used 6 and 3 regularized Gram matrix units for the 128 × 8 and 64×4 system, respectively. In addition, we used one Cholesky decomposition unit and one forward and backward substitution unit for both cases to meet the data rate requirements.
As shown in Table I , all designs are capable of running at 317 MHz and the critical path is the routing between different blocks of the detector. For the 128 × 8 and 64 × 4 systems, the proposed units can achieve 603 Mb/s and 301 Mb/s, respectively. For the 64 × 4 system, the design meets the 300 Mb/s peak data rate requirement specified in LTE-A with 4 users and 20MHz bandwidth. In addition, our design can scale beyond LTE-A specifications, i.e., the proposed designs can support up to 8 users and still achieve a 75 Mb/s per-user requirement.
In terms used resources on the Virtex-7 XC7VX980T FPGA, the approximate soft-output data detector is smaller than the Cholesky-based unit. There are notable saving in logic slices and DSP48 units. For 64 × 4, K = 3 uses 56% fewer LUT slices and 29% fewer DSP48 units compared to that of the Cholesky-based unit. For 128 × 8, K = 3 uses 19% fewer LUT slices and 26% fewer DSP48 units compared to that of the Cholesky-based unit. We emphasize that the savings in hardware resources become significantly larger as the number of users U increases.
C. Performance/Complexity Trade-off
Based on the simulated BLER results in Fig. 2 and the associated FPGA implementation results, we are now ready to characterize the error-rate performance vs. hardware complexity trade-offs associated with the detector containing the proposed approximate matrix inversion and the Cholesky-based exact inversion. To this end, we show the associated hardware complexity against the minimum SNR required to achieve 10% BLER in Fig. 5 . Since both designs are dominated by multipliers, we define the hardware complexity as the number of multipliers required to achieve a 75 Mb/s per-user throughput.
From Fig. 5 , we observe that the hardware complexity of the Cholesky-based detector is larger than that of the approximate inversion circuit for K = 3 and K = 2. In addition, for large ratios between the number of BS antennas to the number of users B/U , we clearly see that the SNR performance of the approximate inverse with K = 3 and the exact inverse are very similar. For small ratios B/U , however, the performance difference between the approximate inverse and the exact inverse is rather large, which is reflected in the analysis shown Section III-B. Hence, the ratio B/U determines whether an approximate or exact inversion is beneficial in a practical largescale MIMO system. Note that for 128 × 8 and 64 × 4, the approximate inverse with K = 2 is unable to achieve 10% BLER (cf. Fig. 2 ). We note that when considering 16-QAM modulation (rather than 64-QAM modulation, as shown here), the approximate inversion for K = 2 is capable of achieving similar performance as the exact inverse (see [1] , [2] for corresponding simulation results).
D. Related FPGA Designs for Linear Data Detection
A host of FPGA designs for linear data detection in conventional (small-scale) MIMO systems have been proposed in the literature [28] - [30] , [40] - [44] . Unfortunately, all these designs differ in various ways. First, the corresponding architectures rely on different matrix inversion algorithms, such as the QR decomposition [29] , [40] , [43] , [44] , Gram-Schmidt orthogonalization [30] , [45] , LU decomposition [25] , direct matrix inversion [42] , divide-and-conquer methods [41] , [46] . Second, all FPGA implementations do not generate soft outputs, with the exception of [47] . Third, the designs were implemented on different FPGA types.
Since the soft-output detector implementations proposed in this paper are for large-scale MIMO systems having hundreds of BS antennas and none of the small-scale MIMO detector designs in [28] , [29] , [40] - [46] was implemented on a Xilinx Virtex-7 FPGA, a fair comparison of our design with the above-mentioned implementations is difficult. Hence, we decided to resort to the comparison with our own reference circuit, i.e., the Cholesky-based inverse, as shown in Section V-C.
VI. CONCLUSIONS
We have proposed a new soft-output data detector for large-scale (or massive) MIMO-based 3GPP LTE-Advanced (LTE-A) systems. The proposed solution is capable of performing high throughput detection in single-carrier frequency division multiple access (SC-FDMA)-based large-scale MIMO systems equipped with hundreds of antennas at the base station (BS). In order to achieve low computational complexity, we have proposed a new approximate linear detector relying on a Neumann series approximation of the matrix inverse.
We have designed two reference VLSI architectures, one relying on the approximate inverse, the other on an exact Cholesky-based matrix inversion. Both architectures have been successfully implemented on a state-of-the-art Xilinx Virtex-7 FPGA, are suitable for systems equipped with 128 BS antennas or fewer while serving up to 8 users, and achieve more than 600 Mb/s, exceeding the peak data rates specified in the 3GPP LTE-A uplink for 20 MHz bandwidth. Our FPGA implementation results reveal that for systems with a large ratio between the number of BS antennas and the number of users, the approximate matrix inversion is able to significantly reduce the hardware implementation complexity (compared to that of the exact inversion) with only a slight error-rate performance degradation. For systems with small ratios between the number of BS antennas and the number of users (as it is the case in, e.g., conventional, small-scale MIMO systems) one must resort to an exact inverse in order to avoid poor error-rate performance. This behavior is in accordance with the analytical results we have developed for the approximate matrix inverse. In summary, our FPGA implementation results demonstrate the practical feasibility of high-throughput data detection for 3GPP LTE-based large-scale MIMO systems. We finally note that a corresponding high-throughput ASIC design has recently been published in [48] .
There are many avenues for future work. The development of detection algorithms that are able to perform iterative detection and decoding (as, e.g., in [25] ) in large-scale MIMO systems is left for future work. Furthermore, the design of high-performance, near-optimal detection methods (e.g., based on the algorithms in [17] , [49] ) that require low computational complexity for large-dimensional antenna configurations and for SC-FDMA is a challenging open research problem.
APPENDIX A PROOF OF THEOREM 1
To prove Theorem 1, we need the following three Lemmata. = 2B(B + 1).
Proof:
We have −1 is an inverse chi-square random variable with 2B degrees of freedom. The inverse chi-square distribution with 2B degrees of freedom χ(2B) corresponds to an inverse-Gamma distribution with 2B degrees-of-freedom. The 4 th moment of this inverse chi-square distribution is given by .
We are now in position to prove Theorem 1. To this end, we start by using Markov's inequality to obtain the following straightforward inequality:
With Pr D 
