Abstract-For massive multiple-input multiple-output (MIMO) systems, linear minimum mean-square error (MMSE) detection has been shown to achieve near-optimal performance but suffers from excessively high complexity due to the large-scale matrix inversion. Being matrix inversion free, detection algorithms based on the Gauss-Seidel (GS) method have been proved more efficient than conventional Neumann series expansion (NSE) based ones. In this paper, an efficient GS-based soft-output data detector for massive MIMO and a corresponding VLSI architecture are proposed. To accelerate the convergence of the GS method, a new initial solution is proposed. Several optimizations on the VLSI architecture level are proposed to further reduce the processing latency and area. Our reference implementation results on a Xilinx Virtex-7 XC7VX690T FPGA for a 128 base-station antenna and 8 user massive MIMO system show that our GSbased data detector achieves a throughput of 732 Mb/s with closeto-MMSE error-rate performance. Our implementation results demonstrate that the proposed solution has advantages over existing designs in terms of complexity and efficiency, especially under challenging propagation conditions. Index Terms-Massive MIMO, minimum-mean square error (MMSE), Gauss-Seidel method, soft-output data detection, VLSI.
A. Detection in Massive MIMO Systems
The optimal data detection problem in MIMO systems is non-deterministic polynomial-time hard (NP-hard) [11] [12] [13] . Hence, existing algorithms that aim at solving this problem optimally, e.g., algorithms based on the optimal maximumlikelihood (ML) criterion [14] or the maximum a posteriori (MAP) criterion [15] , inevitable require excessively high complexity as the number of decision variables increases with the number of transmitted data streams. Though the hardware computing capability has evolved significantly over recent decades, efficient hardware implementations for optimal data detection remains challenging. On the other hand, the new emerged application scenarios such as massive machine type communications (mMTC) and ultra-reliable low latency communications (URLLC) are expecting massive MIMO detectors of low complexity, latency, and area, which implies that even data detectors that have modest-complexity would be unacceptable due to the stringent power and area constraints [10] . Therefore, near-optimal, low-complexity, and high-speed massive MIMO detectors are highly desired to bridge the gap between algorithms and hardware implementations.
One possible solution to perform optimal data detection are non-linear data detection algorithms with reduced complexity, such as the sphere decoder (SD) [16] [17] [18] and tabu search (TS)-based data detectors [19, 20] . Admittedly, such solvers worked perfectly well for traditional, small-scale MIMO systems. However, for massive MIMO systems with tens to hundreds of antennas and higher-order modulation schemes [21, 22] , such methods result in prohibitively high computation and implementation complexity [12] .
Alternatively, one can resort to linear data detection algorithms, such as zero-forcing (ZF) or minimum mean-square error (MMSE)-based equalization, to tradeoff performance versus complexity [3] . Unfortunately, both of these methods require a large-dimensional matrix inversion. Exact inversion algorithms of an N × N matrix, using, for example, QR-Gram Schmidt [23] , Gauss-Jordan [24] , or Cholesky decomposition [25] , entail high complexity of O(N 3 ). By exploiting the channel hardening property of wireless channels in massive MIMO systems, reference [26] proposed a Neumann series expansion (NSE) that reduces the complexity of matrix inversion. However, this algorithm still suffers from a complexity of O(N 3 ) when the NSE length K ≥ 2 (even higher than the exact inverse when K ≥ 4) [26] . To achieve a lower complexity of O(N 2 ), a soft-output detection based on Gauss-Seidel (GS) method was proposed in [27] recently. But the existing GS-based detectors still exhibit slow convergence rates and relatively high hardware efficiency.
B. Contributions
In this paper, an efficient GS-based soft-output data detection algorithm and a corresponding VLSI architecture are proposed. Based on the fact that the MMSE filtering matrix is diagonally dominant for massive MIMO systems, a 2-term NSE is employed to generate the initial solution of the GS method, which effectively accelerates the convergence, especially under challenging propagation environments, such as MIMO systems with a large system loading factor or correlated channels. We provide a VLSI architecture along with numerous architecture-and hardware-level optimizations. Our implementation results demonstrate that the proposed detector achieves a throughput of 732 Mb/s with close-to-MMSE errorrate performance, outperforming state-of-the-art designs in terms of complexity and efficiency. With the aid of the proposed efficient architecture, the GS-based detector is flexible and suitable to meet various system requirements.
C. Outline of the Paper
The reminder of the paper is organized as follows. Section II reviews the prerequisites. Section III proposes the GS-based soft-output data detection algorithm for massive MIMO systems and provides a complexity analysis and error-rate performance comparison. Section IV details the VLSI architecture with several optimizations. Section V provides reference FPGA implementation results and a comparison with existing designs. Section VI concludes the paper.
D. Notation
Lower-and upper-case boldface letters stand for column vectors and matrices, respectively. The entry in the ith row and jth column of a matrix A is represented by A ij ; the kth entry of a vector a is represented by a k . The operations (·)
H , (·)
T , (·) −1 denote conjugate transpose, transpose and inverse respectively; | · | and · denote the absolute value operator and the Euclidean norm respectively. E{·} stands for expectation.
indicates the kth iteration of iterative methods.
II. PREREQUISITES A. System Model
We consider a multi-user MIMO uplink system ith N t users and N r receive antennas at the base station (BS). In the case of massive MIMO, we assume that N r N t . Spatial multiplexing is employed and each user is equipped with a single antenna 1 . For each user, the information bits b are encoded into the coded bit-stream x and then mapped to the transmit vector s = [s 1 , . . . , s Nt ] ∈ Ω Nt , where Ω corresponds to the 2 B -QAM 1 The model can be extended to the case of multiple spatial streams per user.
constellation using Gray labelling. Therefore, each transmit vector is associated with N t B information bits and x ib denotes the bth bit of the ith entry of s. The transmit vector s is transmitted over a wireless MIMO channel modeled as
where y = [y 1 , . . . , y Nr ] T ∈ C Nr corresponds to the received vector at the BS, H ∈ C Nr×Nt is the channel matrix, and n ∈ C Nr stands for independent identically distributed (i.i.d.) Gaussian noise with mean zero and variance N 0 per entry. In the following, the channel matrix H and the noise variance N 0 are assumed to be perfectly known at the receiver. The transmit symbol variance is normalized as
In what follows, we employ the average SNR per receive antenna defined as SNR = N t E s /N 0 .
B. Soft-Output MMSE Data Detection
The task of the BS is to compute log-likelihood ratio (LLR) values for the coded bits given H and y with a soft-output data detection algorithm. Since MMSE algorithms have been proven to be near-optimal for massive MIMO uplink with low complexity [26] , we consider the typical linear MMSE detection, which can be written aŝ
where y MF = H H y is the matched-filter output and W = G + N 0 I Nt is the regularized Gram matrix with the Gram matrix G = H H H. In order to obtain the LLR values, (2) can be rewritten asŝ
where U = W −1 G is the equivalent channel matrix. Then the equalized symbol of the ith user can be written asŝ i = µ i s i +t i , where µ i = U ii denotes the effective channel gain and t i denotes noise-plus-interference (NPI). The a posteriori LLRs for each bit of x can be computed by
(4) Entailing only a small loss for high-order modulation schemes, the approximation of LLR computation proposed in [28] is hardware-friendly for Gray mappings. Therefore, an efficient approach to compute the extrinsic LLRs of the detector is given bŷ b denote subsets of Ω for which the b-th bit is 0 and 1, respectively. As summarized in [28] , the function λ b (z i ) can be computed efficiently for Gray mappings. Note that the computation of the LLRs is the same for both real parts and imaginary parts.
III. LOW-COMPLEXITY SIGNAL DETECTION ALGORITHM
In this section, a low-complexity soft-output detection algorithm named improved GS detection (IGS) is proposed for massive MIMO uplink. Algorithm 1 provides a summary of the method.
A. Signal Detection with Gauss-Seidel Method
Computing the inverse of the regularized Gram matrix W −1 requires a computational complexity of O(N 3 t ) using hardwarefriendly matrix inverse approaches, which is prohibitive for massive MIMO. Therefore, iterative methods with low complexity have been exploited in massive MIMO detection. The GS method is used to solve the N -dimension linear equation Ax = b, where A is the N × N coefficient matrix, x is the solution vector, and b is the measurement vector. In what follows, the reason why the GS method can be utilized to solve massive MIMO detection problem is explained. Lemma 1. For massive MIMO systems, the columns of channel matrix H are asymptotically orthogonal, and the regularized Gram matrix W is Hermitian positive definite with probability one.
Proof: Please refer to [3] and [29] for the proof. Specifically, signal detection in massive MIMO systems using the GS method is carried out by the following steps:
Step 1: Decompose the regularized Gram matrix W as
where D, L, and L H are the diagonal, strictly lower triangular, and strictly upper triangular components of W, respectively.
Step 2: Compute initial solution s (0) of the GS method. Usually, s (0) is set as a zero vector if no prior information of the final solution is available.
Step 3: The transmitted signal vector s is then estimated iteratively as follows:
where K is the maximum number of iterations.
B. Fast-Converging Initial Solution
As discussed in Section III-A, since no a priori information of the final solution is available, the initial solution s (0) in (7) is often set as an all-zero vector. Such a choice is simple but requires a large number of iterations. In general, the initial solution plays an important role for the convergence of the GS method and affects both complexity and accuracy when the number of iterations is finite (and small).
The exact solution of GS method is given in Eq. (2) . Note that the inverse of the regularized Gram matrix W −1 here can also be computed by the following NSE: 
14: for i = 1, 2, ..., N t do 15: (5) 
Letting X = D for the sake of low complexity and keeping only the first 2 terms of NSE, we arrive at the following 2-term
where E is the off diagonal part of W. Exploiting such an efficient approximation, we have the new initial solution s
as follows:
Remark 1. The NSE-based approximation is only efficient with the number of terms less than three, otherwise it will still suffer from high complexity.
C. Efficient LLR Computation
Although the computational complexity of LLR computation has been reduced significantly in (5), it requires the inverse matrix W −1 to compute the effective channel gains µ i . For the purpose of computing LLRs more efficiently, we propose an approximated method to obtain the effective channel gains with a negligible performance loss. Firstly, the equivalent channel matrix can be rewritten as
Inspired by the proposed initial solution in Section III-B, we also exploit the 2-term NSE W −1
2 to approximate W −1 here:
Therefore, the effective channel gain can be computed bỹ
where W ii denotes the ith diagonal entry of W −1
2 . Then, the approximated LLRs can be efficiently computed by (5). 
D. Computational Complexity Analysis
Since most of the linear MMSE algorithms of massive MIMO detection require to pre-compute the regularized Gram matrix W and the matched-filter output y MF , we focus mainly on the complexity of the other parts. Therefore, we have: (11) . Solving (7): Considering (D + L) is a lower-triangular matrix, the computation of (7) after K iterations requires KN complex-valued multiplications. Note that since the proposed initial solution is relatively close to the exact solution in favourable propagation environments, K could be quite small.
2 has been computed, the required number of complex-valued multiplications of computing µ i (for i = 1, . . . , N t ) is simply N t according to (14) .
To sum up, the overall required number of complex-valued multiplications of proposed IGS algorithm is (K + 2)N 2 t . For massive MIMO detection, the computational complexity of IGS reduces to O((K + 2)N 2 t ).
E. Simulation Results
Numerical results of BER performance against SNR are shown in Figs. 2-4 to compare the proposed IGS algorithm with the Cholesky decomposition, the NSE-based method, and conventional GS-based algorithms. We consider massive MIMO systems with different parameters , where a standard rate-1/2 convolutional channel code and 64 quadrature amplitude modulation (QAM) scheme are employed. All the compared methods compute soft-outputs. At the receiver, LLRs are extracted from the detected signal for soft-input Viterbi decoding. Remark 2. We let the number of iterations of the NSE-based approach greater than that of IGS (K NSE = K GS + 2) in order to enable a fair comparison, since the 2-term NSE is employed to obtain the initial solution of IGS.
In Fig. 1 and 2 , we compare the BER performance of IGS with the other conventional approaches under different antenna configurations. It is shown that IGS outperforms the NSE-based algorithm under both antenna configurations of 64 × 16 and 128×16. According to Fig. 1 , IGS with K = 1 achieves similar performance as the NSE-based algorithm with K = 4. Noting that with the greater system loading factor N t /N r (shown in Fig. 2 ), the NSE-based algorithm converges much slower than IGS. Therefore, the proposed IGS algorithm has advantage of convergence rate over the NSE-based one.
In Fig. 3 , we compare the performance of the GS-based algorithms with different initial solutions. Compared to existing choices, the proposed initial solution successfully accelerates the convergence. It is shown that the GS method with the proposed initial solution for K = 1 almost achieves the same performance of that with the zero-vector initial for K = 2. Moreover, IGS also outperforms the one with initial solution [27] in terms of error-rate performance for given K.
In Fig. 4 , we study the BER performance of IGS and other conventional approaches considering the spatial correlation of realistic MIMO systems. Here we adopt the Kronecker model proposed in [32] , where ζ r and ζ t (0 ≤ ζ ≤ 1) denote the correlated factor at the BS and the user sides respectively. We can see that all these approaches degrade to various extents as the channel correlation becomes serious. For both ζ t = 0.2 and ζ t = 0.5, the conventional NSE-based algorithm is hardly able to converge. However, IGS still converges with relatively small number of iterations.
IV. VLSI ARCHITECTURE
In this section, we describe a low-complexity VLSI architecture for the proposed IGS algorithm and provide several solutions to reduce hardware overhead and processing latency. 
A. Architecture Overview
The VLSI architecture is depicted in Fig. 5 . The proposed architecture consists of five units: 1) preprocessing unit (PU), 2) initial solution computation unit (ISCU), 3) GS method unit (GSMU), 4) SINR computation unit (SCU), and 5) LLR computation unit (LCU). Fed by y, H H and N 0 , PU performs matched filtering y MF = H H y and computes the regularized Gram matrix W. Note that both operations can be performed by systolic arrays to achieve high-throughput. The outputs of PU are then passed to ISCU for computing the 2-term NSE 
B. Preprocessing Unit (PU)
The preprocessing unit is employed to compute the matched filter output y MF and the regularized Gram matrix W, which will be further passed to other units. Note that this unit is able to perform the operations above in parallel, since there is no data dependence between them.
1) Matched Filtering: The matched filter (MF) consists of a linear array of N t PEs and performs the operation of y MF = H H y. Fig. 6 shows the systolic array structure of MF. In each clock cycle, MF reads a new entry of y and the corresponding entries of H H , and the multiply-accumulate (MAC) operation is performed in each PE.
Remark 3. The total processing latency of MF is (N t +N r −1), and it utilizes N t complex-valued multipliers.
2) Regularized Gram Matrix Computation:
In massive MIMO systems, the dimension of the Gram matrix G tends to be very large. Therefore, the conventional systolic array for matrix-matrix multiplication which consists of N t × N t PEs is not scalable for large N t . As discussed in Section III-A, the Gram matrix in massive MIMO uplink is Hermitian positive definite. Hence, either upper triangular part or lower triangular of G is required to be calculated. Fig. 7 depicts the systolic array structure for computing the regularized Gram matrix (RGM). In this array, only N t (N t + 1)/2 PEs are employed to compute the lower triangular part of the Gram matrix G. Each PE in the array contains a MAC unit. There are two types of PEs in the array, denoted by PE-A and PE-B, respectively. The transposed channel matrix H H is shifted one column at a time into the systolic array. PE-As have the same structure as PEs of MF. Once an input value reaches a PE-B, the value is conjugated and passed to the lower part of the systolic array. Then, PE-Bs add the noise variance N 0 to the diagonal entries of G. Finally, the lower triangular part of the regularized Gram matrix L, the upper triangular part L H and the diagonal part D are stored in the register files. (2N t +N r −1) , and it uses N t (N t − 1)/2 complex-valued multipliers and N t real-valued multipliers since the diagonal entries of W are real-valued.
Remark 4. Total processing latency of RGM is
3) Data Compression Scheme: Note that the regularized Gram matrix W is diagonally dominant in the case of massive MIMO with i.i.d. assumption, meaning that the diagonal entries of W are much greater than the off-diagonal ones. Hence, conventional uniform quantization schemes that cover the entire dynamic range of entries of W will cost a mass of hardware resources. Fig. 8 shows the distribution of entries of W in form of histogram. Obviously, values of W can be separated into two groups, one is around zero and the other is around N t . By exploiting this property, the hardware overhead for storing and processing W can be saved significantly. The hardwareefficient quantization scheme with data compression for W can be denoted as follows (see Fig. 9 for architecture details)
Step 1: Compare W mn with N t /2. If the former is bigger than the latter, set the first bit of the fixed-point output (offset flag) as 1, otherwise set it as 0.
Step 2: If W mn > N t /2, subtract W mn with N t and then obtain the remaining bits of fixed-point output.
The compressed bits of W mn will be sent to other units in detector, e.g. ISCU and GSMU. Before these units start to compute, the corresponding operation of data decompression is required. As shown in Fig. 10 , the procedure of data decompression is simple and straightforward.
C. Initial Solution Computation Unit (ISCU)
As shown in Fig. 11 , ISCU first performs the approximate matrix inversion W −1 2 and then computes the initial solution s (0) of GS method. Since the computation of s (0) is a typical matrix-vector multiplication, which can be performed by an array of MACs similar to MF, we focus mainly on the architecture design of the 2-term NSE approximation
is shown in Fig. 12(a for the sake of saving storage space. The function of mul-A is to multiply a vector by a scalar.
Remark 5. For the architecture in Fig. 12(a) , either one row or column is computed in mul-A per clock cycle, therefore four buffers are used to obtain a row of W −1 2 per clock cycle. The major drawback of the architecture in Fig. 12(a) is the high processing latency. Considering the regularized Gram matrix W of order N t , an undesirable latency of 2N t clock cycles are required to obtain w H i . Since W = D + E and W is Hermitian, the off-diagonal matrix E is therefore Hermitian. (b) Hardware-efficient architecture of GS iteration. Then, a low-latency version of this architecture is provided in Fig. 12(b) . Instead of computing the reciprocal of d i per clock, the proposed structure in Fig. 12(b) computes all entries of D during one clock cycle. Note that the mul-B is employed here to perform element-wise multiplication of m Remark 6. By exploiting the Hermitian characteristic, the proposed structure in Fig. 12(b) is able to obtain W in N t clock cycles, and extra buffers are no longer needed.
After obtaining the 2-term NSE W −1 2 , the initial solution of the GS method is computed according to Eq. 11, which can be performed by the systolic array in Fig. 6 . It requires (2N t − 1) clock cycles to perform this operation. Fig. 13(a) shows the basic architecture of GS method unit, where
D. GS Method Unit (GSMU)
is a lower triangular matrix, a specific systolic array that performs forward substitution (FS) [33] is employed here to compute N . According to Eq. (7), each iteration of GS method can be divided into three phases. In the first phase, a systolic array for matrix-vector multiplication (denoted as mul-C) is employed to perform −L H s (k−1) . In the second phase, the operation
is performed by N t complex-valued adders. In the third phase, another systolic array denoted as mul-D is used to compute the matrix-vector multiplication of N and b. These phases can be repeated for a configurable number of iteration until the error rate performance meets the requirement. Since s is updated from the previous iteration, only a few registers are required to store the latest s (k−1) .
Remark 7.
Both mul-C and mul-D in Fig. 13 consist of N t MACs.
1) Hardware-Efficient Architecture: Fig. 13(b) depicts the proposed hardware-efficient architecture of GS method, derived from the basic architecture in Fig. 13(a) . In order to reduce the dynamic range of the values in GS method unit, we firstly normalize the input y MF and N into y = N −1 r y
MF
and N = N r N. Such a trick is commonly used in fixedpoint arithmetic. Therefore, Ms (k−1) is scaled by 1/N r correspondingly, ensuring that the final result is equivalent. Note that N r is usually a power of 2, therefore these scaling operations can be performed easily by shifting. After scaling, word-length of the data in GSMU can be significantly shortened and therefore the hardware overhead is reduced. Since the inputs L, D and L H are compressed according to Section IV-B3, they should be converted into conventional fixed-point numbers before computing. Also, the proposed architecture is irrelevant of the number of iterations, and therefore suitable for various applications.
2) Timing Schedule for Lower Latency: Usually, for a systolic array which performs matrix-vector multiplication of dimension N t , it requires (2N t − 1) clock cycles to obtain the result (see the left parts of Fig. 14 and Fig. 15 ). It is worth noting that the GS iteration could have certain disadvantages because the variables that depend on each other can only be updated sequentially. Thus, it is important to reduce the overall latency of GS iterations so as to meet the throughput requirement. To this end, we propose a fast-converging initial solution in Section III-B, which is able to significantly reduce the number of iterations. Here, we introduce an efficient timing schedule scheme to further reduce the latency within each GS iteration.
As shown in Fig. 14 , since M = −L H is an upper-triangular matrix, we reverse the input sequence of d and each row of M. As a result, the total latency of mul-C is reduced from (2N t − 1) clock cycles to only N t clock cycles. Likewise, since
−1 is a lower-triangular matrix, we simply reverse the input sequence of N (see Fig. 15 ) and therefore reduce the total latency of mul-D into N t clock cycles.
Remark 8. By exploiting the rescheduling schemes in Fig.  14 and 15 , latency of each GS iteration can be significantly reduced to half its original time.
E. SCU and LCU
The design of SCU is simple and efficient. According to Section III-C, the equivalent channel matrix U can be easily approximated by adders and scalar multipliers. Then the SINR computation is simply carried out by multipliers and reciprocal modules as described in Section II-B. Given the effective channel gains µ i and the post-equalization SINR values ρ i , the computation of the max-log LLRs can be simplified with Gray mappings according to Section II-B. Hence, the LCU focuses mainly on evaluating the linear function λ b (z i ). The readers can refer to [34] for more details of LCU architectures.
F. Timing Schedule of IGS Detector
As discussed before, because the GS method is inherently sequential, it is difficult to be performed in parallel. However, the proposed IGS detector consists of several units and some of them exist no data dependency. Therefore, some operations can be performed concurrently. After careful analysis and arrangement, the overall timing schedule of IGS detector is shown in Fig. 16 .
V. IMPLEMENTATION RESULTS
In this section, the proposed IGS detector has been implemented with Xilinx Virtex-7 XC7VX690T FPGA. The corresponding fixed-point parameters, FPGA implementation results, and comparison to other designs are provided here.
A. Fixed-Point Error-Rate Performance
In order to achieve near-optimal error-rate performance with lower hardware complexity, fixed-point arithmetic is used in this implementation. The associated word-lengths for the proposed architecture have been determined via numerical simulations. The parameters provided in the following refer to the real or imaginary part. For PU, the channel matrix H, the receive vector y, and the noise variance N 0 are represented with 15 bit; the output of RGM is also quantized to 15 bit and it is then compressed to 9 bit with the proposed data compression scheme; the output of MF is set to 15 bit. For ISCU and GSMU, the input D and E are decompressed to 15 bit at beginning. For SCU, the input and the output are set to 15 bit and 12 bit, respectively. For LCU, its input is represented with 12 bit and its output is quantized to 10 bit. Here all multiplications are mapped onto DSP48 slices. The MAC registers are set to 22 bit. Each LUT in the reciprocal module has 1024 addresses and 15 bit outputs, implemented by a single B-RAM. Fig. 17 compares BER performance of the proposed fixedpoint scheme and the floating-point algorithms for 128 × 8 and 64 × 8 systems. One can observe that the degradation introduced by the fixed-point scheme is negligible, compared to the floating-point result. Specifically, the implementation loss is less than 0.05 dB SNR at 0.1% BER. Table I provides the key implementation results of our GSbased soft-output massive MIMO detector and compares it with the post-place-and-route results of other state-of-the-art massive MIMO detectors [26, [35] [36] [37] . In addition, Fig. 18 compares these detectors in terms of throughput and hardware overhead.
B. FPGA Implementation Results and Comparison
As discussed in Section III-E, the proposed IGS algorithm for massive MIMO uplink should be compared to other NSEbased algorithms with less number of iterations for the sake of fairness. Since the detectors presented in [26, 35, 37] are with K = 3 iterations and a 2-term NSE is used in our algorithm, the proposed IGS detector is implemented with only K = 1 iteration. As shown in Table I , our IGS detector has a much higher throughput (732 Mb/s) than the other detectors. One can also achieve a higher clock frequencies via increasing the number of pipeline stages, if the hardware complexity is affordable. Here we introduce the ratios of throughput/LUTs and throughput/FFs to measure the hardware efficiency of detectors. It is clear that our IGS detector achieves the highest hardware efficiency in terms of throughput/FFs, and the second highest hardware efficiency in terms of throughput/LUTs. Furthermore, Fig. 1, 2 , and 17 show that IGS with K = 1 iteration outperforms the NSE-based one with K = 3 in the case of different antenna configurations. In addition, for large system loading factors N t /N r (e.g., 16/64 and 8/64), the conventional NSE-based detection algorithm performs much worse than the proposed GS-based one, even results in convergence issues. Table II reports the throughput of IGS detector with different number of iterations. It indicates that the proposed IGS detector with K = 2, 3 can still achieve higher throughputs than the other aforementioned detectors. VI. CONCLUSION In this paper, we have proposed an efficient hardware architecture of the GS-based soft-output detection algorithm for massive MIMO systems. The proposed iterative algorithm employs a novel initial solution based on NSE to accelerate convergence. The proposed detection algorithm has shown its advantage of both convergence rate and BER performance in various antenna configurations and propagation environments. The corresponding VLSI architecture has been also provided with low hardware complexity. To further reduce the wordlength of the regularized Gram matrix, a hardware-efficient data compression/decompression scheme is proposed in this paper. Exploiting the Hermitian property of the 2-term NSE, the proposed architecture for NSE computation has a much lower processing latency. The optimized architecture for the GS method reduces the latency of each iterative by half. The provided reference FPGA implementation results have shown that our GS-based massive MIMO detector achieves a medium throughput with a much lower hardware complexity and a better error-rate performance, compared to the conventional ones. Future work focuses on iterative data detection and decoding in massive MIMO systems based on this work.
