The Mean-Square-Error (MMSE) detection achieves near-optimal performance in signal detection for massive Multiple-Input-Multiple-Output (MIMO) systems. But MMSE detection still suffers from high complexity of matrix inversion. In this paper, an efficient and flexible architecture is proposed based on the modified version of the Symmetric Successive Over Relaxation (SSOR) method. A Reconfigurable Computing Array (RCA) is used to implement the SSOR Method. In order to speed up the iteration, an initial solution is adopted. Approximated LLRs computational method is used to scale down the computing load of Log-Likelihood Rate computations. FPGA implementation results show a superior performance over the state-of-theart designs.
Introduction
MIMO is a key technology in most modern wireless communication standards; however, the traditional MIMO systems can not satisfy the increasing requirements for data rates, spectral, link reliability and energy efficiency in the future wireless systems. Massive MIMO is a very promising technique for the 5G wireless communications and it has been proved that the massive MIMO provides opportunities to achieve the ever growing demand in the future wireless systems.
Bringing the amazing benefits of massive MIMO faces a few challenges, one of which is significantly increasing computational complexity by orders of magnitude in the base station. Some optimal detection methods like Maximum Likelihood (ML) [1] , K-Best [2] are able to achieve high performance in data detection. Unfortunately, the problem of computational complexity is nonnegligible when the number of antennas is large. Zero-Forcing (ZF) and MMSE can achieve a tradeoff between the performance and complexity; however, the complex matrix complexity is involved in MMSE detection. Recently, Neumann series (NS) method [3] , Conjugate Gradient (CG) method [4] , [5] , and Gauss-Seidel (GS) method [6] , [7] were proposed to achieve matrix inversion indirectly, but the reduction in complexity is not obvious because of large iterative number.
In this paper, we describe a low-complexity data detection algorithm based on SSOR for massive MIMO system in the uplink. Firstly, we focus on linear soft out detection in combination with an optimized matrix inversion method based on SSOR algorithm. Then, we propose a speed-up method in the SSOR method considering the initial solution of the iteration. Finally, an approximated Log-Likelihood Rates (LLRs) computational method is proposed to scale down the complexity. The simulation results show that the proposed algorithm achieves higher detection accuracy when compared with the algorithm as recently proposed. Based on the proposed algorithm, we develop an efficient and flexible VLSI architecture for signal detection in massive MIMO systems. In particular, we propose a reconfigurable computing array (RCA). Furthermore, different antenna configurations in massive MIMO system can be achieved based on this flexible architecture of various configurations. The experimental results of our design on a Xilinx Virtex-7 FPGA show that our design performs 3.43×, 2.84×, 1.71× throughput per slice compared with the NS-based detector [3] , CG-based detectors [5] and GSbased detectors [7] .
System Model
The massive MIMO (usually N >> M [6] ) uplink system has N antennas at the base-station (BS) to simultaneously communicate with M single-antenna users. The parallel transmit bit streams of M users are encoded by utilizing channel encoders, and then, the results are mapped to constellation symbols to get a sequent of transmit vectors s. Let s denote the M × 1 transmitted signal vector of all M users, and vector y stand for N × 1 that received signal at the BS. We have: = + y Hs n , (2.1) where H∈C N×M stands for flat Rayleigh fading channel matrix whose elements are independent and identically distributed (i.i.d.) and follow N (0,1), and all elements of n denote N ×1 i.i.d. zero-mean complex additive white Gaussian noise (AWGN) whose power spectral density are E(n H n) = N0I. Furthermore, we assume the power of transmitted vector is E(s
According to H and y, the base station detector can compute soft-estimates in the form of LLRs for s. The estimation of s in MMSE which is the most common can be computed as: .2), the MMSE estimation of the transmitted vector can be rewritten asŝ=Us+v . As the matrix U is a non-diagonal matrix, the estimation of transmitted symbol by MMSE for the ith usersˆi s not only contains the information of si but also includes the interferences of another sj (j≠i). In order to distinguish the useful information and interference, each element ofŝ can be written as:
where Uij presents the elements of matrix U in the ith and jth column, Uii is the effective channel gain and η i = ∑ j =1 , j≠i M U i , j s j +v i denotes the post-equalization Noise-Plus-Interference (NPI) variance including interferences and noise. It is obviously that si and ηi are independent when the streams are independent. Hence, the expectation of ηi is 0. Let 2 eq s denote the variance of ηi and b be the bit index of the LLR of ith user. Then the max-log LLR can be expressed as:
is the signal-to-noise-plus-noise ratio (SINR) for ith user, . It is obviously that the detector needs a larger number of multiplications when M is large. Hence, the practical solutions for uplink massive MIMO detection demand low complexity for matrix inversion.
SSOR-Based Signal Detection Method for Massive MIMO Systems
In this section, firstly, an optimized SSOR iterative method is used to achieve matrix inversion. Then, we propose a speed-up method of the SSOR method by using properties of massive MIMO channel. Finally, an approximated Log-Likelihood Rates (LLRs) computational method is proposed to scale down complexity.
Proposed SSOR-based Signal Detection Method

Optimized SSOR Method
In the massive MIMO systems, the matrix H is asymptotically orthogonal; hence, the matrix G and matrix A are Hermitian positive definite [6] . The SSOR-based signal detection method is used to solve the linear equation, as shown in (2.2). According to the SSOR iteration, we decompose the matrix A into three parts: A=L+D+L 
PoS(ISCC 2017)055
where k = 0,1, ... is the number of iterations,
s is the initial solution (discussed later in the paper), ω is the relaxation parameter. To realize the iteration method in hardware efficiently,we change the computing rule considering the definition of D and L, the iteration can be presented as:ŝ
where Di,j and Li,j denote the ith row and jth column of matrix D and matrix L, and respectively. According to (3.1.1.2), the optimized SSOR method takes full advantage of information of the whole matrix A. Also, all the computations (vector multiplications) are similar, indicating that the method can be implemented in hardware easily with high efficiency and flexibility.
Speed-up
In order to improve the speed of iteration, we consider the initial solution. If the initial solution is nearby the exact final solution, the iteration number could be small. Hence, the next task is to determine the initial solution, the traditional set as a zero-vector. For massive MIMO systems, the Gram matrix G and matrix A are diagonally dominant, indicating that we have:
where hi denotes the ith column vector of the matrix H. The domination of the diagonal elements of matrix A is more and more obvious when the number of N/M is increasing. By analyzing the special properties of massive MIMO systems, a low complexity initial solution is proposed as:
The proposed initial method can speed-up the iteration obviously. The complexity of the initial matrix is very low, so the computation can be calculated in parallel.
Approximated LLRs Computational Method
The equivalent channel matrix can be expressed as U = A 
Simulation Results
To evaluate the performance of the proposed SSOR-based algorithm, we simulate the BER performance when compared with the NS, CG and GS methods, as shown in Figure 1 . Nothing that the exact MMSE algorithm with Cholesky decomposition [7] is also provided in this figure to be the reference of these approximate methods. The Rayleigh fading channel model is provided. These simulation results show the proposed SSOR-based method can achieve much more near-optimal performance in different antenna configurations when compared with other algorithms. Hence, to achieve the same performance, the SSOR-based algorithm needs smaller iterative number. For example, in Figure 1-(b) , the simulation result of K=2 in SSOR-based algorithm has almost the same BER performance of K=3 in NS-based algorithm. 
Reconfigurable VLSI Architecture
We propose a low complexity VLSI architecture based on the proposed SSOR detection method. The overall architecture consists of a Reconfigurable Computing Array (RCA), which is shown in Figure 2 
Reconfigurable Computing Array
In the RCA, a Finite State-Machine (FSM) controls the date memory, configurable memory, interconnection and the RCA. The main blocks of the RCA can be reconfigured according to the SSOR method. In order to achieve high parallelism, there are 16 reconfigurable computing unit (RCU) in the RCA. The input data can be stored into the data memory and the configuration can be stored into the configure memory. The RCA can achieve the whole SSOR algorithm. In addition, for different antenna configurations in massive MIMO systems, the RCA can be reconfigured to achieve the signal detection.
Reconfigurable Computing Unit
The reconfigurable computing unit includes four real-real multipliers, four adders, two accumulations, one subtracter, and three multiplexers, as shown in Figure 3 . Each RCU supports one complex multiplication accumulation, one complex multiplication or two conjugate complex multiplication accumulations. For different steps of the SSOR algorithm, the RCU can be reconfigured in order to achieve different functions. According to different configurations, the outputs are selected from three kinds of values, including the real-real multiplication results, the accumulated results and the addition results with parameters. In order to support the SSOR method, each RCU supports the following elementary operations: matrix-matrix multiplication, matrix-vector multiplication, initial solution and iteration and LLR computation. For the Gram matrix computation, all the RCUs are reconfigured to achieve the matrix multiplication. In each RCU, there are three main steps. Firstly, the real-real multiplications are achieved. Secondly, the results are transmitted to accumulations. Finally, in order to get Matrix A, the outputs of each accumulation are added with a parameter N0Es −1 . The matched-filter also can be computed based on the RCU. The matched-filter computation is a matrix-vector multiplication, and the accumulation results are exported. In the initial solution and iteration computations, each element of the vectorŝ is computed in each RCU. The times of iterations can be controlled by the FSM according to the configurable memory. The results are exported from accumulations. Considering the LLR computation, the results are related to the SINR and a piecewise linear function for Gray mappings. In each RCU, only two real-real multipliers and some other lookup tables (LUTs) can be used to achieve LLR computation.
Experimentation Results and Conclusion
We implemented our SSOR-based massive MIMO detector for a 128×8 system on a Xilinx Virtex-7 XC7VX690T FPGA to achieve a fair comparison with NS detector [3] , CG detector [5] and GS detector [7] . Table 1 compares the FPGA implementation results of the proposed SSORbased detector with other detectors. From Table 1 , the SSOR-based implementation has a lower throughput but consumes much smaller hardware resources. Hence, the ratio of throughput/slices in the proposed detector is 3.43× [3] , and 2.84× [7] . In addition, compared with CG-based detector [5] , this detector can achieve a better throughput/slices (1.71×). Table 1 : FPGA Implementation Results In this paper, we proposed an efficient and flexible VLSI architecture for SSOR-based soft-output massive MIMO detection. An initial solution method is proposed to speed up the iteration and a low complexity LLR computation method is proposed. It has been demonstrated that the proposed VLSI architecture is suitable for different SSOR iterations and antenna configurations. The FPGA implementation results show advantages on throughput per slice. Future work will focus on the development of the reconfigurable coarse-grain hardware architecture to achieve high area and energy efficiencies in uplink massive MIMO systems.
