Abstract-Being an effective non-orthogonal multiple access (NOMA) technique, sparse code multiple access (SCMA) is promising for future wireless communication. Compared with orthogonal techniques, SCMA enjoys higher overloading tolerance and lower complexity because of its sparsity. In this paper, based on deterministic message passing algorithm (DMPA), algorithmic simplifications such as domain changing and probability approximation are applied for SCMA decoding. Early termination, adaptive decoding, and initial noise reduction are also employed for faster convergence and better performance. Numerical results show that the proposed optimizations benefit both decoding complexity and speed. Furthermore, efficient hardware architectures based on folding and retiming are proposed. VLSI implementation is also given in this paper. Comparison with the state-of-the-art have shown the proposed decoder's advantages in both latency and throughput (multi-Gbps).
I. INTRODUCTION
T HE fifth generation of cellular network (5G) is put forward to meet the ever-increasing demand of wireless communication. Enabling techniques of 5G include massive multiple-input multiple-output (MIMO), advanced coding, new multiple access (MA), full spectrum access, new network architectures, etc [1] . In the past decades, MAs such as time division multiple access (TDMA) [2] , frequency division multiple access (FDMA) [3] , and code division multiple access (CDMA) [4] , became part of wireless standards. However, those orthogonal MAs can hardly meet the 5G's capacity requirement (10 3 times of LTE), due to limitations on multiplexing approaches towards physical resources [5] . According to 3GPP white book, in the enhanced Mobile Broadband (eMBB) scenario, the peak data rate should be 20 Gbps (10 to 10 2 times of LTE), the peak spectral efficiency should be 30 bps/Hz (3 to 5 times of LTE), and the latency should be less than 1 ms (10% of LTE) [6, 7] . Thus, ideas of non-orthogonal MA (NOMA) [8] are proposed to alleviate these bottlenecks.
A. Challenges for Existing NOMA
Compared to orthogonal MAs, NOMA techniques refer to those allowing multiple users overlap in time, frequency, or code domain, in other words, sharing the same physical resources [9] . NOMA is able to distinguish different users via successive interference cancellation (SIC) [10] or multiple user decoding (MUD) [11] . Besides the very first version [12] , the state-of-the-art (SOA) NOMA includes multiuser shared access (MUSA) [13] , pattern division multiple access (PDMA) [14] , sparse code multiple access (SCMA) [15] , etc. SIC was employed in [12] [13] [14] and has practical challenges:
• Computational complexity: SIC implies that each user can be decoded only when all the prior users are properly decoded. Therefore, its computational complexity scales with the in-cell user number.
• Error propagation: For SIC, if an error occurs, all users afterward are likely to be decoded incorrectly.
• Decoding latency: User power sorting is involved in SIC, and causes good overhead latency compared to other methods. Since the data with the lowest power is decoded last, the latency will even higher. Therefore, SCMA employs MUD instead of SIC. Thanks to its sparsity, message passing algorithm (MPA) can be applied for better decoding performance.
B. Sparse Code Multiple Access
SCMA was proposed in 2013 [15] , trying to increase user scale via a new perspective: enabling more efficient multiple access by non-orthogonal sparse spreading codes of users.
1) Properties of SCMA: As a promising MA, SCMA has the properties: i) multiplexing in frequency domain; ii) codebook based on both mapping and spreading; iii) multidimensional constellation for shaping gain and spectral efficiency; iv) non-orthogonality ensuring more accessed users; v) spreading which reduces noise interference and enhances system robustness; and vi) sparsity which reduces decoding complexity. Thanks to these properties, SCMA is more physically realizable and overloading tolerant, compared to other MAs [16] . Details of SCMA can be found in Section II.
2) Challenges of SCMA:
achieve the eMBB peak rate with acceptable complexity. Admittedly, such throughput can be achieved with a larger overloading factor, leading to prohibitive hardware complexity and performance degradation.
• Latency: On one hand, utilizing MUD, SCMA avoids the sorting latency required by SIC. On the other hand, for imperfect channels the iterative MPA tends to cost more iterations, which will counteract its latency advantage.
• Implementation: Though VLSI techniques ensure that complexity is no longer a bottleneck for SCMA implementation when the overloading factor is not extremely large, existing iterative algorithms are not hardware friendly. Second, the noise power density N 0 results in large data range, leading to unbearable quantization length, or otherwise poor error performance.
C. Relevant Prior Art
Regarding SCMA decoding, existing literature mainly focus on three aspects: i) stochastic computing, ii) tree structure approximation, and iii) efficient hardware implementation.
1) Stochastic Computing: In [17] , a stochastic MPA (SMPA) decoder was proposed, where beliefs are given by weights of bit streams. Multiplication and addition are implemented by AND and MUX, respectively. Though it work effectively reduces the complexity per iteration, problems are:
• Accuracy: Stochastic computing suffers from low accuracy, due to randomness loss. Beliefs in MPA usually require precision of 10
, which length-limited could not give. Performance degradation is observed.
• Latency: For SMPA, the calculation of a single value requires a large number (10 5 to 10
6
) of bit-level operations. Considerable iterations make the latency even larger and not suitable for practice.
• Complexity: Though SMPA helps to reduce hardware of a single operation, the amount of bit-operations in one decoding is around 10
7
. Thus, the total complexity may be even larger than deterministic MPA (DMPA). A VLSI architecture of SMPA was discussed in [18] . The throughput for a 6-user decoder is 57 Mbps and far from 3GPP requirements. Though the hardware cost is low, the latency is not suitable for eMBB.
2) Tree Structure Approximation: In [19] , a pruned tree approximation was proposed. The decoder accurately represents values with high probabilities, whereas approximates ones with low probabilities [20] . Squares are replaced by additions, multiplications, and comparisons. Though complexity is expected to reduce, search breadth must be larger than 2 for performance, which increases the complexity again.
3) Efficient Hardware Architecture: In [21] , a stage-level folded architecture for DMPA was proposed with consideration of both speed and efficiency, which is our prior work. However, only theoretical analysis and simple architecture were given. The real VLSI implementation is missing.
D. Contributions
This paper emphasises on iteration reduction, convergence speedup, computation simplification, and implementation of SCMA decoder. Compared to SOA, our contributions are:
• We propose early termination scheme based on the convergence behavior of DMPA, which significantly reduces the required iteration number.
• We propose adaptive decoder, which adjusts beliefs according to the variation trend, accelerates the convergence, and compensates the performance loss. Results show that it outperforms the ones in [17, 18] in terms of latency and throughput, satisfying the 3GPP requirements.
• We perform numerical analysis for conditional probability approximation (over 60% computation is for conditional probabilities in MPA) in Initialization, which is squarefree and division-free, and suffers from little performance loss. Computational complexity and hardware implementation have been greatly benefitted.
• We propose distributed matrix scheme for prior noise reduction of DMPA decoder, which compensates the approximation loss with negligible extra complexity.
• We improve our stage-level folded decoder with the proposed algorithms, achieve higher hardware efficiency with eMBB requirements on throughput and latency.
• We implement the proposed DMPA decoder on Xilinx Virtex-7 XC7VX690T FPGA to demonstrate its advantages for real applications.
E. Notations
Lowercase and uppercase boldface letters designate column vectors and matrices, respectively. Matrix A's transpose and conjugate are A 
F. Paper Outline
The remainder of this paper is organized as follows. Section II reviews the preliminaries of SCMA. DMPA and its optimized versions are discussed in Section III. Numerical results and analysis are given in Section IV. Hardware architecture is described in Section V. VLSI implementation is given in Section VI. Section VII concludes the entire paper.
II. PRELIMINARIES
Preliminaries of SCMA are given in this section. A 6-user system in Fig. 1 is used as a running example.
A. SCMA Encoder
Suppose codeword set, constellation set, and information set are X , C, and B, respectively. Define x ∈ X , c ∈ C, and b ∈ B. |B| = M , |X | = K, and |C| = N . The SCMA encoding is given by two rounds of mapping [15] . The first round of mapping is:
where
, and g is a constellation mapping function. The second round of mapping is: is the mapping matrix. Suppose the entire mapping function of SCMA encoding is f . Then we have
An M -size SCMA codebook consisting of K complex values is constructed. Note that V contains (K − N ) all-zero rows. Mapping matrix is generated by inserting (K − N ) all-zero rows into an N ×N identity matrix I N randomly. So when the SCMA system is regular, it supports C
B. SCMA Multiplexing
Consider a K-dimensional SCMA encoder with J separated layers. Each layer is defined by (V j , g j , M j , N j ), where j = 1, ..., J. If i = j, V i = V j and g i = g j , in order to distinguish one layer from another. In general, M j and N j can be either the same or different for different layers. Without loss of generality, for ∀j we set
We call this SCMA system semi-regular because J is not necessarily C N K (The regular system will be discussed later). The SCMA codewords are multiplexed over K shared orthogonal resources, e.g. OFDMA tones or MIMO spatial layers [16] . With this semi-regular system, the received signal after synchronous layer multiplexing can be expressed as
where h j and x j are the K-dimensional channel vector and SCMA codeword of layer j. Suppose signals of all layers are from the same transmit point, for a specific receiver, the channel vectors of all layers are identical that for ∀j, h j = h. Now Eq. (4) reduces to
Define overloading factor as λ = J/K, which indicates the overloading tolerance or access ability of a SCMA system. Fig. 2 illustrates a 6-user SCMA multiplexing. 
C. Factor Graph Representation
Define the binary indicator vector as
. Then the factor graph matrix is F = (f 1 , ..., f J ). Then the factor graph representation can be obtained like how we do with LDPC codes. Each column of F associates a layer node, and each row a resource node. Degree of each resource node is defined as
For more details, please refer to [15] .
Take K = 4 and N = 2 as an example. The factor graph is in Fig. 3 and
T and the overloading factor λ = J/K = 1.5. The 4 × 6 factor graph matrix of this system is in Eq 
B. DMPA Decoding
The DMPA decoding for SCMA mainly includes 4 steps.
1) Initialization:
Calculate conditional probability with extrinsic information to get prepared for the belief propagation.
where y k denotes the k-th bit of the received signal y. x k,1 , x k,2 , and x k,3 denote overlapped bits of the 3 layers which are connected to the k-th resource node separately, and N 0 is the noise power density.
2) Resource Node Updating: The updating formulation of resource node is in the sum-product form which is an approximation of marginal probability: (10) where R k is the k-th resource node, m 1,2,3 = 1, ..., M are transmitted symbols. I R k →L1,2,3 denotes the belief propagated to the k-th resource node from the neighboring layer nodes. I L1,2,3→R k is the belief passing in the opposite direction.
3) Layer Node Updating: The normalization makes sure belief falls in [0, 1].
where m = 1, ..., M corresponds different symbols. 4) Probability Calculating and Symbol Judging: After iterations, the final probability of each symbol is
where L j denotes the j-th layer. The symbol with the highest probability becomes the estimated symboll for each layer.
C. Max-Log Algorithm
Decoder in probability domain suffers from huge complexity and relatively high latency. Therefore, its Max-Log version is considered [24] with the Jacobi's logarithm formula [25] :
Updating steps now become: 1) Initialization:
2) Resource Node Updating:
4) Probability Calculating and Symbol Judging:
D. Early Termination
Early termination is based on the belief judgement for each layer node and resource node [26] . Our judgement steps are: 1) Create a zero-matrix to record the stability condition of beliefs, which denotes all the beliefs are unstable. 2) Judge the stability of all beliefs per iteration. If |
V −Vtemp
Vtemp | ≤ , ( > 0), the beliefs are stable, and the corresponding value in the matrix is set as "1". 3) When the stability matrix become a all-ones matrix, beliefs of all layer nodes and resource nodes are stable, and the convergence is achieved. Then, the iterative decoding terminates.
Here, V temp and V are the belief values in the previous and present iteration, respectively. is a judgment constant. The DMPA with early termination is shown in Alg. 1. The MaxLog version is similar and omitted.
Algorithm 1 DMPA with Early Termination
Input: y, I max , and 1: Iteration:
for t = 1 : I max 3:
Set stability matrix S = 0
4:
Update beliefs V
5:
for j = 1 :
if temp ≤
8:
S j = 1 Decideû Output:û = {û 1 ,û 2 , ...,û 6 } E. Self-Adaption Algorithm Self-adaption [27, 28] is also based on stability judgement. Compared to the one in early termination, the judgement in self-adaption requires an extra step between 2) and 3):
"Forecast and adjust the belief of next iteration based on the convergence trend. If for t = 1 : I max 3:
4:
5:
if temp ≥ 8:
elseif temp ≤ − 10:
else 12:
S j = 1 
F. Initial Noise Reduction
"Distributed matrix" D is to reduce random error, enhance accuracy of initial value [29] , and speed up the convergence.
For the SCMA system in Fig. 4 , we have the overlapped signals: a, b, c, and d after multiplexing. Random error of these signals can be either positive or negative, which depends on the environment noise. Therefore, we can regroup signals and assign them to 4 resource nodes. At the receiver, we can first recover the original signals according to the inverse of "distributed matrix" and then start the decoding. Compared to original transmitting scheme, each signal of specific resource node has a great chance to be added with both positive and negative random noises, which increases the accuracy of initial value. It is noted that D is not constant and can be adjusted according to the codebook and channel condition.
G. Initial Probability Approximation
Discussed above, the calculation of initial probability results in high computational complexity, which is obvious in MaxLog decoding. Thus, suitable approximations in Initialization are expected to improve calculation efficiency and reduce latency with little performance loss. For SCMA decoding, the purpose of iterative updating is to find the symbol with the largest probability. Hence, the absolute value is not that critical to make a decision. We can still ensure the detection correctness even with relative beliefs. The relative magnitude is determined by the initial probability and the initial value of different users in Initialization. Now, we carry out the approximation in steps: i) simplify the initial probability calculation by reducing operations with large complexity; ii) adjust the initial value of different users according to the relative magnitude determined by initial probabilities; iii) update beliefs iteratively based on the relative values. The formulae of initial probabilities in DMPA become:
For square and division, which are of higher complexity, DMPA approximations 1 to 3 are proposed:
Similarly, we have Max-Log approximations 1 to 3 as follows.
Analysis below will show these approximations have different effects on error performance and computational complexity.
IV. RESULTS AND ANALYSIS

A. Error-Rate Performance
The 6-user SCMA system is simulated. Additional white Gaussian noise (AWGN) is assumed. The maximum iteration number is 5. Results are give in Fig. 5 . Fig. 5(a) shows the BLER performance of DMPA algorithm with different approximations, different iterations, early termination, self-adaption, and initial noise reduction. Fig. 5(b) shows the curves of Max-Log algorithm. According to Fig. 5 , we see 1) DMPA/Max-Log with more iterations enjoys better performance, but the improvement is limited when iteration number is sufficiently large. Shown by numerical results, DMPA/Max-Log with 3 iterations is a good choice in real implementation.
2) The average iteration number of early termination or adaptive scheme is around 3, but the performance is similar DMPA with 5 iterations. Results with different parameters reveal that self-adaption performs better in high SNR. Thus, the adjusting factor in self-adaption is supposed to be smaller at higher SNR.
3) DMPA and Max-Log have similar performance without approximation. However, since DMPA heavily depends on N 0 , approximations without precise N 0 will cause unbearable performance loss. On the other hand, MaxLog algorithm is not sensitive to N 0 , and its approximations without exact N 0 can still achieve good performance. Therefore, Max-Log is preferred. Now, we figure out that suitable configurations for hardware implementation are: i) Max-Log approach; ii) 3 iterations; iii) early termination and self-adaption; iv) Approximation 2 or 3, and v) initial noise reduction.
B. Computational Complexity
Suppose the symbol set size for each user is M , the number of physical resources is N , the user number is K, and the maximum iteration number is I. Then, we summarize the computational complexity of different decoding methods in Table I . Compared with other methods, the proposed method has the lowest computational complexity, while maintaining the error performance. In fact, the proposed method is similar to Max-Log, but has lower complexity in Initialization due to the approximation. For a real system, M and N are usually large, the number of multiplications and divisions will makes other methods not suitable for implementation. However, as discussed above, the proposed algorithm is multiplication/division-free with Approximation 3. Therefore, it can intensively improve the computational efficiency and reduce the latency, making it more applicable for hardware implementation in Section VI. The VLSI implementation results in Table IV will further verify that the proposed decoder's hardware efficiency over the SOA design.
C. Performance/Complexity Trade-Off Analysis Fig. 6 illustrates the trade-off between error performance and computational complexity of proposed methods. The minimum required SNR to achieve 1% BER is employed as a metric. The complexity is given by Timing (TM) complexity, which is in term of iteration number. Fig. 6 shows the trade-off of DMPA with approximations. It is clear that Max-Log with Approximation 3 provides the best performance/complexity trade-off. 
V. HARDWARE ARCHITECTURE
The hardware architecture of the Max-Log DMPA is discussed. Timing optimization and folding technique are introduced for higer efficiency.
A. Overall Architecture
The overall architecture is shown in Fig. 7 . It has 4 units and 2 memory networks, which are RN-to-LN and LN-to-RN networks for I R→L and I L→R , respectively. The elementary units are Initialization Unit, Resource Node Update Unit, Layer Node Update Unit, and Probability Calculating Unit, which execute steps indicated by Eq. (15) '16] Initial probability calculation
Resource node updating
Layer node updating Update Unit and Layer Node Update Unit, both of which could not start current propagation unless all previous data have been calculated. We call this data updating interval a "step". Optimization details of this scheduling will be discussed below.
B. Stage-Level Scheduling Optimization
ACC STO SWOP RESET
Step 1 CMP , n
I
Step 2
Step 3
Step 4 
The proposed stage-level scheduling is a finer-grained optimization over the step-level scheduling. With this stagelevel scheduling, it is convenient to insert deep pipelines to achieve a higher throughput [32] [33] [34] . Compared with steplevel scheduling, updating of stage-level scheduling does not have to wait for the completion of data computation from the previous unit, which therefore avoids low hardwareefficiency and long processing-latency. In sum, stage-level scheduling enjoys faster processing speed and higher hardware efficiency than the step-level one. Fig. 8 shows the stage-level scheduling. It details each computing step to achieve a deeper pipelined structure.
C. Folding
The architecture of stage-level DMPA turns out to be very complicated in form of data factor graph (DFG). To achieve an efficient architecture, folding technique is employed for further optimization. Since folding operation based on fine-grained architecture is difficult to be carried out, a folding scheme based on unit is considered. Fig.s 13 and 14 in appendix shows the entire step-and stage-level algorithms, respectively. Due to the page constraint, we only take a branch of Initialization Unit, which is fully-paralleled in DFG, as an example to show proposed folding details. Folding transform of other units can be conducted in the similar fashion. The DFG of the branch in Initialization Unit is shown in Fig. 9 . The folding includes 3 steps: i) construct folding sets and folding equations, ii) analysis life span, and iii) allocate registers. More details of this method are explained by [35] .
1) Folding Sets and Folding Equations: Set the folding factor to 7, we can obtain the following folding sets:
where S in , S A , and S M denote the folding sets for inputs, adders, and multipliers, respectively.
Then, folding equations can be derived based on the given folding sets
where D F (x → y) denotes the number of delays on the path from x to y. 2) Life Time Analysis: Life span analysis is demonstrated in the form of life time figure as shown in Fig. 10 . It is achieved from folding equations. One thick line in the figure represents survival time of certain data. Activated number shows number of data in use at the moment [36] . According to Fig. 10 , we see that this folding architecture requires at least 8 registers.
3) Register Allocation: The forward-backward scheme of register allocation is employed based on life span analysis [37] . The specific allocation process is displayed in Fig. 11 . After all the steps, we can finally obtain the folded architecture of the branch in Initialization Unit.
D. Hardware Architecture and Loop Analysis
The final stage-level folded architecture of DMPA, which is illustrated at module-level in Fig. 12 . Lower hardware cost and reasonable processing speed become its main advantages.
The loop bound analysis [38, 39] of this folded architecture is also given here. Suppose the processing time of an adder, a comparator, and a swopper are T A , T C , and T S , respectively. We can obtain the results listed by Table III . 
Thus, the iteration bound is calculated as follows:
VI. VLSI IMPLEMENTATION The proposed decoder's VLSI implementation is given and compared to two SOA baselines. The first is the DMPA decoder [21] , and the second is the SMPA decoder [18] . As both baselines do not consider folding, the proposed decoder does not either for fair comparison. But if all designs are folded, the proposed decoder's advantages remain. Discussed previously, the proposed decoder is based on: i) Max-Log approach; ii) early termination and self-adaption; iii) Approximation 3, and iv) initial noise reduction. Since the SMPA decoder employed 5 iterations, 1 up to 5 iterations are considered, though 3 turns out to be efficient per our analysis. Both the proposed decoder and DMPA decoder are implemented with Xilinx Virtex-7 XC7VX690T FPGA. The results of SMPA decoder is scribed from [18] , since it is implemented with ASIC. The frequency is 500 MHz. The input quantization is 8-bit for both real and imaginary parts, and the intermediate quantization is 16.
A. Module Details of Proposed Decoder
The proposed decoder consists of four basic parts as shown in Fig. 7 : initialization module, layer node updating network, resource node updating network, and symbol judging module. The design details are presented as follows.
1) Initialization Module: It calculates initial belief of each user with the received signal and inner codebook. The received signal is made up of 4 complex resource nodes, thus the input is 8-parallel. Each of them has the quantization length of 8. It is noted that the output belief has the quantization length of 16, due to multiplication. The codebook is restored in memories, which costs 96 memory blocks of 8-bit length each.
2) Resource Node Updating Network: It calculates the sum of belief and outputs the largest, based on the approximated Jacobi's formula. It is made up of resource node updating units, where the input data are initial beliefs and layer node beliefs, and the output data are the 4 resource node beliefs. The largest value is selected from 16 intermediate beliefs, in 3 steps of comparison with 14 buffers. Thus, 56 buffers are required by each unit, and 672 by the entire network.
3) Layer Node Updating Network: It is made up of layer node updating units, which normalize the input value and swop it by the inner connection. In each unit, the input data are resource node beliefs only, and the output data are the corresponding 4 layer node beliefs. Four 16-bit dividers are required per unit with 28 clocks' delay. Hence, the whole network needs 48 dividers. Besides, layer node beliefs would also be reset at the start of each frame of the received signals in layer node updating network.
4) Symbol Judging Module: It finds the largest belief and maps it to original source code according to the codebook of each user. Also, this module consists of 6 smaller judging units, which perform the basic function for each user. In each unit, 4 beliefs are compared with each other. Thus 2 steps of comparison and 3 buffers are required. Then, the entire module needs 18 buffers.
The implementation comparison with the DMPA decoder is listed in Table IV . It shows the proposed decoder's advantages in both complexity and throughput, thanks to the log-domain processing and approximation approaches. Since speed is the main focus of our design, comparison results of throughput and latency with baselines are shown in Table V , where "L" for latency and "T" for throughput. As we can see from the table, the proposed SCMA decoder outperforms the SOA in both throughput and latency, and also meets the multi-Gbps and millisecond requirements of 3GPP. Though, SMPA decoder has complexity advantage, the proposed decoder's complexity can be further reduced with folding techniques.
VII. CONCLUSION
In this paper, simplifications such as log-domain calculation and probability approximation have been introduced to lower the complexity of SCMA's DMPA decoder. Early termination, adaptive decoding, and initial noise reduction are also proposed for faster convergence and better performance. Hardware optimizations with folding and retiming are introduced. VLSI implementation results have confirmed the advantages of the proposed SCMA decoder for high-speed applications over the SOA designs. Future research will be directed towards further improvements on both algorithm and implementation. 
