Efficient Sparse Code Multiple Access Decoder Based on Deterministic
  Message Passing Algorithm by Zhang, Chuan et al.
IEEE TRANSACTIONS ON , 2018 1
Efficient Sparse Code Multiple Access Decoder
Based on Deterministic Message Passing Algorithm
Chuan Zhang, Member, IEEE, Chao Yang, Wei Xu, Senior Member, IEEE, Shunqing Zhang, Senior
Member, IEEE, Zaichen Zhang, Senior Member, IEEE, and Xiaohu You, Fellow, IEEE
Abstract—Being an effective non-orthogonal multiple access
(NOMA) technique, sparse code multiple access (SCMA) is
promising for future wireless communication. Compared with or-
thogonal techniques, SCMA enjoys higher overloading tolerance
and lower complexity because of its sparsity. In this paper, based
on deterministic message passing algorithm (DMPA), algorithmic
simplifications such as domain changing and probability ap-
proximation are applied for SCMA decoding. Early termination,
adaptive decoding, and initial noise reduction are also employed
for faster convergence and better performance. Numerical results
show that the proposed optimizations benefit both decoding
complexity and speed. Furthermore, efficient hardware archi-
tectures based on folding and retiming are proposed. VLSI
implementation is also given in this paper. Comparison with the
state-of-the-art have shown the proposed decoder’s advantages
in both latency and throughput (multi-Gbps).
Index Terms—Sparse code multiple access (SCMA), determin-
istic message passing algorithm (DMPA), folding, retiming, VLSI.
I. INTRODUCTION
THE fifth generation of cellular network (5G) is putforward to meet the ever-increasing demand of wire-
less communication. Enabling techniques of 5G include mas-
sive multiple-input multiple-output (MIMO), advanced coding,
new multiple access (MA), full spectrum access, new network
architectures, etc [1]. In the past decades, MAs such as
time division multiple access (TDMA) [2], frequency division
multiple access (FDMA) [3], and code division multiple access
(CDMA) [4], became part of wireless standards. However,
those orthogonal MAs can hardly meet the 5G’s capacity
requirement (103 times of LTE), due to limitations on multi-
plexing approaches towards physical resources [5]. According
to 3GPP white book, in the enhanced Mobile Broadband
(eMBB) scenario, the peak data rate should be 20 Gbps (10 to
102 times of LTE), the peak spectral efficiency should be 30
bps/Hz (3 to 5 times of LTE), and the latency should be less
Chuan Zhang and Chao Yang are with Lab of Efficient Architectures for
Digital-communication and Signal-processing (LEADS), Southeast University,
Nanjing, China. Chuan Zhang, Chao Yang, Wei Xu, Zaichen Zhang, and Xi-
aohu You are with the National Mobile Communications Research Laboratory,
Southeast University, Nanjing, China. Chuan Zhang, Chao Yang, and Zaichen
Zhang are with Quantum Information Center of Southeast University, Nanjing,
China. Email: {chzhang, chaoyang, wxu, zczhang, xhyu}@seu.edu.cn.
Shunqing Zhang is with Shanghai Institute for Advanced Communica-
tions and Data Science, Shanghai University, Shanghai, China. Email: shun-
qing@shu.edu.cn.
This paper was presented in part at IEEE Asia Pacific Conference on
Circuits and Systems (APCCAS), Jeju, Korea, 2016, as a Best Paper Award
recipient. Chuan Zhang and Chao Yang contributed equally to this work.
(Corresponding author: Chuan Zhang.)
than 1 ms (10% of LTE) [6, 7]. Thus, ideas of non-orthogonal
MA (NOMA) [8] are proposed to alleviate these bottlenecks.
A. Challenges for Existing NOMA
Compared to orthogonal MAs, NOMA techniques refer
to those allowing multiple users overlap in time, frequency,
or code domain, in other words, sharing the same physical
resources [9]. NOMA is able to distinguish different users
via successive interference cancellation (SIC) [10] or multiple
user decoding (MUD) [11]. Besides the very first version [12],
the state-of-the-art (SOA) NOMA includes multiuser shared
access (MUSA) [13], pattern division multiple access (PDMA)
[14], sparse code multiple access (SCMA) [15], etc. SIC was
employed in [12–14] and has practical challenges:
• Computational complexity: SIC implies that each user
can be decoded only when all the prior users are properly
decoded. Therefore, its computational complexity scales
with the in-cell user number.
• Error propagation: For SIC, if an error occurs, all users
afterward are likely to be decoded incorrectly.
• Decoding latency: User power sorting is involved in SIC,
and causes good overhead latency compared to other
methods. Since the data with the lowest power is decoded
last, the latency will even higher.
Therefore, SCMA employs MUD instead of SIC. Thanks to
its sparsity, message passing algorithm (MPA) can be applied
for better decoding performance.
B. Sparse Code Multiple Access
SCMA was proposed in 2013 [15], trying to increase user
scale via a new perspective: enabling more efficient multiple
access by non-orthogonal sparse spreading codes of users.
1) Properties of SCMA: As a promising MA, SCMA
has the properties: i) multiplexing in frequency domain; ii)
codebook based on both mapping and spreading; iii) multi-
dimensional constellation for shaping gain and spectral effi-
ciency; iv) non-orthogonality ensuring more accessed users;
v) spreading which reduces noise interference and enhances
system robustness; and vi) sparsity which reduces decoding
complexity. Thanks to these properties, SCMA is more phys-
ically realizable and overloading tolerant, compared to other
MAs [16]. Details of SCMA can be found in Section II.
2) Challenges of SCMA:
• Throughput: Though the throughput of SCMA outper-
forms other MAs, especially orthogonal ones, it is hard to
ar
X
iv
:1
80
4.
00
18
0v
1 
 [c
s.I
T]
  3
1 M
ar 
20
18
2 IEEE TRANSACTIONS ON , 2018
achieve the eMBB peak rate with acceptable complexity.
Admittedly, such throughput can be achieved with a
larger overloading factor, leading to prohibitive hardware
complexity and performance degradation.
• Latency: On one hand, utilizing MUD, SCMA avoids the
sorting latency required by SIC. On the other hand, for
imperfect channels the iterative MPA tends to cost more
iterations, which will counteract its latency advantage.
• Implementation: Though VLSI techniques ensure that
complexity is no longer a bottleneck for SCMA imple-
mentation when the overloading factor is not extremely
large, existing iterative algorithms are not hardware
friendly. Second, the noise power density N0 results
in large data range, leading to unbearable quantization
length, or otherwise poor error performance.
C. Relevant Prior Art
Regarding SCMA decoding, existing literature mainly focus
on three aspects: i) stochastic computing, ii) tree structure
approximation, and iii) efficient hardware implementation.
1) Stochastic Computing: In [17], a stochastic MPA
(SMPA) decoder was proposed, where beliefs are given by
weights of bit streams. Multiplication and addition are im-
plemented by AND and MUX, respectively. Though it work
effectively reduces the complexity per iteration, problems are:
• Accuracy: Stochastic computing suffers from low accu-
racy, due to randomness loss. Beliefs in MPA usually
require precision of 10−5, which length-limited could not
give. Performance degradation is observed.
• Latency: For SMPA, the calculation of a single value re-
quires a large number (105 to 106) of bit-level operations.
Considerable iterations make the latency even larger and
not suitable for practice.
• Complexity: Though SMPA helps to reduce hardware of
a single operation, the amount of bit-operations in one
decoding is around 107. Thus, the total complexity may
be even larger than deterministic MPA (DMPA).
A VLSI architecture of SMPA was discussed in [18]. The
throughput for a 6-user decoder is 57 Mbps and far from 3GPP
requirements. Though the hardware cost is low, the latency is
not suitable for eMBB.
2) Tree Structure Approximation: In [19], a pruned tree
approximation was proposed. The decoder accurately repre-
sents values with high probabilities, whereas approximates
ones with low probabilities [20]. Squares are replaced by ad-
ditions, multiplications, and comparisons. Though complexity
is expected to reduce, search breadth must be larger than 2 for
performance, which increases the complexity again.
3) Efficient Hardware Architecture: In [21], a stage-level
folded architecture for DMPA was proposed with considera-
tion of both speed and efficiency, which is our prior work.
However, only theoretical analysis and simple architecture
were given. The real VLSI implementation is missing.
D. Contributions
This paper emphasises on iteration reduction, convergence
speedup, computation simplification, and implementation of
SCMA decoder. Compared to SOA, our contributions are:
• We propose early termination scheme based on the con-
vergence behavior of DMPA, which significantly reduces
the required iteration number.
• We propose adaptive decoder, which adjusts beliefs ac-
cording to the variation trend, accelerates the conver-
gence, and compensates the performance loss. Results
show that it outperforms the ones in [17, 18] in terms
of latency and throughput, satisfying the 3GPP require-
ments.
• We perform numerical analysis for conditional probability
approximation (over 60% computation is for conditional
probabilities in MPA) in Initialization, which is square-
free and division-free, and suffers from little performance
loss. Computational complexity and hardware implemen-
tation have been greatly benefitted.
• We propose distributed matrix scheme for prior noise
reduction of DMPA decoder, which compensates the
approximation loss with negligible extra complexity.
• We improve our stage-level folded decoder with the
proposed algorithms, achieve higher hardware efficiency
with eMBB requirements on throughput and latency.
• We implement the proposed DMPA decoder on Xilinx
Virtex-7 XC7VX690T FPGA to demonstrate its advan-
tages for real applications.
E. Notations
Lowercase and uppercase boldface letters designate column
vectors and matrices, respectively. Matrix A’s transpose and
conjugate are AT and AH . The M×M identity matrix is IM
and the M × N all-zeros matrix is 0M×N . Sets are denoted
by uppercase calligraphic letters A, with cardinality |A|.
F. Paper Outline
The remainder of this paper is organized as follows. Section
II reviews the preliminaries of SCMA. DMPA and its opti-
mized versions are discussed in Section III. Numerical results
and analysis are given in Section IV. Hardware architecture
is described in Section V. VLSI implementation is given in
Section VI. Section VII concludes the entire paper.
II. PRELIMINARIES
Preliminaries of SCMA are given in this section. A 6-user
system in Fig. 1 is used as a running example.
A. SCMA Encoder
Suppose codeword set, constellation set, and information
set are X , C, and B, respectively. Define x ∈ X , c ∈ C,
and b ∈ B. |B| = M , |X | = K, and |C| = N . The SCMA
encoding is given by two rounds of mapping [15]. The first
round of mapping is:
g : B → C, c = g(b), (1)
where B ⊂ Blog2M , C ⊂ CN , and g is a constellation mapping
function. The second round of mapping is:
V : C → X , x = Vc, (2)
C. ZHANG et al.: EFFICIENT SPARSE CODE MULTIPLE ACCESS DECODER BASED ON DETERMINISTIC MESSAGE PASSING ALGORITHM 3
Channel
Coding
SCMA
Encoder
Ă
Ă
 
Channel
Coding
SCMA
Encoder
Channel
Coding
SCMA
Encoder
Codebook 1
00 01 10 11
Codebook 6
00 01 10 11
User 1
User 2
User 6
b1
b2
b6
Encoded bits
1x
2x
3x
4x
RE1
RE2
RE3
RE4
RN
RN
RN
RN
LN
LN
LN
LN
LN
Channel
Decoding
Channel
Decoding
Channel
Decoding
Ă
 
Ă
 
LN
Ă
 
User 1
User 3
User 6
Resource
Node
Layer
Node
Reciever
0
0
a
*b-
0
a
0
b
0
0
*b
*a-
0
0
*b-
*a
0
e
0
*d
0
c
d
0
0
f
0
*e
0
0
*f
*e
Fig. 1. A 6-user SCMA system.
where X ⊂ CK , and V ∈ BK×N is the mapping matrix.
Suppose the entire mapping function of SCMA encoding is
f . Then we have
f : B → X , x = f(b), f = Vg. (3)
An M -size SCMA codebook consisting of K complex values
is constructed. Note that V contains (K −N) all-zero rows.
Mapping matrix is generated by inserting (K − N) all-zero
rows into an N×N identity matrix IN randomly. So when the
SCMA system is regular, it supports CNK = C
K−N
K different
layers (users).
B. SCMA Multiplexing
Consider a K-dimensional SCMA encoder with J separated
layers. Each layer is defined by (Vj ,gj ,Mj ,Nj), where
j = 1, ..., J . If i 6= j, Vi 6= Vj and gi 6= gj , in order
to distinguish one layer from another. In general, Mj and Nj
can be either the same or different for different layers. Without
loss of generality, for ∀j we set Mj = M , Nj = N .
We call this SCMA system semi-regular because J is
not necessarily CNK (The regular system will be discussed
later). The SCMA codewords are multiplexed over K shared
orthogonal resources, e.g. OFDMA tones or MIMO spatial
layers [16]. With this semi-regular system, the received signal
after synchronous layer multiplexing can be expressed as
y = ΣJj=1diag(hj)xj + n, (4)
where hj and xj are the K-dimensional channel vector and
SCMA codeword of layer j. Suppose signals of all layers
are from the same transmit point, for a specific receiver, the
channel vectors of all layers are identical that for ∀j, hj = h.
Now Eq. (4) reduces to
y = diag(h)ΣJj=1xj + n. (5)
Define overloading factor as λ = J/K, which indicates the
overloading tolerance or access ability of a SCMA system.
Fig. 2 illustrates a 6-user SCMA multiplexing.
Codebook 1 Codebook 2 Codebook 3 Codebook 4 Codebook 5 Codebook 6
Fig. 2. SCMA multiplexing example.
C. Factor Graph Representation
Define the binary indicator vector as fj = diag(VjVTj ).
Then the factor graph matrix is F = (f1, ..., fJ). Then the
factor graph representation can be obtained like how we do
with LDPC codes. Each column of F associates a layer node,
and each row a resource node. Degree of each resource node is
defined as df = (df1, ..., dfK)T = ΣJj=1fj . For more details,
please refer to [15].
Take K = 4 and N = 2 as an example. The factor graph is
in Fig. 3 and J = C24 = 6. Degree df = (df1, ..., dfK)
T =
(3, 3, 3, 3, 3, 3)T and the overloading factor λ = J/K = 1.5.
The 4× 6 factor graph matrix of this system is in Eq. (6).
1
L
2
L
3
L
4
L
5
L
6
L
1
R
2
R
3
R
4
R
Fig. 3. Factor graph representation of an SCMA with K = 4 and J = 6.
F =

1 1 1 0 0 0
1 0 0 1 1 0
0 1 0 1 0 1
0 0 1 0 1 1
 (6)
III. OPTIMIZATIONS ON SCMA DECODING
A. Regular Form of SCMA
Regular SCMA refers to the absolute-regular form [22, 23],
where number of layers J equals to CNK . In other words, it
employs all the available layers (users). Eq. (6) is an example
of regular form. The definition is as follows.
Definition 1. SCMA with K complex-dimension and weight of
N which satisfies the following requirements is called regular
(absolute-regular) SCMA.
Requirement 1: Owning J = CNK layers (users) in total.
Requirement 2: The columns of factor graph matrix must be
listed in the sequential permutation order, with weight λ.
B. DMPA Decoding
The DMPA decoding for SCMA mainly includes 4 steps.
4 IEEE TRANSACTIONS ON , 2018
1) Initialization: Calculate conditional probability with ex-
trinsic information to get prepared for the belief propagation.
Pk(yk|xk,1, xk,2, xk,3, N0) = e−‖yk−(xk,1+xk,2+xk,3)‖2/N0 , (7)
where yk denotes the k-th bit of the received signal y. xk,1,
xk,2, and xk,3 denote overlapped bits of the 3 layers which
are connected to the k-th resource node separately, and N0 is
the noise power density.
2) Resource Node Updating: The updating formulation
of resource node is in the sum-product form which is an
approximation of marginal probability:
IRk→L1(m1) =
∑M
m2=1
∑M
m3=1
PkIL2→Rk(m2)IL3→Rk(m3), (8)
IRk→L2(m2) =
∑M
m1=1
∑M
m3=1
PkIL1→Rk(m1)IL3→Rk(m3), (9)
IRk→L3(m3) =
∑M
m1=1
∑M
m2=1
PkIL1→Rk(m1)IL2→Rk(m2), (10)
where Rk is the k-th resource node, m1,2,3 = 1, ...,M are
transmitted symbols. IRk→L1,2,3 denotes the belief propagated
to the k-th resource node from the neighboring layer nodes.
IL1,2,3→Rk is the belief passing in the opposite direction.
3) Layer Node Updating: The normalization makes sure
belief falls in [0, 1].
ILj→R1(m) = normalize(IR2→Lj (m)), (11)
ILj→R2(m) = normalize(IR1→Lj (m)), (12)
where m = 1, ...,M corresponds different symbols.
4) Probability Calculating and Symbol Judging: After it-
erations, the final probability of each symbol is
QLj (m) = IR1→Lj (m) · IR2→Lj (m). (13)
where Lj denotes the j-th layer. The symbol with the highest
probability becomes the estimated symbol lˆ for each layer.
C. Max-Log Algorithm
Decoder in probability domain suffers from huge complex-
ity and relatively high latency. Therefore, its Max-Log version
is considered [24] with the Jacobi’s logarithm formula [25]:
log
(
N∑
i=1
exp(fi)
)
≈ max
i=1,...,N
{f1, f2, ..., fN}. (14)
Updating steps now become:
1) Initialization:
P logk (yk|xk,1, xk,2, xk,3, N0) = − 1N0 ‖ yk − (xk,1 + xk,2 + xk,3) ‖2, (15)
2) Resource Node Updating:
I logRk→L1(m1) = max
{
P logk + I
log
L2→Rk(m2) + I
log
L3→Rk(m3)
}
, (16)
I logRk→L2(m2) = max
{
P logk + I
log
L1→Rk(m1) + I
log
L3→Rk(m3)
}
, (17)
I logRk→L3(m3) = max
{
P logk + I
log
L1→Rk(m1) + I
log
L2→Rk(m2)
}
, (18)
3) Layer Node Updating:
I logLj→R1(m) = I
log
R2→Lj (m), (19)
I logLj→R2(m) = I
log
R1→Lj (m), (20)
4) Probability Calculating and Symbol Judging:
QlogLj (m) = I
log
R1→Lj (m) + I
log
R2→Lj (m). (21)
D. Early Termination
Early termination is based on the belief judgement for each
layer node and resource node [26]. Our judgement steps are:
1) Create a zero-matrix to record the stability condition of
beliefs, which denotes all the beliefs are unstable.
2) Judge the stability of all beliefs per iteration. If
|V−VtempVtemp | ≤ , ( > 0), the beliefs are stable, and the
corresponding value in the matrix is set as “1”.
3) When the stability matrix become a all-ones matrix,
beliefs of all layer nodes and resource nodes are stable,
and the convergence is achieved. Then, the iterative
decoding terminates.
Here, Vtemp and V are the belief values in the previous and
present iteration, respectively.  is a judgment constant. The
DMPA with early termination is shown in Alg. 1. The Max-
Log version is similar and omitted.
Algorithm 1 DMPA with Early Termination
Input: y, Imax, and 
1: Iteration:
2: for t = 1 : Imax
3: Set stability matrix S = 0
4: Update beliefs V
5: for j = 1 : N
6: temp =
∣∣V (t)j − V (t−1)j /V (t−1)j ∣∣
7: if temp ≤ 
8: Sj = 1
9: end if
10: end for
11: if S = 1
12: break
13: end if
14: end for
15: Judgementearly:
16: Compute beliefs
17: Decide uˆ
Output: uˆ = {uˆ1, uˆ2, ..., uˆ6}
E. Self-Adaption Algorithm
Self-adaption [27, 28] is also based on stability judgement.
Compared to the one in early termination, the judgement in
self-adaption requires an extra step between 2) and 3):
“Forecast and adjust the belief of next iteration based on
the convergence trend. If V−VtempVtemp ≥ , V ⇐ αV with α > 1,
since the convergence trend makes values larger. Otherwise,
if V−VtempVtemp ≤ −, V ⇐ βV with β < 1.”
Now the DMPA with self-adaption is shown in Alg. 2. The
Max-Log version is omitted.
C. ZHANG et al.: EFFICIENT SPARSE CODE MULTIPLE ACCESS DECODER BASED ON DETERMINISTIC MESSAGE PASSING ALGORITHM 5
Algorithm 2 DMPA with Self-Adaption
Input: y, Imax, and 
1: Iteration:
2: for t = 1 : Imax
3: Set stability matrix S = 0
4: Update beliefs V
5: for j = 1 : N
6: temp = V
(t)
j − V (t−1)j /V (t−1)j
7: if temp ≥ 
8: V
(t)
j ← α · V (t)j
9: elseif temp ≤ −
10: V
(t)
j ← β · V (t)j
11: else
12: Sj = 1
13: end if
14: end for
15: if S = 1
16: break
17: end if
18: end for
19: Judgementadapt:
20: Compute beliefs
21: Decide uˆ
Output: uˆ = {uˆ1, uˆ2, ..., uˆ6}
Multiplex 
inverse 
matrix
Transmitter
a
b
c
d
0.1 0.4 0.3 0.2
0.2 0.1 0.4 0.3
0.3 0.2 0.1 0.4
0.4 0.3 0.2 0.1
é ù
ê ú
ê ú
ê ú
ê ú
ë û
Distributed matrix D A
B
C
D
Subcarrier 1
Subcarrier 2
Subcarrier 3
Subcarrier 4
transmit
A
B
C
D
Noise
( )20,s
Soft
decoding
Wireless
Channel
Fig. 4. Procedure of initial noise reduction.
F. Initial Noise Reduction
“Distributed matrix” D is to reduce random error, enhance
accuracy of initial value [29], and speed up the convergence.
For the SCMA system in Fig. 4, we have the overlapped
signals: a, b, c, and d after multiplexing. Random error of
these signals can be either positive or negative, which depends
on the environment noise. Therefore, we can regroup signals
and assign them to 4 resource nodes. At the receiver, we can
first recover the original signals according to the inverse of
“distributed matrix” and then start the decoding. Compared to
original transmitting scheme, each signal of specific resource
node has a great chance to be added with both positive and
negative random noises, which increases the accuracy of initial
value. It is noted that D is not constant and can be adjusted
according to the codebook and channel condition.
G. Initial Probability Approximation
Discussed above, the calculation of initial probability results
in high computational complexity, which is obvious in Max-
Log decoding. Thus, suitable approximations in Initialization
are expected to improve calculation efficiency and reduce
latency with little performance loss. For SCMA decoding,
the purpose of iterative updating is to find the symbol with
the largest probability. Hence, the absolute value is not that
critical to make a decision. We can still ensure the detection
correctness even with relative beliefs. The relative magnitude
is determined by the initial probability and the initial value
of different users in Initialization. Now, we carry out the
approximation in steps: i) simplify the initial probability
calculation by reducing operations with large complexity;
ii) adjust the initial value of different users according to
the relative magnitude determined by initial probabilities; iii)
update beliefs iteratively based on the relative values. The
formulae of initial probabilities in DMPA become:
Pk(yk|xk,1, xk,2, xk,3, N0) = e−‖yk−(xk,1+xk,2+xk,3)‖2/N0 , (22)
For square and division, which are of higher complexity,
DMPA approximations 1 to 3 are proposed:
Pk(yk|xk,1, xk,2, xk,3, N0) = e−‖yk−(xk,1+xk,2+xk,3)‖/N0 , (23)
Pk(yk|xk,1, xk,2, xk,3, N0) = e−‖yk−(xk,1+xk,2+xk,3)‖2 , (24)
Pk(yk|xk,1, xk,2, xk,3, N0) = e−‖yk−(xk,1+xk,2+xk,3)‖, (25)
Similarly, we have Max-Log approximations 1 to 3 as follows.
P logk (yk|xk,1, xk,2, xk,3, N0) = − 1N0 ‖ yk − (xk,1 + xk,2 + xk,3) ‖, (26)
P logk (yk|xk,1, xk,2, xk,3, N0) = − ‖ yk − (xk,1 + xk,2 + xk,3) ‖2, (27)
P logk (yk|xk,1, xk,2, xk,3, N0) = − ‖ yk − (xk,1 + xk,2 + xk,3) ‖, (28)
Analysis below will show these approximations have different
effects on error performance and computational complexity.
IV. RESULTS AND ANALYSIS
A. Error-Rate Performance
The 6-user SCMA system is simulated. Additional white
Gaussian noise (AWGN) is assumed. The maximum iteration
number is 5. Results are give in Fig. 5.
Fig. 5(a) shows the BLER performance of DMPA algorithm
with different approximations, different iterations, early ter-
mination, self-adaption, and initial noise reduction. Fig. 5(b)
shows the curves of Max-Log algorithm. According to Fig. 5,
we see
1) DMPA/Max-Log with more iterations enjoys better per-
formance, but the improvement is limited when iteration
number is sufficiently large. Shown by numerical results,
DMPA/Max-Log with 3 iterations is a good choice in
real implementation.
2) The average iteration number of early termination or
adaptive scheme is around 3, but the performance is
similar DMPA with 5 iterations. Results with different
parameters reveal that self-adaption performs better in
6 IEEE TRANSACTIONS ON , 2018
−2 0 2 4 610
−4
10−3
10−2
10−1
100
SNR [dB]
B
LE
R
DMPA Precise
−2 0 2 4 610
−3
10−2
10−1
100
SNR [dB]
DMPA Approximation 1
−2 0 2 4 6
10−0.3
10−0.2
10−0.1
SNR [dB]
DMPA Approximation 2
−2 0 2 4 6
10−0.3
10−0.2
10−0.1
SNR [dB]
DMPA Approximation 3
DMPA: Early
DMPA: Adaptive
DMPA: Iter1
DMPA: Iter2
DMPA: Iter3
DMPA: Iter4
DMPA: Iter5
(a) Error-rate performance of DMPA with different approximation.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, 2017 14
A0
A1
A8
A9
A16
A17
D
B0
B1
A24 B8 C0
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
1
y
A2
A3
A10
A11
A18
A19
D
B2
B3
A25 B9 C1
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
2
y
A4
A5
A12
A13
A20
A21
D
B4
B5
A26 B10 C2
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
3
y
A6
A7
A14
A15
A22
A23
D
B6
B7
A27 B11 C3
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
4
y
y F0
B12
MEM
kP
D
D
D
D
B16 A28
0
D D D
G0
B13 B17 A29
0
D D D
B14 B18 A30
0
D D D
B15 B19 A31
0
D D D
A32
0
D
A33
0
D
A34
0
D
A35
0
D
D0
D1
D2
D3
E0
RN
MEM
LN
MEM
B20
B21
B24
B25
B22
B23
H0
0
D
H1
0
D
H2
0
D
H3
0
D
H4
0
D
H5
0
D
I0
I1
I2
I3
I4
I5
1
l
2
l
3
l
4
l
5
l
6
l
4D
4D
4D
4D
D D
D D
D D
D D
D D
DD
D
D
D
D
DD
DD
Fig. 13. Data flow graph (DFG) of step-level architecture.
A0
A1
A8
A9
A16
A17
D
B0
B1
A24 B8 C0
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
1
y
A2
A3
A10
A11
A18
A19
D
B2
B3
A25 B9 C1
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
2
y
A4
A5
A12
A13
A20
A21
D
B4
B5
A26 B10 C2
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
3
y
A6
A7
A14
A15
A22
A23
D
B6
B7
A27 B11 C3
1,Re
x
D D D
DDDD
D D
2,Re
x 3,Rex
1,Im
x 2,Imx 3,Imx
0
1/ N-
4
y
y
B12
D
D
D
D
B16 A28
0
D D D
B13 B17 A29
0
D D D
B14 B18 A30
0
D D D
B15 B19 A31
0
D D D
A32
0
D
A33
0
D
A34
0
D
A35
0
D
D0
D1
D2
D3
B20
B21
B24
B25
B22
B23
H0
0
D
H1
0
D
H2
0
D
H3
0
D
H4
0
D
H5
0
D
I0
I1
I2
I3
I4
I5
1
l
2
l
3
l
4
l
5
l
6
l
4D
4D
4D
4D
D D
D D
D D
D D
D D
DD
D
D
D
D
DD
DD
E0
D
D
D
D
G0
LN
MEM
RN
MEM
Fig. 14. Data flow graph (DFG) of stage-level architecture.
−2 0 2 4 610
−4
10−3
10−2
10−1
100
SNR [dB]
B
LE
R
Max−Log Precise
(a)
−2 0 2 4 610
−3
10−2
10−1
100
SNR [dB]
Max−Log Approximation 1
(b)
−2 0 2 4 610
−4
10−3
10−2
10−1
100
SNR [dB]
Max−Log Approximation 2
(c)
−2 0 2 4 610
−3
10−2
10−1
100
SNR [dB]
Max−Log Approximation 3
(d)
Max−Log: Early
Max−Log: Adaptive
Max−Log: Iter1
Max−Log: Iter2
Max−Log: Iter3
Max−Log: Iter4
Max−Log: Iter5
(e)
Fig. 15. Error-rate performance of DMPA with different approximation and detecting methods.
(b) Error-rate performance of Max-Log with different approximation.
Fig. 5. Error-rate performance of SCMA with different detecting methods.
high SNR. Thus, the adjusting factor in self-adaption is
supposed to be smaller at higher SNR.
3) DMPA and Max-Log have similar performance without
approximation. However, since DMPA heavily depends
on N0, approximations without precise N0 will cause
unbearable performance loss. On the other hand, Max-
Log algorithm is not sensitive to N0, and its approxi-
mations without exact N0 can still achieve good perfor-
mance. Therefore, Max-Log is preferred.
Now, we figure out that suitable configurations for hardware
implementation are: i) Max-Log approach; ii) 3 iterations; iii)
early termination and self-adaption; iv) Approximation 2 or 3,
and v) initial noise reduction.
B. Computational Complexity
Suppose the symbol set size for each user is M , the number
of physical resources is N , the user number is K, and the
maximum iteration number is I . Then, we summarize the
computational complexity of different decoding methods in
Table I. Compared with other methods, the proposed method
has the lowest computational complexity, while maintaining
the error performance. In fact, the proposed method is similar
to Max-Log, but has lower complexity in Initialization due
to the approximation. For a real system, M and N are
usually large, the number of multiplications and divisions
will makes other methods not suitable for implementation.
However, as discussed above, the proposed algorithm is
multiplication/division-free with Approximation 3. Therefore,
it can intensively improve the computational efficiency and
reduce the latency, making it more applicable for hardware
implementation in Section VI. The VLSI implementation re-
sults in Table IV will further verify that the proposed decoder’s
hardware efficiency over the SOA design.
C. Performance/Complexity Trade-Off Analysis
Fig. 6 illustrates the trade-off between error performance
and computational complexity of proposed methods. The
minimum required SNR to achieve 1% BER is employed as a
metric. The complexity is given by Timing (TM) complexity,
which is in term of iteration number. Fig. 6 shows the trade-off
of DMPA with approximations. It is clear that Max-Log with
Approximation 3 provides the best performance/complexity
trade-off.
4 5 6 7 8 9
100
200
300
400
500
600
700
800
900
Minimum SNR [dB] to achieve 1% BLER
Ti
m
in
g 
(T
M)
 C
om
ple
xit
y
 
 
DMPA Precise
DMPA Approx 1
Max−Log Precise
Max−Log Approx 1
Max−Log Approx 2
Max−Log Approx 3
Fig. 6. Performance/complexity trade-off analysis of DMPA algorithm.
V. HARDWARE ARCHITECTURE
The hardware architecture of the Max-Log DMPA is dis-
cussed. Timing optimization and folding technique are intro-
duced for higer efficiency.
A. Overall Architecture
The overall architecture is shown in Fig. 7. It has 4 units
and 2 memory networks, which are RN-to-LN and LN-to-RN
networks for IR→L and IL→R, respectively. The elementary
units are Initialization Unit, Resource Node Update Unit,
Layer Node Update Unit, and Probability Calculating Unit,
which execute steps indicated by Eq. (15) to Eq. (21), re-
spectively. The iterative calculation is done by Resource Node
C. ZHANG et al.: EFFICIENT SPARSE CODE MULTIPLE ACCESS DECODER BASED ON DETERMINISTIC MESSAGE PASSING ALGORITHM 7
TABLE I
COMPARISON OF COMPUTATIONAL COMPLEXITY FOR DIFFERENT DECODING ALGORITHMS
Procedure Operation This work
DMPA [21] Max-Log [30] Pruned DMPA [31]
[APCCAS ’17] [China Comm. Dec. ’15] [DSP ’16]
Initial probability
calculation
ADD 2M3N/Tadp 3M3N/TMPA 3M3N/TMax-Log 3M3N/Ttree
MUL 0 3M3N/TMPA 3M3N/TMax-Log 3M3N/Ttree
EXP 0 M3N/TMPA 0 M3N/Ttree
Resource node
updating
ADD 2 · 3M3N 3M3N 2 · 3M3N 3M3N
MUL 0 2 · 3M3N 0 2 · 3M3N
MAX 3M3N 0 3M3N 0
Layer node
updating
ADD 0 2MK 0 2MK
MUL 0 2MK 0 2MK
SWOP 2MK 2MK 2MK 2MK
Users’ symbol
judgement
ADD MK 0 MK 0
MUL 0 MK 0 MK
MAX 0 MK 0 MK
Resource 
Node
Update
Unit
RN-to-LN 
Network
LN-to-RN 
Network
Layer 
Node
Update
Unit
Probability
Calculating
Unit
Output 
Initialization
Unit
0, ,Ny H lˆ
Fig. 7. Overall architecture of DMPA.
Update Unit and Layer Node Update Unit, both of which could
not start current propagation unless all previous data have
been calculated. We call this data updating interval a “step”.
Optimization details of this scheduling will be discussed
below.
B. Stage-Level Scheduling Optimization
ACC STO SWOPRESET
Step 1
CMP
,
n
f
Step 2 Step 3
Time
init
L R
I ®
Step 4
L R
I ®
ACC
Stage 0
networkR LI ®
One interation
R L
I ®01/ N-
MUL
R L
I ®
STO
R L
I ®
CMP SLT
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8 Stage 9 Stage 10
2X
Operand
Operator
After several
iterations
MAG L RI ® SUM SYM
MUL STO SWOPRESET CMP
,
n
finit
L R
I ® L RI ®R LI ®01/ N-
MUL
R L
I ®
STO2X
MAG L RI ®
ACC
( )
L
Q mR LI ®
CMP SLT
SYM
Ă Ă
networkR LI ®
4 parallel 6 parallel
( )
L
Q m
SUM
Fig. 8. Stage-level scheduling.
The proposed stage-level scheduling is a finer-grained op-
timization over the step-level scheduling. With this stage-
level scheduling, it is convenient to insert deep pipelines to
achieve a higher throughput [32–34]. Compared with step-
level scheduling, updating of stage-level scheduling does not
have to wait for the completion of data computation from
the previous unit, which therefore avoids low hardware-
efficiency and long processing-latency. In sum, stage-level
scheduling enjoys faster processing speed and higher hardware
efficiency than the step-level one. Fig. 8 shows the stage-level
scheduling. It details each computing step to achieve a deeper
pipelined structure.
C. Folding
The architecture of stage-level DMPA turns out to be very
complicated in form of data factor graph (DFG). To achieve an
efficient architecture, folding technique is employed for further
optimization. Since folding operation based on fine-grained
architecture is difficult to be carried out, a folding scheme
based on unit is considered. Fig.s 13 and 14 in appendix shows
the entire step- and stage-level algorithms, respectively. Due
to the page constraint, we only take a branch of Initialization
Unit, which is fully-paralleled in DFG, as an example to show
proposed folding details. Folding transform of other units can
be conducted in the similar fashion. The DFG of the branch
in Initialization Unit is shown in Fig. 9.
imag.
1
2
real
3
7
4 5
8 9
10
12
6
11
D D D
D D D D
D D
D
D
01/ N-
1,imag.x
1,realx 2,realx 3,realx
2,imag.x 3,imag.x
Fig. 9. Original hardware of Step 1 before folding.
The folding includes 3 steps: i) construct folding sets
and folding equations, ii) analysis life span, and iii) allocate
registers. More details of this method are explained by [35].
1) Folding Sets and Folding Equations: Set the folding
factor to 7, we can obtain the following folding sets:
Sin = {1, 2, φ, φ, φ, φ, φ},
SA = {3, 4, 5, 6, 7, 8, 9},
SM = {10, 11, 12, φ, φ, φ, φ},
(29)
where Sin, SA, and SM denote the folding sets for inputs,
adders, and multipliers, respectively.
8 IEEE TRANSACTIONS ON , 2018
Then, folding equations can be derived based on the given
folding sets
DF (1→ 3) = 0, DF (3→ 4) = 7, DF (4→ 5) = 7,
DF (2→ 7) = 3, DF (7→ 8) = 7, DF (8→ 9) = 7,
DF (5→ 10) = 4, DF (10→ 6) = 8, DF (6→ 11) = 4,
DF (9→ 12) = 2, DF (12→ 6) = 6,
(30)
where DF (x→ y) denotes the number of delays on the path
from x to y.














          Cycle Activated
number







 
 
 
 
 
 
 

Fig. 10. Life time figure.
2) Life Time Analysis: Life span analysis is demonstrated in
the form of life time figure as shown in Fig. 10. It is achieved
from folding equations. One thick line in the figure represents
survival time of certain data. Activated number shows number
of data in use at the moment [36]. According to Fig. 10, we
see that this folding architecture requires at least 8 registers.
3) Register Allocation: The forward-backward scheme of
register allocation is employed based on life span analysis [37].
The specific allocation process is displayed in Fig. 11.
Cycle Input 1R  2R  3R  4R  5R  6R  7R  8R  Output 
0           
1 2n , 3n  
         
2 4n , 10n  3n  2n  
       
3 5n  
 
3n  2n  4n  10n  
    
4 6n , 12n  5n  
 
3n  2n  4n  10n  
  
2n  
5 7n  12n  5n  6n  3n  
 
4n  10n  
  
6 8n  7n  12n  5n  6n  3n  
 
4n  10n  
 
7 9n  8n  7n  12n  5n  6n  3n  10n  4n  5n  
8  9n  8n  7n  12n  4n  6n  3n  10n  3n , 6n  
9   9n  8n  7n  12n  4n  10n  
 
4n , 9n  
10     8n  7n  12n  
 
10n  10n , 12n  
11      8n  7n  
   
12       8n  7n  
 
7n  
13        8n  
 
8n  
 
Fig. 11. Register allocation table of folding architecture.
After all the steps, we can finally obtain the folded archi-
tecture of the branch in Initialization Unit.
D. Hardware Architecture and Loop Analysis
The final stage-level folded architecture of DMPA, which
is illustrated at module-level in Fig. 12. Lower hardware cost
and reasonable processing speed become its main advantages.
The loop bound analysis [38, 39] of this folded architecture
is also given here. Suppose the processing time of an adder, a
y
MEM CMP SLT lˆD D D
D D
D
CMP SWOP
D2MAG
Loop 1Loop 2
Loop 3
Loop 4
Fig. 12. Hardware architecture of DMPA.
TABLE II
COST OF DIFFERENT ARCHITECTURES FOR “J = 6”
Different architectures Cycles
Hardware cost (main untis)
Adders (Comparators) Multipliers
Original 80 52 12
Stage-level folded 300 4 2
comparator, and a swopper are TA, TC , and TS , respectively.
We can obtain the results listed by Table III.
TABLE III
LOOP BOUND ANALYSIS
Loop ADD CMP SWOP Delay Loop bound
1 1 1 0 3 (TA + TC)/3
2 2 1 0 4 (2TA + TC)/4
3 1 1 1 4 (TA + TC + TS)/4
4 2 1 1 5 (2TA + TC + TS)/5
Thus, the iteration bound is calculated as follows:
T∞=max
{
TA+TC
3 ,
2TA+TC
4 ,
TA+TC+TS
4 ,
2TA+TC+TS
5
}
(31)
VI. VLSI IMPLEMENTATION
The proposed decoder’s VLSI implementation is given and
compared to two SOA baselines. The first is the DMPA de-
coder [21], and the second is the SMPA decoder [18]. As both
baselines do not consider folding, the proposed decoder does
not either for fair comparison. But if all designs are folded, the
proposed decoder’s advantages remain. Discussed previously,
the proposed decoder is based on: i) Max-Log approach; ii)
early termination and self-adaption; iii) Approximation 3, and
iv) initial noise reduction. Since the SMPA decoder employed
5 iterations, 1 up to 5 iterations are considered, though 3 turns
out to be efficient per our analysis. Both the proposed decoder
and DMPA decoder are implemented with Xilinx Virtex-7
XC7VX690T FPGA. The results of SMPA decoder is scribed
from [18], since it is implemented with ASIC. The frequency
is 500 MHz. The input quantization is 8-bit for both real and
imaginary parts, and the intermediate quantization is 16.
A. Module Details of Proposed Decoder
The proposed decoder consists of four basic parts as shown
in Fig. 7: initialization module, layer node updating network,
resource node updating network, and symbol judging module.
The design details are presented as follows.
1) Initialization Module: It calculates initial belief of each
user with the received signal and inner codebook. The received
signal is made up of 4 complex resource nodes, thus the input
is 8-parallel. Each of them has the quantization length of 8. It
is noted that the output belief has the quantization length of 16,
due to multiplication. The codebook is restored in memories,
which costs 96 memory blocks of 8-bit length each.
C. ZHANG et al.: EFFICIENT SPARSE CODE MULTIPLE ACCESS DECODER BASED ON DETERMINISTIC MESSAGE PASSING ALGORITHM 9
2) Resource Node Updating Network: It calculates the sum
of belief and outputs the largest, based on the approximated
Jacobi’s formula. It is made up of resource node updating
units, where the input data are initial beliefs and layer node
beliefs, and the output data are the 4 resource node beliefs.
The largest value is selected from 16 intermediate beliefs, in
3 steps of comparison with 14 buffers. Thus, 56 buffers are
required by each unit, and 672 by the entire network.
3) Layer Node Updating Network: It is made up of layer
node updating units, which normalize the input value and
swop it by the inner connection. In each unit, the input data
are resource node beliefs only, and the output data are the
corresponding 4 layer node beliefs. Four 16-bit dividers are
required per unit with 28 clocks’ delay. Hence, the whole
network needs 48 dividers. Besides, layer node beliefs would
also be reset at the start of each frame of the received signals
in layer node updating network.
4) Symbol Judging Module: It finds the largest belief and
maps it to original source code according to the codebook
of each user. Also, this module consists of 6 smaller judging
units, which perform the basic function for each user. In each
unit, 4 beliefs are compared with each other. Thus 2 steps of
comparison and 3 buffers are required. Then, the entire module
needs 18 buffers.
The implementation comparison with the DMPA decoder is
listed in Table IV. It shows the proposed decoder’s advantages
in both complexity and throughput, thanks to the log-domain
processing and approximation approaches.
TABLE IV
FPGA RESULTS FOR DIFFERENT DECODERS WITH J/K = 6/4
SCMA decoders DMPA decoder [21] This work
LUTs 139, 205 (36%) 82, 909 (19%)
Registers 248, 217 (28%) 109, 997 (12%)
LUT-FF pairs 103, 127 (36%) 52, 203 (18%)
DSP48E1s 436 (12%) 436 (12%)
Maximum frequency 167.6 MHz 359.1 MHz
Since speed is the main focus of our design, comparison
results of throughput and latency with baselines are shown
in Table V, where “L” for latency and “T” for throughput.
As we can see from the table, the proposed SCMA decoder
outperforms the SOA in both throughput and latency, and
also meets the multi-Gbps and millisecond requirements of
3GPP. Though, SMPA decoder has complexity advantage, the
proposed decoder’s complexity can be further reduced with
folding techniques.
VII. CONCLUSION
In this paper, simplifications such as log-domain calculation
and probability approximation have been introduced to lower
the complexity of SCMA’s DMPA decoder. Early termination,
adaptive decoding, and initial noise reduction are also pro-
posed for faster convergence and better performance. Hard-
ware optimizations with folding and retiming are introduced.
VLSI implementation results have confirmed the advantages of
the proposed SCMA decoder for high-speed applications over
the SOA designs. Future research will be directed towards
further improvements on both algorithm and implementation.
REFERENCES
[1] T. B. Iliev, G. Y. Mihaylov, T. D. Bikov et al., “LTE eNB traffic
analysis and key techniques towards 5G mobile networks,” in Proc.
IEEE International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), May 2017, pp.
497–500.
[2] M. Anwar, Y. Xia, and Y. Zhan, “TDMA-based IEEE 802.15.4 for low-
latency deterministic control applications,” IEEE Trans. Ind. Informat.,
vol. 12, no. 1, pp. 338–347, Feb. 2016.
[3] A. N. Akansu and M. V. Taztbay, “Orthogonal trans multiplexer: A
multiuser communications platform from FDMA to CDMA,” in Proc.
IEEE European Signal Processing Conference (EUSIPCO), Sep. 1996,
pp. 1–4.
[4] Q. Xue, Y. Li, L. Zhong et al., “Study on key techniques for 3G mobile
learning platform based on cloud service,” in Proc. IEEE International
Conference on Consumer Electronics, Communications and Networks
(CECNet), Apr. 2011, pp. 3588–3591.
[5] J. G. Andrews, S. Buzzi, W. Choi et al., “What will 5G be?” IEEE J.
Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, Jun. 2014.
[6] 3GPP, “Study on scenarios and requirements for next generation access
technologies,” 3rd Generation Partnership Project (3GPP), vol. 32,
no. 6, pp. 1065–1082, Mar. 2016.
[7] J. Gozalvez, “Tentative 3GPP timeline for 5G [mobile radio],” IEEE
Veh. Technol. Mag., vol. 10, no. 3, pp. 12–18, Sep. 2015.
[8] W. Shin, M. Vaezi, B. Lee et al., “Non-orthogonal multiple access
in multi-cell networks: Theory, performance, and practical challenges,”
IEEE Commun. Mag., vol. 55, no. 10, pp. 176–183, Oct. 2017.
[9] K. S. Ali, H. Elsawy, A. Chaaban et al., “Non-orthogonal multiple access
for large-scale 5G networks: Interference aware design,” IEEE Access,
vol. 5, pp. 21 204–21 216, May 2017.
[10] Z. Dawy, A. Seeger, and M. Mecking, “Design methodologies and power
setting strategies for WCDMA serial interference cancellation receivers,”
in Proc. IEEE International Zurich Seminar on Communications (IZSC),
Oct. 2004, pp. 28–31.
[11] M. A. Pasha, M. Uppal, M. H. Ahmed et al., “Towards design and
automation of hardware-friendly NOMA receiver with iterative multi-
user detection,” in Proc. IEEE ACM/EDAC/IEEE Design Automation
Conference (DAC), Jun. 2017, pp. 1–6.
[12] T. Manglayev, R. C. Kizilirmak, Y. H. Kho et al., “NOMA with
imperfect SIC implementation,” in Proc. IEEE EUROCON International
Conference on Smart Technologies (ICST), Jul. 2017, pp. 22–25.
[13] F.-L. Luo and C. Zhang, Non-orthogonal Multi-User Superposition and
Shared Access. Wiley-IEEE Press, 2016, pp. 616–. [Online]. Available:
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7572709
[14] S. Chen, B. Ren, Q. Gao et al., “Pattern division multiple access
– a novel nonorthogonal multiple access for fifth-generation radio
networks,” IEEE Trans. Veh. Technol., vol. 66, no. 4, pp. 3185–3196,
Apr. 2017.
[15] H. Nikopour and H. Baligh, “Sparse code multiple access,” in proc.
IEEE Annual International Symposium on Personal, Indoor, and Mobile
Radio Communications (PIMRC), Sep. 2013, pp. 332–336.
[16] H. Nikopour, E. Yi, A. Bayesteh et al., “SCMA for downlink multiple
access of 5G wireless networks,” in Proc. IEEE Global Communications
Conference (GLOBECOM), Dec. 2014, pp. 3940–3945.
[17] K. Han, J. Hu, J. Chen et al., “A low complexity SCMA detector
based on stochastic computation,” in Proc. IEEE International Midwest
Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 783–
786.
[18] ——, “A low complexity sparse code multiple access detector based on
stochastic computing,” IEEE Trans. Circuits Syst. I, vol. PP, no. 99, pp.
1–14, Oct. 2017.
[19] Z. Jia, Z. Hui, and L. Xing, “A low-complexity tree search based quasi-
ML receiver for SCMA system,” in Proc. IEEE International Conference
on Computer and Communications (ICCC), Oct. 2015, pp. 319–323.
[20] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity
iterative decoding of low-density parity check codes based on belief
propagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, May
1999.
[21] C. Yang, C. Zhang, S. Zhang, and X. You, “Efficient hardware archi-
tecture of deterministic MPA decoder for SCMA,” in Proc. IEEE Asia
Pacific Conference on Circuits and Systems (APCCAS), Oct. 2016, pp.
293–296.
10 IEEE TRANSACTIONS ON , 2018
TABLE V
LATENCY (L) IN [µS] AND THROUGHPUT (T) IN [MB/S] FOR DIFFERENT DECODERS (FREQUENCY: 500 MHZ)
User # (J) 6 12 24 48 96 192
Resource # (K) 4 8 16 32 64 128
Iteration # (Imax) L / T L / T L / T L / T L / T L / T
This work
1 3.50 / 857.14 3.70 / 1628.57 3.96 / 3012.85 4.26 / 5423.13 4.58 / 9490.48 4.95 / 16133.81
2 5.50 / 547.69 5.84 / 1040.62 6.22 / 1925.14 6.66 / 3465.25 7.16 / 6064.20 7.74 / 10309.14
3 8.62 / 349.96 9.14 / 664.93 9.74 / 1230.12 10.42 / 2208.46 11.22 / 3874.88 12.12 / 6587.30
4 13.48 / 223.62 14.28/ 424.88 15.22 / 786.02 16.32 / 1411.16 17.56 / 2475.96 18.98 / 4209.14
5 21.00 / 142.86 22.12 / 271.43 23.42 / 502.15 24.94 / 903.88 26.68 / 1581.78 28.62 / 2689.03
C. Yang [21], [DMPA decoder, APCCAS ’17]
1 5.60 / 150.00 5.80 / 285.00 6.06 / 527.25 6.36 / 949.05 6.78 / 1660.84 7.16 / 2823.42
2 10.92 / 76.92 11.26 / 146.148 11.64 / 270.37 12.08 / 486.67 12.58 / 851.68 13.16 / 1447.85
3 16.24 / 51.72 16.76 / 98.27 17.36 / 181.80 18.04 / 327.23 18.84 / 572.66 19.74 / 973.52
4 21.56 / 38.96 22.36 / 74.02 23.30 / 136.94 24.40 / 246.50 25.64 / 431.37 27.06 / 733.34
5 26.88 / 31.25 28.00 / 59.38 29.30 / 109.84 30.82 / 197.72 32.56 / 346.01 34.50 / 588.21
K. Han [18], [SMPA decoder, TCAS-I Oct. ’17]
5 n.a. / 57 n.a. / n.a. n.a. / n.a. n.a. / n.a. n.a. / 640 n.a. / n.a.
[22] Ryan and William, A Low-Density Parity-Check Code Tutorial, Part II-
The Iterative Decoder, 1st ed. The University of Arizona, 2002.
[23] W. Li, B. Chen, J. Lei et al., “Low density parity check codes with
quasi-cyclic structure and zigzag pattern,” in Proc. IEEE International
Conference on Signal Processing (ICSP), Oct. 2014, pp. 1730–1734.
[24] J. Chen, A. Dholakia, E. Eleftheriou et al., “Reduced-complexity de-
coding of LDPC codes,” IEEE Trans. Commun., vol. 53, no. 8, pp.
1288–1299, Aug. 2005.
[25] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and
sub-optimal MAP decoding algorithms operating in the log domain,” in
Proc. IEEE International Conference on Communications (ICC), vol. 2,
Jun. 1995, pp. 1009–1013 vol.2.
[26] J. Chen, R. M. Tanner, C. Jones et al., “Improved min-sum decoding
algorithms for irregular LDPC codes,” in Proc. IEEE International
Symposium on Information Theory (ISIT), Sep. 2005, pp. 449–453.
[27] X. Wu, Y. Song, M. Jiang et al., “Adaptive-normalized/offset min-sum
algorithm,” IEEE Commun. Lett., vol. 14, no. 7, pp. 667–669, Jul. 2010.
[28] V. Savin, “Self-corrected min-sum decoding of LDPC codes,” in Proc.
IEEE International Symposium on Information Theory (ISIT), Jul. 2008,
pp. 146–150.
[29] G. D. Forney and G. Ungerboeck, “Modulation and coding for linear
Gaussian channels,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2384–
2415, Oct. 1998.
[30] L. Lu, Y. Chen, W. Guo et al., “Prototype for 5G new air interface tech-
nology SCMA and performance evaluation,” China Communications,
vol. 12, no. Supplement, pp. 38–48, Dec. 2015.
[31] J. Chen, K. Han, J. Hu et al., “Low complexity sparse code multiple
access decoder based on tree pruned method,” in Proc. IEEE Interna-
tional Conference on Digital Signal Processing (DSP), Oct. 2016, pp.
341–345.
[32] S. Simon, E. Bernard, M. Sauer et al., “A new retiming algorithm for
circuit design,” in Proc. IEEE International Symposium on Circuits and
Systems (ISCAS), vol. 4, May 1994, pp. 35–38 vol.4.
[33] J. Monteiro, S. Devadas, and A. Ghosh, “Retiming sequential circuits
for low power,” in Proc. IEEE International Conference on Computer
Aided Design (ICCAD), Nov. 1993, pp. 398–402.
[34] K. K. Parhi, C. Y. Wang, and A. P. Brown, “Synthesis of control circuits
in folded pipelined DSP architectures,” IEEE J. Solid-State Circuits,
vol. 27, no. 1, pp. 29–43, Jan. 1992.
[35] Parhi and K. K, VLSI digital signal processing systems: design and
implementation, 1st ed. John Wiley & Sons, 1999.
[36] K. K. Parhi, “Calculation of minimum number of registers in arbitrary
life time chart,” IEEE Trans. Circuits Syst. II, vol. 41, no. 6, pp. 434–
436, Jun. 1994.
[37] ——, “Systematic synthesis of DSP data format converters using life-
time analysis and forward-backward register allocation,” IEEE Trans.
Circuits Syst. II, vol. 39, no. 7, pp. 423–440, Jul. 1992.
[38] D. Y. Chao and D. T. Wang, “Iteration bounds of single-rate data flow
graphs for concurrent processing,” IEEE Trans. Circuits Syst. I, vol. 40,
no. 9, pp. 629–634, Sep. 1993.
[39] K. Ito and K. K. Parhi, “Determining the iteration bounds of single-rate
and multi-rate data-flow graphs,” in Proc. IEEE Asia-Pacific Conference
on Circuits and Systems (APCCAS), Dec. 1994, pp. 163–168.
C. ZHANG et al.: EFFICIENT SPARSE CODE MULTIPLE ACCESS DECODER BASED ON DETERMINISTIC MESSAGE PASSING ALGORITHM 11
A Adder B Multiplier C Comparator D Swopper E LN Memory F RN Memory G Symbol Selector H Pk Memory
A0
A1
A8
A9
A16
A17
D
B0
B1
A24 B8
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
1y
A2
A3
A10
A11
A18
A19
D
B2
B3
A25 B9
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
2y
A4
A5
A12
A13
A20
A21
D
B4
B5
A26 B10
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
3y
A6
A7
A14
A15
A22
A23
D
B6
B7
A27 B11
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
4y
y H0
A28
MEM
kP
D
D
D
D
A32 C0
0
D D D
F0
A29 A33 C1
0
D D D
A30 A34 C2
0
D D D
A31 A35 C3
0
D D D
D0
D1
D2
D3
E0
RN
MEM
LN
MEM
A36
A37
A40
A41
A38
A39
C4
0
D
C5
0
D
C6
0
D
C7
0
D
C8
0
D
C9
0
D
G0
G1
G2
G3
G4
G5
1l
2l
3l
4l
5l
6l
D D
D D
D D
D D
D D
DD
D
D
D
D
DD
DD
Fig. 13. Data flow graph (DFG) of step-level architecture.
A Adder B Multiplier C Comparator D Swopper E LN Memory F RN Memory G Symbol Selector H Pk Memory
A0
A1
A8
A9
A16
A17
D
B0
B1
A24 B8
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
1y
A2
A3
A10
A11
A18
A19
D
B2
B3
A25 B9
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
2y
A4
A5
A12
A13
A20
A21
D
B4
B5
A26 B10
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
3y
A6
A7
A14
A15
A22
A23
D
B6
B7
A27 B11
1,Rex
D D D
DDDD
D D
2,Rex 3,Rex
1,Imx 2,Imx 3,Imx
01/ N-
4y
y
A28 A32
D D
A29 A33
D D
A30 A34
D D
A31 A35
D D
D0
D1
D2
D3
A36
A37
A40
A41
A38
A39
C4
0
D
C5
0
D
C6
0
D
C7
0
D
C8
0
D
C9
0
D
G0
G1
G2
G3
G4
G5
1l
2l
3l
4l
5l
6l
D D
D D
D D
D D
D D
DD
D
D
D
D
DD
DD
E0
D
D
D
D
F0
LN
MEM
RN
MEM
C0
0
D
C1
0
D
C2
0
D
C3
0
D
Fig. 14. Data flow graph (DFG) of stage-level architecture.
