Symbol-Decision Successive Cancellation List Decoder for Polar Codes by Xiong, Chenrong et al.
1Symbol-Decision Successive Cancellation List
Decoder for Polar Codes
Chenrong Xiong, Jun Lin Student member, IEEE and Zhiyuan Yan, Senior member, IEEE
Abstract—Polar codes are of great interests because they
provably achieve the capacity of both discrete and continuous
memoryless channels while having an explicit construction. Most
existing decoding algorithms of polar codes are based on bit-wise
hard or soft decisions. In this paper, we propose symbol-decision
successive cancellation (SC) and successive cancellation list (SCL)
decoders for polar codes, which use symbol-wise hard or soft
decisions for higher throughput or better error performance.
First, we propose to use a recursive channel combination to
calculate symbol-wise channel transition probabilities, which lead
to symbol decisions. Our proposed recursive channel combination
also has a lower complexity than simply combining bit-wise
channel transition probabilities. The similarity between our
proposed method and Arikan’s channel transformations also
helps to share hardware resources between calculating bit- and
symbol-wise channel transition probabilities. Second, a two-stage
list pruning network is proposed to provide a trade-off between
the error performance and the complexity of the symbol-decision
SCL decoder. Third, since memory is a significant part of SCL
decoders, we propose a pre-computation memory-saving tech-
nique to reduce memory requirement of an SCL decoder. Finally,
to evaluate the throughput advantage of our symbol-decision
decoders, we design an architecture based on a semi-parallel
successive cancellation list decoder. In this architecture, different
symbol sizes, sorting implementations, and message scheduling
schemes are considered. Our synthesis results show that in terms
of area efficiency, our symbol-decision SCL decoders outperform
both bit- and symbol-decision SCL decoders.
Index Terms—Error control codes, polar codes, successive
cancellation, list decoding algorithm, hardware implementation
I. INTRODUCTION
Polar codes, a groundbreaking finding by Arikan [1] in
2009, have ignited a spark of research interest in the fields of
communication and coding theory, because they can provably
achieve the capacity for both discrete [1] and continuous [2]
memoryless channels. The second reason why polar codes are
attractive is their low encoding and decoding complexity. For
example, a polar code of length N can be decoded by the
successive cancellation (SC) algorithm [1] with a complexity
of O(N logN). However, their capacity approaching can be
achieved only when the code length is large enough (N > 220
[3]) if the SC algorithm is used. For short or moderate code
length, in terms of the error performance, polar codes with
the SC algorithm are inferior to Turbo codes or low-density
parity-check (LDPC) codes [4], [5].
Since the debut of polar codes, a lot of efforts have been
made to improve the error performance of short polar codes.
Systematic polar codes [6] were proposed to reduce the bit
error rate (BER) while guaranteeing the same frame error
rate (FER) as their non-systematic counterparts. Although a
Viterbi algorithm [7], a sphere decoding algorithm [8] and
stack sphere decoding algorithm [9] can provide maximum
likelihood (ML) decoding of polar codes, they are considered
infeasible, especially for long polar codes, due to their much
higher complexity than the SC algorithm. Recently, an SC
list algorithm for polar codes was proposed in [10] to bridge
the performance gap between the SC algorithm and ML
algorithms at the cost of complexity of O(LN logN), where
L is the list size. Moreover, the concatenation of polar codes
with cyclic redundancy check (CRC) codes was introduced in
[4], [11]. To decode the CRC-concatenated polar codes, a CRC
detector is used in the SCL algorithm to help select the output
codeword. The combination of an SCL algorithm and a CRC
detector is called CRC-aided SCL (CA-SCL) algorithm. [11]
shows that with the CA-SCL algorithm, the error performance
of a (2048, 1024) CRC-concatenated polar code is better that
of a (2304, 1152) LDPC code, which is used in the WiMax
standard [12].
Several architectures have been proposed for the SC algo-
rithm. Arikan [1] showed that a fully parallel SC decoder has
a latency of 2N − 1 clock cycles. A tree SC decoder and a
line SC decoder with complexity of O(N) were proposed in
[13]. These two decoders have the same latency as the fully
parallel SC decoder. To reduce complexity further, Leroux
et al. [3] proposed a semi-parallel SC decoder for polar codes
by taking advantage of the recursive structure of polar codes
to reuse processing resources. Assuming that the number of
processing elements (PEs) are P (P = 2p ≤ N), the latency
of the semi-parallel SC decoder is 2N + NP log2(
N
4P ) clock
cycles. To reduce the latency, a simplified SC (SSC) polar
decoder was introduced in [14] and it was further analyzed
in [15]. In the SSC polar decoder, a polar code is converted
to a binary tree including three types of nodes: rate-one, rate-
zero and rate-R nodes. Based on the SSC polar decoder, the
ML SSC decoder makes use of the ML algorithm to deal
with part of rate-R nodes in [16]. However, the SSC and ML-
SSC polar decoders depend on positions of information bits
and frozen bits, and are code-specific consequently. In [17], a
pre-computation look-ahead technique was proposed to reduce
the latency of the tree SC decoder by half. For the SCL polar
decoder, the semi-parallel architecture was adopted in [18]. In
[19] Balatsoukas-Stimming et al. proposed an architecture of
L = 4 to achieve a throughput of 124 Mbps and a latency of
8.25 ms when decoding a (1024, 512) polar code. In [20], Lin
and Yan designed an SCL polar decoder with the throughput
of 182 Mbps and a latency of 5.63 ms. To reduce the memory
requirement, the log-likelihood ratio (LLR) messages are used
in [21]. The throughput of existing polar decoders is still not
high enough for high speed applications.
ar
X
iv
:1
50
1.
04
70
5v
1 
 [c
s.I
T]
  2
0 J
an
 20
15
2Since the low throughput (or long latency) of the SC
decoder is due to its serial nature, several previous works
attempt to improve the throughput (or latency). In [22], the
data bits of a polar code is split into several streams, which
are decoded simultaneously. This idea of parallel processing
is extended in [23], where the SC decoder is transformed into
a concatenated decoder, where all the inner SC decoders are
carried out in parallel. Yuan and Parhi proposed a multi-bit
SCL decoder [24].
In this paper, we address the throughput/latency issue by
proposing symbol-decision SC and SCL decoders, which are
based on symbol-wise hard or soft decisions. Since each
symbol consists of M bits, when M > 1 the symbol-
decision decoders achieve higher throughput as well as better
error performance. The proposed symbol-decision decoders
are natural generalization of their bit-wise counterparts, and
reduce to existing bit-wise decoders when the symbol size is
one bit. The main contributions of this paper are:
• We propose a novel recursive channel combination to
calculate the symbol-wise channel transition probabil-
ities, which enable symbol decisions in SC and SCL
algorithms. The proposed recursive channel combination
also has a lower complexity than simply combining
bit-wise channel transition probabilities. The similarity
between the Arikan’s recursive channel transformation
and our symbol-wise recursive channel combination helps
to share hardware resources to calculate the bit- and
symbol-based channel transition probabilities.
• An M -bit symbol-decision SCL decoder needs to find the
L most reliable candidates out of 2ML list candidates. We
propose a two-stage list pruning network to perform this
sorting function. This pruning network also provides a
trade-off between performance and complexity.
• By adopting pre-computation technique [25], We develop
a pre-computation memory-saving (PCMS) technique to
reduce the memory requirement of the SCL decoder.
Specifically, the channel information memory can be
eliminated when using the PCMS technique. Moreover,
this technique also helps to improve throughput slightly.
• To evaluate the throughput of symbol-decision SC de-
coders, we propose an area efficient architecture for
symbol-decision SCL decoders1. In our architecture, to
save the area, adders in processing units are reused to
calculate the symbol-wise channel transition probability.
We propose two scheduling schemes for sharing hardware
resources. We also propose two list pruning network for
designs with different symbol sizes.
• We design two-, four-, and eight-bit symbol-decision SCL
decoders for a (1024, 480) CRC32-concatenated polar
code with a list size of four. Synthesis results show
that in terms of area efficiency, our symbol-decision
SCL decoder outperforms all existing state-of-the-arts
SCL decoders in [19]–[21], [24]. For example, the area
efficiency of our four-bit symbol-decision SCL decoder
is 259.2 Mb/s/mm2, which is 1.51 times as big as that
1We focus on the SCL decoder because the SC decoder can be considered
as an SCL decoder with a list size of one.
of [21]. Our implementation results also demonstrate that
the symbol-decision SCL decoder can provide a range of
tradeoffs between area, throughput, and area efficiency.
Our symbol-decision decoding algorithms assume that the
underlying channel has a binary input, and our symbol-wise
channel transformation is virtual and introduced for decoding
only. Hence, our work is different from those assuming a q-ary
(q > 2) channel (see, for example, [26]).
The decoding schedule (bit sequence) of our symbol-
decision decoding algorithms is actually the same as those
in [22]–[24], but our symbol-decision decoding algorithms
are different from those in [22]–[24] in two aspects. First,
our symbol-wise recursive channel transition is different from
how transition probabilities are derived in [22]–[24]. Sec-
ond, the symbol-decision perspective allows us to prove that
the symbol-decision algorithms have better frame error rates
(FERs) than their bit-decision counterparts [27], while only
simulation results are provided in [22], [24] and error per-
formance is not investigated in [23]. There are additional
differences between our decoding algorithms/architectures and
those in [22]–[24]. For instance, all the bits within a symbol
are estimated jointly in our symbol-decision SC algorithm,
whereas some bits are decoded independently for the decoder
with parallelism two in [22]. Also, while our symbol-decision
decoding is introduced on the algorithmic level, the multibit
decoder is introduced on the level of decoding operations [24].
Finally, for our symbol-decision SCL decoders, we use the
semi-parallel architecture because it is more area efficient than
the tree architecture and the line architecture [13].
The rest of our paper is organized as follows. Section II
briefly reviews polar codes and existing decoding algorithms
for polar codes. In Section III, the symbol-based recursive
channel combination is proposed to calculate the symbol-
based channel transition probability. Moreover, to simplify
the selection of the list candidates, a two-stage list pruning
network is proposed. In Section IV, we introduce a method to
reduce memory requirement of list decoders of polar codes by
pre-computation technique. In Section V, we demonstrate the
hardware architecture for symbol-decision SCL decoders. Two
scheduling schemes for hardware sharing are discussed. We
also propose two list pruning network for different designs:
a folded sorting implementation and a tree sorting imple-
mentation. A discussion on the latency of our architecture
and synthesis results for our implementations are provided in
this section as well. Finally, we draw some conclusions in
Section VI.
II. POLAR CODES AND EXISTING DECODING
ALGORITHMS
A. Preliminaries
We follow the notation for vectors in [1], namely uba =
(ua, ua+1, · · · , ub−1, ub); if a > b, uba is regarded as void.
uba,o and u
b
a,e denote the subvector of u
b
a with odd and even
indices, respectively.
Let W : X → Y represent a generic B-DMC with
binary input alphabet X , arbitrary output alphabet Y , and
transition probabilities W (y|x), y ∈ Y , x ∈ {0, 1}. Assume
3N is an arbitrary integer and M is an integer satisfying
M |N . Let W (j)N,M s denote a set of NM coordinate channels:
W
(j)
N,M : XM → YN × X (j−1)M , 0 < j ≤ NM with
the transition probabilities W (j)N,M (y
N
1 , x
(j−1)M
1 |xjM(j−1)M+1),
where (yN1 , x
(j−1)M
1 ) and x
jM
(j−1)M+1 denote the output and
input of W (j)N,M , respectively.
B. Polar Codes
Polar codes are linear block codes, and their block lengths
are restricted to powers of two, denoted by N = 2n for n ≥ 2.
Assume u = uN1 = (u1, u2, · · · , uN ) is the data bit sequence.
Let F = [ 1 01 1 ]. The corresponding encoded bit sequence x =
xN1 = (x1, x2, · · · , xN ) is generated by
x = uBNF
⊗n, (1)
where BN is the N ×N bit-reversal permutation matrix and
F⊗n denotes the n-th Kronecker power of F [1].
For any index set A ⊆ {1, 2, · · · , N}, uA = (ui : 0 < i ≤
N, i ∈ A) is the sub-sequence of u restricted to A. For an
(N,K) polar code, the data bit sequence is grouped into two
parts: a K-element part uA which carries information bits, and
uAc whose elements are predefined frozen bits, where Ac is
the complement of A. For convenience, frozen bits are set to
zero.
C. SC Algorithm for Polar Codes
Given a transmitted codeword x and the corresponding
received word y, the SC algorithm for an (N,K) polar code
estimates the encoding bit sequence u successively as shown
in Alg. 1. Here, uˆ = (uˆ1, uˆ2, · · · , uˆN ) represents the estimated
value for u.
Algorithm 1: SC Decoding Algorithm [1]
1 for j = 1 : N do
2 if j ∈ Ac then uˆj = 0 else
3 if W
(j)
N,1(y,uˆ
j−1
1 |uj=1)
W
(j)
N,1(y,uˆ
j−1
1 |uj=0)
≥ 1 then uˆj = 1 else
uˆj = 0
4
To calculate W (j)N,1(y, uˆ
j−1
1 |uj), Arikan’s recursive channel
transformation [1] is applied. A pair of binary channels
W
(2i−1)
2Λ,1 and W
(2i)
2Λ,1 are obtained by a single-step transforma-
tion of two independent copies of a binary input channel W (i)Λ,1
: (W (i)Λ,1,W
(i)
Λ,1) 7→ (W (2i−1)2Λ,1 ,W (2i)2Λ,1). The channel transition
probabilities of W (2i−1)2Λ,1 and W
(2i)
2Λ,1 are given by
W
(2i−1)
2Λ,1 (y
2Λ
1 , u
2i−2
1 |u2i−1)
=
1
2
∑
u2i
[
W
(i)
Λ,1(y
Λ
1 , u
2i−2
1,o ⊕ u2i−21,e |u2i−1 ⊕ u2i)
·W (i)Λ,1(y2ΛΛ+1, u2i−21,e |u2i)
]
,
(2)
and
W
(2i)
2Λ,1(y
2Λ
1 , u
2i−1
1 |u2i)
=
1
2
W
(i)
Λ,1(y
Λ
1 , u
2i−2
1,o ⊕ u2i−21,e |u2i−1 ⊕ u2i)
·W (i)Λ,1(y2ΛΛ+1, u2i−21,e |u2i),
(3)
where 0 < i ≤ Λ = 2λ < N , and 0 ≤ λ < n.
Expressed in log-likelihood (LL), Eqs. (2) and (3) can be
approximated as [4]:
LL
(2i−1)
2Λ (y
2Λ
1 , u
2i−2
1 |u2i−1)
≈ max
{[
LL
(i)
Λ (y
Λ
1 , u
2i−2
1,o ⊕ u2i−21,e |u2i−1 ⊕ 0)
+ LL
(i)
Λ (y
2Λ
Λ+1, u
2i−2
1,e |0)
]
,[
LL
(i)
Λ (y
Λ
1 , u
2i−2
1,o ⊕ u2i−21,e |u2i−1 ⊕ 1)
+ LL
(i)
Λ (y
2Λ
Λ+1, u
2i−2
1,e |1)
]}
− log 2,
(4)
LL
(2i)
2Λ (y
Λ
1 , u
2i−1
1 |u2i)
≈ LL(i)Λ (yΛ1 , u2i−21,o ⊕ u2i−21,e |u2i−1 ⊕ u2i)
+ LL
(i)
Λ (y
2Λ
Λ+1, u
2i−2
1,e |u2i)− log 2.
(5)
To simplify the calculation, the constants in Eqs. (4) and
(5) can be discarded since this global offset for all LLs does
not affect the decoding decision.
D. Parallel SC Algorithm for Polar Codes
Bit 1 Bit 2 Bit 3 Bit 4 Bit 5 Bit 6 Bit N
Bits
(1,2,…,M)
(a)
(b)
Bits
(M+1,...2M)
Bits
(2M+1,…,3M)
Bits
(N-M+1,…,N)
Fig. 1. Decoding of (a) bit-decision vs. (b) M -bit symbol-decision
The SC algorithm makes hard-decision for only one bit
at a time, as shown in Fig. 1(a). We call it bit-decision
decoding algorithm. A parallel SC decoder [22]–[24] makes
hard-decision for M bits instead of only one bit at a time, as
shown in Fig. 1(b).
Without loss of generality, assume M is a power of two,
i.e. M = 2m(0 ≤ m ≤ n). IMj def= {jM − M +
1, jM − M + 2, · · · , jM}, for 0 < j ≤ NM . AMj
def
=
IMj ∩ A and AMcj def= IMj ∩ Ac.
Given y and uˆjM−M1 , u
jM
jM−M+1 is determined by
uˆjMjM−M+1 = arg max
uAMj∈{0,1}|AMj |
uAMc
j
∈{0}|AMcj |
W
(j)
N,M (y, uˆ
jM−M
1 |ujMjM−M+1),
(6)
where |AMj | represents the cardinality of AMj . If M =
N , this decoding algorithm is exactly a maximum-likelihood
sequence decoding algorithm.
4Algorithm 2: SCL Decoding Algorithm [10]
1 α = 1;
2 for j = 1 : N do
3 if j ∈ Ac then
4 for i = 1 : α do
5 (Li)j = 0;
6 else if 2α ≤ L then
7 for i = 1 : α do
8 (Li)j1 =conc((Li)j−11 , 0);
9 (Li+α)j1 =conc((Li)j−11 , 1);
10 α = 2α;
11 else
12 for i = 1 : L do
13 S[i].P = W
(j)
N,1(y, (Li)j−11 |0);
14 S[i].L = (Li)j−11 ;
15 S[i].U = 0;
16 S[i+ L].P = W
(j)
N,1(y, (Li)j−11 |1);
17 S[i+ L].L = (Li)j−11 ;
18 S[i+ L].U = 1;
19 sortPDecrement(S);
20 for i = 1 : L do
21 (Li)j1 =conc(S[i].L,S[i].U);
22 α = L;
23 uˆ = L1;
E. SCL and CA-SCL Algorithms for Polar Codes
Instead of making a hard decision for each information bit
of u in the SC algorithm, the SCL algorithm creates two
paths in which the information bit is assumed to be 0 and
1, respectively. If the number of paths is greater than the list
size L, the L most reliable paths are selected. At the end of
the decoding procedure, the most reliable path is chosen as uˆ.
The SCL algorithm is formally described in Alg. 2. Without
loss of generality, we assume L to be a power of two, i.e.
L = 2l. We use Li = ((Li)1, (Li)1, · · · , (Li)N ) to represent
the i-th list vector, where 0 < i ≤ L. S is a structure type
array with size 2L. Each element of S has three members: P,
L, and U. The function sortPDecrement sorts the array
S by decreasing order of P. c=conc(a,b) attaches a bit
sequence b at the end of a bit sequence a, and the length of
the output bit sequence c is the sum of lengths of a and b.
The CA-SCL algorithm is used for the CRC-concatenated
polar codes. The difference between CA-SCL [11] and SCL
algorithms is how to make the final decision for uˆ. If there is at
least one path satisfying the CRC constraint, the most reliable
CRC-valid path is chosen for uˆ. Otherwise, the decision rule
of the SCL algorithm is used for the CA-SCL algorithm.
III. M -BIT SYMBOL-DECISION DECODING ALGORITHMS
FOR POLAR CODES
A. M -bit Symbol-Decision SC Algorithm
Here, we proposed a symbol-decision SC algorithm, which
treats M -bit data as a symbol and decodes a symbol at a
time. Let Z represent the alphabet of all M -bit symbols. The
symbol-decision SC algorithm deals with the virtual channel
W(j)N : Z → YN × Z(j−1), 0 < j ≤ NM with the transi-
tion probabilities W(j)N (yN1 , zj−11 |zj), where (yN1 , z(j−1)1 ) and
zj = (ujM−M+1, · · · , ujM ) denote the output and input of
W(j)N , respectively. Actually, W(j)N is exactly equivalent to
W
(j)
N,M if we consider XM as the binary vector representation
of Z . Therefore, the symbol-decision SC algorithm has the
same schedule as the parallel SC algorithm in [22]–[24].
However, our symbol-decision SC algorithm has a different
approach, called symbol-based recursive channel combina-
tion, to compute symbol-based channel transition probabilities
W
(j)
N,M (y, uˆ
jM−M
1 |ujMjM−M+1), which is our main focus.
B. Symbol-Based Recursive Channel Combination
Assume uiMiM−M+1 = (wi, wi+ NM , · · · , wi+N− NM )BMF
⊗m
for 1 ≤ i ≤ NM . In [22]–[24], the calculation
of the symbol-based channel transition probability
W
(i)
N,M (y, uˆ
iM−M
1 |uiMiM−M+1) is based on the following
equation, referred to as direct-mapping calculation:
W
(i)
N,M (y, uˆ
iM−M
1 |uiMiM−M+1) =
M−1∏
j=0
W
(i)
N
M ,1
(y
(j+1) NM
j NM +1
, wˆ
(i−1)+j NM
1+j NM
|wi+j NM ),
(7)
where W (i)N
M ,1
(y
(j+1) NM
j NM +1
, wˆ
(i−1)+j NM
1+j NM
|wi+j NM ) is calculated by
the Arikan’s recursive channel transformations.
Actually, a symbol-based recursive channel combina-
tion described in Proposition 1 can be used to calculate
W
(i)
N,M (y, uˆ
iM−M
1 |uiMiM−M+1).
Proposition 1. Assume that all bits of u are independent and
each bit has an equal probability of being a 0 or 1. Given
0 < m ≤ n, N = 2n, M = 2m, for any 1 ≤ φ ≤ m,
0 ≤ λ < n, Λ = 2λ, Φ = 2φ, and 0 ≤ i < 2ΛΦ , we say that a
Φ-bit channel W (i+1)2Λ,Φ is obtained by a single-step combination
of two independent copies of a Φ2 -bit channel W
(i+1)
Λ,Φ/2 and
write
(W
(i+1)
Λ,Φ/2,W
(i+1)
Λ,Φ/2) 7→W (i+1)2Λ,Φ , (8)
where the channel transition probability satisfies,
W
(i+1)
2Λ,Φ (y
2Λ
1 , u
iΦ
1 |uiΦ+ΦiΦ+1 ) =
W
(i+1)
Λ,Φ/2(y
Λ
1 , u
iΦ
1,o ⊕ uiΦ1,e|uiΦ+ΦiΦ+1,o ⊕ uiΦ+ΦiΦ+1,e)
·W (i+1)Λ,Φ/2(y2ΛΛ+1, uiΦ1,e|uiΦ+ΦiΦ+1,e).
(9)
Similar to the SC algorithm, with the help of the symbol-
based recursive channel combination, an M -bit symbol-
decision SC algorithm can be represented by using a message
flow graph (MFG) as well, where a channel transition proba-
bility is referred to as a message for the sake of convenience.
This MFG is referred to as SR-MFG. If the code length
of a polar code is N , the SR-MFG can be divided into
(n + 1) stages (S0,S1, · · · ,Sn) from the right to the left:
one initial stage S0 and n calculation stages. For the SC
5algorithm, all calculation stages carry out the Arikan’s recur-
sive channel transformation. However, for the M -bit symbol-
decision SC algorithm, in the left-most m calculation stages
(Sn, · · · ,Sn−m+1), called S-COMBS stages, symbol-based
channel combinations are carried out. For the rest (n − m)
calculation stages (Sn−m, · · · ,S1), called B-TRANS stages,
the Arikan’s recursive channel transformations are performed.
The S-COMBS stages use outputs of B-TRANS stages to
calculate symbol-based messages.
For [22]–[24], we refer to the MFG as the DM-MFG which
also consists of two parts: B-TRANS and DM-CAL. The B-
TRANS part of the DM-MFG is the same as that of the SR-
MFG. However, there is only one stage in the DM-CAL part of
the DM-MFG which performs the direct-mapping calculation.
For example, as shown in Fig. 2, the SR-MFG of a four-bit
symbol-decision SC algorithm for a polar code with N = 8
has four stages. Messages of the initial stage (S0) come
from the channel directly. Messages of the first stage (S1)
are calculated with Arikan’s transformations. Messages of
the second and third stages (S2 and S3) are calculated with
Eq. (9). Stages in the left gray box are the S-COMBS stages.
Stages in the right gray box are the B-TRANS stages. Fig. 3
shows the DM-MFG when the direct-mapping calculation is
used to calculate symbol-based channel transition probability
W
(1)
8,4 (y
8
1 |u41) and W (2)8,4 (y81 , u41|u85). Here,
v41 = u
8
1,o ⊕ u81,e, v85 = u81,e,
w1 = v1 ⊕ v2 = u1 ⊕ u2 ⊕ u3 ⊕ u4,
w2 = v3 ⊕ v4 = u5 ⊕ u6 ⊕ u7 ⊕ u8,
w3 = v2 = u3 ⊕ u4,
w4 = v4 = u7 ⊕ u8,
w5 = v5 ⊕ v6 = u2 ⊕ u4,
w6 = v7 ⊕ v8 = u6 ⊕ u8,
w7 = v6 = u4,
w8 = v8 = u8.
B-TRANSS-COMBS |
Fig. 2. The message flow graph of a four-bit symbol-decision SC algorithm
for a polar code with a code length of eight by using the proposed symbol-
based recursive channel combination.
For the direct-mapping calculation, Eq. (7) needs (M − 1)
additions. Therefore, a total of 2|AMj |(M − 1) additions
are needed to calculate all LL-based symbol-based channel
B-TRANS|DM-CAL
Fig. 3. The message flow graph of a four-bit symbol-decision SC algorithm
for a polar code with a code length of eight by using direct-mapping
calculation [22]–[24].
transition probabilities for ujM+MjM+1 . Consider the recursive
symbol-based channel combination. The S-COMBS stages of
the SR-MFG are indexed as 1 to m from left to right. There
are 2n−i(0 < i ≤ m) nodes in the i-th S-COMBS stage and
each node contains 2M+i−n messages. One addition is needed
to compute each LL message according to Eq. (9). Hence, the
number of additions needed by the S-COMBS stages to calcu-
late W (j)N,M (y, uˆ
jM−M
1 |ujMjM−M+1) is
∑m−1
i=1 2
i2
M
2i + 2|AMj |.
Actually, if we perform the hardware implementation, the
worst case - that all bits of a symbol are information bits
- should be considered. Therefore, the recursive symbol-based
channel combination can be taken advantage of to reduce
complexity of calculating the symbol-based channel transition
probability.
For the example shown in Fig. 2, Eq. (7) needs 24(4 −
1) = 48 additions to calculate log(W (1)8,4 (y
8
1 |u41)). With the
symbol-based channel combination, 4, 4 and 16 additions
are needed to calculate log(W (1)4,2 (y
4
1 |v21)), log(W (1)4,2 (v85 |v65))
and log(W (1)8,4 (y
8
1 |u41)), respectively. Therefore, our method
needs only 24 + 2 × 22 = 24 additions, which is only a
half of those needed by Eq. (7). Table I lists the numbers
of additions needed by our recursive method and direct-
mapping calculation [22]–[24] when all M bits of a symbol
are information bits. When M = 8, the number of additions
needed by our proposed method is 17% of that needed by the
direct-mapping calculation.
TABLE I
THE NUMBERS OF ADDITIONS TO CALCULATE W (j+1)N,M (y, uˆ
jM
1 |ujM+MjM+1 )
WHEN THE (j + 1)-TH SYMBOL HAS NO FROZEN BIT.
Proposed method Direct-mapping calculation [22]–[24]
M = 2 4 4
M = 4 24 48
M = 8 304 1792
The other advantage of the proposed method to calculate the
symbol-based channel transition probability is that it reveals
the similarity between the Arikan’s recursive channel transfor-
mation and symbol-based recursive channel combination. We
will take advantage of this similarity to reuse adders and to
save area when computing the bit- and symbol-based channel
6transition probability in our proposed architecture. In [24],
additional dedicated adders are used to calculated the symbol-
based channel transition probability, which is not area efficient.
In terms of the error performance, the symbol-decision SC
algorithm is not worse than the bit-decision SC algorithm
[27]. Fig. 4 shows the BERs and FERs of symbol-decision
SC algorithms for a (1024, 512) polar codes. SDSC-i denotes
the i-bit symbol-decision SC algorithm. When M = 2 and 4,
the FER performance is the same as that of the bit-decision
SC algorithm. When M = 8, the FER performance is slightly
better.
Fig. 4. Error rates of symbol-decision SC algorithms for a (1024, 512) polar
code.
C. Generalized Symbol-Decision SCL Decoding Algorithm
Similarly, the symbol-based recursive channel combination
is also useful for the SCL algorithm. The symbol-decision
SCL algorithm is more complicate than the SCL algorithm,
since the path expansion coefficient is not a constant any
more. In the SCL algorithm, for each information bit, the
path expansion coefficient is two. But for the M -bit symbol-
decision SCL algorithm, the path expansion coefficient is
2|AMj |, which depends on the number of information bits in
an M -bit symbol. The M -bit symbol-decision SCL algorithm
is formally described in Alg. 3. Without any ambiguity, 0
represents a zero vector whose bit-width is determined by the
left-hand operator. The function dec2bin(d, b) converts a
decimal number d to a b-bit binary vector. Eq. (9) is used
to calculate the symbol-based channel transition probability
corresponding to each list, i.e. W (j+1)N,M (y, (Li)jM1 |ujM+MjM+1 ).
Fig. 5 shows the BERs and FERs of symbol-decision
SCL algorithms for a (1024, 480) CRC32-concatenated polar
code with L = 4 where the generator polynomial of the
CRC32 is 0x1EDC6F41. This CRC32 is also used in all the
CRC-concatenated polar codes used in the following section.
SDSCL-i denotes the i-bit symbol-decision SCL algorithm.
The performances of the symbol-decision SCL algorithms with
different symbol sizes are almost the same.
Algorithm 3: M -bit Symbol-Decision SCL Decoding
Algorithm
1 α = 1;
2 for j = 1 : NM do
3 β = 2|AMj |;
4 if β == 1 then
5 for i = 1 : α do
6 (Li)jMjM−M+1 = 0;
7 else if αβ ≤ L then
8 uAMcj = 0;
9 for k = 0 : β − 1 do
10 uAMj =dec2bin(k, |AMj |);
11 for i = 1 : α do
12 t = i+ kα;
13 (Lt)jM1 =conc((Li)jM−M1 , ujMjM−M+1);
14 α = αβ;
15 else
16 uAMcj = 0;
17 for k = 0 : β − 1 do
18 uAMj =dec2bin(k, |AMj |);
19 for i = 1 : L do
20 t = i+ kL;
21 S[t].P =
W
(j)
N,M (y, (Li)jM−M1 |ujMjM−M+1);
22 S[t].L = (Li)jM−M1 ;
23 S[t].U = ujMjM−M+1;
24 sortPDecrement(S);
25 for i = 1 : L do
26 (Li)jM1 =conc(S[i].L,S[i].U);
27 α = L;
Fig. 5. Error rates of symbol-decision SCL algorithms for a (1024, 480)
CRC32-concatenated polar code with L = 4.
7D. Two-Stage List Pruning Network for Symbol-Decision SCL
algorithm
For the M -bit symbol-decision SCL algorithm, the maxi-
mum path expansion coefficient is 2M , i.e. each existing path
generates 2M paths. Therefore, in the worst-case scenario, the
L most reliable paths should be selected out of 2ML paths.
To facilitate this sorting network, we propose a two-stage list
pruning network. In the first stage, the q most reliable paths
are selected from up to 2M paths that come from expansion of
each existing path. Therefore, there are qL paths left. In the
second stage, the L most reliable paths are sorted out from
the qL paths generate by the first stage. The message flow of
a two-stage list pruning network is illustrated in Fig. 6.
2M-path 
Sorting 
Function
2M-path 
Sorting 
Function
2M-path 
Sorting 
Function
qL-path 
Sorting 
FunctionL
L
≤ 2M
q
q
q
≤ 2M
≤ 2M
Fig. 6. Message flow for a two-stage list pruning network.
If q ≥ L, the L paths found by the two-stage list pruning
network are exactly the L most reliable paths among the 2ML
paths. When q < L, the probability that the L paths found
by the two-stage list pruning network are exactly the L most
reliable paths among the 2ML paths decreases as well. This
may cause some performance loss. But a smaller q leads to a
two-stage list pruning network with lower complexity.
Fig. 7. Error rates of the SDSCL-8 decoder for a (1024, 480) CRC32-
concatenated polar code with L = 4.
Fig. 7 shows how different values of q affect the error
performance of an SDSCL-8 algorithm for a (1024, 480)
CRC32-concatenated polar code with L = 4. When L = 4 and
q = 2, the SDSCL-8 algorithm shows an FER performance
loss of about 0.25 dB at an FER level of 10−3. As shown
Fig. 8. Error rates of the SDSCL-4 algorithm for a (2048, 1401) CRC32-
concatenated polar code with L = 4 and L = 8.
in Fig. 8, for a (2048,1401) CRC32-concatenated polar code,
the two stage list-pruning network of q = 4 helps to reduce
the complexity of the SDSCL-4 decoder without observed
performance loss when L = 8. When q = 2 and L = 8,
the SDSCL-4 decoder has a performance degradation of about
0.1 dB at an FER level of 10−3, compared with the SDSCL-4
decoder with q = 8 and L = 8. If L = 4, the error performance
due to q = 2 is very small.
Therefore, the two-stage list pruning network uses an ad-
ditional parameter q to introduce different trade-offs between
error performance and complexity.
IV. PRE-COMPUTATION MEMORY-SAVING TECHNIQUE
Pre-computation technique was first proposed in [25] and
can be used to improve processing rate when the number
of possible outputs is finite. In [17], the pre-computation
technique is used to improve the throughput of the line SC
decoder with an additional cost of increased area. Here, our
main purpose is to use the pre-computation technique to reduce
the memory required by list decoders because the memory of
an SCL decoder to store the channel transition probability
becomes a big challenge as the list size and code length
increase. Henceforth, this memory saving technique is called
the pre-computation memory-saving (PCMS) technique. It is
worth noting that this memory-saving technique is independent
of the decoder architecture and the message representation of
SCL decoders.
Let us take the MFG shown in Fig. 2 as an example.
For stages S0 and S1, the numbers of pairs of LLs stored
by the list decoder are 8 and 4L, respectively. Actually, the
outgoing message W (1)2,1 (y
2
1 |w1) of the top black node in S1
can only be either W (1)2,1 (y
2
1 |0) or W (1)2,1 (y21 |1). The outgoing
message W (2)2,1 (y
2
1 , w1|w2) can only be one of W (2)2,1 (y21 , 0|0),
W
(2)
2,1 (y
2
1 , 0|1), W (2)2,1 (y21 , 1|0), and W (2)2,1 (y21 , 1|1). Hence, no
matter what the list size is, the total number of possible values
of outgoing messages of S1 is 2 × 4 + 4 × 4 = 24. These
24 values provide all information we need for calculations of
8further stages. With knowledge of these 24 values, channel
LLs are not needed any more.
Generally speaking, the PCMS technique takes advantage of
the relationship between messages of S0 (channel LLs), and
outgoing messages of S1. By storing only all possible outgoing
messages of S1, the PCMS technique helps list decoders save
memory.
Let us evaluate the memory saving of the PCMS technique,
assuming LL representation is used for the channel transition
probability. Without PCMS technique, a list decoder for a polar
code with the code length of N has a list size of L stores
(N − 2)L + N LL pairs. Each pair contains two messages
which are associated with the conditional bit being zero or
one. The total number of bits used for LL storage is
BLL = 2
(
NQch + L
logN−1∑
i=1
2i(Qch + logN − i)
)
= 2(L+ 1)NQch + 4L(N − logN −Qch − 1),
(10)
where Qch denotes the number of bits used for the quantization
of the channel LLs.
With the PCMS technique, the total number of LL pairs
needed by a list decoder is N2 L +
3
2N . The total number of
bits needed for LL storage is:
BPCMS =2(
N
2
+N)(Qch + 1)+
2L
logN−2∑
i=1
2i(Qch + logN − i)
=3N(Qch + 1) + LN(Qch + 3)
− 4L(logN +Qch + 1)
=BLL −N(LQch + L−Qch − 3).
(11)
Therefore, when LL representation is used for messages,
the PCMS technique saves N(LQch + L − Qch − 3) bits of
memory. The saving is linear with both N and L. Consider
a polar code with N = 1024, a list decoder with L = 4 and
Qch = 4. Without the PCMS technique, BLL = 57104. With
the PCMS technique, BPCMS = 43792. The PCMS technique
helps to save 13312 bits of memory, which is 23% of BLL.
The other advantage of the PCMS technique is that it
improves the throughput slightly because the messages of S1
are already in the memory and don’t need to be calculated
from the channel messages. For example, for a bit-decision
semi-parallel SCL decoder with the list size of L, if the code
length is N and the number of processing units is P , the
latency saving due to the PCMS technique is NLP clock cycles.
V. IMPLEMENTATION OF SYMBOL-DECISION SCL
DECODERS
A. Architecture of Symbol-Decision SCL Decoders
We propose an architecture of an M -bit symbol-decision
SCL decoder shown in Fig. 9. It consists of M MPU blocks
(MPU0,MPU1, · · · ,MPUM−1), a list pruning network (LPN),
a mask bit generator (MBG), a message-screening block
(MSNG), a control block (CNTL), an output-list generator
(OLG) and a CRC checker (CRCC).
LPN
MPU0
MPU1
MPUM-1
CNTL
MBG
M
SN
G
0
1 0frz_flag
CRCC
OLG
Fig. 9. Top architecture for an M -bit symbol-decision SCL decoder.
An MPU block calculates messages for B-TRANS and S-
COMBS messages and updates the partial-sum network by
adopting blocks of the SCL decoder in [20]. The additions of
S-COMBS stages are carried out by reusing the same hardware
resource which is used to calculate messages of B-TRANS
stages to reduce the area. Compared with the SCL decoder
in [20], the MPU has neither path pruning unit nor the CRC
checker. The other improvement for the MPU is that PCMS
technique is used here. The architecture of an MPU is shown
in Fig. 10. Channel messages are not needed any more due
to the adoption of PCMS technique. L-MEM stores messages
corresponding to stages of the MFG. For the stage S1, MSEL
selects the appropriate messages from L-MEM based on partial
sum values and/or the type of calculation nodes. PUs are
processing units to calculate LL messages. PSUs is used to
update partial-sums. ISel selects messages from LMEM or
OSel module for the crossbar (CB) module which chooses
proper messages for PUs. OSel outputs messages to L-MEM
for intermediate stages and output symbol-based messages to
MSNG.
L-MEM
CB
PU0
PU1
PUL-1
P
SU
0
P
SU
1
P
SU
L-
1
ISel OSelMSEL
Fig. 10. Architecture of an MPU.
We take the MFG of Fig. 2 as an example to illus-
trate the function of block MSEL. For node f21 of path
l, {W (1)2,1 (y21 |0),W (1)2,1 (y21 |1)}l and {W (1)2,1 (y43 |0),W (1)2,1 (y43 |1)}l
are selected from LMEM by MSEL and output to Isel.
For node g21 of path l, {W (2)2,1 (y21 , w1l|0),W (2)2,1 (y21 , w1l|1)}l
and {W (2)2,1 (y43 , w3l|0),W (2)2,1 (y43 , w3l|1)}l are selected from
9LMEM. Here, w1l and w3l are the partial sum for w1 and w3,
respectively, belonging to path l. The detailed information of
other blocks in Fig. 10 can be found in [20] and will not be
discussed in this paper.
The message-passing scheme in MFG of a polar code is
in a serial way, which means that the calculation of a stage
depends on the output of its previous stage. The PUs in [20]
only carry out the B-TRANS additions. On the other hand,
the S-COMBS stages need only additions and a processing
unit has four adders. Therefore, in order to save hardware
resources, the adders in the processing units is reused to
calculate the symbol-based channel transition probability, after
these processing units finish calculations for the B-TRANS
stages. In other words, additions of both the B-TRANS and
the S-COMBS stages are folded onto the same adders in the
processing unites. As shown in Fig. 11, c[0] and c[1] are
outputs for the B-TRANS stages; d[0], d[1], d[2], and d[3] are
outputs for the S-COMBS stages.
a[0]
a[1]
b[0]
b[1]
max
max
c[0]
c[1]
d[0]
d[1]
d[2]
d[3]
1
0
1
0
1
0
1
0
u
mode
Fig. 11. Architecture of a processing unit.
Block MBG provides a mask bit for each path. If there
are f (fgeq0) frozen bits in the M -bit symbol, the number
of expanded paths will be 2M−f . For hardware implementa-
tions, we need to consider the worst case and all messages
corresponding to 2M possible paths are calculated. Each path
is associated with a mask bit. When some paths are not
needed, due to frozen bits, they are turned off by mask bits.
Fig. 12 shows how to generate the mask bit for path i, where
i = (i1, i2, · · · , iM ) ∈ {1, 0}M (0 ≤ i < 2M − 1) and
bj = (bj,1, bj,2, · · · , bj,M ) is a frozen-bit indication vector
for ujM+MjM+1 . If ujM+t is a frozen bit, bj,t = 1. Otherwise,
bj,t = 0. If bj is an all-one vector, all bits of u
jM+M
jM+1 are
frozen bits, called an M -bit frozen vector. If Mask biti is 1,
ujM+MjM+1 is impossible to be i and the message corresponding
to ujM+MjM+1 = i is set to 0 in block MSNG.
Block LPN receives 2ML messages from block MSNG,
finds the most reliable L paths, and feeds decision results
back to the MPUs. Here, we use two different sorting im-
plementations – a folded sorting implementation and a tree
sorting implementation – for different designs. The basic unit
for these two implementations is a bitonic sorter [28] , which
outputs the L max values out of 2L inputs. It is referred
i1
bj,1
i2
bj,2
iM
bj,M
Mask_biti
Fig. 12. Architecture for generating a mask bit.
to as BS L. The folded sorting implementation needs 2M−1
BS Ls (BS L0,BS L1, · · · ,BS L2M−1−1). The outputs of
the BS L2i and the BS L2i+1(0 ≤ i < 2M−2) are connected
with inputs of BS Li through registers and multiplexers. For
the tree sorting implementation with 2ML inputs, 2M − 1
BS Ls are needed. The tree sorting implementation can be
divided into M layers. For 0 ≤ i < M , there are 2i BS Ls
in the i-th layer. Inputs of the BS Ls of the i-th layer are
connected with outputs of the BS Ls of the (i + 1)-th layer.
Fig. 13 and 14 show examples of the folded and tree sorting
implementations, respectively, for 2M = 8.
BS_L0 BS_L1 BS_L2 BS_L3
D D DD
MUX MUX MUX MUX
Fig. 13. Architecture for the folded sorting implementation when 2M = 8.
BS_L0 BS_L1 BS_L2 BS_L3
BS_L4 BS_L5
BS_L6
Fig. 14. Architecture for the tree sorting implementation when 2M = 8.
The folded sorting implementation has a smaller area than
the tree sorting implementation. However, the pipeline can
be applied to the tree sorting implementation by inserting
registers between layers to improve the throughput of the tree
sorting implementation.
For the two-stage list pruning network proposed in
Sec. III-D, either the folded sorting implementation or the tree
sorting implementation can be used for the 2M -to-q sorting
function and the qL-to-L sorting function.
Block CNTL provides control signals to schedule the hard-
ware sharing for MPUs and decides when to start pruning
paths. The signal frz flag is an indicator which is one when
a frozen vector appears. When frz flag is one, all MPUs use
10
zero to update the partial-sums instead of outputs of the LPN.
In this case, the LPN, the MSNG, and the calculation of
S-COMBS stages are bypassed. The OLG stores the output
paths. The CRCC checks if a path satisfies the CRC constraint.
B. Message Scheduling and Latency Analysis
To improve area efficiency, for different number of PUs,
different scheduling schemes are needed. To reuse the adders
of the processing units, the additions of the S-COMBS stages
in the MFG must be scheduled properly. Assume the number
of the processing units is P . The total number of the adders
provided by processing units is 4P . If 2ML ≤ 4P , we use a
serial scheduling, which means that there is no overlap for the
processing units and the LPN in terms of the operation time,
as shown in Fig. 15.
B-TRANS S-TRANS
Processing Units LPN
Sn-m+1 Sn-m+2 Sn...S1 Sn-mS2 ...
TS TN
Fig. 15. Serial scheduling (in clock cycles).
Suppose each addition takes one clock cycle. Then each S-
COMBS stage takes one clock cycle to compute messages.
Therefore, it takes m clock cycles for the S-COMBS stages
to output messages to the LPN. To save the area, the folded
sorting implementation is applied for the serial scheduling.
When 2ML > 4P ≥ 2M/2L, there are not enough
adders to calculate all 2ML messages of the stage Sn in
one clock cycle, but all 2M/2
n−i
L messages of the stage Si
(n+m−1 ≤ i ≤ n−1) can be calculated in one clock cycle.
Without increasing the number of adders, 2
ML
4P cycles are
needed. In each cycle, 4P messages are calculated. To reduce
the latency, the overlapping scheduling shown in Fig. 16 is
used. In clock cycle c0, the first 4P messages come out. In
clock cycle c1, the LPN starts work. Therefore, the MPUs
and the LPN are working simultaneously for 2
ML
4P − 1 clock
cycles. Here, the LPN works in a pipeline way. Hence, the
tree sorting implementation is deployed for the overlapping
scheduling and a BS L is connected at the end of the tree
sorting implementation in a way shown in Fig. 17, where
the number on a line represents the number of messages
transmitted through the line.
B-TRANS S-TRANS
Sn-m+1 Sn-m+2 Sn...
Processing Units LPN
: Clock cycles when the processing units are busy.
: Clock cycles when the LPN is busy.
: Clock cycles when both the processing units and LPN are busy.
c0 c1 ...
Sn-mS1 ...
TS TN
Fig. 16. Overlapping scheduling (in clock cycles).
The latency of an M -bit symbol-decision SCL decoder
consists of: the latency for calculating messages of the B-
TRANS stages, the latency for calculating messages of the
B
S_
L
q
L
4P
0
(L-q)
Tree 
Sorting 
Network
D L
Fig. 17. A pipelined tree sorting implementation for the overlapping
scheduling.
S-COMBS stages, and the latency of the list pruning network.
TB represents the overall number of clock cycles for the
calculations of the B-TRANS stages. It is equivalent to the
latency of a bit-decision SCL decoder with a code length of
N
M and
P
M processing units:
TB = 2
N
M
+
NL/M
P/M
log2(
NL/M
4P/M
)− NL/M
P/M
,
where the third term, −NL/MP/M , is the latency saving by using
PCMS technique. TS represent the number of clock cycles
for the calculations of S-COMBS stages per symbol. TN
represents the number of extra clock cycles per symbol needed
by the LPN to finish the list pruning after all messages of the
stage Sn are calculated. If 2ML ≤ 4P , the number of clock
cycles used to calculate messages for S-COMBS stages is
TS = m. When 2ML > 4P ≥ 2M/2L, TS = m− 1 + d 2ML4P e.
More generally, TS ≤
∑m
i=1d 2
2iL
4P e. TN is determined by the
detailed implementation. Hence, the latency of the symbol-
decision SCL decoder is:
T (M) = (1− γ)N
M
(TS + TN ) + TB
= (1− γ)N
M
(TS + TN ) + 2
N
M
+
NL
P
log2(
NL
8P
),
(12)
where γ is a ratio of the number of frozen vectors to NM .
Table II shows the latencies (in clock cycles) for different
decoders to decode a (1024, 480) CRC32-concatenated polar
code with 64 processing units and L = 4. We assume a BS L
needs one clock cycle to find the four maximum values out
of eight values. For M = 2 and M = 4, a folded sorting
implementation and the serial scheduling are used. For M = 8,
a pipelined tree sorting implementation and the overlapped
scheduling are applied. For M = 8 and q = 2, the basic unit
in the tree sorting implementation is to find the two maximum
values out of eight values, which needs one clock cycles.
Therefore, TN = 4 when M = 8 and q = 2.
TABLE II
LATENCIES FOR DIFFERENT DECODERS FOR A (1024, 480)
CRC32-CONCATENATED POLAR CODE WITH 64 PROCESSING UNITS AND
L = 4.
Decoder γ TS TN q Latency (# of cycles)
SDSCL-2 0.445 1 2 4 2069
SDSCL-4 0.395 2 4 4 1634
SDSCL-8 0.344 6 7 4 1540
SDSCL-8 0.344 6 4 2 1288
11
It is claimed in [22] that the M -bit SDSCL decoder could
have M times faster decoding speed than the bit-decision
SCL decoder, which is much better than our implementation
results. Let us review Eq. (12) again. For a fair comparison,
suppose the MPUs of the M -bit SDSCL decoder has the
same architecture as the conventional SCL decoder. Then a
conventional SCL decoder with the PCMS technique has a
latency of T (1) = 2N + NLP log2
NL
8P . The decoding speed
gain of the M -bit SDSCL decoder is
T (1)
T (M)
=
2N + NLP log2
NL
8P
(1− γ)NM (TS + TN ) + 2NM + NLP log2(NL8P )
= M − (1− γ)N(TS + TN ) + (M − 1)
NL
P log2
NL
8P
(1− γ)NM (TS + TN ) + 2NM + NLP log2 NL8P
(13)
To be exactly M (M > 1) times faster, (1 − γ)NM (TS +
TN ) + (M − 1)NLP log2 NL8P should be zero. For NL > 8P ,
T (1)
T (M) < M , because TS ≥ 0 and TN ≥ 0. For NL = 8P ,
TS = TN = 0 should be satisfied, which means that the
calculation of the symbol-based channel transition probability
and the list pruning procedure do NOT take any clock cycle.
This is impractical. However, TS and TN cannot be zero in a
practical design. If NL < 8P and P ≤ NL, to achieve M
times faster, (TR+TN ) =
(M−1)L log2 8PNL
(1−γ)P <
5(M−1)
(1−γ)N . Usually,
(1 − γ)N >> 5(M − 1). Therefore, the statement about the
decoding speed gain in [22] is too idealistic to be achieved
in practice because the practical implementation needs some
extra cycles to calculate the symbol-based channel transition
probability and to perform the list pruning function.
C. Synthesis results
To implement the proposed symbol-decision SCL decoder,
we consider only M = 2, 4 and 8. For M ≥ 16, it is
impractical to build list pruning networks. For example, for
the worst case of M = 16 that all the bits of a symbol are
information bits, there are 216L = 65536L paths. Even if
L = 1, to find the maximum value among 65536 values still
needs a huge amount of hardware resources and leads to a
huge latency.
In our implementations, L = 4. Each implementation has
64 processing units. LL messages are used in our designs.
The channel LL messages are quantized with 4 bits. A (1024,
480) CRC32-concatenated polar code is used. The synthesis
tool is Cadence RTL compiler. The process technology is
TSMC 90nm CMOS technology. Our proposed architectures
are compared with the state-of-the-arts SCL architectures, in
[19]–[21], [24], both bit- and symbol-decision algorithms. The
synthesis results in [21] and [20] are also based on a TSMC
90nm CMOS technology. The original synthesis results of [19]
and [24] are based on a UMC 90nm and ST 65nm CMOS
technologies, respectively.
The synthesis results shown in Table III, demonstrate that
our symbol-decision SCL polar decoders have higher area
efficiencies than the SCL decoders in [19], [20], [24], and
[21]. The SCL decoders in [19], [21], [24] have higher clock
rates than our designs because it uses registers as storage units.
However, in our designs, register files are used.
The SDSCL-8 decoders provide a higher throughput and
a smaller latency than the SDSCL-2 and SDSCL-4 decoders,
and occupy larger areas. However the improvements on the
throughput and latency are not linear in the symbol size.
Compared with the SCL decoder in [20], the increase of
areas of symbol-decision SCL decoders is mainly due to
sorting networks because the adders of processing units are
reused to calculate both the bit- and symbol-based channel
transition probabilities. For the SDSCL-4 decoder, because the
sorting network of the SDSCL-4 decoder is only 0.073 mm2,
there is no need to shrink q further. For the SDSCL-8 decoders,
when q = 4, the area of the sorting network is 0.454 mm2.
However, when q = 2, the sorting network occupies 0.196
mm2 which is less than a half of that of q = 4. A smaller q
does help the SDSCL-8 decoder achieve a higher throughput,
a smaller latency, a smaller area, and a higher area efficiency,
but it also introduces an FER performance loss of 0.25 dB to
the SDSCL-8 decoder at an FER level of 10−3 as shown in
Fig. 7.
Moreover, we also provide synthesis results for SDSCL-8
decoders without the PCMS technique. The PCMS technique
helps the SDSCL-8 decoders gain an area saving of about 0.12
mm2.
We’ve already mentioned that LL messages are used in our
designs. If LLR messages [21] are used, symbol-decision SCL
decoders can have better area efficiencies than our current
designs because the memory requirement for LLR messages
are fewer than that for LL messages [21].
VI. CONCLUSION
In this paper, we use the symbol-based recursive channel
combination to calculate the symbol-based channel transition
probability. We show that based on the LL representation of
the transition probability, this recursive procedure needs fewer
additions than the method used in [22], [24]. Furthermore,
a two-stage list pruning network is proposed to simplify
the L-path finding problem. We use the PCMS technique to
reduce the memory requirement for list decoders. By applying
the PCMS technique, we design an efficient architecture for
symbol-decision SCL decoders. Specifically, we introduce two
scheduling schemes to perform the hardware sharing. A folded
sorting implementation and tree sorting implementation are
also discussed. We also implement symbol-decision SCL polar
decoders for two-bit, four-bit and eight-bit, respectively, with
a list size of four. Our synthesis results show that symbol-
decision SCL polar decoders outperform existing SCL polar
decoders in terms of the area efficiency. Our proposed methods
and architecture provide a range of tradeoffs between area,
throughput and area efficiency.
ACKNOWLEDGMENT
We would like to thank the authors of [19] for providing
the synthesis results using the TSMC 90nm technology in
Table III.
12
TABLE III
SYNTHESIS RESULTS FOR DIFFERENT DECODERS WITH L = 4.
Proposed Architectures [24] [20] [19]‡ [19]† [21]
Algorithm Symbol-decision SCL Bit-decision SCL
M 2 4 8 2 4 N/A
Message Type LL LLR
Clock Rate (MHz) 500 525 379* 400 289* 500 694 314 794
Latency (us) 4.14 3.27 3.08 3.21** 2.58 2.70** 3.89 5.39* 2.56 3.53* 5.63 4.06 8.25 3.34
Throughput (Mbps) 247 313 332 319** 398 379** 262 189* 401 289* 182 252 124 307
Area (mm2) 1.126 1.209 1.669 1.782** 1.403 1.519** 1.98 3.79* 2.14 4.10* 1.099 2.197 3.53 1.78
Area eff. (Mb/s/mm2) 219.4 259.2 199.2 179.1** 283.3 249.3** 132.3 49.9* 187.3 70.6* 165.6 114.7 35.1 172
† The synthesis result in [19] is based on a UMC 90nm CMOS technology.
‡ The synthesis result is provided by the authors of [19] based on a TSMC 90nm CMOS technology.
* Original synthesis results in [24] are based on an ST 65nm CMOS technology. For a fair comparison, synthesis results scaled to a 90nm technology
are used in the comparison.
** The design is without the PCMS technique.
APPENDIX
Proof of Proposition 1: According to the definition of
conditional probability Pr(B|A) = Pr(AB)Pr(A) ,
W
(i+1)
2Λ,Φ (y
2Λ
1 ,u
iΦ
1 |uiΦ+ΦiΦ+1 )
=
W
(iΦ+Φ)
2Λ,1 (y
2Λ
1 , u
iΦ+Φ−1
1 |uiΦ+Φ)
Pr(uiΦ+Φ−1iΦ+1 |uiΦ+Φ)
.
(14)
Because all bits of u are independent and each bit has an equal
probability of being a 0 or 1,
Pr(uiΦ+Φ−1iΦ+1 |uiΦ+Φ) = Pr(uiΦ+Φ−1iΦ+1 ) = 2−(Φ−1).
Therefore,
W
(i+1)
2Λ,Φ (y
2Λ
1 , u
iΦ
1 |uiΦ+ΦiΦ+1 )
= 2(Φ−1)W (iΦ+Φ)2Λ,1 (y
2Λ
1 , u
iΦ+Φ−1
1 |uiΦ+Φ).
(15)
According to Eq. (3),
W
(iΦ+Φ)
2Λ,1 (y
2Λ
1 , u
iΦ+Φ−1
1 |uiΦ+Φ)
=
1
2
W
( iΦ+Φ2 )
Λ,1 (y
Λ
1 , u
iΦ+Φ−2
1,o ⊕ uiΦ+Φ−21,e |uiΦ+Φ−1 ⊕ uiΦ+Φ)
·W (
iΦ+Φ
2 )
Λ,1 (y
2Λ
Λ+1, u
iΦ+Φ−2
1,e |uiΦ+Φ).
(16)
Similarly, we have
W
( iΦ+Φ2 )
Λ,1 (y
Λ
1 , u
iΦ+Φ−2
1,o ⊕ uiΦ+Φ−21,e |uiΦ+Φ−1 ⊕ uiΦ+Φ)
= 2−(
Φ
2 −1)W (i+1)Λ,Φ/2(y
Λ
1 , u
iΦ
1,o ⊕ uiΦ1,e|uiΦ+ΦiΦ+1,o ⊕ uiΦ+ΦiΦ+1,e),
(17)
and
W
( iΦ+Φ2 )
Λ,1 (y
2Λ
Λ+1, u
iΦ+Φ−2
1,e |uiΦ+Φ)
= 2−(
Φ
2 −1)W (i+1)Λ,Φ/2(y
2Λ
Λ+1, u
iΦ
1,e|uiΦ+ΦiΦ+1,e).
(18)
Then, by equations (15) ∼ (18), Eq. (9) is obtained.
REFERENCES
[1] E. Arikan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Transactions on Information Theory, vol. 55, no. 7, pp. 3051–3073, July
2009.
[2] E. Sasoglu, I. Telatar, and E. Arikan, “Polarization for arbitrary discrete
memoryless channels,” in Proceedings of IEEE Information Theory
Workshop, Oct 2009, pp. 144–148.
[3] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Transactions on
Signal Processing, vol. 61, no. 2, pp. 289–299, Jan 2013.
[4] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE
Communications Letters, vol. 16, no. 10, pp. 1668–1671, October 2012.
[5] A. Eslami and H. Pishro-Nik, “A practical approach to polar codes,” in
Proceedings of IEEE International Symposium on Information Theory,
July 2011, pp. 16–20.
[6] E. Arikan, “Systematic polar coding,” IEEE Communications Letters,
vol. 15, no. 8, pp. 860–862, August 2011.
[7] E. Arikan, H. Kim, G. Markarian, U. Ozgur, and E. Poyraz, “Perfor-
mance of short polar codes under ml decoding,” in Proceedings of ICT
Mobile Summit Conference, 2009.
[8] S. Kahraman and M. Celebi, “Code based efficient maximum-likelihood
decoding of short polar codes,” in Proceedings of IEEE International
Symposium on Information Theory, July 2012, pp. 1967–1971.
[9] K. Niu, K. Chen, and J. Lin, “Low-complexity sphere decoding of polar
codes based on optimum path metric,” IEEE Communications Letters,
vol. PP, no. 99, pp. 1–4, 2014.
[10] I. Tal and A. Vardy, “List decoding of polar codes,” in Proceedings of
IEEE International Symposium on Information Theory, July 2011, pp.
1–5.
[11] ——, “List decoding of polar codes,” arXiv:1206.0050, Jun. 2012.
[Online]. Available: http://arxiv.org/abs/1206.0050
[12] IEEE Standard for Local and Metropolitan Area Networks Part 16:
Air Interface for Fixed and Mobile Broadband Wireless Access Systems
Amendment 2: Physical and Medium Access Control Layers for Com-
bined Fixed and Mobile Operation in Licensed Bands and Corrigendum
1, IEEE Std. 802.16e-2005, Mar. 2006.
[13] C. Leroux, I. Tal, A. Vardy, and W. Gross, “Hardware architectures
for successive cancellation decoding of polar codes,” in Proceedings
of IEEE International Conference on Acoustics, Speech and Signal
Processing, May 2011, pp. 1665–1668.
[14] A. Alamdar-Yazdi and F. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Communications Letters,
vol. 15, no. 12, pp. 1378–1380, Dec. 2011.
[15] C. Zhang and K. Parhi, “Latency analysis and architecture design
of simplified sc polar decoders,” IEEE Transactions on Circuits and
Systems II: Express Briefs, vol. 61, no. 2, pp. 115–119, Feb. 2014.
[16] G. Sarkis and W. Gross, “Increasing the throughput of polar decoders,”
IEEE Communications Letters, vol. 17, no. 4, pp. 725–728, Apr. 2013.
[17] C. Zhang and K. Parhi, “Low-latency sequential and overlapped archi-
tectures for successive cancellation polar decoder,” IEEE Transactions
on Signal Processing, vol. 61, no. 10, pp. 2429–2441, May 2013.
[18] J. Lin and Z. Yan, “Efficient list decoder architecture for polar codes,” in
Proceedings of IEEE International Symposium on Circuits and Systems,
June 2014, pp. 1022–1025.
[19] A. Balatsoukas-Stimming, A. Raymond, W. Gross, and A. Burg, “Hard-
ware architecture for list successive cancellation decoding of polar
codes,” IEEE Transactions on Circuits and Systems II: Express Briefs,
vol. 61, no. 8, pp. 609–613, Aug 2014.
[20] J. Lin and Z. Yan, “An efficient list decoder architecture for polar codes,”
13
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. PP, no. 99, pp. 1–1, 2015.
[21] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” arXiv:1401.3753,
September 2014. [Online]. Available: http://arxiv.org/abs/1401.3753
[22] B. Li, H. Shen, and D. Tse, “Parallel decoders of polar codes,”
arXiv:1309.1026, September 2013. [Online]. Available: http://arxiv.org/
abs/1309.1026
[23] B. Li, H. Shen, D. Tse, and W. Tong, “Low-latency polar codes via
hybrid decoding,” in Proceedings of 2014 8th International Symposium
on Turbo Codes and Iterative Information Processing, Aug. 2014, pp.
223–227.
[24] B. Yuan and K. Parhi, “Low-latency successive-cancellation list decoders
for polar codes with multibit decision,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. PP, no. 99, pp. 1–13, 2014.
[25] K. Parhi, “Pipelining in algorithms with quantizer loops,” IEEE Trans-
actions on Circuits and Systems, vol. 38, no. 7, pp. 745–754, July 1991.
[26] W. Park and A. Barg, “Polar codes for q-ary channels, q = 2r ,”
IEEE Transactions on Information Theory, vol. 59, no. 2, pp. 955–969,
February 2013.
[27] C. Xiong, J. Lin, and Z. Yan, “Error performance analysis of the
symbol-decision SC polar decoder,” arxiv:1501.01706, 2015. [Online].
Available: http://arxiv.org/abs/1501.01706
[28] K. E. Batcher, “Sorting networks and their applications,” in AFIPS
Proceeding of the Spring Joint Computer Conference, 1968, pp. 307–
314.
