A multi-mode area-efficient SCL polar decoder by Xiong, Chenrong et al.
1A multi-mode area-efficient SCL polar decoder
Chenrong Xiong, Jun Lin, Student member, IEEE and Zhiyuan Yan, Senior member, IEEE
Abstract—Polar codes are of great interest since they are
the first provably capacity-achieving forward error correction
codes. To improve throughput and to reduce decoding latency
of polar decoders, maximum likelihood (ML) decoding units are
used by successive cancellation list (SCL) decoders as well as
successive cancellation (SC) decoders. This paper proposes an
approximate ML (AML) decoding unit for SCL decoders first.
In particular, we investigate the distribution of frozen bits of polar
codes designed for both the binary erasure and additive white
Gaussian noise channels, and take advantage of the distribution
to reduce the complexity of the AML decoding unit, improving
the area efficiency of SCL decoders. Furthermore, a multi-mode
SCL decoder with variable list sizes and parallelism is proposed.
If high throughput or small latency is required, the decoder
decodes multiple received codewords in parallel with a small list
size. However, if error performance is of higher priority, the
multi-mode decoder switches to a serial mode with a bigger list
size. Therefore, the multi-mode SCL decoder provides a flexible
tradeoff between latency, throughput and error performance,
and adapts to different throughput and latency requirements
at the expense of small overhead. Hardware implementation
and synthesis results show that our polar decoders not only
have a better area efficiency but also easily adapt to different
communication channels and applications.
Index Terms—Error control codes, polar codes, successive
cancellation list decoding, ML decoding, multi-mode decoding,
parallel decoding
I. INTRODUCTION
Polar codes [1], a breakthrough in coding theory, have
attracted lots of research interest since they achieve the sym-
metric capacity of memoryless channels with both binary input
[1] and nonbinary input [2]. A lot of effort has been made to
improve the error performance of polar codes with short or
moderate lengths [3]–[8], and to improve the hardware area
efficiency of polar decoders [9]–[20].
Maximum likelihood (ML) decoding algorithms — the
sphere decoding algorithm [6], stack sphere decoding algo-
rithm [7] and a Viterbi algorithm [8] — can be used to
decode polar codes, but their complexity can be prohibitively
high. Compared with ML decoding algorithms, the successive
cancellation (SC) decoding algorithm [1] has a lower com-
plexity at the cost of sub-optimal performance. To improve the
performance of the SC algorithm, the SC list (SCL) decoding
algorithm [21] and the CRC-aided SCL (CA-SCL) algorithm
[3], [5] were proposed. A key drawback of the SC, SCL and
CA-SCL algorithms is their long decoding latency and low
decoding throughput, as these algorithms deal with only one
bit at a time: the SC algorithm makes hard bit decisions only
one bit at a time; in the SCL and CA-SCL algorithms, the
path expansion is with respect to only one bit.
To reduce decoding latency and improve throughput of an
SC polar decoder, several algorithms [13], [15], [22], [23] were
proposed to deal with several bits at a time instead of only
one bit by using ML decoding units, which calculate symbol-
wise channel transition probabilities and make hard decisions
for several bits at a time. Based on the SC algorithm, the
parallel SC [22], hybrid ML-SC [23], ML simplified SC (ML-
SSC) [15] and fast ML-SSC [13] algorithms were proposed.
The basic difference of ML decoding units between these four
algorithms is that hybrid ML-SC [23] and fast ML-SSC [13]
take advantage of the distribution of frozen bits to reduce
complexity, but neither parallel SC [22] nor ML-SSC [15]
algorithms do so.
ML decoding units in [19], [20], [24]–[27] are also used
to improve throughput of SCL-based decoders and to reduce
decoding latency. Instead of making hard decisions in SC-
based algorithms, an ML decoding unit for SCL-based algo-
rithms calculates symbol-wise channel transition probabilities
and performs path expansion and pruning. None of these SCL-
based algorithms takes advantage of the distribution of frozen
bits to reduce complexity of ML decoding units. Therefore,
ML decoding units in these SCL-based algorithms have high
complexities. For example, when the list size is four and the
symbol size is eight, the ML decoding unit accounts for 27%
of the overall decoder area in [25]. In [27], when the code
length is 1024, the area of an ML decoding unit takes up as
much as 62% of the overall decoder area.
In this paper, we first propose a low-complexity approximate
ML (AML) decoding unit by utilizing the distribution of
frozen bits of polar codes and then propose a multi-mode SCL
(MM-SCL) polar decoder to support variable throughput and
latency. Our main contributions are:
• The divide-and-conquer method in [23] is applied in
the probability domain to simplify the ML unit for SC-
based algorithms. By extending this idea, a divide-and-
conquer AML decoding unit for SCL-based algorithms is
proposed by considering the distribution of frozen bits.
Its computational complexity is greatly smaller than those
of existing ML decoding units for SCL-based algorithms.
When an appropriate design parameter for the divide-and-
conquer AML decoding unit is selected, the SCL decoder
has negligible performance loss.
• The distribution of frozen bits of polar codes is analyzed
from the viewpoint of code construction. We show that
there are only a small number of frozen-location patterns
for polar codes constructed by a method proposed by
Arıkan in [28] and a method in [29].
• Since only a small number of frozen-location patterns
exist in polar codes, the divide-and-conquer AML de-
coding unit for SCL-based algorithms is simplified fur-
ther. A low-complexity hardware implementation for the
simplified divide-and-conquer AML decoding unit, the
LC-AML decoding unit, is proposed. Synthesis results
show that by taking advantage of a small number of
ar
X
iv
:1
51
0.
07
51
0v
1 
 [c
s.I
T]
  2
6 O
ct 
20
15
2frozen-location patterns, our CA-SCL decoder with the
LC-AML unit has a better area efficiency than existing
SCL decoders, while working for all channel conditions.
• An MM-SCL polar decoder is also proposed. This de-
coder supports SCL algorithms with different list sizes
and parallelism. When a high throughput or small la-
tency is needed, the MM-SCL decoder decodes multiple
received codewords in parallel with a small list size. If
a good performance is required, the MM-SCL decoder
switches to a mode with a greater list size to decode only
one receive codeword. Therefore, the MM-SCL polar
decoder provides a flexible tradeoff between latency,
throughput and performance, and consequently adapts
to different throughput and latency requirements at the
expense of small overhead.
Our proposed divide-and-conquer AML decoding unit for
SCL-based algorithm is an extension of the method for SC-
based algorithm in [23]. However, by investigating the distri-
bution of frozen bits of polar codes, we reduce the complexity
of the ML decoding unit further. Existing ML decoding units
for SCL decoders [19], [24]–[27] perform list pruning func-
tion after all the symbol-wise channel transition probabilities
are calculated, whereas the LC-AML decoding unit sorts
intermediate calculation results generated by the recursive
channel combination method [19], and consequently reduces
the number of symbol-wise channel transition probabilities
dealt with by the list pruning function. Hence, the proposed
LC-AML decoding unit has a much smaller complexity. The
performance degradation due to the proposed LC-AML is
the same as that in [19], [25]. Although the ML decoding
unit in [24] has no performance degradation, its complexity
grows quickly as the list size and symbol size increase. The
performance degradation of the ML decoding unit in [26], [27]
depends on its design parameters.
Many applications, such as modern wireless or wireline
communication system, require variable data rate transmission
and have stringent latency requirements. As a potential can-
didate of FEC technique for future communication systems, a
polar decoder supporting variable data rate and variable decod-
ing latency is desired. Unfortunately, existing polar decoders
provide only fixed latency and throughput (data rate). To the
best of our knowledge, the proposed MM-SCL decoder is
the first polar decoder with variable throughput and decoding
latency given a polar code.
The rest of this paper is organized as follows. In Section II,
polar codes as well as construction methods for polar codes
and existing ML decoding units are reviewed. In Section III,
first, the divide-and-conquer method is applied in the proba-
bility domain for the ML unit of SC-based algorithms. Then,
the divide-and-conquer AML decoding unit for SCL-based
algorithms is proposed, and its computational complexity is
also analyzed. In Section IV, frozen-location patterns for polar
codes are investigated. In Section V, a hardware design of
the LC-AML unit is proposed, and an area-efficient CA-SCL
decoder with the LC-AML unit is implemented as well. The
hardware implementation and synthesis results for the area-
efficient SCL decoder are also presented in this section. In Sec-
tion VI, the MM-SCL decoder, its hardware implementation
and synthesis results are presented. Finally, some conclusions
are provided in Section VII.
II. PRELIMINARIES
A. Notations
Suppose u represents a binary vector (u1, u2, · · · , uN ). uba
denotes a binary subvector (ua, ua+1, · · · , ub−1, ub) of u for
1 ≤ a, b ≤ N ; if a > b, uba is regarded as void. uba,o and
uba,e represent the subvectors of u
b
a with odd and even indices,
respectively. For an index set A ⊆ I = {1, 2, · · · , N}, its
complement in I is denoted as Ac. The subvector of u = uN1
restricted to A is represented by uA = (ui : 0 < i ≤ N, i ∈
A).
B. Polar codes
For an (N,K) polar code, the code length N is a power
of two, i.e., N = 2n for n > 0. The data bit sequence,
represented by u = uN1 , is divided into two parts: a K-
element part uA which carries information bits, and uAc
whose elements (called frozen bits) are set to zero. The
corresponding encoded bit sequence x = xN1 is generated
by x = uBNF⊗n, where BN is the N × N bit-reversal
permutation matrix, F = [ 1 01 1 ], and F
⊗n is the n-th Kronecker
power of F [1].
C. Construction Methods of Polar Codes
An essential problem for constructing a polar code is to
determine the locations of frozen bits (elements of Ac). For
the BEC with an erasure probability  (0 <  < 1), assuming
z0,1 = , the following recursions [28] are used to construct
an (N,K) polar code, where N = 2n and 0 < K < N :
zi,2j−1 = 2zi−1,j − z2i−1,j , (1)
zi,2j = z
2
i−1,j , (2)
where 1 ≤ i ≤ n. Then Ac is chosen such that ∑j∈Ac zn,j is
maximal and |Ac| = N −K.
For the AWGN channel and a given initial value of z0,1,
which is determined by a desired signal-to-noise ratio (SNR),
the following recursive method [29] based on Gaussian ap-
proximation is used for 1 ≤ i ≤ n:
zi,2j−1 = τ−1
(
1− (1− τ(zi−1,j))2), (3)
zi,2j = 2zi−1,j , (4)
where
τ(x) =
{
e−0.4527x
0.86+0.0218, 0 < x < 10,√
pi
xe
− x4 (1− 107x ) x ≥ 10.
(5)
In this case, Ac is chosen such that ∑j∈Ac zn,j is minimal
and |Ac| = N −K.
3D. Existing ML Decoding Units for Polar Decoders
When x = uBNF⊗n is transmitted, suppose the received
word is y = yN1 and the symbol size is M = 2
m. A symbol-
decision [19] ML decoding unit first calculates symbol-wise
channel transition probabilities, Pr(y, uˆjM1 |ujM+MjM+1 ) (0 ≤ j <
N
M ), then makes a symbol-wise ML decision for SC-based
decoders or chooses the L most reliable paths for SCL-based
decoders. Here, uˆjM1 is the previously estimated bits.
There are three methods to calculate the symbol-wise
channel transition probabilities, and all of them do not take
advantage of the distribution of frozen bits. The first [15],
[22], [24], [26] is based on an M -element product of bit-
wise channel transition probabilities, called directing mapping
method (DMM):
Pr(y, uˆiM−M1 |uiMiM−M+1) =
M−1∏
j=0
Pr(y
(j+1) NM
j NM +1
, wˆ
(i−1)+j NM
1+j NM
|wi+j NM ),
(6)
where uiMiM−M+1 = (wi, wi+ NM , · · · , wi+N− NM )BMF
⊗m for
1 ≤ i ≤ NM , and wˆ
(i−1)+j NM
1+j NM
is the previously estimated bit
vector of w(i−1)+j
N
M
1+j NM
.
The second [19], called as the recursive channel combina-
tion (RCC) method, is based on a product of symbol-wise
channel transition probabilities recursively,
Pr(y2Λ1 , uˆ
iΦ
1 |uiΦ+ΦiΦ+1 ) =Pr(yΛ1 , uˆiΦ1,o ⊕ uˆiΦ1,e|uiΦ+ΦiΦ+1,o ⊕ uiΦ+ΦiΦ+1,e)
· Pr(y2ΛΛ+1, uˆiΦ1,e|uiΦ+ΦiΦ+1,e),
(7)
where 1 ≤ φ ≤ m, 0 ≤ λ < n, Λ = 2λ, and Φ = 2φ.
The third [27] is a hybrid method by applying the DMM
first and then the RCC method, referred to as the DRH method.
Based on the distribution of frozen bits, some data symbols
in [13] are considered as some special constituent codes, such
as repetition codes and single-parity-check nodes. Different
methods were proposed to deal with different constituent
codes.
Furthermore, an ML decoding unit in [23] with the divide-
and-conquer method was proposed for SC algorithms based
on an empirical assumption [23]:
Assumption 1. For a well designed polar code, there is no
such case that u2i−1 is an information bit and u2i is a frozen
bit, for any 1 ≤ i ≤ N2 .
Based on this assumption and the divide-and-conquer
method, a simplified ML unit was proposed in [23]. More-
over, a recursive way of the divide-and-conquer method was
proposed in [23], but it is not suitable for hardware imple-
mentation since it is for a large symbol size, which has a very
high complexity for hardware implementation.
How to take advantage of frozen-location patterns to reduce
complexity of ML decoding units has been discussed in
[13] and [23] for SC-based algorithms, but it has not been
investigated yet for SCL-based algorithms.
III. DIVIDE-AND-CONQUER AML DECODING UNIT
The simplified ML unit in [23] is based on the Euclidean
distance since an AWGN channel is assumed. Here, we
first apply the divide-and-conquer method in the probability
domain and reformulate the ML unit for SC-based algorithms.
This simplified ML unit in the probability domain is slightly
more general than that in [23], because it is applicable to
both AWGN channels and other channels. We then extend the
simplified ML unit in the probability domain to SCL-based
algorithms.
A. Reformulation of the divide-and-conquer ML unit for SC-
based algorithms [23] in the probability domain
For the ease of discussion, a string vector Sba =‘Sa · · · Sb’
(for 1 ≤ a ≤ b ≤ N ) is introduced to represent a frozen-
location pattern of symbol uba. If uj (a ≤ j ≤ b) is an
information bit, Sj is denoted as ‘D’. Otherwise, Sj as ‘F’.
Consider a toy example of a four-bit symbol u41. Assuming
u1 and u3 are frozen bits, and u2 and u4 are information
bits. Then the frozen-location pattern S41 of u41 is ‘FDFD’.
Obviously, for all M -bit symbols, there are 2M frozen-location
patterns.
Based on Assumption 1, ujM+MjM+1 (0 ≤ j < NM ) can be
divided into M2 pairs, ujM+2i−1 and ujM+2i for 1 ≤ i ≤
M
2 . In theory, any pair of ujM+2i−1 and ujM+2i have four
possibilities. ‘FF’ is trivial. Under Assumption 1, ‘DF’ is
not possible. Hence, in [23], only two remaining possibilities
are considered: ‘FD’ and ‘DD’. Let Ω(j)01 represent the index
set of i that ujM+2ijM+2i−1 is ‘FD’. Ω(j)11 represents the index set
of i that ujM+2ijM+2i−1 is ‘DD’.
For SC-based algorithms, the maximum of 2|Ω
(j)
01 |+2|Ω(j)11 |
values of T (ujM+MjM+1 ) , Pr(y, uˆ
jM
1 |ujM+MjM+1 ) needs
to be found. Based on the RCC method [19],
T (ujM+MjM+1 ) = T1(v
jM+M
2
jM
2 +1
) × T2(ujM+MjM+1,e), where
v
N
2
1 ,uN1,o ⊕ uN1,e, T1(v
jM+M
2
jM
2 +1
) , Pr(y
N
2
1 , vˆ
jM
2
1 |v
jM+M
2
jM
2 +1
),
and T2(u
jM+M
jM+1,e),Pr(yNN
2 +1
, uˆjM1,e |ujM+MjM+1,e). The possible
values of T (ujM+MjM+1 ) can be divided into 2
|Ω(j)01 | groups
based on Ω(j)01 , each with 2
2|Ω(j)11 | values. In each group,
for i ∈ Ω(j)11 , since v jM
2 +i
and ujM+2i are independent,
max(T ) = max(T1) max(T2). Then, the maximum of 2|Ω
(j)
01 |
values generated in the previous step is found. Therefore, if
Ω
(j)
01 = ∅,
max
ujM+MjM+1
(T ) = max
ujM+2ijM+2i−1
i∈Ω(j)11
(T1)× maxujM+2i
i∈Ω(j)11
(T2); (8)
otherwise,
max
ujM+MjM+1
(T ) = max
ujM+2i
i∈Ω(j)01
 max
ujM+2ijM+2i−1
i∈Ω(j)11
(T1)× maxujM+2i
i∈Ω(j)11
(T2)
 . (9)
Under Assumption 1, considering Eq. (8), if Ω(j)01 = ∅, the
maximal value of T is just a product of the maximal value
4Fig. 1. Examples for calculating max() and max2() functions when M = 4, N = 4, q = 2, and j = 0, (a) the calculation rule for Pr(y41 |u41), (b) the
calculation of max(Pr(y41 |u41)) in [23] when the frozen-location pattern is ‘DDDD’ , (c) the calculation of max(Pr(y41 |u41)) in [23] when the frozen-location
pattern is ‘FDDD’, (d) the calculation of max2(Pr(y41|u41)) when the frozen-location pattern is ‘DDDD’ , and (e) the calculation of max2(Pr(y41|u41))
when the frozen-location pattern is ‘FDDD’.
of T1 and the maximal value of T2. For example, Fig. 1(b)
shows an example of frozen-location pattern ‘DDDD’ which
has Ω(0)01 = ∅, when M = 4 and N = 4. If Ω
(j)
01 is not empty,
for any i ∈ Ω(j)01 , v jM
2 +i
= ujM+2i.
B. Divide-and-conquer AML decoding unit for SCL-based
algorithms
Extending the idea in Eqs. (8) and (9), we propose a divide-
and-conquer AML decoding method for SCL-based algorithms
under Assumption 1. For SC-based algorithms, only the max-
imal value of Pr(y, uˆjM1 |ujM+MjM+1 ) is needed. In contrast, for
SCL-based algorithms with list size L, the L maximal values
of Pr(y, uˆjM1 |ujM+MjM+1 ) are needed. A simple understanding
for our method is that the max(Pr(ρ)) function is replaced by
a function finding the L maximal values of Pr(ρ), denoted
as [Pr(ρ1), · · · ,Pr(ρL)] = maxL(Pr(ρ)). maxL(Pr(ρ)) 
maxL(Pr(ψ)) generates L2 values of Pr(ρi) Pr(ψj) for 1 ≤
i, j ≤ L.
The path expansion-and-pruning procedure of SCL-based
algorithms is divided into two stages. In the first stage, the
q most reliable paths are selected for each list by calculating
and comparing path metrics. In the second stage, the L most
reliable paths among the qL survival paths generated in the
first stage. This two stage approach was proposed in our prior
work [19], and the novelty herein is that we use the divide-
and-conquer method to reduce the complexity of the first stage.
The second stage has been described in [19], and we omit its
discussions.
Assuming |Ω(j)01 | = βj , |Ω(j)11 | = γj . When βj ≥ 1, let
Ω
(j)
01 = {i(j)1 , i(j)2 , · · · , i(j)βj } (i
(j)
1 < i
(j)
2 < · · · < i(j)βj ). The
first stage includes:
Step 0: the RCC method [19] is applied to calculate both
T1 and T2.
Step 1: Given any βj-bit binary vector B(j) =
(u
jM+2i
(j)
1
, u
jM+2i
(j)
2
, . . . , u
jM+2i
(j)
βj
), there are 2γj possible
values for both v
jM+M
2
jM
2 +1
and ujM+MjM+1,e. We find the min(q, 2
γj )
5maximal values of 2γj values of T1, and the min(q, 2γj )
maximal values of 2γj values of T2.
Step 2: For B(j), there are (min(q, 2γj ))2 values of T ,
which is a product of values of T1 and T2 generated in Step
1.
Step 3: The q maximal values are selected from(
min(q, 2γj )
)2
2βj values of T generated by Step 2 because
there are 2βj possible values for B(j).
If Ω(j)01 = ∅ and βj = 0, we still use the aforementioned
four steps to find the q most reliable paths for each list except
that B(j) is considered as a void binary vector which is the
only value for B(j) when βj = 0.
Fig. 1(d) and 1(e) show two examples for frozen-location
patterns ‘DDDD’ and ‘FDDD’, respectively when M = 4,
N = 4, and q = 2. After these four steps are carried out for
each list, there are qL values of T left, which are sorted to
choose the L maximal values in the second stage.
The proposed divide-and-conquer AML decoding unit has
a lower computational complexity. It reduces the number of
symbol-wise channel transition probabilities dealt by the list
pruning function by sorting the intermediate calculation results
generated by the RCC method [19], whereas the DMM, RCC,
and DRH methods perform list pruning function after all the
symbol-wise channel transition probabilities are calculated.
For example, in Fig. 1(d), the DMM, RCC, and DRH methods
perform max2(Pr(y41 |u41)) after all 16 values of Pr(y41 |u41) are
calculated. The proposed divide-and-conquer method performs
max2(Pr(y
2
1 |v21)) and max2(Pr(y43 |u2, u4)) first. Then it finds
the two maximal values out of four elements generated by
max2(Pr(y
2
1 |v21))  max2(Pr(y43 |u2, u4)). The output of the
proposed AML decoding unit is the same as those of other
ML decoding units if they have the same input.
Given an M -bit symbol ujM+MjM+1 , Ω
(j)
01 , Ω
(j)
11 ,
Pr(y
N
2
1 , vˆ
jM
2
1 |v
jM+M
2
jM
2 +1
), and Pr(yNN
2 +1
, uˆjM1,e |ujM+MjM+1,e), the
first stage using the divide-and-conquer decoding unit needs
two 2γj -to-
(
min(q, 2γj )
)
sorts, one
(
min(q, 2γj )
)2
2βj -to-q
sort, and
(
min(q, 2γj )
)2
2βj multiplications per list, whereas
the ML decoding unit in [19] needs 2βj+2γj multiplications
and a 2βj+2γj -to-q sort per list. By examining all possible
values of βj and γj , we can find the worst-case computational
complexity.
We demonstrate the advantage of the proposed divide-and-
conquer AML unit in computational complexity as opposed to
other ML decoding units with an example of M = 8 and q =
4. Henceforth, we only discuss the computational complexity
per list to accomplish the job of the first stage of the proposed
method. Table I lists worse-case computational complexities of
different methods and shows that the proposed method has the
smallest computational complexity when 81 eight-bit frozen-
location patterns under Assumption 1 are needed to be dealt
with.
Regarding the impact on the error performance, our pro-
posed method has the same performance degradation as in
[25]. If q ≥ L, our method does not introduce any performance
degradation for SCL-based algorithms. If q < L, the perfor-
mance degradation depends on values of q and L, and the
performance degradation is usually negligible when q and L
TABLE I
WORST-CASE COMPUTATIONAL COMPLEXITY OF DIFFERENT
METHODS WHEN M = 8 AND q = 4.
Method Computational Complexity
RCC [25] ‡ 304 multiplications, a 256-to-4 sort
DMM [15] ‡ 1792 multiplications, a 256-to-4 sort
DRH [27] ‡ 784 multiplications, a 256-to-4 sort
Divide-and-Conquer 112 multiplications, a 64-to-4 sort
AML ‡ and two 16-to-4 sorts
Divide-and-Conquer 80 multiplications, a 32-to-4 sort
AML † and two 16-to-4 sorts
LC-AML ? 80 multiplications, a 32-to-4 sort
and four 8-to-4 sorts
‡ All 81 eight-bit patterns under Assumption 1 are dealt with.
† Only nine eight-bit patterns in Sec. IV-A are dealt with.
? Only six eight-bit patterns in Sec. V are dealt with.
Fig. 2. Frame and bit error rates of CA-SCL decoders with different qs for
a (2048,1433) polar codes with a 32-bit CRC over the AWGN channel when
adapting the LC-AML decoding unit.
are small. Fig. 2 shows the frame and bit error rates of a (2048,
1433) polar code with a 32-bit CRC of CA-SCL decoders with
the LC-AML decoding unit when M = 8 and L = 8. When
q = 4, the performance loss is negligible. However, q = 2
leads to a performance loss of about 0.1 dB at an FER level
of 10−3.
IV. FROZEN-LOCATION PATTERNS FOR POLAR CODES
Considering the hardware implementation for the divide-
and-conquer AML unit, a uniform hardware design for all
frozen-location patterns is preferred rather than different dedi-
cated designs for various frozen-location patterns. For M = 8
and M = 16, there are 81 and 6561 possible frozen-location
patterns satisfying Assumption 1, respectively. Actually, some
of them may never exist in a polar code. Therefore, we want to
know the exact number of frozen-location patterns in a polar
code, since the number of frozen-location patterns impacts the
complexity of the divide-and-conquer AML decoding unit for
SCL-based algorithms: the more frozen-location patterns, the
higher complexity the divide-and-conquer AML decoding unit
has.
6ε 
2ε-ε2 
ε2 
ε4 
2ε2 -ε4 
(2ε-ε2)2
(2ε-ε2)4
2(2ε2 -ε4)-(2ε2 -ε4)2
2(2ε -ε2)-(2ε -ε2)2
z0,1
z1,1
z1,2
z2,1
z2,2
z2,3
z2,4
z3,1
z3,2
z3,3
z3,4
z3,5
z3,6
z3,7
z3,8
2(2ε -ε2)2-(2ε -ε2)4
(2ε2 -ε4)2
Fig. 3. Recursive calculation of erasure probabilities for polar codes over
the BEC when the code length is no more than eight.
A. Polar Codes for the BEC
For polar codes constructed for the BEC with an erasure
probability  (0 <  < 1), Eqs. (1) and (2) are used in [28].
Fig. 3 illustrates the transition graph of the erasure probability
for constructing a polar code with the code length no more
than eight, and also can be viewed as a sub-graph of the
erasure probability transition for any eight-bit symbol of a
polar code because of the recursive calculation. In order to
examine frozen-location patterns in a polar code, we have
following results regarding the ordering of zi,j for i ≥ 1 and
1 ≤ j ≤ 2i. This ordering determines possible frozen-location
patterns in a polar code.
Proposition 1. Assuming z0,1 =  ∈ (0, 1), given any i ≥ 1
and 1 ≤ j ≤ 2i, zi,j is calculated by Eqs. (1) or (2). We have
(a) 0 < zi,j < 1 for i ≥ 1 and 1 ≤ j ≤ 2i,
(b) zi,2j−1 > zi,2j for i ≥ 1 and 1 ≤ j ≤ 2i−1,
(c) zi,4j−3 > zi,4j−2 > zi,4j−1 > zi,4j for i ≥ 2 and 1 ≤
j ≤ 2i−2,
(d) zi,8j−7 > zi,8j−6 > zi,8j−5 > zi,8j−3 > zi,8j−4 >
zi,8j−2 > zi,8j−1 > zi,8j for i ≥ 3 and 1 ≤ j ≤ 2i−3.
Proof of Prop. 1 is provided in the Appendix.
Now, let us explain how the ordering of zn,j determines
2m-bit (1 ≤ m ≤ 3) frozen-location patterns in an (N,K)
polar code over the BEC. First, to choose elements of Ac for
an (N,K) polar code over the BEC, Ac is chosen such that∑
j∈Ac zn,j is maximal and |Ac| = N −K, where N = 2n.
Then, if there are kj frozen bits in a symbol u
2m(j+1)
2mj+1 (0 ≤
j < N2m ), a set Acj consisting of indexes of these kj frozen
bits must be chosen such that
∑
t∈Acj zn,t is maximal while|Acj | = kj . For example, assuming there are four frozen bits in
u81 in a (16, 12) polar code, by Proposition 1(d), z4,1 > z4,2 >
z4,3 > z4,5 > z4,4 > z4,6 > z4,7 > z4,8. Hence, u1, u2, u3,
and u5 will be frozen bits and the frozen-location pattern for
u81 will be ‘FFFDFDDD’.
Therefore, for polar codes constructed by the method in
[28], by Proposition 1(b), there are three two-bit frozen-
location patterns: ‘DD’, ‘FD’, and ‘FF’. We note that the
implication of Proposition 1(b) is the counterpart over the
BEC of Assumption 1 in [23]. By Proposition 1(c), there are
five four-bit frozen-location patterns: ‘DDDD’, ‘FDDD’,
‘FFDD’, ‘FFFD’, and ‘FFFF’. By Proposition
1(d), there are nine eight-bit frozen-location patterns:
‘DDDDDDDD’, ‘FDDDDDDD’, ‘FFDDDDDD’,
‘FFFDDDDD’, ‘FFFDFDDD’, ‘FFFFFDDD’,
‘FFFFFFDD’, ‘FFFFFFFD’, and ‘FFFFFFFF’.
For a larger symbol size, it is hard to get the ordering of zi,j
by an analytical method. A numerical method can be used. For
example, the symbol size is 16. By Proposition 1(d), we have
z4,1 > z4,2 > z4,3 > z4,5 > z4,4 > z4,6 > z4,7 > z4,8 and
z4,9 > z4,10 > z4,11 > z4,13 > z4,12 > z4,14 > z4,15 > z4,16.
We also have z4,5 > z4,9 > z4,7 > z4,11 and z4,4 > z4,6 >
z4,10 > z4,8 > z4,12. For 0 < z0,1 =  < 1,
z4,10−z4,7 = 24(−1)4
[
3(2 − + 24)(− 1)3 − 8)] < 0.
Moreover, for 0 < z0,1 =  < 1, z4,4 − z4,9 = 22( −
1)4(10−49+348−1167+2376−3755+4204−2803+
1022−16−4) < 0 and z4,8−z4,13 = 24(−1)2(10−69+
438−1327+2516−2625+1214−83−62−4−2) < 0.
These two inequalities are verified numerically.
Because of the recursive calculation of zi,j , for i ≥ 4 and
1 ≤ j ≤ 2i−4, we have
zi,16j−15 > zi,16j−14 > zi,16j−13 > zi,16j−11
> zi,16j−7 > zi,16j−12 > zi,16j−10 > zi,16j−9
> zi,16j−6 > zi,16j−5 > zi,16j−3 > zi,16j−8
> zi,16j−4 > zi,16j−2 > zi,16j−1 > zi,16j .
Thus, there are only 17 frozen-location patterns for 16-bit
symbols.
It is not meaningful to consider the symbol size greater than
16, because this will incur very high complexity for hardware
implementations.
B. Polar Codes for the AWGN Channel
For the construction method introduced in [29] for the
AWGN channel, it is difficult to analyze the relationship
between z3,i’s for 1 ≤ i ≤ 8 based on Eqs. (3) and (5).
Instead, we examine eight polar codes constructed with the
method in [29], which have code lengths from 210 to 213 and
code rates of 0.5 and 0.8 to identify eight-bit frozen-location
patterns. By examining all eight-bit symbols of these polar
codes, we found that in these codes there are only nine eight-
bit frozen-location patterns, which are the same as those for
polar codes constructed for the BEC, listed in Sec. IV-A. Our
observation is consistent with Assumption 1 in [23].
C. Computational Complexity of the Divide-and-Conquer
AML Decoding Unit
When it needs to deal with only the frozen-location patterns
mentioned in Sections IV-A and IV-B, the divide-and-conquer
AML decoding unit has a smaller complexity. If M = 8 and
q = 4, it needs 80 multiplications, two 16-to-4 sorts, and a
32-to-4 sort, as listed in Table I. It saves 32 multiplications, a
32-to-4 sort and a 8-to-4 sort compared with the divide-and-
conquer AML decoding unit which deals with all 81 frozen-
location patterns following Assumption 1, since a 64-to-4 sort
consists of two 32-to-4 sorts and a 8-to-4 sort.
7If M = 16 and q = 4, to deal with all 38 = 6561 16-bit
frozen-location patterns satisfying Assumption 1, the first stage
of the proposed ML decoding unit needs 1632 multiplications,
two 256-to-4 sorts, and a 1024-to-4 sort. However, to deal with
17 16-bit frozen-location patterns discussed in Section IV-A,
the simplified divide-and-conquer AML decoding unit needs
736 multiplications, two 256-to-4 sorts, and a 128-to-4 sort.
V. LOW-COMPLEXITY AML DECODING UNIT
For convenience, we implement the proposed divide-and-
conquer AML decoding unit, assuming M = 8 henceforth.
Our implementation can be readily extended to other values
of M . To further reduce complexity and latency, we do
not use the divide-and-conquer method to deal with patterns
‘DDDDDDDD’, ‘FFFFFFFD’, and ‘FFFFFFFF’,
which will be described in Sec. V-B. Then the divide-and-
conquer AML decoding unit can be simplified further by
dealing with only the remaining six eight-bit frozen-location
patterns. This simplified divide-and-conquer AML decoding
unit, referred to as the LC-AML decoding unit, needs 80
multiplications, four 8-to-4 sorts, and a 32-to-4 sort. It saves
two 8-to-4 sorts compared with the divide-and-conquer AML
decoding unit dealing with nine patterns, since a 16-to-4 sort
consists of three 8-to-4 sorts. This also leads to a shorter
critical path in our design than the divide-and-conquer AML
decoding unit.
A. Hardware Design for the LC-AML Decoding Unit
SCL-based polar decoders in the literature can be divided
into two categories: the log-likelihood (LL) based decoders
[11], [25], [30] and the log-likelihood-ratio (LLR) based de-
coders [10], [27]. Although our proposed algorithm in Sec. III
is described in the probability domain, it can be easily adapted
for both the LL-based decoder and the LLR-based decoder.
We focus on the LLR-based polar decoder, because in general
the LLR-based decoder has a better area efficiency than the
LL-based decoder.
First, we adapt the proposed LC-AML decoding unit to the
LLR-based SCL decoder. Given path metrics PM(t)k of L list
survivors and assuming ut is the last bit processed by the
decoder, where 1 ≤ k ≤ L, 1 ≤ t ≤ N , and t is a multiple
of M . Suppose αj,l (0 ≤ j < M) represents the LLR of
Pr(y
(j+1) NM
j NM +1
, wˆ
t
M +j
N
M
1+j NM
|w t
M +1+j
N
M
) corresponding to the list l.
The path metric PM(t+M)k,p of the p-th expanded path from the
k-th list survivor corresponding to ut+Mt+1 = p (0 ≤ p < 2M )
is PM(t+M)k,p = PM(t)k +
∑M−1
j=0 mj |αj,l|, where mj = 0 if
w t
M +1+j
N
M
= 12 (1 − sign(αj,l)) [10]. Otherwise, mj = 1.
Then our goal is to calculate PM(t+M)k,p and to select the L
minimum values of PM(t+M)k,p .
Fig. 4 shows the top architecture of our low-complexity
implementation for the LC-AML decoding unit. MLD S1
calculates path metrics and selects the q minimum values for
each list. FrzInfVec is an M -bit frozen bit indication vector
(f1, f2, · · · , fM ) for ut+Mt+1 . For 1 ≤ j ≤ M , if ut+j is a
frozen bit, fj = 1; otherwise, fj = 0. LLRInV l is the vector
(α0,l, α1,l, · · · , αM−1,l) for 1 ≤ l ≤ L.
MLD_S1
MLD_S1
MLD_S1
q
L-
to
-L
 s
o
rt
e
r
Stage 1 Stage 2
L
q
q
q
PM1
PM2
PML
FrzInfVec
M
M
M
LLRInV_1
LLRInV_2
LLRInV_L
Fig. 4. Top architecture of the proposed LC-AML decoding unit.
Fig. 5(a) shows the design for MLD S1 q4 when M = 8
and q = 4. Here, we focus on the data path for calculating
path metrics. The circuitry to generate symbol values associ-
ated with path metrics is simple and consists of XORs, and
therefore is omitted. The data paths corresponding to different
steps aforementioned in Section III are labeled as well.
In Step 0, two RCC blocks, shown in Fig. 5(b),
are used. LLR ai (16 ≤ i ≤ 31) associated with
Pr(yNN
2 +1
, uˆ8j1,e|u8j+88j+1,e = (i − 16)2) is calculated by the
right RCC block. LLR ai (0 ≤ i ≤ 15) associated with
Pr(y
N
2
1 , vˆ
4j
1 |v4j+44j+1 = (i)2) is calculated by the left RCC
block. Here, (i)2 represents the binary string of interger i. 16-
ADDER contains 16 adders to calculate path metrics, shown
in Fig. 5(c).
In Step 1, for different frozen-location patterns, path metrics
go through different data paths selected by 16 2-to-1 multi-
plexers. Their control words are 1 if frozen-location patterns
are FDDDDDDD, FFDDDDDD; otherwise they are 0.
In Step 2, results from Step 1 are combined to calculate∑7
j=0mj |αj,l|.
In Step 3, there are 32 path metrics going through a 32-to-4
sorter. However, for some frozen-location patterns, the number
of valid symbol values is less than 32 because the number of
frozen bits can be larger than three. Therefore, path metrics
associated with those invalid symbol values need to be set to
the maximal positive value as well so that the 4 minimum path
metrics belong to valid symbol values. MSNG accomplishes
this job with FrzInfVec, which contains the frozen-location
pattern information.
Different sorters used in our design are shown in Fig. 5(d),
5(e) and 5(f). S8TO4 finds the minimum four values of eight
values. S4 sorts the four inputs and outputs them in decreasing
order and has a shorter critical path of two comparators and
one 4-to-1 multiplexer than a four-input bitonic sorter [31],
which has a critical path of three comparators. S32To4 consists
of seven S8TO4 units in a binary tree structure.
Although MLD S1 q4 is designed for six eight-bit frozen-
location patterns, other frozen-location patterns also can be
dealt with by MLD S1 q4, such as all frozen-location patterns
satisfying the following two conditions. First, the frozen-
location pattern has at least three ‘F’s. Second, two frozen
bits are located at the first two bits of the data symbol.
8RCC
S8TO4 S8TO4
16-ADDER 16-ADDER
S32TO4
a
1
6
a
1
7
a
2
3
a
2
4
a
3
1
a
2
5S
te
p
 0
S
te
p
 1
S
te
p
 2
S
te
p
 3
b0
0 1
S8TO4 S8TO4
b1 b2 b3 b8 b9 b10
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
b11 b4 b5 b6 b7 b12 b13 b14 b15a0 a1 a2 a3 a4 a5 a6 a7a16 a17 a18 a19 a20 a21 a22 a23
MSNGFrzInfVec
0 01 1
x0 
0
z0
0 01 1
x1 
0
z1
0 01 1
x2 
0
z2
0 01 1
x3
0
z3
16-ADDER
y0 y1 y14 y15
y15y8y7y0
x0z0 x1z1 x2z2 x3z3
RCC
a
0
a
1
a
7
a
8
a
1
5
a
9
RCC y15y8y7y0
x0z0 x1z1 x2z2 x3z3
|α0,l|m0 m1 m2 m3|α1,l| |α2,l| |α3,l| |α4,l|m4 m5 m6 m7|α5,l| |α6,l| |α7,l|
PMl
(a)
(b)1
6
1
6
3
2
4
4
i1 i2 i3 i4 i5 i6 i7 i8
o1 o2 o3 o4 o1 o2 o3 o4 o1 o2 o3 o4 o1 o2 o3 o4
i1 i2 i3 i4 i5 i6 i7 i8 i1 i2 i3 i4 i5 i6 i7 i8 i1 i2 i3 i4 i5 i6 i7 i8
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
>
0
1
0
1
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
00
01
10
11
00
01
10
11
x1
x2
x3
x4
y1
y2
y3
y4
S
4
x1
x2
x3
x4
y1
y2
y3
y4
S
4
x1
x2
x3
x4
y1
y2
y3
y4
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
S2
a
b
c
d
e
i1
i2
i3
i4
i5
i6
i7
i8
o1
o2
o3
o4
(f)
(e)(d)
S8TO4 S4
S2
16-ADDER
(c)
Fig. 5. Design of MLD S1 q4 when M = 8 and q = 4. (a) top architecture of MLD S1 q4, (b) diagram of RCC, (c) diagram of 16-ADDER, (d) diagram
of S8TO4, (e) diagram of S4, (f) diagram of S2.
B. Area-Efficient SCL Decoder
To examine the advantage of our proposed design, we
incorporate MLD S1 q4 into CA-SCL polar decoders with
list size L = 4. Architecture-wise, our decoder, referred to as
the AE-SCL decoder, is almost the same as the architecture
of the tree based reduced latency SCL polar decoder in [27],
which performs the CA-SCL decoding algorithm on a binary
tree representation of a polar code, except that our AE-SCL
decoder uses the LC-AML decoding unit instead of the DRH
ML decoding unit used in [27].
Leaf nodes of the decoding tree for our decoder are divided
into four categories:
1) Rate-0 node: its frozen-location pattern contains only ‘F’,
i.e., the node contains only frozen bits.
2) Rate-1 node: its frozen-location pattern contains only ‘D’,
i.e., the node contains only information bits.
3) Repetition node [13]: its frozen-location pattern is either
‘FFFFFFFF FFFFFFFD’ or ‘FFFFFFFD’.
4) Rate-R-2 node: its frozen-location pattern is one of the
six eight-bit frozen-location patterns.
Rate-0 and rate-1 nodes are decoded with the same methods
as in [27]. The main difference between our proposed decoder
here and the tree based reduced latency SCL polar decoder
is how to deal with repetition nodes and rate-R-2 nods. For
repetition nodes, a binary tree of adders is used to calculate
LLRs in order to reduce the decoding latency [13]. Rate-R-2
nodes are dealt with by MLD S1 q4, which reduces the area
of AE-SCL decoders.
C. Synthesis Results
AE-SCL decoders with L = 4 are implemented for three
polar codes: a (1024, 512) code, an (8192,4096) code, and
a (32768, 29504) code. These three codes are with a 32-bit
cyclic redundancy check. The number of processing units of
decoders for N = 1024 is 256. For the other two codes,
the decoder has 512 processing units. Five-bit channel LLRs
are used. The synthesis tool is Cadence RTL compiler. The
process technology is TSMC 90nm CMOS technology. Here,
four stages of pipeline registers are used in the LC-AML
decoding unit. Areas of different ML decoding units for the
(1024, 512) polar codes are listed in Table II. The area of our
proposed LC-AML decoding unit is only one fourth of that of
the ML decoding unit in [27]. By taking into account fewer
patterns, the area of the LC-AML decoding unit is 67% of that
of the Divide-and-Conquer AML design which deals with all
81 eight-bit frozen-location patterns following Assumption 1.
The synthesis results of three entire decoders (AE-SCLs)
are also listed in Tables V, VI, and VII, respectively. Here,
NIT means the net information throughput. Compared with
decoders in [27], the SCL decoder architecture with the best
area efficiency to our knowledge, the AE-SCL decoders have
smaller areas because the proposed LC-AML decoding unit
is applied. The LC-AML decoding unit has a slightly larger
decoding latency than that in [27], because the proposed LC-
9TABLE II
AREAS OF DIFFERENT ML DECODING UNITS FOR THE (1024,512) POLAR
CODE.
LC-AML† Divide-and-Conquer AML‡ [27]
area (mm2) 0.456 0.673 2.298
† The LC-AML design targets the six eight-bit frozen-location
patterns.
‡ The Divide-and-Conquer AML design targets all 81 eight-bit
frozen-location patterns following Assumption 1.
AML decoding unit deals with only eight-bit frozen-location
patterns, whereas the ML decoding unit in [27] can deal with
some 16-bit frozen-location patterns. Since the extra decoding
cycles needed by AE-SCL decoders are a very small fraction
of the entire decoding cycles, the proposed AE-SCL decoders
still achieve better area efficiency than decoders in [27]. For
example, for the (1024, 512) polar code, the area efficiency
of the AE-SCL decoder is 1.93 times of that of the decoder
in [27]. As the code length increases, the advantage of area
efficiency is less because the ML decoding unit occupies a
smaller fraction of the entire decoder if the code is longer.
Compared with symbol-decision SCL decoders in [10], [24],
[25], the advantage of our decoders on the area efficiency is
more significant. The area efficiency of the AE-SCL decoder
is 3.32, 8.25, and 3.17 times of that of decoders in [10], [24],
[25], respectively, for the (1024, 512) polar code.
VI. MULTI-MODE SCL DECODER
All existing SCL polar decoders in the literature provide
fixed throughput and decoding latency given a polar code.
These SCL decoders cannot adapt to variable communication
channels and applications. In order to adapt to different
throughput and latency requirements, we propose a multi-
mode SCL (MM-SCL) decoder with nd decoding paths, which
can decode P received words with list size L in parallel, where
1 ≤ P,L ≤ nd and nd ≥ P × L. This multi-mode feature
requires the decoder to perform SCL decoding algorithms with
different list sizes (the SC decoding algorithm is a special case
of the SCL decoding algorithm with list size L = 1).
A. Architecture Description
DCD1
DCD2
DCD4
DCD3
CMEM1
CMEM2
CMEM3
CMEM4
0
1
2
0
1
2
0
1
2
Mode_Sel
FrzInfVec
0
1
2
0
1
2
0
1
2
0
1
2
SCLO1
MM-LC-AML
SCLO2
SCLO3
SCLO4
SCO1
SCO2
SCO3
SCO4
LLRInV_1
LLRInV_2
LLRInV_4
LLRInV_3
LMEM
Mode_Sel
FrzInfVec
Fig. 6. Top architecture of the MM-SCL decoder when nd = 4.
Assuming nd = 4, the top architecture of the MM-SCL
decoder is shown in Fig. 6. It has four blocks of channel
memory, CMEMi (1 ≤ i ≤ 4), to store four received
codewords since the decoder of the MODE-1 mode can deal
with four received codewords simultaneously. Block DCDi
(1 ≤ i ≤ 4) contains processing units to calculate LLRs,
and partial-sum units to update partial-sum for each list. The
intermediate LLRs calculated by DCDi are stored in block
LMEM. Designs for processing units, partial-sum units and the
interface between processing units and LMEM adopt blocks
of the reduced-latency tree-based SCL decoder in [27]. We
focus on the additional logic to support multi-mode features.
Mode Sel is a two-bit control word to select the decoding
mode of the MM-SCL decoder, shown in Table III.
TABLE III
CONTROL WORD FOR THE DECODING MODE OF THE MM-SCL DECODER
WITH nd = 4.
Mode Sel Decoding Mode Notation
0 SCL with L = 4 MODE-4
1 SCL with L = 2 MODE-2
2 SC MODE-1
MM_MLD_S1
MLD_S1
MM_MLD_S1
S16TO4
4
PM1
PM2
PM4
FrzInfVec
8
8
8
LLRInV_1
LLRInV_2
LLRInV_4
M _MLD_S1
MM_MLD_S18LLRInV_3
Stage 1 Stage 2
PM3
S4TO2
S4TO2
4
4
4
2
2
2
2
1
0
1
0
1
0
1
0
Mode_Sel
SCLO1
SCLO2
SCLO3
SCLO4
SCO1
SCO2
SCO3
SCO4
Fig. 7. Top architecture of MM-LC-AML for the MM-SCL decoder.
Block MM-LC-AML performs the LC-AML decoding func-
tion for different types of leaf nodes and is supposed to output
the most reliable list candidate for the MODE-1 mode, the two
most reliable list candidates for the MODE-2 mode, and the
four most reliable list candidates for the MODE-4 mode. The
architecture in Fig. 4 is for a fixed list size only. Here, we
propose an MM-LC-AML unit (we take nd = 4 and M = 8
as an example) shown in Fig. 7 to support the multi-mode
features. When the mode is MODE-4, all of SCLO1, SCLO2,
SCLO3, and SCLO4 are used to decode a received codeword.
When the mode is MODE-2, SCLO1 and SCLO2 are used for
one of two received codewords; SCLO3 and SCLO4 are used
for the other received codeword. When the mode is MODE-1,
each of SCOi(1 ≤ i ≤ 4) is used by an individual received
codeword.
Block MM MLD S1 performs the same function as block
MLD S1 q4 in Fig. 5, except that block MM MLD S1
supports multiple modes. This can be accomplished by simply
adding an S4 block between block S32TO4 and the adder
at the bottom right of Fig. 5(a). This implementation has a
disadvantage: the delay of the critical path is increased due to
the extra block in the data path and the decoding latency is
increased.
To avoid the unnecessary increase of the decoding latency,
we redesign the MLD S1 block for the MODE-1 and MODE-
2 modes, respectively, called MLD S1 q1 and MLD S1 q2,
10
RCC
S4TO1
S
te
p
 0
S
te
p
 1
S
te
p
 2
S
te
p
 3
MSNGFrzInfVec
y15y8y7y0
x0z0 x1z1 x2z2 x3z3
RCC
y15y8y7y0
x0z0 x1z1 x2z2 x3z3
|α0,l|m0 m1 m2 m3|α1,l| |α2,l| |α3,l| |α4,l|m4 m5 m6 m7|α5,l| |α6,l| |α7,l|
(a)
4
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
ab
cd e
S2
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
a
0
a
1
5
a
1
6
a
3
1
b
0
b
7
b
8
b
1
5
c
0
c
3
c
4
c
7
d
0
d
1
d
2
d
3
d0 d2 d1 d3 Z F Z FZ F Z Fc0 c4 c1 c5b0 b8 b1 b9 b2 b10 b3 b11
RCC
S
te
p
 0
S
te
p
 1
S
te
p
 2
S
te
p
 3
MSNGFrzInfVec
y15y8y7y0
x0z0 x1z1 x2z2 x3z3
RCC
y15y8y7y0
x0z0 x1z1 x2z2 x3z3
|α0,l|m0 m1 m2 m3|α1,l| |α2,l| |α3,l| |α4,l|m4 m5 m6 m7|α5,l| |α6,l| |α7,l|
1
6
a
0
a
1
5
a
1
6
a
3
1
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2 y3 y4
S4
x1 x2 x3 x4
y1 y2
S4TO2
x1 x2 x3 x4
y1 y2
S4TO2
x1 x2 x3 x4
y1 y2
S4TO2
x1 x2 x3 x4
y1 y2
S4TO2
b
0
b
3
b
4
b
7
b
8
b
1
1
b
1
2
b
1
5
c
0
c
3
c
4
c
7
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
S16TO2
PMl
2
2
c
0 c1
c
4
c
5b
0
b
1
b
8
b
9
a
0
a
1
a
1
6
a
1
7
Z F Z Fc
2 c3
c
6
c
7
b
2
b
3
b
1
0
b
1
1
a
2
a
3
a
1
8
a
1
9
Z F Z F Z F Z F
a
4
a
5
a
2
0
a
2
1
a
6
a
7
a
2
2
a
2
3
d
0 d7
d
1
5
d
8
d
3
d
4
d
1
1
d
1
2
4-ADDER
d
0
d
1
d
2
d
3
4-ADDER
d
4
d
5
d
6
d
7
4-ADDER
d
8
d
9
d
1
0
d
1
1
4-ADDER
d
1
2
d
1
3
d
1
4
d
1
5
4444
(b)
Fig. 8. (a) Design of MLD S1 q1 for q = 1, (b) design of MLD S1 q2 for q = 2.
shown in Fig. 8(a) and 8(b). Symbol values for ‘Z’ and ‘F’
are four-bit vectors ‘0000’ and ‘1111’, respectively. Hence,
the symbol value calculated from ‘Z’ and ‘F’ is ‘11111111’,
which is guaranteed to be an invalid symbol value for our
designs.
If MODE-2 is used, the control words for patterns
‘FFDDDDDD’, ‘FDDDDDDD’, and ‘FFFDDDDD’
are 0, 1, and 2, respectively; for the remaining patterns,
the control words are 3. If MODE-1 is used, the control
words for patterns ‘FFDDDDDD’, ‘FDDDDDDD’, and
‘FFFDDDDD’ are 0, 0, and 1, respectively; for the remain-
ing patterns, the control words are 2.
Actually, MLD S1 q4, MLD S1 q2 and MLD S1 q1 are
integrated together instead of three individual blocks in block
MM MLD S1, since they have the same circuitry for Step
0. Furthermore, sorting units of the top row of Step 1 in
these three designs can also be reused because block S8TO4
contains several S4 blocks and S2 blocks. The hardware
sharing reduces the additional area for supporting multiple
modes and improves area efficiency without increasing the
critical path delay.
Compared with the AE-SCL decoder in Section V-B, the
MM-SCL decoder needs additional hardware for supporting
multiple modes. The main area increase is the additional
three blocks of channel memories and the hardware of
MM MLD S1 to support the MODE-2 and MODE-1 modes.
Frame error rates of different modes for the MM-SCL
decoder to decoder all the three codes are shown in Fig. 9,
which shows that, regarding the frame error rates, MODE-4
< MODE-2 < MODE-1.
11
Fig. 9. Frame error rates of different modes for the MM-SCL decoder.
B. Variation of modes
An additional feature of the MM-SCL decoder is that
Mode Sel can be changed during the decoding procedure of a
received word. This doe not need any additional hardware. It
means that the SCL algorithm with different list sizes can be
used to deal with different segments of a received word. For
example, for a (32768, 29504) polar code, u210001 is decoded
by the SCL algorithm of L = 4 and u3276821001 is decoded by
the SC algorithm. For the first 21000 bits, the decoder is in
MODE-4 and four lists are maintained. The remaining bits are
decoded by the SC algorithm for each list. This mode is called
MODE-4 1. By choosing the switching point θ (where the
mode switches) of Mode Sel carefully, the decoding latency
can be improved slightly without any observed performance
loss as shown in Fig. 9. The decoding latency of the MODE-4
mode is 6718 cycles. MODE-4 1 takes 6530 cycles to decode
a received codeword and improves the throughput slightly.
To reduce the decoding latency further, a smaller switching
point can be used at the expense of small performance loss.
If θ = 10000, MODE-4 1 has a decoding latency of 6206
cycles and has a performance loss of about 0.03 dB compared
with MODE-4 as shown in Fig. 9, but still has a better
performance than MODE-2. Hence, MODE-4 1 provides a
more flexible way to achieve a decoding latency between those
of the MODE-4 mode and the MODE-1 mode.
Therefore, the variation of modes provides a way for the
MM-SCL decoder to reduce decoding latency further some-
what without noticeable performance loss and improves area
efficiency further. It can also be used when decoding needs to
be finished as soon as possible due to external reasons, such
as buffer overflow.
C. Synthesis Results
The MM-SCL decoder are implemented for the aforemen-
tioned three codes. For the (1024, 512), the areas of the
channel memory and the ML decoding unit are listed in
Table IV. It shows that the increased area of the MM-SCL
decoder over the AE-SCL decoder is dominated by the area
of additional three blocks of channel memory. Due to the
hardware sharing, the increased area of the ML decoding unit
is small.
TABLE IV
AREAS (IN MM2) OF THE CHANNEL MEMORY AND THE ML DECODING
UNIT FOR MM-SCL AND AE-SCL DECODERS WHEN N = 1024 AND
R=0.5.
MM-SCL AE-SCL Difference
Area of Channel Memory 0.484 0.121 0.363
Area of ML Decoding Unit 0.513 0.456 0.057
Synthesis results of MM-SCL decoders for different polar
codes are listed in Tables V, VI and VII. The decoding latency
of the MODE-2 mode is smaller than that of the MODE-
4 mode and the decoder has the smallest decoding latency
with the MODE-1 mode. This is because MLD S1 q2 and
MLD S1 q1 have shorter data paths. Therefore, in block MM-
LC-AML, three stages and two stages of pipeline registers are
used by the circuitry for the MODE-2 mode and the MODE-
1 mode, respectively. If MODE-4 1 is used, the MM-SCL
decoder can achieve a smaller latency than the MODE-4 mode
and the MODE-2 mode.
Compared with the AE-SCL decoder, the MM-SCL decoder
in the MODE-4 mode has a smaller area efficiency due to the
additional circuitry for supporting multiple modes. However,
the MM-SCL decoder is more flexible to provide multiple
choices of output throughput and decoding latency, which
is more suitable for variable communication channels and
applications. If a higher throughput or a smaller decoding
latency is required, the MM-SCL decoder can be switched
to the MODE-2, MODE-1 or MODE-4 1 mode.
Compared with the decoder in [27], for N = 1024 and
N = 8192, the MM-SCL decoder has a smaller area and a
better area efficiency. For N = 32768, the area of the MM-
SCL decoder is bigger than that of the decoder in [27] because
the additional circuitry to support multiple modes is larger
than the saving due to the low-complexity ML decoding unit
in the MM-SCL decoder. For the (1024, 512) code, under
the MODE-4, MODE-2, and MODE-1 modes, the MM-SCL
decoder provides area-efficiencies of 1.59, 3.51, and 8.39 times
of area-efficiency of the decoder in [27], respectively.
Compared with decoders in [10], [24], [25], the advantage
in area efficiency of the MM-SCL decoder is more significant.
This advantage comes from two aspects. The first is that the
tree-based low-latency SCL architecture in [27] is adopted for
the MM-SCL decoder. This helps to reduce the decoding la-
tency. The second is due to the low-complexity AML decoding
unit. For N = 1024, the MM-SCL decoder in MODE-4 mode
provides an area efficiency of 2.72, 6.77, and 2.60 times of area
efficiencies of SCL decoders in [10], [24], [25], respectively.
When the mode is MODE-1, the ratios of the area efficiency
of the MM-SCL decoder over those of SCL decoders in [10],
[24], [25] are 14.37, 35.71, and 13.73, respectively.
For N = 32768, decoding latencies and throughputs respect
to different switching points of MODE-4 1 are also provided.
A smaller switching point leads to a smaller latency. When
the switching point is 10000, the latency of MODE-4 1 is
12
TABLE V
SYNTHESIZING RESULTS FOR DIFFERENT DECODERS WHEN N = 1024 AND R=0.5.
Decoder AE-SCL MODE-4 MODE-2 MODE-1 [27] [25] [10] [24]
List Size 4 4 2 1 4
Area (mm2) 1.89 2.31 3.83 1.70 1.78 2.14 4.10*
Clock Rate (MHz) 409 409 403 500 794 400 289*
# of Decoding Cycles 391 391 357 304 371 1540 2649 1022
Latency (us) 0.96 0.96 0.87 0.74 0.92 3.08 3.34 2.56 3.54*
NIT (Mbps) 547 547 1208 2887 570 155 154 200 144*
Area Eff. (Mbps/mm2) 289 237 523 1250 149 91 87 93 35*
* Original synthesis results in [24] are based on an ST 65nm CMOS technology. For a fair comparison, synthesis results
scaled to a 90nm technology are used in the comparison.
TABLE VI
SYNTHESIZING RESULTS FOR DIFFERENT DECODERS WHEN N = 8192 AND R=0.5.
Decoder AE-SCL MODE-4 MODE-2 MODE-1 [27] [25]‡ [10]†
List Size 4 4 2 1 4
Area (mm2) 4.49 5.51 6.46 7.32 12.73
Clock Rate (MHz) 398 398 398 434 794
# of Decoding Cycles 2542 2542 2323 1975 2367 11700 20736
Latency (us) 6.39 6.39 5.84 4.96 5.95 26.96 26.12
NIT (Mbps) 670 670 1473 3503 723 150 156
Area Eff. (Mbps/mm2) 149 122 267 636 112 20 12
‡ The decoder architecture in [25] has been re-synthesized with the TSMC 90nm CMOS technology.
† These results for the decoder in [10] are estimated conservatively.
TABLE VII
SYNTHESIZING RESULTS FOR DIFFERENT DECODERS WHEN N=32768 AND R=0.9.
Decoder AE-SCL MODE-4 MODE-4 1 MODE-2 MODE-1 [27] [25]‡ [10]†
List Size 4 4 4> 4? 2 1 4
Area (mm2) 9.97 11.93 11.89 15.8 50.41
Clock Rate (MHz) 350 350 359 389 794
# of Decoding Cycles 6718 6718 6530 6206 6300 5368 6492 65813 96576
Latency (us) 19.19 19.19 18.66 17.73 18 15.34 18.08 169.19 121.63
NIT (Mbps) 1662 1662 1714 1811 3564 8499 1772 165 242
Area Eff. (Mbps/mm2) 167 139 144 152 299 712 149 10 5
‡ The decoder architecture in [25] has been re-synthesized with the TSMC 90nm CMOS technology.
† These results for the decoder in [10] are estimated conservatively.
> The switching point for variation of modes is 21000.
? The switching point for variation of modes is 10000.
even smaller than that of MODE-2. Compared with MODE-4,
improvements on throughput and latency are about 8%.
VII. CONCLUSION
In this paper, the divide-and-conquer method is applied to
SC-based algorithms in the probability domain. By extending
this idea, a divide-and-conquer AML decoding unit for SCL-
based polar decoder is proposed. By examining frozen-location
patterns of polar codes, an efficient hardware design for a sim-
plified divide-and-conquer AML decoding unit is developed.
To adapt to different throughput and latency requirements, the
MM-SCL polar decoder is proposed in this paper. Synthesis
results show that our implementations for our MM-SCL de-
coder and SCL decoder with the LC-AML unit achieve better
area efficiencies than existing SCL polar decoders.
APPENDIX
Proof of Proposition 1:
(a) First, 0 < z1,1 = 2 − 2 = 1 − (1 − )2 < 1. Second,
0 < z1,2 = 
2 < 1. Then, by induction, for i ≥ 1 and
1 ≤ j ≤ 2i, 0 < zi,j < 1 is satisfied.
(b) For any i ≥ 1 and 1 ≤ j ≤ 2i, zi,2j−1−zi,2j = 2zi−1,j−
z2i−1,j−z2i−1,j = 2zi−1,j(1−zi−1,j). By Proposition 1(a),
zi,2j−1 − zi,2j > 0⇒ zi,2j−1 > zi,2j .
(c) By Proposition 1(b), zi,4j−3 > zi,4j−2 and zi,4j−1 >
zi,4j . zi,4j−2−zi,4j−1 = 2z2i−2,j(1−zi−2,j)2. By Propo-
sition 1(a), zi,4j−2 − zi,4j−1 > 0⇒ zi,4j−2 > zi,4j−1.
Therefore, zi,4j−3 > zi,4j−2 > zi,4j−1 > zi,4j .
(d) By Proposition 1(c), zi,8j−7 > zi,8j−6 > zi,8j−5 >
zi,8j−4 and zi,8j−3 > zi,8j−2 > zi,8j−1 > zi,8j . We also
have zi,8j−5 > zi,8j−3 and zi,8j−4 > zi,8j−2 because
zi,4j−2 > zi,4j−1.
Now let us compare zi,8j−4 and zi,8j−3,
zi,8j−4−zi,8j−3 = −2z2i−3,j(1− zi−3,j)2
× (2 + 4zi−3,j − 5z2i−3,j + 2z3i−3,j − z4i−3,j).
By Proposition 1(a), zi,8j−4 < zi,8j−3.
Therefore, zi,8j−7 > zi,8j−6 > zi,8j−5 > zi,8j−3 >
zi,8j−4 > zi,8j−2 > zi,8j−1 > zi,8j .
REFERENCES
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
13
Transactions on Information Theory, vol. 55, no. 7, pp. 3051–3073, July
2009.
[2] E. Sasoglu, I. Telatar, and E. Arıkan, “Polarization for arbitrary discrete
memoryless channels,” in Proceedings of IEEE Information Theory
Workshop, Oct 2009, pp. 144–148.
[3] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Transactions
on Information Theory,, vol. 61, no. 5, pp. 2213–2226, May 2015.
[4] E. Arıkan, “Systematic polar coding,” IEEE Communications Letters,
vol. 15, no. 8, pp. 860–862, August 2011.
[5] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE
Communications Letters, vol. 16, no. 10, pp. 1668–1671, October 2012.
[6] S. Kahraman and M. Celebi, “Code based efficient maximum-likelihood
decoding of short polar codes,” in Proceedings of IEEE International
Symposium on Information Theory, July 2012, pp. 1967–1971.
[7] K. Niu, K. Chen, and J. Lin, “Low-complexity sphere decoding of polar
codes based on optimum path metric,” IEEE Communications Letters,
vol. 18, no. 2, pp. 332–335, February 2014.
[8] E. Arıkan, H. Kim, G. Markarian, U. Ozgur, and E. Poyraz, “Perfor-
mance of short polar codes under ml decoding,” in Proceedings of ICT
Mobile Summit Conference, 2009.
[9] Y. S. Park, Y. Tao, S. Sun, and Z. Zhang, “A 4.68gb/s belief propagation
polar decoder with bit-splitting register file,” in Symposium on VLSI
Circuits Digest of Technical Papers, June 2014, pp. 1–2.
[10] A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” IEEE Transactions
on Signal Processing, vol. 63, no. 19, pp. 5165–5179, Oct 2015.
[11] J. Lin and Z. Yan, “An efficient list decoder architecture for polar
codes,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 2015, accepted and to appear, available on IEEE Explore, DOI:
10.1109/TVLSI.2014.2378992.
[12] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Transactions on
Signal Processing, vol. 61, no. 2, pp. 289–299, Jan 2013.
[13] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. Gross, “Fast polar
decoders: Algorithm and implementation,” IEEE Journal on Selected
Areas in Communications, vol. 32, no. 5, pp. 946–957, May 2014.
[14] A. Balatsoukas-Stimming, A. Raymond, W. Gross, and A. Burg, “Hard-
ware architecture for list successive cancellation decoding of polar
codes,” IEEE Transactions on Circuits and Systems II: Express Briefs,
vol. 61, no. 8, pp. 609–613, Aug 2014.
[15] G. Sarkis and W. Gross, “Increasing the throughput of polar decoders,”
IEEE Communications Letters, vol. 17, no. 4, pp. 725–728, Apr. 2013.
[16] C. Zhang and K. Parhi, “Latency analysis and architecture design
of simplified sc polar decoders,” IEEE Transactions on Circuits and
Systems II: Express Briefs, vol. 61, no. 2, pp. 115–119, Feb. 2014.
[17] ——, “Low-latency sequential and overlapped architectures for succes-
sive cancellation polar decoder,” IEEE Transactions on Signal Process-
ing, vol. 61, no. 10, pp. 2429–2441, May 2013.
[18] A. Raymond and W. Gross, “A scalable successive-cancellation decoder
for polar codes,” IEEE Transactions on Signal Processing, vol. 62,
no. 20, pp. 5339–5347, Oct. 2014.
[19] C. Xiong, J. Lin, and Z. Yan, “Symbol-based successive cancellation list
decoder for polar codes,” in Proceedings of IEEE Workshop on Signal
Processing Systems, Belfast, UK, October 2014, pp. 198–203.
[20] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. Gross, “Increasing
the speed of polar list decoders,” in 2014 IEEE Workshop on Signal
Processing Systems (SiPS), Oct 2014, pp. 1–6.
[21] I. Tal and A. Vardy, “List decoding of polar codes,” in Proceedings of
IEEE International Symposium on Information Theory, July 2011, pp.
1–5.
[22] B. Li, H. Shen, and D. Tse, “Parallel decoders of polar codes,”
arXiv:1309.1026, Sep. 2013. [Online]. Available: http://arxiv.org/abs/
1309.1026
[23] B. Li, H. Shen, D. Tse, and W. Tong, “Low-latency polar codes via
hybrid decoding,” in Proceedings of 2014 8th International Symposium
on Turbo Codes and Iterative Information Processing, Aug. 2014, pp.
223–227.
[24] B. Yuan and K. Parhi, “Low-latency successive-cancellation list decoders
for polar codes with multibit decision,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 2014, accepted and to appear, available
on IEEE Explore, DOI: 10.1109/TVLSI.2014.2359793.
[25] C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation
list decoder for polar codes,” IEEE Transactions on Signal Process-
ing, 2015, accepted and to appear, available on IEEE Explore, DOI:
10.1109/TSP.2015.2486750.
[26] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list decoding
algorithm for polar codes,” in Proceedings of IEEE Workshop on
Signal Processing Systems (SiPS 2014), Belfast, UK, October 2014,
pp. 56–61. [Online]. Available: http://arxiv.org/abs/1405.4819
[27] ——, “A high throughput list decoder architecture for polar
codes,” arXiv:1510.02574, Oct. 2015. [Online]. Available: http:
//arxiv.org/abs/1510.02574
[28] E. Arıkan, “A performance comparison of polar codes and Reed-Muller
codes,” IEEE Communications Letters, vol. 12, no. 6, pp. 447–449, June
2008.
[29] D. Wu, Y. Li, and Y. Sun, “Construction and block error rate analysis of
polar codes over awgn channel based on gaussian approximation,” IEEE
Communications Letters, vol. 18, no. 7, pp. 1099–1102, July 2014.
[30] J. Lin and Z. Yan, “Efficient list decoder architecture for polar codes,” in
Proceedings of IEEE International Symposium on Circuits and Systems,
June 2014, pp. 1022–1025.
[31] K. E. Batcher, “Sorting networks and their applications,” in AFIPS
Proceeding of the Spring Joint Computer Conference, 1968, pp. 307–
314.
