High Throughput Polar Decoding Using Two-Staged Adaptive Successive
  Cancellation List Decoding by Xia, ChenYang et al.
ar
X
iv
:1
90
5.
09
12
0v
1 
 [e
es
s.S
P]
  2
2 M
ay
 20
19
SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS 1
High Throughput Polar Decoding Using Two-Staged
Adaptive Successive Cancellation List Decoding
ChenYang Xia, Student Member, IEEE, YouZhe Fan, Member, IEEE, Chi-ying Tsui, Senior Member, IEEE
Abstract—Polar codes are the first class of capacity-achieving
forward error correction (FEC) codes. They have been selected as
one of the coding schemes for the 5G communication systems due
to their excellent error correction performance when successive
cancellation list (SCL) decoding with cyclic redundancy check
(CRC) is used. A large list size is necessary for SCL decoding to
achieve a low error rate. However, it impedes SCL decoding from
achieving a high throughput as the computational complexity
is very high when a large list size is used. In this paper, we
propose a two-staged adaptive SCL (TA-SCL) decoding scheme
and the corresponding hardware architecture to accelerate SCL
decoding with a large list size. Constant system latency and
data rate are supported by TA-SCL decoding. To analyse the
decoding performance of TA-SCL, an accurate mathematical
model based on Markov Chain is derived, which can be used
to determine the parameters for practical designs. A VLSI
architecture implementing TA-SCL decoding is then proposed.
The proposed architecture is implemented using UMC 90nm
technology. Experimental results show that TA-SCL can achieve
throughputs of 3.00 and 2.35 Gbps when the list sizes are 8 and
32, respectively, which are nearly 3 times as that of the state-of-
the-art SCL decoding architectures, with negligible performance
degradation on a wide signal-to-noise ratio (SNR) range and
small hardware overhead.
Index Terms—Polar codes, Successive cancellation list decod-
ing, Adaptive decoding, Markov chain, VLSI decoder architec-
tures
I. INTRODUCTION
S
INCE they were invented by Arıcan in 2009, polar codes
[1] have attracted much research interest due to their ex-
cellent error correction performance. Polar codes decoded by
successive cancellation (SC) decoding provably achieve chan-
nel capacity for symmetric binary-input, discrete, memoryless
channels (B-DMC) when their code lengths are approaching
infinity [2]. However, as the source word is recovered bit by bit
in the SC decoding process, the decoding latency for long polar
code is large [3], [4]. Numerous fast SC architectures have
been proposed to improve the decoding latency [5], [6]. On the
other hand, the error correction performance of SC decoding
is not satisfactory when it is used for practical polar codes
with short to medium code lengths [7], such as the channel
codes for 5G communication systems [8]. Thus, successive
cancellation list (SCL) decoding [7], [9] has been proposed
to improve the error correction performance of polar codes.
C.-Y. Xia, and C.-Y. Tsui are with the Department of Electronic and
Computer Engineering, Hong Kong University of Science and Technology,
Kowloon, Hong Kong (e-mail: cxia@connect.ust.hk; eetsui@ust.hk).
Y.-Z. Fan is with MaxLinear, Carlsbad, CA, USA (e-mail: jason-
fan@connect.ust.hk).
This paper will be presented in part at the IEEE International Conference
on Circuits and Systems, Sapporo, Japan, May 2019.
However, it has a large computational complexity and latency
overhead.
In SCL decoding, L concurrently-executed SC decodings
are used to keep L candidates of decoded vectors, where
L is called the list size. Compared with SC decoding, SCL
decoding has better error correction performance as the source
word is possible to be kept in the list even when a decoding
error happens. Moreover, by concatenating polar codes with
cyclic redundancy check (CRC) codes [10], [11], the valid
output vector is selected according to the CRC checksums
after decoding. Consequently, SCL decoding significantly out-
performs SC decoding for polar codes in error correction
performance. Polar codes using SCL decoding with L ≥ 16
even out-perform low-density parity-check (LDPC) [12] and
turbo [13] codes using iterative decoding [14], [15], and
hence short polar codes have been elected as one of the
coding schemes in the coming 5G enhanced mobile broadband
(eMBB) standard [8].
Aiming at increasing the decoding throughput of polar
codes, VLSI architectures of SCL decoding becomes a popular
research topic [16]–[27]. Compared with single SC decoding,
SCL decoding has latency overhead because of the need of
executing list management (LM) [16]. During the decoding
process of a bit, the L survival paths will be expanded to
2L paths as all of them are possible candidates of the partial
decoded vectors. LM is executed to select the L best paths
to keep. Basically, LM needs to solve a radix-2L sorting
problem which has a computational complexity of O(L2)
[28]. To minimize the latency overhead brought by LM, the
most popular optimization schemes used in state-of-the-art
hardware architectures are decoding multi-bit sub-codes at
the same time so that fewer LM operations are needed. A
sub-code can be either fixed-length [18]–[22] or matching
a special code pattern with variable length [23]–[27]. Be-
sides, the sorting algorithm itself can be simplified [28]–[30].
An approximate sorting algorithm called double thresholding
scheme (DTS) was proposed in our previous work [15], [21],
[22]. It simplifies the sorting complexity to O(L) with the help
of two run-time generated thresholds. The corresponding VLSI
architecture supports a list size up to 32 [22] so as to achieve
an excellent decoding performance. However, as shown in Fig.
1, state-of-the-art SCL decoding architectures suffer from a
severe throughput degradation when the list size is increased.
It is because a larger list size causes larger computational
complexity for LM and hence the critical path delay of the
SCL decoding architectures increases.
In the iterative decoding of LDPC codes, the number of
iterations for each frame to converge is not fixed so that
2 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
2 4 8 16 32
0.6
0.8
1
1.2
1.4
List size
Th
ro
ug
hp
ut
 (G
bp
s)
 
 
TSP18’ [22]
TSP17’ [24]
TVLSI16’ [23]
Fig. 1. Throughputs versus list sizes of various VLSI architectures of SCL
decoders synthesized with or scaled to 90nm CMOS technology.
the decoding speed can be increased by adaptively assigning
different decoding iterations for different frames [31]–[33].
Similarly, to increase the decoding speed of polar codes so
that they can be comparable with those of LDPC and turbo
codes, adaptive SCL (A-SCL) decoding was proposed in [10],
in which the list size is adaptive. Specifically, a codeword
is first decoded by a single SC decoding. If the decoded
vector cannot pass CRC, the codeword will be decoded by
SCL with a doubled list size. This process is iterated until
a valid output is obtained or a predefined maximal list size
Lmax is reached. Experimental results in [10] show that A-SCL
with Lmax has an equivalent error correction performance as
that of SCL decoding with L=Lmax, and the average list size
L¯ ≪ Lmax as most of the codewords can be decoded by SCL
with L ≪ Lmax. According to the relationship between list
size and throughput as shown in Fig. 1, the reduction on L¯
increases the average throughput of hardware polar decoder.
Nevertheless, the decoding latency of each codeword in A-
SCL is different. This is not an issue for a software decoder
such as the one in [26]. However, a directly-mapped A-SCL
hardware architecture cannot support applications that need a
constant system latency and transmission data rate, such as the
digital baseband in a communication system.
In this work, we will introduce how to accelerate polar
decoding on hardware with the help of A-SCL decoding. The
paper is an extension of our previous work [34]. Here, the
main contributions of this work are summarized as follows:
• A hardware-friendly two-staged adaptive SCL (TA-SCL)
algorithm is proposed, which can achieve a high through-
put with constant transmission data rate and system
latency. To analyse the error correction performance of
TA-SCL, a mathematical model on B-DMC is developed
based on Markov chain. The model is an extension from
our previous work [34] such that the speed gain achieved
by TA-SCL is not restricted to an integer multiple but
can be any rational multiple.
• The relationships between the error correction perfor-
mance and design parameters are studied, and a method
of how to select the design parameters is introduced.
• A hardware architecture of TA-SCL is developed based
on the proposed model. The memory usage is analysed
and the corresponding timing schedule is presented. A
low-latency SCL (LL-SCL) architecture combining sev-
eral existing low-latency decoding schemes is introduced
to satisfy the high requirement of a component SCL
decoder in the TA-SCL decoder architecture.
• Experimental results show that the throughput is about
=,(1) =,(2) =,(3)
L 02
4
stage 2
stage 1
stage 0
2 2
1 11 1
L 12 L 22 L 32
L 01 L 11 L 21 L 31
=,(0)u0^ u1^ u2^ u3^
ps1=u00 ^^ ps3=u20 ^^
ps2 ps3=[u0 u1] F.0^ 0^ ^ ^
F-node
G-node
L 00 L 10 L 20 L 30
Fig. 2. Scheduling tree of SC decoding for polar codes with N = 4.
three times that of state-of-the-art SCL decoding archi-
tecture [22] with negligible performance degradation and
small hardware overhead.
The rest of this paper is organized as follows. In Section II,
the background knowledge of polar codes and the decoding
algorithms will be reviewed. In Section III, the algorithm of
TA-SCL and its analytical model will be introduced. The
relationships between the error correction performance and
design parameters will also be analysed. In Section IV, the
hardware architecture of the TA-SCL decoder will be intro-
duced. Finally, simulation and implementation results of the
hardware-based TA-SCL will be presented in Section V, and
conclusions will be given in Section VI.
II. PRELIMINARIES
A. Polar Codes
Polar codes [1] are a kind of linear block codes of length
N . Without loss of generality, we assume N=2n in this work,
where n is an integer. Let uN and xN be the input source
word and the output codeword of an N -bit binary frame,
respectively, and the encoding process can be simply expressed
by xN = uN ·F
⊗n, in which F⊗n is called the generator matrix
that equals to the nth Kronecker power of the polarization
matrix F=
[
1 0
1 1
]
. Due to the polarization effect, each bit
in uN has a different reliability. An information set A is
determined by finding the K most reliable bits. These K bits,
called information bits, are used to transmit information. The
complement of A, Ac, is defined as the frozen set, in which
the bits are called the frozen bits and set to 0. If an r-bit CRC
code is used, the last r information bits are used to transmit
the checksum generated from the other K-r information bits,
and the code rate of polar codes is R = K−rN .
B. Successive Cancellation Decoding
Successive cancellation decoding is a basic decoding algo-
rithm of polar codes and has a low computational complexity
of O(N logN). Its decoding process is usually represented
by a scheduling tree. An example for an N=4 polar code is
shown in Fig. 2. It is a full binary tree with n+1 stages. The
operands of SC decoding are the log-likelihood ratios (LLRs).
The ith LLR at stage s is denoted as Lsi , where i ∈ [0, N−1]
and s ∈ [0, n]. Specifically, Lni s are the channel LLRs that
are inputted to the tree from the root node at stage n. L0i s are
LLRs corresponding to the N leaf nodes and the hard decision
of ui, denoted as uˆi, is made according to
uˆi = Θ(Λi) =
{
0, ifΛi > 0 or i ∈ A
c,
1, otherwise.
(1)
XIA et al.: HIGH THROUGHPUT POLAR DECODING USING TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION LIST DECODING 3
0 -
1
10
--10
0
10
10-- u3 inf.
u0 frz.
u1 inf.
u2 inf.
(a) Traditional SCL decoding.
01
11100100
00
11100100
----
u3 inf.
u0 frz.
u1 inf.
u2 inf.
(b) SCL decoding with MBD.
01
--10--00
00
1110---- rate-1
u2 u3
0 - u0 frz.
u1 inf.
(c) SCL decoding with SND.
Fig. 3. Decoding tree of different SCL decodings for a polar code with N = 4
and L = 2.
where Λi=L
0
i . To obtain these Λis, the nodes in the scheduling
tree are calculated as follows. A pair of sibling nodes at stage
s share 2s+1 LLR inputs from stage s+1 and both of them
execute 2s calculations in parallel. The left and right sibling
nodes (denoted as F- and G-nodes) calculate the following F-
and G-functions, respectively:
LF(La, Lb) = (−1)
Θ(La)⊕Θ(Lb) ·min(|La|, |Lb|), (2)
LG(La, Lb, pˆs) = (−1)
pˆsLa + Lb, (3)
where La and Lb are the two input LLRs and pˆs for the G-
function is a binary bit called partial-sum. (2) is a hardware-
friendly version of F-function proposed in [3]. For a G-node
at stage s, its partial-sums are obtained by
[pˆssj+1, .., pˆs
s
j+2s ] = [uˆj−2s+1, .., uˆj ] · F
⊗s, (4)
where uˆj is the last decoded bit. According to (4), the partial-
sums of a G-node has data dependancy on the 2s decoded
bits rooted at its sibling F-node. Thus, it can be seen that the
decoding process of SC decoding follows a depth-first traversal
of the scheduling tree.
C. Successive Cancellation List Decoding
SCL decoding was proposed in [7], [9]. It has a significant
performance gain over single SC decoding. Its decoding pro-
cess can be regarded as a search problem on a binary decoding
tree of depth-N. Fig. 3a shows a decoding tree for a polar code
of N=4. The ith source bit ui, which corresponds to the i
th
leaf node in the scheduling tree, is mapped to the nodes at
depth i+1 in the decoding tree. A path from the root node
to a leaf node represents a candidate of decoded vector. For a
parent node at depth i, its left and right children correspond to
two different expansions of the partial decoded path with ui=0
and 1, respectively. For example, the paths marked with single
and double crosslines in Fig. 3a represent decoding vectors
0010 and 0100, respectively.
The decoding process begins from the root node. When the
decoding process reaches a frozen bit ui, such as u0 in Fig.
3a, the sub-tree rooted at the right child is pruned (marked
with light colour) as it does not contain any valid path, and
the number of valid paths is unchanged. Otherwise, if ui is
an information bit, the valid decoded paths are expanded to
both sibling nodes and the number of valid paths in the list
doubles. The number of the path candidates increases expo-
nentially with respect to the number of decoded information
bits. When a practical code length is used, the computational
complexity will be too high to be implemented after a few bits
are decoded. To limit the computational complexity, an LM
operation is executed at each new depth to keep the number
of survival paths to a predefined value L which is called the list
size. In Fig. 3a, the lines with dark colours represent the paths
that have been expanded during LM and those with crosslines
represent the paths kept after LM.
The criterion of selecting survival paths during LM is their
reliability measured by path metrics (PM). We denote the path
metric of a path l (l ∈ [0,L− 1]) after the decoding of bit ui
as γki+1 where k ∈ {2l, 2l+1}. The PM is initialized as γ
0
0=0
and is updated based on bit-wise accumulation as [16]{
γ2li+1 = γ
l
i, where uˆ
2l
i = Θ
(
Λli
)
,
γ2l+1i+1 = γ
l
i + |Λ
l
i|, where uˆ
2l+1
i = 1−Θ
(
Λli
)
.
(5)
Similar as (2), (5) is a hardware-friendly version of PM update.
For a frozen bit, only one of (5) that satisfies uˆi=0 will
be computed and the number of paths remains to be L. As
mentioned above, only the L left children nodes will be kept.
Otherwise, both equations in (5) will be computed and the
number of paths is doubled. After that, a list pruning operation
will be executed, where all the 2L PMs are sorted and the L
paths with the smaller PMs will be kept in the list.
Recently, a variety of algorithms have been proposed to
reduce the decoding latency of SCL by decoding multiple
bits and executing their LM operation for only once. The
algorithms in the literature can be divided into two different
classes, multi-bit decoding (MBD) [18]–[20], [22] and special
node decoding (SND) [23]–[27]. MBD decodes M = 2m bits
simultaneously, where M is a fixed and predefined value. The
decoding tree of MBD is modified to a full 2M -ary tree as
shown in Fig. 3b, in which M=2. LM is still executed at each
depth of this tree for each M bits. SND, on the other hand,
runs simplified LM algorithms for variable-length sub-codes
that matches special code patterns. Fig. 3c shows the decoding
tree modified for SND, in which rate-1 sub-code, a sub-code
with only information bits, is used to simplify the decoding.
Fewer paths are expanded from each survival path in SND and
hence the computational complexity is reduced.
D. Adaptive SCL Decoding for Polar Codes
Adaptive SCL with CRC was proposed in [10] and the
algorithm is summarised in Algorithm 1. Each time, a new
codeword which contains N LLRs is inputted for decoding.
A-SCL starts with an SCL of L=1, i.e., a single SC. If
there are some decoded vectors that pass CRC at the end
of decoding, the one with the highest reliability is chosen as
4 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
Algorithm 1: Adaptive SCL with CRC.
1 Input: N channel LLRs; Initial: L = 1;
2 while 1 do
3 SCL decoding with L: codeword from channel;
4 if more than one paths pass CRC or L == Lmax then
5 Output the most reliable path; Break;
6 else
7 L = 2 · L; // L = Lmax for simplified A-SCL
TABLE I
AVERAGE LIST SIZE OF THE (N,K, r)=(1024, 512, 24) POLAR CODE
DECODED BY A-SCL WITH LMAX=32
SNR 1.2 1.4 1.6 1.8 2.0
Original [10] 4.11 2.49 1.67 1.30 1.13
Simplified [26] 18.55 13.52 9.06 5.74 3.53
output. Otherwise, the list size is doubled and the codeword
is decoded again by an SCL with the new list size. Usually, a
predefined Lmax is used to limit the computational complexity,
that is, after the decoding using an SCL with Lmax, the
decoding terminates even when there is no valid candidate.
From [10], the error correction performance of A-SCL is the
same as that of an SCL with Lmax. At the same time, as
most of the valid decoded vectors can be obtained using SCL
with smaller list sizes, the average list size L¯ of A-SCL is
much smaller than Lmax. An example for a polar code of
(N,K, r)=(1024, 512, 24) polar code is shown in Table I, in
which L¯ ≪ Lmax=32 . As SCL with a smaller list size has a
higher decoding speed, the average decoding speed of A-SCL
is much higher than that of SCL.
However, if we directly implement the A-SCL algorithm in
hardware, the following issues need to be addressed:
• In an A-SCL decoding, different codewords may be
decoded by SCL with different list sizes and SCL with a
larger list size has a much larger latency. The decoding
latency varies from frame to frame, so the system latency
is not fixed. Because of that, a directly-mapped architec-
ture may not be able to support applications that need to
have a constant transmission data rate and latency, such
as the channel coding blocks in communication systems.
• A directly-mapped architecture is required to support
multiple SCL decodings with list sizes L ranging from
1 to Lmax. This increases the design effort and also the
hardware complexity. A simplified A-SCL decoding was
proposed for accelerating software polar decoding [26],
in which only one single SC and one SCL with Lmax is
used. However, as shown in Table I, its L¯ is larger than
that of the original A-SCL, which means the achievable
throughput gain is much less than that of the original
A-SCL decoding. Moreover, simplified A-SCL does not
support constant transmission data rate as well.
• The throughput of A-SCL is not a constant under different
channel conditions. As shown in Table I, the average
list sizes of A-SCL decoding increase when the channel
condition deteriorates as more frames need to be decoded
multiple times, which indicates the throughput could be
Channel SCL(small L)
LLR
Buffer
SCL
(large L)
Output
Buffer
N  LLRs
N  LLRs
N  LLRs
N  bits
N  bits
Fig. 4. Block diagram of TA-SCL.
fr.1 fr.2 fr.4 fr.5 fr.6 fr.7 fr.8 fr.9 fr.10
fr.1 fr.2 fr.3
fr.3
fr.4fr.2--
fr.3
fr.4
fr.6
fr.3
fr.4
fr.4
fr.6
fr.4
fr.6
fr.4
fr.6
SCL(small L)
SCL(large L)
fr.11 fr.12 fr.13
fr.6
fr.3
fr.2
fr.3
fr.4
fr.6 fr.6
LLR 
Buffer
= 
(a) The buffer size is infinite. Buffer overflow never happens.
fr.1 fr.2 fr.4 fr.5 fr.6 fr.7 fr.8 fr.9 fr.10
fr.1 fr.2 fr.4
fr.4fr.2-- fr.4fr.4 - --
Overflow
fr.11 fr.12 fr.13
- --
fr.3
fr.2
Overflow
SCL(small L)
SCL(large L)
LLR 
Buffer
= 1
(b) The buffer size is one (finite). Buffer overflow may happen.
Fig. 5. Timing schedule of TA-SCL. The codewords in gray cannot be
decoded correctly by Ds. The arrows represent data flows.
affected accordingly. It is also shown in [26] that a
software-based simplified A-SCL decoder suffers from a
20x throughput reduction when the signal-to-noise ratio
(SNR) is reduced by 1 dB. To adjust the data rate
according to the channel condition, the transmitter side in
a real system needs to know the channel SNR in real time,
which increases the difficulty of system implementation.
To design a hardware-friendly A-SCL algorithm, we take
reference from variable-iteration decoders for LDPC codes
[31]–[33]. An additional buffer is employed at the input of the
iterative decoder, in which the newly received codeword can
be stored temporarily before the current decoding is finished.
By doing so, the iterative decoder can use different decod-
ing iterations to achieve decoding convergence and support
constant transmission data rate at the same time. In the next
section, we will propose a two-staged A-SCL decoding which
solves the problems mentioned above with the help of some
buffers and is hence suitable for hardware implementation.
III. TWO-STAGED ADAPTIVE SUCCESSIVE
CANCELLATION DECODING
A. Algorithm of TA-SCL Decoding
As mentioned in Section II-D, the average list size and the
error correction performance of A-SCL algorithm follows that
of single SC and SCL with Lmax, respectively. Based on this
observation, we propose a hardware-friendly TA-SCL whose
algorithm is described as below.
The block diagram of TA-SCL and its timing schedule are
shown in Fig. 4 and Fig. 5a, respectively. Basically, it includes
two SCL decodings, which are an SCL decoding with small
list size (not necessarily to be 1), denoted as Ds, and an SCL
decoding with large list size Lmax, denoted as Dl. Each input
codeword from the channel is first decoded by Ds. If none of
the candidates in the list passes CRC after this decoding, e.g.
fr.1 in Fig. 5a, the current codeword will be decoded again
by Dl. This decoding usually takes longer time than decoding
XIA et al.: HIGH THROUGHPUT POLAR DECODING USING TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION LIST DECODING 5
using Ds. Nevertheless, Dl runs concurrently with Ds so that
Ds starts decoding the next codeword immediately. Most of
the time, Ds can decode the input codewords correctly and
Dl becomes idle when the current decoding process finishes.
However, if the channel is subject to burst errors, it is possible
that a new codeword cannot be correctly decoded by Ds and
the decoding in Dl has not finished yet. To deal with this, an
LLR buffer is needed to store the LLRs of the codeword from
Ds temporarily, such as fr.2, fr.3, fr.4 and fr.6 shown in Fig.
5a. An output buffer is also needed to re-order the decoded
vectors as the codeword may be decoded out of order. For
example, fr.7˜fr.13 are stored in the output buffer until the
decoding of fr.6 finishes.
The major difference between TA-SCL and A-SCL, either
the original one or the simplified one, is that Ds can decode
the next codeword from the channel input immediately instead
of waiting forDl to finish decoding the current codeword with
the help of LLR buffer. The continuous running of Ds and
LLR storage in the buffer permit the data to be transmitted at a
constant data rate which is equal to the decoding throughput of
Ds regardless of the SNR of the channel, while the decoding
performance is guaranteed by Dl. Also, TA-SCL benefits the
hardware complexity as only two SCL decoders need to be
implemented on hardware. The issues mentioned in Section
II-D are hence solved.
B. Performance Bound of TA-SCL Decoding
If we have unlimited buffer resources, the decoding perfor-
mance of TA-SCL will be the same as that of Dl. However, in
actual hardware implementation, the buffer size is limited and
buffer overflow will happen, as shown in Fig. 5b. It happens
when a new codeword needs to be stored in the LLR buffer
but the buffer is full and decoding in Dl has not finished yet.
To deal with buffer overflow, either the codeword in Ds or
Dl would be thrown away and the incorrect decoding results
from Ds will be used as the final output decoded vector of the
corresponding codeword. Thus, the block error rate (BLER)
of DTA, denoted as ǫDTA , is bounded by
ǫl ≤ ǫDTA < ǫl + Pr(Overflow), (6)
where the upper limit is the sum of the BLER of Dl and the
probability of buffer overflow. Obviously, it is important to
prevent buffer overflow in order to reduce performance loss
which is defined as
δ =
ǫDTA − ǫl
ǫl
· 100%. (7)
In summary, the benefits of TA-SCL on hardware comes at the
cost of error correction performance loss. To obtain the best
tradeoff among performance, hardware usage and throughput,
an analytical model of DTA will be introduced next to obtain
the relationship between Pr(Overflow) and the parameters of
DTA in the next sub-section. Before that, we define the design
parameters of TA-SCL as follows.
• Ls/Ll: list sizes of Ds /Dl .
• ǫs/ǫl: BLERs of Ds /Dl .
• ts/tl: decoding time of each codeword using Ds /Dl .
3
0
6
5
4
2
1
ts0 2ts 3ts 4ts
I
S
H
S
I
H


1

1

0
1
0
0
1
1
0
0
i i X2 5ts
S
S
I
(a) An example for DTA(3, 1).
# State Incorrect
decoding
Correct
decoding
Hazard βn-βd Xτ -1 Xτ -1
Safe βnζ Xτ+β-1 Xτ -1
Idle βd+1 β 0
(b) Summarization of three kinds of
states.
Fig. 6. States and state transitions of the proposed model. The white and black
arrows mean the frame is decoded correctly and incorrectly, respectively.
• β: speed gain, i.e., tlts . In this work, β is not limited to
integer value and can be any rational number, which is
given by β = βnβd , where βn, βd ∈ Z
+ and βn⊥βd, i.e.,
βn and βd are co-prime.
• ζ: size of the LLR buffer, which equals to the number of
codewords that can be stored in the buffer. In this work,
we assume ζ ≥ 1 and ζ ∈ Z+.
We also denote a TA-SCL decoding whose speed gain is β and
buffer size is ζ as DTA(β, ζ). The TA-SCL decoding in the
example shown in Fig. 5a and 5b hence can be described as
DTA(3,∞) and DTA(3, 1), respectively, and the corresponding
Dl needs 3ts to decode a codeword.
C. Analytical Model of TA-SCL Based on Markov Chain
In this sub-section, we model the behavior of DTA(β, ζ) on
B-DMC. Without loss of generality, we assume the channel is
an additive white Gaussian noise (AWGN) channel. We first
introduce the states that the decoder can operate at. We define
the number of codewords currently stored in the LLR buffer
as iζ and the remaining time required to finish the decoding of
the current codeword in Dl as iβ which is an integer multiple
of tsβd
. Each codeword in the LLR buffer needs βts to decode.
Then, the state of TA-SCL is defined as
Xτ = β · iζ + iβ, iζ ∈ [0, ζ], iβ ∈ {
i
βd
|i ∈ [0, βn]}, (8)
which is actually the total time required to clear the buffer in
terms of ts. For a DTA(β, ζ), there are S=βnζ+βn+1 states
in total. The S states can be categorized into the following
groups according to whether buffer flow will happen.
• Hazard states: The states that the LLR buffer is full and
the current codeword decoded by Dl cannot be finished
within ts, which means iζ=ζ and iβ>βd. Buffer over-
flow will occur if Ds cannot decode the next codeword
correctly. The codeword in Ds will be thrown away as
this allows Dl to decode as many codewords as possible
without any interruption.
• Safe/Idle states: In contrast with the hazard states, these
states do not have overflow hazard as the LLR buffer has
enough space for a codeword that cannot be correctly
decoded by Ds. The transitions in idle states are a little
different from those in safe states in the sense that Dl
will finish its decoding and become idle during this ts.
We show an example for DTA(3, 1) in Fig. 6a, where the
black and white arrows represent the probabilities of ǫs and
ǫ′s=1-ǫs, respectively. The first three columns show iβ , iζ and
6 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
Xτ , respectively. Typical transitions from hazard, safe and
idle states are marked with “H”, “S” and “I” in the figure,
respectively. The number of these states and their transitions
are summarized in Fig. 6b. After defining these states, we
introduce the modeling of TA-SCL by Proposition 1.
Proposition 1. Decoding with DTA is a Markov process and
can be modeled with a Markov chain.
Proof: An input codeword is independent of the code-
words inputted at other time as all the LLRs are random
variables with identical and independent distribution (IID).
Thus, the decoding correctness of Ds only depends on the
inputs at that time. It actually follows a Bernoulli distribution
which takes the value 1 (error happens) with probability ǫs.
As all the state transitions depend only on ǫs, the next state
of DTA(β, ζ) only depends on the current state instead of any
earlier state. Hence, Proposition 1 is proved.
From Proposition 1, the state diagram of aDTA(β, ζ) can be
easily obtained by finding out all the possible state transitions
in Fig. 6a. The state diagrams of DTA(3, 1) and DTA(
5
2 , 1)
are both shown in Fig. 7. The latter one has some non-integer
states as the remaining time to clear a buffer is an integer
multiple of ts2 instead of ts. If βn and βd are not co-prime,
the corresponding model can be simplified. For example, if
βn=6 and βd=2 in Fig. 7a, some states (the light gray ones
in Fig. 7a) can never be accessed and are redundant in the
model. These states can be removed and the model is the
same with the one with βn=3 and βd=1. So we only consider
the situations that βn⊥βd.
For further mathematical analysis, we map the state diagram
to a transition matrix P whose size is S × S. An element
Px,y ∈ P (x, y ∈ [0,S-1]) corresponds to the transition
probability from state x to state y, i.e.,
Px.y = Pr(Xτ+1 = y|Xτ = x), (9)
where Xτ is the current state and Xτ+1 is the next state. The
transition matrix of DTA(3, 1) mapped from the state diagram
is
P =


ǫ′s 0 0 ǫs 0 0 0
ǫ′s 0 0 ǫs 0 0 0
0 ǫ′s 0 0 ǫs 0 0
0 0 ǫ′s 0 0 ǫs 0
0 0 0 ǫ′s 0 0 ǫs
0 0 0 0 1 0 0
0 0 0 0 0 1 0


. (10)
The transition matrix P is time-independent according to
Proposition 1. To do steady-state analysis for TA-SCL, the fol-
lowing proposition of the Markov chain model is introduced.
Proposition 2. The Markov chain of TA-SCL decoding is
irreducible (possible to get to any state from any state) and
aperiodic.
Proof: The proof is given in Appendix A.
From Proposition 2, the chain converges to the stationary
distribution regardless of its initial distribution. Suppose that
the decoding begins with DTA at state 0, i.e., the initial dis-
tribution λ0=[1, 0, ..., 0]. After k · ts (k ∈ Z
+), the probability
2 3 40 1 5 61.5 2.5 3.5 4.5 5.50.5
(a) DTA(3, 1).
1.5 2 2.5 3 3.50 0.5 1 4 4.5 5
(b) DTA(
5
2
, 1).
Fig. 7. State diagrams of the proposed model. The white and black arrows
mean the frame is decoded correctly and incorrectly, respectively.
distribution becomes λk=λ0 · P
(k). Let P∞= lim
k→∞
P (k), then
the stationary distribution π of DTA is
π = λ · P∞ = [(P∞)0,0, ..., (P∞)0,βζ+β ]. (11)
Actually, all the rows of P∞ are the same, and so the stationary
distribution is irrespective of the initial state λ0 of DTA.
Buffer overflow happens when DTA is in any hazard state
and Ds cannot decode the next codeword correctly. Thus, the
probability of buffer overflow is expressed as
Pr(Overflow) = ǫs · Pr(Hazard) (12)
= ǫs · Pr(iζ = ζ and iβ > 1) (13)
= ǫs · Pr(Xτ > βζ + 1), (14)
= ǫs ·
βζ+β∑
i=βζ+1+β−1n
πi. (15)
This probability of overflow bounds ǫDTA in (6). It is a function
of error correction performance of Ds ǫs, speed gain β and
buffer size ζ, i.e., Pr(Overflow)=f(ǫs, β, ζ). We will use this
to analyse the error correction performance of TA-SCL next.
D. Error Correction Performance of TA-SCL
In this sub-section, we study the relationships between the
error correction performance of TA-SCL decoding and its
design parameters based on the proposed model. Simulation
results are presented to verify the analysis numerically. A polar
code of (N,K, r)=(1024, 512, 24) is used for simulation over
an AWGN channel, which is the same as that used in [22]. The
low-latency hardware-friendly decoding algorithm proposed in
[22], multi-bit DTS (MB-DTS), is used for Dl with Ll=32 as
TA-SCL is targeting for hardware-based applications.
Instead of directly study the relationships of the design
parameters with ǫDTA , we first study their relationships with
Pr(Overflow) with the help of the derived model.
Proposition 3. If β and ζ are fixed, lim
ǫs→0
Pr(Overflow) = 0.
Proof: As shown in (10), all the elements in a transition
matrix are linear combinations of 1 and ǫs. Thus, each element
πi in the stationary distribution π and hence Pr(Overflow) is
a polynomial of ǫs and the Proposition is proved.
Proposition 3 indicates that TA-SCL decoding should have
a better error correction performance at a higher SNR or if
a larger Ls is used. To verify this, we simulate the error
correction performance of the proposed TA-SCL decoding
with different Ls but the same β and ζ and the results
XIA et al.: HIGH THROUGHPUT POLAR DECODING USING TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION LIST DECODING 7
1.2 1.4 1.6 1.8 2 2.2
10−4
10−3
10−2
10−1
Eb/N0
BL
ER
 
 
SCL,Trad.,Ls=1
SCL,Trad.,Ls=2
SCL,MB−DTS,Ll=32
TA−SCL,siml.,Ls=1,Ll=32
TA−SCL,model,Ls=1,Ll=32
TA−SCL,siml.,Ls=2,Ll=32
TA−SCL,model,Ls=2,Ll=32
(a) DTA(3, 3) with Ls=1 and 2.
1.2 1.4 1.6 1.8 2 2.2
10−4
10−3
10−2
10−1
Eb/N0
BL
ER
 
 
SCL,Trad.,Ls=2
SCL,Trad.,Ll=32
TA−SCL,siml.,beta=2
TA−SCL,model,beta=2
TA−SCL,siml.,beta=3
TA−SCL,model,beta=3
TA−SCL,siml.,beta=4
TA−SCL,model,beta=4
(b) DTA(β, 2) (β ∈ [2, 4]) with Ls=2.
1.2 1.4 1.6 1.8 2 2.2
10−4
10−3
10−2
10−1
Eb/N0
BL
ER
 
 
SCL,Trad.,Ls=2
SCL,Trad.,Ll=32
TA−SCL,siml.,zeta=3
TA−SCL,model,zeta=3
TA−SCL,siml.,zeta=2
TA−SCL,model,zeta=2
TA−SCL,siml.,zeta=1
TA−SCL,model,zeta=1
(c) DTA(3, ζ) (ζ ∈ [1, 3]) with Ls=2.
Fig. 8. The impact of parameters ǫs, β and ζ on the error correction performance of DTA .
are shown in Fig. 8a. We assume that β=3, which is a
reasonable estimation as will be shown later in Section V.
The solid lines show the simulation results of ǫDTA and the
dashed lines show the upper bound of ǫDTA calculated by
the proposed model. It can be observed that the TA-SCL
with Ls=2 has almost negligible error correction performance
degradation compared to Dl on a wide SNR range. Also,
the performance degradation gradually disappears when SNR
increases and ǫs decreases. In contrast, TA-SCL with Ls=1
has much poorer performance at low SNR range. This is
bacause more codewords cannot be correctly decoded by Ds,
and more operations of Dl are needed, hence Pr(Overflow)
increases. Nevertheless, at high SNR range, the performance
degradation is still negligible. Considering that a smaller Ls
usually indicates a lower decoding latency, a larger speed gain
can be achieved. Moreover, it can be observed in Fig. 8a that
the simulation results are almost the same as the upper bounds
obtained from the analysis, i.e., ǫDTA ≈ ǫl+Pr(Overflow). Thus,
the upper bound of the derived model can be used to estimate
the error correction performance of DTA.
Next, we study the relationships between Pr(Overflow) and
the other two design parameters, β and ζ.
• When ζ increases, more codewords can be stored in the
LLR buffer for Dl decoding. As discussed in Section
III-A, if we have infinity buffer resources, buffer overflow
never happens.
• When β decreases, Dl can decode the codewords in
the LLR buffer sooner after they were stored. When
β ≤ 1, Dl decodes a codeword faster than Ds and buffer
overflow never happens.
Fig. 8b and 8c shows the performance of TA-SCL with
different β and ζ, respectively. It can be seen at low SNR
range, both increasing buffer size and decreasing speed gain
β lead to a lower Pr(Overflow) and hence ǫDTA , which is in
accordance with the discussion above. At high SNR range, the
curves of all the DTA(β, ζ) are overlapped, which means the
error correction performance is good even when a small buffer
is used or a high speed gain is required.
From the simulation results shown in Fig. 8, the perfor-
mance loss of TA-SCL decoding δ is larger at low SNR range.
In practical applications, the decision to select which Ds to use
depends on how much performance loss we can tolerate for a
certain BLER. For example, in Fig. 8a, the performance loss is
20% at a BLER of 2·10−2 for DTA(3, 3) with Ls=2 while that
for Ls=1 is much larger. Also it can be seen that δ gradually
approaches 0 when SNR increases.
Based on the relationships between the error correction
performance and the design parameters, we summarise the
following steps of designing TA-SCL decoding for a specific
polar code.
1) Running simulations of the target code using Ds and Dl.
2) Calculating ts, tl and corresponding β.
3) Gradually increasing the buffer size from ζ = 1. Calcu-
lating and checking whether the performance loss δ is
satisfied at the target BLER by using the Markov model.
4) If ζ reaches a predefined buffer resource constraint and
δ is still not satisfied, adding idle time to ts to decrease
β and redo step 3.
5) Running simulations of the target code using the designed
DTA(β, ζ) for verification.
IV. HARDWARE ARCHITECTURE FOR TA-SCL
In this section, we first present the overall architecture of
TA-SCL decoder and analyze its memory usage and timing
schedule. Then, we introduce a low-latency architecture for
Ds. We denote the number of clock cycles required for Ds
and Dl to decode one frame as Cs and Cl, respectively, then
β =
tl
ts
=
Cl
Cs
. (16)
In the rest of the paper, we will use Cs and Cl to represent the
latency instead of ts and tl.
A. Overall Architecture
The proposed architecture of TA-SCL decoder is shown in
Fig. 9. It consists of four major sub-blocks: two constituent
SCL decoders for Ds and Dl, an LLR buffer and an output
buffer. Data width of each connection is also marked in Fig. 9,
where P is called parallelism factor, i.e., the maximum number
of F- or G-nodes that can be executed at the same time in Dl,
and Q is the number of quantization bits for the LLR values.
Details of each sub-block is introduced as below.
To support a high speed gain β, the architecture of Ds
should have a very low decoding latency. Empirically, a Ds
with Ls ≤ 2 provides error correction performance that is
good enough to achieve a very low Pr(Overflow) and ǫDTA , and
a larger Ls brings little performance gain but large overhead
on timing and hardware complexity. Hence, a low-latency SCL
8 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
Inputs:
2P LLRs 
per cycle Outputs:
2P bits 
per cycle
2P
2P
Depth:
 N/2P·(+1) 
Simple dual-port RAM
Port A
Width: 2PQ bits
Port B
Width: 2PQ bits
Depth:
 N/2P·(++1) 
Ture dual-port RAM
Port A
Width: 2P bits
Port B
Width: 2P bits
2P
2PQ
2PQ
Dl
LLR Buffer Output BufferDs
Ls·N/2
Ls·N
PS mem.
Path mem.
Ll·N/2 Ll·N
PS mem. Path mem.
[(Ll+1)N+3LlP]·Q
LLR mem.xxA F nodesLLR mem.xx
xxB
xx
E x
x C
x
F
G
2PQ Channel
G nodes
NQ/2
NQ
~Ls·2·NQ/2
Fig. 9. Overall hardware architecture and memory usage.
x
x
xx
xx
x
x
x
x
xx
xx
xx
xx
x
x
x
x
x
x
x
x
1
1
2
2
3
3
4
4
5
5
6
6
1
1
4
4
7
7
1
6
6
8
8
9
9
2 3 4
x CE
x
F
G
A B LLR inputs of any frame
SCLD w/ Ls
LLR inputs of error frames
SCLD w/ Ll
Decoded vector of Ds
Decoded vector of Dl
Final outputs
Fig. 10. Timing schedule of DTA(
5
2
, 1). The codewords in gray cannot be decoded correctly by Ds.
decoding scheme targeting for Ls=2 is used in the proposed
TA-SCL architecture. It will be introduced in Section IV-C
in detail. The memory usage of Ds is shown in Fig. 9. We
only show the major memories which dominate the overall
memory usage. The LLR memory stores both the channel
LLRs and the calculated LLRs during decoding. The channel
LLR memory have NQ bits. It is also re-used for the storage
of the calculated LLRs from the G-node at stage n-1. The
calculated LLRs from the G-nodes at the other stages require
about Ls · 2 ·
NQ
2 bits for storage in total [22]. The calculated
LLRs of F-nodes at different stages can share a NQ2 -bit
memory as they will be used only once in the next clock cycle
directly and do not need to be stored afterwards [3], [22]. The
partial-sum memory and path memory requires Ls ·
N
2 bits
and Ls · N bits, respectively. This architecture can also be
configured for Ls=1, and the corresponding memory usage is
slightly larger than half of that of Ds with Ls=2.
For Dl, we use the architecture proposed in our previous
work [22] and the details will not be discussed here. The sizes
of LLR memory, partial-sum memory and path memory are
[(Ll + 1)N + 3LlP ] · Q bits, Ll ·
N
2 bits and Ll · N bits,
respectively. The number of processing elements (PE) for each
path in Dl equals to P .
The LLR buffer is implemented by a simple one-read one-
write dual-port SRAM. The incoming channel LLR values will
be written to the channel LLR memory in Ds and the LLR
buffer at the same time. If Ds could not decode the frame
correctly, this frame data will be kept in the LLR buffer forDl
decoding. Otherwise, it will be overwritten by the next frame.
Thus the total size of the LLR buffer is ζ+1 frame where ζ
is the number of buffer required for storing the frame data
for Dl as discussed in the analytical model and the additional
buffer is holding the current frame data being decoded by Ds.
The size of LLR buffer is hence NQ · (ζ+1) bits. The width
of each port is 2PQ bits. This parallelism matches with the
I/O ports of Dl [22].
The output buffer is implemented by a true dual-port SRAM
of which both ports can be used for reading or writing. It is
used to align the output order to be the same as the input
sequence because the decoding of the frames can be out of
TABLE II
MEMORY USAGE OF TA-SCL DECODING AND AN EXAMPLE FOR
DTA(3, 3). ASSUMING THATN=1024, Ll=32, P =64 ANDQ=6
Sub-blocks Memory usage (bits) Ls=2 Ls=1
Dl [(Ll+1)N+3LlP ] ·Q+Ll ·
3N
2
288,768 288,768
Others (ζ+Ls+
5
2
)NQ+(⌊βζ +β+1⌋+ 3Ls
2
)N 62,464 54,784
Overhead 21% 19%
order. All the decoding results from Ds will be temporarily
stored in this buffer. The results will be overwriten if the
corresponding codeword is decoded by Dl. Considering the
worst case that a TA-SCL reaches the maximum state, the
frame just stored into the LLR buffer and could not be decoded
by Ds correctly will be decoded by Dl after (βζ+β) ·Cs clock
cycles. All the decoding results of the codewords inputted
after this frame need to be temporarily stored in the output
buffer. So the output buffer needs to accommodate at most
⌈βζ + β⌉ frames theoretically. In the real design, the output
buffer needs to store ⌊βζ + β + 1⌋ frames, i.e, one more frame
of the decoded vectors needs to be stored if βζ+β is an integer,
and the reason will be explained in Section IV-B. The size of
the output buffer is hence N · ⌊βζ + β + 1⌋ bits.
The memory usage of TA-SCL is summarized in Table II
according to the analysis above. The hardware complexity
overhead of TA-SCL over the traditional architecture for Dl
comes from Ds and the two buffers, which is dominated by
the memory overhead. As an example, the memory overhead
of DTA(3, 3) (the one in Fig. 8a) is also shown in Table II
and the overhead is around 20%. More accurate experimental
results on hardware usage will be presented in Section V.
B. Timing schedule
The timing schedule of TA-SCL architecture is illustrated in
Fig. 10 with an example for DTA(
5
2 , 1). Each line represents
a decoding process or a data flow in Fig. 9, as marked with
circled letters. The number shown in the waveforms represent
the frame indices.
The first and third rows represent the decoding operations
of Ds and Dl, respectively, which are similar as the timing
schedule shown in Fig. 5. The first Crw , N/2P clock cycles
XIA et al.: HIGH THROUGHPUT POLAR DECODING USING TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION LIST DECODING 9
LLR mem.
(G nodes)
LL
R 
Cr
o
ss
ba
r PE array
(N/4 PE)
x
x PE array(N/4 PE) Pipeli
n
e 
st
ag
e
Pi
pe
lin
e 
st
ag
e SND
x SND
SND
x
x
SND
SND
x
SNDxx
Radix-
2^(MSN+1) 
sorter
Pi
pe
lin
e 
st
ag
e
PS
mem.
CRC
Path 
mem.
LLR mem.
(F nodes)
Channel
LLR mem.
MSN pipeline stages
In
Out
High stages  (>m) Low stages (m)
Fig. 11. Hardware architecture of Ds. The blocks with “+” stripes are disabled
when Ls=1.
TABLE III
SUMMARIZATION OF SPECIAL NODES USED AT LOW STAGES
Name # frz. # inf. Name # frz. # inf. Name # frz. # inf.
Rate-0 T 0 Rep. T -1 1 Rep2 T -2 2
Rate-1 0 T SPC 1 T -1 SPC2 2 T -2
are used to load the input channel LLRs. The periods filled
with “//”, “\\” and “X” stripes represent the LLR loading time
for Ds, for Dl and for both, respectively. The second and
fourth rows represent the operations of sending the decoding
results from Ds and Dl to the output buffer, respectively.
These two operations are executed concurrectly with the
loading operations of the next codeword to be decoded. The
fifth row shows when the final output results are generated
from the output buffer. From the timing schedule, it can be
seen that the final decoding results of any codeword will
be available after Cs+Cs·(βζ+β)+Crw clock cycles when the
corresponding LLRs are inputted to the decoder. The first term
is caused by Ds. The second term is dictated by the worst case
discussed in Section IV-A. The third term is added to avoid
potential memory collision as circled in Fig. 10: the decoded
data is read out from the output buffer only after the two
decoders send the decoded results to the output buffer. This
is also the reason why we need space for one more frame as
mentioned in Section IV-A. As an example, the system latency
of DTA(
5
2 , 1) in Fig. 9 is 6Cs+Crw clock cycles.
C. Low-Latency SCL Decoding Scheme
In this section, we introduce a low-latency SCL (LL-SCL)
decoding scheme customized for Ds with Ls ≤2 such that
Ds can support a large speed gain. It combines several state-
of-the-art low-latency decoding schemes for SCL decoding,
including G-node look-ahead scheme [6], multi-bit decoding
(MBD) [19] and special node decoding (SND) [24], [25].
Specifically, LL-SCL can be divided into the following two
parts.
• SC calculations at stages not lower than stage m=logMs
(Ms is the number of merged bits for MBD in Ds)
are calculated by normal SC algorithm with the full
parallelism, i.e., N2 PEs are used and any node in the
scheduling tree takes only one clock cycle. Moreover, G-
node look-ahead scheme [6] is used so that each pair of
sibling nodes are calculated simultaneously and half of
the latency is saved. Thus, the decoding latency of this
part is NMs -1 clock cycles.
• SC calculations at the low stages are replaced with MBD
which decodes Ms bits in the same sub-tree rooted
at stage m simultaneously. According to [19], given a
certain number of frozen bit, there is only one code
pattern for theMs-bit sub-codes. So totally there are only
Ms+1 different code patterns. To reduce the decoding
latency, the decoding scheme for each code pattern is
designed as follows. These Ms+1 code patterns are
divided into multiple special nodes as shown in Table III
where T is the number of bits in each special node. The
corresponding numbers of information bits and frozen
bits are also listed. Rate-0, rate-1, repetition (Rep.) and
single parity check (SPC) nodes are decoded according
to the schemes presented in [24]–[26]. Rep2/SPC2 nodes
can be divided into two Rep./SPC nodes with half of the
lengths which can be decoded concurrently. Thus, each
special node can be decoded by SND within one clock
cycle. The number of paths is doubled after a special node
is decoded. Different from the traditional SCL decoding,
all the expanded paths are kept temporarily. List pruning
is not executed until the end of each Ms-bit sub-code,
which takes another clock cycle. Thus, an Ms-bit sub-
code that can be divided into MSN special nodes requires
MSN+1 clock cycles to decode, except for Ms-bit rate-0
and rate-1 nodes which do not need a sorting operation
and only take 1 clock cycle.
The total decoding latency of the LL-SCL decoding scheme
is
Cs = CMBD + CSCD + Crw, (17)
=
N/Ms∑
i=1
(MSN(Fi) + Csort(Fi)) + (
N
Ms
− 1) +
N
2P
, (18)
where Crw is the LLR loading latency and Fi is the number of
frozen bits in the ith Ms-bit sub-code, MSN(Fi) and Csort(Fi)
are the decoding clock cycles for SND and sorting of the code
pattern corresponding to Fi, respectively. For Ls=2,
Csort(Fi) =
{
0, if Fi = 0 or Ms,
1, otherwise.
(19)
For Ls=1, the sorting operation for list pruning is unnecessary.
Hence, Csort(Fi)=0 for Ls=1, which indicates the throughput
can be increased by using conventional SC decoding as Ds.
The top level architecture of LL-SCL is shown in Fig. 11.
For high stages, the SCL architecture [16] uses a PE array
with N4 PEs for each path. As the calculations at the highest
stage n-1, which requires N2 PEs to compute, are the same
for both paths, they can use the two PE arrays from the two
paths to accomplish. For low stages, max(MSN) stages of
SND blocks and a radix-2max(MSN)+1 sorter are implemented
in a feedforward manner, which are directly mapped from the
decoding schemes introduced above. The architecture can be
configured for Ls=1 by simply disabling some of the blocks
as shown in Fig. 11.
V. EXPERIMENTAL RESULTS
A. Decoding Latency and Error Correction Performance of
TA-SCL Decoding for Practical Polar Codes
In this sub-section, we demonstrate the speed gain and the
error correction performance of the proposed architectures.
10 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
TABLE IV
CODE SETTINGS, NUMBERS OF 16-BIT SUB-CODES IN DIFFERENT GROUPS AND LATENCY OF THE TWO COMPONENT DECODERS
Codes
I. Code Settings II. # of sub-codes with Fi = III. Latency of Ds (Ls=2) IV. Latency of Dl
N K r |Ar | 0,16 (1) 1,2,14,15 (2) 7,8,9 (3) others (4) CMBD CSCD Cs CLM CSCD Cfine Czero Cl
P1 1024 512 24 360 32 13 2 17 132 63 203 520 296 -128 -41 647
P2 1024 768 24 622 38 12 2 12 116 63 187 501 296 -128 -18 651
P3 256 128 8 78 6 4 2 4 36 15 53 152 66 -32 -18 168
 
 
 
 
 
 
 
BL
ER
1.2 1.4 1.6 1.8 2
10−4
10−3
10−2
10−1
Eb/N0
(a) D1 for P1.
2.4 2.6 2.8 3 3.2
10−4
10−3
10−2
10−1
Eb/N0
(b) D2 for P2.
1 1.4 1.8 2.2 2.6
10−4
10−3
10−2
10−1
Eb/N0
(c) D3 for P3.
1.2 1.4 1.6 1.8 2
10−4
10−3
10−2
10−1
Eb/N0
(d) D4 for P1.
1.2 1.4 1.6 1.8 2 2.2 2.4
10−4
10−3
10−2
10−1
Eb/N0
(e) D5 for P1.
Ds, floating-point Dl, floating-point DTA, floating-pointDs, fixed-point Dl, fixed-point DTA, fixed-point
Fig. 12. Error correction performance of TA-SCL decoders D1˜D5 shown in Table V.
TABLE V
DIFFERENT TA-SCL DECODERS FOR P1˜P3
DTA Code Ls Cs Ll Cl β ζ
D1 P1 2 203 32 647 3.18 2
D2 P2 2 187 32 651 3.48 2
D3 P3 2 53 32 168 3.17 2
D4 P1 1 170 32 647 3.80 6
D5 P1 2 203 8 647 3.18 2
We apply TA-SCL decoding on several polar codes with
different code lengths and code rates as shown in Table IV
for illustration. These codes are similar to those chosen for
the 5G eMBB control channel [8]. P1 is the same as the one
used in [22] and similar to those presented in many existing
works [23], [24]. P2 and P3 have different code rate and code
length, respectively, and are used as examples to show the
flexibility of TA-SCL decoding.
Table IV summarizes code characteristics and the number of
cycles required for the corresponding Ds and Dl. |Ar| is the
number of reliable bits for the MB-DTS in the corresponding
Dl [22]. For Ds, MBD is applied to each 16-bit sub-codes,
i.e., Ms=16, which means there are 17 code patterns in total.
The number of special nodes in each code pattern and hence
the number of cycles required for decoding is
MSN(Fi) =


1, if Fi = 0, 1, 2, 14, 15, 16,
2, if Fi = 7, 8, 9,
3, if Fi = 3, 4, 5, 6, 10, 11, 12, 13,
(20)
As max(MSN)=3, the decoding latency of a sub-code varies
from one to four clock cycles according to (19) and (20).
As shown in part II of Table IV, these code patterns are
divided into four different groups according to their decoding
latency as shown in the brackets. The numbers of sub-codes
in each group are then shown in the table, with which the
decoding latency of Ds can be calculated according to (18)
and presented in part III. For Dl, MB-DTS is applied to sub-
codes with a maximum length of Ml=4 and P=64 PEs are
used for each path. Its total latency is calculated according to
[22] and the results are shown in part IV. With the selected
Ms and Ml, the critical path delays of Ds and Dl are similar
(both require 4˜5 stages of adders delay) and the decoding
latency can be minimized.
Based on the parameters shown in Table IV, we can design
the TA-SCL decoders for P1˜P3 to meet different requirements
of error correction performance and speed gain. Some designs
are listed in Table V. If we want to target for a TA-SCL
decoder that has good decoding performance across a wide
SNR range, as shown in Fig. 8, Ls=2 should be used. D1, D2
and D3 shown in Table V are the designs for P1, P2 and P3,
respectively, with Ls=2 and Ll=32. To determine the values of
β and ζ for each design, we set the maximum performance loss
δ at a BLER of 10−2 to be 30%. β is then calculated according
to (16) and ζ is obtained by using the design flow presented
in Section III-D. It can be observed from Table V that the
speed gain is 3x˜3.5x for these decoders. The error correction
performance of these decoders on AWGN channel are simu-
lated using both floating-point and fixed-point numbers and the
results are shown in Fig. 12a-12c. The quantization schemes
for Ds are Qs,LLR=7 and Qs,PM=8 so that the performance
loss due to quantization error is reduced to the minimum and
this is essential to avoid performance loss for the TA-SCL
decoding as analysed in Section III. The quantization schemes
forDl areQl,LLR=6 andQl,PM=8 to achieve a balance between
performance loss and hardware complexity. It can be seen
that the fixed-point simulation results of TA-SCL decoding
has negligible performance degradation (<0.05dB) compared
XIA et al.: HIGH THROUGHPUT POLAR DECODING USING TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION LIST DECODING 11
TABLE VI
SYNTHESIS RESULTS OF THE PROPOSED TA-SCL DECODERS AND COMPARISON WITH STATE-OF-THE-ART SCL DECODERS FOR 1024-BIT POLAR CODES
D1 D4 D5 [22] (Ml=8) [24]† [23]⋄
K = |A| 512 512 512 512 512 528 512 Notes:
List size L 32 32 8 32 8 8 8 † The synthesis results in [24] are based
on TSMC 65nm technology and are scaled
to a 90nm technology.
Clock freq. (MHz) 465 465 595 417 556 520 289
Throughput (Mbps) 2346 2801 3002 827 1103 862 732
Total area (mm2) 22.00 21.27 7.67 19.58 4.54 7.64 7.22 ⋄ The synthesis results in [23] are based
on TSMC 90nm technology.Area efficiency (Mbps/mm2) 106.63 131.69 391.40 42.34 242.95 112.83 101.36
TABLE VII
AREA BREAKDOWN OF THE PROPOSED TA-SCL DECODERS
D1 D4 D5
Area
(mm2)
Ds 1.97 (9%) 1.07 (5%) 1.97 (25%)
Dl (Ml=4) 18.76 (85%) 18.76 (88%) 4.43 (58%)
LLR buffer 1.14 (5%) 1.24 (6%) 1.14 (15%)
Output buffer 0.13 (1%) 0.20 (1%) 0.13 (2%)
Total area 22.00 (100%) 21.27 (100%) 7.67 (100%)
with the floating-point results of Dl at the BLER of 10
−2.
For some applications that only need to work at high SNR
range but require a high throughput, we can use a DTA with
Ls=1. We illustrate the performance of such a decoder with
an example of D4 shown in Table V for P1. The decoding
latency of Ds with Ls=1 is 32 clock cycles fewer than that of
Ds with Ls=2. Consequently,D4 can achieve 0.6x more speed
gain compared with D1. The error correction performance of
D4 is shown in Fig. 12d. As ǫs and hence Pr(Overflow) is
large at a lower SNR, we set the maximum performance loss
δ at a lower BLER, i.e., a higher SNR. In this case, δ is set
to be 30% at a BLER of 10−3 and the LLR buffer size ζ is
equal to 6 to achieve this performance .
For some applications that can trade the error correction
performance with lower decoding complexity, Ll can be
smaller than 32 [22]. D5 shown in Table V is an example
for P1. This design has the same design parameters, β and
ζ, as those of D1. The hardware complexity of D5 is much
smaller than that of D1 and the results will be shown later.
The error correction performance of D5 is shown in Fig. 12e.
The simulation results show that by just using a smaller
number of buffer, e.g. ζ=2, the maximum speed gain β can
be achieved for all the designs. To show the actual throughput
gain achieved by TA-SCL on hardware, we realize the design
of the TA-SCL decoders D1, D4 and D5 for P1 and obtain
their throughputs. The results will be presented in the next
sub-section and compared with the results of the state-of-the-
art polar decoders [22]–[24].
B. Implementation Results of the Proposed Architecture for
TA-SCL Decoding
The proposed architecture is synthesized with a UMC
90nm CMOS process using Synopsys Design Compiler. The
quantization schemes and the number of PEs for Dl are the
same as those presented in [22]–[24] for a fair comparison.
The reported throughputs are in terms of coded bits and the
reported area includes both cell and net area.
Table VI summarizes the synthesis results of the TA-SCL
decoder D1, D4 and D5 for the polar code P1. The results
of [22]–[24] are also shown for comparison. When Ll=32,
the critical path delay of Dl is larger than that of Ds so
the clock frequency is determined by Dl. Hence, the clock
frequency of D1 and D4 are the same and lower than that
of D5. The decoding throughput of D4 is higher than that of
D1 because Ls=1 is used and fewer clock cycles are required
for each frame in D4. When Ll=8, the critical datapath of
Dl is shorter so the clock frequency of D5 follows that of
Ds. The corresponding throughput is the highest due to the
high clock frequency. It is noted that Ml=4, rather than Ml=8
which is used to maximize the throughputs as reported in [22],
is used for all the Dl. This is because the critical path delay
for Ml=8 is larger than those for Ml=4 and also the one of
Ds. The clock frequency and hence the throughput of DTA
will be lower if Ml=8 is used.
From the area breakdown shown in Table VII, the area of the
TA-SCL architecture is dominated by that of Dl. Numerically,
Dl contributes 85%, 88% and 58% of the total area for D1,
D4 and D5, respectively. When Ll =32, the area overhead of
D1 and D4 is 18% and 11% compared with the corresponding
single Dl, respectively. The area of D4 with Ls=1 is similar
with that of D1 with Ls=2 as the area is dominated by that of
Dl. The area of D5 is much smaller than D1 and D4 because
Dl with Ll=8 has a much smaller area.
Table VI compares the synthesis results with state-of-the-art
architectures for polar codes with N=1024 and R ≈ 12 [22]–
[24]. Compared with the results in [22] in which Ml=8 is
used, the throughput gains achieved by D1 and D5 are 2.83x
and 2.72x for Ll=32 and 8, respectively, which are slightly
lower than the theoretical speed gains due to the clock rate
issue as discussed above. For applications just targeting at high
SNR range, D4 can achieve a throughput gain of up to 3.39x.
Comparing D5 with the decoders in [23], [24] with Ll=8, the
area is similar while the throughput is nearly 4 times higher.
The implementation results show that the proposed TA-SCL
architecture can significantly improve the decoding throughput
with a small hardware overhead and negligible error correction
performance degradation at a wide SNR range.
VI. CONCLUSION
In this work, a two-staged SCL decoding scheme is pro-
posed, which significantly increases the throughput of the
polar decoding on hardware. To analyse the error correction
performance of TA-SCL decoding, a mathematical model
based on Markov chain is proposed. With a proper selection
12 SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
of the design parameters, the performance loss is negligible
for a wide SNR range. A high-performance VLSI architec-
ture is then developed for the proposed TA-SCL decoding.
Experimental results show that the throughput of TA-SCL
decoding implemented by the proposed architecture are about
three times as high as that of the state-of-the-art architectures.
APPENDIX A
PROOF OF PROPOSITION 2
We first review Theorem 1 which is necessary to prove
Proposition 2.
Theorem 1. (Be´zout’s identity) Let a and b be integers with
greatest common divisor d. Then, there exist integers x and y
such that
ax+ by = d. (21)
More generally, the integers of the form ax+by are exactly the
multiples of d.
Next, the proof of Proposition 2 is given as below.
Proof: We prove the irreduciblity of the Markov chain
model first. First, we consider the safe states only and try to
prove that any safe state j is accessible from any other safe
state i. Let a=βn and b=βd. As βn ⊥ βd, d=1, and (21) can
be rewritten as
(βn − βd)x+ βd(y + x) = 1 (22)
(βn − βd)x
′ + βdy
′ = 1 (23)
If x′ is positive and y′ is negative, (23) actually means state
i+ 1βd is accessible from state i after (|x
′|+|y′|)ts during which
|x′| and |y′| frames are decoded correctly and incorrectly by
Ds, respectively. State i-
1
βd
is also accessible as (23) is still
valid when its right-hand side is -1 according to Theorem 1.
Thus, any safe state j can be accessed from state i by repeating
this procedure. The accessibility of a idle/hazard state from/to
a safe state is obvious. Thus, it is possible to get to any state
from any state in this model and the irreduciblity is proved.
An irreducible Markov chain is aperiodic if any state is
aperiodic. As state 0 is aperiodic, the aperiodicity is proved.
REFERENCES
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, June 2009.
[2] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf.
Theory, vol. 59, no. 10, pp. 6562–6582, Oct 2013.
[3] C. Leroux et al., “A semi-parallel successive-cancellation decoder for
polar codes,” IEEE Trans. Signal Process., vol. 61, no. 2, pp. 289–299,
Jan 2013.
[4] Y. Fan and C.-Y. Tsui, “An efficient partial-sum network architecture for
semi-parallel polar codes decoder implementation,” IEEE Trans. Signal
Process., vol. 62, no. 12, pp. 3165–3179, Jun 2014.
[5] G. Sarkis et al., “Fast polar decoders: Algorithm and implementation,”
IEEE J. Sel. Areas Commun., vol. 32, no. 5, pp. 946–957, May 2014.
[6] C. Zhang, B. Yuan, and K. K. Parhi, “Reduced-latency SC polar decoder
architectures,” in IEEE Int. Conf. Commun. (ICC), 2012, pp. 3471–3475.
[7] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inf.
Theory, vol. 61, no. 5, pp. 2213–2226, May 2015.
[8] 3rd Generation Partnership Project, “Draft Re-
port of 3GPP TSG RAN WG1 #87 v0.1.0,”
http://www.3gpp.org/ftp/tsg ran/WG1 RL1/TSGR1 87/Report/Draft Minutes report RAN1#87 v011.zip,
p. 129, 2016, [Online; accessed 11-Jan-2017].
[9] K. Chen, K. Niu, and J. R. Lin, “List successive cancellation decoding
of polar codes,” IET Electron. Lett., vol. 48, no. 9, pp. 500–501, Apr
2012.
[10] B. Li, H. Shen, and D. Tse, “An adaptive successive cancellation list
decoder for polar codes with cyclic redundancy check,” IEEE Commun.
Lett., vol. 16, no. 12, pp. 2044–2047, Dec 2012.
[11] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE
Commun. Lett., vol. 16, no. 10, pp. 1668–1671, Oct 2012.
[12] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA,
USA: MIT Press, 1963.
[13] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-
correcting coding and decoding: Turbo-codes. 1,” in IEEE Int. Conf.
Commun. (ICC), May 1993, pp. 1064–1070.
[14] K. Niu, K. Chen, and J. R. Lin, “Beyond turbo codes: Rate-compatible
punctured polar codes,” in Proc. IEEE Int. Conf. Commun.(ICC), 2013,
pp. 3423–3427.
[15] Y. Fan et al., “Low-latency list decoding of polar codes with double
thresholding,” in IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 2015, pp. 1042–1046.
[16] A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” IEEE Trans. Signal
Process., vol. 63, no. 19, pp. 5165–5179, Oct 2015.
[17] P. Giard et al., “POLARBEAR: A 28-nm FD-SOI ASIC for decoding
of polar codes,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. PP,
no. 99, pp. 1–14, 2017.
[18] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation list
decoders for polar codes with multibit decision,” IEEE Trans. VLSI Syst.,
vol. 23, no. 10, pp. 2268–2280, Oct 2015.
[19] C. Xiong, J. Lin, and Z. Yan, “A multimode area-efficient SCL polar
decoder,” IEEE Trans. VLSI Syst., vol. 24, no. 12, pp. 3499–3512, Dec
2016.
[20] ——, “Symbol-decision successive cancellation list decoder for polar
codes,” IEEE Trans. Signal Process., vol. 64, no. 3, pp. 675–687, Feb
2016.
[21] Y. Fan et al., “A low-latency list successive-cancellation decoding
implementation for polar codes,” IEEE J. Sel. Areas Commun., vol. 34,
no. 2, pp. 303–317, Feb. 2016.
[22] C. Xia et al., “A high-throughput architecture of list successive can-
cellation polar codes decoder with large list size,” IEEE Trans. Signal
Process., vol. 66, no. 14, pp. 3859 – 3874, Jul 2018.
[23] J. Lin, C. Xiong, and Z. Yan, “A high throughput list decoder architecture
for polar codes,” IEEE Trans. VLSI Syst., vol. 24, no. 6, pp. 2378–2391,
June 2016.
[24] S. A. Hashemi, C. Condo, and W. J. Gross, “Fast and flexible successive-
cancellation list decoders for polar codes,” IEEE Trans. Signal Process.,
vol. 65, no. 21, pp. 5756 – 5769, Nov 2017.
[25] ——, “A fast polar code list decoder architecture based on sphere
decoding,” IEEE Trans. Circuits Syst. I, vol. 63, no. 12, pp. 2368–2380,
Dec 2016.
[26] G. Sarkis et al., “Fast list decoders for polar codes,” IEEE J. Sel. Areas
Commun., vol. 34, no. 2, pp. 318–328, Feb. 2016.
[27] J. Lin and Z. Yan, “An efficient list decoder architecture for polar codes,”
IEEE Trans. VLSI Syst., vol. 23, no. 11, pp. 2508–2518, Nov 2015.
[28] A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “On metric
sorting for successive cancellation list decoding of polar codes,” in IEEE
Int. Symp. Circ. and Syst. (ISCAS), 2015, pp. 1993–1996.
[29] B. Y. Kong, H. Yoo, and I. C. Park, “Efficient sorting architecture
for successive-cancellation-list decoding of polar codes,” IEEE Trans.
Circuits Syst. II, vol. 63, no. 7, pp. 673–677, July 2016.
[30] V. Bioglio et al., “Two-step metric sorting for parallel successive
cancellation list decoding of polar codes,” IEEE Commun. Lett., vol. 3,
no. 21, pp. 456–459, March 2017.
[31] G. Bosco, G. Montorsi, and S. Benedetto, “Decreasing the complexity
of LDPC iterative decoders,” IEEE Commun. Lett., vol. 9, no. 7, pp.
634–635, July 2005.
[32] M. Rovini and A. Martinez, “On the addition of an input buffer to an
iterative decoder for LDPC codes,” in IEEE Vehicular Techn. Conf. -
Spring (VETECS), 2007, pp. 1995–1998.
[33] S. L. Sweatlock, S. Dolinar, and K. Andrews, “Buffering requirements
for variable-iterations LDPC decoders,” in IEEE Inf. Theory and Appl.
Workshops (ITA), 2008, pp. 1–8.
[34] C. Xia, Y. Fan, and C. Tsui, “A two-staged adaptive successive cancel-
lation list decoding for polar codes,” in IEEE Int. Symp. Circ. and Syst.
(ISCAS), 2019, pp. 1–5.
