A Low-Latency List Successive-Cancellation Decoding Implementation for
  Polar Codes by Fan, YouZhe et al.
ar
X
iv
:1
80
6.
11
30
1v
1 
 [c
s.I
T]
  2
9 J
un
 20
18
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 1
A Low-Latency List Successive-Cancellation
Decoding Implementation for Polar Codes
YouZhe Fan, Member, IEEE, ChenYang Xia, Student Member, IEEE, Ji Chen, Student Member, IEEE,
Chi-ying Tsui, Senior Member, IEEE, Jie Jin, Hui Shen, Member, IEEE, and Bin Li, Member, IEEE
Abstract—Due to their provably capacity-achieving perfor-
mance, polar codes have attracted a lot of research interest
recently. For a good error-correcting performance, list successive-
cancellation decoding (LSCD) with large list size is used to
decode polar codes. However, as the complexity and delay of
the list management operation rapidly increase with the list
size, the overall latency of LSCD becomes large and limits the
applicability of polar codes in high-throughput and latency-
sensitive applications. Therefore, in this work, the low-latency
implementation for LSCD with large list size is studied. Specifi-
cally, at the system level, a selective expansion method is proposed
such that some of the reliable bits are not expanded to reduce
the computation and latency. At the algorithmic level, a double
thresholding scheme is proposed as a fast approximate-sorting
method for the list management operation to reduce the LSCD
latency for large list size. A VLSI architecture of the LSCD
implementing the selective expansion and double thresholding
scheme is then developed, and implemented using a UMC 90
nm CMOS technology. Experimental results show that, even for
a large list size of 16, the proposed LSCD achieves a decoding
throughput of 460 Mbps at a clock frequency of 658 MHz.
Index Terms—Polar codes, successive-cancellation decoding,
list decoding, selective expansion, double thresholding, VLSI
decoder architectures.
I. INTRODUCTION
A
S the first family of error-correcting codes provably
achieving the channel capacity with explicit construction,
polar codes are a major breakthrough in coding theory [1]. Due
to their low encoding and decoding complexities, polar codes
have drawn a lot of research interest recently [2]-[16].
Successive-cancellation decoding (SCD) was proposed in
[1] for decoding polar codes. It was shown that SCD asymp-
totically achieves the channel capacity when the code length
N is large [1]. Moreover, the computational complexity of the
SCD algorithm is low, in the order of N log2N [1]. Therefore,
the SCD algorithm and its hardware implementation have
been extensively studied recently [17]-[28]. However, for polar
codes with short-to-medium code length, the error-correcting
performance of SCD is unsatisfactory. For example, as shown
in [29], compared with the low-density parity-check (LDPC)
code with similar code length and code rate, the SNR penalty
This work has been published in part in the 40th International Conference
on Acoustics, Speech and Signal Processing (ICASSP’2015).
Y.-Z. Fan, C.-Y. Xia, J. Chen, and C.-Y. Tsui are with the Department of
Electronic and Computer Engineering, the Hong Kong University of Science
and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: {jasonfan,
cxia, jchenbh}@connect.ust.hk, eetsui@ust.hk).
J. Jin, H. Shen, and B. Li are with the Communications Technol-
ogy Research Lab., Huawei Technologies, Shenzhen, P. R. China (e-mail:
{steven.jinjie, henry.shenhui, binli.binli}@huawei.com).
of SCD forN = 2048 polar codes is greater than 1 dB for a bit
error rate of 10−5. Hence, to improve the performance of polar
codes with short-to-medium code length, SCDs generating
multiple codeword candidates were proposed. They are list
successive-cancellation decoding (LSCD) [29], [30] and its
variants [31]-[33].
During the decoding of one codeword, LSCD generates L
codeword candidates where L is called the list size. The value
of L determines the trade-off between the error-correcting
performance and the computational complexity. From [29], the
LSCD approaches the maximum likelihood decoding (MLD)
performance of polar codes with a moderate list size. However,
this performance is still not comparable with that of the
advanced error-correcting codes such as Turbo codes and
LDPC codes. To this end, to further improve the error-
correcting performance, cyclic redundancy check (CRC) code
is serially concatenated with the polar codes and the CRC bits
are used to choose the valid codeword from the candidates
of the LSCD [29], [34], [35]. With the help of the CRC
code, the LSCD of polar codes achieves or even exceeds the
error-correcting performance of Turbo codes [36] and LDPC
codes [29]. However, this performance improvement is at the
cost of a larger list size (e.g., L = 16 or 32) and hence
the complexity of the corresponding LSCD becomes high.
The high computational complexity also results in an LSCD
architecture with high decoding latency and low throughput.1
This limits the applicability of polar codes in high-throughput
and latency-sensitive applications. In this work, a low-latency
LSCD architecture is explored, aiming at promoting polar
codes as a competitive coding candidate in both the error-
correcting and hardware implementation aspects.
LSCD mainly consists of two classes of operations: 1) SCD
operations for generating each of the L codeword candidates,
and 2) list management (LM) operations for maintaining the L
(locally) best codeword candidates in the list. SCD operations
are serial in nature and hence affect the decoding latency.
LM operations involve the finding of the best L out of
2L candidates and maintaining the copy of the candidates.
This requires sorting and copying operations of which the
complexity increases rapidly with L. To achieve a low latency,
existing LSCD architectures apply optimizations at either the
algorithmic or architectural level. As the first work on LSCD,
1Non-overlapped decoding architecture is assumed in this work; i.e., only
one codeword is decoded each time in the hardware. Hence, the higher the
decoding latency is, the lower the decoding throughput will be. Moreover,
except otherwise stated, the latency in this work is given in the number of
clock cycles.
2 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
Algorithmic 
level
Double 
thresholding 
scheme
Architectural 
level
Low-latency 
architecture
System level
Selective 
expansion 
method
VLSI design 
methods
Approximated 
sorting
Codes 
properties
Figure 1. Low-latency LSCD design flowgraph.
lazy copy was proposed in [29] to reduce the data copying
complexity and hence the latency for the LM operation.
The corresponding gate-level implementation was detailed in
[37]. In [38] and [39], the operand of the SCD operation
was changed from the log-likelihood (LL) value to the log-
likelihood ratio (LLR), resulting in a simplified data path
and improved clock frequency as well as a smaller memory
data storage. To reduce the latency introduced by the SCD
operation, multiple bits of a codeword were decoded at the
same time in [40]-[44]. In [45], the pre-computation look-
ahead technique was used to reduce the SCD latency by
half, at the cost of a larger memory. However, all these
LSCD architectures [37]-[45] were designed for a small list
size (L ≤ 4).2 With the increase of the list size, both the
computational complexity and the logic delay of the LM
operation become larger. Therefore, to support LSCD for
L = 8 with a reasonable clock frequency, up to three pipeline
stages were inserted in the LM operation and three cycles
were needed for each LM operation in [46]. This resulted in
a long decoding latency. In [47], the serial sorting operation
in the LM operation was parallelized at the architectural level
[48], and the latency of the resulting LSCD architecture was
reduced for L = 8. However, as shown in [49], even using
a parallel architecture, the logic delay of the LM operation
keeps increasing with the list size, and it deteriorates the clock
frequency of the overall LSCD architecture for a larger list size
(L > 8). Therefore, in this work, we concentrate on reducing
the latency introduced by LM operations, especially for a large
list size L.
This work achieves low-latency LSCD implementation by
performing optimizations at the system, algorithmic, and ar-
chitectural level, as depicted in Fig. 1. At the system level, a
method called selective expansion (SE) is proposed based on
the properties of polar codes. From [1], each source word bit of
the polar code’s codeword corresponds to a synthetic channel,
and different synthetic channels have different reliabilities.
In the SE method, only those bits associated with the less
reliable synthetic channels are decoded with the LSCD, while
the more reliable bits are decoded by the SCD [50]. As a
result, the LM operation (and its associated latency) for the
reliable bits are not needed. To implement the SE method on
2In [41], an architecture for LM operation supporting L = 8 was
proposed; however, the overall LSCD architecture was not presented. An
LSCD algorithm for list size up to L = 128 was discussed in [43]. However,
it was implemented on the PC platform instead of in VLSI.
the LSCD architecture, an optimization problem is formulated
to determine which bits are decoded by the LSCD, such that
the latency saving is maximized for a given error-correcting
performance requirement of the system. We note that, similar
to SE, a concurrent work [51] was proposed to reduce the
complexity of the LSCD by utilizing the synthetic channel
characteristics. However, the methodology and the goal of
this work and ours are different. At the algorithmic level, an
approximated LM operation called the double thresholding
scheme (DTS) is proposed. Instead of exactly maintaining
the L (locally) best codeword candidates, the DTS keeps the
L almost-the-best codeword candidates in the list such that
the performance degradation introduced is negligible [52].
Compared with the original LM operation, the DTS is parallel
in nature and its logic delay is independent of the list size.
Hence, the latency of the LM operation is not increased,
even for a large list size. Finally, at the architectural level,
an efficient LSCD architecture based on the DTS is proposed.
By optimizing the schedule and logic of the blocks related
to the LM operation, a low-latency LSCD implementation is
achieved, even for L = 16.
The remainder of this paper is organized as follows. The
construction of polar codes and the algorithm of LSCD are
reviewed in Section II. Section III presents the proposed
SE method for reducing the latency of LSCD. The DTS is
detailed in Section IV and Section V presents the LSCD
architecture with a low decoding latency. In Section VI, the
simulation results of the error-correcting performance of the
proposed low-latency LSCD architecture are presented. The
ASIC implementation results of the proposed architecture are
also shown. Finally, Section VII concludes the work.
II. PRELIMINARIES
In this section, the channel polarization phenomenon [1]
discovered by ArÄs´kan is firstly reviewed, and it is funda-
mental to the SE method discussed in Section III. After that,
the construction of polar codes and the algorithm of LSCD
are reviewed.
A. Channel Polarization Phenomenon
Consider a binary-input discrete memoryless channel, de-
noted as W : X → Y , with an input alphabet X ∈ {0, 1} and
an output alphabet Y . Channel W is specified by the channel
transition probabilities W (y|x) with x ∈ X and y ∈ Y .
Let WN : XN → YN denote N independent copies of
channel W , where N = 2n and n ∈ N. Channel WN can be
described by the channel transition probabilities and is given
by
WN (yN |xN ) =
N−1∏
i=0
W (yi|xi) , (1)
where xN ∈ XN and yN ∈ YN are the input and the output
of WN , respectively.
Let uN ∈ XN be a binary vector one-to-one mapped to xN
by the following relation:
xTN = u
T
NF
⊗n, (2)
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 3
where xT is the transpose of x, and F⊗n is the nth Kronecker
power of the kernel matrix F. Since F ,
[
1 0
1 1
]
, F⊗n is
invertible.
Based on (2), N synthetic channels are obtained from
WN . They are denoted as W iN : X → X
i × YN , where
i ∈ {0, 1, . . . , N − 1}. The transition probabilities of channel
W iN are given by
W iN
(
yN ,u
i−1
0 |ui
)
=
∑
u
N−1
i+1
∈XN−i−1
1
2N−1
WN (yN |xN ) , (3)
where xN and uN are related by (2). xba denotes the sub-vector
of x with a starting and ending index of a and b. From (3), the
input of a synthetic channelW iN is a binary bit ui ∈ X , and its
output includes theWN output yN and the side information of
the i preceding bits ui−10 . To evaluate the performance of the
synthetic channels, a probability of error Pe (i) is associated
with each channel W iN . Under maximum likelihood decoding
(MLD), Pe (i) is given as
Pe(i)=
∑
u
i−1
0
,yN
min
{
W iN
(
yN,u
i−1
0 |0
)
,W iN
(
yN,u
i−1
0 |1
)}
2
, (4)
where ui−10 ∈ X
i and yN ∈ YN , and ui assumes the value
of X with equal probability. For any given N , the values of
the Pe (i)s can be found efficiently by the density evolution
techniques, as presented in [9]-[13].
ArÄs´kan’s Channel Polarization Theorem studies the be-
havior of the synthetic channel W iN [1]. One key observation
of the theorem is that when N →∞, the performance of the
synthetic channelW iN is polarized; i.e., except for a vanishing
fraction of W iN s, the rest of the W
i
N s are either almost noise-
free (Pe (i) → 0) or almost useless (Pe (i) → 0.5). For a
finite value of N , the Pe (i)s of the synthetic channels are
getting close to either 0 or 0.5, and the Pe (i)s are different
for different W iN s [9]-[13].
B. Construction of Polar Codes
Based on the channel polarization phenomenon, the con-
struction of polar codes is simple. In a polar coding scheme,
(2) represents the encoding operation of a lengthN polar code.
Vectors uN and xN are the source word and codeword, respec-
tively. A rate R = K/N polar code is specified by the frozen
set Ac ⊂ {0, 1, . . . , N − 1} of cardinality |Ac| = N −K and
the information set A defined as A = {0, 1, . . . , N − 1}\Ac.
The K source word bits ui (i ∈ A) deliver the information
bits, and the remaining N −K bits ui (i ∈ Ac) are the frozen
bits. Since the frozen bits are set to a value, e.g. 0, known to
both the encoder and the decoder, the block-error probability
Pb of polar codes is bounded by [1],
Pb ≤
∑
i∈A
Pe (i) . (5)
From (5), choosing the K indices with the smallest Pe (i)s
in A minimizes the block-error probability Pb. From the
discussion in Section II-A, if K is not greater than the
Depth level
0
u0
Source word bit
0
1
2
3
4
Null
0
0 1
0 0 0 0
0 1
1
1 1 1 1
1 1 1 1 1 1 1 10000000
u1
u2
u3
ęAc
ę A
ę 
ę 
A
A
Figure 2. Decoding tree for an N = 4 polar code.
number of the almost noise-free synthetic channels, a reliable
communication is achieved by the polar codes.
If r-bit CRC code is used in polar codes, to maintain a
fixed code rate R, the information set A is extended such that
|A| = NR+ r by switching r most reliable frozen bits to the
information bits. These extended bits deliver the CRC code
bits of the original NR information bits. In the LSCD, only
the codeword candidate passing the CRC check is output as
the decoding result.
C. List Successive-Cancellation Decoding
The decoding process of polar codes can be treated as a
search problem in the decoding tree. As an example, Fig. 2
shows the decoding tree for an N = 4 polar code. In general,
the decoding tree of a length-N polar code is a depth-N
binary tree, with ui mapped to the nodes at depth i + 1. As
shown in Fig. 2, its root node represents a null state, and
the left and right children at depth i + 1 represent ui = 0
and ui = 1, respectively. Therefore, a path from the root
node to a depth-i node represents a sub-vector ui−10 ∈ X
i,
and it is called a decoding path. Specifically, a complete
decoding path is a path from the root node to the leaf node
that represents a vector uN ∈ XN . The value of each bit of
uN is shown in the corresponding node lying at this decoding
path. If ui is a frozen bit, it only assumes a preset value,
e.g. 0. Consequently, the right-hand sub-tree rooted at the
depth-(i+ 1) node is pruned, as uN s included in this sub-
tree are not valid source words. For example, if Ac = {0},
the gray sub-tree in Fig. 2 is pruned. As a result, each
complete decoding path in the pruned decoding tree is one-to-
one corresponding to a valid source word of the polar code,
denoted as U = {uN |ui (i ∈ Ac) = 0}. In the subsequent
discussion, let uN ∈ U be the transmitted source word, and
the task of the decoder is to find a complete decoding path
uˆN ∈ U to decode uN .
The MLD of polar codes exhaustively searches all the
complete decoding paths in the decoding tree and generates
the likelihood Pr (yN |uˆN ) for each complete decoding path
uˆN ∈ U , where
Pr (yN |uˆN ) =WN
(
yN |uˆ
T
NF
⊗n
)
. (6)
The decoding path uˆMLDN with the maximum Pr (yN |uˆN ) is
output as the decoding result.
4 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
LM Operation
LPOPMU
¤i+10
...
...
¤i+11
...
¤i+10
¤i+11... ...
¤i+1L -1¤i+12L -1
¤i0
¤i1
L -1¤i
i0i1 L -1i
...
...
Figure 3. List management (LM) operation.
To ease the implementation, likelihood Pr (yN |uˆN ) is rep-
resented by the path metric γN (uˆN ), which is given by [38]:
γN (uˆN ) = − log [Pr (yN |uˆN )]− log
Pr (uˆN )
Pr (yN )
. (7)
For a given channel observation yN , the second term in (7)
is the same for all the source word uˆN s. Therefore, the MLD
of polar codes is described by
uˆMLDN = arg min
uˆN∈U
γN (uˆN ) . (8)
Recently, [38] and [39] showed that the path metric
γN (uˆN ) can be expressed as
γN (uˆN ) =
N−1∑
i=0
log
{
1 + exp
[
(2uˆi − 1) · Λ
i
(
uˆi−10
)]}
, (9)
where uˆi is the ith bit of the decoding path uˆN . Λi
(
uˆi−10
)
denotes the output LLR of the synthetic channel W iN , which
is given as
Λi
(
uˆi−10
)
= log
W iN
(
yN , uˆ
i−1
0 |ui = 0
)
W iN
(
yN , uˆ
i−1
0 |ui = 1
) . (10)
From (10), the value of Λi
(
uˆi−10
)
depends on the previous
decoding path uˆi−10 , and therefore each decoding path uˆ
i−1
0
corresponds to a different output LLR Λi
(
uˆi−10
)
. Using the
alternative form of the path metric expressed in (9) enables the
use of LLR-based SCD in LSCD which leads to a lower logic
delay and memory requirement over its LL-based counterpart
[38]-[39].
Similarly, a path metric γi
(
uˆi−10
)
is associated with the
decoding path uˆi−10 , and is given as
γi
(
uˆi−10
)
=
i−1∑
j=0
log
{
1 + exp
[
(2uˆj−1)·Λ
j
(
uˆ
j−1
0
)]}
, (11)
where γ0 = 0. Considering all the decoding path uˆi−10 s at a
certain depth of the decoding tree, the path metric γi
(
uˆi−10
)
and the output LLR Λi
(
uˆi−10
)
of each path is available. When
the decoding path is extended to the next depth, the path metric
of uˆi0 is updated as
γi+1
(ˆ
ui0
)
=γi
(ˆ
ui−10
)
+log
{
1+exp
[
(2uˆi−1)·Λ
i
(ˆ
ui−10
)]}
, (12)
where the decoding path uˆi0 is extended from uˆ
i−1
0 . Here,
uˆi can be either 0 or 1 if i ∈ A. Otherwise, uˆi = 0. The
operation in (12) is called the Path Metric Update (PMU) in
this work. With the PMU, the path metrics of all the paths uˆN
are generated and (8) can be executed accordingly. Therefore,
Li
n
L1L0
L0
Λ
0
Root
[ 1, 1]
Λ
1
Λ
2
Λ
3
stage 0
stage 1
f node g node
0
f 
1
0
f 
0
0 g
0
0 f 
0
1 g
0
1
g10
L1L0[
1
, 
1
]
L0
0 L0
0
L0
0
Figure 4. Scheduling tree for an N = 4 polar code.
the MLD can be regarded as a breadth-first search in the
decoding tree.
Since there are 2K complete decoding paths in the pruned
decoding tree, the MLD complexity is as large as O
(
2K
)
. To
achieve a reasonable decoding complexity, LSCD is proposed
to obtain a decoding performance close to that of MLD
with a much smaller complexity. For an LSCD with a list
size of L, at most L decoding paths are maintained at each
depth of the decoding tree. Therefore, after decoding log2 L
information bit uis,3 the decoding list has L decoding paths.
In the subsequent decoding, if ui is a frozen bit, L decoding
paths uˆi0s are extended from uˆ
i−1
0 s. On the other hand, if ui is
an information bit, 2L decoding paths uˆi0s are extended from
L uˆi−10 s. As a result, to maintain the list size, a List Pruning
Operation (LPO) has to be executed. Out of the 2L decoding
paths, the LPO keeps the L paths with the minimum path
metrics and drops the rest. For simplicity, the decoding path
uˆi−10 in the path metric notation γ
i
(
uˆi−10
)
and output LLR
notation Λi
(
uˆi−10
)
are dropped in the subsequent discussion,
and the L path metrics (and output LLRs) are indexed by the
subscript l = 0, 1, . . . ,L − 1. In this work, as depicted in
Fig. 3, the LPO together with the PMU is denoted as the LM
operation.
In the PMU operation specified by (12), the output LLR
Λi of each decoding path uˆi−10 is required and it is generated
by the SCD. The SCD operation for a length-N polar code
can be represented by a depth-n balanced binary tree, called
the scheduling tree [25]. Fig. 4 shows an example of the
scheduling tree for an N = 4 polar code. Its root node
provides the input LLR Lni s from the channel observation yN
as follows:
Lni = logW (yi|xi = 0)− logW (yi|xi = 1) , (13)
where i = 0, 1, . . . , N − 1. The non-root nodes in the
scheduling tree are categorized into two types: the f node
at the left-hand child and the g node at the right-hand child.
The f node at stage t executes the following f function,
Ltj = 2 tanh
−1
[
tanh
(
Lt+1j /2
)
tanh
(
Lt+1j+2t/2
)]
, (14)
and the g node executes the following g function,
Ltj = L
t+1
j+2t + (−1)
sj Lt+1j , (15)
where j = 0, 1, . . . , 2t − 1 and Ltjs are the output LLRs at
stage t. From (14) and (15), each function of the SCD has
3To ease the discussion, L is assumed to be an integer power of 2. The
methodology of this work does not have a constraint on the value of L.
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 5
Algorithm 1: Procedure of LSCD
1 L← 1; // initialize the actual size of
the candidate list
2 for i = 0, 1, · · ·N − 1 do
3 for l = 0, 1, · · ·L− 1 do // given the lth
decoding path uˆi−10
4 update Λil with (14)-(16); // SCD operation
5 if i ∈ Ac then
6 extend uˆi−10 to uˆ
i
0 with uˆi ← 0;
7 update γi+1l from γ
i
l and Λ
i
l with uˆi ← 0;
// PMU in (12)
8 else
9 extend uˆi−10 to two path uˆ
i
0s with uˆi ← 0/1;
10 update γi+1l s from γ
i
l and Λ
i
l with uˆi∈{0, 1};
// PMU in (12)
11 L← L+ 1;
12 if L > L then // LPO
13 find L smallest γi+1l s and corresponding uˆ
i
0s;
14 L← L;
15 return uˆN passed the CRC check;
two LLRs as inputs and one LLR as output. A node at stage t
of the scheduling tree includes 2t functions, and they can be
executed in parallel. As a result, 2t Ltjs are output by a node
at stage t, and they are the inputs of its two children in the
next stage.
The variable sj in (15) is known as the partial-sum in [20]
and [25]. The partial-sum s = [s0, s1, . . . , s2t−1] is calculated
from the previous decoding path u˜ = uˆi−1i−2t by
sT = u˜TF⊗t. (16)
Due to the data dependency introduced by the partial-sum, the
decoding schedule of the SCD follows the depth-first traversal
of the scheduling tree. As shown in Fig. 4, the ith leaf node of
the scheduling tree outputs the LLR of the synthetic channel
W iN as Λ
i = L00 and hence Λ
is are serially generated. Based
on Λi, if i ∈ A, the MLD of ui is given by
Θ
(
Λi
)
=
{
0
1
if Λi ≥ 0,
else,
(17)
where Θ
(
Λi
)
is the hard-decision function based on the value
of Λi. The probability of error for uˆi = Θ
(
Λi
(
ui−10
))
, i.e.,
Pr (uˆi 6= ui), is given by Pe (i) in (4). If i ∈ Ac, ui is decoded
as 0.
Algorithm 1 summarizes the procedure of an LSCD with list
size L. Line 4 indicates that the LSCD consists of L SCDs.
They are executed in parallel till a leaf node of the scheduling
tree is reached. With L output LLR Λils, the decoding path
uˆi−10 s are extended to the next depth of the decoding tree and
the path metrics are updated by the PMU. If the number of
extended paths is greater than L, the LPO is executed. Note
that the SCD operation has to be stalled till the LPO is finished
because the subsequent SCD operation needs the knowledge
of the previous path uˆi0, as discussed in (16). As a result, the
decoding schedule of the LSCD can also be represented by
the depth-first traversal of the scheduling tree, except that the
LM operation (Lines 5-14 in Algorithm 1) has to be executed
at each leaf node of the scheduling tree. Hence, the decoding
latency of the LSCD depends on the latency of both the SCD
and LM operations.
Finally, it is noted that the PMU in (12) and the f function
in (14) are non-linear functions. To simplify the hardware
implementation, the PMU is approximated as follows [38]-
[39], [47]:{
γi+12l = γ
i
l
γi+12l+1 = γ
i
l +
∣∣Λil∣∣
if uˆi = Θ
(
Λil
)
,
if uˆi = Θ
(
Λil
)
,
(18)
where l = 0, 1, . . . ,L−1, and γi+12l and γ
i+1
2l+1 denote the path
metrics of the two path extensions from the lth decoding path
uˆi−10 , respectively. Here, x is the complement of the binary
variable x. Similarly, the f function is usually approximated
as [20]-[28]
Ltj = sgn
(
Lt+1j
)
⊕sgn
(
Lt+1j+2t
)
min
(∣∣Lt+1j ∣∣ , ∣∣∣Lt+1j+2t ∣∣∣), (19)
where sgn (·) and |·| represent the sign bit and the magnitude
of a variable, respectively. As hardware implementation is
discussed in this work, (18) and (19) will be used for the
corresponding calculation except otherwise stated.
III. SELECTIVE EXPANSION
A. Selective Expansion Scheme
From the discussion in Section II-C, additional latency is
introduced by an LM operation, when L decoding paths are
expanded into 2L paths for an information bit ui (i ∈ A) in the
LSCD. In this section, we present a selective expansion (SE)
scheme where the path expansion for some of the information
bits is not executed; i.e., L decoding paths are only extended
into L paths for those bits. As a result, the list pruning
operation (LPO) is not needed and the associated latency will
not be added to the overall latency.
When an information bit ui (i ∈ A) is decoded, there
are L surviving decoding paths uˆi−10 s available due to the
decoding of the previous i bits. Assuming that ultimately the
LSCD will correctly decode the source word, there exists one
path ui−10 out of the L surviving decoding paths uˆ
i−1
0 s that
will lead to the correct decoding of the source word uN .
Consider the path extensions from ui−10 . From (10), the output
LLR of ui−10 is Λ
i
(
ui−10
)
= logW iN
(
yN ,u
i−1
0 |ui = 0
)
−
logW iN
(
yN ,u
i−1
0 |ui = 1
)
. From the discussion of (17), uˆi
assumes either Θ
(
Λi
)
or Θ(Λi), and the probability of error
for uˆi = Θ
(
Λi
)
is Pe (i). Therefore, if the decoding path u
i−1
0
is only extended into a single path taking uˆi as Θ
(
Λi
)
, the
probability of this path extension leading to an incorrect de-
coding of the transmitted source word uN is then Pe (i). From
the discussion in Section II-A, even inside the information set,
different bits have different Pe (i)s. If ui corresponds to a very
reliable channel with a very low Pe (i), the probability of ui0
not being in the candidate list by only extending the path
into a single path assuming uˆi = Θ
(
Λi
)
is small and the
performance degradation introduced is negligible.
6 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
Based on the above discussion, the SE method is proposed.
It divides the information set A into two subsets: the reliable
set and the unreliable set, denoted by Ar and Au, respectively.
Only for those bits inside Au are the L decoding paths
expanded into 2L paths. If ui is in Ar , each of the L decoding
paths is extended into a single path by taking uˆi = Θ
(
Λi
)
.
Consequently, the LPO and the associated latency are saved
for those bits inside Ar. Moreover, from (18), since uˆi is taken
to be Θ
(
Λi
)
, no PMU operation is required. Next, the method
of determining the set Ar is discussed.
B. Reliable Set for Selective Expansion
To determine the reliable set Ar (or equivalently Au =
A \ Ar), the performance of LSCD using the SE method is
firstly analyzed. Let MLSCD and MSE denote the candidate
lists output from the conventional LSCD and the LSCD using
the SE method, respectively. We are mainly interested in the
block-error event that the transmitted source word uN is not
in MLSCD or MSE. The block-error event of the SE method
ESE , {uN /∈ MSE} is given by
ESE = E
LSCD
SE ∪ E
LSCD
SE (20)
= {uN /∈MSE,uN /∈MLSCD} ∪ {uN /∈MSE,uN ∈MLSCD},
where ELSCDSE and E
LSCD
SE denote the error events in the SE
method that can and cannot be correctly decoded by the
conventional LSCD, respectively. In other words, ELSCDSE is the
error events introduced by the SE method, since otherwise
they can be decoded by the conventional LSCD. In addition,
let ELSCD , {uN /∈MLSCD} be the block-error event of the
conventional LSCD and we have ELSCDSE ⊂ ELSCD. A block-
error event in ELSCDSE occurs when we decode an information
bit ui, where i ∈ Ar, and the resulting L candidate paths
do not include the correct path ui0. This event is denoted by
B =
{
ui 6= Θ
(
Λi
(
ui−10
))
|i ∈ Ar
}
, where ui is the ith bit
of the transmitted source word uN and Λi
(
ui−10
)
is obtained
based on ui−10 . Similar to the union bound of the SCD in (5),
the probability of event B satisfies
Pr (B) ≤
∑
i∈Ar
Pe (i) . (21)
From the above discussion, the block-error probability of the
LSCD using the SE method, i.e., P SEb , Pr (ESE), is upper
bounded by
P SEb = Pr
(
ELSCDSE
)
+ Pr
(
ELSCDSE
)
≤ P LSCDb +
(
1− P LSCDb
)
· Pr (B) (22)
≤ P LSCDb +
∑
i∈Ar
Pe (i) ,
where P LSCDb , Pr (ELSCD) denotes the block-error probability
of the conventional LSCD.4 Furthermore, to simplify the
calculation of (22), Pe (i)s are approximated by the error
probability P de (i)s of their degraded channels [13], where
4It is assumed in this work that the value of P LSCD
b
is already available
and it can be obtained from the simulation. We leave the theoretical analysis
of P LSCD
b
to our future works.
Pe (i) ≤ P de (i). As a result, the upper bound on P
SE
b is given
by
P SEb ≤ P
LSCD
b +
∑
i∈Ar
P de (i) . (23)
Based on (23), we define the upper bound of the block-error
probability degradation η introduced by the SE method as
η (Ar) ,
∑
i∈Ar
P de (i)
P LSCDb
, (24)
and the block-error probability of the LSCD using SE is no
greater than (1 + η)P LSCDb .
From the above performance analysis result, we formulate
an optimization problem given a constraint on the tolerable
error-correcting performance degradation ǫ as follows:
maximize |Ar|
subject to Ar ⊂ A
η ≤ ǫ.
(25)
The solution of (25) is the optimal set of Ar, as the objective
function |Ar|, reflecting the latency saving achieved by the SE
method, is maximized.
The optimal solution to problem (25) can be obtained by
sorting the information set A by P de (i) (i ∈ A) in ascending
order and taking the first k elements in the sorted A such that
the corresponding η of this k-element set is just smaller than
ǫ. For an information set A of polar codes with a given ǫ, the
reliable set Ar of (25) can be found offline accordingly.
IV. DOUBLE THRESHOLDING SCHEME
For the SE method, the LM operation still has to be executed
for those unreliable information bits. The LPO needs to find
the smallest L path metrics from the 2L candidate inputs and
sorting method is required. However, the sorting operation
introduces a large latency, particularly when the list size is
large. To reduce the latency, parallel sorting can be used, but
the computation complexity will be very high for a large
list size. Therefore, a low complexity sorting operation is
needed. In this section, a Double Thresholding Scheme (DTS)
is proposed at the algorithmic level as a good approximation
of the conventional sorting method. Low complexity parallel
comparisons are executed in the DTS to find the surviving
paths, and the latency of the LPO is greatly reduced for a
large list size L.
A. Properties of the Path Metric
From Section II-C, the inputs to the LPO of bit ui are 2L
path metrics γi+1k (k = 0, 1, . . . , 2L − 1) generated from the
PMU as stated in (18). To approximate the LPO, the properties
of the input path metrics are first studied. Specifically, we are
interested in the number of the path metrics that are smaller
than a certain value T , i.e., the cardinality of the set Ω (T )
which is defined as
Ω (T ) ,
{
γi+1k |γ
i+1
k < T
}
. (26)
The properties related to the cardinality |Ω (T )| are stated as
follows.
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 7
L
DTS.1
AT
DTS.3
RT
DTS.2
(a)
L AT
RT
(b)
L AT
RT
(c)
Figure 5. Double thresholding scheme. (a) RT = γi
L−1
. (b) A smaller RT .
(c) A too small RT .
Proposition 1: Assume the L path metrics γil (l =
0, 1, . . . ,L− 1) input to the PMU are sorted and
γi0 < γ
i
1 < · · · < γ
i
l < γ
i
l+1 < · · · < γ
i
L−1. (27)
The cardinality of Ω (T ), when T = γil , satisfies
l ≤
∣∣Ω (γil)∣∣ ≤ 2l. (28)
Proof: From (18) and (27), γi+10 < γ
i+1
2 < · · · < γ
i+1
2l =
γil , and hence the left-hand part of (28) is proved. On the other
hand, γil = γ
i+1
2l < γ
i+1
2l+2 < · · · < γ
i+1
2L−2, and (18) implies
that γi+12l+1 ≥ γ
i+1
2l . As a result, γ
i+1
k ≥ γ
i
l for k ≥ 2l, and the
right-hand part of (28) is proved.
B. Double Thresholding Scheme
Based on the path metric properties presented in Proposition
1, the DTS is proposed for a fast LPO. It finds the L
approximately smallest path metrics from the 2L inputs to
form the surviving path metric set Ψ.
Double Thresholding Scheme: Assuming the L path metrics
γil (l = 0, 1, . . . ,L − 1) input to the PMU satisfy (27), two
threshold values, one the acceptance threshold (AT ) and the
other the rejection threshold (RT ), can be determined, and
they are given as
[AT,RT ] =
[
γiL/2, γ
i
L−1
]
. (29)
The LPO for γi+1k (k = 0, 1, . . . , 2L− 1) is then summarized
as follows:
DTS.1) if γi+1k < AT , γ
i+1
k ∈ Ψ;
DTS.2) if γi+1k > RT , γ
i+1
k /∈ Ψ; and
DTS.3) if AT ≤ γi+1k ≤ RT , it is randomly chosen to be
included in Ψ such that |Ψ| = L.
Finally, the path extensions with the path metrics γi+1k s that
are inside Ψ are kept and the rest of the path extensions are
pruned.
The operation of the DTS is illustrated in Fig. 5. Assuming
the 2L path metrics γi+1k (k = 0, 1, . . . , 2L − 1) are sorted
in ascending order, the top L path metrics are the smallest.
Hence, they are the elements of Ψ if an exact sorting method
is used for the LPO. On the other hand, when the DTS is used,
the shaded path metrics are the elements of Ψ.
From Proposition 1, DTS.1 ensures that at least L/2 path
metrics are picked and they are the smallest among all 2L
path metrics. So these path metrics are in the original exactly-
sorted Ψ. Therefore, based on DTS.1, the performance of the
Algorithm 2: Procedure of the low-latency LSCD
1 L← 1; // initialize the actual size of
the candidate list
2 for i = 0, 1, · · ·N − 1 do
3 for l = 0, 1, · · ·L− 1 do // given the lth
decoding path uˆi−10
4 update Λil with (14)-(16); // SCD operation
5 if i ∈ Ac then
6 extend uˆi−10 to uˆ
i
0 with uˆi ← 0;
7 update γi+1l from γ
i
l and Λ
i
l with uˆi ← 0;
// PMU in (12)
8 else if i ∈ Ar then
9 extend uˆi−10 to uˆ
i
0 with uˆi ← Θ
(
Λil
)
;
10 update γi+1l from γ
i
l and Λ
i
l with uˆi←Θ
(
Λil
)
;
// PMU in (12)
11 else // i ∈ Au
12 extend uˆi−10 to two path uˆ
i
0s with uˆi ← 0/1;
13 update γi+1l s from γ
i
l and Λ
i
l with uˆi∈{0, 1};
// PMU in (12)
14 L← L+ 1;
15 if L > L then // DTS for LPO
16 update Ψ from γi+1l s, AT , and RT , and reserve
corresponding uˆi0s;
17 L← L;
18 find AT and RT from the updated Ψ;
19 return uˆN passed the CRC check;
resulting LSCD with list size L would not be worse than that
of the LSCD with a list size L/2 based on the exact sorting
method.
From Proposition 1, |Ω (RT )| ≥ L − 1, and (18) implies
γi+12L−2 = RT . Hence, at least L γ
i+1
k s are less than or equal
to RT . It also means that at most L path metrics are greater
than RT . Therefore, DTS.2 efficiently excludes at most the L
largest path metrics and these are surely not in the original
exactly-sorted Ψ. Finally, as shown in Fig. 5(a), when the
number of path metrics picked by DTS.1 is smaller than L,
DTS.3 randomly chooses the metrics from the remaining γi+1k s
to fill up the decoding list such that |Ψ| = L.
Compared with the exact-sorting method, the performance
of the DTS is potentially degraded due to DTS.3. As shown in
Fig. 5(a), some of larger of the L smallest path metrics may
not be chosen by DTS.3, and this happens when the number of
path metrics accepted by DTS.1 and that excluded by DTS.2
are both fewer than L. Therefore, to improve the performance
of the DTS, a larger AT or a smaller RT can be used. If
the AT is increased, it is possible that more than L path
metrics are accepted by DTS.1. Also, as will be discussed in
the next section, in order to reduce the number of comparisons,
our proposed architecture does not explicitly generate the AT
value for comparison. Hence, in this work, a smaller RT , e.g.,
RT = γil (l < L − 1), is used to improve the performance.
As indicated in Fig. 5(b), a smaller RT excludes more path
metrics, and hence the path metric chosen by DTS.3 is more
8 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
LLR memory DTS
SRAM
LM module
Pointer Memory
2MQ...
SCD module
... ...
M 
PEs
LC
P
C
TTA
P
N
P
N
Path memory
...
State memory module
CRC check unit
Partial-sum memory
...
N/2 bits
Bit-serial
Control unit ROM for
P
M
U
K bits
PMO
PME
A
c
, Ar (Au)
Figure 6. Top-level architecture of the low-latency LSCD.
likely to be one of the L smallest metrics. On the other hand,
with a smaller RT , it is possible that more than L path metrics
will be excluded by DTS.2. As shown in Fig. 5(c), this results
in a list size smaller than L. Hence, if the RT is reduced by
too much, the performance of the LSCD will also be degraded.
In the next section, we propose an architecture that can use a
smaller RT value while guaranteeing to generate a list with
size L.
The overall procedure of the proposed low-latency LSCD
based on SE and DTS is summarized in Algorithm 2. Lines
8-14 execute the SE method discussed in Section III and Lines
15-18 describe the DTS. From the hardware implementation
perspective, since now we only need to compare the 2L input
path metric values with fixed threshold values, the DTS can be
executed in parallel, without a large increase in computation
complexity. Therefore, the logic delay is much smaller than
that of the exact sorting method and the overall latency of
the LPO is reduced. In the next section, a VLSI architecture
implementing Algorithm 2 will be discussed in detail.
V. LOW-LATENCY LSCD ARCHITECTURE
The top-level architecture of the proposed LSCD is shown
in Fig. 6. It mainly consists of five modules: the SCD module,
the state memory module, the LM module, the CRC check
unit, and the control unit. The SCD module is composed of L
independent semi-parallel SCDs, each using M (M < N/2)
processing elements (PEs) for the f and g function evaluation
[20], [25]. The CRC check unit contains L bit-serial units
computing the CRC check of each decoding path. As shown
in [47], the latency of the CRC check unit is masked by that
of the LSCD and hence can be neglected. A 2N bit ROM is
used to store the flags to indicate whether ui is a frozen bit, a
reliable information bit, or an unreliable information bit, and
this is used by the control unit to generate the corresponding
control signals to each block. In the rest of this section, the
state memory module and the LM module are discussed in
detail.
A. State Memory Module
Similar to the architecture in [37], the state memory module
is composed of three memories: the LLR memory, storing the
intermediate Ltjs (0 ≤ j < 2
t, 0 ≤ t ≤ n) of each SCD;
the partial-sum memory, storing the partial-sums of each SCD
[25]; and the path memory, storing the L decoding paths.
As discussed in [20] and [25], a semi-parallel SCD with
M = 2m processing elements uses a dual-port SRAM to
store the intermediate LLR operands at every decoding stage.
It consists of 2
(
N
M +m
)
words with MQ bits each (i.e., an
overall size of 2 (N +mM)Q bits), where Q is the number of
quantization bits for the LLR values. In every cycle, two words
are needed for the corresponding f and g node execution and
one word of the M LLR values is generated and stored back.
NQ bits of memory are used to store the channel input LLR
Lni (0 ≤ i < N ) and the remaining (N + 2mM)Q bits
are used for the intermediate output LLR Ltj (0 ≤ j < 2
t,
0 ≤ t < n). To support the operation of L parallel SCDs, L
SRAMs are needed for the LLR memory. Since the channel
input Lni s are the same for all L SCDs, they can be stored in
the first SRAM, while the size of the other SRAMs is reduced
to (N + 2mM)Q bits each. As a result, the overall size of
the LLR memory is [(L+ 1)N + 2LmM ]Q bits.5
As shown in [25], N/2 bits of partial-sums are stored for
the g function evaluation for one SCD. Hence, the size of the
partial-sum memory in the LSCD is LN/2 bits. The size of
the path memory is LK bits, as each of the L decoding paths
has K information bits (the values of the N −K frozen bits
are pre-known and need not be stored). Since the sizes of the
partial-sum memory and the path memory are much smaller
than that of the LLR memory, they are implemented using
registers and organized into L register blocks with equal size,
as shown in Fig. 6.
For LSCD, each SCD expands a decoding path into two
when an information bit is decoded. The two paths can both
be kept in or excluded from the surviving candidate list. That
means an SCD used for the decoding of a path stored in
a certain SRAM in this decoding cycle may be assigned to
decode another path stored in another SRAM in the next
decoding cycle. Therefore, we need to re-align the connection
between the state memory and the SCD in each decoding
cycle. As shown in Fig. 6, for the partial-sum memory and
the path memory, L × L crossbars are used for moving the
data for the alignment. For the LLR memory, since the size
is very large and moving the contents has a large timing and
power overhead, the lazy copy method, which uses a pointer
to manipulate the alignment instead of physically moving the
data content, is introduced in [29] and [37]. As shown in Fig.
6, an L×L crossbar with port width 2MQ bits is used to direct
the memory contents to the corresponding SCD hardware. The
control signals of this crossbar are generated by the pointer
memory updated by the LM module, and the details of the
updating logic have been presented in [37] and [47]. The size
of the pointer memory is L × (n− 1) × log2 L bits, and the
memory is implemented with registers.
5As discussed in [20], for an easy memory layout and a simple connection
between the memory and the PEs, every word of the memory has the same bit
width. For each SCD, the memory location for storing Ltjs at stage t, where
0 ≤ t ≤ m, has (2mM + 1)Q unused bits and hence for an LSCD with
list size L, there is an overall unused overhead of (2mM + 1)LQ bits.
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 9
 Λi s
PMU
¤i+10
DTS
TTA
LC
¤i0 ¤iL -1
Ă
Ă
AT/RT
¤i+12 ¤i+1L-22
PME
¤i+11 Ă¤i+13 ¤i+1L -12
PMO
Figure 7. The data path of the LM module using the DTS.
B. List Management Module
The LM module implements the LM operation shown in
Fig. 3. Fig. 7 shows the data path when the DTS is used for
the LPO. It mainly consists of four components: the threshold-
tracking architecture (TTA), the PMU block, the DTS block,
and the lazy copy (LC) block. Specifically, the PMU block
executes the PMU operation in Fig. 3, and the DTS block
together with the LC block implements the LPO shown in
Fig. 3. The TTA calculates the thresholds to support the
operation of the DTS block. As shown in Fig. 7, after decoding
ui−1, the path metrics of the L surviving decoding paths
are γil (l = 0, 1, . . . ,L − 1). In decoding ui (i ∈ Au), the
L SCDs generate L output LLRs Λils, and the PMU block
generates the path metrics of the 2L extended paths. After
this, the DTS block finds the L almost-the-best path metrics
γi+1l (l = 0, 1, . . . ,L − 1) and their corresponding decoding
paths. Based on the information on path removal and survival,
the LC block manipulates the memory contents in the state
memory module, and its logic has been discussed in [37] and
[47]. Running in parallel with the LC block, the TTA block
calculates the values of AT and RT from the surviving γi+1l s,
and they will be used by the DTS block for the decoding of
the next bit. In the following, the architectures for the PMU,
TTA, and DTS blocks are presented in detail.
1) PMU Block in the List Management Module: The PMU
block expands and updates the path metrics based on (18).
Its 2L outputs γi+1l (l = 0, 1, . . . , 2L − 1) are divided into
two groups: path metrics with an even index (PME), i.e. γi+1j
(j = 0, 2, . . . , 2L − 2), and path metrics with an odd index
(PMO), i.e. γi+1k (k = 1, 3, . . . , 2L − 1). From (18), no extra
hardware is required to generate the path metrics in the PME
as γi+1j = γ
i
j/2 when j is an even number. On the other hand,
L adders are needed in the PMU block to generate the path
metrics in the PMO as γi+1k = γ
i
(k−1)/2 +
∣∣∣Λi(k−1)/2∣∣∣ when k
is an odd number.
2) TTA in the List Management Module: The TTA is
responsible for calculating the acceptance threshold AT and
the rejection threshold RT for the DTS to work. The AT
and RT values for decoding bit ui are generated from γil
(l = 0, 1, . . . ,L − 1), which are the L surviving path metrics
at bit ui−1, as shown in Fig. 7. The architecture of the TTA
is shown in Fig. 8. In addition to the generation of AT and
RT , as shown in Fig. 8, the TTA also outputs the partially-
sorted γil s. The smallest L/2 path metrics are on the top and
the largest L/2 path metrics, which are exactly-sorted, are at
the bottom. The details of the TTA operations are as follows.
The L input path metrics are evenly divided into two
groups. Each group is then sorted by a radix-L/2 sorter [48].
...
...
R
a
d
ix
 L
/2
 
S
o
r
te
r
R
a
d
ix
 L
/2
 
S
o
r
te
r
Ă
Ă Ra
d
ix
 L
/2
 
S
o
r
te
r
d
0
0
C&S C&S C&S
d
0
1 Ă
Ă
dL /2-1
0
d
1
0
dL /2-2
1
dL /2-1
1
Ă
¤i0
¤i1
¤iL-1
¤i0
¤i1
L¤i /2-1
¤i /2L
¤i
¤i
AT
RT
S
m
a
lle
s
t L
/2
 
m
e
tric
s
L¤i /2-1
¤i/2L
¤i-2L -2L
L-1
Figure 8. Threshold-tracking architecture.
Therefore, their outputs dkj (j = 0, 1, . . . ,L/2 − 1; k = 0, 1)
satisfy dk0 ≤ d
k
1 ≤ · · · ≤ d
k
L/2−1, for k = 0 and 1. Similar
to [46], L/2 comparing-and-swapping (C&S) elements take
pairs of the output values of the sorters
(
d0j , d
1
L/2−1−j
)
,
j = 0, 1, . . . ,L/2 − 1, as their inputs and direct the smaller
value to the upper output and the larger value to the lower
output. As a result, the outputs of the C&S array are partially
sorted, where the top L/2 outputs are guaranteed to be
smaller than or equal to the lower L/2 outputs. For an easier
implementation of the DTS architecture, the lower L/2 outputs
are further exactly sorted by another radix-L/2 sorter. The
reason for this will be discussed in the next sub-section. From
the discussion in Section IV-B, the first element in the lower
L/2 sorted output path metric γi
L/2 in Fig. 8 is AT . In fact,
we do not need to know the value of AT . The group of path
metrics that satisfies the AT check can be directly obtained
from the top L/2 outputs of the TTA. This will be discussed
in more detail in the next sub-section. Moreover, RT can be
chosen from the lower L/2 sorted output path metrics of the
TTA. For example, the RT used in (29) is the last output path
metric of the TTA. The TTA requires an exact sorting of L/2
elements. For other LSCD architectures that use exact sorting
for list pruning, the input size of the sorter is 2L instead of
L/2. So the complexity of the proposed TTA is much smaller.
In addition, the TTA is executed in parallel with the execution
of f or g nodes and the PMU for the decoding of the next bit,
and hence the latency is hidden and no extra cycle is added
to the overall latency.
3) DTS Block in the List Management Module: As shown
in Fig. 7, when we decode bit ui, the DTS takes the two groups
of path metrics (PME and PMO) output from the PMU block
as input. The AT and RT values obtained from the TTA are
used as the threshold values for the DTS operation.
As shown in Fig. 6, the path metrics in the PME and
PMO are firstly passed to two permutation networks (PNs),
respectively. Since each partially-sorted path metric output of
the TTA corresponds to the generation of one path metric
element in the PME and another in the PMO, according to
(18), the elements in the PME and PMO are permutated
based on the sorted-order of their parent path metrics in the
TTA output. For example, if the orders of the outputs of the
TTA are γi3, γ
i
2, γ
i
0, and γ
i
1 for L = 4, the orders of the
PME and PMO after permutation are
[
γi+16 , γ
i+1
4 , γ
i+1
0 , γ
i+1
2
]
and
[
γi+17 , γ
i+1
5 , γ
i+1
1 , γ
i+1
3
]
, respectively. Since the first L/2
10 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
MUX array
RT
<
>
<
>
<
> Comparator
...
PMO
PN
Ă
Ă
¤i+1L-22Ă+2L¤i+1
P
M
E
flag Ă¤i+1L /2 ¤i+1L -1Accumulator0 1 1Ă
<
>
...
¤1i+1
¤i+13
¤i+1L-12
¤i+11¤i+10 ¤i+1-1/2L
¤i+10 ¤i+12 -2L¤i+1
L¤i+1
Figure 9. Architecture of the pruning and copying (PC) block.
outputs of the TTA are smaller than AT and γi+12l = γ
i
l , the
first L/2 elements in the permutated PME are all smaller than
AT . Similarly, as AT is the smallest value among the last L/2
outputs of the TTA, the last L/2 elements in the permutated
PME are all greater than or equal to AT .
After permutation, the elements of the PME and PMO are
passed to the pruning and copying (PC) block to determine
the L surviving paths. The architecture of the PC is shown
in Fig. 9. From the above discussion, the first L/2 elements
in the PME are definitely smaller than AT and hence will be
included in the surviving set Ψ. To fill up the remaining L/2
elements in Ψ, as discussed in Section IV, we need to compare
the last L/2 elements in the PME and the elements in the
PMO with AT and RT . Random inclusion or exclusion has
to be done if the number of elements passing the two threshold
checks is not exactly equal to L/2. To reduce the number of
comparisons and also avoid the random inclusion/exclusion,
which will complicate the hardware implementation, we pro-
pose a different method to select the remaining L/2 elements
in Ψ. We temporarily accept the last L/2 elements in the PME
first. We then compare the elements in the PMO with a fixed
RT value using L comparators. A flag equal to 1 is generated
if the corresponding path metric is not greater than RT . Note
that this RT value is smaller than that stated in (29) in order
to prune out more paths with larger metric values. All the flags
are then added up by an accumulator to decide how many path
metrics are not greater than RT . Carry-save adders and adder
tree are used to reduce the delay of the accumulator. Let k
be the output of the accumulator. Then the largest k elements
of the last L/2 elements in the PME are replaced by the k
path metrics in the PMO that are not greater than RT . Note
that since the last L/2 elements in the PME are exact-sorted
in order, we simply pick the last k elements in the set for
replacement. If k is larger than L/2, we just take the first
L/2 elements in the PMO that pass the RT test to replace the
last L/2 elements in the PME in Ψ.
The DTS architecture presented in Fig. 9 has two advantages
over the DTS operation discussed in Section IV-B. Firstly, a
much smaller RT can be used to exclude more paths with
large metric values. Even when a smaller RT is used, we
can still guarantee at any time that the candidate list of the
LSCD has L decoding paths. In the worst case, when all the
path metrics in the PMO are greater than RT , we will keep
the last L/2 elements in the PME in the surviving path list.
cycle 0 1 2 3
SCD f10 f
0
0 g
0
0 g
1
0
(a)
cycle 0 1 2 3 4
SCD f00 g
0
0 g
1
0
PMU γ1l γ
2
l
DTS γ1l γ
2
l
LC LC LC
TTA TTA TTA
(b)
Figure 10. Timing diagram example of the low-latency LSCD architecture.
(a) Timing diagram of decoding u0 and u1 for N = 4 polar codes. (b) Timing
diagram of decoding u0 and u1 with the LSCD.
Secondly, since the last L/2 elements in the PME are already
sorted by the TTA, we always replace the worst elements in the
PME. This is better than randomly selecting a path to replace
as the probability of the last few elements of the PME in the
actual surviving path set is low. As a result, the error-correcting
performance of the DTS is improved by using the architecture
shown in Fig. 9, and we denote this as DTS-Advance.
C. Decoding Latency of the Proposed LSCD Architecture
Fig. 10(a) shows the timing diagram of decoding u0 and
u1 in the scheduling tree of Fig. 4 using a single SCD.
When LSCD is used, additional cycles are required for the
path metric updating and list pruning. Fig. 10(b) shows the
timing diagram of decoding u0 and u1 with the proposed
LSCD architecture,6 where the detailed timing of the list
management (LM) component is also shown. Specifically, γil
in the PMU and DTS denotes the generation of 2L path
metrics output from L input path metrics in the PMU block
and finding the L surviving path metrics from the 2L path
metric candidates in the DTS block, respectively. Compared
with the architecture presented in [52], the processing element
data path is optimized and the PMU block is executed in the
same clock cycle with the leaf f/g node execution of the SCD
operation. Moreover, the LPO implemented by the DTS and
the lazy copying (LC) blocks are done in the same clock cycle.
Due to the data dependency, the TTA operation for finding the
threshold values for the next bit is executed when the DTS for
the current bit is finished and it is hidden in the cycle where
the leaf f/g nodes are executed. As a result, by using the DTS
for the LPO, only one additional cycle is introduced for each
LM operation.
From [20], the decoding latency (i.e., the time to traverse
the scheduling tree) of a semi-parallel SCD using M PEs is
equal to 2N + NM log2
(
N
4M
)
clock cycles. Hence, the overall
latency of the LSCD architecture is
D = 3N +
N
M
log2
(
N
4M
)
. (30)
As discussed in Section III, when the SE method is used,
if ui is a reliable bit, i.e. i ∈ Ar, the operation of the
6For illustration, it is assumed that the list is already full of L decoding
paths in the beginning.
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 11
Table I
DIFFERENT LATENCY REDUCTION CASES FOR SOURCE-BIT COUPLE
ar
af 2 1a 0
2 n/a n/a Case I
1 n/a Case II Case III
0 Case IV Case V Case VI
aFrom the properties of polar codes, if af = 1 in (u2i, u2i+1), then
2i ∈ Ac and 2i+ 1 ∈ A.
PMU and the LPO after the decoding of bit ui are not
required. Moreover, the LPO for the frozen bit is not exe-
cuted either. Hence, the latency in (30) can be reduced. The
latency is further reduced by considering two source bits at
a time. A source-bit couple is defined as (u2i, u2i+1), with
i ∈ {0, 1, . . . , N/2− 1}. Based on the types of bits of u2i
and u2i+1, the source-bit couples can be categorized into six
cases, which are summarized in Table I, where af and ar
denote the number of frozen bits and reliable bits in a source-
bit couple, respectively. Without loss of generality, we use the
couple (u0, u1) and its decoding timing diagram in Fig. 10 for
illustration in the following discussion.
1) Case I: Both u0 and u1 are reliable information bits.
Hence, the LM operation after decoding each bit is saved.
Moreover, since the PMU operation is not needed, the output
LLRs Λ0 and Λ1 are not needed, and hence the leaf nodes
of the scheduling tree, f00 and g
0
0 , are not executed. For the
lth decoding path, the values of (uˆ0, uˆ1) on its path extension
are determined by the hard decision of the SCD assigned for
that decoding path and they are given by
[
Θ
(
L10
)
,Θ
(
L11
)]
F,
where L10 and L
1
1 are the LLRs from the parent node of the
SCD, i.e., f10 in Fig. 4.
Based on the above discussion, the operations in cycles 0
to 3 of Fig. 10(b) are saved for Case I. Moreover, as part of
the LM operation, the TTA in cycle 4 is also not needed. As
a result, four clock cycles are saved for the Case I source-bit
couple.
2) Case II: Bit u0 is a frozen bit and u1 is a reliable
information bit. The LPOs for both bits and the PMU operation
for bit u1 are not executed. However, the PMU for the frozen
bit u0 still has to be executed, and it can be combined with
the SCD operation as follows:
γ2l =
{
γ0l
γ0l +min
(∣∣L10∣∣ , ∣∣L11∣∣)
if Θ
(
L10
)
= Θ
(
L11
)
,
if Θ
(
L10
)
6= Θ
(
L11
)
,
(31)
where l = 0, 1, . . . ,L − 1. Similar to Case I, L10 and L
1
1 are
the LLRs output from node f10 .
As a result, for the Case II source-bit couple, the leaf nodes
of the scheduling tree are not executed and the LM operations
are simplified to (31). The lth decoding path’s path extension
(uˆ0, uˆ1) is given as
(
0,Θ
(
L10 + L
1
1
))
. Specifically, the PMU
operation in (31) is retimed and it is executed in the same
cycle with f10 . Thus, the corresponding operations in cycles 0
to 3 are not needed. Different from that in Case I, the TTA in
cycle 4 has to be executed, as the path metrics are changed
by (31).
Table II
LATENCY REDUCTION FOR DIFFERENT SOURCE-BIT COUPLE CASES
Case I II III IV V VI
Number of cycles reduced 4 4 1 4 1 0
3) Case III: u0 is an unreliable information bit and u1 is
a reliable bit.7 In this case, the operations of the PMU, the
LPO, and the TTA after decoding u1 are not needed. Hence,
one clock cycle (i.e., cycle 3 in Fig. 10(b)) is saved.
4) Case IV: Both u0 and u1 are frozen bits. The LPOs for
both bits are saved, and the PMU operations of the two bits
are combined and simplified as [52]
γ2l = γ
0
l +Θ
(
L10
)
·
∣∣L10∣∣+Θ (L11) · ∣∣L11∣∣ , (32)
where l = 0, 1, . . . ,L − 1, and L10 and L
1
1 are the output
LLRs of node f10 . Therefore, the leaf node operations f
0
0 and
g00 of the SCD together with the LM operations are simplified
to (32). This PMU operation is retimed and it is executed in
the same cycle with f10 . Hence, similar to Case II, four clock
cycles are saved.
5) Case V: u0 is a frozen bit and u1 is an unreliable
information bit. This case is different from Case II, because
the LM operation is needed for u1. Hence, only the LPO for
u0 can be eliminated and one cycle is saved.
6) Case VI: Both u0 and u1 are unreliable information
bits. Fig. 10(b) depicts the timing of this case, and no latency
reduction is achieved.
Table II summarizes the latency reduction achieved by
different source-bit couple cases. As a result, the decoding
latency of the proposed LSCD architecture is given as
DLSCD = D − 4 (NI +NII +NIV)− (NIII +NV) , (33)
where Nα denotes the number of source-bit couples for Case
α found in the polar codes. These values depend on the frozen
set Ac and the reliable set Ar. To achieve the timing specified
in (33), the PMU block shown in Figs. 6 and 7 has to support
the operation of (31) and (32), and it is easily achieved with
additional comparators and adders.
VI. EXPERIMENTAL RESULTS
In this section, to demonstrate the error-correcting perfor-
mances of the proposed SE method and DTS algorithm, an
(N,R, r) = (1024, 1/2, 16) polar code is simulated over a
binary-input AWGN channel.8 Then, we present the imple-
mentation results of the proposed LSCD architecture, and then
compare them with those of other existing works.
7From the properties of polar codes, if {2i, 2i+ 1} ⊂ A and ar = 1,
then 2i ∈ Au and 2i+ 1 ∈ Ar .
8As stated in Section II-B, when 16-bit CRC code is used, the information
set A of polar codes is extended such that K = |A| = NR + r = 528.
When SCD is used to decode polar codes, CRC code is not used and hence
the size of A remains to be K = 512 for a same code rate of R = 1/2.
Specifically, the information set As of both polar codes with K = 528 and
K = 512 are optimized for Eb/N0 = 1.5 dB.
12 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
1 1.5 2 2.5
10
5
10
4
10
3
10
2
10
1
10
0
E
b
/N
0
B
L
E
R
Conventional
SE w/ ε = 0.3
SE w/ ε = 1
SE w/ ε = 3
SE w/ ε = 9
Random A
r
Figure 11. BLERs of LSCD using SE with different ǫs.
Table III
THE CARDINALITY OF Ar AND DECODING LATENCY FOR DIFFERENT ǫS
ǫ@2.25 dB 0.3 1 3 9
|Ar| / |A| 72.35% 75.76% 78.98% 82.77%
DLSCD (cycles) 1462 1424 1381 1329
A. Error-correcting Performance of the SE Method
Fig. 11 shows the block-error rate (BLER) of different
LSCD implementations with a list size of L = 16. First the
BLER of the conventional LSCD, i.e., P LSCDb in (23) and
(24), is shown. The BLERs of the proposed SE method with
different sizes of the reliable set Ar are also shown. The
size of Ar depends on the tolerable performance degradation
parameter ǫ. In the simulation, we use different ǫ values,
ranging from 0.3 to 9 at Eb/N0 = 2.25 dB.
From Fig. 11, it can be seen that, for each given ǫ, the
degradation in BLER of the LSCD using the SE method is
close to the upper bound predicted by (23) and (24). This
indicates that the performance analysis in (24) well estimates
the performance degradation introduced by the SE method for
a given reliable set Ar. To investigate the relationship between
the latency reduction and the performance degradation of the
SE method, Table III summarizes the cardinality of Ar for
different ǫs. Moreover, based on Ac and the corresponding
Ar, Table IV presents the number of different source-bit
couples for each ǫ value. Assuming that the LSCD architecture
proposed in Section V is used and each SCD uses M = 64
PEs, the last row of Table III compares the decoding latency
(DLSCD) for different ǫs, based on (33). From Table III, we can
see that for ǫ = 0.3, more than 72% of the information bits are
included in set Ar and hence more than 72% of the LPOs are
saved by the corresponding LSCD with SE. From Fig. 11, it is
also shown that the performance degradation introduced by the
SE method with ǫ = 0.3 is negligible compared with that of
1 1.5 2 2.5
10
5
10
4
10
3
10
2
10
1
10
0
E
b
/N
0
B
L
E
R
Exact sorting
DTS, RT = γ
15
DTS, RT = γ
11
DTSA, RT = γ
15
DTSA, RT = γ
11
Figure 12. BLERs of LSCD using DTS and DTS-Advance with different
RT values.
Table IV
SOURCE-BIT COUPLE DISTRIBUTION FOR DIFFERENT ǫS
ǫ NI NII NIII NIV NV NVI
0.3 158 0 66 224 48 16
1 168 0 64 224 48 8
3 176 5 60 224 43 4
9 186 11 54 224 37 0
the conventional LSCD. If a larger ǫ is used, Table III shows
that |Ar| is only slightly increased, while the performance of
the corresponding LSCD is degraded significantly, as shown
in Fig. 11. For example, when ǫ = 9, the decoding latency is
only reduced by 9% compared with that of ǫ = 0.3. Therefore,
ǫ = 0.3 is used in the SE method for our low-latency LSCD
implementation.
To verify the effectiveness of the method proposed in
Section III in finding set Ar, we randomly choose 72.35%
information bits in A to compose set Ar. Fig. 11 shows its
BLER using the SE method. It is shown that the performance
is greatly degraded from that using Ar generated from our
proposed method.
B. Error-correcting Performance of the DTS
Next the error-correcting performance of LSCD using the
DTS to replace exact sorting in the LPO is investigated.
Simulations for the polar code used in the previous sub-section
are carried out. Fig. 12 shows the BLERs of different LSCDs,
including those using the DTS discussed in Section IV and
the DTS-Advance discussed in Section V. Comparisons of
the BLERs of the DTS using different RT values are also
shown. Compared with the LSCD using the exact sorting
method, when γiL−1 is used as RT , as stated in (29), the LSCD
using the DTS introduces an SNR penalty of around 0.2 dB
when the BLER is 10−4. For the DTS-Advance discussed in
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 13
Table V
SYNTHESIS RESULTS COMPARISON OF DIFFERENT LSCD ARCHITECTURES FOR (N,R) = (1024, 1/2) POLAR CODES
This work [47] [46] [44] [42] [38] [37]
PE number per SCD M 64 n/a 64
K = |A| 528 528a 512
List size L 16 8 4
Technology UMC 90 nm TSMC 90 nm 90 nm ST 65 nm TSMC 90 nm UMC 90 nm UMC 90 nm
Area (mm2) 7.47 3.85 8.64 2.14 1.669 1.743 3.53
Clock freq. (MHz) 658 637 625 400 500 412 314
Throughput (Mbps) 460 245 177 401 332 162 124
aA 16-bit CRC code is used with (N,R) = (1024, 1/2) polar codes in [47]
1 1.5 2 2.5
10
5
10
4
10
3
10
2
10
1
10
0
E
b
/N
0
B
L
E
R
SCD
L = 2
L = 4
L = 8
L = 16
Impl.
LDPC
Figure 13. BLERs of LSCD with different list sizes.
Section V-B, the SNR loss is only around 0.1 dB. Moreover,
when a smaller RT value is used, such as γi11 shown in
Fig. 8, the performance degradation of the DTS-Advance is
negligible. However, when the same RT is used for the DTS, a
performance loss of around 0.1 dB is recorded. This is because
fewer decoding paths are chosen by DTS.3 and the candidate
list is not full for most of the time. As a result, the DTS-
Advance with RT = γi11 is used for a low-latency LPO in our
LSCD implementation.
C. Implementation Results of the Low-latency LSCD
The LSCD architecture proposed in Fig. 6 is designed
and implemented for an (N,R, r) = (1024, 1/2, 16) polar
code with list size L = 16. M = 64 PEs are used for
each SCD. From the simulation results, the SE method with
ǫ = 0.3 and the DTS-Advance with RT = γi11 introduce
negligible degradation in the error-correcting performance, and
hence they are used for the hardware implementation. Fig. 13
compares our implementation’s error-correcting performance
with those of the conventional LSCD with different list sizes.
It can be seen that our LSCD architecture has a very similar
BLER performance to the conventional LSCD. As a reference,
the performances of SCD and an (N,R) = (1152, 1/2) LDPC
code used in the WiMAX standard [53] are also shown in Fig.
13. Here, 40 iterations are used for the LDPC decoding. It can
be seen that polar codes have better performance when LSCD
with a larger list size L is used. When LSCD with L = 16 is
used, the BLER performance of polar codes is comparable to
that of the LDPC code.
The design is synthesized with a UMC 90 nm CMOS
process, using Synopsys Design Compiler. For a fair com-
parison, the quantization scheme in [47] is used, i.e., the
LLR and the path metric are represented in 6 bits and 8
bits, respectively. Table V summarizes the synthesis results
and compares them with those of the existing architectures.
Compared with the state-of-the-art architectures, our proposed
LSCD architecture supports a much larger list size, which
results in a comparable error-correcting performance with
other advanced error-correcting codes. Moreover, from Table
III, the proposed LSCD architecture requires 1462 clock cycles
to decode one codeword, and hence it achieves a decoding
throughput of 460 Mbps at a clock frequency of 658 MHz.
Compared with [46] and [47], both the decoding throughput
and the list size are doubled. The chip area presented in Table
V is mainly due to the state memory module. The SCD module
only occupies 0.53 mm2 and the area of the LM module is
smaller than 0.1 mm2.
VII. CONCLUSION
In this work, a low-latency LSCD architecture is presented,
which is optimized at the system, algorithmic, and architec-
tural levels. At the system level, a selective expansion method
is proposed such that the amount of LM operations and the
associated latency of the reliable information bits are reduced.
At the algorithmic level, a double thresholding scheme is
proposed as an approximate sorting method for the list pruning
operation and its logic delay is greatly reduced for a large
list size. Finally, an optimized VLSI architecture for the LM
operation is presented. Experimental results show that both
the decoding throughput and the list size are doubled when
compared with the state-of-the-art architectures.
REFERENCES
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inform. Theory, vol. 55, no. 7, pp. 3051-3073, Jul. 2009.
14 ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
[2] S. H. Hassani, R. Mori, T. Tanaka, and R. L. Urbanke, “Rate-dependent
analysis of the asymptotic behavior of channel polarization,” IEEE
Trans. Inform. Theory, vol. 59, no. 4, pp. 2267-2276, Apr. 2013.
[3] S. H. Hassani, K. Alishahi, and R. L. Urbanke, “Finite-length scaling
for polar codes,” IEEE Trans. Inform. Theory, vol. 60, no. 10, pp. 5875-
5898, Oct. 2014.
[4] M. Mondelli, S. H. Hassani, and R. L. Urbanke, “From polar to Reed-
Muller codes: A technique to improve the finite-length performance,”
IEEE Trans. Commun., vol. 62, no. 9, pp. 3084-3091, Sep. 2014.
[5] D.-M. Shin, S.-C. Lim, and K. Yang, “Design of length-compatible
polar codes based on the reduction of polarizing matrices,” IEEE Trans.
Commun., vol. 61, no. 7, pp. 2593-2599, Jul. 2013.
[6] M. Seidl, A. Schenk, C. Stierstorfer, and J. B. Huber, “Polar-coded
modulation,” IEEE Trans. Commun., vol. 61, no. 10, pp. 4108-4119,
Oct. 2013.
[7] A. Eslami and H. Pishro-Nik, “On finite-length performance of polar
codes: Stopping sets, error floor, and concatenated design,” IEEE Trans.
Commun., vol. 61, no. 3, pp. 919-929, Mar. 2013.
[8] E. Hof, I. Sason, S. Shamai, and C. Tian, “Capacity-achieving polar
codes for arbitrarily permuted parallel channels,” IEEE Trans. Inform.
Theory, vol. 59, no. 3, pp. 1505-1516, Mar. 2013.
[9] R. Mori and T. Tanaka, “Performance and construction of polar codes
on symmetric binary-input memoryless channels,” in Proc. IEEE Int.
Symp. Inf. Theory (ISIT), Jun. 2009, pp. 1496-1500.
[10] R. Mori and T. Tanaka, “Performance of polar codes with the construc-
tion using density evolution,” IEEE Commun. Lett., vol. 13, no. 7, pp.
519-521, Jul. 2009.
[11] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the construction
of polar codes,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aug. 2011,
pp. 11-15.
[12] R. Pedarsani, “Polar codes: Construction and performance analysis,”
Master’s thesis, Swiss Federal Institute of Technology (EPFL), Lau-
sanne, Switzerland, Jun. 2011.
[13] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inform.
Theory, vol. 59, no. 10, pp. 6562-6582, Oct. 2013.
[14] M. Andersson, R. F. Schaefer, T. J. Oechtering, and M. Skoglund,
“Polar coding for bidirectional broadcast channels with common and
confidential messages,” IEEE J. Select. Areas Commun., vol. 31, no. 9,
pp. 1901-1908, Sep. 2013.
[15] D. U. Fayyaz and J. R. Barry, “Low-complexity soft-output decoding
of polar codes,” IEEE J. Select. Areas Commun., vol. 32, no. 5, pp.
958-966, May 2014.
[16] K. Niu, K. Chen, J. Lin, and Q.-T. Zhang, “Polar codes: Primary
concepts and practical decoding algorithms,” IEEE Commun. Mag., vol.
52, no. 7, pp. 192-203, Jul. 2014.
[17] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15,
no. 12, pp. 1378-1380, Dec. 2011.
[18] G. Sarkis and W. J. Gross, “Increasing the throughput of polar decoders,”
IEEE Commun. Lett., vol. 17, no. 4, pp. 725-728, Apr. 2013.
[19] Z. Huang, C. Diao, J. Dai, C. Duanmu, X. Wu, and M. Chen, “An
improvement of modified successive-cancellation decoder for polar
codes,” IEEE Commun. Lett., vol. 17, no. 12, pp. 2360-2363, Dec. 2013.
[20] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289-299, Jan. 2013.
[21] C. Zhang and K. K. Parhi, “Low-latency sequential and overlapped
architectures for successive cancellation polar decoder,” IEEE Trans.
Signal Process., vol. 61, no. 10, pp. 2429-2441, May 2013.
[22] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation polar
decoder architectures using 2-bit decoding,” IEEE Trans. Circuits Syst.
I, Reg. Papers, vol. 61, no. 4, pp. 1241-1254, Apr. 2014.
[23] C. Zhang and K. K. Parhi, “Latency analysis and architecture design of
simplified SC polar decoders,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
vol. 61, no. 2, pp. 115-119, Feb. 2014.
[24] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast
polar decoders: algorithm and implementation,” IEEE J. Select. Areas
Commun., vol. 32, no. 5, pp. 946-957, May 2014.
[25] Y.-Z. Fan and C.-Y. Tsui, “An efficient partial-sum network architecture
for semi-parallel polar codes decoder implementation,” IEEE Trans.
Signal Process., vol. 62, no. 12, pp. 3165-3179, Jun. 2014.
[26] A. J. Raymond and W. J. Gross, “A scalable successive-cancellation
decoder for polar codes,” IEEE Trans. Signal Process., vol. 62, no. 20,
pp. 5339-5347, Oct. 2014.
[27] A. Mishra, A. J. Raymond, L. G. Amaru, G. Sarkis, C. Leroux,
P. Meinerzhagen, A. Burg, and W. J. Gross, “A successive cancellation
decoder ASIC for a 1024-bit polar code in 180 nm CMOS,” in Proc.
IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2012, pp. 205-
208.
[28] O. Dizdar and E. ArÄs´kan, “A high-throughput energy-efficient im-
plementation of successive cancellation decoder for polar codes using
combinational logic,” 2014, arXiv:1412.3829v3 [Online]. Available:
http://arxiv.org/abs/1412.3829
[29] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inform.
Theory, vol. 61, no. 5, pp. 2213-2226, Mar. 2015.
[30] K. Chen, K. Niu, and J. R. Lin, “List successive cancellation decoding
of polar codes,” Electron. Lett., vol. 48, no. 9, pp. 500-501, Apr. 2012.
[31] K. Niu and K. Chen, “Stack decoding of polar codes,” Electron. Lett.,
vol. 48, no. 12, pp. 695-697, Jun. 2012.
[32] K. Chen, K. Niu, and J. R. Lin, “Improved successive cancellation
decoding of polar codes,” IEEE Trans. Commun., vol. 61, no. 8, pp.
3100-3107, Aug. 2013.
[33] K. Niu, K. Chen, and J. R. Lin, “Low-complexity sphere decoding of
polar codes based on optimum path metric,” IEEE Commun. Lett., vol.
18, no. 2, pp. 332-335, Feb. 2014.
[34] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE
Commun. Lett., vol. 16, no. 10, pp. 1668-1671, Oct. 2012.
[35] B. Li, H. Shen, and D. Tse, “An adaptive successive cancellation list
decoder for polar codes with cyclic redundancy check,” IEEE Commun.
Lett., vol. 16, no. 12, pp. 2044-2047, Dec. 2012.
[36] K. Niu, K. Chen, and J. R. Lin, “Beyond Turbo codes: Rate-compatible
punctured polar codes,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun.
2013, pp. 3423-3427.
[37] A. Balatsoukas-Stimming, A. J. Raymond, W. J. Gross, and A. Burg,
“Hardware architecture for list successive cancellation decoding of polar
codes,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 8, pp. 609-
613, Aug. 2014.
[38] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2014, pp.
3903-3907.
[39] B. Yuan and K. K. Parhi, “Successive cancellation list polar decoder
using log-likelihood ratios,” in Proc. Asilomar Conf. Signals, Syst., and
Computers, Nov. 2014, pp. 548-552.
[40] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list decoding algorithm
for polar codes,” in Proc. IEEE Workshop Signal Process. Syst. (SiPS),
Oct. 2014, pp. 1-6.
[41] C. Xiong, J. Lin, and Z. Yan, “Symbol-based successive cancellation
list decoder for polar codes,” in Proc. IEEE Workshop Signal Process.
Syst. (SiPS), Oct. 2014, pp. 1-6.
[42] C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation
list decoder for polar codes,” 2015, arXiv:1501.04705 [Online]. Avail-
able: http://arxiv.org/abs/1501.04705
[43] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Increasing
the speed of polar list decoders,” in Proc. IEEE Workshop Signal
Process. Syst. (SiPS), Oct. 2014, pp. 1-6.
[44] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation list
decoders for polar codes with multibit decision,” IEEE Trans. Very Large
Scale Integr. Syst., to appear.
[45] C. Zhang, X. You, and J. Sha, “Hardware architecture for list successive
cancellation polar decoder,” in Proc. IEEE Int. Symp. Circuits Syst.
(ISCAS), Jun. 2014, pp. 209-212.
[46] J. Lin and Z. Yan, “Efficient list decoder architecture for polar codes,” in
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Jun. 2014, pp. 1022-1025.
[47] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” IEEE Trans. Signal
Process., vol. 63, no. 19, pp. 5165-5179, Oct. 2015.
[48] L. Amaru, M. Martina, and G. Masera, “High speed architectures for
finding the first two maximum/minimum values” IEEE Trans. Very Large
Scale Integr. Syst., vol. 20, no. 12, pp. 2342-2346, Dec. 2012.
[49] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “On metric sorting
for successive cancellation list decoding of polar codes,” in Proc. IEEE
Int. Symp. Circuits Syst. (ISCAS), May 2015, pp. 1993-1996.
[50] B. Li and H. Shen, Method and device for decoding polar codes, United
States Patent 20150026543 A1.
[51] C. Cao, Z. Fei, J. Yuan, and J. Kuang, “Low complexity list successive
cancellation decoding of polar codes,” IET Commun., vol. 8, no. 17, pp.
3145-3149, Nov. 2014.
[52] Y.-Z. Fan, J. Chen, C.-Y. Xia, C.-Y. Tsui, J. Jin, H. Shen, and B. Li,
“Low-latency list decoding of polar codes with double thresholding,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr.
2015, pp. 1042-1046.
ACCEPTED FOR PUBLICATION IN IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 15
[53] Air Interface for Fixed and Mobile Broadband Wireless
Access Systems, IEEE 802.16e, Oct. 2005 [Online]. Available:
http://www.ieee802.org/16/tge
YouZhe Fan (S’11-M’15) received the B.E. degree
in electronic engineering form the Harbin Institute
of Technology, China, in 2009, and Ph.D. degree
in electronic and computer engineering from the
Hong Kong University of Science and Technology
(HKUST), Hong Kong, in 2015, respectively.
He is now a Research Associate in the Depart-
ment of Electronic and Computer Engineering at the
HKUST. His research interests are VLSI architec-
tures and integrated circuit design for communica-
tions and coding theory applications, digital signal
processing systems, and general purpose computing systems. He is currently
working on low-power high-speed VLSI design for wideband wireless MIMO
communications and advanced error-control coding schemes such as low-
density parity-check (LDPC) codes and polar codes.
ChenYang Xia (S’15) received the B.E. degree in
electronic engineering from Shanghai Jiao Tong Uni-
versity, China, in 2013. He is currently pursuing the
M.Phil. degree at the Department of Electronic and
Computer Engineering, the Hong Kong University
of Science and Technology, Hong Kong.
His research interests include VLSI architecture
and implementation for communication systems and
other digital signal processing systems. He is cur-
rently working on high-speed low-complexity FPGA
design for channel codec system such as polar codes.
Ji Chen (S’15) received the B.E. degree in electronic
information and communications from Huazhong
University of Science and Technology (HUST),
China, in 2013. He is currently working towards the
M.Phil. degree in the Department of Electronic and
Computer Engineering at the Hong Kong University
of Science and Technology (HKUST), Hong Kong.
His research interests are in information theory and
signal processing. He is currently working on the
high-speed low-complexity decoding algorithm de-
sign of polar codes.
Chi-ying Tsui (SM’11) received the B.S. degree in
electrical engineering from the University of Hong
Kong and the Ph.D. degree in computer engineering
from the University of Southern California in 1994.
He joined the Department of Electronic and Com-
puter Engineering, Hong Kong University of Science
and Technology in 1994 and is currently a full
professor in the department. His research interests
include designing VLSI architectures for low power
multimedia and wireless applications, developing
power management circuits and techniques for em-
bedded portable devices and ultralow power systems. He has published more
than 170 referred publications and holds 10 US patents on power management,
VLSI and multimedia systems.
Dr. Tsui received the Best Paper awards from the IEEE TRANSACTIONS
ON VLSI SYSTEMS in 1995, IEEE ISCAS in 1999, IEEE/ACM ISLPED
in 2007, and IEEE DELTA in 2008, CODES in 2012. He also received the
Design Awards in the IEEE ASP-DAC University Design Contest in 2004
and 2006.
Jie Jin received the B.S. degree in electronic en-
gineering from XiâA˘Z´an Jiaotong University and
Ph.D. degree in electronic and computer engineering
from the Hong Kong University of Science and
Technology in 2009. He joined Huawei Technologies
in 2009 and is currently a senior research engineer.
His research interests include VLSI architectures
for low power communications and channel coding
applications, and digital signal processing systems.
He is currently working on VLSI architectures for
advanced channel coding schemes such as low-
density parity-check codes and polar codes.
Hui Shen (M’09) was born in 1975. He received
the Ph.D. degree in electronics and communication
engineering from the Huazhong University of Sci-
ence and Technology, P.R.China in April 2004. From
April 2004 to September 2007, he was with Techni-
cal Center, Research Department of ZTE Corpora-
tion, Shenzhen, P.R.China as a researcher and stan-
dard senior engineer. Currently, he is with Huawei
Corporation, Shenzhen, P.R.China. His research in-
terests lie in the areas of wireless communications,
design and analysis of multiple-antenna systems,
multi-user MIMO pre-coding, interference alignment.
Bin Li (M’08) received the Ph.D. degree in com-
munications engineering from the Nanjing Institute
of Communications Engineering, Nanjing, China,
in 1993. From 1996 to 1997, he was a visiting
professor with the School of Engineering Science,
Simon Fraser University, Canada. From 1997 to
2001, he was a member of technical staff in Nortel,
Ottawa. From 2001 to 2005, he was a senior staff
engineer in InterDigital, NY, USA. Since November
2005, he has been a senior expert in Huawei Tech-
nologies, Shenzhen, China. His research interests are
modulation, coding and MIMO.
