A High Throughput List Decoder Architecture for Polar Codes by Lin, Jun et al.
1A High Throughput List Decoder Architecture
for Polar Codes
Jun Lin Student Member, IEEE, Chenrong Xiong and Zhiyuan Yan, Senior Member, IEEE
Abstract—While long polar codes can achieve the capacity of
arbitrary binary-input discrete memoryless channels when de-
coded by a low complexity successive cancelation (SC) algorithm,
the error performance of the SC algorithm is inferior for polar
codes with finite block lengths. The cyclic redundancy check
(CRC) aided successive cancelation list (SCL) decoding algorithm
has better error performance than the SC algorithm. However,
current CRC aided SCL (CA-SCL) decoders still suffer from
long decoding latency and limited throughput. In this paper, a
reduced latency list decoding (RLLD) algorithm for polar codes
is proposed. Our RLLD algorithm performs the list decoding on
a binary tree, whose leaves correspond to the bits of a polar code.
In existing SCL decoding algorithms, all the nodes in the tree are
traversed and all possibilities of the information bits are consid-
ered. Instead, our RLLD algorithm visits much fewer nodes in
the tree and considers fewer possibilities of the information bits.
When configured properly, our RLLD algorithm significantly
reduces the decoding latency and hence improves throughput,
while introducing little performance degradation. Based on our
RLLD algorithm, we also propose a high throughput list decoder
architecture, which is suitable for larger block lengths due to its
scalable partial sum computation unit. Our decoder architecture
has been implemented for different block lengths and list sizes
using the TSMC 90nm CMOS technology. The implementation
results demonstrate that our decoders achieve significant latency
reduction and area efficiency improvement compared with other
list polar decoders in the literature.
Index Terms—polar codes, successive cancelation decoding, list
decoding, hardware implementation, low latency decoding
I. INTRODUCTION
Polar codes [3] are a significant breakthrough in coding
theory, since they can achieve the channel capacity of binary-
input symmetric memoryless channels [3] and arbitrary dis-
crete memoryless channels [4]. Polar codes of block length N
can be efficiently decoded by a successive cancelation (SC)
algorithm [3] with a complexity of O(N logN). While polar
codes of very large block length (N > 220 [5]) approach the
capacity of underlying channels under the SC algorithm, for
short or moderate polar codes, the error performance of the
SC algorithm is worse than turbo or LDPC codes [6].
Lots of efforts [6]–[8] have already been devoted to the
improvement of error performance of polar codes with short
or moderate lengths. An SC list (SCL) decoding algorithm [6]
performs better than the SC algorithm. In [6]–[8], the cyclic
redundancy check (CRC) is used to pick the output codeword
from L candidates, where L is the list size. The CRC-aided
SCL (CA-SCL) decoding algorithm performs much better than
Part of the preliminary results were presented at the 2014 IEEE Workshop
on Signal Processing Systems (SiPS 2014) [1] and the 2015 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015) [2].
the SCL decoding algorithm at the expense of negligible loss
in code rate.
Despite its significantly improved error performance, the
hardware implementations of SC based list decoders [9]–[13]
still suffer from long decoding latency and limited throughput
due to the serial decoding schedule. In order to reduce the
decoding latency of an SC based list decoder, M (M > 1)
bits are decoded in parallel in [14]–[16], where the decoding
speed can be improved by M times ideally. However, for
the hardware implementations of the algorithms in [14]–[16],
the actual decoding speed improvement is less than M times
due to extra decoding cycles on finding the L most reliable
paths among 2ML candidates, where L is list size. A software
adaptive SSC-list-CRC decoder was proposed in [17]. For a
(2048, 1723) polar+CRC-32 code, the SSC-list-CRC decoder
with L = 32 was shown to be about 7 times faster than an
SC based list decoder. However, it is unclear whether the list
decoder in [17] is suitable for hardware implementation.
In this paper, a tree based reduced latency list decoding
algorithm and its corresponding high throughput architecture
are proposed for polar codes. The main contributions are:
• A tree based reduced latency list decoding (RLLD)
algorithm over logarithm likelihood ratio (LLR) domain
is proposed for polar codes. Inspired by the simplified
successive cancelation (SSC) [18] decoding algorithm
and the ML-SSC algorithm [19], our RLLD algorithm
performs the SC based list decoding on a binary tree.
Previous SCL decoding algorithms visit all the nodes
in the tree and consider all possibilities of the informa-
tion bits, while our RLLD algorithm visits much fewer
nodes in the tree and considers fewer possibilities of the
information bits. When configured properly, our RLLD
algorithm significantly reduces the decoding latency and
hence improves throughput, while introducing little per-
formance degradation.
• Based on our RLLD algorithm, a high throughput list
decoder architecture is proposed for polar codes. Com-
pared with the state-of-the-art SCL decoders in [10], [12],
[15], our list decoder achieves lower decoding latency and
higher area efficiency (throughput normalized by area).
More specifically, the major innovations of the proposed
decoder architecture are:
• An index based partial sum computation (IPC) algo-
rithm is proposed to avoid copying partial sums directly
when one decoding path needs to be copied to another.
Compared with the lazy copy algorithm in [6], our IPC
algorithm is more hardware friendly since it copies only
ar
X
iv
:1
51
0.
02
57
4v
1 
 [c
s.A
R]
  9
 O
ct 
20
15
2path indices, while the lazy copy algorithm needs more
complex index computation.
• Based on our IPC algorithm, a hybrid partial sum unit
(Hyb-PSU) is proposed so that our list decoder is suitable
for larger block lengths. The Hyb-PSU is able to store
most of the partial sums in area efficient memories such
as register file (RF) or SRAM, while the partial sum units
(PSUs) in [9], [10], [12] store partial sums in registers,
which need much larger area when the block length N
is larger. Compared with the PSU of [10], our Hyb-PSU
achieves an area saving of 23% and 63% for block length
N = 213 and 215, respectively, under the TSMC 90nm
CMOS technology.
• For our RLLD algorithm, when certain types of nodes are
visited, each current decoding path splits into multiple
ones, among which the L most reliable paths are kept.
In this paper, an efficient path pruning unit (PPU) is
proposed to find the L most reliable decoding paths
among the split ones. For our high throughput list de-
coder architecture, the proposed PPU is the key to the
implementation of our RLLD algorithm.
• For the fixed-point implementation of our RLLD algo-
rithm, a memory efficient quantization (MEQ) scheme is
used to reduce the number of stored bits. Compared with
the conventional quantization scheme, our MEQ scheme
reduces the number of stored bits by 17%, 25% and 27%
for block length N = 210, 213 and 215, respectively, at
the cost of slight error performance degradation.
Note that the SSC and ML-SSC algorithms reduce the
decoding latency by first performing it on a binary tree and
then pruning the binary tree. Inspired by this idea, our RLLD
algorithm performs the SC based list decoding algorithm on a
binary tree. The low-latency list decoding algorithm [17] also
performs the list decoding algorithm on a binary tree. Our
work [1] and the decoding algorithm in [17] are developed
independently. While both our RLLD algorithm and the low-
latency list decoding algorithm in [17] visit fewer nodes in
the binary tree so as to reduce the decoding latency, there are
some differences:
• Compared with the decoding algorithm in [17], our
RLLD algorithm visits fewer nodes. Illuminated by the
ML-SSC algorithm, our RLLD algorithm processes cer-
tain arbitrary rate nodes [18] in a fast way.
• When a rate-1 node [18] is visited, our RLLD algorithm
employs a less complex and hardware friendly algorithm
to compute the returned constituent codewords.
• Our RLLD algorithm is based on LLR messages, while
the algorithm in [17] is based on logarithm likelihood
(LL) messages, which require a larger memory to store.
In terms of hardware implementations, compared with state-
of-the-art SC list decoders [9], [10], [12], [13], [15], [16], our
high throughput list decoder architecture shows advantages in
various aspects:
• For the high throughput list decoder architecture, LLR
message is employed while LL message was used in [9],
[10], [15], [16]. The LL based memories require more
quantization bits and a larger memory to store. The
area efficient memory architecture in [10] is employed
to store all LLR messages. LLR messages were also
employed in [12], [13]. However, the register based
memories in [12], [13] suffer from excessive area and
power consumption when N is large.
• Our list decoder architecture employs a Hyb-PSU, which
is scalable for polar codes of large block lengths. The
register based PSUs of the list decoders in [9], [10],
[12] suffer from area overhead when the block length
is large. Instead of copying partial sums directly, our
scalable PSU copies only decoding path indices, which
avoids additional energy consumption.
The proposed high throughput list decoder architecture has
been implemented for several block lengths and list sizes under
the TSMC 90nm CMOS technology. The implementation re-
sults show that our decoders outperform existing SCL decoders
in both decoding latency and area efficiency. For example,
compared with the decoders of [12], the area efficiency and
decoding latency of our decoders are 1.59 to 32.5 times and
3.4 to 6.8 times better, respectively.
For our RLLD algorithm and the corresponding decoder
architecture, when computing the returned constituent code-
words from an FP node or a rate-1 node, the returned L
constituent codewords may not be the L most reliable ones
among all candidates. This kind of approximation leads to
more efficient hardware implementation of our list decoding
algorithm at the cost of certain performance degradation. In
contrast, existing SC list decoders in [6], [13] usually selects
the L most reliable candidates.
The rest of the paper is organized as follows. Related
preliminaries are reviewed in Section II. The proposed RLLD
algorithm is presented in Section III. The high throughput list
decoder architecture is presented in Section IV. In Section V,
the implementation and comparisons results are shown. At last,
the conclusion is drawn in Section VI.
II. PRELIMINARIES
A. Polar Codes
Let uN−10 = (u0, u1, · · · , uN−1) denote the data bit se-
quence and xN−10 = (x0, x1, · · · , xN−1) the correspond-
ing codeword, where N = 2n. Under the polar encoding,
xN−10 = u
N−1
0 BNF
⊗n (n > 1), where BN is the bit reversal
permutation matrix, and F =
[
1
1
0
1
]
. Here ⊗n denotes the
nth Kronecker power, F⊗n = F ⊗ F⊗(n−1) and F⊗0 = 1.
For i = 0, 1, · · · , N − 1, ui is either an information bit or a
frozen bit, which is set to zero usually. For an (N,K) polar
code, there are a total of K information bits within uN−10 .
The encoding graph of a polar code with N = 8 is shown in
Fig. 1.
B. Prior Tree-Based SC Algorithms
A polar code of block length N = 2n can also be
represented by a full binary tree Gn of depth n [18], where
each node of the tree is associated with a constituent code.
For example, for node 1 shown in Fig. 2, the correspondent
constituent code is the set {(s20, s22, s24, s26)}, where each
3u0
u1
u2
u3
u4
u5
u6
u7
s10
s11
s12
s13
x0
x1
x2
x3
x4
x5
x6
x7
s14
s15
s16
s17
s20
s21
s22
s23
s24
s25
s26
s27
s00
s01
s02
s03
s04
s05
s06
s07
Fig. 1. Polar encoder with N = 8
s00 s01 s02 s03
s10,s14 s11,s15
s04 s05 s06 s07
s12,s16 s13,s17
s20,s22,s24,s26 s21,s23,s25,s27
layer index
2
1
0
v
v
0
v 0
v 1
v
1
v
v
pv x0,x1,...,x7
3
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
node index
Fig. 2. Binary tree representation of an (8, 3) polar code
element (s20, s22, s24, s26) relates to the data word u70 as
shown in Fig. 1. The binary tree representation of an (8, 3)
polar code is shown in Fig. 2, where the black and white leaf
nodes correspond to information and frozen bits, respectively.
There are three types of nodes in a binary tree representation
of a polar code: rate-0 , rate-1 and arbitrary rate nodes. The
leaf nodes of a rate-0 and rate-1 nodes correspond to only
frozen and information bits, respectively. The leaf nodes of
an arbitrary rate node are associated with both information
and frozen bits. The rate-0, rate-1 and arbitrary rate nodes
in Fig. 2 are represented by circles in white, black and gray,
respectively.
The SC algorithm can be mapped on Gn, where each
node acts as a decoder for its constituent code. The SC
algorithm is initialized by feeding the root node with the chan-
nel LLRs, (Y0, Y1, · · · , YN−1), where Yi = log(Pr(yi|xi =
0)/Pr(yi|xi = 1)) and (y0, y1, · · · , yN−1) is the received
channel message vector. As shown in Fig. 2, the decoder at
node v receives a soft information vector αv and returns a
constituent codeword βv . When a non-leaf node v is activated
by receiving an LLR vector αv , it calculates a soft information
vector α0v and sends it to its left child. Node v first waits until
it receives a constituent codeword β0v , and then computes and
sends a soft information vector α1v to its right child. Once the
right child returns a constituent codeword β1v , node v computes
and returns a constituent codeword βv . When a leaf node v is
activated, the returned constituent codeword βv contains only
one bit βv[0], where βv[0] is set to 0 if leaf node v is associated
with a frozen bit; otherwise, βv[0] is calculated by making a
hard decision on the received LLR αv[0], where
βv[0] = h(αv[0]) =
{
0 αv[0] > 0,
1 αv[0] < 0.
(1)
From the root node, all nodes in a tree are activated in a
recursive way for the SC algorithm. Once βv for the last leaf
node is generated, the codeword xN−10 can be obtained by
combining and propagating βv up to the root node.
The SSC decoding algorithm in [18] simplifies the process-
ing of both rate-0 and rate-1 nodes. Once a rate-0 node is
activated, it immediately returns the all zero vector. Once a
rate-1 node is activated, a constituent codeword is directly
calculated by making hard decisions on the received soft
information vector as shown in Eq. (1). The ML-SSC decoding
algorithm [19] further accelerates the SSC decoding algorithm
by performing the exhaustive-search ML decoding on some
resource constrained arbitrary rate nodes, which are called
ML nodes in [19]. For an ML node with layer index t, the
constituent codeword passed to the parent node pv is
βv = argmax
x∈C
2n−t−1∑
i=0
(1− 2x[i])αv[i], (2)
where C is the constituent code associated with node v.
C. LLR Based List Decoding Algorithms
For SCL decoding algorithms [6], [9], [13], when decoding
an information bit ui, each decoding path splits into two paths
with uˆi being 0 and 1, respectively. Thus 2L path metrics
are computed and the L paths correspond to the L minimum
path metrics are kept. The list decoding algorithms [6], [9] are
performed either on probability or logarithmic likelihood (LL)
domain. In [13], an LLR based list decoding algorithm was
proposed to reduce the message memory requirement and the
computational complexity of LL based list decoding algorithm.
For decoding path l (l = 0, 1, · · · , L− 1), the LLR based list
decoding algorithm employs a novel approximated path metric
PM(i)l =
i∑
k=0
D(L(k)n [l], uˆk[l]), (3)
where D(L(k)n [l], uˆk[l]) is set to 0 if h(L
(k)
n [l]) equals uˆk[l] or
|L(k)n [l]| otherwise. Here L(k)n [l] , log W
(k)
n (y
N−1
0 ,uˆ
k−1
0 [l]|0)
W
(k)
n (y
N−1
0 ,uˆ
k−1
0 [l]|1)
and
yN−10 = (y0, y1, · · · , yN−1) is the received channel message
vector.
III. REDUCED LATENCY LIST DECODING ALGORITHM
A. SCL Decoding on A Tree
Similar to the SSC decoding algorithm, we also perform
the SC based list decoding algorithms [6], [9] on a full
binary tree Gn [1], [17]. The SCL decoding is initiated
by sending the received channel LLR vector to the root
node of Gn. As shown in Fig. 3, without losing generality,
each internal node v in Gn is activated by receiving L
LLR vectors, αv,0, αv,1, · · · , αv,L−1, from its parent node vp
and is responsible for producing L constituent codewords,
βv,0, βv,1, · · · , βv,L−1, where αv,l and βv,l correspond to
decoding path l for l = 0, 1, · · · , L − 1. Suppose the layer
index of node v is t, αv,l and βv,l have 2n−t LLR messages
and binary bits, respectively, for l = 0, 1, · · · , L− 1.
4Once a non-leaf node v is activated, it calculates L LLR
vectors, αvL,0, αvL,1, · · · , αvL,L−1, and passes them to its left
child node vL, where
αvL,l[i] = f(αv,l[2i], αv,l[2i+ 1]) (4)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L − 1. Here f(a, b) =
2 tanh−1(tanh(a/2) tanh(b/2)) and can be approximated as:
f(a, b) ≈ sign(a) · sign(b) min(|a|, |b|). (5)
Node v then waits until it receives L codewords, βvL,0,
βvL,1, · · · , βvL,L−1, from vL. In the following step, node v
calculates another L LLR vectors, αvR,0, αvR,1, · · · , αvR,L−1,
and passes them to its right child node vR, where
αvR,l[i] = g(αv,l[2i], αv,l[2i+ 1], βvL,l[i])
= αv,l[2i](1− 2βvL,l[i]) + αv,l[2i+ 1] (6)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L− 1.
At last, after node v receives L codewords, βvR,0, βvR,1,
· · · , βvR,L−1, from vR, it calculates βv,0, βv,1, · · · , βv,L−1
and passes them to its parent node vp, where
(βv,l[2i], βv,l[2i+ 1]) = (βvL,l[i]⊕ βvR,l[i], βvR,l[i]), (7)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L− 1.
,0 ,1 , 1, , ,v v v L    ,0 ,1 , 1, , ,v v v L   
,0 ,1 , 1, , ,v v v L     
,0
,1
, 1
,
,
,
v
v
v L


 




,0
,1
, 1
,
,
,
v
v
v L


 




,0 ,1 , 1, , ,v v v L     
v
vv
vp
Fig. 3. Node activation schedule for SC based list decoding on Gn
For l = 0, 1, · · · , L − 1, PMl is the path metric associated
with decoding path l and is initialized with 0. When a leaf node
v associated with an information bit is activated, decoding path
l splits into two paths with βv,l being 0 and 1, respectively.
Note that the layer index of a leaf node is n, hence αv,l and
βv,l have only one LLR and binary bit, respectively, when
node v is a leaf node. For the SCL decoding, 2L expanded
path metrics are computed, where
PMjl = PMl +D(αv,l, j), (8)
for j = 0, 1 and l = 0, 1, · · · , L − 1. D(αv,l, j) = 0 if
h(αv,l) equals j. Otherwise, D(αv,l, j) = |αv,l|. Suppose
the L minimum expanded path metrics are PMj0a0 , PM
j1
a1 , · · · ,
PMjL−1aL−1 , which correspond to the L most reliable paths, then
βv,l = jl for l = 0, 1, · · · , L − 1. Decoding path al will be
copied to decoding path l before further partial sum and LLR
vector computations. For each decoding path l, path metric is
also updated with PMl = PMjlal . When a leaf node v associated
with a frozen bit is activated, βv,l = 0 for l = 0, 1, · · · , L− 1
are passed to its parent node vp. The updated path metric PMl
= PMl + D(αv,l, 0).
The SCL algorithm on a tree described above is equivalent
to the SCL algorithms in [6], [9].
B. Proposed RLLD algorithm
In this paper, a reduced latency list decoding (RLLD)
algorithm is proposed to reduce the decoding latency of SC
list decoding for polar codes. For a node v, let Iv denote the
total number of leaf nodes that are associated with information
bits. Let Xth be a predefined threshold value and X0 and X1
be predefined parameters. Our RLLD algorithm performs the
SC based list decoding on Gn and follows the node activation
schedule in Section III-A, except when certain type of nodes
are activated. These nodes calculate and return the codewords
to their parent nodes while updating the decoding paths and
their metrics, without activating their child nodes. Specifically:
• When a rate-0 node v is activated, βv,l is a zero vector
for l = 0, 1, · · · , L− 1.
• When a rate-1 node v with Iv > Xth is activated, βv,l is
just the hard decision of αv,l for l = 0, 1, · · · , L − 1.
For polar codes constructed in [20], [21], we observe
that the polarized channel capacities of the information
bits corresponding to rate-1 nodes with Iv > Xth are
greater than those of the other information bits. Hence,
for rate-1 nodes with Iv > Xth, our RLLD algorithm
considers only the most reliable candidate codeword for
each decoding path due to a more reliable channel.
• When a rate-1 node v with Iv 6 Xth is activated,
the returned codewords are calculated by a candidate
generation (CG) algorithm, which is proposed later.
• Let t denote the layer index of node v. When an arbitrary
rate node v with Iv 6 X0 and 2n−t 6 X1 is activated,
each decoding path splits into 2Iv paths. From now on,
such an arbitrary rate node is called fast processing
(FP) node. A metric based search (MBS) algorithm,
which is proposed later, is used to calculate the returned
codewords.
Moreover, our RLLD algorithm works on a pruned tree. As a
result, our RLLD algorithm visits fewer nodes than the SCL
algorithm in [6], [9]. The full binary tree is pruned in the
following ways:
• Starting from the complete tree representation of a polar
code, label all FP nodes such that the parent node of each
of them is not an FP node. Note that an FP node v is an
arbitrary rate node with Iv 6 X0 and 2n−t 6 X1. For
each labeled FP node, remove all its child nodes.
• Based on the pruned tree from the previous step, label all
rate-0 and rate-1 nodes such that the parent node of each
of these rate-0 and rate-1 nodes is not a rate-0 and rate-1
node, respectively. In the next, remove all child nodes of
each of labeled rate-0 and rate-1 node.
The leaf nodes of the pruned tree from the above two steps
consist of rate-0, rate-1 and FP nodes. The non-leaf nodes of
the pruned tree are arbitrary rate nodes.
When a rate-1 node with Iv > Xth or a rate-0 node is
activated, ideally PMl is updated with PMl + ∆v,l for l =
50, 1, · · · , L − 1, where ∆v,l =
∑Iv−1
i=0 D(αv,l[i], βv,l[i]). For
each rate-1 node with Iv > Xth, ∆v,l = 0 since βv,l is the
hard decision of αv,l. However, for a rate-0 node, ∆v,l could
have a non-zero value. For our RLLD algorithm, ∆v,l is also
set to 0 for each rate-0 node, since the resulting performance
degradation is negligible. By setting ∆v,l to 0, we no longer
need to calculate αv,l sent to a rate-0 node.
1) Proposed CG Algorithm: When a rate-1 node with
Iv 6 Xth is activated, instead of considering 2Iv candidate
codewords for each decoding path, since there are at most
L codewords from the same decoding path that could be
passed to the parent node, it is enough to find only the
L most reliable codewords among 2Iv candidates for each
decoding path. When Iv is large (e.g. Iv > 32), finding
the L most reliable codewords is computationally intensive
and lacks efficient hardware implementations. For our RLLD
algorithm, we considers only the W (W < L) most reliable
codewords among 2Iv candidates for each decoding path. In
this paper, W is set to 2, since it results in efficient hardware
implementations at the cost of negligible performance loss.
When W = 2, the proposed CG algorithm, shown in
Alg. 1, is used to calculate the codewords passed to the
parent node. Besides, the CG algorithm also outputs L list
indices, a0, a1, · · · , aL−1, which indicate that decoding path
al needs to be copied to path l. Suppose the layer index of
such a rate-1 node v is t. For each decoding path l, there
are 2Iv = 22
n−t
candidate codewords that could be passed to
the parent node vp. However, our CG algorithm considers only
the most reliable codeword Cv,l,0 and the second most reliable
codeword Cv,l,1. In order to find these two codewords, each
candidate codeword Cv,l,j is associated with a node metric
NMjl =
∑Iv−1
k=0
mk|αv,l[k]| (9)
for j = 0, 1, · · · , 2Iv − 1, where mk = 0 if Cv,l,j [k]
equals h(αv,l[k]) and 1 otherwise. As a result, the smaller a
node metric is, the more reliable the corresponding candidate
codeword is. Based on Eq. (9), Cv,l,0 = h(αv,l) is the hard
decision of the received LLR vector αv,l. Cv,l,1 is obtained by
flipping the kM,l-th bit of Cv,l,0, where kM,l is the index of
the LLR element with the smallest absolute value among αv,l.
Each decoding path splits into two paths and has two as-
sociated candidate codewords. Alg. 1 calculates 2L expanded
path metrics PMjl for l = 0, 1, · · · , L−1 and j = 0, 1 to select
L codewords passed to the parent node. The minL function in
Alg. 1 finds the L smallest values among 2L input expanded
path metrics. Once βv,l for l = 0, 1, · · · , L− 1 are computed,
decoding path al is copied to decoding path l before further
operations.
2) Proposed MBS Algorithm: When an FP node is acti-
vated, each current decoding path expands to 2Iv paths, each
of which is associated with a candidate codeword. Similar
to the CG algorithm, the proposed MBS algorithm calculates
L codewords passed to the parent node and L path indices,
a0, a1, · · · , aL−1. The calculation of returned codewords are
shown as follows.
• For each candidate codeword Cjv,l, calculate its corre-
sponding node metric NMjl for j = 0, 1, · · · , 2Iv − 1
Algorithm 1: The proposed CG algorithm
input : αv,0, αv,1, · · · , αv,L−1
output: βv,0, βv,1, · · · , βv,L−1; a0, a1, · · · , aL−1
1 for l = 0 to L− 1 do
2 kM,l = argmin
k∈{0,1,··· ,Iv−1}
|αv,l[k]|
3 NM0l = 0; Cv,l,0 = h(αv,l)
4 NM1l = |αv,l[kM,l]|; Cv,l,1 = Flip(Cv,l,0, kM,l)
5 PMjl = PMl + NM
j
l for j = 0, 1
6 (PMb0a0 , · · · ,PMbL−1aL−1) = minL(PM00,PM10, · · · ,PM1L−1)
for l = 0 to L− 1 do
7 βv,l = Cv,al,bl ; PMl = PMblal
and l = 0, 1, · · · , L− 1.
• Calculate 2IvL expanded path metrics PMjl for l =
0, 1, · · · , L− 1 and j = 0, 1, · · · , 2Iv − 1.
• Find L expanded path metrics among 2IvL ones. The
correspondent candidate codewords are passed to the
parent node vp.
To calculate the node metric, we propose a new method
with low computational complexity. In the literature, two
methods can be used: the direct-mapping method (DMM)
shown in Eq. (9) and the recursive channel combination
(RCC) [16]. In terms of computational complexity, the for-
mer needs 2Iv (2n−t − 1)L additions, where N = 2n and
t is the layer index of an FP node v. The RCC needs
(
∑n−t−1
i=1 2
i22
n−t−i
+2Iv )L additions. Compared to the DMM,
the RCC approach needs fewer additions. For our RLLD
algorithm, we want to compute these 2Iv node metrics in
parallel. However, the parallel hardware implementations of
the DMM and RCC algorithms require large area consumption.
This will be discussed in more detail in Section IV-C.
In this paper, a hardware efficient node metric computation
method, which takes advantage of both the DMM and the
RCC, is proposed. The proposed method, referred to as
the DR-Hybrid (DRH) method, is shown in Alg. 2, where
Cjv,l[2i : 2i+ 1] = (Cjv,l[2i], Cjv,l[2i+ 1]), and r (0 6 r 6 3) is
represented by a binary tuple of length two, i.e. r = r0 + 2r1.
In our method, the RCC approach is used to calculate θl,i first.
Then, the DMM is carried out.
Algorithm 2: DR-Hybrid method
1 for l = 0 to L− 1 do
/* ----------RCC---------------- */
2 for i = 0 to 2n−t−1 − 1 do
3 for r = 0 to 3 do
4 θl,i[(r0, r1)] =
(1− 2r0)αv,l[2i] + (1− 2r1)αv,l[2i+ 1];
/* ----------DMM---------------- */
5 for j = 0 to 2Iv − 1 do
6 NMjl =
∑2n−t−1−1
i=0 θl,i[Cjv,l[2i : 2i+ 1]].
6The DRH method needs 4× 2n−t−1 + 2Iv (2n−t−1− 1) ad-
ditions. Take X0 = 8 and X1 = 16 as an example, the DMM,
RCC and DRH methods need 3840, 864 and 1824 additions.
Though our DRH method needs more additions than the RCC,
it results in a more area efficient hardware implementation
when all 2Iv node metrics are computed in parallel, since the
RCC method needs more complex multiplexors.
Once we have 2IvL node metrics and corresponding can-
didate codewords, 2IvL expanded path metrics PMjl = PMl +
NMjl for l = 0, 1, · · · , L − 1 and j = 0, 1, · · · , 2Iv − 1 can
be computed. The next step is selecting L returned codewords
and their corresponding expanded path metrics.
Since directly finding the L minimum values from 2IvL
ones is computationally intensive and lacks efficient hardware
implementations, a bitonic sequence based sorter [10] (BBS)
with 2IvL inputs is able to fulfill this task. Such a BBS
takes 2Iv−1L(
∑s−1
i=1 i) + 2
Iv−2L compare-and-switch (CS)
units [10], where each of them has one comparator and
two 2-to-1 multiplexors and s = log2(2
IvL). In order to
simplify the hardware implementation, a two-stage sorting
scheme was proposed in [16], where the first stage selects
q (q < L) smallest node metrics from 2Iv ones for each
decoding path. The second stage selects the L smallest metrics
from the Lq expanded path metrics produced by the first
stage. Compared with the direct sorting scheme [10], [15],
the hardware implementation of the two-stage sorting scheme
is more efficient at the cost of certain error performance
degradation.
In this paper, our MBS algorithm employs the two-stage
sorting scheme and improves the first stage in the following
two aspects:
• Instead of using a fixed q, our MBS algorithm employs
a dynamic qIv,L(qIv,L 6 L), which is a power of 2 and
depends on both Iv and L.
• An approximated sorting (ASort) method, which leads to
an efficient hardware implementation, is used to select
qIv,L metrics from 2
Iv ones, though these sorted metrics
are not always the qIv,L smallest ones.
Our ASort method is illustrated as follows:
• When 2Iv 6 2L, the BBS with 2L inputs and L outputs
is used to select the qIv,L minimum node metrics from
2Iv ones.
• When 2Iv > 2L, all 2Iv node metrics are divided into
qIv,L groups:
NM0l , · · · ,NMm−1l︸ ︷︷ ︸
group 1
, · · · ,NM(qIv,L−1)ml , · · · ,NMqIv,Lm−1l︸ ︷︷ ︸
group qIv,L
.
Here m = 2
Iv
qIv,L
. The two minimum node metrics of each
group are first computed. The BBS computes the minimum
qIv,L node metrics among 2qIv,L ones.
After the first stage of sorting, the number of expanded path
metrics Ne could be 2L, 4L, · · · , L×L. The second stage of
sorting is the same as that in [16]. A binary tree of 2L-L
BBSs are employed to sort the final L minimum expanded
path metrics. Take Ne = 4L as an example, there are 4L
extended path metrics: PMj0l0 , PM
j1
l1
, · · · , PMj4L−1l4L−1 , then PM
j0
l0
,
· · · , PMj2L−1l2L−1 and PM
j2L
l2L
, · · · , PMj4L−1l4L−1 are applied to two
2L-L BBSs, respectively. Thus, 2L metrics are selected. Then
the 2L-L BBS is employed again to generated the final L
minimum extended path metrics: PMj
′
0
l′0
, PMj
′
1
l′1
, · · · , PMj
′
L−1
l′L−1
.
C. Parameters of Our RLLD Algorithm
For our RLLD algorithm, the returned codewords from rate-
1 nodes with Iv > Xth are obtained by making hard decisions
on the received LLR vectors. The other rate-1 nodes are pro-
cessed by our CG algorithm. Note that both the hard decision
approach and our CG algorithm could cause potential error
performance degradation since ideally we should consider 2Iv
candidate codewords for each decoding path. With more rate-1
nodes (decreasing Xth) being processed by the hard decision
approach, the decoding latency could be reduced at the cost
of more error performance degradation. Besides, in order to
save computations, path metrics remain unchanged when a
rate-0 node is activated, which may cause error performance
degradation.
The choices of X0 and X1 are tradeoffs between implemen-
tation complexity and achieved decoding latency reduction.
Ideally, we want X0 and X1 to be as large as possible so that
more data bits could be decoded in parallel. Since the number
of adders needed by Alg. 2 is proportional to 2X0X1, the
values of X0 and X1 are limited by hardware implementations.
For the two-step sorting scheme of our MBS algorithm,
we want qIv,L to be as small as possible so that the sorting
complexity could be minimized. However, reducing qIv,L
could degenerate the resulting error performance, since ideally
we need to consider the L most reliable candidate codewords
for each decoding path. As a result, the selections of qIv,L are
tradeoffs between sorting complexity and error performance.
D. Comparison with Related Algorithms
If we perform the SC based list decoding algorithms [6],
[9] on a tree, then all 2N − 1 nodes of the tree will be
activated. For our RLLD algorithm, denote na as the number
of activated nodes. Then we have na < 2N − 1, where na is
determined by the block length N , the code rate, the locations
of frozen bits and the parameters X0 and X1. X0 and X1 are
used to identify all FP nodes. The reduction of the number
of activated nodes will transfer into reduced decoding latency
and increased throughput. Take the (8, 3) polar code in Fig. 2
as an example, suppose X0 = 1 and X1 = 2, then only 5
nodes (nodes 0, 1, 2, 5, and 6) need to be activated by our
RLLD algorithm, whereas the algorithms in [6], [9] need to
activate all 15 nodes.
The CA-SCL decoding algorithm was also performed on
a binary tree in [17]. Compared with the low-latency list
decoding algorithm [17], our RLLD algorithm employs the
proposed MBS algorithm to process FP nodes, while FP
nodes were processed by activating its child nodes in [17].
Our MBS algorithm results in decreased decoding latency
at the cost of potential error performance loss. Besides, our
RLLD algorithm takes a simpler approach when a rate-1 node
is activated. When a rate-1 node is activated, a Chase-like
7algorithm was used to calculate the L codewords passed to
the parent node in [17]. Compared to the Chase-like algorithm,
our CG algorithm has lower computational complexity and is
more suitable for hardware implementation because:
(1) The Chase-like algorithm in [17] was performed over
log-likelihoods (LL) domain while our method is performed
over LLR domain. Compared with our LLR based method, it
takes more additions to calculate related metrics for the Chase-
like algorithm.
(2) For each decoding path, the Chase-like algorithm consid-
ers 1+
(
c
1
)
+
(
c
2
)
candidate constituent codewords, where c = 2
in [17]. In contrast, our method considers only two constituent
codewords, which leads to simpler hardware implementations.
(3) In order to find the L best decoding paths and their
constituent codewords, the Chase-like algorithm creates a
candidate path list. The final L candidates are determined by
inserting and removing elements from the list. The Chase-like
algorithm is suitable for software implementations. However,
the hardware implementations of the Chase-like algorithm has
not been discussed in [17]. On the other hand, with a bitonic
based sorter [10] (BBS), the L most reliable decoding paths
can be decided in parallel for our CG algorithm.
E. Simulation Results
For an (8192, 4096) polar code, the bit error rate (BER)
performances of the proposed RLLD algorithm as well as other
algorithms are shown in Fig. 4. In Fig. 4, CSx denotes the CA-
SCL decoding algorithm with L = x, where CRC-32 is used.
Rx-y denotes our RLLD algorithm with L = x and Xth = y.
The values of qIv,L’s under different list sizes and Iv’s are
shown in Table I. For all simulated algorithms, the additive
white Gaussian noise (AWGN) channel and binary phase-shift
keying (BPSK) modulation are used. For all simulated RLLD
algorithms, X0 = 8 and X1 = 16.
TABLE I
THE VALUES OF qIv,L’S UNDER DIFFERENT LIST SIZES AND Iv ’S
Iv 1 2 3 4 5 6 7 8
L
2 2 2 2 2 2 2 2 2
4 2 4 4 4 4 4 4 2
8 2 4 8 8 8 8 4 2
16 2 4 8 8 8 8 8 2
32 2 4 8 8 8 4 4 2
Based on the simulation results shown in Fig. 4, we observe
that R2-8 performs nearly the same as CS2 and R2-64.
When the list size increases, compared with CS4, R4-8 shows
obvious error performance degradation when BER is below
10−7. The degradation is reduced by increasing Xth to 128,
as we observe that R4-128 performs nearly the same as CS4.
When the list size further increases (e.g. L = 16 and 32), at
low BER level, the error performance degradation exists even
when Xth = 256. As shown in Fig. 4, R16-256 and R32-256
are worse than CS16 and CS32 when BER is below 10−5
and 10−6, respectively. Note that for the (8192, 4096) polar
code in this paper, Iv of a rate-1 node is at most 256. The
simulation results of a (1024, 512) polar code show similar
phenomena.
SNR (dB)
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
Bi
t E
rro
r R
at
e 
(B
ER
)
10-9
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
SC
CS2
R2-8
R2-64
CS4
R4-8
R4-128
CS8
R8-128
CS16
R16-256
CS32
R32-256
Fig. 4. BER performance for an (8192, 4096) polar code
Depending on the specific list size, it seems that our RLLD
algorithm has performance degradation compared to the CA-
SCL algorithm at certain BER values even when all rate-1
nodes are processed by the proposed CG algorithm. There are
several reasons for the error performance degradation:
(1) For our RLLD algorithm, when a rate-1 node with
Iv 6 Xth is activated, only the two most reliable constituent
codewords are kept. When list size L is large, there may not be
enough candidate codewords to include the correct codeword,
since our CG algorithm could miss certain good candidate
codewords.
(2) When a rate-1 node with Iv > Xth is activated, only
the most reliable candidate codeword is considered for each
decoding path, which could also cause error performance
degradation.
(3) During the first sorting stage of our MBS algorithm,
when 2Iv > L, qIv,L is selected to be no greater than L for
certain Iv values for efficient hardware implementation. As a
result, we may lose certain good candidate codewords due to
the limitation on qIv,L.
IV. HIGH THROUGHPUT LIST POLAR DECODER
ARCHITECTURE
A. Top Decoder Architecture
0
1
0
1
Din
Dout
...
IMEM
CMEM
LBuf0
LBuf1
CBuf
PPUCUL-1
CU0
CU1
...
PUAL-1
PUA0
PUA1
IEncCRCC
pList
pCCode
PS
LICout
SN
c0c1
Hyb-PSU
0
1
LBuf2
c2
Fig. 5. Decoder top architecture
8In this paper, based on the proposed RLLD algorithm, a high
throughput list decoder architecture, shown in Fig. 5, for polar
codes is proposed. In Fig. 5, the channel message memory
(CMEM) stores the received channel LLRs, and the internal
LLR message memory (IMEM) stores the LLRs generated
during the SC computation process. With the concatenation
and split method in our prior work [10], the IMEM is
implemented with area efficient memories, such as register
file (RF) or SRAM. The proposed architecture has L groups
of processing unit arrays (PUAs), each of which contains T
processing units [5] (PUs) and is capable of performing either
the f or the g computation in Eqs. (5) and (6), respectively.
The hybrid partial sum unit (Hyb-PSU) in Fig. 5 consists
of L computation units, CU0, CU1, · · · , CUL−1, which are
responsible for updating the partial sums of L decoding paths,
respectively. The path pruning unit (PPU) in Fig. 5 finds
the list indices and corresponding constituent codewords for
L survival decoding paths, respectively. The control of our
decoder architecture can be designed based on the instruction
RAM based methodology in [22].
Both our high throughput list decoder architecture in Fig. 5
and that in [10] employ a partial parallel processing method.
Besides, both architectures contain a channel message memory
and internal message memory. However, compared to the ar-
chitecture in [10], the major improvements of our list decoder
architecture are:
(a) Instead of LL messages, our high throughput list decoder
architecture employs LLR messages, which result in more area
efficient internal and channel message memories.
(b) The PPU in Fig. 5 implements our CG and MBS
algorithms, while the PPU in [10] is just a sorter which selects
L values among 2L ones. Due to the proposed PPU, our
decoder architecture achieves much higher throughput than
that in [10].
(c) Our list decoder architecture employs a novel Hyb-PSU,
which is more area and energy efficient than that in [10].
Our Hyb-PSU is based on the proposed index based partial
sum computation algorithm. When a decoding path needs to
be copied to another one, instead of copying partial sums
directly our Hyb-PSU copies only decoding path indices. In
contrast, the PSU in [10] copies path sums directly, which
incurs additional energy consumption. Our Hyb-PSU stores
most of the partial sums in area efficient memories, while the
PSU in [10] stores all the partial sums in area demanding
registers. Hence, our Hyb-PSU is scalable for larger block
lengths.
B. Memory Efficient Quantization Scheme
For an SC or SCL decoder, the message memory oc-
cupies a large part of the overall decoder area [5], [10].
An SCL decoder needs a channel message memory and an
internal message memory. For an LLR based SCL decoder,
the channel memory stores N channel LLR messages. The
internal message memory stores Ln LLR matrices: Pl,t for
l = 0, 1, · · · , L− 1 and t = 1, 2, · · · , n, where Pl,t has 2n−t
LLR messages.
For a fixed point implementation of our RLLD algorithm, it
is straightforward to quantize all LLRs in the internal memory
with Q bits. In this paper, a memory efficient quantization
(MEQ) scheme is proposed to reduce the size of the internal
memory. f(a, b) in Eq. (5) has the same magnitude range
as those of a and b, while the magnitude range of g(a, b, s)
in Eq. (6) is at most twice of those of a and b (s is either
0 or 1). Since P0,t, P1,t, · · · , PL−1,t are computed based on
P0,t−1, P1,t−1, · · · , PL−1,t−1, for a decoding path l, the LLRs
in Pl,t1 may need a greater magnitude range than that of the
LLRs in Pl,t2 , where t1 > t2. Suppose each channel LLR
is quantized with Qc bits, the proposed MEQ scheme is as
follows:
(1) Suppose all LLRs within the internal memory are quan-
tized with Qm bits, determine the minimal Qm such that the
error performance degradation of the fixed point performance
is negligible.
(2) Let t1, t2, · · · , tr be r integers, where t1 6 t2 6
· · · 6 tr 6 n and r = Qm − Qc. Denote
Pt = (P0,t, P1,t · · · , PL−1,t). Suppose LLRs associated with
P1,P2, · · · ,Pt1 are quantized with Qc bits and the remaining
LLRs are quantized with Qm bits. Decide the maximal t1 such
that the resulting fixed point error performance degradation
is negligible. Once t1 is decided, suppose the LLRs within
Pt1+1,Pt1+2, · · · ,Pt2 are all quantized with Qc+1 bits, find
the maximal t2 such that the corresponding error performance
degradation is negligible. In this way, t3, · · · , tr are decided in
a serial manner so that Pti+1,Pti+2, · · · ,Pti+1 are quantized
with Qc + i bits for 1 6 i 6 r − 1, and Pj are quantized to
Qm bits for j > tr.
With the proposed MEQ scheme, the number of bits saved
for the internal memory is
NB =
r+1∑
j=1
tj∑
t=tj−1+1
L2n−t(Qc + j − 1), (10)
where t0 = 0 and tr+1 = n are introduced for convenience.
In order to show the effectiveness of our MEQ scheme, the
error performances of our RLLD algorithm with the proposed
MEQ scheme are shown in Fig. 6, where the RLLD algorithm
with our MEQ scheme is compared with the floating-point CA-
SCL decoding algorithm, floating-point RLLD algorithm, and
RLLD algorithm with a uniform quantization scheme for three
different polar codes, (1024, 512), (8192, 4096) and (32768,
29504) with Xth = 32, 128, 1024, respectively. For all fixed-
point decoders, each channel LLR is quantized with Qc = 5
bits. For the RLLD algorithm with uniform quantization, each
LLR in the internal memory is quantized with Qm = 6 bits for
the length 210 and 213 polar codes. For the polar code with
a length of 215, the uniform quantization takes 7 bits. For
our MEQ scheme, Qm = 7. Since Qm − Qc = 2, we need
to determine two integers, r1 and r2, for our MEQ scheme.
When N = 210, 213 and 215, (r1, r2) = (1,2), (3,4) and (4,5),
respectively. As shown in Fig. 6, the performance degradation
caused by our MEQ scheme is small. Compared with the
uniform quantization, the proposed MEQ scheme reduces the
number of stored bits by 4.5%, 13.5% and 27.2% for N = 210,
213 and 215, respectively. For all the simulation results shown
in Fig. 6, list size L = 4.
9SNR (dB)
1 2 3 4 5
FE
R
10-6
10-5
10-4
10-3
10-2
10-1
100
Floating CA-SCL
Floating RLLD
Uniform
MEQ
N=215
N=210
N=213
Fig. 6. Effects of the proposed MEQ scheme on the error performances
C. Proposed path pruning unit
When a rate-1 node with Iv 6 Xth or an FP node is
activated, each decoding path splits into multiple ones and
only the L most reliable paths are kept. The PPU in Fig. 5
implements our CG and MBS algorithms, and is responsible
for calculating L returned codewords, βv,0, βv,1, · · · , βv,L−1,
and L path indices, a0, a1, · · · , aL−1. For l = 0, 1, · · · , L−1,
decoding path l copies from decoding path al before further
decoding steps.
Take L = 4 as an example, the proposed PPU is shown in
Fig. 7, which can be easily adapted to other L values. Our
PPU in Fig. 7 has two types of node metric generation (NG)
units, NG-I and NG-II, which compute the node metrics for
a rate-1 node and an FP node, respectively. NG-Il and NG-
IIl correspond to decoding path l. For decoding path l, the
expanded path metrics PMjl ’s are obtained by adding the node
metrics to the path metric PMl, which is stored in the path
metric registers (PMR) and initialized with 0.
When a rate-1 node is activated, NG-Il outputs two node
metrics for l = 0, 1, · · · , L−1. After 2L expanded path metrics
are computed, a stage of metric sorter (MS2L−L) selects the
L minimum metrics and their corresponding codewords from
2L ones. The metrics sorter MS2L−L implements the minL
function in Alg. 1 and can be constructed with a BBS. When
an FP node is activated, L NG-II modules implement the
first part of our two-stage sorting scheme. For each decoding
path, qIv,L node metrics and their correspondent codewords
are computed. The tree of metric sorters sort the L minimum
metrics among qIv,LL ones. This is achieved by log2 qIv,L
stages of metric sorters, where qIv,L is a power of 2. The
output expanded path metrics of the last stage of metric sorter
are saved in the PMR. The corresponding codewords of the
selected L expanded path metrics are also chosen. The related
circuitry is omitted for brevity.
The micro architecture of NG-Il is shown in Fig. 8. The
most complex part of NG-Il is finding the minimum LLR
magnitude and its corresponding index among the LLR vector
|αv,l| , (|αv,l[0]|, |αv,l[1]|, · · · , |αv,l[Iv−1]|). Since the node
metric of the most reliable candidate codeword is always 0,
we need to compute NM1l = |αv,l[kM,l]| in Fig. 8, which
is the node metric of the second most reliable candidate
MS8-4
NG-II0
NG-I0
NG-II1
NG-II2
NG-II3
NG-I1
NG-I2
NG-I3
MS8-4
P
M
0
P
M
1
P
M
2
P
M
3
MS8-4
PMR
,0v
,1v
,2v
,3v
,0v
,1v
,2v
,3v
Fig. 7. The proposed architecture for PPU
codeword, with a corresponding index kM,l. For our list
decoder architecture, for each decoding path, at most T LLRs
are computed in one clock cycle, since we have only T PUs
per decoding path. The Min-1 unit in Fig. 8 is capable of
finding the minimum value, mLLR, and its corresponding
index, mIdx, from at most T parallel inputs. When Iv 6 T ,
NM1l = mLLR and kM,l = mIdx. Cv,l,0 = h(αv,l) in Fig. 8 is
the hard decision of αv,l, which is the most reliable candidate
codeword. The second most reliable candidate codeword is
obtained by flipping the kM,l-th bit of Cv,l,0.
Min-1
mLLR
mIdx
mLR
mIR
cmp
HCM0 HCM1
0
1
0
1
,| |v l
,( )v lh 
0
1
En
kM,l
1NMl
, ,0v l
Fig. 8. Hardware architecture of the proposed NG-Il
When Iv > T , suppose T is a power of 2, then Iv can
be divided by T . During each clock cycle, only T LLRs are
fed to NG-Il, and the minimum value and its corresponding
index are computed in a partial parallel way. The minimum
value and associated index of the first T inputs are stored in
mLR and mIR, respectively. The minimum value of the second
group of T inputs is compared with the current value stored
in mLR, and is stored in mLR if it is smaller than the current
value of mLR. This repeats until the whole LLR vector αv,l is
processed. At last, the minimum value of |αv,l| and its index
are stored in mLR and mIR, respectively. The hard decoding
of αv,l is stored in the hard decoded constituent codeword
memory (HCM0), and is copied to HCM1 when the second
most reliable constituent codeword is computed.
The micro-architecture of NG-IIl under X0 = 8 and X1 =
16 is shown in Fig. 9, where the block MUX4T256 includes
256 4-to-1 multiplexers. Our NG-IIl consists of two parts:
the first part calculates 2Iv node metrics, NM0l , NM
1
l , · · · ,
NM2
Iv−1
l , based on Alg. 2, and the second part implements
the first stage sorting of our MBS algorithm. For L = 4,
when 2Iv > 2L, the 2Iv metrics are first divided into four
groups. The Min-2 [23] block is modified slightly to find
the two minimum node metrics and their associated indices
10
for each metric group. The MS8−4 block calculates the final
output metrics. When 2Iv = 2L = 8, the MS8−4 blocks work
directly on the 2L = 8 expanded path metrics. When 2Iv 6 L,
the expanded path metrics are output directly. As shown in
Figs. 7 to 9, our PPU has long critical path delay, since there
are many levels of logic from the inputs to outputs. Pipelines
should be used to improve overall decoder frequency.
0 01 1
|αv,l[0]| 0
m0
0 01 1
m1
MUX4T256
|αv,l[1]| 0 |αv,l[14]| 
m14 m15
|αv,l[15]| 
...
...
...SUM
0 01 1
0
0 01 1
MUX4T256
0
SUM
0NMl
255NMl
Min-2 Min-2 Min-2 Min-2
MS8-4
Fig. 9. Architecture of NG-IIl
Based on the DMM method in Eq. (9), the node metric
computation part needs 2Iv (2n−t− 1)L adders and 2Iv2n−tL
2-to-1 multiplexers, where N = 2n and t is the layer
index of an FP node v. Based on the RCC method, it
takes (
∑n−t−1
i=1 2
i22
n−t−i
+ 2Iv )L adders, 2Iv+1L 22
n−t−1
-
to-1 multiplexers and 4 × 2n−t−1L 2-to-1multiplexers. In
contrast, based on our DRH method, it takes 4 × 2n−t−1 +
2Iv (2n−t−1−1) adders, 2Iv2n−t−1 4-to-1 and 4×2n−t−1 2-to-
1 multiplexers. Table II compares hardware resources needed
by the DMM, RCC and DR-Hybrid methods when X0 = 8,
X1 = 16, and αv,l[j] (0 ≤ j < 2n−t) is a 6-bit LLR. As
shown in Table II, the DRH method requires the smallest total
area. Besides, the implementations based on DMM, RCC and
DRH have roughly the same critical path delay.
TABLE II
HARDWARE RESOURCES NEEDED BY DIFFERENT METHODS PER LIST
DMM RCC DRH
# of adders 3840 864 1824
# of MUX2−1 4096 32 32
# of MUX4−1 0 0 2048
# of MUX256−1 0 512 0
total area (# of NANDs) 313,967 1,673,810 229,449
D. Proposed hybrid partial sum unit
For the list decoder architectures in [9], [10], all partial sums
are stored in registers and the partial sums of decoding path l′
are copied to decoding path l when decoding path l′ needs to
be copied to decoding path l. The PSU in [9] and [10] needs
L(N −1) and L(N2 −1) single bit registers to store all partial
sums, respectively. Thus, for large N , the register based PSU
architectures in [9], [10] are inefficient for two reasons. First,
the area of the PSU is linearly proportional to N . For large
N (e.g. N > 215), the area of PSU is large since registers are
usually area demanding. Second, the power dissipation due to
the copying of partial sums between different decoding paths
is high when N is large.
1) Proposed Index Based Partial Sum Computation Al-
gorithm: In order to avoid copying partial sums directly,
an index based partial sum computation (IPC) algorithm is
proposed in Algorithm 3, where pl[z] (l = 0, 1, · · · , L −
1 and z = 0, 1, · · · , n) is a list index reference. Cl,z for
l = 0, 1, · · · , L − 1 and z = 0, 1, · · · , n are partial sum
matrices [6], [10]. Cl,z has 2n−z elements, each of which
stores two binary bits.
For our RLLD algorithm, once a rate-0, rate-1 or an FP node
sends L codewords to its parent node, the partial sum compu-
tation is performed after decoding path pruning. Let t denote
the layer index of such a node v. Let (Bn−1, Bn−2, · · · , B0)
denote the binary representation of the index of the last
leaf node belonging to node v, where Bn−1 is the most
significant bit. Let te = n− j, where j is the smallest integer
such that Bj = 0. If Bj = 1 for j = 0, 1, · · · , n − 1,
te = 0. Once βv,0, βv,1, · · · , βv,L−1 are calculated, decoding
path l′ may need to be copied to path l before the following
partial sum computation. Under this circumstance, the index
references are first copied, where pl′ [z] is copied to pl[z] for
z = t, t − 1, · · · , 0. The lazy copy algorithm was proposed
in [6] to avoid copying partial sums directly. However, the lazy
copy algorithm is not suitable for hardware implementation
due to complex index computation. The PSU in [10] copies all
partial sums belonging one decoding path to the corresponding
locations of another decoding path.
Algorithm 3: Index Based Partial Sum Computation (IPC)
Algorithm
input : te, t, (βv,0, βv,1, · · · , βv,L−1)
output: Cl,te [j][0] for l = 0, 1, · · · , L− 1 and
j = 0, 1, · · · , 2n−te
1 for l = 0 to L− 1 do
2 for j = 0 to 2n−t − 1 do
3 if v is the left child node of its parent node then
4 Cl,t[j][0] = βv,l[j]; pl[t] = l
5 else Cl,t[j][1] = βv,l[j]
6 if v is the left child node of its parent node then exit
for l = 0 to L− 1 do
7 for z = t− 1 to te do
8 for j = 0 to 2n−z−1 do
9 v0 = Cpl[z+1],z+1[j][0]; v1 = Cl,z+1[j][1]
10 if z == te then
11 Cl,z[2j][0] = v0 ⊕ v1; Cl,z[2j + 1][0] = v1
12 pl[z + 1] = pl[z] = l
13 else
14 Cl,z[2j][1] = v0 ⊕ v1; Cl,z[2j + 1][1] = v1
15 pl[z + 1] = l
2) Micro Architecture of the Proposed Hybrid Partial Sum
Unit: Based on our IPC algorithm, a Hyb-PSU is proposed
with two improvements. First, some partial sums are stored in
11
PEl,n-2,0
PEl,n-2,1
PEl,n-2,2
PEl,n-2,3
PEl,n-1,0
PEl,n-1,1
PEl,m,0
PEl,m,1
PEl,m,2
PEl,m,q
PEl,n,0
...
...
...
...
...
...
... ...
CN CNBMl,m-1 CN...BMl,m-2 BMl,1
...
T
T
T
T
T
T
Sl,m-1 Dl,m-1
Ml,m-1
Sl,m-2 Dl,m-2
Ml,m-2
Sl,1 Dl,1
Ml,1T
T
T T T
T
T T
T
Xl
stage n stage n-1 stage n-2 stage m
stage m-1 stage m-2 stage 1
1
D
LZz
1
1
ol,z,2j
ol,z,2j+1
bl,z,j
sl,z,j
dl,z,j
ml,z,j 1
1
bl,z,j 1
1
1
1
0
LDz
ENz
(a)
(b) (c)
DLU
CNT
T
T
T
I0
I1
O0
O1
(d)
1
0
1
0
1
0
selm-1 selm-2 sel1
, [ ]v l j
dL-1,z,j
d1,z,j
d0,z,j
1
1
1
1
...
...
1
1
D
1
ol,z,2j
ol,z,2j+1
sl,z,j
dl,z,j
ml,z,j 1
1
1
ENz
dL-1,z,j
d1,z,j
d0,z,j
1
1
1
1
...
...
1
1
D0,m-1 DL-1,m-1
... ............
D1,m-1
D0,m-2 DL-1,m-2D1,m-2
D0,1
DL-1,1
Fig. 10. (a) Top architecture of CUl. (b) Type-I PE. (c) Type-II PE. (d) Inputs and outputs of the CN.
memory, while others are stored in registers. Second, instead
of partial sums, only list index matrices are copied. These
two improvements reduce the area and power overhead of
partial sum computation unit when N is large. The Hyb-
PSU consists of L computation units, CU0, CU1, · · · , CUL−1,
where the micro architecture of CUl is shown in Fig. 10(a)
and is described as follows.
(a) Let m be a predefined integer parameter. For block
length N = 2n, CUl consists of n stages, where the first
n −m + 1 stages are a binary tree of the type-I and type-II
unit processing elements (PEs) shown in Figs. 10(b) and 10(c),
respectively. Stage z (z > m) has 2n−z PEs. Each of the
remaining m− 1 stages has the same circuitry.
(b) Two types of PEs are used in the PE tree in Fig. 10(a).
Suppose the maximal length of a constituent codeword that
is returned from a rate-0, rate-1 or FP node is 2µ, then stage
z (z > n − µ) employs only the type-I PEs. The remaining
stages in the PE tree employ the type-II PEs.
(c) Compared with the type-II PE, the type-I PE has an
extra data load unit (DLU). For PEl,z,j within stage z (j =
0, 1, · · · , 2n−z − 1), the binary outputs, ol,z,2j and ol,z,2j+1,
are connected to bl,z−1,2j and bl,z−1,2j+1, respectively. The
wired connections are not shown in Fig. 10(a) for simplicity.
(d) BMl,z (z 6 m− 1) is a bit memory with cw,z = 2n−zT
words, where each word contains T bits. T is the number
of processing elements belonging to a decoding path in a
partial parallel list decoder. For our memory compiler, if cw,z
is greater than a threshold value, then BMl,z is implemented
with an RF. If cw,z is even greater than another threshold value,
then BMl,z is implemented with an SRAM.
(e) The connector module (CN) has two T -bit inputs and
two T -bit outputs. The connections between the outputs and
inputs are
O0[2j] = I0[j]⊕ I1[j] 0 6 j < T/2
O0[2j + 1] = I1[j] 0 6 j < T/2
O1[2j − T ] = I0[j]⊕ I1[j] T/2 6 j < T
O1[2j + 1− T ] = I1[j] T/2 6 j < T
(11)
(f) For our Hyb-PSU, L computation units are needed. For
each PE within CUl, ml,z,j in Figs. 10(b) and 10(c) is the
output of an L-to-1 multiplexer whose inputs are d0,z,j , d1,z,j ,
· · · , dL−1,z,j , where L−1 of them are from other computation
units. For each CN, Ml,z is the output of an L-to-1 multiplexer
whose inputs are D0,z, D1,z, · · · , DL−1,z .
3) Computation Schedule of Our Hybrid Partial Sum Unit:
Once the returned L codewords βv,0, βv,1, · · · , βv,L−1 are
computed, the path pruning unit also outputs L indices
a0, a1, · · · , aL−1, where al needs to be copied to decoding
path l. For l = 0, 1, · · · , L− 1, βv,l is first loaded into stage
t by the DLU in Fig. 10(b), and the output partial sums in
Alg. 3 come out from stage te. For stage t, if βv,l is sent from
a rate-0 node, then the control signal LZt is 0, since βv,l is a
zero vector. Otherwise, LDt = 0 and LZt = 1. For the other
stages, LDz = 1 and LZz = 1 (z 6= t).
For all partial sums within the partial sum matrix Cl,z , we
divide them into two sets: C0l,z and C
1
l,z , where C
0
l,z consists
of Cl,z[j][0] for j = 0, 1, · · · , 2n−z − 1 and C1l,z consists of
the other partial sums within Cl,z . For each Cl,z , our Hyb-PSU
stores only C0l,z in the registers or bit memory of stage z. As
shown in Alg. 3, for z = t − 1 to te + 1, C1l,z is computed
in serial. At last, C0l,te is computed. For our Hyb-PSU, after
loading the returned L codewords into stage t, for z = t− 1
to te + 1, C1l,z is computed on-the-fly and passed to the next
stage as shown in Fig. 10.
When te > m, C0l,te is computed in one clock cycle and is
output from stage te, where Cl,te [j][0] is set to sl,te,j produced
by the type-I and type-II PEs for j = 0, 1, · · · , 2n−te − 1.
When te < m, C0l,te is computed in 2
n−te/T cycles, and T
updated partial sums are computed in each clock cycles. Since
decoding path al needs to be copied to path l, for z = t, t−
1, · · · , te+1, the computation of C1l,z is based on C0al,z+1 and
C1l,z+1. Hence, the multiplexers within stage z are configured
so that ml,z,j = dal,z,j for z > m. When z < m, Ml,z =
Qal,z .
4) Comparisons with Related Works: Compared to the
partial sum computation architectures in [9], [10], the proposed
Hyb-PSU architecture has advantages in the following two
aspects.
(1) The proposed Hyb-PSU is a scalable architecture. The
PSU architectures in [9], [10] require L(N−1) and L(N/2−1)
single bit registers, where N = 2n is the block length. Hence,
they will suffer from excessive area overhead when the block
12
length N is large. In contrast, the proposed Hyb-PSU stores
L(N − 1) bits and most of these bits are stored in RFs or
SRAMs, which are more area efficient than registers.
(2) The architectures in [9], [10] copies partial sums of a de-
coding path to another decoding path when needed, while our
Hyb-PSU copies only index references. We define the copying
of a single bit from one register to another as a single copy
operation. When decoding path l′ needs to be copied to path l,
the PSU in [10] requires N1 = 2n−1−1 copy operations, while
our Hyb-PSU needs only N2 = (n+1) log2 L copy operations.
Since the value of L for practical hardware implementation is
small, our lazy copy needs much fewer copy operations than
direct copy.
In this paper, when L = 4 and T = 128, for N = 213
and 215, the proposed hybrid partial sum unit architecture is
implemented with m = 3 and m = 5, respectively, under a
TSMC 90nm CMOS technology. Our partial sum computation
unit consumes an area of 0.779mm2 and 1.31mm2 for N =
213 and N = 215, respectively.
To the best of our knowledge, those decoder architectures
in [9], [10], [15], [24] are the only for SC based list decoding
algorithms of polar codes. However, in [9], [15], [24], the
partial sum computation unit architecture was not discussed
in detail and the implementation results on the PSU alone are
not shown. Hence, we compare our proposed Hyb-PSU with
that in [10]. When L = 4, the partial sum unit architecture
in [10] for N = 213 and 215 consumes an area of 1.011mm2
and 3.63mm2, respectively, under the same CMOS technology.
All PSUs are synthesized under a frequency of 500MHz. Our
Hyb-PSU achieves an area saving of 23% and 63% for block
length 213 and 215, respectively.
E. Latency and Throughput
For the proposed high throughput decoder architecture, the
number of clock cycles, ND, used on the decoding of a
codeword depends on the block length, the code rate and the
positions of frozen bits. For our RLLD algorithm, let NV be
the number of nodes (except the root node) visited in Gn. Let
SV denote the set of indices of visited nodes (except the root
node). Let S′V be a subset of SV and S
′
V consists of rate-1
nodes with Iv 6 Xth and all FP nodes. For vi ∈ SV , let ti be
the layer index of node vi for i = 0, 1, · · · , NV − 1. Then
ND =
NV −1∑
i=0
(N
(i)
L +N
(i)
P ) +NC , (12)
where N (i)L = d 2
n−ti
T e is the number of clock cycles needed
to calculate the LLR vectors sending to node vi. N
(i)
P is the
number of clock cycles used by our PPU when vi is activated.
Note that decoding path splits only if node vi is a rate-1 node
with Iv 6 Xth or an FP node. Hence, N (i)P = 0 if vi 6∈ S′V . If
vi ∈ S′V , N (i)P 6= 0 and depends on the node type, Xth, qIv,L,
T , L and the number of pipeline stages in our PPU. This will
be discussed in more detail in Section V.
Since our list decoder outputs xN−10 instead of u
N−1
0 , we
need to obtain uN−10 based on x
N−1
0 before calculating the
CRC checksum of the information bits. A partial-parallel polar
encoder [25] can be used and the corresponding latency is
N/T when T bits are fed to the encoder in parallel. For the
computation of CRC, a partial parallel CRC unit [26] can be
used, and the corresponding latency is also N/T . As a result,
NC =
2N
T is the number of clock cycles due to encoding and
CRC checksum computation.
The latency of our decoder is TL = ND/f , where f is the
decoder frequency. Since we are using CRC for output final
data word, we calculate the net information throughput (NIT)
of our decoder, where NIT = (N×R−h)fND−NC , where h is the CRC
checksum length. Here, the latency due to the CRC checksum
computation does not affect out decoder throughput, since our
decoder can work on the next frame once our Hyb-PSU begins
to output decoded codewords for the current frame.
V. IMPLEMENTATION RESULTS AND COMPARISONS
To compare with prior works, we implement our high
throughput list decoder architecture for three polar codes with
lengths of 210, 213 and 215, respectively, and rates 0.5, 0.5
and 0.9, respectively. The last polar code is intended for
storage applications. For each code, three different list sizes
are considered: L = 2, 4, 8. All our decoders are synthesized
under the TSMC 90nm CMOS technology using the Cadence
RTL compiler. The area efficiency (AE) of a partly parallel
decoder architecture depends on the number of PUs. In order
to make a fair comparison with prior works in [10], [12], [16],
the number of PUs for each decoding path of our implemented
decoders is selected to be 64 when N = 210. When N = 213
and 215, the number of PUs per decoding path is 128 for
our decoders. The list decoders in [27] are based on a line
architecture, which requires N2 PUs.
A total of 3, 4 and 6 pipeline stages, respectively, are
inserted in the PPU for decoders with L = 2, 4 and 8,
respectively. The number of pipeline stages needed for our
PPU is determined by the longest data path. For each vi ∈ S′V ,
if node vi is a rate-1 node with Iv 6 Xth, N (i)P depends on the
number of PUs in a decoding path: when Iv 6 T , N (i)P = 2
for all our implemented decoders; otherwise, N (i)P = 4 for
all our decoders, since the minimum value of a received LLR
vector is calculated in a partial parallel way, which incurs
extra clock cycles. When node vi is an FP node, N
(i)
P relates
to qIv,L. Depending on the detailed value of qIv,L, we may use
different data paths when computing the L minimum expanded
path metrics. The locations of all pipelines are arranged so that
fewer clock cycles are needed when the qIv,L is smaller. In
Table VI, we list the detailed value of N (i)P with respect to Iv
and L.
The selection of Xth is a trade-off between AE and error
performance. When increasing Xth, more rate-1 nodes will
be processed by our CG algorithm. Hence, ND increases and
the resulting NIT decreases. Meanwhile, the corresponding
error performance is better especially in high SNR region.
Our high throughput list decoder architecture supports all
Xth values. For all our implemented decoders, Xth is large
enough so that all rate-1 nodes are processed by our CG
algorithm. In this setup, for each implemented decoder, ND is
maximized with respect to Xth, and hence the throughput of
13
TABLE III
IMPLEMENTATION RESULTS FOR N = 210, R = 0.5
proposed [12] [10]‡ [15] [16]
L 2 4 8 2 4 8 2 4 8 2 4 4
Frequency (MHz) 423 403 289 847 794 637 507 492 462 500∗ 361† 400∗ 288† 500
Cell Area (mm2) 1.98 3.83 7.22 0.88 1.78 3.85 1.23 2.46 5.28 1.06∗ 2.03† 2.14∗ 4.10† 1.403
# of Decoding Cycles 337 371 404 2592 2649 2649 2592 2592 3104 1022 1022 1290
NIT (Mbps) 666 570 374 168 154 123 93 91 71 250∗ 180† 200∗ 144† 186
Latency (us) 0.79 0.92 1.39 3.06 3.34 4.16 5.11 5.26 6.72 2.04∗ 2.83† 2.55∗ 3.54† 2.58
AE (Mbps/mm2) 336 148 51 191 86 32 76 37 13 237∗ 88† 94∗ 35† 132
‡The decoder architecture in [10] has been re-synthesized under the TSMC 90nm CMOS technology. ∗ These are the original implementation results
based on a 65nm CMOS technology. †These are the scaled results under the TSMC 90nm CMOS technology.
TABLE IV
IMPLEMENTATION RESULTS FOR N = 213, R = 0.5
proposed [12]† [10]‡ [16]‡
L 2 4 8 2 4 8 2 4 8 4
Frequency (MHz) 416 398 289 847 794 637 467 434 434 434
Cell Area (mm2) 3.42 6.46 12.26 6.48 12.73 28.04 3.97 7.93 17.45 7.02
# of Decoding Cycles 2146 2367 2576 20736 20736 20736 20736 20736 24832 11488
NIT (Mbps) 839 723 479 167 156 125 92 85 71 153
Latency (us) 5.16 5.94 8.91 24.48 26.11 32.55 44.40 47.78 58.56 26.47
AE (Mbps/mm2) 245 111 39 26 12 4.6 23 11 4.1 21.79
†These results are estimated conservatively. ‡The decoder architectures in [10], [16] have been re-synthesized under the
TSMC 90nm CMOS technology. The number of PU per decoding path is 128.
TABLE V
IMPLEMENTATION RESULTS FOR N = 215, R = 0.9004
proposed [12]† [10]‡ [16]‡
L 2 4 8 2 4 8 2 4 8 4
Frequency (MHz) 367 359 286 847 794 637 398 389 389 389
Cell Area (mm2) 6.22 11.89 23.1 25.68 50.41 111.08 8.59 17.54 34 15.5
# of Decoding Cycles 6070 6492 6895 96576 96576 96576 96576 96576 126080 63606
NIT (Mbps) 1949 1772 1323 258 242 194 121 118 90 180
Latency (us) 16.53 18.08 24.11 114.02 121.63 151.61 242.65 248.26 324.1 163.5
AE (Mbps/mm2) 313 149 57 11 4.8 1.75 14 6.72 2.64 11.61
†These results are estimated conservatively. ‡The decoder architectures in [10], [16] have been re-synthesized under the TSMC
90nm CMOS technology. The number of PU per decoding path is 128.
our decoder architecture in Tables III, IV and V is the minimum
achieved by our decoders. For each code, the corresponding
error performance is better than that of the RLLD with the
MEQ in Fig. 6.
TABLE VI
N
(i)
P WITH RESPECT TO Iv AND L
Iv 1 2 3 4 5 6 7 8
L = 2 2 2 3 3 3 3 3 3
L = 4 2 4 4 4 4 4 4 3
L = 8 2 3 4 5 5 6 5 3
The implementation results are shown in Table III, IV
and V. The implementation results show that our decoders
outperform existing SCL decoders [10], [12], [15] in both
decoding latency and area efficiency. Compared with the
decoders of [12], the area efficiency and decoding latency
of our decoders are 1.59 to 32.5 times and 3.4 to 6.8 times
better, respectively. The area efficiency and decoding latency
of our decoders are 3.9 to 21.5 times and 5.5 to 13 times
better, respectively, than the decoders of [10]. Compared with
decoders of [16], our decoders improve the area efficiency
and decoding latency by 1.12 to 12 times and 2.8 to 9 times,
respectively. When N = 210, the area efficiency and decoding
latency of our decoders are 3.8 to 4.2 times and 3.58 to
3.84 times better, respectively, than the decoders of [15].
Compared with the decoders of [15], our decoders would show
more significant improvements in area efficiency and decoding
latency when N is larger.
Based on the implementation results shown in Tables III, IV
and V, it is observed that when the block length is fixed, as the
list size L increases, the area efficiency and decoding latency
will decrease and increase, respectively, because:
• It takes more memory to store internal LLRs when L
increases.
• The number of pipeline stages within our PPU will
increase when L increases, which in turn increases the
overall decoding clock cycles.
The latency reduction and area efficiency improvement of
our decoders are due to the reduced number of nodes activated
in the decoding. However, the area and frequency overhead of
the proposed PPU somewhat dilute the effects due to decoding
clock cycles reduction. For example, our decoder reduces
the number of decoding cycles to approximately 17 of that
of the decoders in [12] for L = 2, 4 and 8. However, the
14
reduction in decoding cycles does not fully transfer into the
improvement in decoding latency and area efficiency. Based
on our implementation results, take L = 2 as an example,
the PPU occupies 61.99%, 40.16% and 25.40% of the area
of the whole decoder, for N = 210, 213 and 215, respectively.
Compared with the decoders with N = 210 and 213, the effects
on the area efficiency caused by the area overhead of PPU are
smaller for decoders with N = 215. Keeping T unchanged,
as N increases, the area of the PPU increases very slowly
while the total area of all LLR memories is proportional to
N . Hence, for larger N , PPU occupies a smaller percentage
of the total area of a whole decoder. When list size L is
fixed, as N increases, the latency reduction and area efficiency
improvement compared with other decoders in the literature
will be greater.
VI. CONCLUSION
In this paper, a reduced latency list decoding algorithm
is proposed for polar codes. The proposed list decoding
algorithm results in a high throughput list decoder architecture
for polar codes. A memory efficient quantization method is
also proposed to reduce the size of message memories. The
proposed list decoder architecture can be adapted to large
block lengths due to our hybrid partial sum unit, which is area
efficient. The implementation results of our high throughput
list decoder demonstrate significant advantages over current
state-of-the-art SCL decoders.
REFERENCES
[1] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list decoding algorithm
for polar codes,” in Proc. IEEE Workshop on Signal Processing Systems
(SiPS), Belfast, UK, October 2014, pp. 56–61.
[2] J. Lin and Z. Yan, “A hybrid partial sum computation unit architec-
ture for list decoders of polar codes,” in IEEE Int. Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 2015, [Online:
http://arxiv.org/abs/1506.05896].
[3] E. Arıkan, “Channel polarization: a method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,”
IEEE Trans. Info. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
[4] E. Sasoglu, E. Teltar, and E. Arıkan, “Polariztion for arbitrary discrete
memoryless channels,” in Proc. IEEE Int. Symp. on Information Theory,
Seoul, South Korea, Jun. 2009, pp. 144–148.
[5] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289–299, Jan. 2013.
[6] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Info.
Theory, 2015, [Online; DOI: 10.1109/TIT.2015.2410251].
[7] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE
Commun. Lett., vol. 16, no. 10, pp. 1668–1671, Oct. 2012.
[8] B. Li, H. Shen, and D. Tse, “An adaptive successive cancellation list
decoder for polar codes with cyclic redundancy check,” IEEE Commun.
Lett., vol. 16, no. 12, pp. 2044–2047, Dec. 2012.
[9] A. Balatsoukas-Stimming, A. J. Raymond, W. J. Gross, and A. Burg,
“Hardware architecture for list successive cancellation decoding of polar
codes,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 8, pp.
609–613, Aug. 2014.
[10] J. Lin and Z. Yan, “An efficient list decoder architecture for polar codes,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2015, to appear.
[11] B. Yuan and K. K. Parhi, “Succssive cancellation list polar decoder
using log-likelihood ratios,” in IEEE Asilomar Conf. on Signal, Systems
and Computers, November 2014.
[12] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-
based successive cancellation list decoding of polar codes,”
http://arxiv.org/abs/1401.3753v3, submitted to IEEE Trans. Signal
Process.
[13] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” in Proc. IEEE
Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Florence, Italy, May 2014, pp. 3903–3907.
[14] B. Li, H. Shen, and D. Tse, “Parallel decoders of polar codes,”
http://arxiv.org/abs/1309.1026v1, Sep. 2013.
[15] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation list
decoders for polar codes with multibit decision,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., to appear.
[16] C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation
list decoder for polar codes,” http://arxiv.org/abs/1501.04705, submitted
to IEEE Trans. Signal Processing.
[17] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Increasing
the speed of polar list decoders,” in Proc. IEEE Workshop on Signal
Processing Systems (SiPS), Belfast, UK, 2014.
[18] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15, no.
12, pp. 1378–1380, Dec. 2011.
[19] G. Sarkis and W. J. Gross, “Increasing the throughput of polar decoders,”
IEEE Commun. Lett., vol. 17, no. 9, pp. 725–728, Apr. 2013.
[20] E. Arıkan, “A performance comparison of polar codes and reed-muller
codes,” IEEE Commun. Lett., vol. 12, no. 6, pp. 447–449, June 2008.
[21] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Info.
Theory, vol. 59, no. 10, pp. 6562–6582, October 2013.
[22] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar
decoders: algorithm and implementation,” IEEE J. Sel. Areas Commun.,
vol. 32, no. 5, pp. 946–957, May 2014.
[23] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of finding the first
two minimum values and their hardware implementation,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 1549–8328, Dec. 2008.
[24] C. Zhang, X. Yu, and J. Sha, “Hardware architecture for list successive
cancellation polar decoder,” in Proc. IEEE Int. Symp. on Circuits and
Systems (ISCAS), Melbourne, AU, Jun. 2014, pp. 209–212.
[25] H. Yoo and I.-C. Park, “Partially parallel encoder architecture for long
polar codes,” IEEE Trans. Circuits Syst. II, Exp. Briefs, accepted.
[26] Y. Huo, X. Li, W. Wang, and D. Liu, “High performance table-based
architecture for parallel CRC calculation,” in IEEE Int. workshop on
local and metropolitan area network (LANMAN), April 2015, pp. 1–6.
[27] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation polar
decoder architectures using 2-bit decoding,” IEEE Trans. Circuits Syst.
I, Reg. Papers, vol. 61, no. 4, pp. 1241–1254, Apr. 2014.
