Rate-Flexible Fast Polar Decoders by Hashemi, Seyyed Ali et al.
ar
X
iv
:1
90
3.
09
20
3v
1 
 [c
s.I
T]
  2
1 M
ar 
20
19
1
Rate-Flexible Fast Polar Decoders
Seyyed Ali Hashemi, Carlo Condo, Marco Mondelli, Warren J. Gross
Abstract—Polar codes have gained extensive attention during
the past few years and recently they have been selected for
the next generation of wireless communications standards (5G).
Successive-cancellation-based (SC-based) decoders, such as SC
list (SCL) and SC flip (SCF), provide a reasonable error
performance for polar codes at the cost of low decoding speed.
Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-
SSCF, identify the special constituent codes in a polar code graph
off-line, produce a list of operations, store the list in memory,
and feed the list to the decoder to decode the constituent codes
in order efficiently, thus increasing the decoding speed. However,
the list of operations is dependent on the code rate and as the rate
changes, a new list is produced, making fast SC-based decoders
not rate-flexible. In this paper, we propose a completely rate-
flexible fast SC-based decoder by creating the list of operations
directly in hardware, with low implementation complexity. We
further propose a hardware architecture implementing the pro-
posed method and show that the area occupation of the rate-
flexible fast SC-based decoder in this paper is only 38% of the
total area of the memory-based base-line decoder when 5G code
rates are supported.
Index Terms—polar codes, successive-cancellation decoding,
list decoding, hardware implementation.
I. INTRODUCTION
Polar codes are a family of channel codes which can
provably achieve the capacity of a binary memoryless sym-
metric (BMS) channel with the low-complexity successive-
cancellation (SC) decoding algorithm [1]. However, this
capacity-achieving property under SC decoding only occurs as
the code length tends towards infinity. For practical values of
code length, SC decoding fails to provide a reasonable error-
correction performance.
In order to improve the error-correction performance of
SC decoding, SC list (SCL) [2] and SC flip (SCF) [3]
decoders run multiple SC decoders in parallel and in series,
respectively. Therefore, SCL improves the error-correction
performance of SC at the cost of higher area occupation when
implemented on hardware, while SCF improves the error-
correction performance of SC at the cost of higher latency
and lower throughput. With this error-correction performance
improvement, polar codes were selected as a channel coding
scheme for the enhanced mobile broadband (eMBB) control
S. A. Hashemi was with the Department of Electrical and Computer
Engineering, McGill University, Montre´al, QC H3A 0G4, Canada. He is now
with the Department of Electrical Engineering, Stanford University, Stanford,
CA 94305, USA (email: ahashemi@stanford.edu).
C. Condo was with the Department of Electrical and Computer Engineer-
ing, McGill University, Montre´al, QC H3A 0G4, Canada. He is now with
Huawei Technologies France, 92100 Boulogne-Billancourt, France (e-mail:
carlo.condo@huawei.com).
M. Mondelli is with the Department of Electrical Engineering, Stanford
University, Stanford, CA 94305, USA (email: mondelli@stanford.edu).
W. J. Gross is with the Department of Electrical and Computer En-
gineering, McGill University, Montre´al, QC H3A 0G4, Canada (e-mail:
warren.gross@mcgill.ca).
channel in the next generation of wireless communications
standard (5G).
SC-based decoding algorithms such as SC, SCL, and SCF,
suffer from high latency and low throughput when imple-
mented on hardware. This is due to the serial nature of SC
decoding in which the decoding proceeds bit by bit. In order
to address this issue, polar codes where shown to be a con-
catenation of smaller constituent codes which can be decoded
in parallel [4], [5]. These constituent codes are shown to
add small implementation complexity overhead while keeping
the error-correction performance of SC unchanged. In [6],
more constituent codes were identified and low-complexity
parallel decoders were designed to increase the throughput
of SC decoders even further. It was shown in [7], [8] that
the constituent codes can be decoded efficiently under SCL
decoding while keeping the error-correction performance of
SCL decoder unaltered. The same approach was applied to
the SCF decoder in [9].
The construction of polar codes is based on the identification
of reliable bit-channels through which information bits are
transmitted. The remaining bit-channels carry fix values and
are called frozen bits. The location of the frozen bits and of
the information bits is known to the encoder and the decoder.
In SC-based decoders, the frozen and information bit sequence
can be either stored in a memory, or computed on-line given
the bit-channel relative reliability vector and desired code rate,
as proposed in [10]. In fact, the latter approach is significantly
more efficient in case of multi-code decoders, and is facilitated
by nested reliability vectors as those selected for the 5G eMBB
control channel [11]. Therefore, in 5G, the polar encoder and
decoder are provided with a vector of bit indices in descending
reliability order and an information length K , from which the
encoder and the decoder should extract the frozen/information
bit sequence. It should be noted that the number of information
bits for polar codes in the 5G eMBB control channel can be
any value between 12 and 1706 [12]. Thus, the encoder and
the decoder should be able to support a vast range of code
rates.
Fast SC-based decoders rely on the identification of the
type and the length of constituent codes in a polar code.
While the calculation of the frozen/information bit sequence
is straightforward and can be performed by simply assigning
information bits to the first K elements of the reliability vector,
the direct calculation of the list of operations for fast SC-based
decoders requires complicated controller logic [5]. Therefore,
the identification of the type and the length of constituent
codes is performed off-line and the decoding order is stored
in a dedicated memory as a list of operations [5], [7], [8]. The
decoder fetches the list of operations from memory to decode
the constituent codes in order one by one. The main drawbacks
of the aforementioned fast SC-based decoders are twofold:
2first, the list of operations requires high memory usage when
implemented on hardware. Second, the list of operations is
highly dependent on the rate of the polar code and as the
rate changes, the list of operations changes too. Therefore, for
5G applications which require the support of multiple rates,
multiple lists of operations need to be stored in memory. This
in turn increases the hardware implementation overhead and
renders fast SC-based decoders not rate-flexible.
In this paper, we propose completely rate-flexible fast SC-
based decoders by introducing a method to infer the list
of operations directly in hardware by using the bit-channel
relative reliability vector and without the need to store it
in memory. We show that the type and the length of a
constituent code in a polar code can be identified with low
hardware implementation complexity, by checking only a few
bits of the constituent code. We further show that the list
of operations adapts with the rate of the code, allowing the
resulting fast SC-based decoder to be completely rate-flexible.
We design and implement a hardware architecture for the
proposed decoder and show that the memory required to store
the list of operations can be completely removed, resulting in
significantly lower decoder area occupation.
The remainder of this paper is organized as follows: Sec-
tion II reviews polar codes, SC-based decoding algorithms, and
their fast counterparts. We propose the rate-flexible fast de-
coder for polar codes in Section III. In Section IV, a hardware
architecture to implement the proposed method is introduced.
Section V provides the hardware implementation results and
comparisons with state of the art. Finally, conclusions are
drawn in Section VI.
II. PRELIMINARIES
A. Polar Codes
A polar code of length N = 2n that carries K information
bits has a rate R = K/N and can be represented as P(N,K). It
can be constructed using a lower-triangular generator matrix
G as
x = uG, (1)
where x = {x0, x1, . . . , xN−1} is the vector of coded bits and
u = {u0, u1, . . . , uN−1} is the vector of input bits. The matrix
G = BNF
⊗n where BN is the bit-reversal permutation matrix,
and F⊗n is the n-th Kronecker product of the polarizing matrix
F =
[
1 0
1 1
]
.
As N goes toward infinity, the polarization phenomenon
creates bit-channels that are either completely noisy or com-
pletely noiseless and the fraction of noiseless bit-channels
equals the channel capacity. For finite practical code lengths,
the polarization of bit-channels is incomplete, therefore, there
are bit-channels that are partially noisy. In principle, a bit-
channel relative reliability vector v = {v0, v1, . . . , vN−1}, where
0 ≤ vi < N , is generated and fed into the encoder and the
decoder based on the polarization phenomenon which shows
the rank of each bit-channel. Thus, v is a vector of integers
such that if vi < vj , then bit-channel i is more reliable (less
noisy) than bit-channel j. The polar encoding process consists
of the classification of the bit-channels in u into two groups
based on v: the K good (more reliable) bit-channels which
uˆ0 uˆ1 uˆ2 uˆ3 uˆ4 uˆ5 uˆ6 uˆ7
t = 3
t = 2
t = 1
t = 0
α
β
α
ℓ
β
ℓ
β r
α r
Fig. 1: SC-based decoding on a binary tree for P(8,4) and
v = {7, 6, 5, 3, 4, 2, 1, 0} (s = {0, 0, 0, 1, 0, 1, 1, 1}).
carry the information bits, and the N − K bad (less reliable)
bit-channels that are fixed to a predefined value (usually 0).
This classification can be represented as a sequence of binary
values s = {s0, s1, . . . , sN−1} where
si =
{
0 if vi ≥ K ,
1 if vi < K .
(2)
More formally, let W be a BMS channel with input alphabet
X = {0, 1} and output alphabet Y, and let {W(y | x) : x ∈
X, y ∈ Y} be the transition probabilities. In order to quantify
the reliability of the channel W , we use the Bhattacharyya
parameter Z(W) ∈ [0, 1], that is defined as
Z(W) =
∑
y∈Y
√
W(y | 0)W(y | 1). (3)
Hence, the good bit-channels are the ones that have the lowest
Bhattacharyya parameter.
B. SC-Based Decoding
SC-based decoding algorithms can be represented as a
depth-first binary tree search with priority to the left branches
as depicted in Fig. 1. Two kinds of messages are passed
between the nodes in the graph: the soft log-likelihood ratio
(LLR) valuesα = {α0, α1, . . . , α2T−1} which are passed from a
parent node at level log2(2T ) = t+1 to the child nodes at level
log2(T ) = t, and the hard bit estimates β = {β0, β1, . . . , β2T−1}
which are passed from a child node at level t to a parent node
at level t + 1.
The T = 2t elements of the left child node αℓ =
{αℓ
0
, αℓ
1
, . . . , αℓ
T−1
} can be computed by the Ft function, and
those of the right child node αr = {αr
0
, αr
1
, . . . , αr
T−1
} can be
computed by the Gt function as
αℓi =Ft (αi, αi+T ), (4)
αri =Gt (αi, αi+T , β
ℓ
i ), (5)
where
Ft (a, b) =2 arctanh
(
tanh
(a
2
)
tanh
(
b
2
))
, (6)
≈ sgn(a) sgn(b)min(|a|, |b|), (7)
Gt (a, b, c) =b + (1 − 2c) a. (8)
Assume that the vector of relative reliabilities of bit-channels
v is stored in memory and is available to the decoder. In SC
3and SCF decoding algorithms, when a leaf node is reached,
the i-th bit uˆi can be estimated as
uˆi =
{
0, if vi ≥ K or αi ≥ 0,
1, if vi < K and αi < 0,
(9)
while in SCL decoding, at a leaf node we have
uˆi =
{
0, if vi ≥ K ,
0 and 1, if vi < K .
(10)
As can be seen in (10), when an information bit is reached
in SCL decoding, both of its possible values of 0 and 1 are
considered. In order to limit the exponential growth in the
complexity of the SCL decoder, at each bit estimation, only
L candidates are allowed to survive with the help of a path
metric (PM) [13]. To this end, a sorter module is used to
rank the PMs of the 2L generated candidates and selecting
L of them with the best PMs. After the estimation of bits
by (9) or (10), the left child and right child node messages
βℓ = {βℓ
0
, βℓ
1
, . . . , βℓ
T−1
} and βr = {βr
0
, βr
1
, . . . , βr
T−1
} are used
successively to calculate the 2T values of β as [1]
βi =
{
βℓ
i
⊕ βr
i
, if i < T ,
βr
i−T
, otherwise,
(11)
where ⊕ is the bitwise XOR operation.
The depth-first binary tree search of SC-based decoding
algorithms can be represented by a list of operations. Let
b
i
= {bi
n−1
, bi
n−2
, . . . , bi
0
} represent the binary expansion of the
integer i. The LLR value associated with ui can be calculated
by a set of Ft and Gt operations as [14]:{
Ft , if b
i
t = 0,
Gt , if b
i
t = 1.
(12)
For example, the LLR value associated with u0 in Fig. 1
can be calculated by performing F2, F1, and F0, respec-
tively, and the LLR value associated with u1 in Fig. 1 can
be calculated by performing F2, F1, and G0, respectively.
However, the calculation of the LLR value for u1 can use
the already calculated F2 and F1 operations in u0. Let M
denote the minimum index in bi such that bi
M
= 1. It is
only required to perform Ft or Gt operations with t ≤ M
because for t > M, the LLR values are already calculated for
previous bits. For example, the list of operations associated
with the SC-based decoder of Fig. 1 can be represented as
{F2, F1, F0,G0, G1, F0, G0, G2, F1, F0,G0,G1, F0,G0}. It should
be noted that since the hard estimate operations of (9), (10),
and (11) are performed right after Ft or Gt functions at a
leaf node and in the same time step, we do not include them
in the list of operations. The list of operations for SC-based
decoders can be generated directly on hardware by simple
bitwise operations [13], [14].
It is worth mentioning that the list of operations for SC-
based decoders is fixed for all rates and thus SC-based
decoders are rate-flexible. However, the number of time steps
required to finish the decoding process in SC-based decoders
is at least 2N − 21. This limits the latency and throughput of
polar codes when decoded by SC-based decoders.
C. Fast SC-Based Decoding
In order to reduce the latency and increase the throughput
of SC-based decoders for polar codes, special node structures
are identified and the decoding is performed based on the LLR
values at the intermediate levels in the SC-based decoding
tree without the need of traversing it. It was shown in [4],
[5] that four special nodes can be decoded efficiently in fast
simplified SC (Fast-SSC) decoding without traversing the tree
at the special nodes. Let vt = {vt0, vt1, . . . , vtT−1 } represent a
subset of v and st = {st0, st1, . . . , stT−1} represent a subset of s
corresponding to a node of length T in a polar code decoding
tree. The four special nodes are:
• Rate-0 Node: This node consists of only frozen bits, i.e.,
vti ≥ K for any i ∈ {0, 1, . . . ,T − 1} (st = {0, 0, . . . , 0}).
• Rate-1 Node: This node consists of only information
bits, i.e., vti < K for any i ∈ {0, 1, . . . ,T − 1} (st =
{1, 1, . . . , 1}).
• Repetition (Rep) Node: This node consists of frozen bits
except for the last bit which is an information bit, i.e.,
vtT−1 < K and vti ≥ K for any i ∈ {0, 1, . . . ,T − 2}
(st = {0, . . . , 0, 0, 1}).
• Single parity-check (SPC) Node: This node consists of
information bits except for the first bit which is a frozen
bit, i.e., vt0 ≥ K and vti < K for any i ∈ {1, 2, . . . ,T − 1}
(st = {0, 1, 1, . . . , 1}).
It was shown in [7], [8] that these nodes can be decoded
efficiently also in simplified SCL (SSCL), SSCL-SPC, fast
SSCL (Fast-SSCL), and Fast-SSCL-SPC decoding without the
need for traversing the tree. This is performed by estimating
bits one by one at an intermediate level of the decoding tree,
thus generating only 2L candidates and selecting the best L
from them, similar to the conventional SCL decoding process.
This guarantees that the sorter module which selects the L
candidates out of 2L remains the same as the conventional
SCL decoder. The method was also applied to the SCF decoder
which resulted in the Fast-SSCF decoder in [9]. Recently, five
new special nodes are observed in [6] and efficient decoders
that can be used in SC decoding were designed for them.
These nodes are:
• Type-I Node: This node consists of frozen bits except for
the last two bits which are information bits, i.e., vtT−1 <
K , vtT−2 < K , and vti ≥ K for any i ∈ {0, 1, . . . ,T − 3}
(st = {0, . . . , 0, 1, 1}).
• Type-II Node: This node consists of frozen bits except
for the last three bits which are information bits, i.e.,
vtT−1 < K , vtT−2 < K , vtT−3 < K , and vti ≥ K for any
i ∈ {0, 1, . . . ,T − 4} (st = {0, . . . , 0, 1, 1, 1}).
• Type-III Node: This node consists of information bits
except for the first two bits which are frozen bits, i.e.,
vt0 ≥ K , vt1 ≥ K , and vti < K for any i ∈ {2, 3, . . . ,T −1}
(st = {0, 0, 1, . . . , 1}).
1For SCL decoder, K more time steps are needed to perform the PM
computation and path pruning [13]. For SCF decoder, additional rounds of
SC decoding add to the number of required time steps [3].
4t = 3
t = 2
Type-V
Rep SPC
α
β
Fig. 2: Fast SC-based decoding on a binary tree for P(8, 4)
and v = {7, 6, 5, 3, 4, 2, 1, 0} (s = {0, 0, 0, 1, 0, 1, 1, 1}).
• Type-IV Node: This node consists of information bits
except for the first three bits which are frozen bits, i.e.,
vt0 ≥ K , vt1 ≥ K , vt2 ≥ K , and vti < K for any
i ∈ {3, 4, . . . ,T − 1} (st = {0, 0, 0, 1, . . . , 1}).
• Type-V Node: This node consists of frozen bits except for
the bits T−5, T−3, T−2, and T−1 which are information
bits, i.e., vtT−1 < K , vtT−2 < K , vtT−3 < K , vtT−4 ≥ K ,
vtT−5 < K , and vti ≥ K for any i ∈ {0, 1, . . . ,T − 6}
(st = {0, . . . , 0, 1, 0, 1, 1, 1}).
It was shown in [15] that these new nodes can be decoded
efficiently to improve the speed of SCL decoding. However,
the drawback of using these new nodes when implementing
the decoder on hardware is that these nodes are based on
multiple bit estimations at a time, thus producing more than
2L candidates in each decoding step. Therefore, a large sorter
is required to select the final L caldidates which adversely
affects the hardware implementation complexity. In particular,
at each decoding step, Type-I node produces 4L candidates to
account for all the cases for its two information bits, Type-
II node produces 8L candidates to account for all the cases
for its three information bits, and Type-V node produces 16L
candidates to account for all the cases for its four information
bits. Moreover, Type-III node is decoded using two parallel
SPC node decoders, and Type-IV node starts by decoding a
Rep node of length four followed by four parallel SPC node
decoders [15].
The pruned decoding tree for the same example as in Fig. 1
is shown in Fig. 2. If the new nodes are not taken into account,
P(8, 4) can be decoded in four time steps by traversing the
tree for one level and decode the resulting Rep and SPC
nodes. The resulting list of operations for the decoder would
be {F2,Rep2,G2, SPC2}, where Rept and SPCt represent the
decoding of Rep and SPC nodes of length T = 2t , respec-
tively. However, by considering the new nodes, the decoder
can immediately decode the received vector by decoding the
Type-V node. The corresponding list of operations would be
{Type-V3}, where Type-Vt represents the decoding of Type-V
nodes of length T = 2t . The operations which are performed
in fast SC-based decoders are summarized in Table I. Note
that Ft and Gt operations are common between conventional
SC-based and fast SC-based decoding algorithms. In the
hardware implementation of fast SC-based decoders, this list
of operations is stored in memory and is fed into the decoder
to perform decoding [5], [7], [8].
Let us consider the example in Fig. 2. If the rate of the
code changes from 1/2 to 5/8, the list of operations also
changes as shown in Fig. 3. Without using the new nodes,
TABLE I: Different operations that are supported in SC-based
decoding algorithms.
Operation Description Decoder
Ft Calculate α
ℓ at level t. SC-based
Gt Calculate α
r at level t. SC-based
Rate-0t Decode Rate-0 node of length 2
t . Fast SC-based
Rate-1t Decode Rate-1 node of length 2
t . Fast SC-based
Rept Decode Rep node of length 2
t . Fast SC-based
SPCt Decode SPC node of length 2
t . Fast SC-based
Type-It Decode Type-I node of length 2
t . Fast SC-based
Type-IIt Decode Type-II node of length 2
t . Fast SC-based
Type-IIIt Decode Type-III node of length 2
t . Fast SC-based
Type-IVt Decode Type-IV node of length 2
t . Fast SC-based
Type-Vt Decode Type-V node of length 2
t . Fast SC-based
t = 3
t = 2
Type-IV
Rep Rate-1
α
β
Fig. 3: Fast SC-based decoding on a binary tree for P(8, 5)
and v = {7, 6, 5, 3, 4, 2, 1, 0} (s = {0, 0, 0, 1, 1, 1, 1, 1}).
the list of operations becomes {F2,Rep2,G2,Rate-12}, and by
considering the new nodes it becomes {Type-IV3}. Therefore,
as the rate changes, the list of operations changes. The
resulting decoder is therefore not rate-flexible. For applications
that support codes with multiple rates, for each rate, the list
of operations has to be stored in memory to make the decoder
flexible. However, this results in high memory usage when
implemented on hardware.
III. RATE-FLEXIBLE FAST POLAR DECODING
The high memory usage of storing the list of operations can
be mitigated by generating the list of operations on hardware
as the decoding proceeds. A rudimentary approach would be
to generate the vector st from K and the vector vt using
comparators, and check the pattern of information and frozen
bits in st for every encountered node. This is shown in Fig. 4
for determining Rate-0, Rate-1, Rep, and SPC nodes of length
8. It should be noted that the comparators in Fig. 4a have two
inputs A and B, and an output C where
C =
{
0, if A ≥ B,
1, if A < B.
(13)
The problem with this approach is that for nodes of large
length, there is a high hardware complexity overhead in
generating st from K and vt , and determining the node types.
Moreover, the module that generates the list of operations
should account for the largest possible node which is the root
node in the decoding tree with size N . This results in a large
critical path which limits the operating frequency.
In order to tackle the above issue, the idea is to exploit
the inherent order in the Bhattacharyya parameters of the bit-
channels. Let Wi and Wj be the bit-channels corresponding
5A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
K
vt0
vt1
vt2
vt3
vt4
vt5
vt6
vt7
st0
st1
st2
st3
st4
st5
st6
st7
(a)
st0st1
st2st3
st4st5
st6st7
Rate-0
(b)
st0st1
st2st3
st4st5
st6st7
Rate-1
(c)
st0st1
st2st3
st4st5
st6st7
Rep
(d)
st0st1
st2st3
st4st5
st6st7
SPC
(e)
Fig. 4: Determination of node types for fast SC-based decoding in a node of length T = 8. (a) generation of st from K and
vt , (b) Rate-0 node, (c) Rate-1 node, (d) Rep node, (e) SPC node.
to ui and uj , and let b
i and bj be the binary expansions
of the integers i and j. In [16], [17] a partial order between
the polarized bit-channels was introduced. In particular, it was
proven that Wi is stochastically degraded with respect to Wj ,
i.e., Wi ≺ Wj , when one of the following two properties hold:
• Addition Property [18]: There exists k ∈ {0, 1, . . . , n − 1}
such that 
bim = b
j
m, if m , k,
bi
k
= 0,
b
j
k
= 1.
(14)
• Left-Swap Property [18]: There exist k, l ∈ {0, 1, . . . , n −
1} such that l < k and
bim = b
j
m, if m , k, m , l,
bi
k
= b
j
l
= 0,
bi
l
= b
j
k
= 1.
(15)
Recall that, if Wi ≺ Wj , then all the reliability measures of
Wi are worse than those of Wj , i.e., Wi has smaller mutual
information, larger Bhattacharyya parameter, and larger error
probability. Consequently, if uj belongs to the frozen set, then
also ui belongs to the frozen set. Furthermore, if ui belongs
to the information set, then also uj belongs to the information
set. By using the two properties above, it was shown in [18]
that it suffices to compute the reliability of a sublinear fraction
of channels in order to identify the frozen and the information
sets.
Another option to find an ordering between the Bhat-
tacharyya parameters of the bit-channels can be described as
follows. Consider the transmission over a BMS channel W
with Bhattacharyya parameter Z(W) and define the synthetic
channels W0 and W1 as
W0(y1, y2 | x1) =
∑
x2
1
2
W(y1 | x1 ⊕ x2)W(y2 | x2),
W1(y1, y2, x1 | x2) =
1
2
W(y1 | x1 ⊕ x2)W(y2 | x2).
(16)
Then, the following inequalities between Z(W0), Z(W1) and
Z(W) hold
Z(W)
√
2 − Z(W)2 ≤ Z(W0) ≤ 2Z(W) − Z(W)2,
Z(W1) = Z(W)2,
(17)
which follow from Proposition 5 of [1] and from Exercise 4.62
of [19]. Furthermore, the bit-channel Wi corresponding to ui
is given by the recursive formula below:
Wi = (((W
bi
n−1)b
i
n−2)...)b
i
0 . (18)
In what follows, we will denote by Zi the Bhattacharyya
parameter of Wi .
At this point, we are ready to state and prove the first result
of this paper, which concerns the identification of Rate-0, Rate-
1, Rep, and SPC nodes.
Theorem 1. Consider a node of length T = 2t in a polar code
of length N = 2n. Then, the following properties hold:
1) If vtT−1 ≥ K, i.e., stT−1 = 0, then the node represents a
Rate-0 node.
2) If vt0 < K, i.e., st0 = 1, then the node represents a Rate-1
node.
3) If vtT−1 < K and vtT−2 ≥ K, i.e., stT−1 = 1 and stT−2 = 0,
then the node represents a Rep node.
4) If vt0 ≥ K and vt1 < K, i.e., st0 = 0 and st1 = 1, then
the node represents an SPC node.
Proof. 1) Note that bT−1 = {1, . . . , 1}. By using the addi-
tion property (14), we obtain that Wi ≺ WT−1 for any
i ∈ {0, 1, . . . ,T − 2}. Hence, as vtT−1 ≥ K , vti ≥ K for
any i ∈ {0, 1, . . . ,T − 2}. This means that the polar code
consists of only frozen bits, i.e., it is a Rate-0 node.
2) Note that b0 = {0, . . . , 0}. By using the addition property
(14), we obtain that W0 ≺ Wi for any i ∈ {1, 2, . . . ,T−1}.
Hence, as vt0 < K , vti < K for any i ∈ {1, 2, . . . ,T − 1}.
This means that the polar code consists of only infor-
mation bits, i.e., it is a Rate-1 node.
6A
B
C
A
B
C
A
B
C
A
B
C
K
vt0 vt1 vtT−2 vtT−1
st0 st1 stT−2 stT−1
st0
st1
stT−2
stT−1
Rate-1
SPC
Rep
Rate-0
Fig. 5: Efficient generation of the list of operations on hard-
ware.
3) Note that bT−2 = {1, . . . , 1, 0}. By using the addition
property (14) and the left-swap property (15), we obtain
that Wi ≺ WT−2 for any i ∈ {0, 1, . . . ,T − 3}. Hence,
as vtT−2 ≥ K , vti ≥ K for any i ∈ {0, 1, . . . ,T − 3}. As
vtT−1 < K , the polar code consists of frozen bits except
for the last bit which is an information bit, i.e., it is a
Rep node.
4) Note that b1 = {0, . . . , 0, 1}. By using the addition
property (14) and the left-swap property (15), we obtain
that W1 ≺ Wi for any i ∈ {2, 3, . . . ,T − 1}. Hence, as
vt1 < K , vti < K for any i ∈ {2, 3, . . . ,T−1}. As vt0 ≥ K ,
the polar code consists of information bits except for the
first bit which is a frozen bit, i.e., it is an SPC node.

In the proof of Theorem 1, we used the fact that for any
node of length T = 2t in a polar code of length N = 2n, the
n-bit binary expansions of the integers corresponding to the
bit-channels in the node are equal in the bits {n−1, n−2, . . . , t},
and are different in the bits {t − 1, t − 2, . . . , 0}. An immediate
consequence of Theorem 1 is that, by checking only one value,
we can find out if a constituent node is either a Rate-0 or
a Rate-1 node. Furthermore, by checking only two values,
we can find out if a constituent node is either a Rep or an
SPC node. This observation significantly reduces the hardware
complexity associated with the on-line node identification.
In addition, the proposed approach is independent of the
node length, making it suitable for codes of any length and
rate. Fig. 5 shows the circuit required to generate the list of
operations on-line for any node of length T . It can be seen
that the circuit consists of only four comparators, three NOT
gates, and two AND gates.
Let us now state and prove the second result of this paper,
which concerns the identification of Type-I, Type-II, Type-III,
Type-IV, and Type-V nodes.
Theorem 2. Consider a node of length T = 2t in a polar code
of length N = 2n. Then, the following properties hold:
1) If vtT−1 < K, vtT−2 < K, and vtT−3 ≥ K, then the node
represents a Type-I node.
2) If vtT−1 < K, vtT−2 < K, vtT−3 < K, and vtT−5 ≥ K, then
the node represents a Type-II node.
3) If vt0 ≥ K, vt1 ≥ K, and vt2 < K, then the node
represents a Type-III node.
4) If vt0 ≥ K, vt1 ≥ K, vt2 ≥ K, and vt4 < K, then the node
represents a Type-IV node.
5) If vtT−1 < K, vtT−2 < K, vtT−3 < K, vtT−4 ≥ K, vtT−5 < K,
and vtT−9 ≥ K, then the node represents a Type-V node.
Proof. 1) Note that bT−3 = {1, . . . , 1, 0, 1}. By using the
addition property (14) and the left-swap property (15),
we obtain that Wi ≺ WT−3 for any i ∈ {0, 1, . . . ,T − 4}.
Hence, as vtT−3 ≥ K , vti ≥ K for any i ∈ {0, 1, . . . ,T−4}.
As vtT−1 < K and vtT−2 < K , the node consists of frozen
bits except for the last two bits which are information
bits, i.e., it is a Type-I node.
2) Note that bT−5 = {1, . . . , 1, 0, 1, 1}. By using the addition
property (14) and the left-swap property (15), we obtain
that Wi ≺ WT−5 for any i ∈ {0, 1, . . . ,T − 6}. Hence, as
vtT−5 ≥ K , vti ≥ K for any i ∈ {0, 1, . . . ,T − 6}. Fur-
thermore, note that bT−4 = {1, . . . , 1, 1, 0, 0}. Let W be
the transmission channel and let z be the Bhattacharyya
parameter of the channel defined as
(((W
t−3 times︷  ︸︸  ︷
1)1)...)1 .
Then, by using (17), we have that
ZT−5 ≤ (2z − z
2)4,
ZT−4 ≥ z
2
√
2 − z4
√
2 − z4(2 − z4).
It is easy to check that, for any z ∈ [0, 1],
(2z − z2)4 ≤ z2
√
2 − z4
√
2 − z4(2 − z4), (19)
which implies that
ZT−5 ≤ ZT−4.
Consequently, as vtT−5 ≥ K , vtT−4 ≥ K . As a result, since
vtT−1 < K , vtT−2 < K , and vtT−3 < K , the node consists
of frozen bits except for the last three bits which are
information bits, i.e., it is a Type-II node.
3) Note that b2 = {0, . . . , 0, 1, 0}. By using the addition
property (14) and the left-swap property (15), we obtain
that W2 ≺ Wi for any i ∈ {3, 4, . . . ,T − 1}. Hence, as
vt2 < K , vti < K for any i ∈ {3, 4, . . . ,T−1}. As vt0 ≥ K
and vt1 ≥ K , the node consists of information bits except
for the first two bits which are frozen bits, i.e., it is a
Type-III node.
4) Note that b4 = {0, . . . , 0, 1, 0, 0}. By using the addition
property (14) and the left-swap property (15), we obtain
that W4 ≺ Wi for any i ∈ {5, 6, . . . ,T − 1}. Hence,
as vt4 < K , vti < K for any i ∈ {5, 6, . . . ,T − 1}.
Furthermore, note that b3 = {0, . . . , 0, 0, 1, 1}. Let W be
the transmission channel and let z be the Bhattacharyya
parameter of the channel defined as
(((W
t−3 times︷  ︸︸  ︷
0)0)...)0 .
7Then, by using (17), we have that
Z3 ≤ (2z − z
2)4,
Z4 ≥ z
2
√
2 − z4
√
2 − z4(2 − z4).
Since (19) holds for any z ∈ [0, 1], we obtain that
Z3 ≤ Z4.
Consequently, as vt4 < K , vt3 < K . As a result, since
vt0 ≥ K , vt1 ≥ K , and vt2 ≥ K , the node consists of
information bits except for the first three bits which are
frozen bits, i.e., it is a Type-IV node.
5) Note that bT−9 = {1, . . . , 1, 0, 1, 1}. By using the addition
property (14) and the left-swap property (15), we obtain
that Wi ≺ WT−9 for any i ∈ {0, 1, . . . ,T − 10}. Hence,
as vtT−9 ≥ K , vti ≥ K for any i ∈ {0, 1, . . . ,T − 10}.
By using again the left-swap property (15), we obtain
that WT−6 ≺ WT−4 and WT−7 ≺ WT−4. By using again
the addition property (14), we obtain that WT−8 ≺ WT−4.
Hence, as vtT−4 ≥ K , vti ≥ K for any i ∈ {T−6,T−7,T−
8}. As a result, since vtT−1 < K , vtT−2 < K , vtT−3 < K ,
and vtT−5 < K , the node is a Type-V node.

The proofs for the identification of Rate-0, Rep, SPC, Rate-
1, Type-I, Type-III, and Type-V nodes are based on stochastic
degradation arguments. Consequently, these proofs are general
and do not depend on the fact that the frozen bits are deter-
mined according to the value of the Bhattacharyya parameter.
On the contrary, the proofs for Type-II and Type-IV nodes
use the inequalities (17) which are valid for Bhattacharyya
parameters. However, let us point out that the strategy of the
proof (use extremes of information combining bounds such as
(17) in order to compare the reliability of specific channels)
is general. In order to prove a similar statement for different
reliability measures, one would need to find bounds of the
form (17) for the desired reliability measure (e.g., mutual
information, error probability). Let us further clarify that the
proofs for Type-II and Type-IV nodes provide an ordering
between the Bhattacharyya parameter of bit-channels. As such,
they do not depend on the particular technique used to compute
those Bhattacharyya parameters (Gaussian approximation [20],
beta-expansion [21], Monte Carlo simulation [1], etc.). Let
us also note that the Bhattacharyya parameter represents the
typical performance metric employed for code construction
[22]–[24].
It is also worth mentioning that since every node in the SC-
based decoding tree represents a polar code constructed for a
different channel [1], the results in this section are valid for
all the nodes in any polar code of any length. Fig. 6 shows the
circuit required to generate the list of operations on-line for
any node of length T , if Type-I, Type-II, Type-III, Type-IV,
and Type-V nodes are considered in addition to Rate-0, Rep,
SPC, and Rate-1 nodes. It can be seen that the circuit consists
of ten comparators, nine NOT gates, and fourteen AND gates,
in order to identify all the special nodes.
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
K
vt0 vt1 vt2 vt4 vtT−9
st0 st1 st2 st4 stT−9
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
K
vtT−5 vtT−4 vtT−3 vtT−2 vtT−1
stT−5 stT−4 stT−3 stT−2 stT−1
st0
st1
st2
st4
Rate-1
SPC
Type-III
Type-IV
stT−9
stT−5
stT−4
stT−3
stT−2
stT−1
Type-V
Type-II
Type-I
Rep
Rate-0
Fig. 6: Efficient generation of the list of operations on hard-
ware considering new nodes.
IV. DECODER ARCHITECTURE
As a proof of concept, a decoder architecture implementing
the proposed technique has been designed. It implements the
layered partitioned SCL (LPSCL) decoding algorithm detailed
in [25] and the Fast-SSCL-SPC algorithm introduced in [8],
along with the memory-reduction techniques proposed in [26].
The LPSCL decoder decreases the memory requirements of
standard SCL decoding by dividing the SC decoding tree in
different partitions; the bottom part of the SC decoding tree
belonging to each partition is decoded with SCL with a list size
Lmax. When information needs to be passed between partitions,
i.e. at the top stages of the tree, only Lt < Lmax candidate
codewords are passed, with Lt decreasing progressively as the
stage t increases. The Fast-SSCL-SPC algorithm is applied
to the lower stages of the tree, where Lmax candidates are
considered.
Fig. 7 shows the architecture of the proposed decoder. It
8PE0 · · · PENPE−1
0
· · · PE0 · · · PENPE−1
Lmax−1
Path Memory
LLR Memory
PM Memory
PM Calculation · · · PM Calculation
PM Sorting
Fig. 7: Decoder architecture.
is based on a semi-parallel SCL decoder architecture, where
Lmax sets of NPE processing elements (PEs) are instantiated
in parallel, implementing (7) and (8). Each set works on a
different candidate codeword, as explained in Section II-B.
Different candidate codewords are created whenever one or
more information bits are estimated. Each set of PEs relies on
a dedicated memory to store the internal LLR values relative
to all stages of the SC decoding tree. LLR values are quantized
with QLLR bits, and represented with sign and magnitude.
Each stage of the SC decoding tree requires the storage of
2t−1 LLR values. However, given the limited number of PEs
instantiated, the LLR memory is split in high stage and low
stage memories. The high stage memory stores LLR values of
stages with nodes of size greater than NPE: at stage t, where
2t > 2NPE, a total of 2
t/(2NPE) decoding steps are needed
to descend to the lower tree level. The depth of the high
stage memory is
∑n−1
j=log2 NPE+1
2j/NPE = N/NPE −2, while it is
QLLR × NPE wide. The low stage memory stores LLR values
for stages where 2t ≤ 2NPE, and it is QLLR bits wide, while
its depth is
∑log2 NPE−1
j=0
NPE/2
j
= 2NPE−2. High and low stage
memory words are rewritten when a node belonging to the
same stage t is traversed. Lmax different instantiations of both
high and low stage memories are required. Lmax separate path
memories store the hard bit estimates (11) for all the tree stages
as well, updating them every time that a bit is estimated. PMs,
that identify the likelihood of a candidate codeword (or path)
to be correct, are incremented every time a bit is estimated
differently from the sign of the LLR value associated to it.
They are sorted in PM memory before and after the estimation
of an information bit, in order to identify the Lmax surviving
paths out of the 2Lmax created. When none of the paths coming
from the splitting of a particular candidate codeword survives,
all stages of its LLR memory are overwritten, along with the
bit estimate and PM memories.
This baseline architecture has been modified to implement
the LPSCL decoder. The bottom stages of the SC decoding
tree are left unchanged, and decoded with a list size Lmax.
Given the partitioning factor P, the top log2 P stages rely on
a smaller list size Lt , with n − log2 P < t ≤ n, and Lt ≥ Lt+1.
Consequently, only Lt LLR memories are instantiated in the
upper stages, reducing the LLR memory requirements for
each upper stage of a factor
Lmax−Lt
Lmax
, as shown in Fig. 8.
Depending on the number of instantiated PEs and on the
partitioning factor, the high and/or low stage memories might
need to be separated into different memory structures, each
part belonging to a different layer of LPSCL and thus instan-
tiated a different number of times, depending on Lt . Since
the number of surviving paths is reduced from Lmax to Lt
when ascending the decoding tree above stage n − log2 P, the
Lmax − Lt candidate codewords with the highest PMs need
to be discarded. In the baseline architecture, PMs are sorted
only when an information bit is estimated, i.e. when the paths
split. However, in the proposed architecture the PMs need to
be sorted also when i mod (N/P) = 0, where i is the index
of the codeword bit that needs to be estimated, and mod
represents the modulo operation. The decoding of a bit with
such an index i identifies the completion of the decoding of a
subtree of size N/P, and the need to transfer information to
the upper tree stages, where Lt < Lmax. The sorting of PMs
allows the most reliable paths, their LLR values, and their hard
bit estimates to be transferred between partitions through the
memory copy mechanism addressed in Fig. 8.
The implementation of the Fast-SSCL-SPC algorithm re-
quires more substantial modifications, that have been detailed
in [8]. The hard bit estimate memory and path memories
are updated according to different values depending on the
node type, along with PMs. This requires different parallel
instantiations of the PM computation logic, as shown in Fig. 9.
More complex routing and selection logic are necessary to
update memories, since multiple concurrent values need to be
updated and propagated through the hard bit estimates memory
structure. A sorter module for LLR values is needed in Rate-
1 and SPC nodes, to identify the order with which bits are
estimated: the disruption of the sequential bit estimation order
that SC is based on leads to additional complexity in memory
updates and control logic.
Aside from the logic needed to perform the calculations
for special node PM update and bit estimations, the decoder
needs to know at which point in the SC tree the special nodes
are found, and what is their type. This information is used to
identify the number of clock cycles needed for the decoding
of a particular node, and which of the different parallel PM,
path, and LLR updates is memorized. In [8], the proposed
decoder architecture relied on an off-line compiler to obtain
the sequence of special nodes, their size, and the stage at which
they are encountered. These informations differ for every code
supported by the decoder, and need to be stored in a memory.
Note that the frozen and information bit sequence can be either
stored in a memory, as supposed by most decoder architectures
in literature, or computed on-line given the bit-channel relative
reliability vector and the desired code rate, as proposed in [10].
This approach is significantly more efficient in case of multi-
code decoders, and is facilitated by nested reliability vectors
as those selected for the 5G eMBB control channel [11]. This
is the approach taken in both the baseline and the modified
architectures in this paper, by comparing each entry of the
relative reliability vector v to the desired K in order to obtain
s.
The control unit of the modified architecture implements
the proposed special node on-line identification, based on
9Copy
LLR Memory
0
· · ·
LLR Memory
Lt+1−1
Stage t + 1
Copy
LLR Memory
0
· · ·
LLR Memory
Lt+1−1
· · ·
LLR Memory
Lt−1
Stage t
Copy
LLR Memory
0
· · ·
LLR Memory
Lt+1−1
· · ·
LLR Memory
Lt−1
· · ·
LLR Memory
Lmax−1
U
p
p
er
/L
o
w
er
S
ta
g
e
T
ra
n
sf
er
Fig. 8: LPSCL LLR memory structure.
Rate-1 Rep SPC Rate-0 Rate-1 Rep SPC
LLR Values
PM PM
bit = 0 bit = 1
Fig. 9: PM calculation for Fast-SSCL-SPC.
the relative reliability vector v and K . Fig. 5 shows the
simple logic needed to identify the considered special nodes.
Given the low complexity of the node identification circuit,
the structure is instantiated at every decoding tree stage t,
separately at every partition identified by LPSCL, to reduce the
amount of multiplexing needed at the inputs and the possible
increase in the system critical path. The logic pictured in Fig. 5
is inserted within a finite state machine (FSM) in the decoder
control unit to identify the correct decoding phase, through
two main control signals, NodeType and NodeSize. A
maximum NodeSize value for each NodeType is selected
at design time, to limit the additional complexity and critical
path degradation.
• While the general node type can be identified easily
through the proposed identification, different decoding
phases are foreseen within each special node. Thus,
NodeType foresees subtypes in the special node. While
the Rate-0 node is a standalone node type, the Rate-
1 node is divided into three subtypes: one phase is
assigned to the fetching and sorting of the LLR values,
a second to the estimation of the bits associated to the
least reliable LLR values, and the third for the hard-
decision on the remainder of the bits. The Rep node
is divided in two subtypes, one for the frozen bits and
one for the information bit. Finally, SPC nodes foresee
four subtypes: one for the concurrent fetching and sorting
of LLR values and frozen bit selection, one for the bit
estimations, one for the hard decision on the remaining
bits, and one for the parity correction. The NodeType
signal is thus influenced not only by the result of the logic
in Fig. 5, but also by the number of estimated bits within
the special node, the stage t, and the current NodeType
subtype.
• The control unit identifies the size of the special node
NodeSize as 2t , given the current SC decoding tree
stage t. This information is used to update the index
i of the codeword bit to be estimated. The index i is
usually updated once a leaf node has been reached and
the corresponding bit estimated, but during the decoding
of special nodes, it is kept fixed pointing at the first bit
of the node. Once the decoding is terminated, the index
is updated as i + NodeSize.
V. HARDWARE IMPLEMENTATION RESULTS
The proposed decoder architecture has been described in
VHDL and synthesized in TSMC 65 nm CMOS technology,
at the operating conditions defined by the NCCOM corner, i.e.
1.2 V core voltage and a temperature of 298 K. Two versions
10
of the decoder have been implemented: one considering the
proposed special node identification technique, and one based
on the off-line identification and storage used in [8]. Both
decoders target the 5G polar code with a code length N = 1024
[11], rely on a partitioning factor P = 4, and make use of
64 parallel PEs. The bottom part of the SC decoding tree is
decoded with a list size Lmax = 4, while for the upper stages
L10 = L9 = 2. Fig. 10 shows the frame error rate (FER)
and bit error rate (BER) performance of the LPSCL decoder
used in this paper in comparison with SCL decoding with
L = 4. The curves in Fig. 10 are provided for the code rates
of { 1
12
, 1
6
, 1
3
, 1
2
, 2
3
}. It can be seen that LPSCL decoding incurs
negligible FER and BER performance loss with respect to SCL
for all considered rates. It should be noted that the introduction
of the proposed technique to infer the list of operations on
the fly does not change the FER or BER performance of the
decoder in comparison with the same memory-based decoder.
The channel LLR values are quantized with 4 bits and
internal LLR values with 6 bits, with 2 bits assigned to
the fractional part, while PMs are quantized with 8 bits
[26]. The maximum node size is set to 16 for Rate-0 and
Rep nodes, and to 64 for Rate-1 and SPC nodes. Table II
reports the area occupation and achievable frequency for the
proposed decoder, and for the decoder based on the off-line
identification technique, labelled as memory-based decoder.
The two decoders differ in their implementation of the control
unit (CU): its area occupation ACU in the proposed decoder is
24% less than that of the memory-based decoder. This is due to
the fact that the information computed off-line in the memory-
based case, i.e. the equivalent of the NodeType signal, needs
to be inserted in an FSM analogous to that used by the
control unit of the proposed decoder. This FSM handles the
node subtypes and the internal counters that determine when a
special node decoding is terminated. Moreover, the memory-
based case needs an additional information, NodeStage, to
identify at which SC decoding tree stage the special node is
encountered: the NodeSize information is derived from that.
The NodeStage signal is inserted in its own FSM, that adds
substantial complexity to the control unit, resulting in a larger
ACU. While the contribution of ACU to the total decoder area
occupation Atotal is relatively small, with Atotal = 1.410 mm
2
and Atotal = 1.454 mm
2 for the proposed and the memory-
based decoders respectively, the NodeStage FSM influences
signals in the NodeSize and NodeType FSM, lengthening
the critical path. In particular, the state of NodeStage is
combined to the NodeType and NodeSize to determine
the current and future node subtypes. This leads to a lower
achievable frequency f , lower throughput TP, and lower area
efficiency Aeff in the memory-based decoder in comparison
with the proposed decoder, as provided in Table II.
The proposed decoder fetches four required values of the
relative reliability vector from memory, compares them with
K , and identifies the node types efficiently. Table II also reports
the external memory requirements Memext of the proposed
decoder in comparison with the memory-based decoder con-
sidering 5G code rates are supported. For a code of length
1024, the vector of relative reliabilities v contains 1024 entries
where each entry is stored with 10 bits. Therefore, a total
TABLE II: TSMC CMOS 65 nm synthesis results for N =
1024, P = 4, Lmax = 4, and L10 = L9 = 2.
Proposed Memory-based
ACU [µm
2] 35881 47025
Atotal [mm
2] 1.410 1.454
f [MHz] 955 926
TP @ R =
1
2 [Mb/s] 1223 1186
Aeff @ R =
1
2 [
Mb/s
mm2
] 867 816
Memext [bits] 10240 1025120
of 1024 × 10 = 10240 bits are stored in memory. For the
memory-based decoder, the memory requirement is different
for different values of K (different rates). This is depicted in
Fig. 11 where it can be seen that the list of operations is large
for medium rates and becomes small as the rate becomes very
high or very low. Note that the proposed decoder is capable
of supporting any code rate within a given code length which
is also foreseen in 5G [12]. If the memory-based decoder is
designed such that it supports all the code rates of 5G for a
code of length 1024 (12 ≤ K ≤ 1024), the memory require-
ment of it considering a 4-bit representation for NodeType
and NodeStage signals is 128140× 8 = 1025120 bits, more
than 100 times larger than the number of bits required for the
proposed decoder.
Artisan dual-port SRAM compiler was used for the im-
plementation of the external memories. Table III shows
the area occupation of the external memory for the pro-
posed decoder in comparison with the memory-based de-
coder. While the proposed decoder supports all the code
rates, the memory requirement of the memory-based decoder
depends on the number of code rates it can support. In
Table III, we showed four cases of memory requirements
for the memory-based decoder: when it supports 5 code
rates of { 1
12
, 1
6
, 1
3
, 1
2
, 2
3
}, when it supports 10 code rates of
{ 1
16
, 1
12
, 1
8
, 1
6
, 1
4
, 1
3
, 1
2
, 2
3
, 5
6
, 7
8
}, when it supports 20 code rates
of { 1
24
, 1
16
, 1
12
, 1
8
, 1
6
, 1
5
, 1
4
, 5
16
, 1
3
, 3
8
, 2
5
, 1
2
, 3
5
, 5
8
, 2
3
, 11
16
, 3
4
, 4
5
, 5
6
, 7
8
}, and
when it supports all the code rates considered in 5G, similar to
the proposed decoder. It can be seen that the proposed decoder
occupies a smaller area in comparison with the memory-based
decoder even when the memory-based decoder supports only
5 code rates. In fact, the area occupation of the memory-
based decoder increases as the number of supported code rates
increases. This consequently reduces the area efficiency of the
memory-based decoder as can be seen in Table III. The area
occupation of the proposed decoder is only 38% of that of the
memory-based decoder when both decoders support all 5G
code rates.
It is worth mentioning that the goal of this paper is to
propose a low-complexity approach to generate the list of
operations for fast SC-based decoders directly on hardware,
therefore, allowing for the implementation of a fast and rate-
flexible SC-based decoder. Our implementation results show
that by using the proposed method, there is a negligible area
occupation overhead or throughput loss in comparison with
the memory-based decoders, while having a completely rate-
flexible decoder.
11
1 2 3 4 5
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 [dB]
F
E
R
1 2 3 4 5
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 [dB]
B
E
R
SCL - R = 1
12
SCL - R = 1
6
SCL - R = 1
3
SCL - R = 1
2
SCL - R = 2
3
LPSCL - R = 1
12
LPSCL - R = 1
6
LPSCL - R = 1
3
LPSCL - R = 1
2
LPSCL - R = 2
3
Fig. 10: FER and BER performance comparison of decoding the 5G polar code of length N = 1024 and R ∈ { 1
12
, 1
6
, 1
3
, 1
2
, 2
3
},
using LPSCL decoding with Lmax = 4 and L10 = L9 = 2, and SCL decoding with L = 4.
0 200 400 600 800 1,000
0
500
1,000
1,500
K
M
em
ex
t
[b
it
s]
Fig. 11: Memory requirements to store the list of operations
of the memory-based decoder for different values of K . The
polar code of length 1024 is used which is adopted in 5G [11].
TABLE III: SRAM synthesis results for external memories.
Memext Atotal Aeff @ R =
1
2
[mm2] [mm2] [
Mb/s
mm2
]
Proposed All rates 0.039 1.449 844
Memory-based
5 rates 0.033 1.487 798
10 rates 0.047 1.501 790
20 rates 0.080 1.534 773
All rates 2.358 3.812 311
The main advantage of the proposed approach is that given
the design code length, any code with the same N can be
decoded using the Fast-SSCL-SPC algorithm without fore-
knowledge of the information/frozen bit sequence, regardless
of rate and target Eb/N0. On the contrary, the memory-based
decoder needs to store the NodeType and NodeStage
information for each considered code in an external memory
of Memext bits.
Table IV compares the proposed decoder to other architec-
tures in the state of the art which use 64 parallel PEs. Results
are reported for P(1024,512) and L = 4. The architectures
presented in [8] and [7] are based on the Fast-SSCL-SPC and
SSCL-SPC algorithms, respectively: it is possible to add the
cost of the external memory directly to their area occupation
and evaluate its impact on the area efficiency, considering all
the code rates in 5G are supported. These modified results are
reported within parentheses. It can be seen that the external
memory increases Atotal by 131% in [8] and by 193% in [7]:
the proposed special node identification technique is thus able
to substantially limit the area occupation and increase the area
efficiency in both architectures. The architecture presented in
this work has higher Aeff and lower Atotal than both [7] and [8].
Different design choices in terms of concurrent operations in
the special nodes lead to a slightly lower TP than [8], together
with a substantially lower Atotal and higher Aeff .
The architectures presented in [27]–[29] do not rely on a
special-node-based decoding algorithm: thus, the throughput
benefits and complexity saving of the proposed node identifi-
cation technique cannot be directly evaluated. Moreover, the
synthesis results of [27] were reported in 90 nm technology,
but they were carried out in 65 nm technology. Therefore,
a factor of 90/65 was used to convert the frequency, and a
factor of (65/90)2 was used to convert the area of the decoder
from 90 nm to 65 nm technology in [27]. The same conversion
factors were used to convert to 65 nm technology the synthesis
results in [28], [29], which were synthesized with a 90 nm
node.
Our work shows 31% higher throughput and 31% lower
latency with respect to the multibit decision SCL decoder
12
TABLE IV: Comparison with state-of-the-art decoders.
This work [8] [7] [27] [28]† [29]†
Atotal [mm
2] 1.449 1.797 (4.155) 1.22 (3.578) 0.62 0.73 2.00
f [MHz] 955 840 961 498 692 558
TP [Mb/s] 1223 1338 1146 935 551 1578
Latency [µs] 0.84 0.77 0.89 1.10 1.86 0.66
Aeff [Mb/s/mm
2] 844 744 (322) 939 (320) 1508 755 789
†The results are originally based on TSMC 90 nm technology and are scaled to TSMC 65 nm technology.
architecture of [27], while the smaller area occupation of
[27] leads to a higher Aeff . The decoder in [28] shows lower
area occupation than our work. However, the architecture
proposed in this work achieves 122% higher throughput and
55% lower latency, leading to 11% higher area efficiency. The
high throughput SCL decoder architecture of [29] achieves
higher throughput and lower latency than this work, at the cost
of 38% higher area occupation and 7% lower Aeff . Moreover,
[29] relies on tunable parameters that can lead to more than
0.2 dB error-correction performance loss. These parameters
also reduce the flexibility of the decoder, since for each code
rate, a different set of parameters needs to be used. However,
the decoder proposed in this paper is designed to guarantee
rate-flexibility, making it suitable for 5G applications.
VI. CONCLUSION
The main drawback of the fast successive-cancellation-
based decoders for polar codes is that they require to store
a list of operations for each code rate in a dedicated memory,
in order to tell the decoder when a special node in a polar
code graph is reached. In this paper, we tackled this issue by
proposing a technique to generate the list of operations on-the-
fly directly in hardware. We proved that this technique can be
applied to polar codes of any rate, therefore, removing the
memory needed to store the list of operations completely. We
proposed a hardware architecture for the proposed technique
and showed that the total area occupation of the proposed
decoder is 38% of the base-line memory-based decoder, if 5G
code rates are considered.
ACKNOWLEDGMENTS
The authors would like to thank Arash Ardakani and
Harsh Aurora of McGill University for helpful discussions.
S. A. Hashemi is supported by a Postdoctoral Fellowship
from the Natural Sciences and Engineering Research Council
of Canada (NSERC). M. Mondelli is supported by an Early
Postdoc.Mobility fellowship from the Swiss National Science
Foundation and by the Simons Institute for the Theory of
Computing.
REFERENCES
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, July 2009.
[2] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inf.
Theory, vol. 61, no. 5, pp. 2213–2226, May 2015.
[3] O. Afisiadis, A. Balatsoukas-Stimming, and A. Burg, “A low-complexity
improved successive cancellation decoder for polar codes,” in Asilomar
Conf. on Signals, Syst. and Comput., November 2014, pp. 2116–2120.
[4] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15,
no. 12, pp. 1378–1380, December 2011.
[5] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. Gross, “Fast polar
decoders: Algorithm and implementation,” IEEE J. Sel. Areas Commun.,
vol. 32, no. 5, pp. 946–957, May 2014.
[6] M. Hanif and M. Ardakani, “Fast successive-cancellation decoding of
polar codes: Identification and decoding of new nodes,” IEEE Commun.
Lett., vol. 21, no. 11, pp. 2360–2363, November 2017.
[7] S. A. Hashemi, C. Condo, and W. J. Gross, “A fast polar code list
decoder architecture based on sphere decoding,” IEEE Trans. Circuits
Syst. I, vol. 63, no. 12, pp. 2368–2380, December 2016.
[8] S. A. Hashemi, C. Condo, and W. J. Gross, “Fast and flexible successive-
cancellation list decoders for polar codes,” IEEE Trans. Signal Process.,
vol. 65, no. 21, pp. 5756–5769, November 2017.
[9] P. Giard and A. Burg, “Fast-SSC-flip decoding of polar codes,” in IEEE
Wireless Commun. and Netw. Conf. Workshops, April 2018, pp. 73–77.
[10] C. Condo, S. A. Hashemi, and W. J. Gross, “Efficient bit-channel reli-
ability computation for multi-mode polar code encoders and decoders,”
in IEEE Int. Workshop on Signal Process. Syst., October 2017, pp. 1–6.
[11] 3GPP TSG RAN WG1 #90, “Summary of email
discussion [NRAH2-11] polar code sequence,”
http://www.3gpp.org/ftp/tsg ran/wg1 rl1/TSGR1 90/Docs/R1-1712174.zip,
Prague, Czech Republic, August 2017.
[12] 3GPP, “Multiplexing and channel coding,”
http://www.3gpp.org/ftp/Specs/archive/38 series/38.212/38212-f11.zip,
April 2018.
[13] A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “LLR-based
successive cancellation list decoding of polar codes,” IEEE Trans. Signal
Process., vol. 63, no. 19, pp. 5165–5179, October 2015.
[14] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289–299, January 2013.
[15] M. Hanif, M. H. Ardakani, and M. Ardakani, “Fast list decoding of polar
codes: Decoders for additional nodes,” in IEEE Wireless Commun. and
Netw. Conf. Workshops, April 2018, pp. 37–42.
[16] C. Schu¨rch, “A partial order for the synthesized channels of a polar
code,” in IEEE Int. Symp. on Inform. Theory, July 2016, pp. 220–224.
[17] M. Bardet, V. Dragoi, A. Otmani, and J.-P. Tillich, “Algebraic properties
of polar codes from a new polynomial formalism,” in IEEE Int. Symp.
on Inform. Theory, July 2016, pp. 230–234.
[18] M. Mondelli, S. H. Hassani, and R. Urbanke, “Construction of polar
codes with sublinear complexity,” in IEEE Int. Symp. on Inform. Theory,
June 2017, pp. 1853–1857.
[19] T. Richardson and R. Urbanke, Modern Coding Theory. Cambridge
University Press, 2008.
[20] P. Trifonov, “Efficient design and decoding of polar codes,” IEEE Trans.
Commun., vol. 60, no. 11, pp. 3221–3227, November 2012.
[21] G. He, J. C. Belfiore, I. Land, G. Yang, X. Liu, Y. Chen, R. Li,
J. Wang, Y. Ge, R. Zhang, and W. Tong, “Beta-expansion: A theoretical
framework for fast and recursive construction of polar codes,” in IEEE
Global Commun. Conf., December 2017, pp. 1–6.
[22] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf.
Theory, vol. 59, no. 10, pp. 6562–6582, October 2013.
[23] J. Guo, M. Qin, A. G. i F bregas, and P. H. Siegel, “Enhanced belief
propagation decoding of polar codes through concatenation,” in IEEE
Int. Symp. on Inf. Theory, June 2014, pp. 2987–2991.
13
[24] H. Vangala, E. Viterbo, and Y. Hong, “A comparative study of polar
code constructions for the AWGN channel,” ArXiv e-prints, January
2015. [Online]. Available: https://arxiv.org/abs/1501.02473
[25] S. A. Hashemi, M. Mondelli, S. H. Hassani, C. Condo, R. L. Urbanke,
and W. J. Gross, “Decoder partitioning: Towards practical list decoding
of polar codes,” IEEE Trans. Commun., vol. 66, no. 9, pp. 3749–3759,
September 2018.
[26] S. A. Hashemi, C. Condo, F. Ercan, and W. J. Gross, “Memory-efficient
polar decoders,” IEEE J. on Emerging and Sel. Topics in Circuits and
Syst., vol. 7, no. 4, pp. 604–615, December 2017.
[27] B. Yuan and K. K. Parhi, “LLR-based successive-cancellation list
decoder for polar codes with multibit decision,” IEEE Trans. Circuits
Syst. II, vol. 64, no. 1, pp. 21–25, January 2017.
[28] C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation
list decoder for polar codes,” IEEE Trans. Signal Process., vol. 64, no. 3,
pp. 675–687, February 2016.
[29] J. Lin, C. Xiong, and Z. Yan, “A high throughput list decoder architecture
for polar codes,” IEEE Trans. VLSI Syst., vol. 24, no. 6, pp. 2378–2391,
June 2016.
