Practical Dynamic SC-Flip Polar Decoders: Algorithm and Implementation by Ercan, Furkan et al.
ar
X
iv
:2
00
9.
08
54
7v
2 
 [c
s.I
T]
  2
1 S
ep
 20
20
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Practical Dynamic SC-Flip Polar Decoders:
Algorithm and Implementation
Furkan Ercan, Student Member, IEEE, Thibaud Tonnellier, Nghia Doan, Student Member, IEEE,
and Warren J. Gross, Senior Member, IEEE
Abstract—SC-Flip (SCF) is a low-complexity polar code decod-
ing algorithm with improved performance, and is an alternative
to high-complexity (CRC)-aided SC-List (CA-SCL) decoding.
However, the performance improvement of SCF is limited since
it can correct up to only one channel error (ω = 1). Dynamic
SCF (DSCF) algorithm tackles this problem by tackling multiple
errors (ω ≥ 1), but it requires logarithmic and exponential
computations, which make it infeasible for practical applica-
tions. In this work, we propose simplifications and approxima-
tions to make DSCF practically feasible. First, we reduce the
transcendental computations of DSCF decoding to a constant
approximation. Then, we show how to incorporate special node
decoding techniques into DSCF algorithm, creating the Fast-
DSCF decoding. Next, we reduce the search span within the
special nodes to further reduce the computational complexity.
Following, we describe a hardware architecture for the Fast-
DSCF decoder, in which we introduce additional simplifications
such as metric normalization and sorter length reduction. All the
simplifications and approximations are shown to have minimal
impact on the error-correction performance, and the reported
Fast-DSCF decoder is the only SCF-based architecture that can
correct multiple errors. The Fast-DSCF decoders synthesized us-
ing TSMC 65nm CMOS technology can achieve a 1.25, 1.06 and
0.93 Gbps throughput for ω ∈ {1, 2, 3}, respectively. Compared
to the state-of-the-art fast CA-SCL decoders with equivalent FER
performance, the proposed decoders are up to 5.8× more area-
efficient. Finally, observations at energy dissipation indicate that
the Fast-DSCF is more energy-efficient than its CA-SCL-based
counterparts.
Index Terms—Polar codes, 5G, energy efficiency, Dynamic
SCFlip, wireless communications, hardware implementation.
I. INTRODUCTION
The 5th generation wireless mobile communications stan-
dard (5G) creates a vast infrastructure that enhances the
existing communications platforms and enables new tech-
nologies. Among the 5G use cases, massive machine-type
communications (mMTC) [1] prioritize enhanced connectivity
and energy efficiency.
Polar codes are a class of forward error-correcting codes that
asymptotically achieve the channel capacity [2]. They have
been selected as the coding scheme for the control channel
for 5G eMBB [3], and are being evaluated for 5G URLLC
and mMTC use cases [4], [5]. Even though the successive
cancellation (SC) decoding algorithm of polar codes enables
to prove the capacity achieving property, its error-correction
performance is mediocre at practical code lengths.
F. Ercan, T. Tonnellier, N. Doan and W. J. Gross are with the Department of
Electrical and Computer Engineering, McGill University, Montre´al, Que´bec,
Canada. (e-mail: furkan.ercan@mail.mcgill.ca, thibaud.tonnellier@mcgill.ca,
nghia.doan@mail.mcgill.ca, warren.gross@mcgill.ca.)
Part of this work has been published in ICASSP 2020.
In order to improve the error correction performance of
polar codes, SC-List (SCL) decoding was proposed [6]. SCL
uses L SC decoders in parallel to maintain a list of candidate
codewords which improves the error-correction performance
at the cost of increased implementation complexity [7]. SC-
Flip (SCF) decoding [8] is another SC-based polar decoder
algorithm that uses several SC decoding attempts when an
initial SC decoding fails due to a single channel-induced error.
Compared to SC decoding, SCF has improved error-correction
performance at the cost of variable decoding latency. The
average computational complexity of SCF decoding is similar
to that of SC decoding at medium-to-high signal-to-noise ratio
(SNR) regions. However, the improved performance with the
SCF decoding is limited and can only match to its SCL
counterparts with small list sizes. The limited performance
improvement of the SCF is due to two main issues. The first
problem is that SCF cannot correct more than one channel-
induced error. The second problem is that the metric used to
identify the error is suboptimal.
Dynamic SC-Flip (DSCF) decoding [9] proposes a solution
to address both of these problems, by extending the search
to more than one channel-induced errors, and by proposing
an enhanced metric that is significantly more efficient on
locating the erroneous locations in the codeword. In return,
the logarithmic and exponential calculations involved in the
DSCF decoding make it challenging for practical hardware
implementations.
Our goal in this work is to make the DSCF algorithm
practically feasible, so that it can be implemented in hard-
ware at low cost to become an alternative for existing high-
performance polar decoder architectures. State-of-the-art de-
coder architectures for polar codes either require substantial
amount of resources (e.g. Fast-SSCL decoding [10]), or have
limited error-correction performance improvement (e.g. Fast-
SCF decoding [11]). Accordingly, our contributions are sum-
marized as follows:
• First, we show that the logarithmic and exponential
computations in the DSCF algorithm can be replaced
by a simple constant approximation. We show that the
proposed approximation does not incur any significant
loss in error-correction performance.
• Then, we propose novel methods to implement decoding
of special nodes under DSCF algorithm. We reformulate
the original computations of the DSCF decoding to acco-
modate special nodes, and we show that it is possible to
maintain similar error-correction performance. Moreover,
we show that the achievable error-correction performance
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
can in fact be improved with one of the special nodes.
• We show how to reduce the computational complexity
further associated with two special nodes. Using a math-
ematical framework, we first find the theoretical frame-
error rate (FER) for these special nodes with and without
error-correction. Then, we show the achievable perfor-
mance approximations when the computational effort in
these nodes are intentionally reduced. Under the light of
these findings, we limit the computational effort in the
hardware architecture that follows.
• Finally, we show how to implement the proposed Fast-
DSCF algorithm in hardware. The proposed hardware
takes advantage of all the proposed simplifications. In
addition, we present further simplifications, such as met-
ric normalization and sorter length reduction. There are
several SCF-based algorithms that describe multiple bit-
flipping operations, but to the best of our knowledge none
of them describe a hardware architecture. Therefore, the
proposed Fast-DSCF decoder is the first reported SCF-
based decoder architecture that can correct more than a
single channel-induced error.
Simulation results show that Fast-DSCF decoder using all
the simplifications, approximations and quantizations main-
tains similar error-correction performance to the baseline
DSCF algorithm. Moreover, the proposed algorithm is able
to match the FER performance of the state-of-the art fast CA-
SCL-based decoders with up to L = 16, while keeping an
average computational complexity similar to that of a single
SC decoder. Synthesis results in TSMC 65nm CMOS tech-
nology shows that the proposed Fast-DSCF decoders is able
to achieve an average throughput of up to 1.25 Gbps, while
being up to 5.8× more area-efficient compared to the fast
CA-SCL decoders with equivalent FER performance. Finally,
observations at the increase trends in energy consumption with
improved performance indicate that the Fast-DSCF is more
energy-efficient than its CA-SCL-based counterparts.
The structure of this paper is as follows: Preliminaries
are described in Section II. Approximations to replace the
transcendental computations in DSCF decoding are explained
in Section III. Fast decoding techniques for DSCF decoding
are detailed in Section IV. Reducing the computational effort
for fast decoding techniques is discussed in Section V. The
hardware architecture for the Fast-DSCF decoding is explained
in Section VI, followed by simulation and implementation
results in Section VII. Conclusions are drawn in Section VIII.
Note that, a portion of this study has been discussed previously
in [12].
II. PRELIMINARIES
Vectors and matrices are denoted with bold letters (v), an
index of a vector is denoted with a subscript (vi), a range of
indices from i to j for a vector is denoted as vi:j . For LLRs
(L) and partial sums (β) at decoding tree stage S are denoted
using a superscript (LS , βS). Eω denote the set of bit-flipping
indices for the DSCF algorithm, and LS [Eω]i denote the LLR
at stage S and at index i when the decisions at indices in Eω
are flipped.
uˆ0 uˆ1 uˆ2 uˆ3 uˆ4 uˆ5 uˆ6 uˆ7 uˆ8 uˆ9 uˆ10 uˆ11 uˆ12 uˆ13 uˆ14 uˆ15
S=4
S=3
S=2
S=1
S=0
v
Lv
βv
Ll L
r
βl β
rRate-0 Rep SPC Rate-1
Fig. 1. Successive cancellation decoding tree for PC(16, 8). LLR (L) and
partial sum (β) vectors of parent node v and of child nodes are represented
with superscripts that indicate the direction (l for left, r for right) and not
with their in-text superscripts for simplicity. Stages (S) for each level and
the sub-codes with special frozen bit-patterns (Rate-0, Rate-1, Rep, SPC) are
outlined for reference.
A. Polar Codes
A polar code PC(N,K) splits N channels into K reliable
ones that are used to transmit the information bits, and N−K
unreliable ones, which are frozen to a known value (usually
to 0). The set of frozen and non-frozen indices are denoted
with AC and A, respectively. The encoding of a polar code
is a linear transformation, such that x = uG⊗n, where x is
the encoded vector, u is the message vector, and the generator
matrix G⊗n is the n-th Kronecker product (⊗) of the polar
code kernel G = [ 1 01 1 ] and n = log2N , n ∈ Z+.
The decoding schedule of SC can be interpreted as a binary
tree search that starts from the root node (that contains the
channel observation), and with priority given to the left branch.
An illustration of the SC decoder tree is shown in Fig. 1 for
PC(16, 8). Each stage in the tree is defined by the inverse of
its depth from the root node, which is denoted by S where
0 ≤ S ≤ n. Each node contains Nv = 2S soft information,
interpreted in log-likelihood ratio (LLR) form (LS) that are
propagated to their child nodes. In return, each child node
propagates Nv hard information (β
S) to their parent nodes,
called partial sums. As illustrated in Fig. 1, from a node v
that has Lv LLRs, the LLRs at the left child (Ll) and the
right child (Lr) are calculated as
Lli = sgn(L
v
i ) sgn(L
v
i+2S−1)min(|Lvi |, |Lvi+2S−1 |), (1)
Lri = L
v
i+2S−1 + (1 − 2βli)Lvi . (2)
Given that the hard decision information from the left child
(βl) and the right child (βr) of node v are available, the βv
for node v is calculated as
βvi =
{
βli ⊕ βri, if i ≤ 2S−1
βr
i−2S−1 , otherwise.
(3)
The bit estimations are performed at leaf node stage S = 0
sequentially, starting from the leftmost index. Estimation of
each bit uˆi depends on the channel observation y and previ-
ously decoded bits uˆ0:i−1, such that
uˆi =


0, if Pr[y, uˆ0:i-1|ui = 0] ≥ Pr[y, uˆ0:i-1|ui = 1];
0, if i ∈ AC ;
1, otherwise.
(4)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
It was shown in [13] and [14] that nodes in the SC decoding
tree with special frozen bit patterns are not needed to be
explicitly traversed; dedicated fast decoding techniques for
such special nodes improves the throughput of the decoding
substantially. Among these special nodes, decoding of Rate-
0 (where all indices are frozen) Rate-1 (where no indices
are frozen), repetition (Rep, where only the rightmost index
is non-frozen) and single-parity check (SPC, where only the
leftmost index is frozen) are within the scope of this work.
An example for each of the considered nodes are highlighted
with the dashed lines in Fig. 1.
B. SC-Flip and Dynamic SC-Flip Decoding
In a failed SC decoding, the incorrect bit estimations (e.g.
errors) could occur in two different ways. The first way is
due to the the noise present in the channel, this type of
errors is called a channel-induced error. The second way
to incur an error is due to a previously made error during
the sequential schedule of SC decoding (4), which we call
a propagated error. Since the first error in the codeword
cannot be propagated from a previously made error, it is a
channel-induced error. Therefore, if this error is found and
corrected, then its associated propagated errors – if there are
any – also disappear. In this context, while propagated errors
are dependent on their associated channel errors, the channel
errors are independent from one another.
The observation above was originally made in [8], and it
was also observed that most of the decoding failures are due
to a single channel-induced error. Hence, if a single channel-
induced error was avoided, then the error-correction perfor-
mance would improve. Aided by an outer cyclic redundancy
check (CRC) code for detecting whether an initial SC decoding
has failed, SCF decoding first creates a list of bit-flipping
positions using non-frozen leaf indices sorted according to
their LLR magnitudes. Then, the SC decoding process is
relaunched but the hard decision at the index that holds the
next lowest LLR magnitude is flipped. This is repeated until
no errors are detected or until all positions in the list have
been considered. When SCF decoding fails, it is either due to:
(i) a wrong codeword with a valid CRC, or (ii) not locating
the correct erroneous index within a maximum number of
additional attempts (Tmax), or (iii) having more than one
channel-induced error in the codeword.
The performance improvement in error-correction intro-
duced by the SCF decoding algorithm is limited due to
two different problems. The first problem is that the SCF
algorithm cannot correct more than a single channel-induced
error. Even though several attempts lead to improvements in
the SCF decoding [15]–[19], they are unable to tackle more
than one channel-induced error. An alternative approach that
segments the codeword into multiple partitions, and applying
SCF decoding to each partition separately is also shown
to have limited performance improvements [20]–[22]. The
second problem is that the decision metric used in the SCF
decoding is not able to distinguish channel-induced errors
from propagated errors. Indeed, we have observed that the
propagated errors may also carry small LLR magnitudes [16]
which is the only parameter that the SCF decoding relies on
for identifying channel-induced errors.
The proposals that address the limited performance im-
provement problem of SCF decoding in the literature can be
classified into two different techniques. The first technique is
to merge the SCF algorithm with the SCL algorithm [23]–
[25]. This approach has shown to improve the decoding per-
formance at the cost of an increased computational complexity.
The second technique is to create combinations of bit-flipping
positions to tackle more than one channel induced error [9],
[26]–[31]. Among these, the Dynamic SCF (DSCF) algorithm
[9] tackles both of the problems associated with the SCF
decoding simultaneously: (i) the search for bit-flipping is not
limited to a single channel-induced error, and (ii) the decision
metric is more efficient in identifying the correct bit-flipping
positions than the SCF algorithm.
To identify more than one channel-induced error, DSCF
updates the set of flipping indices progressively over the course
of each decoding attempt: Let Eω = {i1, . . . , iω} denote the
set of bit-flipping indices at an additional decoding attempt,
where i1 < · · · < iω and 0 ≤ ω ≤ K + C. Note that
here, C is the CRC remainder length. In this sense, ω is the
number of attempted channel-induced errors, which is referred
to as the decoding order. Eω is built progressively over a prior
additional decoding attempt with Eω−1 = {i1, . . . , iω−1}.
Unlike in the SCF decoding, the decision metric of DSCF
decoding for non-frozen indices does not only depend on
their LLR magnitudes. Instead, all the decisions that were
made at the prior non-frozen indices are also considered to
calculate a more comprehensive decision metric. In this sense,
let Pr(Eω) be the probability of SC decoding being successful
after flipping the bits in Eω. It was shown in [9] that Pr(Eω)
can be formulated as
Pr(Eω) =
∏
j∈Eω
pe(uˆ[Eω−1]j)×
∏
j<iω
j∈A\Eω
(
1−pe(uˆ[Eω−1]j)
)
(5)
where pe(uˆ[Eω−1]j) is the probability of incurring an error at
index j, such that
pe(uˆ[Eω−1]j) := Pr(uˆ[Eω−1]j 6= uj |y,
uˆ[Eω−1]0:j−1 = u0:j−1). (6)
Let us elaborate on the computation of (5) with a simple
example. Assume that the estimations uˆ7 and uˆ12 in the
PC(16, 8) polar code in Fig. 1 have channel-induced errors.
Let a specific Eω include the erroneous indices, i.e. ω = 2 and
E2 = {7, 12}. By the successive course of the decoding, uˆ7 is
flipped first. The probability of a bit-flip at index 7 yielding the
correct decision is equal to the probability of index 7 incurring
a channel-induced error originally. The bit estimations that
follow (i.e. at indices 9, 10, 11) are impacted by the first
bit-flip and therefore denoted as uˆ[E1]j . Consequently, their
associated probability of correct estimation are represented
as
(
1 − pe(uˆ[E1]j)
)
. Finally, the probability of error at the
last index of E2 (which is index 12), is the same as the
probability of incurring an error when it is not corrected, which
is pe(uˆ[E1]12). The product of all these probabilities creates
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
the probability of the successful decoding after flipping the
indices of Eω, which is summarized in (5).
It can be seen in (6) that pe(uˆ[Eω−1]j) depends on all
the previous bits decoded correctly, which cannot be granted
in practice. Hence, an approximation to pe(uˆ[Eω−1]j), that
depends on all the previously decoded bits, regardless of them
being correctly decoded, can be used instead:
qe(uˆ[Eω−1]j) = 1
1 + exp(|L0[Eω−1]j |) , ∀j ∈ A (7)
where L0[Eω−1]j is the LLR at index j of the current de-
coding attempt. Hence, approximating (6) with (7) and thus
substituting (7) into (5) yields the decision metric m for the
bit-flipping set Eω:
m(Eω) =
∏
j∈Eω
1
1 + exp(|L0[Eω−1]j |)
×
∏
j<iω
j∈A\Eω
(
1− 1
1 + exp(|L0[Eω−1]j |)
)
(8)
Using the fact that 11+exp(x) =
exp(−x)
1+exp(−x) , (8) can also be
written as
m(Eω) =
∏
j∈Eω
exp(−|L0[Eω−1]j |)
×
∏
j≤iω
j∈A
1
1 + exp(| − L0[Eω−1]j |)
For numerical stability, the metric is converted to the
logarithmic domain using M(Eω) = − log(m(Eω)) as
M(Eω) =
∑
j∈Eω
|L0[Eω−1]j |+ S(Eω), (9)
where
S(Eω) =
∑
j≤iω
j∈A
log(1 + exp(−|L0[Eω−1]j |)). (10)
Finally, in order to approximate the value of qe(uˆ[Eω−1]j)
(7) close to pe(uˆ[Eω−1]j) (6), a perturbation parameter α was
defined in [9] and used, such that
Mα(Eω) =
∑
j∈Eω
|L0[Eω−1]j |+ Sα(Eω), (11)
and
Sα(Eω) = 1
α
∑
j≤iω
j∈A
log(1 + exp(−α|L0[Eω−1]j |)). (12)
The value of α can be optimized via Monte-Carlo simulations
[9] or machine learning [32].
The procedure of the DSCF decoding is summarized in
Algorithm 1. Required inputs are the channel LLRs, maximum
number of iterations Tmax, maximum bit-flipping order ω and
A. The received codeword is initially decoded with the SC
algorithm (line 3). Here, the input for SCF algorithm is the bit-
flipping indices, hence SCF with an empty input is equivalent
to SC decoding. If the CRC on the estimated information
Algorithm 1: Dynamic SCF Algorithm
1 procedure DSCF(y0:N−1, Tmax, ω,A)
2 initialize: t← 0, w ← 0, Ew ← ∅,
V = {vw ← w; vidx ← ∅; vMα ← +∞}.
3 (uˆ0:N−1, L
0
0:N−1)← SCF(∅)
4 if
(
!CRC(uˆ0:N−1,A) & Tmax > 0 & ω > 0
)
then
5 V new ← ∅
6 for i ∈ A do
7 compute Mα(E1) (Eq. (11))
8 V new ← push({1; {i};Mα(E1)})
9 V ← sort(V new → vMα , ascending)
10 while (!CRC(uˆ0:N−1,A) & t < Tmax) do
11 t← t+ 1
12 w← V0(vw)
13 Ew ← V0(vidx)
14 V ← pop(V0)
15 (uˆ0:N−1, L
0
0:N−1)← SCF(Ew)
16 if w < ω then
17 V new ← ∅
18 for i > back(Ew), i ∈ A do
19 compute Mα(Ew+1) (Eq. (11))
20 V new ←
push({w + 1, ; Ew ∪ {i};Mα(Ew+1)})
21 V ← sort(V new ∪ V → vMα , ascending)
22 return uˆ0:N−1
bits fails and if there is room for additional bit-flips (line
4), then an initial list of bit-flipping indices are created using
leaf LLRs (lines 5-8). Here, a special data structure (V ) is
used to enclose the bit-flipping order (vw), bit-flipping indices
(vidx) and the metric (vMα ). The created vector of V is then
sorted with respect to their metric value (indicated with →)
in ascending order (line 9). Following, a series of decoding
iterations is initiated that is conditioned on the CRC output and
Tmax (lines 10-21). At each iteration, the decoding order and
the bit-flipping indices are obtained from the next item in V
(lines 12-13). The used entry from V is discarded from the list
(pop() in line 14). The indices are used as an input to the new
SCF decoding attempt (line 15). If the new decoding attempt
allows for further bit-flipping investigations over the newly
created decoding trajectory (line 16), then new bit-flipping
indices are built on top of the current bit-flipping attempt (note
the Ew ∪ {i} in line 20), and the updated vector is re-sorted
before the next iteration begins. Here, the evaluated bit-flipping
indices must be greater than the last index at Ew (obtained by
back() operation at line 18), following the definition of Ew.
If the initial SC decoding has a valid CRC, or when at least
one loop condition is broken, the bit estimation is reported
(line 22).
To gain an in-depth theoretical background on the derivation
of (11) and the DSCF algorithm, the readers are strongly
encouraged to refer to Section V-A and Section VI-A of [9],
respectively.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
SNR (dB)
F
E
R
SCF w = 1 w = 2 w = 3
DSCF w = 1 w = 2 w = 3
SCO w = 1 w = 2 w = 3
Fig. 2. Error-correction performance comparison of SCF and DSCF algo-
rithms at ω ∈ {1, 2, 3} with Tmax ∈ {10, 50, 200}, using PC(1024, 512)
[3] and C = 16.
III. APPROXIMATIONS FOR DSCF DECODING
The Sα(Eω) component of the metric computation in DSCF
decoding has a crucial impact on the identification of the
correct bit-flipping indices; and thus, on the improvement of
the error-correction performance. In fact, without Sα(Eω) the
decision metric of DSCF reverts to the SCF decoding with
higher error order. Fig. 2 illustrates the differences in FER
performance using the metric of SCF and DSCF, at error orders
ω ∈ {1, 2, 3} using PC(1024, 512) and C = 16. For this
exercise, the SCF algorithm is enhanced to tackle higher order
errors by updating the set of flipping indices progressively
similar to that of DSCF algorithm. In other words, SCF
builds a bit-flipping set Eω, however it uses only the LLR
magnitude values of the flipping indices while building it. In
this sense, SCF with ω = 1 is the original SCF decoding. For
each error order, the dashed lines depict the ideal decoding
performance when all ω errors are corrected using a genie-
aided decoder, called SC-Oracle (SCO) [8]. Observe that even
when the SCF is modified to correct higher order errors, the
associated performance improvement is fractional such that
the FER of SCF with ω = 3 is worse than that of DSCF with
ω = 1. This implies that the metric computation of DSCF
algorithm is more efficient, and essential for tackling higher-
order errors. On the other hand, although DSCF has superior
error-correction performance due to the Sα(Eω) term, its
required logarithmic and exponential operations make DSCF
inconvenient for efficient hardware implementations.
We reformulate Sα(Eω) in (12) as
Sα(Eω) =
∑
j≤iω
j∈A
fα(|L0[Eω−1]j |), (13)
where
fα(x) =
1
α
log
(
1 + exp(−αx)). (14)
Following the Monte-Carlo optimizations from [9], α = 0.3
is used throughout this paper. Interestingly, it was shown that
a similar expression, f(x) = log
(
1 + exp(−x)), used in
the soft-input soft-output decoding algorithm of turbo codes,
can be approximated in different ways without adversely
affecting the decoding performance [33], [34]. Inspired from
0 2 4 6 8 10 12 14 16
0
0.5
1
1.5
2
2.5
x
fα=0.3(x)
f linα=0.3(x)
f∗α=0.3(x)
Fig. 3. fα(x) with α = 0.3, and its constant and linear approximations
f∗α(x) and f
lin
α=0.3(x), respectively.
the constant log-MAP approximation in [33], we use a similar
approximation to simplify fα(x) as:
f∗α=0.3(x) =
{
3
2 , if |x| ≤ 5
0, otherwise.
(15)
The values of 32 and 5 in (15) are selected to ensure both
an easy future hardware implementation and a reduced fitting
error. For illustration purpose, Fig. 3 plots the original function
fα and its proposed approximation f
∗
α, with α = 0.3. Note that
in [35], a linear approximation to fα(x) was used following
[34] to reduce its complexity. An illustration for the linear
approximation is also depicted in Fig. 3, labeled as f linα=0.3(x).
However, our approach to approximate fα(x) is simpler as
it only involves a constant value. Fig. 4 compares the FER
performance of DSCF decoding with the constant and linear
approximations against its original approach from [9], using
length-1024 polar codes with different rates. The polar codes
are constructed using the 5G reliability sequence [3]. The
length-16 CRC defined in [3] is serially concatenated with the
polar code. A BPSK modulation and an AWGN channel are
considered. Note that the same settings are used for all the
following Monte-Carlo simulations. Three error orders ω ∈
{1, 2, 3} are targeted, corresponding to Tmax ∈ {10, 40, 200},
respectively. The decoding performance with SC-Oracle for
each ω value is also shown. Observe that in the consid-
ered cases, the approximated DSCF curves achieve similar
decoding performance as the original approach but without
transcendental computations. As the constant approximation is
more favorable for reduced complexity, we choose to replace
fα(x) with f
∗
α(x).
IV. FAST-DSCF DECODING
A. Achievable Performance by the Fast-SCF-based Decoders
We introduce the concept of Fast-SCO decoder to depict
the ideal limit on the achievable performance by SCF-based
decoders when special nodes that enclose more than one non-
frozen index are incorporated. The modified SCO decoder
works exactly as a Fast-SSC decoder, except that it is able to
identify and correct up to a certain number of channel-induced
errors at the top-level of the special nodes. The introduction
of Fast-SCO is essential towards the performance evaluation
of the proposed Fast-DSCF decoder.
The two special nodes of our interest, which involves more
than a single non-frozen index, are Rate-1 and SPC nodes.
Note that the FER value of Fast-SCO is the same as SCO
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
R = 1
8
R = 1
4
R = 1
2
R = 3
4
R = 7
8
ω ω
ω
ω
ω
SNR (dB)
F
E
R
DSCF using f∗α
DSCF using f linα
DSCF using fα
SC-Oracle
Fig. 4. FER performance comparison of DSCF decoding with and without approximations, using α = 0.3. Polar codes with N = 1024 and R ∈
{ 1
8
, 1
4
, 1
2
, 3
4
, 7
8
}, C = 16, ω ∈ {1, 2, 3} with Tmax ∈ {10, 40, 200}.
1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
ω = 1
ω = 2
ω = 3
SNR (dB)
F
E
R
SCO
Fast-SCO (Rate-1)
Fast-SCO (Rate-1 & SPC)
Fig. 5. FER performance limits depicted by SCO and Fast-SCO when Rate-1
and SPC nodes are involved, using PC(1024, 512) and C = 16.
when only Rate-0 and Rep nodes are involved. For Rate-1
nodes, up to ω channel-induced errors are found and flipped.
On the other hand, an SPC node involves a single frozen index,
and its parity must be kept even at all times. As such, an even
number of bit-flips have to take place at SPC nodes in order to
keep their even parity constraint. Therefore, two simultaneous
bit-flips have to be performed in SPC nodes, for each error
order.
Fig. 5 compares the FER performance limits depicted by
SCO against Fast-SCO, when Rate-1 or Rate-1 and SPC nodes
are involved for ω ∈ {1, 2, 3}. It can be noticed that the perfor-
mance limit of Fast-SCO with only Rate-1 nodes is equivalent
to that of SCO, and the Fast-SCO adaptation involving SPC
nodes has an improved performance limit compared to others.
This is because a portion of corrected errors at the top of
the SPC node correspond to multiple channel-induced error-
corrections at the leaf node level. Hence, a better performance
can be ideally achieved when SPC nodes are used in SCF-
based decoders.
B. Incorporation of Special Nodes into DSCF
The main task in incorporating the special nodes into DSCF
decoding is to conceive metric computations (11)-(15) using
the available information at the top-level of the special nodes.
In the following, we present how to perform the metric
computations for each special node. Note that the following
approaches can be applied to DSCF decoding with or without
the approximation presented in Section III.
Let us split the metric calculation of DSCF into two parts,
such that:
Mα(Eω) = |L0[Eω−1]iω |︸ ︷︷ ︸
M ′α(Eω)
+
( ∑
j∈Eω−1
|L0[Eω−1]j |+ Sα(Eω)
)
︸ ︷︷ ︸
M ′′α (Eω)
.
(16)
Observe that M ′α(Eω) takes a value only if the index iω is a
candidate for bit-flipping during a future decoding iteration.
M ′α(Eω) contains the instantaneous value at the index iω and
used for the metric computation of the index iω only. On
the other hand, M ′′α(Eω) is the accumulative part that is used
for the next possible set of bit-flips throughout the decoding
attempt. M ′′α(Eω) is set to 0 at the beginning of any extra
decoding attempt and accumulated for each non-frozen leaf
index j as follows:
M ′′α(Eω)j = M ′′α(Eω)j−1
+ |L0[Eω−1]j | if j ∈ Eω−1
+ fα(|L0[Eω−1]j |) if j ∈ A. (17)
We now provide a way to compute M ′α(Eω) and M ′′α(Eω)
when special nodes are encountered during DSCF decoding.
First, we redefine the set Eω to hold the information of the
special nodes as well as the flipping indices. Let us define
the notation {j, i} as the coordinate in a polar code tree
where j denotes the special node index and i denotes a
set of top-node indices that belongs to j. Accordingly, let
Eω = {{j1, i1}, . . . , {jω, iω}} denote the set of flipping
coordinates at special nodes {j1, . . . , jω} (j1 ≤ · · · ≤ jω).
In this context, a flipping coordinate {j, i} becomes a subset
of the set Eω. Depending on the type of the special node, the
cardinality of i can vary. As we will see, the instantaneous
component of the metric computationM ′α(Eω) depends on the
subset {j, i}, whereas the accumulative component M ′′α(Eω)
is updated at once at each special node j.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
The decoding of Rate-0 nodes for DSCF decoding is the
same as [13] since their LLR magnitudes are not evaluated
towards metric updates; this has been addressed previously in
[36] for SCF-based algorithms.
Repetition (Rep) nodes contain a single non-frozen instance
located at the rightmost index at the leaf node. From the top-
level perspective, the information of the non-frozen index is
distributed amongst all indices of the top node. In other words,
the LLR in the last leaf node obtained through SC decoding
is equal to the sum of all LLRs in the root node; and there is
only one possible flipping-event. Therefore, for a Rep node j
of size Nv , at decoding tree stage S, we can write
M ′α(Eω){j,∅} =
∣∣∣∣ ∑
i∈Nv
LS [Eω−1]i
∣∣∣∣, (18)
and the update for M ′′α(Eω) at a Rep node can be expressed
as
M ′′α(Eω) += fα
(∣∣∣∣ ∑
i∈Nv
LS[Eω−1]i
∣∣∣∣
)
+
∣∣∣∣ ∑
i∈Nv
LS [Eω−1]i
∣∣∣∣ if {j,∅} ⊂ Eω−1. (19)
If a flipping event for a Rep node is selected during an extra
decoding attempt, all the Nv partial sums at level S have to
be flipped. Accordingly, there is no index information required
for Rep nodes (i = ∅ in {j, i}). Note that the proposed metric
calculation and update for Rep nodes are exact to the baseline
DSCF decoding.
By definition, Rate-1 nodes do not involve any frozen bits.
Thus, they correspond to an uncoded sequence and all the
indices at the top of the node are evaluated for bit-flipping. We
therefore use the top-level LLRs directly in metric calculations
for the prospective flipping indices. For a Rate-1 node j of size
Nv, at decoding tree stage S, M
′
α(Eω) is calculated for each
index i (0 ≤ i < Nv) within the node, such that
M ′α(Eω){j,i} = |LS [Eω−1]i| . (20)
Note that, the LLR values at the top of a Rate-1 node do not
have any dependencies on one another. Therefore, M ′′α(Eω) is
updated at once for the entire Rate-1 node, such that
M ′′α(Eω) +=
∑
i∈Nv
fα(|LS [Eω−1]i|)
+
∑
i∈Nv
{j,i}⊂Eω−1
|LS [Eω−1]i| . (21)
If the flipping subset {j, i} corresponds to a Rate-1 node at an
additional decoding attempt, its top-node index i is flipped.
The proposed metric for Rate-1 nodes is not exact to the
baseline DSCF decoding. In return, Fig. 6 depicts how the FER
proceeds with Tmax, using two polar codes PC(1024, 512) and
PC(1024, 896), evaluated at error orders ω ∈ {1, 2, 3} and
simulated at six different SNR points each. The non-frozen
set A associated with these polar codes exhibit 95% and 99%
of the indices that fall under Rate-1 nodes. It can be seen that
the error-correction performance with Rate-1 nodes is similar
0 20 40 60 80 100
10−4
10−3
10−2
10−1
100
SNR=1.5dB
SNR=2.75dB
F
E
R
(ω
=
1
)
20 40 60 80 100
SNR=6.0dB
SNR=7.25dB
0 100 200 300 400 500
10−5
10−4
10−3
10−2
10−1
100
SNR=1.5dB
SNR=2.75dBF
E
R
(ω
=
2
)
100 200 300 400 500
SNR=6.0dB
SNR=7.25dB
0 400 800 1,200 1,600 2,000
10−5
10−3
10−1 SNR=1.5dB
SNR=2.75dB
Tmax
F
E
R
(ω
=
3
)
400 800 1,200 1,600 2,000
SNR=6.0dB
SNR=7.25dB
Tmax
(a) PC(1024, 512) (b) PC(1024, 896)
DSCF [9] Fast-DSCF (Rate-1 only)
Fig. 6. FER performance of DSCF decoding with and without using Rate-1
nodes, with respect to a wide range of Tmax values. C = 16, ω ∈ {1, 2, 3}.
fα=0.3(x) is not approximated. Simulated SNR values are between 1.5 dB
and 2.75 dB for PC(1024, 512), and between 6.0 dB and 7.25 dB for
PC(1024, 896), with 0.25 dB steps each.
to that of the original DSCF algorithm. In return, the decoding
tree does not need to be traversed at Rate-1 nodes.
As mentioned in Section IV-A, an even number of bit-flips
have to take place at SPC nodes in order to keep their even
parity constraint. Accordingly, for each attempted error order,
two indices are considered, which leads to
(
Nv
2
)
combinations
for an SPC node of size Nv. Therefore, for an SPC node j
of size Nv, at decoding tree stage S, M
′
α(Eω) is calculated
for each {j, i} (i = {i1, i2}, 0 ≤ i1 < Nv, 0 ≤ i2 < Nv,
i1 6= i2), such that
M ′α(Eω){j,{i1,i2}} =
∑
i∈{i1,i2}
(
|LS[Eω−1]i|−γ|LS[Eω−1]imin |
)
,
(22)
where imin denotes the top-node index with the minimum LLR
magnitude and γ is the initial parity.
As noted in (12), only the non-frozen index LLR magnitudes
are used towards the calculation of Sα(Eω). Thus, the index
imin is excluded in the calculation of Sα(Eω) in SPC nodes.
Instead, its LLR magnitude is applied as an offset to all other
indices, such that
M ′′α(Eω) +=
∑
i∈Nv\imin
fα
(|LS [Eω−1]i|+ (1− 2γ)|LS[Eω−1]imin |)
+
∑
i∈{i1,i2}
{j,{i1,i2}}⊂Eω−1
(
|LS[Eω−1]i| − γ|LS [Eω−1]imin |
)
.
(23)
If the flipping subset {j, i} correspond to an SPC node at
an additional decoding attempt, then the two top node indices
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
0 20 40 60 80 100
10−4
10−3
10−2
10−1
100
SNR=1.5dB
SNR=2.75dB
F
E
R
(ω
=
1
)
20 40 60 80 100
SNR=6.0dB
SNR=7.25dB
0 100 200 300 400 500
10−5
10−4
10−3
10−2
10−1
100
SNR=1.5dB
SNR=2.75dBF
E
R
(ω
=
2
)
100 200 300 400 500
SNR=6.0dB
SNR=7.25dB
0 400 800 1,200 1,600 2,000
10−5
10−3
10−1
SNR=1.5dB
SNR=2.75dB
Tmax
F
E
R
(ω
=
3
)
400 800 1,200 1,600 2,000
SNR=6.0dB
SNR=7.25dB
Tmax
(a) PC(1024, 512) (b) PC(1024, 896)
Fast-DSCF (Rate-1) Fast-DSCF (Rate-1 & SPC)
Fig. 7. FER trend line of DSCF decoding with and without using SPC
nodes, with respect to a wide range of Tmax values. C = 16, ω ∈ {1, 2, 3}.
fα=0.3(x) is not approximated. Simulated SNR values are between 1.5 B
and 2.75 dB for PC(1024, 512), and between 6.0 dB and 7.25 dB for
PC(1024, 896), with 0.25 dB steps each.
in i are flipped.
Fig. 7 compares the FER performance of our approach with
SPC nodes (22)-(23) against the baseline DSCF algorithm
for ω ∈ {1, 2, 3}. As in the Rate-1 study, PC(1024, 512)
and PC(1024, 896) are used: their associated non-frozen set
exhibit 68% and 65% of the indices that fall under SPC
nodes, respectively. For both polar codes, it is observed that
their FER performance are able to reach lower FER values
when SPC nodes are involved as outlined by the Fast-SCO
performance limits in Fig. 5. On the other hand, it is also
observed that with increasing ω, the performance with SPC
nodes are slightly degraded at low Tmax values and better
performance is achieved at relatively higher SNR and Tmax
values.
V. REDUCING THE BIT-FLIPPING SEARCH SPAN IN
FAST-DSCF DECODING
In this Section, we attempt to reduce the search span for
bit-flipping within Rate-1 and SPC nodes without introducing
significant degradation in error-correction performance using
a theoretical framework. This study is divided into three
parts. First, we derive the theoretical FER bounds for Rate-1
nodes for any error order via density evolution using Gaussian
approximation [37]. Then, we attempt to approach to the
derived bounds using a reduced number of elements within
the Rate-1 nodes. Finally, we extend our study towards SPC
nodes.
Note that these derived theoretical computations are not
exclusive to the DSCF algorithm, and can be used for other
algorithms and purposes. This study is carried out using BPSK
modulation, it can be extended to higher-order modulation
scenarios.
A. Reducing the Search Span in Rate-1 Nodes
Assuming an all-zero codeword with BPSK signaling, the
LLRs can be expressed as Gaussian random variables, such
that
L0i ∼ N (µi, σ2i ), i ∈ [0;N). (24)
As the variance (σ2) and the mean (µ) of the random
variable model are coupled (σ2 = 2µ), it is sufficient to
track the mean for each channel [37]. Then, the probability
of error (pii) associated with each leaf node index can be
approximated by the Q-function Q(x), which is referred to as
the tail probability of a Gaussian distribution. The Q-function
can be described using the complementary error function [38]:
pii ≈ Q
(µi
σi
)
= Q
(√µi
2
)
=
1
2
erfc
(√µi
2
)
. (25)
Recall that Pr(Eω) is the probability of SC decoding being
successful after flipping the indices in Eω (5). Here, let us
define a slightly different probability, Pr(eω), as SC decoding
being successful after flipping ω indices. Differently than
Pr(Eω), Pr(eω) does not involve a specific set of indices but
only concerns the number of flipped indices. In this sense,
when ω = 0, then Pr(e0) is the probability of SC decoding
being successful, and can be expressed in terms of pii as
Pr(e0) =
∏
i∈A
(1− pii), (26)
Accordingly, the FER of SC can be easily computed by taking
the complement of Pr(e0) [39], such that
FERSC = 1− Pr(e0) = 1−
[ ∏
i∈A
(1− pii)
]
. (27)
It is worth to mention that, the theoretical FER of SC when
an error (or more) is corrected (ω > 0) can be derived similarly
(e.g. 1− Pr(e0)− Pr(e1)). However, the derived performance
does not correlate well with the simulated performance. This
is because of the propagated errors that occur at the leaf nodes.
Since no systematic model is derived to explain the behavior
of the propagated errors, they are assumed unpredictable.
The correlation between channel errors and propagated errors
have a negative impact on the accuracy of the performance
estimation. This bevahior was also addressed shortly in Section
IV-B of [9]. On the other hand, we show that there is a way
to estimate the performance for ω > 0 accurately for Rate-1
and SPC nodes.
Based on (27), the FER for a Rate-1 node under SC
decoding can be derived using its top-node LLRs rather than
its leaf node LLRs, by exploiting the fact that there are no
frozen (parity) bits involved [28]. This means that, the FER
computation of a Rate-1 node can also be computed using the
top-node LLRs instead of using its leaf node LLRs. As such,
for a Rate-1 node at stage S and of size Nv, (27) can be
adapted into
FERRate-1,SC = 1− (1− piS)Nv , (28)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
where piS is derived by substituting the mean value at stage S
(µS) into (25). Note that, this derivation is possible because all
variables at the top-level of a node have the same mean and
variance. Similarly, we can derive the FER for higher error
orders for Rate-1 nodes. Unlike the case in SC, the theoretical
FER for Rate-1 nodes is accurate if it is calculated using their
top-node LLRs, because channel-induced errors at top-level
indices does not yield error propagation. In other words, the
independent top-node errors in Rate-1 nodes is the key for
obtaining accurate theroretical FER calculation. For instance,
for a Rate-1 node of size Nv and at stage S, the probability of a
channel-induced error at a top-node index is piS . To compute
Pr(e1), we must ensure that all other indices are error-free
((1−piS)Nv−1), and know that the channel error can occur in
Nv different indices:
Pr(e1) = Nv × piS(1− piS)Nv−1.
The FER for a Rate-1 node with correcting one error at the
top-level can be expressed as 1− Pr(e0)− Pr(e1):
FERRate-1,ω=1 = 1−
(
(1−piS)Nv)− (Nv×piS(1−piS)Nv−1).
To generalize, the probability of error for a Rate-1 node of
size Nv at a target error order ω can be expressed as
FERRate-1,ω = 1−
ω∑
i=0
Pr(ei). (29)
which can be expanded into
FERRate-1,ω = 1−
ω∑
i=0
(
Nv
i
)
× (piS)i(1 − piS)Nv−i. (30)
As the theoretical achievable FER limit for Rate-1 nodes is
identified as a function of Nv, ω and µ
S , we now attempt
to approximate this with an artificially reduced size. We
claim that the Rate-1 top-node indices with the lowest LLR
magnitudes become far more susceptible to incur errors than
others. Since the lowest LLR magnitudes at a parent node are
propagated to its left child nodes using (1), the FER of a Rate-1
node can be approximated by using its left child nodes instead
of the root node. For example, the lowest LLR magnitude at
the top of the Rate-1 node is also found at the leftmost leaf
node (S = 0), with a different mean value µ0. Similarly, the
two lowest LLR magnitudes at the top of the Rate-1 node are
also found at the leftmost node at S = 1 with mean value
µ1. Accordingly, we use the size and mean values of the left
nodes of a Rate-1 node to obtain an approximated achievable
FER limit to the original one in (30).
Let δ denote the number of indices within a search span
of a node which comprises the lowest LLR magnitudes at
a Rate-1 node, where Nv ≥ δ ≥ ω. If δ < ω, then the
number of errors cannot fit within the search span and a failed
decoding is guaranteed. Accordingly, the achievable FER can
be approximated as
FERRate-1,ω ≈ 1−
ω∑
i=0
(
δ
i
)(
pi∗
)i
(1− pi∗)δ−i (31)
where pi∗ is the pi value calculated by the mean of the leftmost
child at stage S = log2 δ. It should be noted that a similar
0 5 10 15 20 25
10−15
10−10
10−5
100
F
E
R
R
at
e-
1
,ω
0 5 10 15 20 25
0 5 10 15 20 25 30 35
10−15
10−10
10−5
100
Mean (µ)
F
E
R
R
at
e-
1
,ω
0 5 10 15 20 25 30 35
Mean (µ)
(a) Nv = 4 (b) Nv = 8
(c) Nv = 32 (d) Nv = 64
(δ = Nv) (δ = 1) (δ = 2) (δ = 4) (δ = 8)
Fig. 8. Theoretical achievable FER limits (30) for Rate-1 nodes for ω ∈
{0, 1, 2, 3} and different node sizes, compared to their approximations (31)
using δ ∈ {1, 2, 4, 8}.
study has been carried out in [28] but it is limited to ω =
0 and δ = 1, which cannot be used for higher order FER
approximations.
Fig. 8 depicts the exact achievable FER for Rate-1 nodes
(30) compared to the approximated versions (31) using four
different δ values, four different node sizes and four error
orders. It can be observed that, the approximations follow a
closer trend to the original FERRate-1,ω with increasing δ. At
large node sizes and higher error orders, the approximations
begin to diverge from the original achievable FER but they
may still be considered within an acceptable domain, at a
wide range of mean values. A reduced bit-flipping search span
will be introduced for Rate-1 nodes based on the presented
approximations at Section VI.
B. Reducing the Search Span in SPC Nodes
At the top-level of an SPC node, all indices but one carry
information and one bit carry the parity of all the other indices.
This allows the SPC node to correct one erroneous index
naturally, but certain conditions must be met. These conditions
can be described as two probabilistic events: the probability
of only one bit estimation is incorrect, and the probability
of the incorrect bit containing the lowest absolute LLR value
among all indices. If the incorrect bit does not have the lowest
absolute LLR, then the correction mechanism of the SPC node
causes an additional erroneous index instead of the correction.
We denote the probability of a naturally correctable error with
Pr(e∗1).
The associated events are visualized in the Gaussian proba-
bility density function in Fig. 9, and explained next. Consider
an SPC node of size Nv at stage S with top-level LLR values
LS . Consider an all-zero codeword, a BPSK modulation and
the AWGN channel. The probability of an SPC index i is
erroneous and naturally correctable can be described as the
probability of index i containing the LLR value of LSi = x
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
−∞ ∞0
f(x;µ, σ)
x
Q(µ
σ
)
−x
Pr(L ≥ −x)
µ
Fig. 9. A Gaussian probability density function highlighting the Q-function,
the probability of a negative value x, and the probability of a value equal to
or larger than −x.
(x < 0), and |x| is the smallest LLR magnitude at the top-
level:
Pr(e∗1, L
S
i = x) = Pr(L
S
i = x ∩ {LS \ LSi } ≥ −x). (32)
The first event in (32) can be approximated using the normal
probability density function:
f(x;µ, σ) = e
−(µ−x)2
2σ2 /
√
2piσ , x ∈ R. (33)
On the other hand, the second event can be described in terms
of a Q-function with mean value shifted by x. Finally, since
the two described events are statistically independent, we can
apply Pr(A ∩B) = Pr(A)× Pr(B); leading to
Pr(e∗1, L
S
i = x) = f(x;µ, σ) ×
[
1−Q
(µ+ x
σ
)]Nv−1
. (34)
Finally, taking into account that the described event in (32)
may occur at any index, and for any x < 0, we get
Pr(e∗1) =
(
Nv
1
)∫ 0−
−∞
f(x;µ, σ)
[
1−Q
(µ+ x
σ
)]Nv−1
dx.
(35)
Accordingly, the theoretical FER for an SPC node under SC
decoding can be computed as:
FERSPC,ω=0 = 1− Pr(e0)− Pr(e∗1), (36)
where Pr(e0) and Pr(e
∗
1) can be substituted from (26) and (35).
Now we show how to extend this computation for SPC
nodes with higher error orders. Recall that two SPC indices
are corrected per attempted error order. Hence, for ω = 1,
two channel-induced errors are corrected, and a third error if
it is correctable by the parity check (Pr(e∗3)). For the all-zero
codeword scenario, given that an SPC index i with an LLR
value of Li = x (x < 0), the remaining two erroneous indices
must have lower values than x and all other indices must have
higher values than −x so that index i becomes correctable by
the parity check. Considering all combinations of the described
scenario and integrating for the range of x yields
Pr(e∗3) =
∫ 0−
−∞
(
Nv
1
)
f(x;µ, σ)
(
Nv − 1
2
)
Q
(µ− x
σ
)2
×
[
1−Q
(µ+ x
σ
)]Nv−3
dx. (37)
Following (29), (36) and (37) the theoretical FER of SPC
nodes for any error order can be generalized as follows:
FERSPC,ω = 1−
2ω∑
i=0
Pr(ei)− Pr(e∗2ω+1), (38)
0 5 10 15 20 25
10−15
10−10
10−5
100
F
E
R
S
P
C
,ω
0 5 10 15 20 25
0 5 10 15 20 25 30 35
10−15
10−10
10−5
100
Mean (µ)
F
E
R
S
P
C
,ω
0 5 10 15 20 25 30 35
Mean (µ)
(a) Nv = 4 (b) Nv = 8
(c) Nv = 32 (d) Nv = 64
(δ = Nv) (δ = 2) (δ = 4) (δ = 8)
Fig. 10. Theoretical achievable FER limits (38) for SPC nodes for ω ∈
{0, 1, 2} and different node sizes, compared to their approximations (40)
using δ ∈ {2, 4, 8}.
where
Pr(e∗2ω+1) =
∫ 0−
−∞
(
Nv
1
)
f(x;µ, σ)
(
Nv − 1
2ω
)
Q
(µ− x
σ
)2ω
×
[
1−Q
(µ+ x
σ
)]Nv−2ω−1
dx. (39)
Our next and final goal is to approximate to the theoretical
FER trend lines of SPC nodes with a limited search span. To
do this, we perform a similar approach to the study in Rate-1
nodes: A search span that comprises δ lowest LLR magnitudes
at an SPC node, where Nv ≥ δ ≥ max(2, 2ω + 1), δ ∈ Z+
is defined for SPC nodes. If δ < max(2, 2ω + 1), then the
number of errors cannot fit within the search span and a failed
decoding is guaranteed. Accordingly, the achievable FER for
SPC nodes within a limited search span can be expressed as
FERSPC,ω ≈ 1−
2ω∑
i=0
(
δ
i
)(
pi∗
)i
(1− pi∗)δ−i
−
∫ 0−
−∞
(
δ
1
)
f(x;µ∗, σ∗)
(
δ − 1
2ω
)
Q
(µ∗ − x
σ∗
)2ω
×
[
1−Q
(µ∗ + x
σ∗
)]δ−2ω−1
dx. (40)
where µ∗ is obtained from the leftmost node at stage S =
log2 δ, and pi
∗, and σ∗ are calculated using µ∗.
Fig. 10 depicts the exact achievable FER for SPC nodes
(38) compared to the approximated version (40) using three
different δ values where applicable, four different node sizes
and three error orders. Similar observations to the case of Rate-
1 nodes can also be made for the SPC nodes: Considered
approximations follow a close trend to the exact derivation,
with higher δ values are in favor of a better approximation
at the cost of higher search span. Similar to Rate-1 case,
a reduced bit-flipping search span is created for SPC nodes
based on the presented approximations at Section VI.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Decision Metric Generator
L abs()
|L|
f ∗α(x)
∑
+ M ′′α(Eω)
findmin4() M ′α(Eω) presort() + join()
Current λ
Shift
Register
Insertion Sorter
Fig. 11. Sorter datapath for the proposed Fast-DSCF decoder architecture.
VI. HARDWARE ARCHITECTURE
The architecture for a typical SCF-based polar decoder
comprises of the following main components: An SC decoder
core where the decoding iterations are carried out, a CRC unit
that works in parallel with the SC core to validate the output,
and a sorter datapath to collect and regulate the bit-flipping
information for possible additional decoding attempts. For
this work, the SC decoder core is derived from the Fast-SCF
decoder reported in [11], which is based on a semi-parallel
polar decoder architecture [40] with a parallelization factor
of Pe = 64. Other than the branch operations, the modified
decoder supports only Rep, Rate-1 and SPC instructions. Since
Rate-0 nodes are always followed by a right branch (G)
operation (1), Rate-0 nodes are merged with right branch
operations to reduce latency [41]. A highly-parallelized CRC
processor for the polynomial 0x1021 is implemented after [42]
to validate the decoder output.
Compared to the state-of-the-art Fast-SCF decoder architec-
tures, the main difference of the proposed Fast-DSCF archi-
tecture is on the sorter datapath, which includes generation of
the decision metrics for bit-flipping candidates, followed by
sorting and storing of the candidates. The architecture for the
proposed sorter datapath is visualized in Fig. 11.
A. Metric Normalization and Quantization
The M ′′α(Eω) component of the decision metric is accumu-
lative as mentioned in Section IV-B, which is not a desired
property for quantization. When an accumulative component
is quantized, it carries the risk of saturating over the course
of the decoding, which disorganizes the sorted order of the
bit-flipping indices and therefore may degrade the error-
correction performance. One way to get around this problem
is to increase the number of quantization bits at the cost of
an increased latency, area and power consumption. Another
way is to normalize the metric update by identifying and
eliminating the computations that are performed more than
once so that the risk of saturation could be minimized, as
explained next.
Recall that the Mα(Eω) is computed as described in (16),
in which M ′′α(Eω) is expressed as:
M ′′α(Eω) =
∑
j∈Eω−1
|L0[Eω−1]j |+
∑
j≤iω
∀j∈A
fα(|L0[Eω−1]j |). (41)
Note that, a part of this computation can be found in the pre-
ceding metricMα(Eω−1). Following (11) and (13),Mα(Eω−1)
can be expressed as:
Mα(Eω−1) =
∑
j∈Eω−1
|L0[Eω−2]j |+
∑
j≤iω−1
∀j∈A
fα(|L0[Eω−2]j |).
(42)
Given that L0[Eω−2]j = L0[Eω−1]j for j ≤ iω−1, M ′′α(Eω)
can be normalized by Mα(Eω−1):
M ′′α(Eω)−Mα(Eω−1) =
∑
iω−1≤j≤iω
∀j∈A
fα(|L0[Eω−1]j |). (43)
Therefore, the normalized M ′′α(Eω) can be initiated at 0 at
the beggining of any extra decoding attempt and remains
unchanged until the last flipped index, and is updated only
after the last flipped index as:
M ′′α(Eω)j += fα(|L0[Eω−1]j |) if j ∈ A, j > iω−1. (44)
The normalization of M ′′α(Eω) is not only in favor of quanti-
zation, but it also helps to reduce the computational effort by
avoiding redundant calculations. Note that, this normalization
procedure is extended to the computation of M ′′α(Eω) at the
special nodes that are detailed in (19), (21), (23) for the
hardware implementation.
For quantization of Fast-DSCF decoding with ω = 1, 5
bits and 6 bits are set for the channel and internal LLRs,
respectively, with 1 bit reserved for the fractional part. A
quantization of 5 bits for the metric is shown to be sufficient
to obtain a well-approximated performance to the floating-
point decoding. On the other hand, these quantization schemes
are not shown to be sufficient for higher order decoding: one
extra bit is required for the fractional part of the LLRs which
impacts channel and internal LLRs, and the metric. Moreover,
another extra bit is required for the metric to sort the bit-
flipping indices efficiently. Hence, 6 and 7 bits are set for the
channel and internal LLRs, and 7 bits are set for the metric
quantization at ω > 1.
B. Decision Metric Generation
The M ′α(Eω) and M ′′α(Eω) components of the decision
metric are generated simultaneously, then summed to acquire
the desired Mα for each bit-flipping candidate. During the
execution of each special node, their top-node LLR values
(shown as L in Fig. 11) are forwarded to the decision metric
generator with the associated control signals, such as the
instruction, parity information, last flipping location (if any),
and the stage size. The upper datapath in Fig. 11 visualizes the
signal flow on the generation of M ′′α(Eω): the f∗α(x) function
from (15) is applied to the absolute top-node LLRs, which are
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
then summed together and forwarded to the register that holds
the current M ′′α(Eω). The M ′′α(Eω) component of the metric is
updated based on the rule described in (44), and the control
signals are not shown for a simplified view.
Following the discussions on reducing the bit-flipping
search span from Section V, the M ′α(Eω) component of the
decision metric should only involve a predetermined number
of indices. Following the studies in Fig. 8 and Fig. 10, we
reduce the search span for Rate-1 and SPC indices to 2 and
4, respectively. It can be seen from Fig. 10 that limiting the
search span to 4 for SPC nodes results in some performance
degradation for larger node sizes at higher orders. On the
other hand, increasing the search span for SPC nodes increases
the complexity of the sorting process drastically. In order to
minimize the performance degradation, the maximum size for
SPC nodes are limited for higher order decoding. Namely, the
maximum SPC node sizes are set to 8 and 4 for ω = 2 and
ω = 3, respectively.
The lower path in Fig. 11 shows the M ′α(Eω) genera-
tion followed by the bit-flipping candidate generation. The
findmin4() function in Fig. 11 finds the four indices with
the minimum LLR magnitudes, with which 2 and
(
4
2
)
= 6 bit-
flipping candidates are generated with theirM ′α(Eω) values for
Rate-1 and SPC nodes, respectively. The generated candidates
are pre-sorted (presort() function in Fig. 11) which will
greatly help with actual sorting process. The generated can-
didates are then assembled with the M ′′α(Eω) component and
with the current bit-flipping information (join() function in
Fig. 11). Hence, up to six bit-flipping candidates are generated
that are forwarded to the sorter architecture.
C. Sorter Architecture
Since there is always at least one branch operation in
between any two leaf node operations due to the sequential tree
traversal [41], the newly created sorting items can be processed
in two clock cycles. Since up to six new sorting elements are
created, up to three elements can be sorted at a clock cycle.
This is achieved by inserting a dedicated shift register before
the sorter architecture, that takes up to 6 elements and sends
3 elements to the sorter at a time.
Let us denote a sorting element by λ, which contains the
information of {ω,Mα(Eω), Eω}. The sorter architecture keeps
an array of λ items, λ, with their metrics in increasing order.
Let us further denote the newly generated sorting elements by
λ∗ = {λ∗0, λ∗1, λ∗2}. Note that the items in λ∗ are also sorted
in increasing order due to the presort() function in Fig. 11.
Since the length of λ is far greater than the size of λ∗, together
they form a nearly sorted array, that needs to be fully sorted.
The insertion sort method is an algorithm that is in favor of
nearly sorted lists [43], therefore we create an insertion sorter
that is able to insert up to three new elements in a single clock
cycle. Fig. 12 depicts the proposed insertion sorter architecture
for Fast-DSCF decoding: The elements can shift back up to
three places based on the places of the newly inserted values.
The elements are also capable of shifting forward by one place,
which is performed at the beginning of an additional decoding
iteration. That way, the current bit-flipping information is
always stored at λ0. Based on the normalization procedure
λ0 λ1 λ2 λ3 λl−1
λ∗
−
−
λ
0
{
M
}
+ + + +
Fig. 12. Insertion sorter architecture.
described in Section VI-A, the metrics at each element of the
sorter are normalized by the metric value of the current λ
(denoted as λ0{M}) every time the sorter shifts its elements
forward.
With the increasing error order, the size of the sorter
architecture increases in two dimensions: the cardinality of
Eω within the sorting element (λ) increases and therefore
more bits are required to describe a flipping event, and the
length of the sorter increases linearly with Tmax. Hence, the
complexity of the sorter increases dramatically with ω, requir-
ing a substantial portion of the overall decoding architecture.
To illustrate, for the Fast-DSCF decoder with ω = 1 and
Tmax = 10, assuming 5 bits for metric quantization, 2 bits
for order information, 9 bits for the special node and 6 bits
for each stored index, 300 bits are required for the sorter.
On the other hand, assuming the same quantization numbers,
5, 100 and 28, 800 bits are required for the sorter for ω = 3
with Tmax = 400 and ω = 2 with Tmax = 100, respectively.
In other words, the sorter complexity for ω = 3 would be
about 5 times of the LLR memory, and about 60 times of the
partial sum memory. Therefore, it is essential to reduce the
sorter complexity for higher error orders as much as possible.
For SCF architectures that feature ω = 1 only and not
higher order error-correction, the sorter length must match
the Tmax since the bit-flipping indices are calculated only
once during the initial decoding attempt. On the other hand,
for architectures that feature higher-order error-correction, the
last elements at the sorter are most likely to be shifted
out when the sorter gets updated with the higher-order bit-
flipping information. Inspired from this event, we propose a
sorter length l ≤ Tmax. When l = Tmax, the error-correction
performance of the Fast-DSCF algorithm is preserved. When
l < Tmax, the original error-correction performance is not
guaranteed; however, an opportunity is created to reduce the
sorter length at the expense of a preferably negligible loss in
error-correction performance. Accordingly, empirical studies
with ω = 2 and ω = 3 have shown that setting the sorter
length to 50% of the Tmax value has minimal impact on error-
correction performance, while it greatly helps with reducing
the computational complexity of the decoder.
VII. RESULTS
The following results for the proposed Fast-DSCF decoder
uses all the simplifications, optimizations and approximations
discussed throughout this work. The constant approximation
from (15), and all the special node decoding techniques from
(18)-(23) are used. Furthermore, following the discussion in
Section V, the search span for Rate-1 and SPC nodes are
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
1 1.5 2 2.5 3 3.5
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
SNR (dB)
F
E
R
Fast-SSCL-SPC [10] L = 2 L = 4 L = 8 L = 16
DSCF [9] w = 1 w = 2 w = 3
This Work w = 1 w = 2 w = 3
1 1.5 2 2.5 3 3.5
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
SNR (dB)
F
E
R
Fast-SSCL-SPC [10] L = 2 L = 4 L = 8 L = 16
DSCF [9] w = 1 w = 2 w = 3
This Work w = 1 w = 2 w = 3
Fig. 13. FER comparison of the proposed Fast-DSCF decoding against
baseline DSCF and Fast-SSCL-SPC [10] decoders, using PC(1024, 512) and
C = 16. The Tmax are set to 10, 100, 400 for ω ∈ {1, 2, 3} for both DSCF
and Fast-DSCF decoders. The Fast-SSCL-SPC and the Fast-DSCF decoders
are quantized, whereas DSCF decoder is simulated using floating-point.
reduced to 2 and 4, respectively. The maximum node size for
SPC nodes is set to {64, 8, 4} for ω ∈ {1, 2, 3}, respectively.
The metric normalization and quantization schemes follow-
ing Section VI-A are employed. For both DSCF and Fast-
DSCF decoders, Tmax values are set to 10, 100 and 400 for
ω ∈ {1, 2, 3}, respectively. The sorter lengths of the Fast-
DSCF decoder are set to 10, 50 and 200 for ω ∈ {1, 2, 3},
respectively. A FER value of 10−5 is targeted for comparison.
A. Error-Correction Performance
Fig. 13 presents the error-correction performance of the
proposed Fast-DSCF decoding, against baseline DSCF de-
coding from [9] and CRC-aided Fast-SSCL-SPC (Fast-SSCL)
decoding with L ∈ {2, 4, 8, 16} [10], using PC(1024, 512)
and C = 16 from [3]. Note that the Fast-DSCF is simulated
using the quantization schemes presented in Section VI-A,
whereas the DSCF decoder is simulated using floating point
numbers, and the Fast-SSCL decoder is quantized with the
scheme presented in [10].
According to Fig. 13, the Fast-DSCF decoder equipped
with all the simplifications exhibits similar performance to the
baseline DSCF decoder at all error orders and SNRs. It can be
observed that the Fast-DSCF decoder has similar performance
to DSCF even though the Fast-DSCF is quantized. At ω = 2
and high SNR, Fast-DSCF shows a slight performance loss
compared to the DSCF; this is mostly due to the reduced
search span in SPC nodes. This means that a better SPC
node approximation could be required at FER values targeting
beyond 10−6.
At the target FER of 10−5, the proposed Fast-DSCF at
ω = 1 depicts 0.18 dB gain over Fast-SSCL-SPC with L = 2.
At ω = 2, Fast-DSCF is 0.21 dB better than Fast-SSCL-
SPC with L = 4 and is only 0.08 dB away from the Fast-
SSCL-SPC performance with L = 8. Compared to Fast-SSCL-
SPC with L = 16, proposed Fast-DSCF performs slightly
better than Fast-SSCL-SPC, by 0.03 dB. In the following
comparison schemes, based on these performance results Fast-
DSCF is compared against its Fast-SSCL-SPC counterparts
2 2.25 2.5 2.75 3 3.25 3.5
300
500
700
900
1,100
1,300
1,500
SNR (dB)
A
v
g
.
#
o
f
C
lo
ck
C
y
cl
es Fast-SSCL-SPC [10] L = 2 L = 8 L = 16
Fast-SSCL-SPC [44] L = 2 L = 8 L = 16
This Work w = 1 w = 2 w = 3
Fig. 14. Comparison of average number of decoding steps for proposed sim-
plified DSCF decoding and baseline DSCF decoding. PC(1024, 512), C =
16.
that yield the closest error-correction performance at target
FER= 10−5. That is, Fast-DSCF with ω = 1, 2, 3 are
compared against Fast-SSCL-SPC (and other state-of-the-art
SCL-based decoders) with L = 2, 8, 16, respectively.
B. Average Computational Complexity
Fig. 14 compares the average computational complexity
of Fast-DSCF decoder against state-of-the-art Fast-SSCL-SPC
decoders from [10] and [44], at medium-to-high SNR regime.
The computational complexity is measured via the average
number of cycles, which is obtained by taking the product
of the average number of decoding iterations with the total
number of clock cycles per iteration, for each decoder. It is
essential to note that, the number of cycles are obtained with
using the same polar code (PC(1024, 512)) and using the
same number of parallel processing elements (Pe = 64) for
all considered decoders. While latency of the Fast-SSCL-SPC
decoder from [10] is fixed at each list size, the decoder from
[44] uses early termination logic and its latency is dependent
on the SNR. The Fast-DSCF decoder has large computational
complexity at low SNR regions, but saturates quickly around
our FER performance of interest.
At medium-to-high SNR regimes, the Fast-DSCF with
ω = 1 requires up to 16% more clock cycles than the its
counterparts with L = 2. On the other hand, for ω = 2 and
ω = 3, Fast-DSCF requires 15.6% and 21.7% less cycles
on average than the Fast-SSCL-SPC decoders with L = 8
and L = 16, respectively. Therefore, we can claim that the
proposed Fast-DSCF decoding for higher order error correc-
tion requires less amount of operations on average compared
to Fast-SSCL-SPC decoding, which makes it favorable for
applications that require improved FER performance with less
average computational complexity.
C. ASIC Synthesis Results
The proposed Fast-DSCF decoder has been implemented
in VHDL, validated with test benches and synthesized using
TSMC 65nm CMOS technology node through Cadence Genus
RTL compiler. To assure accuracy in our power measurements,
switching activities from real test frames are extracted for the
three architectures. The non-frozen bits for the test frames are
generated using Bernoulli distribution equal probability.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
TABLE I
TSMC 65NM CMOS IMPLEMENTATION RESULTS FOR PROPOSED FAST-DSCF ARCHITECTURES FEATURING ω ∈ {1, 2, 3} AGAINST EQUIVALENT
STATE-OF-THE-ART SCF AND SCL DECODERS WITH PC(1024, 512).
Base Algorithm SCF SCL
This Work [11] [19] [45](a) [10] [46](a) [47](a)
Methodology Synthesis Synthesis Synthesis Silicon Synthesis Synthesis Synthesis
Technology (nm) 65 65 65 28 65 90 90
Supply (V) 1.0 1.0 1.0 0.9 - - -
Pe 64 64 64 - 64 - 64
ω (SCF) / L (SCL) 1 2 3 1 1 1 2 8 8 16 16
Tmax 10 100 400 20 10 20 N/A N/A N/A
Qchn,Qint,Qm 5,6,5 6,7,7 6,7,7 5,6,5 5,6,5 6,6,8 6,6,8 6,6,8 6,6,8
Frequency (MHz) 455 425 410 455 480 145 885 722 770 676 911
Latency (µs) 0.82(b) 0.97(b) 1.11(b) 0.68 0.64 12.68 0.55 0.85 0.67 0.76 1.60
W.C. Latency (µs) 9.02 49.7 447 14.23 7.04 266.18 0.55 0.85 0.67 0.76 1.60
Average Coded T/P (Mbps) 1248(b) 1060(b) 935(b) 1511 1595 80.8 1861 1198 1527 1340 637
Area (mm2) 0.55 0.69 1.00 0.56 0.49 - 1.05 3.98 2.37 4.21 3.90
Power Consumption (mW) 118.01 136.35 201.61 83.44 - 51.30 - - - - -
Power Density (mW/mm2) 214.56 197.61 201.61 149.53 - - - - - - -
Area Efficiency (Gbps/mm2) 2.27 1.54 0.94 2.71 3.25 - 1.78 0.30 0.64 0.32 0.16
Energy (pJ/info.bit) 189.14 259.79 439.86 110.43 - 1270 - - - - -
(a) Normalized for 65nm CMOS technology and 1V supply, based on the scaling techniques from [45].
(b) Average value at target FER= 10−5 .
Table I presents the ASIC synthesis results of the Fast-
DSCF decoder implemented separately for each considered
error order, and compared against other available SCF-based
decoders and best available SCL-based decoders. The results
from [45], [46] and [47] are scaled to 65nm CMOS technology,
based on the presented scaling techniques in [45]. The latency
and the throughput for this work are calculated based on their
average values at target FER= 10−5. On the other hand,
the worst-case latency values are also presented for a fair
comparison. The quantization values are presented for the
channel (Qchn) and internal LLRs (Qint) and for the metric
(Qm).
The latency and the average throughput of the Fast-DSCF
decoder is different for each implementation; that is because
their different quantization schemes lead to different operating
frequencies, and different SPC node sizes lead to different
number of clock cycles per decoding iteration. On the other
hand, the power consumption and the area increases with ω:
this is mostly due to the increased complexity of the sorter.
Specifically, the insertion sorter component consumes 2.2% of
the overall power consumption at ω = 1, its share increases
to 16.1% and 46.5% for ω = 2 and ω = 3, respectively.
The only available SCF-based decoders in the literature that
report ASIC results target ω = 1 [11], [19], [45]. According
to Table I, the former Fast-SCF [11] and Fast-TSCF [19]
decoders provide a higher throughput than the proposed Fast-
DSCF decoder at ω = 1. The main reason to this is that the
former architectures use merged special node techniques (e.g.
Rep-SPC) that requires complex metric updates and bit-flips in
our framework, and thus they are not implemented. Moreover,
the Fast-TSCF decoder does not use a sorter, that allows for
slightly higher throughput. The worst-case latency of the Fast-
DSCF decoder is also more than that of Fast-TSCF, and it is
mainly due to the merged special node techniques embedded
in the Fast-TSCF. However, the worst-case latency of Fast-
SCF is worse than the proposed scheme because it requires a
larger Tmax value to reach a similar performance to Fast-DSCF
decoder. On the other hand, the fast decoding techniques
used in the Fast-DSCF (ω = 1) decoder yields 15.4× more
throughput than the regular SCF implementation from [45].
Finally, compared to the Fast-SSCL-SPC decoder with L = 2
[10], the proposed decoder has 32% less throughput but uses
1.9× less area, leading to 27.5% better area efficiency.
The Fast-DSCF decoder with ω = 2 demonstrates 11.5% to
30.5% less throughput than its SCL-based counterparts with
L = 8 [10], [46]. On the other hand, the required area for
the SCL-based decoders is 3.4× to 5.8× more than that of
Fast-DSCF. Therefore, the Fast-DSCF decoder with ω = 2
is 2.4× to 5.1× more area-efficient than the reported fast
SCL decoders. The Fast-DSCF decoder with ω = 3 reports
46.8% more throughput than the SCL-based decoder in [47],
but is 30% less than that of [46] with L = 16. On the other
hand, similar to the ω = 2 case, the Fast-DSCF decoder has
significantly less area, and therefore is up to 5.8× more area-
efficient. In return, due to the iterative nature of the SCF
algorithm, proposed decoder has larger worst-case latency than
its SCL-based counterparts.
As a result of the increased latency and power at higher error
orders, it can be claimed that improving the FER costs more
energy per decoded information bit; an increase of 37% for
ω = 2 and another 69% for ω = 3 is observed. Since the syn-
thesis results for the operating voltage and power consumption
are not provided for the architectures described in [10], [19],
[46], [47], we are not able to directly compare our results on
power density and energy efficiency against the state-of-the-
art. The only available results on energy consumption for SCL-
based decooders have been carried out in [7] for polar codes
of shorter lengths (i.e. N = 256 and N = 512), and it was
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
shown that the increase in energy consumption is supralinear.
For example, for the Fast-SSCL decoder for PC(512, 256),
the difference in energy consumption between L = 2 and
L = 8 is 6.68×. Note that, the energy consumption of Fast-
SSCL decoding is the best among all the considered SCL-
based algorithms in [7]. Assuming that a similar increasing
trend should be observed for the polar codes used in this work,
the proposed Fast-DSCF decoder is expected to be much more
energy-efficient than its fast SCL-based counterparts given
the increase in energy consumption remains at 37% between
ω = 1 and ω = 2.
The proposed Fast-DSCF implementations at ω > 1 have
greater worst-case latency compared to their SCL-based coun-
terparts. The worst-case latency increases with Tmax, which
is a major drawback for communications that prioritize low
latency, including applications with FPGA. On the other hand,
proposed implementations demonstrate improved area and en-
ergy efficiency as a trade-off, and have similar average latency
and throughput compared to the state-of-the-art. Therefore, the
proposed Fast-DSCF implementations are in favor of use cases
where area and energy efficiency are prioritized over worst-
case latency, such as massive machine-type communications
[1].
VIII. CONCLUSION
In this work, we showed how to make the Dynamic SC-
Flip decoding practical. We replaced the transcandental com-
putations of the DSCF algorithm with a constant, and showed
how to implement fast decoding techniques. Then, we further
reduced the computational effort by employing a theoretical
framework, which is later exploited in a hardware implementa-
tion. We showed further simplifications on metric updates and
sorter complexity in hardware. Proposed approximations and
simplifications do not alter the error-correction performance
significantly but make the DSCF decoding practically feasible.
The proposed Fast-DSCF decoders synthesized using TSMC
65nm CMOS technology demonstrate 1.25, 1.06 and 0.93
Gbps throughput for ω ∈ {1, 2, 3}, respectively. For ω = 3,
the Fast-DSCF decoder is up to 5.8× more area-efficient than
state-of-the-art fast CA-SCL decoders with equivalent FER
performance at L = 16. Compared to the state-of-the-art
fast CA-SCL decoders with equivalent FER performance, the
proposed decoders are up to 5.8× more area-efficient. Finally,
observations at energy dissipation indicate that the Fast-DSCF
is more energy-efficient than its CA-SCL-based counterparts.
ACKNOWLEDGMENTS
The authors would like to acknowledge Prof. Dr. Jan Ba-
jcsy from McGill University for his valuable insights on the
derivation of the theoretical error probability on parity check
codes.
REFERENCES
[1] C. Bockelmann, N. Pratas, H. Nikopour, K. Au, T. Svensson, C. Ste-
fanovic, P. Popovski, and A. Dekorsy, “Massive machine-type commu-
nications in 5G: physical and MAC-layer solutions,” IEEE Communica-
tions Magazine, vol. 54, no. 9, pp. 59–65, Sep. 2016.
[2] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
[3] 3GPP, “NR; Multiplexing and Channel Coding (Release 15),”
http://www.3gpp.org/DynaReport/38-series.htm, Tech. Rep. TS 38.212
V15.2.0 (2018-06), Jan. 2018.
[4] A. Sharma and M. Salim, “Polar code: The channel code contender
for 5G scenarios,” in 2017 International Conference on Computer,
Communications and Electronics (Comptelix), Jul. 2017, pp. 676–682.
[5] M. Sybis, K. Wesolowski, K. Jayasinghe, V. Venkatasubramanian, and
V. Vukadinovic, “Channel coding for ultra-reliable low-latency com-
munication in 5G systems,” in 2016 IEEE 84th Vehicular Technology
Conference (VTC-Fall), Sep. 2016, pp. 1–5.
[6] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inf.
Theory, vol. 61, no. 5, pp. 2213–2226, May 2015.
[7] F. Ercan, C. Condo, S. A. Hashemi, and W. J. Gross, “On error-
correction performance and implementation of polar code list decoders
for 5G,” in 2017 55th Annual Allerton Conference on Communication,
Control, and Computing (Allerton), Oct. 2017, pp. 443–449.
[8] O. Afisiadis, A. Balatsoukas-Stimming, and A. Burg, “A low-complexity
improved successive cancellation decoder for polar codes,” in Asilomar
Conference on Signals, Systems and Computers, Nov. 2014, pp. 2116–
2120.
[9] L. Chandesris, V. Savin, and D. Declercq, “Dynamic-SCFlip decoding
of polar codes,” IEEE Trans. Commun, vol. 66, no. 6, pp. 2333–2345,
Jun. 2018.
[10] S. A. Hashemi, C. Condo, and W. J. Gross, “Fast and flexible successive-
cancellation list decoders for polar codes,” IEEE Trans. Signal Process.,
vol. 65, no. 21, pp. 5756–5769, Nov. 2017.
[11] F. Ercan, T. Tonnellier, and W. J. Gross, “Energy-efficient hardware
architectures for fast polar decoders,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 67, no. 1, pp. 322–335, 2020.
[12] F. Ercan, T. Tonnellier, N. Doan, and W. J. Gross, “Simplified dynamic
SC-flip polar decoding,” in 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1733–
1737.
[13] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15,
no. 12, pp. 1378–1380, Dec. 2011.
[14] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar
decoders: Algorithm and implementation,” IEEE J. Sel. Areas Commun,
vol. 32, no. 5, pp. 946–957, May 2014.
[15] C. Condo, F. Ercan, and W. J. Gross, “Improved successive cancellation
flip decoding of polar codes based on error distribution,” in 2018
IEEE Wireless Communications and Networking Conference Workshops
(WCNCW), Apr. 2018, pp. 19–24.
[16] F. Ercan, C. Condo, and W. J. Gross, “Improved bit-flipping algorithm
for successive cancellation decoding of polar codes,” IEEE Trans.
Commun, vol. 67, no. 1, pp. 61–72, Jan. 2019.
[17] H. Kim, H. Lee, and H. Park, “Interleaver-aided successive cancellation
flip decoding algorithm for polar codes,” in 2018 IEEE 4th International
Conference on Computer and Communications (ICCC), Dec. 2018, pp.
2559–2563.
[18] Y. Lv, H. Yin, J. Li, W. Xu, and Y. Fang, “Modified successive cancel-
lation flip decoder for polar codes based on gaussian approximation,” in
2019 28th Wireless and Optical Communications Conference (WOCC),
May 2019, pp. 1–5.
[19] F. Ercan and W. J. Gross, “Fast thresholded SC-flip decoding of
polar codes,” in ICC 2020 - 2020 IEEE International Conference on
Communications (ICC), 2020, pp. 1–7.
[20] F. Ercan, C. Condo, S. A. Hashemi, and W. J. Gross, “Partitioned
successive-cancellation flip decoding of polar codes,” in 2018 IEEE
International Conference on Communications (ICC), May 2018, pp. 1–
6.
[21] S. Li, Y. Deng, X. Gao, H. Li, L. Guo, and Z. Dong, “Generalized
segmented bit-flipping scheme for successive cancellation decoding of
polar codes with cyclic redundancy check,” IEEE Access, vol. 7, pp.
83 424–83 436, 2019.
[22] Y. Fang, J. Li, and Y. Lv, “Improved segmented SC-flip decoding of
polar codes based on gaussian approximation,” in 2019 4th International
Conference on Smart and Sustainable Technologies (SpliTech), Jun.
2019, pp. 1–5.
[23] Y. Yongrun, P. Zhiwen, L. Nan, and Y. Xiaohu, “Successive cancellation
list bit-flip decoder for polar codes,” in 2018 10th International Confer-
ence on Wireless Communications and Signal Processing (WCSP), Oct.
2018, pp. 1–6.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
[24] F. Cheng, A. Liu, Y. Zhang, and J. Ren, “Bit-flip algorithm for successive
cancellation list decoder of polar codes,” IEEE Access, vol. 7, pp.
58 346–58 352, 2019.
[25] A. Cao, L. Zhang, J. Qiao, and Y. He, “An LLR-based segmented flipped
scl decoding algorithm for polar codes,” in 2019 IEEE/CIC International
Conference on Communications in China (ICCC), Aug. 2019, pp. 724–
729.
[26] L. Chandesris, V. Savin, and D. Declercq, “An improved SCFlip de-
coder for polar codes,” in IEEE Global Communications Conference
(GLOBECOM), Dec. 2016, pp. 1–6.
[27] Z. Zhang, K. Qin, L. Zhang, H. Zhang, and G. T. Chen, “Progressive
bit-flipping decoding of polar codes over layered critical sets,” in IEEE
Global Communications Conference (GLOBECOM), Dec. 2017, pp. 1–
6.
[28] Z. Zhang, K. Qin, L. Zhang, and G. T. Chen, “Progressive bit-flipping
decoding of polar codes: A critical-set based tree search approach,” IEEE
Access, vol. 6, pp. 57 738–57 750, 2018.
[29] J. Cui, Z. Zhang, X. Zhang, H. Li, and Q. Zeng, “A low-complexity
improved progressive bit-flipping decoding for polar codes,” in 2018
IEEE 4th International Conference on Computer and Communications
(ICCC), Dec. 2018, pp. 39–44.
[30] Y. Wang, L. Chen, Q. Wang, Y. Zhang, and Z. Xing, “Algorithm and
architecture for path metric aided bit-flipping decoding of polar codes,”
in 2019 IEEE Wireless Communications and Networking Conference
(WCNC), Apr. 2019, pp. 1–6.
[31] C. Condo, V. Bioglio, and I. Land, “SC-flip decoding of polar codes
with high order error correction based on error dependency,” in 2019
IEEE Information Theory Workshop (ITW), Aug. 2019, pp. 1–5.
[32] N. Doan, S. A. Hashemi, F. Ercan, T. Tonnellier, and W. J. Gross,
“Neural dynamic successive cancellation flip decoding of polar codes,”
in 2019 IEEE International Workshop on Signal Processing Systems
(SiPS), Oct. 2019, pp. 272–277.
[33] W. J. Gross and P. G. Gulak, “Simplified MAP algorithm suitable for
implementation of turbo decoders,” Electronics Letters, vol. 34, no. 16,
pp. 1577–1578, Aug. 1998.
[34] Jung-Fu Cheng and T. Ottosson, “Linearly approximated log-MAP
algorithms for turbo decoding,” in VTC2000-Spring. 2000 IEEE 51st
Vehicular Technology Conference Proceedings (Cat. No.00CH37026),
vol. 3, 2000, pp. 2252–2256 vol.3.
[35] Y. Zhou, J. Lin, and Z. Wang, “Improved fast-SSC-flip decoding of polar
codes,” IEEE Commun. Lett., vol. 23, no. 6, pp. 950–953, Jun. 2019.
[36] P. Giard and A. Burg, “Fast-SSC-flip decoding of polar codes,” in 2018
IEEE Wireless Communications and Networking Conference Workshops
(WCNCW), Apr. 2018, pp. 73–77.
[37] P. Trifonov, “Efficient design and decoding of polar codes,” IEEE Trans.
Commun, vol. 60, no. 11, pp. 3221–3227, Nov. 2012.
[38] J. Proakis, Digital Communications. McGraw Hill, 1995.
[39] D. Wu, Y. Li, and Y. Sun, “Construction and block error rate analysis
of polar codes over AWGN channel based on gaussian approximation,”
IEEE Commun. Lett., vol. 18, no. 7, pp. 1099–1102, 2014.
[40] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289–299, Jan. 2013.
[41] F. Ercan, T. Tonnellier, C. Condo, and W. J. Gross, “Operation merging
for hardware implementations of fast polar decoders,” Journal of Signal
Processing Systems, Nov. 2018.
[42] E. Stavinov, “Parallel CRC Generator,” 2014, retrieved: 2020-02-07.
[Online]. Available: https://opencores.org/projects/parallelcrcgen
[43] C. R. Cook and D. J. Kim, “Best sorting algorithm for nearly sorted
lists,” Commun. ACM, vol. 23, no. 11, p. 620–624, Nov. 1980.
[44] D. Kim and I. Park, “A fast successive cancellation list decoder for polar
codes with an early stopping criterion,” IEEE Trans. Signal Process.,
vol. 66, no. 18, pp. 4971–4979, Sep. 2018.
[45] P. Giard, A. Balatsoukas-Stimming, T. C. Mu¨ller, A. Bonetti,
C. Thibeault, W. J. Gross, P. Flatresse, and A. Burg, “Polarbear: A
28-nm FD-SOI ASIC for decoding of polar codes,” IEEE Trans. Emerg.
Sel. Topics Circuits Syst., vol. 7, no. 4, pp. 616–629, Dec. 2017.
[46] C. Xia, J. Chen, Y. Fan, C. Tsui, J. Jin, H. Shen, and B. Li, “A
high-throughput architecture of list successive cancellation polar codes
decoder with large list size,” IEEE Trans. Signal Process., vol. 66, no. 14,
pp. 3859–3874, Jul. 2018.
[47] Y. Fan, C. Xia, J. Chen, C. Tsui, J. Jin, H. Shen, and B. Li, “A low-
latency list successive-cancellation decoding implementation for polar
codes,” IEEE J. Sel. Areas Commun, vol. 34, no. 2, pp. 303–317, Feb.
2016.
