A High-Throughput Energy-Efficient Implementation of
  Successive-Cancellation Decoder for Polar Codes Using Combinational Logic by Dizdar, Onur & Arıkan, Erdal
ar
X
iv
:1
41
2.
38
29
v5
  [
cs
.A
R]
  1
1 J
an
 20
16
1
A High-Throughput Energy-Efficient
Implementation of Successive Cancellation Decoder
for Polar Codes Using Combinational Logic
Onur Dizdar, Student Member, IEEE, and Erdal Arıkan, Fellow, IEEE
Abstract—This paper proposes a high-throughput energy-
efficient Successive Cancellation (SC) decoder architecture for
polar codes based on combinational logic. The proposed combi-
national architecture operates at relatively low clock frequencies
compared to sequential circuits, but takes advantage of the high
degree of parallelism inherent in such architectures to provide
a favorable tradeoff between throughput and energy efficiency
at short to medium block lengths. At longer block lengths, the
paper proposes a hybrid-logic SC decoder that combines the
advantageous aspects of the combinational decoder with the
low-complexity nature of sequential-logic decoders. Performance
characteristics on ASIC and FPGA are presented with a detailed
power consumption analysis for combinational decoders. Finally,
the paper presents an analysis of the complexity and delay of
combinational decoders, and of the throughput gains obtained
by hybrid-logic decoders with respect to purely synchronous
architectures.
Index Terms—Polar codes, successive cancellation decoder,
error correcting codes, VLSI, energy efficiency.
I. INTRODUCTION
POLAR codes were proposed in [1] as a low-complexitychannel coding method that can provably achieve Shan-
non’s channel capacity for any binary-input symmetric dis-
crete memoryless channel. Apart from the intense theoretical
interest in the subject, polar codes have attracted attention for
their potential applications. There have been several proposals
on hardware implementations of polar codes, which mainly
focus on maximizing throughput or minimizing hardware
complexity. In this work, we propose an architecture for SC
decoding using combinational logic in an effort to obtain a
high throughput decoder with low power consumption. We
begin with a survey of the relevant literature.
The basic decoding algorithm for polar codes is the SC de-
coding algorithm, which is a non-iterative sequential algorithm
with complexity O(N logN) for a code of length N . Many
of the SC decoding steps can be carried out in parallel and
the latency of the SC decoder can be reduced to roughly 2N
in a fully-parallel implementation, as pointed out in [1] and
[2]. This means that the throughput of any synchronous SC
decoder is limited to fc2 in terms of the clock frequency fc, as
pointed out in [3]. The throughput is reduced further in semi-
parallel architectures, such as [5] and [6], which increase the
Copyright (c) 2015 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
The authors are with the Department of Electrical-Electronics
Engineering, Bilkent University, Ankara, TR-06800, Turkey (e-mail:
odizdar@ee.bilkent.edu.tr , arikan@ee.bilkent.edu.tr.)
decoding latency further in exchange for reduced hardware
complexity. This throughput bottleneck in SC decoding is
inherent in the logic of SC decoding and stems from the fact
that the decoder makes its final decisions one at a time in a
sequential manner.
Some algorithmic and hardware implementation methods
have been proposed to overcome the throughput bottleneck
problem in polar decoding. One method that has been tried
is Belief Propagation (BP) decoding, starting with [7]. In BP
decoding, the decoder has the capability of making multiple
bit decisions in parallel. Indeed, BP throughputs of 2 Gb/s
(with clock frequency 500 MHz) and 4.6 Gb/s (with clock
frequency 300 MHz) are reported in [8] and [9], respectively.
Generally speaking, the throughput advantage of BP decoding
is observed at high SNR values, where correct decoding can
be achieved after a small number of iterations; this advantage
of BP decoders over SC decoders diminishes as the SNR
decreases.
A second algorithmic approach to break the throughput
bottleneck is to exploit the fact that polar codes are a class
of generalized concatenated codes (GCC). More precisely, a
polar code C of length-N is constructed from two length-N/2
codes C1 and C2, using the well-known Plotkin |u|u + v|
code combining technique [10]. The recursive nature of the
polar code construction ensures that the constituent codes C1
and C2 are polar codes in their own right and each can be
further decomposed into two polar codes of length N/4, and
so on, until the block-length is reduced to one. In order to
improve the throughput of a polar code, one may introduce
specific measures to speed up the decoding of the constituent
polar codes encountered in the course of such recursive
decomposition. For example, when a constituent code Ci of
rate 0 or 1 is encountered, the decoding becomes a trivial
operation and can be completed in one clock cycle. Similarly,
decoding is trivial when the constituent code is a repetition
code or a single parity-check code. Such techniques have
been applied earlier in the context of Reed-Muller codes by
[11] and [12]. They have been also used in speeding up SC
decoders for polar codes by [13]. Results reported by such
techniques show a throughput of 1 Gb/s by using designs
tailored for specific codes [14]. On the other hand, decoders
utilizing such shortcuts require reconfiguration when the code
is changed, which makes their use difficult in systems using
adaptive coding methods.
Implementation methods such as precomputations,
pipelined, and unrolled designs, have also been proposed
2to improve the throughput of SC decoders. These methods
trade hardware complexity for gains in throughput. For
example, it has been shown that the decoding latency may
be reduced to N by doubling the number of adders in a
SC decoder circuit [18]. A similar approach has been used
in a first ASIC implementation of a SC decoder to reduce
the latency at the decision-level LLR calculations by N/2
clock cycles and provide a throughput of 49 Mb/s with
150 MHz clock frequency for a rate-1/2 code [5]. In contrast,
pipelined and unrolled designs do not affect the latency of the
decoder; the increase in throughput is obtained by decoding
multiple codewords simultaneously without resource sharing.
A recent study [19] exhibits a SC decoder achieving 254
Gb/s throughput with a fully-unrolled and deeply-pipelined
architecture using component code properties for a rate-1/2
code. Pipeling in the context of polar decoders was used
earlier in various forms and in a more limited manner in [2],
[3], [4], [18], and [20].
SC decoders, while being simple, are suboptimal. In [15],
SC list-of-L decoding was proposed for decoding polar codes,
following similar ideas developed earlier by [16] for Reed-
Muller codes. Ordinary SC decoding is a special case of SC list
decoding with list size L = 1. SC list decoders show markedly
better performance compared to SC decoders at the expense of
complexity, and are subject to the same throughput bottleneck
problems as ordinary SC decoding. Parallel decision-making
techniques, as discussed above, can be applied to improve the
throughput of SC list decoding. For instance, it was shown
in [17] that by using 4-bit parallel decisions, a list-of-2 SC
decoder can achieve a throughput of around 500 Mb/s with a
clock frequency of 500 MHz.
The present work is motivated by the desire to obtain high-
throughput SC decoders with low power consumption, which
has not been a main concern in literature so far. These desired
properties are attained by designing completely combinational
decoder architectures, which is possible thanks to the recursive
and feed-forward (non-iterative) structure of the SC algorithm.
Combinational decoders operate at lower clock frequencies
compared to ordinary synchronous (sequential logic) decoders.
However, in a combinational decoder an entire codeword
is decoded in one clock cycle. This allows combinational
decoders to operate with less power while maintaining a high
throughput, as we demonstrate in the remaining sections of
this work.
Pipelining can be applied to combinational decoders at any
depth to adjust their throughput, hardware usage, and power
consumption characteristics. Therefore, we also investigate the
performance of pipelined combinational decoders. We do not
use any of the multi-bit decision shortcuts in the architectures
we propose. Thus, for a given block length, the combinational
decoders that we propose retain the inherent flexibility of polar
coding to operate at any desired code rate between zero and
one. Retaining such flexibility is important since one of the
main motivations behind the combinational decoder is to use
it as an “accelerator” module as part of a hybrid decoder that
combines a synchronous SC decoder with a combinational
decoder to take advantage of the best characteristics of the
two types of decoders. We give an analytical discussion of the
throughput of hybrid-logic decoders to quantify the advantages
of the hybrid decoder.
The rest of this paper is organized as follows. Section II
give a brief discussion of polar coding to define the SC
decoding algorithm. Section III introduces the main decoder
architectures considered in this paper, namely, combinational
decoders, pipelined combinational decoders, and hybrid-logic
decoders. Also included in that section is an analysis of
the hardware complexity and latency of the proposed de-
coders. Implementation results of combinational decoders and
pipelined combinational decoders are presented in Section IV,
with a detailed power consumption analysis for combinational
decoders. Also presented in the same section is an analysis
of the throughput improvement obtained by hybrid-logic de-
coders relative to synchronous decoders. Section V concludes
the paper.
Throughout the paper, vectors are denoted by boldface
lowercase letters. All matrix and vector operations are
over vector spaces over the binary field F2. Addition over
F2 is represented by the ⊕ operator. For any set S ⊆
{0, 1, . . . , N − 1}, Sc denotes its complement. For any vec-
tor u = (u0, u1, . . . , uN−1) of length N and set S ⊆
{0, 1, . . . , N − 1}, uS
def
= [ui : i ∈ S]. We define a binary
sign function s(ℓ) as
s(ℓ) =
{
0, if ℓ ≥ 0
1, otherwise. (1)
II. BACKGROUND ON POLAR CODING
We briefly describe the basics of polar coding in this section,
including the SC decoding algorithm. Consider the system
given in Fig. 1, in which a polar code is used for channel
coding. All input/output signals in the system are vectors of
length N , where N is the length of the polar code that is being
used.
u PolarEncoder W
LLR
Calc.
SC Polar
Decoder uˆ
a
x y ℓ
Fig. 1. Communication scheme with polar coding
The encoder input vector u ∈ FN2 consists of a data part
uA and a frozen part uAc , where A is chosen in accordance
with polar code design rules as explained in [1]. We fix the
frozen part uAc to zero in this study. We define a frozen-bit
indicator vector a so that a is a 0-1 vector of length N with
ai =
{
0, if i ∈ Ac
1, if i ∈ A.
The frozen-bit indicator vector is made available to the decoder
in the system.
The channel W in the system is an arbitrary discrete memo-
ryless channel with input alphabet X = {0, 1}, output alphabet
Y and transition probabilities {W (y|x) : x ∈ X , y ∈ Y}. In
each use of the system, a codeword x ∈ FN2 is transmitted,
and a channel output vector y ∈ YN is received. The receiver
calculates a log-likelihood ratio (LLR) vector ℓ = (ℓ1, . . . , ℓN )
with
ℓi = ln
(
P (yi|xi = 0)
P (yi|xi = 1)
)
,
3and feeds it into the SC decoder.
Algorithm 1: uˆ = DECODE(ℓ,a)
N =length(ℓ)
if N == 2 then
uˆ0 ← s (f(ℓ0, ℓ1)) · a0
uˆ1 ← s (g(ℓ0, ℓ1, uˆ0)) · a1
return uˆ← (uˆ0, uˆ1)
else
ℓ
′ ← fN/2(ℓ)
a
′ ← (a0, . . . , aN/2−1)
uˆ′ ← DECODE(ℓ′,a′)
v ← ENCODE(uˆ′)
ℓ
′′ ← gN/2(ℓ,v)
a
′′ ← (aN/2, . . . , aN−1)
uˆ′′ ← DECODE(ℓ′′,a′′)
return uˆ← (uˆ′, uˆ′′)
end
The decoder in the system is an SC decoder as described in
[1], which takes as input the channel LLRs and the frozen-bit
indicator vector and calculates an estimate uˆ ∈ FN2 of the data
vector u. The SC algorithm outputs bit decisions sequentially,
one at a time in natural index order, with each bit decision
depending on prior bit decisions. A precise statement of the
SC algorithm is given in Algorithm 1, where the functions
fN/2 and gN/2 are defined as
fN/2(ℓ) = (f(ℓ0, ℓ1), . . . , f(ℓN−2, ℓN−1))
gN/2(ℓ,v) =
(
g(ℓ0, ℓ1, v0), . . . , g(ℓN−2, ℓN−1, vN/2−1)
)
with
f(ℓ1, ℓ2) = 2 tanh
−1 (tanh (ℓ1/2) tanh (ℓ2/2))
g(ℓ1, ℓ2, v) = ℓ1(−1)
v + ℓ2.
In actual implementations discussed in this paper, the function
f is approximated using the min-sum formula
f(ℓ1, ℓ2) ≈ (1− 2s(ℓ1)) · (1− 2s(ℓ2)) ·min {|ℓ1| , |ℓ2|} , (2)
and g is realized in the alternative (exact) form
g(ℓ1, ℓ2, v) = ℓ2 + (1 − 2v) · ℓ1. (3)
A key property of the SC decoding algorithm that makes
low-complexity implementations possible is its recursive na-
ture, where a decoding instance of block length N is broken
in the decoder into two decoding instances of lengths N/2
each.
III. SC DECODER USING COMBINATIONAL LOGIC
The pseudocode in Algorithm 1 shows that the logic of SC
decoding contains no loops, hence it can be implementated
using only combinational logic. The potential benefits of a
combinational implementation are high throughput and low
power consumption, which we show are feasible goals. In
this section, we first describe a combinational SC decoder
for length N = 4 to explain the basic idea. Then, we
describe the three architectures that we propose. Finally, we
give an analysis of complexity and latency characteristics of
the proposed architectures.
A. Combinational Logic for SC Decoding
In a combinational SC decoder the decoder outputs are
expressed directly in terms of decoder inputs, without any
registers or memory elements in between the input and output
stages. Below we give the combinational logic expressions
for a decoder of size N = 4, for which the signal flow graph
(trellis) is depicted in Fig. 2.
ℓ0
ℓ1
ℓ2
ℓ3
b
b
b
b
f
f
g
g
b
b
b
b
ℓ′0
ℓ′1
ℓ′′0
ℓ′′1
f
g
f
g
uˆ0
uˆ1
uˆ2
uˆ3
Stage 0Stage 1
Fig. 2. SC decoding trellis for N = 4
At Stage 0 we have the LLR relations
ℓ′0 = f(ℓ0, ℓ1), ℓ
′
1 = f(ℓ2, ℓ3),
ℓ′′0 = g(ℓ0, ℓ1, uˆ0 ⊕ uˆ1), ℓ
′′
1 = g(ℓ2, ℓ3, uˆ1).
At Stage 1, the decisions are extracted as follows.
uˆ0 = s [f (f(ℓ0, ℓ1), f(ℓ2, ℓ3))] · a0,
uˆ1 = s [g (f(ℓ0, ℓ1), f(ℓ2, ℓ3), uˆ0)] · a1,
uˆ2 = s [f (g(ℓ0, ℓ1, uˆ0 ⊕ uˆ1), g(ℓ2, ℓ3, uˆ1))] · a2,
uˆ3 = s [g (g(ℓ0, ℓ1, uˆ0 ⊕ uˆ1), g(ℓ2, ℓ3, uˆ1), uˆ2)] · a3,
where the decisions uˆ0 and uˆ2 may be simplified as
uˆ0 = [s(ℓ0)⊕ s(ℓ1)⊕ s(ℓ2)⊕ s(ℓ3)] · a0,
uˆ2 = [s (g(ℓ0, ℓ1, uˆ0 ⊕ uˆ1))⊕ s (g(ℓ2, ℓ3, uˆ1))] · a2.
≥
<
|ℓ0|
|ℓ1|
0
1
Q − 1
Q − 1
≥
<
|ℓ2|
|ℓ3|
0
1
≥
<
1
0
s(ℓ0)
s(ℓ1)
s(ℓ2)
s(ℓ3)
a0
a1
uˆ0
uˆ1
b
b
b
b
b b
b
b
b
+
−
ℓ0
ℓ1
0
1
QQ
+
−
ℓ2
ℓ3
0
1
≥
<
s01
s23
1
0
s(s01)
s(s23)a2
a3
uˆ2
uˆ3
b
b
b
b
Fig. 3. Combinational decoder for N = 4
Fig. 3 shows a combinational logic implementation of the
above decoder using only comparators and adders. We use
sign-magnitude representation, as in [21], to avoid exces-
sive number of conversions between different representations.
Channel observation LLRs and calculations throughout the
decoder are represented by Q bits. The function g of (3)
4ℓ
a
fN/2(ℓ)DECODE(ℓ′,a′)ENCODE(v)
gN/2(ℓ,v)DECODE(ℓ′′,a′′)uˆ
DECODE(ℓ,a)
bb
ℓ
′
a
′uˆ′ v
a
′′
ℓ
′′
uˆ′′
Fig. 4. Recursive architecture of polar decoders for block length N
is implemented using the precomputation method suggested
in [18] to reduce latency. In order to reduce latency and
complexity further, we implement the decision logic for odd-
indexed bits as
uˆ2i+1 =


0 , if a2i+1 = 0
s(λ2) , if a2i+1 = 1 and |λ2| ≥ |λ1|
s(λ1)⊕ uˆ2i, otherwise.
(4)
B. Architectures
In this section, we propose three SC decoder architectures
for polar codes: combinational, pipelined combinational, and
hybrid-logic decoders. Thanks to the recursive structure of the
SC decoder, the above combinational decoder of size N = 4
will serve as a basic building block for the larger decoders
that we discuss in the next subsection.
1) Combinational Decoder: A combinational decoder ar-
chitecture for any block length N using the recursive algorithm
in Algorithm 1 is shown in Fig. 4. This architecture uses two
combinational decoders of size N/2, with glue logic consisting
of one fN/2 block, one gN/2 block, and one size-N/2 encoder
block.
In
pu
tR
eg
ist
er
sf
f
f
f
ℓ0
ℓ1
ℓ2
ℓ3
ℓ4
ℓ5
ℓ6
ℓ7
Q
Q b
b
b
b
b
b
b
b
Comb.
Decoder
(N=4)
Encoder
(N=4)
g
g
g
g
Q
b
b
b
b
Comb.
Decoder
(N=4)
O
u
tp
u
tR
eg
ist
er
s uˆ0
uˆ1
uˆ2
uˆ3
uˆ4
uˆ5
uˆ6
uˆ7
Bit Indicator Registers
a0a1a2a3a4a5a6a7
Comb.
Decoder
(N=8)
Fig. 5. RTL schematic for combinational decoder (N = 8)
The RTL schematic for a combinational decoder of this type
is shown in Fig. 5 for N = 8. The decoder submodules of size-
4 are the same as in Fig. 3. The size-4 encoder is implemented
using combinational circuit consisting of XOR gates. The
logic blocks in a combinational decoder are directly connected
without any synchronous logic elements in-between, which
helps the decoder to save time and power by avoiding mem-
ory read/write operations. Avoiding the use of memory also
reduces hardware complexity. In each clock period, a new
channel observation LLR vector is read from the input registers
and a decision vector is written to the output registers. The
clock period is equal to the overall combinational delay of
the circuit, which determines the throughput of the decoder.
The decoder differentiates between frozen bits and data bits by
AND gates and the frozen bit indicators ai, as shown in Fig. 3.
The frozen-bit indicator vector can be changed at the start
of each decoding operation, making it possible to change the
code configuration in real time. Advantages and disadvantages
of combinational decoders will be discussed in more detail in
Section IV.
2) Pipelined Combinational Decoder: Unlike sequential
circuits, the combinational architecture explained above has
no need for any internal storage elements. The longest path
delay determines the clock period in such a circuit. This saves
hardware by avoiding usage of memory, but slows down the
decoder. In this subsection, we introduce pipelining in order to
increase the throughput at the expense of some extra hardware
utilization.
It is seen in Fig. 4 that the outputs of the first decoder block
(DECODE(ℓ′,a′)) are used by the encoder to calculate partial-
sums. Therefore, this decoder needs to preserve its outputs
after they settle to their final values. However, this particular
decoder can start the decoding operation for another codeword
if these partial-sums are stored with the corresponding channel
observation LLRs for the second decoder (DECODE(ℓ′′,a′′)).
Therefore, adding register blocks to certain locations in the
decoder enable a pipelined decoding process.
Early examples of pipelining in the context of synchronous
polar decoders are [2], [3], and [4]. In synchronous design with
pipelining, shared resources at certain stages of decoding have
to be duplicated in order to prevent conflicts on calculations
when multiple codewords are processed in the decoder. The
number of duplications and their stages depend on the number
of codewords to be processed in parallel. Since pipelined
decoders are derived from combinational decoders, they do
not use resource sharing; therefore, resource duplications are
not needed. Instead, pipelined combinational decoders aim to
reuse the existing resources. This resource reuse is achieved
by using storage elements to save the outputs of smaller
combinational decoder components and re-employ them in
decoding of another codeword.
A single stage pipelined combinational decoder is shown in
Fig. 6. The channel observation LLR vectors ℓ1 and ℓ2 in this
architecture correspond to different codewords. The partial-
sum vector v1 is calculated from the first half of the decoded
vector for ℓ1. Output vectors uˆ′2 and uˆ′′1 are the first and second
halves of decoded vectors for ℓ2 and ℓ1, respectively. The
schedule for this pipelined combinational decoder is given in
5ℓ2
a
N×Q
N/2×1 fN/2(ℓ)DECODE(ℓ′,a′)ENCODE(v)
gN/2(ℓ,v)DECODE(ℓ′′,a′′)
uˆ′2
uˆ′′1
DECODE(ℓ,a)
bb
ℓ
′
2
a
′v1
ℓ1
a
′′
ℓ
′′
1
Fig. 6. Recursive architecture for pipelined polar decoders for block length N
Table I.
TABLE I
SCHEDULE FOR SINGLE STAGE PIPELINED COMBINATIONAL DECODER
Clock Cycle 1 2 3 4 5 6 7 8
Input of
DECODE(ℓ,a) ℓ1 ℓ2 ℓ3 ℓ4 ℓ5 ℓ6
Output of
DECODE(ℓ′,a′) uˆ
′
1
uˆ
′
2
uˆ
′
3
uˆ
′
4
uˆ
′
5
uˆ
′
6
Output of
DECODE(ℓ′′,a′′) uˆ
′′
1
uˆ
′′
2
uˆ
′′
3
uˆ
′′
4
uˆ
′′
5
uˆ
′′
6
Output of
DECODE(ℓ,a) uˆ1 uˆ2 uˆ3 uˆ4 uˆ5 uˆ6
As seen from Table I, pipelined combinational decoders,
like combinational decoders, decode one codeword per clock
cycle. However, the maximum path delay of a pipelined com-
binational decoder for block length N is approximately equal
to the delay of a combinational decoder for block length N/2.
Therefore, the single stage pipelined combinational decoder
in Fig. 6 provides approximately twice the throughput of a
combinational decoder for the same block length. On the
other hand, power consumption and hardware usage increase
due to the added storage elements and increased operating
frequency. Pipelining stages can be increased by making the
two combinational decoders for block length N/2 in Fig. 6
also pipelined in a similar way to increase the throughput
further. Comparisons between combinational decoders and
pipelined combinational decoders are given in more detail in
Section IV.
3) Hybrid-Logic Decoder: In this part, we give an architec-
ture that combines synchronous decoders with combinational
decoders to carry out the decoding operations for compo-
nent codes. In sequential SC decoding of polar codes, the
decoder slows down every time it approaches the decision
level (where decisions are made sequentially and number of
parallel calculations decrease). In a hybrid-logic SC decoder,
the combinational decoder is used near the decision level to
speed up the SC decoder by taking advantage of the GCC
structure of polar code. The GCC structure is illustrated in
Fig. 7, which shows that a polar code C of length N = 8 can
be seen as the concatenation of two polar codes C1 and C2 of
length N ′ = N/2 = 4, each.
The dashed boxes in Fig. 7 represent the compo-
nent codes C1 and C2. The input bits of compo-
nent codes are uˆ(1) = (uˆ(1)0 , . . . , uˆ
(1)
3 ) = (uˆ0, . . . , uˆ3) and
uˆ(2) = (uˆ
(2)
0 , . . . , uˆ
(2)
3 ) = (uˆ4, . . . , uˆ7). For a polar code of
block length 8 and R = 1/2, the frozen bits are uˆ0, uˆ1, uˆ2,
and uˆ4. This makes 3 input bits of C1 and 1 input bit of C2
frozen bits; thus, C1 is a R = 3/4 code with uˆ(1)0 , uˆ
(1)
1 , uˆ
(1)
2
and C2 is a R = 1/4 code with uˆ(2)0 frozen.
b
b
b
b
b
b
b
b
b
b
b
b
x0
x4
x2
x6
x1
x5
x3
x7
u0(u
1
0)
u1(u
1
1)
u2(u
1
2)
u3(u
1
3)
u4(u
2
0)
u5(u
2
1)
u6(u
2
2)
u7(u
2
3)
x10
x12
x11
x13
x20
x22
x21
x23
CC1
C2
Fig. 7. Encoding circuit of C with component codes C1 and C2 (N = 8 and
N ′ = 4)
Encoding of C is done by first encoding uˆ(1) and uˆ(2)
separately using encoders for block length 4 and obtain
coded outputs xˆ(1) and xˆ(2). Then, each pair of coded bits(
xˆ
(1)
i , xˆ
(2)
i
)
, 0 ≤ i ≤ 3, is encoded again using encoders for
block length 2 to obtain the coded bits of C.
ℓ0
ℓ1
ℓ2
ℓ3
ℓ4
ℓ5
ℓ6
ℓ7
b
b
b
b
b
b
b
b
f
f
f
f
g
g
g
g
λ
(1)
0
λ
(1)
1
λ
(1)
2
λ
(1)
3
λ
(2)
0
λ
(2)
1
λ
(2)
2
λ
(2)
3
b
b
b
b
b
b
b
b
f
f
g
g
f
f
g
g
b
b
b
b
b
b
b
b
f
g
f
g
f
g
f
g
uˆ0
uˆ1
uˆ2
uˆ3
uˆ4
uˆ5
uˆ6
uˆ7
Stage 0Stage 1Stage 2
Fig. 8. Decoding trellis for hybrid-logic decoder (N = 8 and N ′ = 4)
Decoding of C is done in a reversed manner with respect to
encoding explained above. Fig. 8 shows the decoding trellis
for the given example. Two separate decoding sessions for
block length 4 are required to decode component codes C1
and C2. We denote the input LLRs for component codes as
λ
(1) and λ(2), as shown in Fig. 8. These inputs are calculated
by the operations at stage 0. The frozen bit indicator vector
of C is a = (0, 0, 0, 1, 0, 1, 1, 1) and the frozen bit vectors of
component codes are a(1) = (0, 0, 0, 1) and a(2) = (0, 1, 1, 1).
It is seen that λ(2) depends on the decoded outputs of C1,
since g functions are used to calculate λ(2) from input LLRs.
This implies that the component codes cannot be decoded in
parallel.
The dashed boxes in Fig. 8 show the operations performed
by a combinational decoder for N ′ = 4. The operations
outside the boxes are performed by a synchronous decoder.
6Algorithm 2: HL DECODE(ℓ,a, N ′)
for i = 1 to N/N ′ do
if i == 1 then
λ
(i) ← DECODE SYNCH(ℓ, i, N ′)
else
λ
(i) ← DECODE SYNCH(ℓ, i, N ′, uˆ(i−1))
end
uˆ(i) ← DECODE(λ(i),a(i))
end
return uˆ
The sequence of decoding operations in this hybrid-logic
decoder is as follows: a synchronous decoder takes channel
observations LLRs and use them to calculate intermediate
LLRs that require no partial-sums at stage 0. When the
synchronous decoder completes its calculations at stage 0,
the resulting intermediate LLRs are passed to a combinational
decoder for block length 4. The combinational decoder outputs
uˆ0, . . . , uˆ3 (uncoded bits of the first component code) while
the synchronous decoder waits for a period equal to the
maximum path delay of combinational decoder. The decoded
bits are passed to the synchronous decoder to be used in
partial-sums (uˆ0 ⊕ uˆ1 ⊕ uˆ2 ⊕ uˆ3, uˆ1 ⊕ uˆ3, uˆ2 ⊕ uˆ3, and uˆ3).
The synchronous decoder calculates the intermediate LLRs
using these partial-sums with channel observation LLRs and
passes the calculated LLRs to the combinational decoder,
where they are used for decoding of uˆ4, . . . , uˆ7 (uncoded
bits of the second component code). Since the combinational
decoder architecture proposed in this work can adapt to operate
on any code set using the frozen bit indicator vector input,
a single combinational decoder is sufficient for decoding
all bits. During the decoding of a codeword, each decoder
(combinational and sequential) is activated 2 times.
Algorithm 2 shows the algorithm for hybrid-logic polar
decoding for general N and N ′. For the ith activation of
combinational and sequential decoders, 1 ≤ i ≤ N/N ′, the
LLR vector that is passed from synchronous to combina-
tional decoder, the frozen bit indicator vector for the ith
component code, and the output bit vector are denoted by
λ
(i) = (λ
(i)
0 , . . . , λ
(i)
N ′−1), a
(i) = (a(i−1)N ′ , . . . , aiN ′−1), and
uˆ(i) = (uˆ(i−1)N ′ , . . . , uˆiN ′−1), respectively. The function DE-
CODE SYNCH represents the synchronous decoder that cal-
culates the intermediate LLR values at stage (log2(N/N ′)−1),
using the channel observations and partial-sums at each repe-
tition.
During the time period in which combinational decoder
operates, the synchronous decoder waits for ⌈DN ′ · fc⌉ clock
cycles, where fc is the operating frequency of synchronous
decoder and DN ′ is the delay of a combinational decoder for
block length N ′. We can calculate the approximate latency
gain obtained by a hybrid-logic decoder with respect to the
corresponding synchronous decoder as follows: let LS (N) de-
note the latency of a synchronous decoder for block length N .
The latency reduction obtained using a combinational decoder
for a component code of length-N ′ in a single repetition is
Lr (N
′) = LS (N
′)− ⌈DN ′ · fc⌉. In this formulation, it is as-
sumed that no numerical representation conversions are needed
when LLRs are passed from synchronous to combinational
decoder. Furthermore, we assume that maximum path delays
of combinational and synchronous decoders do not change
significantly when they are implemented together. Then, the
latency gain factor can be approximated as
g(N,N ′) ≈
LS (N)
LS (N)− (N/N ′) Lr (N ′)
. (5)
The approximation is due to the additional latency from
partial-sum updates at the end of each repetition using the
N ′ decoded bits. Efficient methods for updating partial sums
can be found in [6] and [22]. This latency gain multiplies the
throughput of synchronous decoder, so that:
TPHL(N,N
′) = g(N,N ′) TPS(N),
where TPS(N,N ′) and TPHL(N) are the throughputs of syn-
chronous and hybrid-logic decoders, respectively. An example
of the analytical calculations for throughputs of hybrid-logic
decoders is given in Section IV.
C. Analysis
In this section, we analyze the complexity and delay of
combinational architectures. We benefit from the recursive
structure of polar decoders (Algorithm 1) in the provided
analyses.
1) Complexity: Combinational decoder complexity can be
expressed in terms of the total number of comparators, adders,
and subtractors in the design, as they are the basic building
blocks of the architecture with similar complexities.
First, we estimate the number of comparators. Comparators
are used in two different places in the combinational decoder
as explained in Section III-A: in implementing the function f
in (2), and as part of decision logic for odd-indexed bits. Let
cN denote the number of comparators used for implementing
the function f for a decoder of block length N . From
Algorithm 1, we see that the initial value of cN may be taken
as c4 = 2. From Fig. 3, we observe that there is the recursive
relationship
cN = 2cN/2 +
N
2
= 2
(
2cN/4 +
N
4
)
+
N
2
= . . . .
This recursion has the following (exact) solution
cN =
N
2
log2
N
2
as can be verified easily.
Let sN denote the number of comparators used for the
decision logic in a combinational decoder of block length N .
We observe that s4 = 2 and more generally sN = 2sN/2;
hence,
sN =
N
2
.
Next, we estimate the number of adders and subtractors.
The function g of (3) is implemented using an adder and a
subtractor, as explained in Section III-A. We define rN as
the total number of adders and subtractors in a combinational
7decoder for block length N . Observing that rN = 2cN , we
obtain
rN = N log2 (N/2) .
Thus, the total number of basic logic blocks with similar
complexities is given by
cN + sN + rN = N
(
3
2
log2 (N)− 1
)
, (6)
which shows that the complexity of the combinational decoder
is roughly N log2 (N).
2) Combinational Delay: We approximately calculate the
delay of combinational decoders using Fig. 4. The combi-
national logic delays, excluding interconnect delays, of each
component forming DECODE(ℓ,a) block is listed in Table II.
TABLE II
COMBINATIONAL DELAYS OF COMPONENTS IN DECODE(ℓ,a)
Block Delay
fN/2(ℓ) δc + δm
DECODE(ℓ′,a′) D′N/2
ENCODE(v) EN/2
gN/2(ℓ,v) δm
DECODE(ℓ′′,a′′) D′′N/2
The parallel comparator block fN/2(ℓ) in Fig. 4 has a
combinational delay of δc + δm, where δc is the delay of a
comparator and δmis the delay of a multiplexer. The delay
of the parallel adder and subtractor block gN/2(ℓ,v) appears
as δm due to the precomputation method, as explained in
Section III-A. The maximum path delay of the encoder can
be approximated as EN/2 ≈
[
log2
(
N
2
)]
δx, where δx denotes
the propagational delay of a 2-input XOR gate.
We model D′N/2 ≈ D′′N/2, although it is seen from
Fig. 4 that DECODE(ℓ′,a′) has a larger load capacitance
than DECODE(ℓ′′,a′′) due to the ENCODE(v) block it drives.
However, this assumption is reasonable since the circuits that
are driving the encoder block at the output of DECODE(ℓ′,a′)
are bit-decision blocks and they compose a small portion of
the overall decoder block. Therefore, we can express DN as
DN = 2D
′
N/2 + δc + 2δm + EN/2. (7)
We use the combinational decoder for N = 4 as the base
decoder to obtain combinational decoders for larger block
lengths in Section III-A. Therefore, we can write DN in terms
of D′4 and substitute the expression for D′4 to obtain the
final expression for combinational delay. Using the recursive
structure of combinational decoders, we can write
DN =
N
4
D′4 +
(
N
4
− 1
)
(δc + 2δm)
+
(
3N
4
− log2 (N)− 1
)
δx +TN . (8)
Next, we obtain an expression for D′4 using Fig. 3. Assuming
δc ≥ 3δx + δa, we can write
D′4 = 3δc + 4δm + δx + 2δa, (9)
where δa represents the delay of an AND gate. Finally,
substituting (9) in (8), we get
DN = N
(
3δm
2
+ δc + δx +
δa
2
)
− {δc + 2δm + [log2 (N) + 1] δx}+TN , (10)
for N > 4. The interconnect delay of the overall design,
TN , cannot be formulated since the routing process is not
deterministic.
We had mentioned in Section III-A that the delay reduction
obtained by precomputation in adders increases linearly with
N . This can be seen by observing the expressions (8) and
(9). Reminding that we model the delay of an adder with
precomputation by δm, the first and second terms of (8) contain
the delays of adder block stages, both of which are multiplied
by a factor of roughly N/4. This implies that the overall delay
gain obtained by precomputation is approximately equal to the
difference between the delay of an adder and a multiplexer,
multiplied by N/2.
The expression (10) shows the relation between basic logic
element delays and maximum path delay of combinational
decoders. As N grows, the second term in (8) becomes
negligible with respect to the first term, making the maximum
path delay linearly proportional to
(
3δm
2 + δc + δx +
δa
2
)
with
the additive interconnect delay term TN . Combinational archi-
tecture involves heavy routing and the interconnect delay is
expected to be a non-negligible component in maximum path
delay. The analytical results obtained here will be compared
with implementation results in the next section.
IV. PERFORMANCE RESULTS
In this section, implementation results of combinational and
pipelined combinational decoders are presented. Throughput
and hardware usage are studied both in ASIC and FPGA, and
a detailed discussion of the power consumption characteristics
is given form the ASIC design.
The metrics we use to evaluate ASIC implementations are
throughput, energy-per-bit, and hardware efficiency, which are
defined as
Throughput[b/s] =
N [bit]
DN [sec]
,
Energy−per−bit[J/b] =
Power[W]
Throughput[b/s]
,
Hardware Efficiency[b/s/m2] =
Throughput[b/s]
Area[m2]
,
(11)
respectively. These metrics of combinational decoders are also
compared with state-of-the-art decoders. The number of look-
up tables (LUTs) and flip-flops (FFs) in the design are studied
in addition to throughput in FPGA implementations. Formulas
for achievable throughputs in hybrid-logic decoders are also
given in this section.
A. ASIC Synthesis Results
81) Post-Synthesis Results: Table III gives the post-synthesis
results of combinational decoders using Cadence Encounter
RTL Compiler for block lengths 26 - 210 with Faraday’s UMC
90 nm 1.3 V FSD0K-A library. Combinational decoders of
such sizes can be used as standalone decoders, e.g., wireless
transmission of voice and data; or as parts of a hybrid-logic
decoder of much larger size, as discussed in Section III-B3.
We use Q = 5 bits for quantization in the implementation. As
shown in Fig. 9, the performance loss with 5-bit quantization
is negligible at N = 1024 (this is true also at lower block
lengths, although not shown here).
TABLE III
ASIC IMPLEMENTATION RESULTS
N 26 27 28 29 210
Technology 90 nm, 1.3 V
Area [mm2] 0.153 0.338 0.759 1.514 3.213
Number of Cells 24.3K 57.2K 127.5K 260.8K 554.3K
Dec. Power [mW] 99.8 138.8 158.7 181.4 190.7
Frequency [MHz] 45.5 22.2 11.0 5.2 2.5
Throughput [Gb/s] 2.92 2.83 2.81 2.69 2.56
Engy.-per-bit [pJ/b] 34.1 49.0 56.4 67.4 74.5
Hard. Eff. [Mb/s/mm2] 19084 8372 3700 1776 796
0 0.5 1 1.5 2 2.5 3
10−3
10−2
10−1
100
Eb/No
FE
R
 
 
Floating Point
Fixed−Point (4−bit)
Fixed−Point (5−bit)
Fig. 9. FER performance with different numbers of quantization bits (N =
1024, R = 1/2)
The results given in Table III verify the analytical analyses
for complexity and delay. It is expected from (6) that the ratio
of decoder complexities for block lengths N and N/2 should
be approximately 2. This can be verified by observing the
number of cells and area of decoders in Table III. As studied
in Section III-C2, (8) implies that the maximum path delay
is approximately doubled due to the basic logic elements,
and there is also a non-deterministic additive delay due to
the interconnects, which is also expected to at least double
when block length is doubled. The maximum delay results in
Table III show that this analytical derivation also holds for the
given block lengths.
It is seen from Table III that the removal of registers and
RAM blocks from the design keeps the hardware usage at
moderate levels despite the high number of basic logic blocks
in the architecture. Moreover, the delays due to register read
and write operations and clock setup/hold times are discarded,
which accumulate to significant amounts as N increases.
2) Power Analysis: Table III shows that the power con-
sumption of combinational decoders tends to saturate as N
increases. In order to fully understand this behavior, a detailed
report for power characteristics of combinational decoders is
given in Table IV.
TABLE IV
POWER CONSUMPTION
N 26 27 28 29 210
Stat. [nW] 701.8 1198.7 2772.8 6131.2 14846.7
Dyn. [mW] 99.8 138.8 158.7 181.3 190.5
Table IV shows the power consumption in combinational
decoders in two parts: static and dynamic power. Static power
is due to the leakage currents in transistors when there is no
voltage change in the circuit. Therefore, it is proportional to
the number of transistors and capacitance in the circuit ([23]).
By observing the number of cells given in Table III, we can
verify the static power consumption doubling in Table IV when
N is doubled. On the other hand, dynamic power consumption
is related with the total charging and discharging capacitance
in the circuit and defined as
Pdynamic = αCV
2
DDfc, (12)
where α represents the average percentage of the circuit
that switches with the switching voltage, C is the total load
capacitance, VDD is the drain voltage, and fc is the operating
frequency of the circuit ([23]). The behavior of dynamic power
consumption given in Table IV can be explained as follows:
The total load capacitance of the circuit is approximately dou-
bled when N is doubled, since load capacitance is proportional
to the number of cells in the decoder. On the other hand,
operating frequency of the circuit is approximately reduced
to half when N is doubled, as discussed above. Activity
factor represents the switching percentage of load capacitance,
thus, it is not affected from changes in N . Therefore, the
multiplication of these parameters produce approximately the
same result for dynamic power consumption in decoders for
different block lengths.
The decoding period of a combinational decoder is almost
equally shared by the two combinational decoders for half
code length. During the first half of this period, the bit estimate
voltage levels at the output of the first decoder may vary
until they are stabilized. These variations cause the input LLR
values of the second decoder to change as they depend on
the partial-sums that are calculated from the outputs of the
first decoder. Therefore, the second decoder may consume
undesired power during the first half of decoding period. In
order to prevent this, the partial-sums are fed to the gN/2 block
through 2-input AND gates, the second input of which is given
as low during the first half of delay period and high during the
second half. This method can be recursively applied inside the
decoders for half code lengths in order to reduce the power
consumption further.
We have observed that small variations in timing constraints
may lead to significant changes in power consumption. More
precise figures about power consumption will be provided in
9the future when an implementation of this design becomes
available.
3) Comparison With Other Polar Decoders: In order to
have a better understanding of decoder performance, we com-
pare the combinational decoder for N = 1024 with three state-
of-the-art decoders in Table V. We use standard conversion
formulas in [24] and [25] to convert all designs to 65 nm,
1.0 V for a fair (subject to limitations in any such study)
comparison.
TABLE V
COMPARISON WITH STATE-OF-THE-ART POLAR DECODERS
Comb. [5] [6] [9]**
Decoder Type SC SC SC BP**
Block Length 1024 1024 1024 1024
Technology 90 nm 180 nm 65 nm 65 nm
Area [mm2] 3.213 1.71 0.68 1.476
Voltage [V] 1.3 1.3 1.2 1.0 0.475
Freq. [MHz] 2.5 150 1010 300 50
Power [mW] 190.7 67 - 477.5 18.6
TP [Mb/s] 2560 49† 497 4676 779.3
Engy.-per-bit
[pJ/b] 74.5 1370 - 102.1 23.8
Hard. Eff.
[Mb/s/mm2] 796 29
* 730* 3168 528
Converted to 65 nm, 1.0 V
Area [mm2] 1.676 0.223 0.68 1.476
Power [mW] 81.5 14.3 - 477.5 82.4
TP [Mb/s] 3544 136 497 4676 779.3
Engy.-per-bit
[pJ/b] 23.0 105.2 - 102.1 105.8
Hard. Eff.
[Mb/s/mm2] 2114 610 730 3168 528
* Not presented in the paper, calculated from the presented results
** Results are given for (1024, 512) code at 4dB SNR
† Information bit throughput for (1024, 512) code
As seen from the technology-converted results in Table V,
combinational decoder provides the highest throughput among
the state-of-the-art SC decoders. Combinational decoders are
composed of simple basic logic blocks with no storage ele-
ments or control circuits. This helps to reduce the maximum
path delay of the decoder by removing delays from read/write
operations, setup/hold times, complex processing elements
and their management. Another factor that reduces the de-
lay is assigning a separate logic element to each decoding
operation, which allows simplifications such as the use of
comparators instead of adders for odd-indexes bit decisions.
Furthermore, the precomputation method reduces the delays of
addition/subtraction operations to that of multiplexers. These
elements create an advantage to the combinational decoders
in terms of throughput with respect to even fully-parallel
SC decoders; and therefore, [5] and [6], which are semi-
parallel decoders with slightly higher latencies than fully-
parallel decoders. The reduced operating frequency gives
the combinational decoders a low power consumption when
combined with simple basic logic blocks, and the lack of read,
write, and control operations.
The use of separate logic blocks for each computation in
decoding algorithm and precomputation method increase the
hardware consumption of combinational decoders. This can be
observed by the areas spanned by the three SC decoders. This
is an expected result due to the trade-off between throughput,
area, and power in digital circuits. However, the high through-
put of combinational decoders make them hardware efficient
architectures, as seen in Table V.
Implementation results for BP decoder in [9] are given for
operating characteristics at 4 dB SNR, so that the decoder
requires 6.57 iterations per codeword for low error rates. The
number of required iterations for BP decoders increase at
lower SNR values Therefore, throughput of the BP decoder
in [9] is expected to decrease while its power consumption
increases with respect to the results in Table V. On the other
hand, SC decoders operate with the same performance metrics
at all SNR values since the total number of calculations in
conventional SC decoding algorithm is constant (N log2N )
and independent from the number of errors in the received
codeword.
The performance metrics for the decoder in [9] are
given for low-power-low-throughput and high-power-high-
throughput modes. The power reduction in this decoder is
obtained by reducing the operating frequency and supply
voltage for the same architecture, which also leads to the
reduction in throughput. Table V shows that the throughput of
the combinational decoder is only lower than the throughput
of [9] when it is operated at high-power mode. In this
mode, [9] provides a throughput which is approximately 1.3
times larger than the throughput of combinational decoder,
while consuming 5.8 times more power. The advantage of
combinational decoders in power consumption can be seen
from the energy-per-bit characteristics of decoders in Table V.
The combinational decoder consumes the lowest energy per
decoded bit among the decoders in comparison.
4) Comparison With LDPC Decoders: A comparison of
combinational SC polar decoders with state-of-the-art LDPC
decoders is given in Table VI. The LDPC decoder presented
in [26] is a multirate decoder capable of operating with 4
different code rates. The LDPC decoder in [27] is a high
throughput LDPC decoder. It is seen from Table VI that
the throughputs of LDPC decoders are higher than that of
combinational decoders for 5 and 10 iterations without early
termination. The throughput is expected to increase for higher
and decrease for lower SNR values, as explained above. Power
consumption and area of the LDPC decoders is seen to be
higher than those of the combinational decoder.
An advantage of combinational architecture is that it pro-
vides a flexible architecture in terms of throughput, power
consumption, and area by its pipelined version. One can
increase the throughput of a combinational decoder by adding
any number of pipelining stages. This increases the operating
frequency and number of registers in the circuit, both of which
increase the dynamic power consumption in the decoder core
and storage parts of the circuit. The changes in throughput
and power consumption with the added registers can be esti-
mated using the characteristics of the combinational decoder.
Therefore, combinational architectures present an easy way
to control the trade-off between throughput, area, and power.
FPGA implementation results for pipelined combinational
decoders are given in the next section.
10
TABLE VI
COMPARISON WITH STATE-OF-THE-ART LDPC DECODERS
Comb.** [26]* [27]
Code/Decoder Type Polar/SC LDPC/BP LDPC/BP
Block Length 512 672 672
Code Rate Any 1/2, 5/8,3/4, 7/8 1/2
Area [mm2] 0.79 1.56 1.60
Power [mW] 77.5 361† 782.9††
TP [Gb/s] 3.72 5.79† 9.0††
Engy.-per-bit [pJ/b] 20.8 62.4 89.5**
Hard. Eff. [Gb/s/mm2] 4.70 3.7 5.63**
* Technology=65 nm, 1.0 V
** Technology converted to 65 nm, 1.0 V
† Results are given for (672, 588) code and 5 iterations without early
termination
†† Results are given for (672, 336) code and 10 iterations without early
termination
B. FPGA Implementation Results
Combinational architecture involves heavy routing due to
the large number of connected logic blocks. This increases
hardware resource usage and maximum path delay in FPGA
implementations, since routing is done through pre-fabricated
routing resources as opposed to ASIC. In this section, we
present FPGA implementations for the proposed decoders and
study the effects of this phenomenon.
Table VII shows the place-and-route results of combina-
tional and pipelined combinational decoders on Xilinx Virtex-
6-XC6VLX550T (40 nm) FPGA core. The implementation
strategy is adjusted to increase the speed of the designs. We
use RAM blocks to store the input LLRs, frozen bit indicators,
and output bits in the decoders. FFs in combinational decoders
are used for small logic circuits and fetching the RAM outputs,
whereas in pipelined decoder they are also used to store the
input LLRs and partial-sums for the second decoding function
(Fig. 4). It is seen that the throughputs of combinational
decoders in FPGA drop significantly with respect to their
ASIC implementations. This is due to the high routing delays
in FPGA implementations of combinational decoders, which
increase up to 90% of the overall delay.
Pipelined combinational decoders are able to obtain
throughputs on the order of Gb/s with an increase in the
number FFs used. Pipelining stages can be increased further
to increase the throughput with a penalty of increasing FF
usage. The results in Table VII show that we can double the
throughput of combinational decoder for every N by one stage
of pipelining as expected.
The error rate performance of combinational decoders is
given in Fig. 10 for different block lengths and rates. The
investigated code rates are commonly used in various wireless
communication standards (e.g., WiMAX, IEEE 802.11n). It is
seen from Fig. 10 that the decoders can achieve very low error
rates without any error floors.
C. Throughput Analysis for Hybrid-Logic Decoders
As explained in Section III-B3, a combinational decoder
can be combined with a synchronous decoder to increase its
throughput by a factor g(N,N ′) as in (5). In this section, we
1 2 3 4 5 6 7
10−8
10−6
10−4
10−2
100
Eb/No
Er
ro
r R
at
es
 
 
N=1024, R=1/2, FER
N=1024, R=1/2, BER
N=512, R=5/6, BER
N=512, R=5/6, FER
Fig. 10. FER performance of combinational decoders for different block
lengths and rates
present analytical calculations for the throughput of a hybrid-
logic decoder. We consider the semi-parallel architecture in
[21] as the synchronous decoder part and use the implemen-
tation results given in the paper for the calculations.
A semi-parallel SC decoder employs P processing elements,
each of which are capable of performing the operations (2)
and (3) and perform one of them in one clock cycle. The
architecture is called semi-parallel since P can be chosen
smaller than the numbers of possible parallel calculations
in early stages of decoding. The latency of a semi-parallel
architecture is given by
LSP (N,P ) = 2N +
N
P
log2
(
N
4P
)
. (13)
The minimum latency that can be obtained with the semi-
parallel architecture by increasing hardware usage is 2N − 2,
the latency of a conventional SC algorithm, when P = N/2.
Throughput of a semi-parallel architecture is its maximum
operating frequency divided by its latency. Therefore, using
N/2 processing elements does not provide a significant mul-
tiplicative gain for the throughput of the decoder.
We can approximately calculate the approximate throughput
of a hybrid-logic decoder with semi-parallel architecture using
the implementation results given in [21]. Implementations in
[21] are done using Stratix IV FPGA, which has a similar tech-
nology with Virtex-6 FPGA used in this work. Table VIII gives
these calculations and comparisons with the performances of
semi-parallel decoder.
Table VIII shows that throughput of a hybrid-logic decoder
is significantly better than the throughput of a semi-parallel
decoder. It is also seen that the multiplicative gain increases as
the size of the combinational decoder increases. This increase
is dependent on P , as P determines the decoding stage after
which the number of parallel calculations become smaller than
the hardware resources and causes the throughput bottleneck.
It should be noted that the gain will be smaller for decoders
that spend less clock cycles in final stages of decoding trellis,
such as [28] and [29]. The same method can be used in ASIC
to obtain a high increase in throughput.
Hybrid-logic decoders are especially useful for decoding
large codewords, for which the hardware usage is high for
combinational architecture and latency is high for synchronous
decoders.
11
TABLE VII
FPGA IMPLEMENTATION RESULTS
N
Combinational Decoder Pipelined Combinational Decoder
LUT FF RAM (bits) TP [Gb/s] LUT FF RAM (bits) TP [Gb/s] TP Gain
24 1479 169 112 1.05 777 424 208 2.34 2.23
25 1918 206 224 0.88 2266 568 416 1.92 2.18
26 5126 392 448 0.85 5724 1166 832 1.80 2.11
27 14517 783 896 0.82 13882 2211 1664 1.62 1.97
28 35152 1561 1792 0.75 31678 5144 3328 1.58 2.10
29 77154 3090 3584 0.73 77948 9367 6656 1.49 2.04
210 193456 6151 7168 0.60 190127 22928 13312 1.24 2.06
TABLE VIII
APPROXIMATE THROUGHPUT INCREASE FOR SEMI-PARALLEL SC
DECODER
N P f TPSP N ′ g TPHLSP[Mhz] [Mb/s] [Mb/s]
210 64 173 85 24 5.90 501
210 64 173 85 25 6.50 552
210 64 173 85 26 7.22 613
211 64 171 83 24 5.70 473
211 64 171 83 25 6.23 517
211 64 171 83 26 7.27 603
V. CONCLUSION
In this paper, we proposed a combinational architecture
for SC polar decoders with high throughput and low power
consumption. The proposed combinational SC decoder op-
erates at much lower clock frequencies compared to typical
synchronous SC decoders and decodes a codeword in one
long clock cycle. Due to the low operating frequency, the
combinational decoder consumes less dynamic power, which
reduces the overall power consumption.
Post-synthesis results showed that the proposed combina-
tional architectures are capable of providing a throughput of
approximately 2.5 Gb/s with a power consumption of 190 mW
for a 90 nm 1.3 V technology. These figures are independent
of the SNR level at the decoder input. We gave analytical
formulas for the complexity and delay of the proposed combi-
national decoders that verify the implementation results, and
provided a detailed power analysis for the ASIC design. We
also showed that one can add pipelining stages at any desired
depth to this architecture in order to increase its throughput
at the expense of increased power consumption and hardware
complexity.
We also proposed a hybrid-logic SC decoder architecture
that combined the combinational SC decoder with a syn-
chronous SC decoder so as to extend the range of applicability
of the purely combinational design to larger block lengths. In
the hybrid structure, the combinational part acts as an acceler-
ator for the synchronous decoder in improving the throughput
while keeping complexity under control. The conclusion we
draw is that the proposed combinational SC decoders offer a
fast, energy-efficient, and flexible alternative for implementing
polar codes.
ACKNOWLEDGMENT
This work was supported by the FP7 Network of Excellence
NEWCOM# under grant agreement 318306. The authors ac-
knowledge O. Arıkan, A. Z. Alkar, and A. Atalar for the useful
discussions and support during the course of this work. The
authors are also grateful to the reviewers for their constructive
suggestions and comments.
REFERENCES
[1] E. Arıkan, “Channel polarization: a method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inform. Theory, vol. 55, no. 7, pp. 3051–3073, July 2009.
[2] E. Arıkan, “Polar codes: A pipelined implementation,” in Proc. Int.
Symp. Broadband Communication (ISBC2010), Melaka, Malaysia,
2010.
[3] C. Leroux, I. Tal, A. Vardy, and W. J. Gross, “Hardware architectures
for successive cancellation decoding of polar codes,” 2010. [Online].
Available: http://arxiv.org/abs/1011.2919
[4] A. Pamuk, “An FPGA implementation architecture for decoding of polar
codes,” in Proc. 8th Int. Symp. Wireless Comm. (ISWCS), pp. 437441,
2011.
[5] A. Mishra, A. Raymond, L. Amaru, G. Sarkis, C. Leroux, P. Meinerzha-
gen, A. Burg, and W. Gross, “A successive cancellation decoder ASIC
for a 1024-bit polar code in 180nm CMOS,” in IEEE Asian Solid State
Circuits Conf. (A-SSCC), Nov. 2012, pp. 205–208.
[6] Y. Fan and C.-Y. Tsui, “An efficient partial-sum network architecture for
semi-parallel polar codes decoder implementation,” IEEE Trans. Signal
Process., vol. 62, no. 12, pp. 3165–3179, June 2014.
[7] E. Arikan, “A performance comparison of polar codes and Reed-Muller
codes,” IEEE Commun. Letters, vol. 12, no. 6, pp. 447–449, June 2008.
[8] B. Yuan and K. Parhi, “Architectures for polar BP decoders using
folding,” in IEEE Int. Symp. Circuits Syst. (ISCAS), June 2014, pp. 205–
208.
[9] Y. S. Park, Y. Tao, S. Sun, and Z. Zhang, “A 4.68gb/s belief propagation
polar decoder with bit-splitting register file,” in Symp. VLSI Circuits Dig.
of Tech. Papers, June 2014, pp. 1–2.
[10] M. Plotkin, “Binary codes with specified minimum distance,” IRE Trans.
Inform. Theory, vol. 6, no. 4, pp. 445450, Sept. 1960.
[11] G. Schnabl and M. Bossert, “Soft-decision decoding of Reed-Muller
codes as generalized multiple concatenated codes,” IEEE Trans. Inform.
Theory, vol. 41, no. 1, pp. 304–308, Jan. 1995.
[12] I. Dumer and K. Shabunov, “Recursive decoding of Reed-Muller codes,”
in Proc. IEEE Int. Symp. Inform. Theory (ISIT), 2000, pp. 63–.
[13] A. Alamdar-Yazdi and F. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Letters, vol. 15,
no. 12, pp. 1378–1380, Dec. 2011.
[14] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. Gross, “Fast polar
decoders: algorithm and implementation,” IEEE J. Sel. Areas Commun.,
vol. 32, no. 5, pp. 946–957, May 2014.
[15] I. Tal and A. Vardy, “List decoding of polar codes,” in Proc. IEEE Int.
Symp. Inform. Theory (ISIT), July 2011, pp. 1–5.
[16] I. Dumer and K. Shabunov, “Soft-decision decoding of Reed-Muller
codes: recursive lists,” IEEE Trans. Inform. Theory, vol. 52, no. 3, pp.
12601266, Mar. 2006.
[17] B. Yuan and K. Parhi, “Low-latency successive-cancellation list decoders
for polar codes with multibit decision,” IEEE Trans. Very Large Scale
Integration (VLSI) Syst., vol. 23, no. 10, pp. 2268–2280, Oct. 2015.
12
[18] C. Zhang and K. Parhi, “Low-latency sequential and overlapped archi-
tectures for successive cancellation polar decoder,” IEEE Trans. Signal
Process., vol. 61, no. 10, pp. 2429–2441, May 2013.
[19] P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “Unrolled polar
decoders, part I: hardware architectures,” 2015. [Online]. Available:
http://arxiv.org/abs/1505.01459
[20] C. Zhang and K. Parhi, “Interleaved successive cancellation polar
decoders,” in Proc. IEEE Int. Symp. Circuits and Syst. (ISCAS), June
2014, pp. 401–404.
[21] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289–299, Jan. 2013.
[22] A. Raymond and W. Gross, “A scalable successive-cancellation decoder
for polar codes,” IEEE Trans. Signal Process., vol. 62, no. 20, pp. 5339–
5347, Oct. 2014.
[23] N. Weste and D. Harris, Integrated Circuit Design. Pearson, 2011.
[24] C.-C. Wong and H.-C. Chang, “Reconfigurable turbo decoder with
parallel architecture for 3gpp lte system,” IEEE Trans. Circuits and Syst.
II, Exp. Briefs, vol. 57, no. 7, pp. 566–570, July 2010.
[25] A. Blanksby and C. Howland, “A 690-mW 1-gb/s 1024-b, rate-1/2 low-
density parity-check code decoder,” IEEE J. Solid-State Circuits, vol. 37,
no. 3, pp. 404–412, Mar. 2002.
[26] S.-W. Yen, S.-Y. Hung, C.-L. Chen, Chang, Hsie-Chia, S.-J. Jou, and
C.-Y. Lee, “A 5.79-Gb/s energy-efficient multirate LDPC codec chip
for IEEE 802.15.3c applications,” IEEE J. Solid-State Circuits, vol. 47,
no. 9, pp. 2246–2257, Sep. 2012.
[27] Y. S. Park, “Energy-efficient decoders of near-capacity channel codes,”
Ph.D. dissertation, Univ. of Michigan, Ann Arbor, 2014.
[28] A. Pamuk and E. Arikan, “A two phase successive cancellation decoder
architecture for polar codes,” in Proc. IEEE Int. Symp. Inform. Theory
(ISIT), July 2013, pp. 957–961.
[29] B. Yuan and K. Parhi, “Low-latency successive-cancellation polar de-
coder architectures using 2-bit decoding,” IEEE Trans. Circuits Syst. I,
Reg. Papers, vol. 61, no. 4, pp. 1241–1254, Apr. 2014.
Onur Dizdar (S’–10) was born in Ankara, Turkey,
in 1986. He received the B.S. and M.S. degrees
in electrical and electronics engineering from the
Middle East Technical University, Ankara, Turkey
in 2008 and 2011. He is currently a Ph.D. candidate
in the Department of Electrical and Electronics Engi-
neering, Bilkent University, Ankara, Turkey. He also
works as a Senior Design Engineer in ASELSAN,
Turkey.
Erdal Arıkan (S84), (M79), (SM94), (F11) was
born in Ankara, Turkey, in 1958. He received the
B.S. degree from the California Institute of Tech-
nology, Pasadena, CA, in 1981, and the S.M. and
Ph.D. degrees from the Massachusetts Institute of
Technology, Cambridge, MA, in 1982 and 1985, re-
spectively, all in Electrical Engineering. Since 1987
he has been with the Electrical-Electronics Engi-
neering Department of Bilkent University, Ankara,
Turkey, where he works as a professor. He is the
receipient of 2010 IEEE Information Theory Society
Paper Award and the 2013 IEEE W.R.G. Baker Award, both for his work on
polar coding.
