Scalable Successive-Cancellation Hardware Decoder for Polar Codes by Raymond, Alexandre J. & Gross, Warren J.
Scalable Successive-Cancellation Hardware
Decoder for Polar Codes
Alexandre J. Raymond and Warren J. Gross
Department of Electrical and Computer Engineering
McGill University
Montre´al, Que´bec, Canada
alexandre.raymond@mail.mcgill.ca, warren.gross@mcgill.ca
Abstract—Polar codes, discovered by Arıkan, are the first
error-correcting codes with an explicit construction to provably
achieve channel capacity, asymptotically. However, their error-
correction performance at finite lengths tends to be lower than
existing capacity-approaching schemes. Using the successive-
cancellation algorithm, polar decoders can be designed for very
long codes, with low hardware complexity, leveraging the regular
structure of such codes. We present an architecture and an
implementation of a scalable hardware decoder based on this
algorithm. This design is shown to scale to code lengths of up
to N = 220 on an Altera Stratix IV FPGA, limited almost
exclusively by the amount of available SRAM.
Index Terms—Error-correcting codes, polar codes, successive-
cancellation decoding, hardware implementation.
I. INTRODUCTION
Since their introduction in 2008, polar codes [1] have
attracted a lot of attention from the information theory com-
munity, as they are the first codes to provably achieve channel
capacity, asymptotically in code length.
Although initially only defined for the binary erasure chan-
nel (BEC), they were later extended to other models, such as
the additive white Gaussian noise (AWGN) channel [2].
Their recursive construction was shown to support low-
complexity implementations of the successive-cancellation
(SC) algorithm in hardware [3][4]. Those low-complexity
decoders can in turn be used as components in more complex
schemes, such as list decoding [5][6] and concatenated coding
[7], which improve the error-correction performance of polar
codes at finite lengths.
The remainder of this paper is structured as follows. Sec-
tion II provides background information on polar codes and SC
decoding. Then, Section III details the proposed architecture.
Section IV analyzes FPGA implementation results, while
Section V concludes this work.
II. BACKGROUND
Polar codes are a class of linear block codes based on a
recursive definition. They are constructed using a generator
matrix G, obtained from the base matrix F2 =
(
1 0
1 1
)
, using
G = FN , (F2)⊗n,
where N = 2n is the code length, and ⊗ represents the
Kronecker product. In this paper, we use u to denote an
1 1.25 1.5 1.75 2
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 (dB)
E
rr
or
ra
te
FER/MSA, (6,3,2) BER/MSA, (6,3,2)
FER/MSA, float BER/MSA, float
FER/SPA, float BER/SPA, float
Fig. 1. Performance of a N = 215, R = 0.50 polar code optimized for a
frame error rate of 10−5.
information vector, x for a codeword, y for a received vector,
and uˆ for the information vector estimated by the decoder.
These codes can be decoded using a recursive, multi-
stage structure featuring n stages of N/2 nodes, yielding a
complexity O(N logN) [1].
To simplify their implementation in hardware, decoding can
be carried out in the log-likelihood-ratio (LLR) domain, where
the SC equations become the standard sum-product algorithm
(SPA) equations, which can be approximated using the well-
known min-sum algorithm (MSA) [8]:
λf (λa, λb) ≈ sign(λa)sign(λb)min(|λa|, |λb|), (1)
λg(sˆ, λa, λb) = λa(−1)sˆ + λb, (2)
where sˆ designates a partial sum. This approximation yields
a performance degradation of ∼0.1dB over SPA, as illustrated
in Figure 1, although this gap tends to shrink for higher-rate
(R = k/N ) codes.
ar
X
iv
:1
30
6.
35
29
v1
  [
cs
.IT
]  
14
 Ju
n 2
01
3
Frozen ui
ROM
λin
PQc
Bypass
MEM
LLR SRAM 2
Bypass
MEM
LLR SRAM 1
P/2
encoding
processing
elements
MEM
Channel SRAM 2
MEM
Channel SRAM 1
P
PEs
Chained
PE
Decoding PEs
2
Bypass
MEM
s SRAM 1
Bypass
MEM
s SRAM 2channel
buffer
λout{i,i+1}
u{i,i+1}^Qc
2Q
2Q Q
P
&
2&
PQc
2PQ
2PQ
PQc
^
^
Fig. 2. Block diagram of the improved decoder architecture.
III. ARCHITECTURE
The architecture presented in this paper is based on the
semi-parallel decoder of [3], and introduces modifications
aiming to improve its scalability with respect to code length.
This decoder uses a fixed datapath, and operates under re-
source constraints, where only P  N/2 processing elements
(PE) are implemented. This limitation, however, only impacts
throughput minimally [8].
Figure 2 provides a top-level overview of the redesigned
decoder architecture, while its various changes are discussed
in the following sections.
A. Memory Improvements
Unlike [3], which makes use of a single SRAM to store all
LLRs, this improved architecture relies on two separate types
of memories: channel and internal. This separation allows full-
throughput operation of the decoder by supporting the loading
of a subsequent frame into the channel memory, without write
contention, while the previous one is still being processed.
This is made possible by the fact that, per the structure of the
decoding graph, channel LLRs are not directly required in the
second half of the decoding process, i.e. after bit i = N/2 and
stage l = (n− 1).
Furthermore, the improved design does away with asym-
metric read/write ports in its SRAMs. Those memories are
replaced by pairs of P -LLR wide SRAMs, whose outputs are
concatenated into 2P -LLR words consumed by the processing
elements, whose own P -LLR outputs are written to each
SRAM in sequence. Note that the & operator used in Figure 2
symbolizes concatenation, with sign extension if needed.
B. Quantization
The separation of the channel and internal memories, de-
scribed in Section III-A, also makes it possible to use distinct
quantization levels for each memory. This enhancement is
suggested by a characteristic of the successive-cancellation
algorithm, namely that (2) affects the range of the compu-
tations in each successive stage, while their precision remains
unchanged by both operations. It follows that the values
processed by lower-indexed stages require more range than
those in the higher ones.
gf g
fg
fg
gg
gg
f
g
f
g
ff f
ff g
gf f
x0
x4
x2
x6
x1
x5
x3
x7
u0
u4
u2
u6
u1
u5
u3
u7
lenc=2lenc=1lenc=0Stage
^s0,0
^
s0,1
^
s0,2
^
s0,3
^
s1,3
^
s1,2
^
s1,0
^
s1,1
^
s2,1
^
s2,0
^
s2,2
^
s2,3
^
^ ^
^
^
^
^
^
^
^ ^
^
^
^
^
^
^
^ ^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
^
Fig. 3. Encoding graph for the partial sums.
Since the decoder must retain an entire N -LLR frame in
memory, the channel SRAMs account for nearly half of the
decoder’s soft information storage requirements [3]. A lower
quantization for this memory therefore reduces the decoder
area significantly.
Quantization is denoted using shorthand (Qi, Qic , Qf ),
which indicates the number of integer bits for internal LLRs,
integer bits for channel LLRs, and fractional bits for both
types, respectively; Q = Qi + Qf and Qc = Qic + Qf are
also used to refer to the total number of quantization bits in
each case.
Simulations showed that full-range quantization does not
benefit error-correction performance; much lower levels can
match a floating-point implementation. Specifically, we carried
out those simulations for codes of length N = 215, with R ∈
{0.25, 0.50, 0.75, 0.90}; results are summarized in Table I. We
found that, for those codes, 6–8 bits of quantization suffice for
good error-correction performance (within ∼0.1dB of floating
point MSA), depending on their rate, as shown in Figure 1.
We also noticed that higher-rate codes tend to require fewer
bits of fractional precision and integer range for internal LLRs,
but more bits of integer range for channel ones.
TABLE I
QUANTIZATION REQUIRED FOR GOOD ERROR-CORRECTION
PERFORMANCE OF N = 215 CODES USING MSA.
R 0.25 0.50 0.75 0.90
(Qi, Qic , Qf ) (6, 3, 2) (6, 3, 2) (6, 4, 1) (6, 4, 0)
C. Chained PE
This architecture makes use of a chained PE in stage 0,
carrying out functions λf and λg in a single clock cycle
(CC). The concept behind this improvement was introduced
in [9], while the restricted implementation used in this paper,
targeting only stage 0, was independently proposed in [4].
This chained PE relies on the specific schedule of the polar
decoding graph, in which stage 0 is always activated twice
in a row, using the same operands: first for function λf , and
then for function λg , using the result of λf . By chaining both
operations in a special PE, we can output two decoded bits
uˆ{i,i+1} at once, yielding a (N/2)-CC reduction in decoding
latency.
This behavior is illustrated in Table III, specifically in
clock cycles {3, 6, 13, 16}. In those cases, the computations of
functions λf and λg are performed in the same clock cycle,
yielding two decoded bits simultaneously.
The chained PE does not incur any overhead over the regular
PE. The data dependency present in-between functions λf and
λg , satisfied by the sign of λf , occurs late in the processing
of λg , and can be computed very rapidly.
D. Semi-Parallel Partial-Sum Encoder
The main factor limiting the scalability of [3] is the growing
complexity of its partial-sum update logic. In this paper, we
introduce an encoder-based alternative inspired by the design
of [9], which proposed a fully-parallel partial-sum computation
module. Our implementation extends this encoder, adapting it
to a novel semi-parallel architecture. This architecture operates
over multiple clock cycles and uses a fixed datapath, removing
it from the decoder’s critical path altogether.
This encoder is triggered after decoding-stage 0, and pro-
cesses two decoded bits at a time. Figure 3 illustrates its
structure, a mirrored version of the decoding graph, in which
the fˆ nodes are defined as binary additions (XOR), and the gˆ
nodes, as pass-through connections:
fˆ(uˆa, uˆb) = uˆa ⊕ uˆb, (3)
gˆ(uˆa) = uˆa. (4)
As in the decoding graph, the nodes are associated into N/2
pairs per stage. Those pairs are processed by the P/2 encoding
PEs.
In order to make the design scalable, a semi-parallel archi-
tecture was chosen for the encoder. Since the encoding graph
mirrors the decoding graph, their schedules are very similar.
The encoding schedule is illustrated in Table III, where e
denotes the activation of encoding stage lenc. Due to the semi-
parallel nature of the encoder, stages which are handled in
multiple clock cycles are denoted using a subscript, e.g. e0.
TABLE II
FPGA IMPLEMENTATION RESULTS TARGETING THE ALTERA STRATIX IV
GX EP4SGX530KH40C2.
N R P Qtz. LUT FF SRAM(bits)
fmax
(MHz)
T/P
(Mbps)
215 0.25 64 (6,3,2) 4,161 1,629 510,464 156 15
215 0.50 64 (6,3,2) 4,161 1,629 510,464 156 29
215 0.75 64 (6,4,1) 3,731 1,496 477,440 155 43
215 0.90 64 (6,4,0) 3,263 1,304 411,648 167 56
216 — 64 (6,4,0) 3,414 1,316 821,248 157 57R
218 — 64 (6,4,0) 3,548 1,349 3, 278,848 140 51R
220 — 64 (6,4,0) 5,956 1,366 13, 109,248 102 38R
215 — 64 (7,4,0) 3,927 1,427 444,672 153 57R
215 — 64 (8,4,0) 4,141 1,569 477,696 154 57R
215 — 64 (9,4,0) 4,673 1,689 510,720 159 59R
215 — 64 (7,3,0) 3,725 1,365 411,904 153 57R
215 — 64 (7,4,0) 3,927 1,427 444,672 153 57R
215 — 64 (7,5,0) 3,731 1,496 477,440 155 57R
215 — 64 (5,5,0) 2,811 1,235 411,392 169 63R
217 — 64 (5,5,0) 2,714 1,263 1, 640,192 160 58R
Decoder from [3]
215 — 64 (5,5,0) 58,480 33,451 364,288 66 31R
217 — 64 (5,5,0) 221,471 131,764 1, 445,632 10 6R
In Figure 3, the subgraph highlighted in bold illustrates the
nodes activated to calculate partial sums sˆ1,0 and sˆ1,2. Those
two values are subsequently used to evaluate λg nodes in stage
l = 1 of the decoding graph.
The partial-sum encoder follows a schedule similar to that of
the decoding, although with half as many processing elements;
those processing elements produce two values instead of one,
since they are not restricted by a data dependency as the
decoding PEs are. The encoder thus increases latency by
(NP (P − 1) + NP log2( N4P ) − log2 P + 2) CC, or ∼67% for
P = 64, but allows higher operating frequencies, for a net
throughput gain.
Using P/2 encoding PEs, the encoder can make use of P -bit
wide words in the sˆ SRAMs, allowing the decoder to retrieve
P partial sums simultaneously during decoding, in a single
clock cycle. Furthermore, because of the specific structure of
the encoding graph, the values stored in memory are properly
aligned for direct consumption by the decoding PEs, via a
fixed datapath.
Note that the internal partial sums sˆ0,j correspond to uˆi,
where i is the bit-reversed [1] value of j. Furthermore, sˆn,j
yields an estimation of codeword value xˆi, where i is again
bit-reversed j. As part of the encoding process resulting in this
estimated codeword xˆ, the encoder creates internal estimations
sˆl,j , which are required by λg during the decoding process.
In a non-systematic polar decoder, it is not necessary to
evaluate xˆ completely, which saves a final encoding stage
after uˆN−2 and uˆN−1 are decoded. However, in a systematic
decoder [10], those extra steps could be carried out to obtain xˆ,
which is required to retrieve the original information vector,
while avoiding the need for extra hardware to perform the
additional encoding step.
IV. EXPERIMENTAL RESULTS
The various characteristics of this architecture, explored in
Section III, are summarized in Table IV. In this table, latency
TABLE III
SCHEDULE OF THE PROPOSED SEMI-PARALLEL ARCHITECTURE, WITH N = 8 AND P = 2.
Stage / CC 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
l = 2 f0 f1 g0 g1
l = 1 f g f g
l = 0 fg fg fg fg
lenc = 0 e e e e
lenc = 1 e0 e1 e0 e1
Output uˆ0uˆ1 uˆ2uˆ3 uˆ4uˆ5 uˆ6uˆ7
TABLE IV
SUMMARY OF THE TECHNICAL CHARACTERISTICS OF THE PROPOSED
ARCHITECTURE.
Decoding latency (CC) N
P
( 5P
2
− 1) + 2N
P
log2(
N
4P
)− log2 P + 2
LLR SRAMs (bits) QcN +Q(N + P log2 P − P )
sˆ SRAMs (bits) P
(
3N
2P
+ 2 log2 P − 4
)
ROM (bits) N
T/P [P = 64] (bits/sec) ∼ R 32
71.5+log2 N
fmax
takes into account the semi-parallel schedule, as well as the
chained PE and partial-sum encoder. The LLR SRAMs entry
combines both the channel and the internal LLR memories. sˆ
SRAMs store internal encoding estimations, but not the whole
estimated codeword xˆ. Finally, throughput is estimated for
P = 64, a common value, to simplify its representation.
Table II then presents implementation results targeting an
Altera Stratix IV FPGA. Maximum frequencies are reported
for the slow 900mV 85◦C timing model.
This table starts by presenting implementation results for the
four N = 215 codes described in Section III-B. It then explores
the scalability of our design with respect to two parameters:
code length and quantization. Finally, it compares this work
with [3].
Note that, as in [3], our decoder architecture is not affected
by code rate, as the choice of a specific code only modifies
the contents of a ROM. Code rate is thus only reported in the
first section of this table.
Those results show that the improved architecture retains a
high clock frequency over a wide variety of code lengths, due
to its fixed datapaths; the decreases observed are mostly due
to routing delays, as more SRAM elements are used on the
FPGA. Compared to [3], this new design scales much better
with respect to all parameters; its higher memory use could
be compensated, in an actual decoder, by Qic < Qi, while it
is set to the same value here, for fair comparison.
The register, logic and memory use of the decoder targeting
the N = 220 code amount to 0.5%, 2%, and 72% of
the resources available on the selected FPGA, respectively.
Additionally, register and logic use grow roughly linearly
in the number of PEs and quantization bits, but are mostly
unaffected by code length. Therefore, we can state that this
architecture will scale to extremely long codes, limited almost
exclusively by the amount of SRAM available on the FPGA.
At N = 217, the largest code length supported by our
previous-generation decoder, this improved architecture uses
81 times less look-up tables (LUT), 104 times fewer flip-flops
(FF), has a maximum operating frequency 16 times higher,
and a throughput 11 times greater, using the same parameters.
V. CONCLUSION
In this paper, we presented a scalable architecture for
SC decoding of polar codes. This decoder features a semi-
parallel, encoder-based partial-sum update module. This mod-
ule utilizes SRAM for storage, and makes use of a fixed
datapath. Additionally, this architecture leverages a multi-level
quantization scheme for LLRs, decreasing memory use and
decoder area. This state-of-the art decoder was synthesized
for an Altera Stratix IV FPGA target up to N = 220, limited
almost exclusively by the amount of available SRAM.
ACKNOWLEDGEMENT
The authors would like to thank Gabi Sarkis and Pascal
Giard, of McGill University, for helpful discussions.
REFERENCES
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), 2008,
pp. 1173–1177.
[2] I. Tal and A. Vardy, “How to construct polar codes,” arXiv/CoRR, vol.
abs/1105.6164, 2011.
[3] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” Signal Processing,
IEEE Trans. on, vol. 61, no. 2, pp. 289–299, 2013.
[4] A. Mishra, A. J. Raymond, L. G. Amaru, G. Sarkis, C. Leroux,
P. Meinerzhagen, A. Burg, and W. J. Gross, “A successive cancellation
decoder ASIC for a 1024-bit polar code in 180nm CMOS,” in Proc.
IEEE Asian Solid-State Circuits Conf (A-SSCC), 2012, to appear.
[5] I. Tal and A. Vardy, “List decoding of polar codes,” in Proc. IEEE Int.
Symp. Inf. Theory (ISIT), 2011, pp. 1–5.
[6] A. Balatsoukas-Stimming and A. Burg, “Tree search architecture for list
SC decoding of polar codes,” arXiv/CoRR, vol. abs/1303.7127, 2013.
[7] H. Mahdavifar, M. El-Khamy, Jungwon Lee, and I. Kang, “On the
construction and decoding of concatenated polar codes,” arXiv/CoRR,
vol. abs/1301.7491, 2013.
[8] C. Leroux, I. Tal, A. Vardy, and W. J. Gross, “Hardware architectures
for successive cancellation decoding of polar codes,” in Proc. IEEE
Int Acoustics, Speech and Signal Processing (ICASSP) Conf, 2011, pp.
1665–1668.
[9] Chuan Zhang, Bo Yuan, and K. K. Parhi, “Reduced-latency SC polar
decoder architectures,” in Proc. IEEE Int. Conf. on Commun. (ICC),
2012, pp. 3471–3475.
[10] E. Arıkan, “Systematic polar coding,” IEEE Commun. Lett., vol. 15,
no. 8, pp. 860–862, 2011.
